Leon Yin

About

Leon Yin is an award-winning investigative data journalist at The Markup. He uses interdisciplinary methods and bespoke datasets to report on technology and monopolies. His work has been cited by legislators, the academy, civil rights groups, and popular media. In 2022, he received a Gerald Loeb Award for the series "Amazon's Advantage". Before journalism, he was a research scientist at The Center for Social Media and Politics at NYU. Leon started his career writing Fortran scripts at NASA.

Projects

Still Loading (2022)

Aaron Sankin and I found four ISPs charge the same price for drastically different internet speeds based on where you live. In cities across the U.S., neighborhoods that were historically redlined, lower-income, and had the highest concentration of people of color were disproportionately asked to overpay for slow speeds. Building off a technique by Princeton researchers, I found and used undocumented APIs to collect +1M internet plans. I merged socioeconomic data from the U.S. Census and digitized redlining maps. We published a storyrecipe to guide reporters to localized data.
Listen to the On The Media interview
Read the methodology
See the code on GitHub

Amazon Brands and Exclusives (2021)

My co-author Adrianne Jeffries and I found Amazon gave its own branded products an advantage over better-rated competitors in search results. I trained a random forest to predict which product Amazon placed on top of thousands of popular searches. The investigation was cited by the House antitrust committee in a letter to Amazon, and received a Gerald Loeb Award for personal finance and consumer reporting in 2022. I was lead engineer on Amazon Brand Detector, a Firefox and Chrome extension that helps shoppers spot Amazon branded products.
Download Amazon Brand Detector
Read the methodology
See the code on GitHub

YouTube's Keyword Blocklist for Ad Targeting (2021)

I found an undocumented API and worked with civil rights groups to audit YouTube's in-house brand safety tools. My co-author Aaron Sankin and I found racial justice phrases like Black Lives Matter were blocked, while hate terms like White Lives Matter were not. In fact, we found that YouTube only blocked one-third of well-known hate terms from advertisers. Removing spaces from phrases circumvented the block in almost every instance. In 2022, the series was part of a portfolio honored by NABJ for best practices reporting on algorithmic bias.
Read the hate methodology
Read the social justice methodology
See the code and keywords on GitHub
See Color of Change's petition

Counting pixels on Google Search (2020)

I developed a staining technique to audit Google search results. My reporting partner Adrianne Jeffries and I developed a categorization scheme for all the things found on Google Search. We found Google's own products and answers covered 41% of the first page. Our research was cited in the congressional subcommittee hearing on Big Tech and antitrust. In 2021, "Google the Giant" was a finalist for a Gerald Loeb Award in explanatory journalism.
Read the methodology
A video of Web Assay
See the code on GitHub

Citizen Browser (2021)

I contributed to an ambitious project to distribute a privacy preserving app to collect Facebook data from a national panel. I built the redaction system and data pipelines alongside Micha Gorelick. Alfred Ng and I debunked Facebook's promise to stop Political Group recommendations. Our story was cited by Senator Ed Markey who demanded answers from Facebook for their broken commitments. The project received an Edward R. Murrow Award in Innovation.
Read the methodology
See the top 100 Groups on GitHub

Google Keyword Planner (2020)

Building off Safiya Noble's book, "Algorithms of Oppression: How Search Engines Reinforce Racism", and Latanya Sweeney's, "Discrimination in Online Ad Delivery", I developed an audit of Google Ad's Keyword Planner with my reporting partner Aaron Sankin. We found hundreds of pornographic keyword suggestions for Black, Latina, and Asian girls, but no results whatsoever for "White girls". The story was featured in a NOVA documentary.
Read the article
Videos of related works
See the code on GitHub

The Internet Research Agency: Hyperlinks, News, and Marketing Tools (2018)

How impactful was "fake news" in foreign info ops during the 2016 U.S. Presidential Election? I analyzed hyperlinks to Junk, National, and Local news sources sent by accounts released by the Senate Intelligence Committee and Twitter's Elections Integrity initiative. My analysis reveals the surprising role of local news, group identity, and free marketing tools in info ops.
Read the report
Identifying Local News outlets

Disinfo Doppler (2018)

An open source computer vision toolkit used to trace and measure image-based activity online. Designed to assist evidence-based reporting and reduce vicarious trauma amongst ephemeral spaces rife with coordinated hoaxes, harassment campaigns and racist propaganda.
Animated mosaic post-Charlottesville
See the code on GitHub

Reverse Image Search (2017)

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017



Open Source Software

YouTube-Data-API

Despite having the largest userbase amongst American adults, YouTube is a social media platform that is often overlooked in academic research. youtube-data-api is a Python client to make this data source more accessible, while introducing new applications and methods to analyze this platform.
ReadTheDocs
Github Repo
PyPi Page

urlExpander

urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page

S3 Helper

A high-level Python AWS-cli wrapper to smooth workflows with private data stored on s3 cloud storage. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets.
Jupyter Notebook Tutorial
Github Repo
PyPi Page



If any of the projects are dated or contain inaccuracies please let me know via email or an issue on GitHub :)

Next

Get in Touch

hello [at] {this-domain}
@leonyin


Especially, if you're interested in:

  1. 🙈 Undocumented APIs and auditing algorithms
  2. 🙊 Sending a secure tip :)