Leon Yin


Twitter: @LeonYin
Bluesky: @leonyin.org
Insta/Threads: @leonyin_
Signal: leon.100
Github: yinleon

About

Leon Yin is an award-winning investigative journalist at Bloomberg News. He builds datasets and develops methods to measure technology's impact on society. He writes Inspect Element, a practitioner's guide to auditing algorithms. His work has been cited by legislators, the academy, and popular media. In 2023, the series "Still Loading" received a Philip Meyer Award recognizing the best uses of social science methods in journalism. Leon got his start in news at The Markup, and his start in research writing Fortran scripts at NASA.

Projects

GPT Racial and Gender Bias in Hiring (2024)

Recruiters and HR vendors are using AI chatbots to interview and screen job candidates. Davey Alba, Leo Nicoletti, and I found that OpenAI's GPT discriminates against names when ranking resumes. Building off classic hiring and algorithmic bias studies, we randomly assigned demographically-distinct names to equally-qualified resumes and asked GPT-3.5 and 4 to rank the resumes against four real job descriptions. We repeated the experiment 1,000 times, cycling through hundreds of unique names for each job. Differences in how often GPT ranked resumes from each group surpassed benchmarks for adverse impact — a standard used to test for discriminatory hiring practices.
Watch a video explainer on TikTok
Read the methodology
See the code on GitHub

Still Loading (2022)

Aaron Sankin and I found four ISPs charge the same price for drastically different internet speeds based on where you live. In cities across the U.S., neighborhoods that were historically redlined, lower-income, and had the highest concentration of people of color were disproportionately asked to overpay for slow speeds. Building off a technique by Princeton researchers, I found and used undocumented APIs to collect +1M internet plans. I merged socioeconomic data from the U.S. Census and digitized redlining maps. We wrote a story recipe to guide reporters to localized data, and a "Build Your Own Dataset" guide for others to reproduce our study with little-to-no coding. The project was honored by a Scripps Howard, SIGMA, NABJ, ONA, SABEW, and Philip Meyer Awards.
Listen to the On The Media interview
Read the methodology
See the code on GitHub

Amazon Brands and Exclusives (2021)

My co-author Adrianne Jeffries and I found Amazon gave its own branded products an advantage over better-rated competitors in search results. I trained a random forest to predict which product Amazon placed on top of thousands of popular searches. The investigation was cited by the House antitrust committee in a letter to Amazon, and received a Gerald Loeb Award for personal finance and consumer reporting in 2022. I was lead engineer on Amazon Brand Detector, a Firefox and Chrome extension that helps shoppers spot Amazon branded products.
Download Amazon Brand Detector
Read the methodology
See the code on GitHub

YouTube's Keyword Blocklist for Ad Targeting (2021)

I found an undocumented API and worked with civil rights groups to audit YouTube's in-house brand safety tools. My co-author Aaron Sankin and I found racial justice phrases like Black Lives Matter were blocked, while hate terms like White Lives Matter were not. In fact, we found that YouTube only blocked one-third of well-known hate terms from advertisers. Removing spaces from phrases circumvented the block in almost every instance. In 2022, the series was part of a portfolio honored by NABJ for best practices reporting on algorithmic bias.
Read the hate methodology
Read the social justice methodology
See the code and keywords on GitHub
See Color of Change's petition

Counting pixels on Google Search (2020)

I developed a staining technique to audit Google search results. My reporting partner Adrianne Jeffries and I developed a categorization scheme for all the things found on Google Search. We found Google's own products and answers covered 41% of the first page. Our research was cited in the congressional subcommittee hearing on Big Tech and antitrust. In 2021, "Google the Giant" was a finalist for a Gerald Loeb Award in explanatory journalism.
Read the methodology
A video of Web Assay
See the code on GitHub

Citizen Browser (2021)

I contributed to an ambitious project to distribute a privacy preserving app to collect Facebook data from a national panel. I built the redaction system and data pipelines alongside Micha Gorelick. Alfred Ng and I debunked Facebook's promise to stop Political Group recommendations. Our story was cited by Senator Ed Markey who demanded answers from Facebook for their broken commitments. The project received an Edward R. Murrow Award in Innovation.
Read the methodology
See the top 100 Groups on GitHub

Google Keyword Planner (2020)

Building off Safiya Noble's book, "Algorithms of Oppression: How Search Engines Reinforce Racism", and Latanya Sweeney's, "Discrimination in Online Ad Delivery", I developed an audit of Google Ad's Keyword Planner with my reporting partner Aaron Sankin. We found hundreds of pornographic keyword suggestions for Black, Latina, and Asian girls, but no results whatsoever for "White girls". The story was featured in a NOVA documentary.
Read the article
Videos of related works
See the code on GitHub

The Internet Research Agency: Hyperlinks, News, and Marketing Tools (2018)

How impactful was "fake news" in foreign info ops during the 2016 U.S. Presidential Election? I analyzed hyperlinks to Junk, National, and Local news sources sent by accounts released by the Senate Intelligence Committee and Twitter's Elections Integrity initiative. My analysis reveals the surprising role of local news, group identity, and free marketing tools in info ops.
Read the report
Identifying Local News outlets

Disinfo Doppler (2018)

An open source computer vision toolkit used to trace and measure image-based activity online. Designed to assist evidence-based reporting and reduce vicarious trauma amongst ephemeral spaces rife with coordinated hoaxes, harassment campaigns and racist propaganda.
Animated mosaic post-Charlottesville
See the code on GitHub

Reverse Image Search (2017)

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017



Open Source Software

United States Place Sampler

Need a random sample of U.S. addresses? Partnered with Big Local News to simplify that process. Now it's easier than ever to get reciepts from streets to test for disparate outcomes.
Read why we built this
Try it out

YouTube-Data-API

Despite having the largest userbase amongst American adults, YouTube is a social media platform that is often overlooked in academic research. youtube-data-api is a Python client to make this data source more accessible, while introducing new applications and methods to analyze this platform.
ReadTheDocs
Github Repo
PyPi Page

urlExpander

urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page



If any of the projects are dated or contain inaccuracies please let me know via email or an issue on GitHub :)
>> Lastly, I keep a log of speaking events here.

Next

Get in Touch

hello [at] {this-domain} No PR pitches...
@leonyin
Bluesky: @leonyin.org
Threads/Insta: @leonyin_
Mastodon: LeonYin@mastodon.social
Linkedin: Leon Yin
Signal: leon.100
Github: yinleon

Especially, if you're interested in:

  1. 🙈 Undocumented APIs and auditing algorithms
  2. 🙉 Testing for tangible harms from technologies, including applications of AI and ML
  3. 🙊 Sending a secure tip ;)