Leon Yin

About

I'm an investigative data journalist at The Markup. I build datasets and use interdisciplinary methods to measure the impacts of technology on society. My work is regularly cited by legislators, the academy, civil rights groups, and others. In 2021, I was a finalist for a Gerald Loeb Award. Before joining the newsroom in 2019, I was a data scientist at The Center for Social Media and Politics at NYU, and a research affiliate at Data & Society. I started my career writing Fortran code to analyze oceanographic data at NASA.

Projects

Amazon Brands and Exclusives (2021)

My co-author Adrianne Jeffries and I found Amazon gave its own branded products an advantage over better-rated competitors in search results. I trained a random forest to predict which product Amazon placed on top of thousands of popular searches. The investigation was cited by the House antitrust committee in a letter to Amazon, and was a finalist for a Deadline Club Award in 2022. I was lead engineer on Amazon Brand Detector, a Firefox and Chrome extension that helps shoppers spot Amazon branded products.
Download Amazon Brand Detector
Read the methodology
See the code on GitHub

YouTube's Keyword Blocklist for Ad Targeting (2021)

I found an undocumented API and worked with civil rights groups to audit YouTube's in-house brand safety tools. My co-author Aaron Sankin and I found racial justice phrases like Black Lives Matter were blocked, while hate terms like White Lives Matter were not. In fact, we found that YouTube only blocked one-third of well-known hate terms from advertisers. Removing spaces from phrases circumvented the block in almost every instance. In 2022, the series was part of a portfolio honored by NABJ for best practices reporting on algorithmic bias.
Read the hate methodology
Read the social justice methodology
See the code and keywords on GitHub
See Color of Change's petition

Measuring What's Google on Search(2020)

Inspired by Biochemistry, I developed a staining technique to audit Google's self-referential search results. My reporting partner Adrianne Jeffries and I developed a categorization scheme for all the things found on Google Search. We found Google's own products and answers covered 41% of the first page. Our research was cited in the congressional subcommittee hearing on Big Tech and antitrust. The series was a finalist for the 2021 Gerald Loeb Award in investigative journalism.
Read the methodology
A video of Web Assay
See the code on GitHub

Citizen Browser (2021)

I contributed to an ambitious project to distribute a privacy preserving app to collect Facebook data from a national panel. I built the redaction system and web parsers alongside Micha Gorelick. Alfred Ng and I debunked Facebook's promise to stop Political Group recommendations. Our story was cited by Senator Ed Markey who demanded answers from Facebook for their broken commitments. The project is a finalist for a Scripps Howard Award in Innovation.
Read the methodology
See the top 100 Groups on GitHub

Google Keyword Planner (2020)

Building off Safiya Noble's book, "Algorithms of Oppression: How Search Engines Reinforce Racism", and Latanya Sweeney's, "Discrimination in Online Ad Delivery", I developed an audit of Google Ad's Keyword Planner with my reporting partner Aaron Sankin. We found hundreds of pornographic keyword suggestions for Black, Latina, and Asian girls, but no results whatsoever for "White girls".
Read the article
Videos of related works
See the code on GitHub

The Internet Research Agency: Hyperlinks, News, and Marketing Tools (2018)

How impactful was "fake news" in foreign info ops during the 2016 U.S. Presidential Election? I analyzed hyperlinks to Junk, National, and Local news sources sent by accounts released by the Senate Intelligence Committee and Twitter's Elections Integrity initiative. My analysis reveals the surprising role of local news, group identity, and free marketing tools in info ops.
Read the report
Identifying Local News outlets

Disinfo Doppler (2018)

An open source computer vision toolkit used to trace and measure image-based activity online. Designed to assist evidence-based reporting and reduce vicarious trauma amongst ephemeral spaces rife with coordinated hoaxes, harassment campaigns and racist propaganda.
Animated mosaic post-Charlottesville
See the code on GitHub

Reverse Image Search (2017)

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017

Are US Legislators Ideologically Polarized? (2017)

A timeseries visualization of legislator voting history using DW-Nominate, a metric of the liberal-conservative spectrum.
Jupyter Notebook
Ideological Polarization of Congress JFK-2014



Open Source Software

YouTube-Data-API

Despite having the largest userbase amongst American adults, YouTube is a social media platform that is often overlooked in academic research. youtube-data-api is a Python client to make this data source more accessible, while introducing new applications and methods to analyze this platform.
ReadTheDocs
Github Repo
PyPi Page

urlExpander

urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page

S3 Helper

A high-level Python AWS-cli wrapper to smooth workflows with private data stored on s3 cloud storage. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets.
Jupyter Notebook Tutorial
Github Repo
PyPi Page



If any of the projects are dated or contain inaccuracies please let me know via email or an issue on GitHub :)

Next

Get in Touch

hello [at] {this-domain}
@leonyin
(wire) @leonyin


Especially, if you're interested in:

  1. 🙈 Undocumented APIs and auditing algorithms
  2. 🙊 Sending a secure tip :)