Leon Yin

About

I'm an investigative data journalist at The Markup, a non-profit newsroom focused on tech accountability. I build datasets, run experiments, and audit algorithms. Before joining The Markup, I was a research scientist working alongside political scientists at NYU, and a research affiliate at Data & Society. I started my career writing Fortran code to analyze oceanographic data at NASA.

Projects

YouTube's Keyword Blocklist for Ad Targeting (2021)

I found an undocumented API and worked with experts to audit YouTube's in-house brand safety tools. My co-author Aaron Sankin and I found YouTube only blocked one-third of well-known hate terms. Removing spaces from phrases circumvented the block in almost every instance. We also found social and racial justice terms like Black Lives Matter were blocked, while terms like White Lives Matter were not.
Read the hate methodology
Read the social and racial justice methodology
See the code and keywords on GitHub
See Color of Change's petition that cites our findings

How to Measure Real Estate on a Website? (2020)

I developed a novel spatial web parsing technique called Web Assay, to audit Google's self-referential search results. My reporting partner Adrianne Jeffries and I developed a categorization scheme for all the things found on Google Search. We found Google's own products and answers covered 41% of the first page. Our research was cited in the congressional subcomittee hearing on Big Tech and antitrust.
Read the methodology
A video of Web Assay
See the code on GitHub

Citizen Browser (2021)

I contributed to an ambitious project to distribute a privacy preserving app to collect Facebook data from a national panel. I built the redaction system and web parsers alongside Micha Gorelick. Alfred Ng and I debunked Facebook's promise to stop Political Group recommendations. Our story was cited by Senator Ed Markey who demanded answers from Facebook for their broken commitments.
Read the methodology
See the top 100 Groups on GitHub

Google Keyword Planner (2020)

Building off Safiya Noble's book, "Algorithims of Opression: How Search Engines Reinforce Racism", and Latanya Sweeney's, "Discrimination in Online Ad Delivery", I developed an audit of Google Ad's Keyword Planner with my reporting partner Aaron Sankin. We found hundreds of pornographic keyword suggestions for Black, Latina, and Asian girls, but no results whatsoever for "White girls".
Read the article
Videos of related works
See the code on GitHub

The Internet Research Agency: Hyperlinks, News, and Marketing Tools (2018)

How impactful was "fake news" in foreign info ops during the 2016 U.S. Presidential Election? I analyzed hyperlinks to Junk, National, and Local news sources sent by accounts released by the Senate Intelligence Committee and Twitter's Elections Integrity initiative. My analysis reveals the suprising role of local news, group identity, and free marketing tools in info ops.
Read the report
Identifying Local News outlets

Disinfo Doppler (2018)

An open source computer vision toolkit used to trace and measure image-based activity online. Designed to assist evidence-based reporting and reduce vicarious trauma amongst ephemeral spaces rife with coordinated hoaxes, harassment campaigns and racist propaganda.
Animated mosaic post-Charlottesville
See the code on GitHub

Reverse Image Search (2017)

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017

Are US Legislators Ideologically Polarized? (2017)

A timeseries visualization of legislator voting history using DW-Nominate, a metric of the liberal-conservative spectrum.
Jupyter Notebook
Ideological Polarization of Congress JFK-2014



Open Source Software

YouTube-Data-API

Despite having the largest userbase amongst American adults, YouTube is a social media platform that is often overlooked in academic research. youtube-data-api is a Python client to make this data source more accessible, while introducing new applications and methods to analyze this platform.
ReadTheDocs
Github Repo
PyPi Page

urlExpander

urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page

S3 Helper

A high-level Python AWS-cli wrapper to smooth workflows with private data stored on s3 cloud storage. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets.
Jupyter Notebook Tutorial
Github Repo
PyPi Page



If any of the projects are dated or contain inaccuracies please let me know via email or an issue on GitHub :)
The next section contains a Javascript app that cycles through a collection of quotes I like.

Next

Next

Get in Touch

hello [at] {this-domain}
@leonyin
(wire) @leonyin


Especially, if you're interested in:

  1. 🙈 Undocumented APIs and auditing algorithms
  2. 🙊 Sending a secure tip :)