Leon Yin

Computer science is sudo science.


I am an investigative journalist at a non-profit newsroom called The Markup. Perviously, I was a research scientist at the SMaPP lab at NYU and a research affiliate at Data & Society's Media Manipulation Intiative. I lay data pipelines, run experiments, and build tools to investigate social and technical systems. I am focusing on building infrastructure to accelerate evidence-building for people and machines. Previously, I wrote scientific software at NASA, and wrangled data for Sony. I received my BS from NYU in 2015.


The Internet Research Agency, Hyperlinks, News, and Marketing Tools

How impactful was "fake news" in foreign info ops during the 2016 U.S. Presidential Election? I analyzed hyperlinks to Junk, National, and Local news sources sent by accounts released by the Senate Intelligence Committee and Twitter's Elections Integrity initiative. My analysis reveals a deliberate use of local news articles to build trust in fake local news accounts, and highlight key social issues masquerading as white conservatives and black activists. I also identify marketing tools used to automate, optimize, and measure the reach of their content.
Read the Report
Identifying Local News Outlets

Three Body Problem Language Model

How can we learn about text from vast quantities of unlabeled documents? Researchers from fast.ai and UW suggest that deep recurrent language models learn useful word representations for an array of NLP tasks. This project was my intro to PyTorch, with re-usable code for pre-processing text, loading data, initialting, training, and evaluating bi-directional LSTM neural networks.
Jupyter Notebook

Reverse Image Search

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017

Fwd: My Great New Friend

What cultural biases do ML algorithms pick up on? I trained a character-level recurrent neural network with one Long Short-Term Memory layer, on 2000 emails from the Enron corpus to finish lines in a love poem. The model picked up corporate culture, and rambled endlessly about "the company", and "compensation". In collaboration with Constant Dullaart and Rhizome. Presented at the New Museum for The Making of Natural Language.
Jupyter Notebook

Are US Legislators Ideologically Polarized?

A timeseries visualization of legislator voting history using DW-Nominate, a metric of the liberal-conservative spectrum.
Jupyter Notebook
Ideological Polarization of Congress JFK-2014

Who is on the Receiving End of Tax-Payer Dollars?

Government contracts are available to the public on USASpending.gov. In this notebook I show how to download records, and aggregate financial data from the US' largest private prison systems. This may someday become a Twitter bot.
Jupyter Notebook 1 2 3
Plotly CoreCivic contracts by state

What Research Does the NSF Support (and How Much)?

NSF grants are available to the public and contain rich metadata. For this project, I ingest XML files into SQLite tables to power dashboards and wordclouds. I look into funding history, and on-going projects from several notable Oceanographers.
Jupyter Notebook
Plot.ly 1 2
d3.js Network Graph (a welcome mistake)

Open Source Software


Despite having the largest userbase amongst American adults, YouTube is a social media platform that is often overlooked in academic research. youtube-data-api is a Python client to make this data source more accessible, while introducing new applications and methods to analyze this platform.
Github Repo
PyPi Page


urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page

S3 Helper

A high-level Python AWS-cli wrapper to smooth workflows with private data stored on s3 cloud storage. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets.
Jupyter Notebook Tutorial
Github Repo
PyPi Page

Data Pipes and Web Scrapers

Coming soon!

  • Local News Dataset
  • Linktree
  • Clarkesworld
  • Much more!

  • Software and analyses adopt historical mistakes and bias. If any of the projects are dated or contain inaccuracies please let me know via email or an issue on GitHub :)
    The next section contains a Javascript app that cycles through a collection of quotes I like.



    Get in Touch

    hello [at] {this-domain}

    Especially, if you're interested in:

    1. 🙈 Backdoor APIs and auditing algorithms.
    2. 🙊 Mischief and metadata.
    3. 🙉 Methods of studying the information ecosystem.