Leon Yin

Solving puzzles across domains, byte-by-byte.


I'm a Data Scientist at the Social Media and Political Participation lab at NYU, and a research affilate at Data & Society's Media Manipulation team. I write software to collect, analyze, and model social movements and misinformation online. I am broadly interested in cross-platform analysis, and NLP. Previously, I wrote scientific software at NASA, and wrangled data for Sony.


Academic pursuits (by progress descending)
   Image-Based Anomaly Detection
   Platforms and Protests
   Language Modelling
Writing barely legible for computers
   Scroll to the next section, fellow bot.

Open Projects

Three Body Problem Language Model

How can we learn about text from vast quantities of unlabeled documents? Researchers from fast.ai and UW suggest language models learn word representations that hold useful information for an array of NLP tasks. This project was my genesis into PyTorch, with re-usable code for pre-processing text, loading data, initialting, training, and evaluating bi-directional recurrent neural networks, and saving word embeddings.
Jupyter Notebook

Reverse Image Search

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from Convolutional Neural Networks and the distance returned from the K-Nearest Neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017

Fwd: My Great New Friend

Trained a character-level Recurrent Neural Network— with one Long Short-Term Memory layer, on 2000 emails from the Enron corpus to finish lines in a love poem. The model picked up corporate culture, and rambled endlessly about "the company", and "compensation". In collaboration with Constant Dullaart and Rhizome. Presented at the New Museum for The Making of Natural Language.
Jupyter Notebook

Are US Legislators Ideologically Polarized?

A timeseries visualization of legislator voting history using DW-Nominate
a metric of the liberal-conservative spectrum.
Jupyter Notebook
Ideological Polarization of Congress JFK-2014

Who is on the Receiving End of Tax-Payer Dollars?

Government contracts are available to the public on USASpending.gov. In this notebook I show how to download records, and aggregate financial data from the US' largest private prison systems. This may someday become a Twitter bot.
Jupyter Notebook 1 2
Plotly CoreCivic contracts by state

What Research Does the NSF Support (and How Much)?

NSF grants are available to the public and contain rich metadata. For this project, I ingest XML files into SQLite tables to power dashboards and wordclouds. I look into funding history, and on-going projects from several notable Oceanographers.
Jupyter Notebook
Plot.ly 1 2
d3.js Network Graph (a welcome mistake)

What Can We Learn From the Poles About Climate Change?

I explore 15 years of biogeochemical seawater measurements along Antarctica's Palmer station sample grid. I analyze spatial-temporal variation within the water column, and calculate mixed layer depth and net community production.
Jupyter Notebook

What Makes a Drop of Seawater Unique?

The answer is in its chemistry! If we inspect the relationship between isotope-enrichment and salinity, we can trace the droplet to its landfall origin. Scientists have been collecting this data as far back as 1949. However, it existed in disparate sources. In 1999, a group of scientists lead by Gavin Schmidt centralized these sources. A decade and a half later, I contributed by building a data pipeline, performing anomoly detecton, and making some visualizations. I did not know it at the time, but this was my genisis into data science.
Jupyter Notebook
Presentation Poster at AGU 2015
d3.js map

Open Source Software

S3 4 me

A high-level Pythonic AWS s3 wrapper to smooth workflows with private data. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets. It just got listed on PyPi, so please give it a whirl!
Jupyter Notebook Tutorial
Github Repo
PyPi Page


A low-level Json parser to work with large files (like Twitter Dumps) that don't fit in memory. This software utilizes generators and streaming (de)compression to transform so-called "big data" problems into smart data problems. Below is a tutorial on how to use SmappDragon to analyze links from questionable media sources from opensources.co.
Jupyter Notebook Tutorial
Github Repo
PyPi Page

Data Pipes and Web Scrapers

Coming soon!

Technology adopts historical mistakes and bias.
If any of the projects have room to improve please let me know via email or github :)
The next section contains a Javascript app that cycles through a collection of quotes I like.



Get in Touch


Especially, if you're interested in:

  1. The spread of images and memes across platforms.
  2. Machine learning for social science.
  3. Mis/disinformation on the web.

Stay Updated

I occasionally write about the burts and the bees.