Leon Yin

Solving puzzles across domains, byte-by-byte.

About

I am a Data Scientist at the Social Media and Political Participation lab at NYU and a research affiliate at Data & Society's Media Manipulation team. Previously, I wrote scientific software at NASA, and wrangled data for Sony.

Research

Academic pursuits (by progress descending)
   Link Analysis on Social Media.
   Detecting Political Polarity in YouTube Transcripts.
   Containment + Detection of Disinfo Campaigns.
   Language Modelling with Recurrent Neural Nets.
For research outputs and past projects...
   Scroll to the next section, fellow bot.

Projects

Three Body Problem Language Model

How can we learn about text from vast quantities of unlabeled documents? Researchers from fast.ai and UW suggest that deep recurrent language models learn useful word representations for an array of NLP tasks. This project was my genesis into PyTorch, with re-usable code for pre-processing text, loading data, initialting, training, and evaluating bi-directional LSTM neural networks.
Jupyter Notebook

Reverse Image Search

A demonstration of a simple, robust, and scalable reverse image search engine that leverages features from convolutional neural networks and the distance returned from the K-nearest neighbors algorithm.
Jupyter Notebook
Presentation at PyData 2017

Fwd: My Great New Friend

What artifacts of culture do ML algorithms pick up on? I trained a character-level recurrent neural network— with one Long Short-Term Memory layer, on 2000 emails from the Enron corpus to finish lines in a love poem. The model picked up corporate culture, and rambled endlessly about "the company", and "compensation". In collaboration with Constant Dullaart and Rhizome. Presented at the New Museum for The Making of Natural Language.
Jupyter Notebook
Poem

Are US Legislators Ideologically Polarized?

A timeseries visualization of legislator voting history using DW-Nominate
a metric of the liberal-conservative spectrum.
Jupyter Notebook
Ideological Polarization of Congress JFK-2014

Who is on the Receiving End of Tax-Payer Dollars?

Government contracts are available to the public on USASpending.gov. In this notebook I show how to download records, and aggregate financial data from the US' largest private prison systems. This may someday become a Twitter bot.
Jupyter Notebook 1 2
Plotly CoreCivic contracts by state

What Research Does the NSF Support (and How Much)?

NSF grants are available to the public and contain rich metadata. For this project, I ingest XML files into SQLite tables to power dashboards and wordclouds. I look into funding history, and on-going projects from several notable Oceanographers.
Jupyter Notebook
Plot.ly 1 2
d3.js Network Graph (a welcome mistake)



Open Source Software

urlExpander

urlExpander is a Python package for quickly and thoroughly expanding shortened URLs. Marketing and analytics services like bit.ly are great for tracking engagement. However, these services obfuscate the destination of URLs for social media analysts.
Jupyter Notebook Quickstart
Github Repo
PyPi Page

S3 4 me

A high-level Pythonic AWS s3 wrapper to smooth workflows with private data. This Jupyter notebook showcases the module's ability to stream csv and json files to Pandas dataframes, and save Scikit-Learn models to s3 buckets. It just got listed on PyPi, so please give it a whirl!
Jupyter Notebook Tutorial
Github Repo
PyPi Page

SmappDragon

A low-level Json parser to work with large files (like Twitter Dumps) that don't fit in memory. This software utilizes generators and streaming (de)compression to transform so-called "big data" problems into smart data problems. Below is a tutorial on how to use SmappDragon to analyze links from questionable media sources from opensources.co.
Jupyter Notebook Tutorial
Github Repo
PyPi Page



Data Pipes and Web Scrapers

Coming soon!

  • Local News Dataset
  • Youtube Data API
  • Red Hen Yelp Campaign
  • And so much more!



  • Technology adopts historical mistakes and bias.
    If any of the projects have room to improve please let me know via email or an issue on github :)
    The next section contains a Javascript app that cycles through a collection of quotes I like.

    Next

    Next

    Get in Touch

    @leonyin
    hello[at]{this-domain}
    pgp.txt

    Especially, if you're interested in:

    1. The spread of images and memes across platforms.
    2. Machine learning for social science.
    3. Mis/disinformation on the web.


    Stay Updated

    I occasionally write about the burts and the bees.