urlexpander.tweet_utils.get_link
)urlexpander.expand
)urlexpander.tweet_utils.count_matrix
, urlexpander.html_utils.get_webpage_meta
)Software for this tutorial is found in this requirements.txt
file, and can be downloaded as follows:
pip install -r requirements.txt
Download data here:
python download_data.py
NOTE: at the time of this writing, download_data.py
does not work! Please go to OSF in the meantime.
If you have yet to look at the backend of a Tweet here you go: https://bit.ly/tweet_anatomy_link.
Today I'll will show you how to extract and work with links from Tweets. We're working with Tweets from members of congress collected by Greg Eady.
import os
import json
import glob
import itertools
from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd
import urlexpander
from smappdragon import JsonCollection
from config import INTERMEDIATE_DIRECTORY, \
RAW_TWEETS_DIRECTORY, \
CONGRESS_METADATA_DIRECTORY
Let's preview one file. The file is saved as a newline-delimited json file like this
{"tweet_id" : "123", "more_data" : {"here" : "it is"}
{"tweet_id" : "124", "more_data" : {"here" : "it is again"}
and bzip2 compressed!
f
'data/tweets-raw/1089859058__2018-03.json.bz2'
The file structure is identical to how we store all Tweet data at SMaPP. We developed software like smappdragon to work with Tweets:
collect = JsonCollection(f, compression='bz2',
throw_error=False,
verbose=1)
smappdragon's JsonCollection
class reads through JSON files as a generator.
collect
<smappdragon.collection.json_collection.JsonCollection at 0x10a495ba8>
The con is that generators are hard to interpret, the pro is that they don't store any data in memory. We access the data on a row-by-row basis by iterating through the collect
object. Here we only get the first row, we can see the contents by printing row:
for row in collect.get_iterator():
print(json.dumps(row, indent=1)[:100])
break
{ "created_at": "Wed Mar 21 17:53:40 +0000 2018", "id": 976517212063322112, "id_str": "9765172120
Each Tweet can have more than one link, thus we need to unpack those values! We created the . urlexpander.tweet_utils.get_link()
has a function to do just this...
Once again we have another generator
# returns a genrator, which is uninterpretable!
urlexpander.tweet_utils.get_link(row)
<generator object get_link at 0x11b8a3f10>
# we can access the data by iterating through it.
for link_meta in urlexpander.tweet_utils.get_link(row):
print(link_meta)
{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}
def read_file_extract_links(f):
'''
This function takes in a Tweet file that bzip2-compressed,
newline-deliminted json, and returns a list of dictionaries
for link data.
'''
# read the json file into a generator
collection = JsonCollection(f, compression='bz2', throw_error=False)
# iterate through the json file, extract links, flatten the generator of links
# into a list, and store into a Pandas dataframe
df_ = pd.DataFrame(list(
itertools.chain.from_iterable(
[urlexpander.tweet_utils.get_link(t)
for t in collection.get_iterator()
if t]
)))
df_['file'] = f
return df_.to_dict(orient='records')
We can iterate through files and run the function iteratively. From there, we can instantiate a Pandas DataFrame
.
data = []
for f in tqdm(files[:2]):
# read the json file into a generator
data.extend(read_file_extract_links(f)
df_links = pd.DataFrame(data)
100%|██████████| 2/2 [00:01<00:00, 1.03s/it]
The for loop is is slow.. Also this task is not memory intensive.
Parallelize the task using the Pool
class from the Mulitprocessing package.
Each core on our computer will read a JSON file of Tweets and filter for links.
data = []
with Pool(4) as pool:
iterable = pool.imap_unordered(read_file_extract_links, files)
for link_data in tqdm(iterable, total=len(files)):
data.extend(link_data)
# link meta -> dataframe
df_links = pd.DataFrame(data)
# save it
df_links.to_csv(file_raw_links, index=False)
# preview it
df_links.head(2)
file | link_domain | link_url_long | link_url_short | tweet_created_at | tweet_id | tweet_text | user_id | |
---|---|---|---|---|---|---|---|---|
0 | /scratch/olympus/projects/mediascore/Data/json... | frc.org | https://www.frc.org/wwlivewithtonyperkins/rep-... | https://t.co/l9dXT0L7oT | Fri Mar 23 14:38:34 +0000 2018 | 977192888781168640 | nan | 2966758114 |
1 | /scratch/olympus/projects/mediascore/Data/json... | thehill.com | http://thehill.com/379188-watch-fund-governmen... | https://t.co/YbdvepWNQ3 | Thu Mar 22 15:21:32 +0000 2018 | 976841314024206339 | nan | 2966758114 |
The bulk of URLs we encounter in the wild are sent through a link shortener. For this dataset 30% of all URLs are from known link shorteners
len(df_links[df_links['link_domain'].isin(short_domains)]) \
/ len(df_links)
0.2965885363056201
Link shorteners record transactional information whenever that link is clicked. Unfortunately it makes it hard for us to see what was being shared.
links = df_links['link_url_long'].tolist()
links[-5:]
['http://goo.gl/kDUwP', 'http://bit.ly/12clU3p', 'http://nyti.ms/Z4rdlU', 'http://goo.gl/LxkrY', 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']
This is why urlexpander
was made. We can run the expand
function on single URLs, as well as a list of URLs.
urlexpander.expand(links[-5])
'https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227'
By default, urlexpander will expand every URL it is shown. However you can pass a boolean function (one the returns True or False based on an inputted string) to the filter_function
parameter.
urlexpander.expand(links[-5:], filter_function=urlexpander.is_short)
['https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227', 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html', 'http://nyti.ms/Z4rdlU', 'http://www.civiccenterconservancy.org/history-2012-nhl-designation_25.html', 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'bit.ly/cbc23']
--> Remove duplicates
['abc.com/123', 'bbc.co.uk/123', 'bit.ly/cbc23']
--> Filter for shortened URLs (optional)
['bit.ly/cbs23']
--> Check the cache file (optional), Unshorten new urls
[{'original_url': 'bit.ly/cbs23', 'resolved' : 'cspan.com/123'}]
--> swap short URLs for full URLs
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'cspan.com/123']
urlexpander parallelizes, filters and caches the input, which is essential for social media data.
?urlexpander.expand
Signature: urlexpander.expand(links_to_unshorten, chunksize=1280, n_workers=1, cache_file=None, random_seed=303, verbose=0, filter_function=None, **kwargs) Docstring: Calls expand with multiple (``n_workers``) threads to unshorten a list of urls. Unshortens all urls by default, unless one sets a ``filter_function``. :param links_to_unshorten: (list, str) either an idividual or list (str) of urls to unshorten :param chunksize: (int) chunks links_to_unshorten, which makes computation quicker with larger inputs :param n_workers: (int) how many threads :param cache_file: (str) a path to a json file to read and write results in :param random_seed: (int) initializes the random state for shuffling the input :param verbose: (int) whether to print updates and errors. 0 is silent. 1 is progress bar. 2 is progress bar and errors. :param filter_function: (func) a boolean used to filter url shorteners out :returns: (list) a list of resolved urls File: ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/api.py Type: function
The above is a toy example with 5 links, let's see how this works for 1.7 Mil links. For reference, this took me an hour on HPC.
resolved_urls = urlexpander.expand(links,
filter_function=urlexpander.is_short,
n_workers=64,
chunksize=1280,
cache_file=file_cache,
verbose=1)
len(resolved_urls)
1700150
df_links['link_resolved'] = resolved_urls
df_links['link_resolved_domain'] = df_links['link_resolved'].apply(urlexpander.get_domain)
df_links.to_csv(file_expanded_links, index=False)
With the links resolved, how can we use links as data?
df_links = pd.read_csv(file_expanded_links)
df_links.head(2)
file | link_domain | link_url_long | link_url_short | tweet_created_at | tweet_id | tweet_text | user_id | link_resolved | link_resolved_domain | |
---|---|---|---|---|---|---|---|---|---|---|
0 | /scratch/olympus/projects/mediascore/Data/json... | frc.org | https://www.frc.org/wwlivewithtonyperkins/rep-... | https://t.co/l9dXT0L7oT | Fri Mar 23 14:38:34 +0000 2018 | 977192888781168640 | nan | 2966758114 | https://www.frc.org/wwlivewithtonyperkins/rep-... | frc.org |
1 | /scratch/olympus/projects/mediascore/Data/json... | thehill.com | http://thehill.com/379188-watch-fund-governmen... | https://t.co/YbdvepWNQ3 | Thu Mar 22 15:21:32 +0000 2018 | 976841314024206339 | nan | 2966758114 | https://thehill.com/379188-watch-fund-governme... | thehill.com |
We can get an overview of the most frequently shared domains:
df_links['link_resolved_domain'].value_counts().head(15)
twitter.com 255532 house.gov 199218 youtube.com 93986 facebook.com 90061 senate.gov 78645 washingtonpost.com 29886 instagram.com 28460 nytimes.com 25014 thehill.com 22925 politico.com 13488 foxnews.com 12045 cnn.com 11611 wsj.com 11289 twimg.com 9633 ow.ly 9463 Name: link_resolved_domain, dtype: int64
We can also look at the contents of each URL. Twitter provides URL metadata (if you pay), we provided a workaround!
url = 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'
meta = urlexpander.html_utils.get_webpage_meta(url)
meta
OrderedDict([('url', 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'), ('title', 'House Hydro Bill Tests Water for Broad Energy Deals'), ('description', ' In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.'), ('paragraphs', ['', '', 'In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.', 'In other words, the Republican-controlled House passed a clean-energy bill.', 'Of course, few Republicans could object on ideological grounds to legislation that aimed to expedite or remove regulatory requirements for expanding hydropower facilities. And it certainly helped that the chamber passed similar legislation in 2012 by a large margin. ', 'Nevertheless, hydropower proponents say that developing the resource represents “low-hanging fruit” that members tired of the partisanship that has permeated energy policy can all agree is worth advancing to President Barack Obama’s desk. Hydropower represented nearly two-thirds — the largest share by far — of domestic renewable-energy production and 8 percent of total U.S. electricity generation in 2011, according to the Energy Information Administration. More than half of that total powered the Pacific Northwest region, and hydro is one of the few renewable resources that can provide baseload electricity — power that is available at all times — to the grid.', 'But just 3 percent of the 80,000 dams in the United States generate power, representing great potential for growing the resource, according to legislation championed by Reps. Cathy McMorris Rodgers, R-Wash., and Diana DeGette, D-Colo.', '“If you can work on regulatory reform for those projects, then you can have small hydro throughout this country,” DeGette said. ', 'The House lawmakers’ legislation would let small hydroelectric facilities generating up to 10 megawatts of power bypass Federal Energy Regulatory Commission licensing requirements that currently apply to projects producing more than 5 megawatts. The bill also would require FERC to study the feasibility of carrying out a two-year hydropower licensing pilot program at unpowered dams and would allow the commission to extend preliminary permits for two additional years. ', 'Jeff Leahey, government affairs director at the National Hydropower Association, said House leaders probably moved the bill so they could promote energy legislation that “checked the boxes” on encouraging the development of a resource that is renewable and reliable. It also helped that the bill — along with another measure passed April 10 that would designate an Interior Department agency as the lead regulator of small federal conduits — moved through the House last Congress and didn’t need much additional work, he said.', 'A significant factor in the refocused spotlight on the “original renewable” is the new leadership on the Senate Energy and Natural Resources Committee and its representation of key hydropower-producing states. Chairman Ron Wyden, D-Ore., promised industry representatives at the hydropower association’s annual conference this week that he plans to “quickly” mark up hydropower legislation after a panel hearing Tuesday. ', 'Wyden attributes the rise in hydro’s profile to better environmental stewardship on the part of facility operators and a more cooperative relationship between hydropower lobbyists and environmental groups focused on protecting river ecosystems. The effect of dams on fisheries and riverine habitat, as well as operational costs, has compelled organizations to promote the removal of dams in some instances.', '“Hydro’s environmental performance has improved dramatically,” Wyden said. ', 'Association President David Moller of Pacific Gas and Electric Co. acknowledged the role that historically low natural-gas prices have played in limiting hydropower expansion in recent years. But he said opportunities for hydropower still flourish because of state renewable portfolio mandates, coal-fired power plants being pushed into retirement and technological advances in powering existing dams and water channels.', '“The price of natural gas has dropped, but it will never match hydropower’s fuel price of zero, or its attributes of being renewable and non-carbon-based,” he said. ', 'Hydro proponents in the private sector and in Congress said this week that they will continue to promote hydropower development, possibly in future legislation. That could include examining additional regulatory issues that contribute to long lead times for completing electrified projects or adjusting current benefits that exist for clean power in the tax code. Prospects for he latter — which would involve extending the production tax credit for a multiyear period or making it permanent as Obama proposes — are dim outside a comprehensive tax code overhaul.', '“I think at the end of the day, it’s all about making sure that hydropower and the benefits that come from hydropower projects are competitive in the marketplace with other energy projects,” Leahey said. ', 'Whether the bipartisan camaraderie that has surrounded promoting hydropower can translate to moving broader energy legislation is anyone’s guess. But members close to the debate express optimism that the current spate of legislative action could beget compromise in the future.', '“I’m not sure we’re any closer to that comprehensive policy, but I’d think that common ground on these issues like hydro can only be helpful,” DeGette said.', '', '×', '$${CardTitle}', '$${CardTitle}'])])
If we want to know more about what kinds of information are being shared by members of congress, we can enrich the dataset by joining metadata about domains. In this example we will use the Local News Dataset to inspect the local media outlets that members of congress share:
local_news_url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'
df_local = pd.read_csv(local_news_url)
df_local = df_local[~(df_local.domain.isnull()) &
(df_local.domain != 'facebook.com')]
df_local.head()
name | state | website | domain | youtube | owner | medium | source | collection_date | |||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | KWHE | HI | http://www.kwhe.com/ | kwhe.com | NaN | NaN | NaN | LeSea | TV station | stationindex | 2018-08-02 14:55:24.612585 |
1 | WGVK | MI | http://www.wgvu.org/ | wgvu.org | NaN | NaN | NaN | Grand Valley State University | TV station | stationindex | 2018-08-02 14:55:24.612585 |
3 | KTUU | AK | http://www.ktuu.com/ | ktuu.com | NaN | NaN | NaN | Schurz Communications | TV station | stationindex | 2018-08-02 14:55:24.612585 |
4 | KTBY | AK | http://www.ktbytv.com/ | ktbytv.com | NaN | NaN | NaN | Coastal Television Broadcasting | TV station | stationindex | 2018-08-02 14:55:24.612585 |
5 | KYES | AK | http://www.kyes.com/ | kyes.com | NaN | NaN | NaN | Fireweed Communications | TV station | stationindex | 2018-08-02 14:55:24.612585 |
# this is a SQL-like join in Pandas that merges the two datasets based on domain name!
df_links_state = df_links.merge(df_local,
left_on='link_resolved_domain',
right_on='domain')
This dataset unlocks insights regarding the locality of news articles shared, as well as media ownership.
df_links_state.state.value_counts().head(20)
TX 43962 CA 35029 NJ 33442 NY 33145 MI 28912 FL 21519 NC 19841 PA 19060 OH 18857 MD 16533 GA 13419 MO 12660 TN 12324 AZ 11052 MA 10767 WA 10325 NV 10261 LA 9519 MN 9510 OR 9404 Name: state, dtype: int64
df_links_state.owner.value_counts().head(25)
Nexstar 7543 Advance Local 4464 Tegna Media 4265 Sinclair 3457 Hearst 2571 Tribune 2086 Gray Television 1817 Fox Television Stations 1580 Hearst Television 1528 Acvance Local 1522 Raycom 1396 NBC Universal 1344 Georgia Public Broadcasting 1248 New Jersey Public Broadcasting Authority 1156 The Philadelphia Inquirer 1103 ABC 978 Evening Post Publishing 820 Meredith 750 Oregon Public Broadcasting 740 Georgia Public Telecommunications Commission 624 Graham Media Group 588 E. W. Scripps Company 540 Cox Enterprises 526 WGBH Educational Foundation 425 Piedmont Television 333 Name: owner, dtype: int64
df_links_state[df_links_state.owner == 'Sinclair']['user_id'].value_counts().head(10)
66891808 100 818948638890217472 68 1065995022 66 1058345042 64 90651198 63 368948092 56 27676828 48 2929491549 48 2987671552 44 58579942 44 Name: user_id, dtype: int64
This is one example of a dataset enrichment, you can create your own categorizations and join them in. Alexa.com is a good starting place, and Greg and Andreu have the most experience categorizing websites.
The frequency of domains shared per-user make rich features. We can aggregate the data using the following utility function:
count_matrix = urlexpander.tweet_utils.count_matrix(
df_links,
user_col='user_id',
domain_col='link_resolved_domain',
min_freq=20,
)
count_matrix.head()
1011fmtheanswer.com | 10best.com | 10tv.com | 11alive.com | 123formbuilder.com | 12news.com | 12newsnow.com | 13abc.com | 13wham.com | 13wmaz.com | ... | yorkdispatch.com | youarecurrent.com | youcaring.com | youngcons.com | yourconroenews.com | yourdailyjournal.com | youtube.com | zeldinforcongress.com | zeldinforsenate.com | zpolitics.com | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
813286 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 198 | 0 | 0 | 0 |
939091 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 106 | 0 | 0 | 0 |
5496932 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 427 | 0 | 0 | 0 |
5511752 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 73 | 0 | 0 | 0 |
5558312 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 53 | 0 | 0 | 0 |
5 rows × 2376 columns
Let's see how the count_matrix
features fair as predictors of political affiliation.
This will be an exploratory step, where we will try to visualze the dataset of counts of links shared by user. We will reduce the high-dimensional data (where each domain is one dimension) to two-dimensions for visualization using the UMAP algorithm (much like the populat TSNE algorithm).
count_matrix = urlexpander.tweet_utils.count_matrix(
df_links_,
user_col='user_id',
domain_col='link_resolved_domain',
min_freq=5,
exclude_domain_list=exclude,
)
count_matrix.head(2)
frc.org | thehill.com | iheart.com | c-span.org | news9.com | speaker.gov | foxnews.com | frcaction.org | koco.com | okcfox.com | ... | mullinforcongress.com | therepublicanstandard.com | barbaracomstockforcongress.com | thefriendshipchallenge.com | tomreedforcongress.com | sincomillas.com | detodopr.com | thedowneypatriot.com | garretgravesforcongress.com | about.com | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
1004855106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1009269193.0 | 0 | 12 | 0 | 16 | 0 | 2 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 5301 columns
viz_umap_embed(count_matrix,
title="Link Sharing of Members of Congress Embedded by UMAP",
threshold=5, n_neighbors=50,
min_dist=0.1, metric='dice',
random_state=303)
/Users/leonyin/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype float32 was converted to bool by check_pairwise_arrays. warnings.warn(msg, DataConversionWarning)
In the unsupervised case, we already see Democrats and Republicans sectioned off. Here we will fit a logistic regression model to predict whether a congress member is a Democrat or a Republican based on the count_matrix
we just created.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# create the training set
X, y = count_matrix_.values, parties
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=303,
test_size=.15)
len(X_train), len(X_test)
(820, 145)
logreg = LogisticRegression(penalty='l2', C=.7,
solver='liblinear',
random_state=303)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)
0.9517241379310345
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
df_meta.set_index('twitter_id').loc[
y_test[y_test != y_pred].index
][['twitter_name']]
twitter_name | |
---|---|
twitter_id | |
9.410800851211756e+17 | SenDougJones |
23820360.0 | billhuizenga |
136526394.0 | WebsterCongress |
4615689368.0 | GeneGreen29 |
19726613.0 | SenatorCollins |
242376736.0 | RepCharlieDent |
16056306.0 | JeffFlake |
df_ind = df_meta.set_index('twitter_id').loc[count_matrix_indep_.index][['twitter_name']]
df_ind['preds'] = logreg.predict(count_matrix_indep_)
df_ind
twitter_name | preds | |
---|---|---|
twitter_id | ||
1068481578.0 | SenAngusKing | Republican |
216776631.0 | BernieSanders | Democrat |
2915095729.0 | AkGovBillWalker | Republican |
29442313.0 | SenSanders | Democrat |
3196634042.0 | GovernorMapp | Republican |
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.96 (+/- 0.02)