Tools for Link Analysis: urlExpander

View this on GitHub | NBviewer | Binder
Auhor: Leon Yin
Updated on: 2018-10-01

Today

  1. What kind of link data does Twitter provide?
  2. How to extract link data from Tweets (urlexpander.tweet_utils.get_link)
  3. Processing data by expanding shortened URLs (urlexpander.expand)
  4. Avenues of analysis with link data (urlexpander.tweet_utils.count_matrix, urlexpander.html_utils.get_webpage_meta)
  5. Using links as features to predict political affiliation

Software for this tutorial is found in this requirements.txt file, and can be downloaded as follows:

pip install -r requirements.txt

Download data here:

python download_data.py

NOTE: at the time of this writing, download_data.py does not work! Please go to OSF in the meantime.

If you have yet to look at the backend of a Tweet here you go: https://bit.ly/tweet_anatomy_link.

Today I'll will show you how to extract and work with links from Tweets. We're working with Tweets from members of congress collected by Greg Eady.

In [2]:
import os
import json
import glob
import itertools
from multiprocessing import Pool

from tqdm import tqdm
import pandas as pd
import urlexpander
from smappdragon import JsonCollection

from config import INTERMEDIATE_DIRECTORY, \
                   RAW_TWEETS_DIRECTORY, \
                   CONGRESS_METADATA_DIRECTORY

Let's preview one file. The file is saved as a newline-delimited json file like this

{"tweet_id" : "123", "more_data" : {"here" : "it is"}
{"tweet_id" : "124", "more_data" : {"here" : "it is again"}

and bzip2 compressed!

In [6]:
f
Out[6]:
'data/tweets-raw/1089859058__2018-03.json.bz2'

The file structure is identical to how we store all Tweet data at SMaPP. We developed software like smappdragon to work with Tweets:

In [7]:
collect = JsonCollection(f, compression='bz2', 
                         throw_error=False, 
                         verbose=1)

smappdragon's JsonCollection class reads through JSON files as a generator.

In [8]:
collect
Out[8]:
<smappdragon.collection.json_collection.JsonCollection at 0x10a495ba8>

The con is that generators are hard to interpret, the pro is that they don't store any data in memory. We access the data on a row-by-row basis by iterating through the collect object. Here we only get the first row, we can see the contents by printing row:

In [11]:
for row in collect.get_iterator():
    print(json.dumps(row, indent=1)[:100])
    break
{
 "created_at": "Wed Mar 21 17:53:40 +0000 2018",
 "id": 976517212063322112,
 "id_str": "9765172120

How do we get the links?

Each Tweet can have more than one link, thus we need to unpack those values! We created the . urlexpander.tweet_utils.get_link() has a function to do just this...

Once again we have another generator

In [9]:
# returns a genrator, which is uninterpretable!
urlexpander.tweet_utils.get_link(row)
Out[9]:
<generator object get_link at 0x11b8a3f10>
In [10]:
# we can access the data by iterating through it.
for link_meta in urlexpander.tweet_utils.get_link(row):
    print(link_meta)
{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}
In [15]:
def read_file_extract_links(f):
    '''
    This function takes in a Tweet file that bzip2-compressed, 
    newline-deliminted json, and returns a list of dictionaries
    for link data.
    '''
    # read the json file into a generator
    collection = JsonCollection(f, compression='bz2', throw_error=False)
    
    # iterate through the json file, extract links, flatten the generator of links
    # into a list, and store into a Pandas dataframe
    df_ = pd.DataFrame(list(
            itertools.chain.from_iterable(
                [urlexpander.tweet_utils.get_link(t) 
                 for t in collection.get_iterator() 
                 if t]
            )))
    df_['file'] = f
    return df_.to_dict(orient='records')

We can iterate through files and run the function iteratively. From there, we can instantiate a Pandas DataFrame.

In [14]:
data = []
for f in tqdm(files[:2]):
    # read the json file into a generator
    data.extend(read_file_extract_links(f)
df_links = pd.DataFrame(data)
100%|██████████| 2/2 [00:01<00:00,  1.03s/it]

Advanced (but practical) usage

The for loop is is slow.. Also this task is not memory intensive.

Parallelize the task using the Pool class from the Mulitprocessing package.

Each core on our computer will read a JSON file of Tweets and filter for links.

In [16]:
data = []
with Pool(4) as pool:
    iterable = pool.imap_unordered(read_file_extract_links, files)
    for link_data in tqdm(iterable, total=len(files)):
        data.extend(link_data)
# link meta -> dataframe
df_links = pd.DataFrame(data)
# save it
df_links.to_csv(file_raw_links, index=False)
# preview it
df_links.head(2)
Out[16]:
file link_domain link_url_long link_url_short tweet_created_at tweet_id tweet_text user_id
0 /scratch/olympus/projects/mediascore/Data/json... frc.org https://www.frc.org/wwlivewithtonyperkins/rep-... https://t.co/l9dXT0L7oT Fri Mar 23 14:38:34 +0000 2018 977192888781168640 nan 2966758114
1 /scratch/olympus/projects/mediascore/Data/json... thehill.com http://thehill.com/379188-watch-fund-governmen... https://t.co/YbdvepWNQ3 Thu Mar 22 15:21:32 +0000 2018 976841314024206339 nan 2966758114

How useful is this data?

The bulk of URLs we encounter in the wild are sent through a link shortener. For this dataset 30% of all URLs are from known link shorteners

In [17]:
len(df_links[df_links['link_domain'].isin(short_domains)]) \
/ len(df_links)
Out[17]:
0.2965885363056201

Link shorteners record transactional information whenever that link is clicked. Unfortunately it makes it hard for us to see what was being shared.

In [17]:
links = df_links['link_url_long'].tolist()
links[-5:]
Out[17]:
['http://goo.gl/kDUwP',
 'http://bit.ly/12clU3p',
 'http://nyti.ms/Z4rdlU',
 'http://goo.gl/LxkrY',
 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']

This is why urlexpander was made. We can run the expand function on single URLs, as well as a list of URLs.

In [18]:
urlexpander.expand(links[-5])
Out[18]:
'https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227'

By default, urlexpander will expand every URL it is shown. However you can pass a boolean function (one the returns True or False based on an inputted string) to the filter_function parameter.

In [19]:
urlexpander.expand(links[-5:], filter_function=urlexpander.is_short)
Out[19]:
['https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227',
 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html',
 'http://nyti.ms/Z4rdlU',
 'http://www.civiccenterconservancy.org/history-2012-nhl-designation_25.html',
 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']

What's happening behind the scenes?

['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'bit.ly/cbc23']
--> Remove duplicates
['abc.com/123', 'bbc.co.uk/123', 'bit.ly/cbc23']
--> Filter for shortened URLs (optional)
['bit.ly/cbs23']
--> Check the cache file (optional), Unshorten new urls
[{'original_url': 'bit.ly/cbs23', 'resolved' : 'cspan.com/123'}]
--> swap short URLs for full URLs
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'cspan.com/123']

urlexpander parallelizes, filters and caches the input, which is essential for social media data.

In [23]:
?urlexpander.expand
Signature: urlexpander.expand(links_to_unshorten, chunksize=1280, n_workers=1, cache_file=None, random_seed=303, verbose=0, filter_function=None, **kwargs)
Docstring:
Calls expand with multiple (``n_workers``) threads to unshorten a list of urls. Unshortens all urls by default, unless one sets a ``filter_function``.

:param links_to_unshorten: (list, str) either an idividual or list (str) of urls to unshorten
:param chunksize: (int) chunks links_to_unshorten, which makes computation quicker with larger inputs
:param n_workers: (int) how many threads
:param cache_file: (str) a path to a json file to read and write results in
:param random_seed: (int) initializes the random state for shuffling the input
:param verbose: (int) whether to print updates and errors. 0 is silent. 1 is progress bar. 2 is progress bar and errors.
:param filter_function: (func) a boolean used to filter url shorteners out
    

:returns: (list) a list of resolved urls
File:      ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/api.py
Type:      function

The above is a toy example with 5 links, let's see how this works for 1.7 Mil links. For reference, this took me an hour on HPC.

In [30]:
resolved_urls = urlexpander.expand(links, 
                                   filter_function=urlexpander.is_short,
                                   n_workers=64,
                                   chunksize=1280,
                                   cache_file=file_cache,
                                   verbose=1)
In [51]:
len(resolved_urls)
Out[51]:
1700150
In [52]:
df_links['link_resolved'] = resolved_urls
df_links['link_resolved_domain'] = df_links['link_resolved'].apply(urlexpander.get_domain)
In [254]:
df_links.to_csv(file_expanded_links, index=False)

Analytics

With the links resolved, how can we use links as data?

In [25]:
df_links = pd.read_csv(file_expanded_links)
df_links.head(2)
Out[25]:
file link_domain link_url_long link_url_short tweet_created_at tweet_id tweet_text user_id link_resolved link_resolved_domain
0 /scratch/olympus/projects/mediascore/Data/json... frc.org https://www.frc.org/wwlivewithtonyperkins/rep-... https://t.co/l9dXT0L7oT Fri Mar 23 14:38:34 +0000 2018 977192888781168640 nan 2966758114 https://www.frc.org/wwlivewithtonyperkins/rep-... frc.org
1 /scratch/olympus/projects/mediascore/Data/json... thehill.com http://thehill.com/379188-watch-fund-governmen... https://t.co/YbdvepWNQ3 Thu Mar 22 15:21:32 +0000 2018 976841314024206339 nan 2966758114 https://thehill.com/379188-watch-fund-governme... thehill.com

We can get an overview of the most frequently shared domains:

In [32]:
df_links['link_resolved_domain'].value_counts().head(15)
Out[32]:
twitter.com           255532
house.gov             199218
youtube.com            93986
facebook.com           90061
senate.gov             78645
washingtonpost.com     29886
instagram.com          28460
nytimes.com            25014
thehill.com            22925
politico.com           13488
foxnews.com            12045
cnn.com                11611
wsj.com                11289
twimg.com               9633
ow.ly                   9463
Name: link_resolved_domain, dtype: int64

Text-based URL Metadata

We can also look at the contents of each URL. Twitter provides URL metadata (if you pay), we provided a workaround!

In [33]:
url = 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'
meta = urlexpander.html_utils.get_webpage_meta(url)
meta
Out[33]:
OrderedDict([('url',
              'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'),
             ('title', 'House Hydro Bill Tests Water for Broad Energy Deals'),
             ('description',
              ' In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.'),
             ('paragraphs',
              ['',
               '',
               'In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.',
               'In other words, the Republican-controlled House passed a clean-energy bill.',
               'Of course, few Republicans could object on ideological grounds to legislation that aimed to expedite or remove regulatory requirements for expanding hydropower facilities. And it certainly helped that the chamber passed similar legislation in 2012 by a large margin.               ',
               'Nevertheless, hydropower proponents say that developing the resource represents “low-hanging fruit” that members tired of the partisanship that has permeated energy policy can all agree is worth advancing to President Barack Obama’s desk. Hydropower represented nearly two-thirds — the largest share by far — of domestic renewable-energy production and 8 percent of total U.S. electricity generation in 2011, according to the Energy Information Administration. More than half of that total powered the Pacific Northwest region, and hydro is one of the few renewable resources that can provide baseload electricity — power that is available at all times — to the grid.',
               'But just 3 percent of the 80,000 dams in the United States generate power, representing great potential for growing the resource, according to legislation championed by Reps. Cathy McMorris Rodgers, R-Wash., and  Diana DeGette, D-Colo.',
               '“If you can work on regulatory reform for those projects, then you can have small hydro throughout this country,” DeGette said. ',
               'The House lawmakers’ legislation would let small hydroelectric facilities generating up to 10 megawatts of power bypass Federal Energy Regulatory Commission licensing requirements that currently apply to projects producing more than 5 megawatts. The bill also would require FERC to study the feasibility of carrying out a two-year hydropower licensing pilot program at unpowered dams and would allow the commission to extend preliminary permits for two additional years.  ',
               'Jeff Leahey, government affairs director at the National Hydropower Association, said House leaders probably moved the bill so they could promote energy legislation that “checked the boxes” on encouraging the development of a resource that is renewable and reliable. It also helped that the bill — along with another measure passed April 10 that would designate an Interior Department agency as the lead regulator of small federal conduits — moved through the House last Congress and didn’t need much additional work, he said.',
               'A significant factor in the refocused spotlight on the “original renewable” is the new leadership on the Senate Energy and Natural Resources Committee and its representation of key hydropower-producing states. Chairman Ron Wyden, D-Ore., promised industry representatives at the hydropower association’s annual conference this week that he plans to “quickly” mark up hydropower legislation after a panel hearing Tuesday. ',
               'Wyden attributes the rise in hydro’s profile to better environmental stewardship on the part of facility operators and a more cooperative relationship between hydropower lobbyists and environmental groups focused on protecting river ecosystems. The effect of dams on fisheries and riverine habitat, as well as operational costs, has compelled organizations to promote the removal of dams in some instances.',
               '“Hydro’s environmental performance has improved dramatically,” Wyden said. ',
               'Association President David Moller of Pacific Gas and Electric Co. acknowledged the role that historically low natural-gas prices have played in limiting hydropower expansion in recent years. But he said opportunities for hydropower still flourish because of state renewable portfolio mandates, coal-fired power plants being pushed into retirement and technological advances in powering existing dams and water channels.',
               '“The price of natural gas has dropped, but it will never match hydropower’s fuel price of zero, or its attributes of being renewable and non-carbon-based,” he said. ',
               'Hydro proponents in the private sector and in Congress said this week that they will continue to promote hydropower development, possibly in future legislation. That could include examining additional regulatory issues that contribute to long lead times for completing electrified projects or adjusting current benefits that exist for clean power in the tax code. Prospects for he latter — which would involve extending the production tax credit for a multiyear period or making it permanent as Obama proposes — are dim outside a comprehensive tax code overhaul.',
               '“I think at the end of the day, it’s all about making sure that hydropower and the benefits that come from hydropower projects are competitive in the marketplace with other energy projects,” Leahey said. ',
               'Whether the bipartisan camaraderie that has surrounded promoting hydropower can translate to moving broader energy legislation is anyone’s guess. But members close to the debate express optimism that the current spate of legislative action could beget compromise in the future.',
               '“I’m not sure we’re any closer to that comprehensive policy, but I’d think that common ground on these issues like hydro can only be helpful,” DeGette said.',
               '',
               '×',
               '$${CardTitle}',
               '$${CardTitle}'])])

Categorizing what is shared

If we want to know more about what kinds of information are being shared by members of congress, we can enrich the dataset by joining metadata about domains. In this example we will use the Local News Dataset to inspect the local media outlets that members of congress share:

In [36]:
local_news_url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'
df_local = pd.read_csv(local_news_url)
df_local = df_local[~(df_local.domain.isnull()) & 
                    (df_local.domain != 'facebook.com')]
df_local.head()
Out[36]:
name state website domain twitter youtube facebook owner medium source collection_date
0 KWHE HI http://www.kwhe.com/ kwhe.com NaN NaN NaN LeSea TV station stationindex 2018-08-02 14:55:24.612585
1 WGVK MI http://www.wgvu.org/ wgvu.org NaN NaN NaN Grand Valley State University TV station stationindex 2018-08-02 14:55:24.612585
3 KTUU AK http://www.ktuu.com/ ktuu.com NaN NaN NaN Schurz Communications TV station stationindex 2018-08-02 14:55:24.612585
4 KTBY AK http://www.ktbytv.com/ ktbytv.com NaN NaN NaN Coastal Television Broadcasting TV station stationindex 2018-08-02 14:55:24.612585
5 KYES AK http://www.kyes.com/ kyes.com NaN NaN NaN Fireweed Communications TV station stationindex 2018-08-02 14:55:24.612585
In [40]:
# this is a SQL-like join in Pandas that merges the two datasets based on domain name!
df_links_state = df_links.merge(df_local, 
                                left_on='link_resolved_domain', 
                                right_on='domain')

This dataset unlocks insights regarding the locality of news articles shared, as well as media ownership.

In [54]:
df_links_state.state.value_counts().head(20)
Out[54]:
TX    43962
CA    35029
NJ    33442
NY    33145
MI    28912
FL    21519
NC    19841
PA    19060
OH    18857
MD    16533
GA    13419
MO    12660
TN    12324
AZ    11052
MA    10767
WA    10325
NV    10261
LA     9519
MN     9510
OR     9404
Name: state, dtype: int64
In [43]:
df_links_state.owner.value_counts().head(25)
Out[43]:
Nexstar                                         7543
Advance Local                                   4464
Tegna Media                                     4265
Sinclair                                        3457
Hearst                                          2571
Tribune                                         2086
Gray Television                                 1817
Fox Television Stations                         1580
Hearst Television                               1528
Acvance Local                                   1522
Raycom                                          1396
NBC Universal                                   1344
Georgia Public Broadcasting                     1248
New Jersey Public Broadcasting Authority        1156
The Philadelphia Inquirer                       1103
ABC                                              978
Evening Post Publishing                          820
Meredith                                         750
Oregon Public Broadcasting                       740
Georgia Public Telecommunications Commission     624
Graham Media Group                               588
E. W. Scripps Company                            540
Cox Enterprises                                  526
WGBH Educational Foundation                      425
Piedmont Television                              333
Name: owner, dtype: int64
In [53]:
df_links_state[df_links_state.owner == 'Sinclair']['user_id'].value_counts().head(10)
Out[53]:
66891808              100
818948638890217472     68
1065995022             66
1058345042             64
90651198               63
368948092              56
27676828               48
2929491549             48
2987671552             44
58579942               44
Name: user_id, dtype: int64

This is one example of a dataset enrichment, you can create your own categorizations and join them in. Alexa.com is a good starting place, and Greg and Andreu have the most experience categorizing websites.

User-Aggregated Acvitity

The frequency of domains shared per-user make rich features. We can aggregate the data using the following utility function:

In [55]:
count_matrix = urlexpander.tweet_utils.count_matrix(
      df_links,
      user_col='user_id',
      domain_col='link_resolved_domain',
      min_freq=20,
)

count_matrix.head()
Out[55]:
1011fmtheanswer.com 10best.com 10tv.com 11alive.com 123formbuilder.com 12news.com 12newsnow.com 13abc.com 13wham.com 13wmaz.com ... yorkdispatch.com youarecurrent.com youcaring.com youngcons.com yourconroenews.com yourdailyjournal.com youtube.com zeldinforcongress.com zeldinforsenate.com zpolitics.com
user_id
813286 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 198 0 0 0
939091 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 106 0 0 0
5496932 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 427 0 0 0
5511752 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 73 0 0 0
5558312 16 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 53 0 0 0

5 rows × 2376 columns

Let's see how the count_matrix features fair as predictors of political affiliation.

Unsupervised Learning

This will be an exploratory step, where we will try to visualze the dataset of counts of links shared by user. We will reduce the high-dimensional data (where each domain is one dimension) to two-dimensions for visualization using the UMAP algorithm (much like the populat TSNE algorithm).

In [28]:
count_matrix = urlexpander.tweet_utils.count_matrix(
      df_links_,
      user_col='user_id',
      domain_col='link_resolved_domain',
      min_freq=5,
      exclude_domain_list=exclude,
)
In [63]:
count_matrix.head(2)
Out[63]:
frc.org thehill.com iheart.com c-span.org news9.com speaker.gov foxnews.com frcaction.org koco.com okcfox.com ... mullinforcongress.com therepublicanstandard.com barbaracomstockforcongress.com thefriendshipchallenge.com tomreedforcongress.com sincomillas.com detodopr.com thedowneypatriot.com garretgravesforcongress.com about.com
user_id
1004855106.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1009269193.0 0 12 0 16 0 2 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 5301 columns

In [64]:
viz_umap_embed(count_matrix, 
               title="Link Sharing of Members of Congress Embedded by UMAP",
               threshold=5, n_neighbors=50,
               min_dist=0.1, metric='dice',
               random_state=303)
/Users/leonyin/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype float32 was converted to bool by check_pairwise_arrays.
  warnings.warn(msg, DataConversionWarning)

Supervised Learning

In the unsupervised case, we already see Democrats and Republicans sectioned off. Here we will fit a logistic regression model to predict whether a congress member is a Democrat or a Republican based on the count_matrix we just created.

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
In [31]:
# create the training set
X, y = count_matrix_.values, parties
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=303, 
                                                    test_size=.15)
len(X_train), len(X_test) 
Out[31]:
(820, 145)
In [33]:
logreg = LogisticRegression(penalty='l2', C=.7,
                            solver='liblinear',
                            random_state=303)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)
Out[33]:
0.9517241379310345

Evaluation

In [46]:
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

Who did we get wrong?

In [71]:
df_meta.set_index('twitter_id').loc[
    y_test[y_test != y_pred].index
][['twitter_name']]
Out[71]:
twitter_name
twitter_id
9.410800851211756e+17 SenDougJones
23820360.0 billhuizenga
136526394.0 WebsterCongress
4615689368.0 GeneGreen29
19726613.0 SenatorCollins
242376736.0 RepCharlieDent
16056306.0 JeffFlake

How did we do with Independents?

In [73]:
df_ind = df_meta.set_index('twitter_id').loc[count_matrix_indep_.index][['twitter_name']]
df_ind['preds'] = logreg.predict(count_matrix_indep_)
df_ind
Out[73]:
twitter_name preds
twitter_id
1068481578.0 SenAngusKing Republican
216776631.0 BernieSanders Democrat
2915095729.0 AkGovBillWalker Republican
29442313.0 SenSanders Democrat
3196634042.0 GovernorMapp Republican

K-Fold Cross Validation

In [76]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.96 (+/- 0.02)
In [ ]: