General Information
In my workshop you learn to use several ways to collect Twitter Data for your own research. The second main emphasis is a descriptive analysis and how to conduct a network analysis with the data.
We mainly use “ready-to-use” functions of existing packages, but if you will conduct a research project later on, you probably have to use customized function to reach your goals. This may seem tough if you are new to R, but it’s the aspect that actually makes R unique compared to other statistical programs.
I’ve decided to use the opinionated tidyverse
package, a collection of R packages designed for Data Science for some code chunks. The packages share a common philosophy and work smoothly together. This approach includes the coding practice of piping
which makes the R code easier to read. Some R practitioners consider the introduction of piping the most important innovation in the recent years. I will point out the first few times where I use this coding principle in the code chunks. But don’t worry, you don’t need to use the pipe operators yourself to complete this workshop successfully.
The overall subject of the workshop is the Bundestagswahl 2017 to demonstrate the opportunities on an issue of Political Science. I recommend you to go through the instructions linear as some sections build up on knowledge you acquire in previous sections. My second is advice that you try to understand the logic behind my code chunks, but you don’t necessary have to replicate them all. In a few cases it won’t be possible. To check on your knowledge each section finishes with a small exercise.
Happy coding!
Packages
Make sure you have the latest versions of these packages installed and loaded:
# installs packages
install.packages(c("rtweet", "tidyverse", "ggplot2", "tm", "igraph", "data.table",
"stringr"), repos = "http://cran.us.r-project.org")
# loads packages
library("rtweet")
library("tidyverse")
library("ggplot2")
library("tm")
library("igraph")
library("data.table")
library("stringr")
Piping
Traditional R coding forces you the either wrap functions into other functions or assign a lot of variables. The concept of piping allows you to forward the output of one function to the next one and read the sequence from left to right (readability). This functionality comes with the package magrittr
that is party of tidyverse
.
Let me show you two code chunks that do the same and decide for yourself which one you consider more intuitive and more readable.
Traditional approach:
# you have to read from the center outwards to understand what's going on
mean(mtcars[which(mtcars$mpg > 15), "mpg"])
## [1] 21.72308
Pipe approach:
# input data
mtcars %>% # filters data for each obersvation that has a value greater than 15 for mpg
filter(mpg > 15) %>% # select the column mpg
select(mpg) %>% colMeans
## mpg
## 21.72308
While the pipe operator %>%
is the most important operator, others exist for different uses as well. Consult the vignette and the documentation of the package for more information. I sometimes use .
within a function. The dot explicitly states where the output of the previous function shall be put which is in a few cases necessary. In other words:
# the following code snippets are the same
mtcars %>% filter(mpg > 15)
# and
mtcars %>% filter(., mpg > 15)
Data Collection
Twitter Access
To retrieve data from Twitter you first have to register an account on
twitter.com
.You need to create an application under
apps.twitter.com
. The name, description, and website specifications do not matter as you are not creating an actual application for other Twitter users. You can fill these fields with your creativity.You have to specify
http://127.0.0.1:1410
as your callback url to ensure that your access runs smoothly with the R packages.
- Save your apps’ credentials in R. They encompass:
Consumer Key
Consumer Secret
Access Token
Access Secret
You find the access token and the access secret on the same page a bit further down.
consumer.key <- "CONSUMER KEY"
consumer.secret <- "CONSUMER SECRET"
access.token <- "ACCESS TOKEN"
access.secret = "ACCESS SECRET"
Though possible it is not recommended to use two applications at the same time, e.g. accessing the Twitter API for two different research projects simultaneously through two applications. You may find yourself running into the
rate limits
. In the worst case Twitter bans your account!
rtweet
The package has implemented a very simple to connect its function with Twitters APIs. You only have to call create_token()
with your application name, consumer key and consumer secret as inputs. Once you’ve created the token, you are ready to use the functions that come with the package.
# defines application name
application_name <- "Name of Application"
# creates token for Twitter API
rtweet_token <- create_token(app = application_name, consumer_key = consumer.key,
consumer_secret = consumer.secret)
The authentication was successful when you see a tab in your browser that says “Authentication complete. Please close this page and return to R”.
The author of the package
explains
a way how to ‘preserve’ the generated token for future R sessions. We however stay with this simpler solution.
Exercise 1
- If you have no Twitter account yet, create one.
- Register one application on Twitter with the previously mentioned callback url.
- Create a Twitter token in R for the rtweet package.
Obtaining user specific data
There is various data you can collect about users. Twitter roughly divides them into a user’s attributes like number of followers and friends, the date when the account was created, geo location etc. and lists like which users actually follow them and who the user follows. The documentation of the Rest API lists the data you can get.
Favorites
If you want to analyze how politicians communicate it might wise to know whose tweets they favor. The function get_favorites()
of rtweet allows you to obtain up to 3000 favorites of the specified user id.
# assigns Trump's account to the variable
politician <- "realDonaldTrump"
# requests the last 1000 favorites of Trump
favorites <- get_favorites(user = politician, n = 1000)
# displays the column screen_name of the first five favorites
head(favorites[, "screen_name"])
## [1] "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" "sheba418"
## [5] "foxandfriends" "IvankaTrump"
# counts how many times each screen_name was favored by Trump
table(favorites[, "screen_name"])
##
## DonaldJTrumpJr FLOTUS foxandfriends IvankaTrump
## 2 2 1 1
## mike_pence realDonaldTrump Ross_7_7 sheba418
## 1 3 1 1
## StrongChestwell
## 2
Donald Trump rarely favors any tweets. Is this evidence for the aspect that politicians tend to rather broadcast than interact on Twitter?
# display the date of Trump's favorites
favorites[, "created_at"]
## [1] "2017-10-08 14:09:00 UTC" "2017-10-08 13:59:58 UTC"
## [3] "2017-09-19 22:04:11 UTC" "2017-09-01 13:15:35 UTC"
## [5] "2017-08-14 09:58:51 UTC" "2017-05-24 12:06:47 UTC"
## [7] "2017-05-21 09:00:15 UTC" "2017-04-14 19:36:17 UTC"
## [9] "2016-11-08 23:14:27 UTC" "2016-10-10 02:45:08 UTC"
## [11] "2016-04-19 13:28:26 UTC" "2013-08-24 02:46:09 UTC"
## [13] "2013-03-02 04:27:26 UTC" "2013-03-02 04:27:26 UTC"
Followers
Now assume we want to find out who follows Trump’s account as they for sure are Trump supporters and we need to know who is in favor of him. The function get_followers()
returns us the followers of the specified user.
# tries to retrieve all followers of the politician
followers <- get_followers(politician, n = "all")
# shows us the first observations of the followers
head(followers[, 1])
## [1] "917747309881982976" "917746118548705280" "917747294639939584"
## [4] "917746766635847680" "2849434548" "917734518525591552"
Note: The output only reveals the user_id. We would have to request more information from the Twitter API, if we want to know more about the users’ followers.
Something seems wrong however, if we check the number of followers that were returned:
# informs about how long the list of followers is, i.e. the number of
# followers
length(followers$user_id)
## [1] 75000
We have only received 75,000 followers, but Trump has millions of followers! The API rate limits determine that we are not allowed to get more followers than this number every 15 minutes. But of course we can work around this barrier.
# instructs R to pause 15 minutes
Sys.sleep(15 * 60)
# determines the last follower before we hit the rate limit
page <- next_cursor(followers)
# gets second round of followers
followers.p2 <- get_followers(politician, n = "all", page = page)
# binds the dataframes together
followers <- rbind(followers, followers.p2)
# verifies that the new 75,000 are not equal to the first 75,000 followers
unique(followers$user_id) %>% length
## [1] 149997
This approach however is not very practical for two reasons:
- We do not want to do this by hand for accounts that hundred thousands of followers or more.
- We do not know yet how many followers Trump actually has.
Luckily rtweet offers the function lookup_users()
that delivers us some attributes about the specified user.
# requests the available attributes for the politician
lookedUp <- lookup_users(politician)
# displays us the available attributes
names(lookedUp)[1:18]
## [1] "user_id" "name" "screen_name"
## [4] "location" "description" "protected"
## [7] "followers_count" "friends_count" "listed_count"
## [10] "created_at" "favourites_count" "utc_offset"
## [13] "time_zone" "geo_enabled" "verified"
## [16] "statuses_count" "lang" "contributors_enabled"
A more general solution can look like the following custom function. The function is only an example and not perfect yet as you’ll might experience:
# the function returns all followers of the specified users input: users
# whose followers you want output: list with a dataframe per user
# parameters: none
iterateFollowers <- function(users) {
# executes the following on each user stated
lapply(1:length(users), function(x) {
# assigns the user name by position
user <- users[x]
# inform the R practioner
message(paste("Finds followers for", user))
# request the user's attributes
info <- lookup_users(user)
# determines how many followers the user has
n.followers <- info[, "followers_count"]
# calculates how many batchs are need to retrieve every follower
n.batchs <- n.followers/75000
# different procedure for users with more than 75,000 followers
if (n.batchs > 1) {
# request 75,000 followers
followers <- get_followers(user, n = "all")
# informs the R practioner
message("Pause")
# 'sleeps' for 15 minutes
Sys.sleep(15 * 60)
# exectues more or less the same in a loop
for (i in 2:n.batches) {
message("Extracts batch ", i)
# saves the last follower of the previous batch
page <- next_cursor(followers)
# request the next 75,000 or less followers
batch <- get_followers(user, n = "all", page = page)
# binds the new batch with the follower up to now
followers <- rbind(followers, batch)
# only pauses when it's not the last batch
if (x != length(users) & i != n.batches) {
Sys.sleep(15 * 60)
message("Pause")
}
}
# simpler procedure if the user has less than 75,000 followers
} else {
followers <- get_followers(user, n = "all")
}
return(followers)
})
}
Friends
The function get_friends()
works pretty much the same except that it informs about whom the specified user follows. These users are referred to as friends
in Twitter’s architecture.
# requests the friends of the politician
friends <- get_friends(politician)
# displays the user ids of the first five friends
head(friends)
Timeline
Assume now that we’re interested in each tweet Trump shared with the world. For this purpose rtweet
provides the function get_timeline()
# requests the last 5000 tweets of Donald Trump
tweets <- get_timeline(politician, n = 5000)
tweets %>% select("created_at", "text", "retweet_count", "favorite_count") %>%
# takes a random sample
sample_n(10)
Again we run into the API Restrictions imposed by Twitter. We can’t access more than 3,200 tweets per user through the Rest API.
nrow(tweets)
## [1] 3223
There are ways to work around this Restriction. The easiest one is to instantly capture the tweets of users, though this doesn’t let you access tweets from the past. The other way is using webscrapping, not so easy.
Let’s find out which tweet has the most retweets:
# piping: the operator %>% pipes the output of the previous expression to
# the following expression
# selects the column retweet_count of the data frame 'tweets'
tweets[, "retweet_count"] %>% which.max %>% tweets[., ] %>% select(created_at,
text, retweet_count)
The video shows Trump playing to hit a man whose face is the CNN logo. In case you don’t know it yet here you go.
Not every tweet of Trump has this style as his tweet with the most favorites shows:
# selects the column favorite_count
tweets[, "favorite_count"] %>% # determines the observation, i.e. the row number, with the most favorites
which.max %>% # selects the whole row of this observation by piping the row number
tweets[., ] %>% # displays only the interesting columns
select(created_at, text, favorite_count)
Exercise 2
- Find a German politician that has more than 75,000 followers.
- Is he favoring any tweets? Whose?
- How many users follow him?
- Extract at least 75,001 followers with a single function.
- Catch the politician’s last 100 tweets and assign them to the variable
tweets
.
Collect Tweets based on keywords
There are two ways to obtain tweets that either match certain keywords or mention specific users without having to know the authors of those tweets beforehand. One of them is using the search feature of the Rest API, the other is filtering the Streaming API for these keywords or users. Rtweet provides the function search_tweets()
for the first and stream_tweets()
for the second way.
A quick recall: Twitter does not impose any limit on the number of tweets you can get per se. But the Rest API does not return tweets older than a week and the Streaming API limits the amount of tweets to 1% of all tweets at any moment. To give you a figure of how many tweets you can capture per day: the daily volume is approximately at 230 millions.
Search Engine
Let’s search for the last 100 tweets about the German election:
# searches for tweets that contain the hashtag #btw17
tweets.btw17 <- search_tweets(q = "#btw17", n = 100, type = "recent")
# displays only the text of those tweets
tweets.btw17 %>% select(text) %>% # random sample
sample_n(size = 10)
You can ask the Twitter’s Rest API for different kind of search results. search_tweets()
by default returns the most recent tweets, but you can request mixed search results or the most popular ones.
Let’s find the most popular tweets related to the past election using the hashtag #btw17
.
# searchs for the most popular tweets containing the keywords in the last 7
# (!) days
tweets.popular <- search_tweets(q = "#btw17", n = 10, type = "popular")
# only display certain columns of the popular tweets
tweets.popular %>% select(screen_name, text, retweet_count, favorite_count) %>%
# random sample
sample_n(size = 10)
We can use the common Boolean operators AND
and OR
to narrow the query.
# searchs for tweets that contain the hashtag #bt17 and the keyword 'Petry'
tweets.btw17_Petry <- search_tweets(q = "#btw17 AND Petry", n = 100, type = "recent")
tweets.btw17_Petry %>% select(text)
# search for tweets that contain the keyword 'Gauland' or 'Höcke, but not
# both
tweets.btw17_GauHoeck <- search_tweets(q = "Gauland OR Höcke", n = 100, type = "recent")
tweets.btw17_GauHoeck %>% select(text)
The function more or less accepts all parameters that Twitter’s Rest API offers, e.g. filtering for one language or only for tweets that are not retweets.
Who dared to tweet about our chancellor Merkel in English recently? And who referred to the famous SPIEGEL lately?
# captures the last 10 tweets about Merkel written in english
tweets_english <- search_tweets(q = "Merkel", n = 10, lang = "en")
tweets_english %>% select(text)
# returns the last 10 tweets about Spiegel that are no retweets
tweets_noRT <- search_tweets(q = "Spiegel", n = 10, include_rts = FALSE)
tweets_noRT %>% select(text)
Access Twitter’s Stream
If you plan to capture Twitter data related to a (huge) political event, the Streaming API will be likely your best choice. Thousands of users express their opinion about on-going politics on Twitter everyday, but you don’t know who is going to unleash his anger about Trump’s impeachment and who will shot up fireworks. Using the Streaming API let’s you collect data instantly based on keywords. After some time you can easily end up with millions of tweets, but that’s not a problem with the available computer power nowadays. It’s easier to reduce unnecessary data in the data preparation phase later than trying to close data gaps. This way you easily end up with some millions.
Let’s see who is tweeting about Merkel, SPD, Deutschland or more broadly about the last German election. With the parameter timeout
of the function stream_tweets()
you can specify for how long you want to listen to Twitter’s stream. If you choose FALSE
, it streams indefinitely long.
# captures tweets containing one of the specified keywords for one minute
tweets.streamed <- stream_tweets(q = "Merkel,Bundestag,SPD,Bundesregierung,Bundestagswahl",
timeout = (1 * 60))
# shows only the text of the tweets
tweets.streamed %>% select(text)
You can give a desired language to the function too.
# captures any tweets in English mentioning Merkel for two minutes
tweets.lang <- stream_tweets(q = "Merkel", timeout = (2 * 60), language = "en")
# displays the tweets
tweets.lang %>% select(text)
Moreover Twitter permits to track up to 5000 user IDs via their Streaming API. In this case you have to call c()
and use a separate string for each user_id.
# DO NOT RUN
tweets.users <- stream_tweets(q = c("regierungssprecher", "spiegelonline"),
timeout = (24 * 60 * 60))
tweets.users %>% select(text)
Michael W. Kearney, the author of rtweet, has written the convenient function users_data()
to retrieve user data on collected tweets. You only need to input the data frame containing the tweets and you’ll get the information available about the tweets’ authors.
users_data(tweets.streamed)
users_data(friends)
Exercise 3
- Find the last 100 or less tweets about your politician.
- Can you find any tweets about him in English?
- Now search again: This time for tweets containing the keyword
Koalition
orJamaika
? Find the 10 most popular ones. - Access the Streaming API for some minutes to collect tweets about your politician. If you can’t find any, search for another politician or the keywords
Koalition and Jamika
. - Obtain information about the tweets’ authors.
Rate Limits
A few last words about the (rate) limits of Twitter’s APIs. They are only important for the use of the Rest API as you can’t work around the volume limit of the Streaming API other than buying yourself exclusive access. Rtweet includes the function rate_limits()
that shows you how much you’ve used up your volume in the current 15 minutes time window. It also shows you when your access volume will be reset. You can access this information and use it to build functions that respect the rate limits.
# displays the current rate limits
rate_limit(token = rtweet_token)
# returns the remaining number of accepted request for the current time
# window
rate_limit(rtweet_token)[1, 3]
## [1] 15
Exercise 3.1
- How can you improve my function
iterateFollowers
by usingrate_limit()
?
Data Analysis
The number of (social) scientists using data from social networks such as Twitter has significantly increased over the last couple of years. It’s not surprising therefore that the number of tools to analyse this data has increased with it as well. But not only the number of available tools has changed, their capacities and the range of analyses they offer “out-of-the-box” are well beyond what was ready to use some years ago. In this chapter, you’ll learn to apply some analysis to Twitter’s data. At first, we look at frequencies of words and create time plots. Afterwards we continue with the more complex task of creating a network with the data and conduct a network analysis.
We will work with real twitter data that I collected during 2017’s election campaign for the Bundestag.
Data Preparation
The classic data frame of R handle large data sets poorly for several reasons. But of course some R enthusiasts have already developed a solution for that problem with the package data.table
that increases performance significantly. fread()
is a lot faster than csv.read()
.
You can’t reproduce the following part for now, but I provide you with a dataset for the exercise.
# loads the package
library(data.table)
# reads in the tweets if the file resides in your home directory
tweets.sample <- fread("tweets_btw17.csv")
##
Read 0.0% of 2015257 rows
Read 9.4% of 2015257 rows
Read 15.9% of 2015257 rows
Read 28.3% of 2015257 rows
Read 33.2% of 2015257 rows
Read 45.7% of 2015257 rows
Read 53.1% of 2015257 rows
Read 68.0% of 2015257 rows
Read 72.0% of 2015257 rows
Read 85.3% of 2015257 rows
Read 95.8% of 2015257 rows
Read 1977422 rows and 16 (of 16) columns from 0.572 GB file in 00:00:17
Descriptive Statistics
The public quite intensely and controversial debated the use of (social) bots for campaigning. In general, it’s interesting to see what’s happening on social networks around election days. In the following I introduce a few descriptive statistics about a collection of tweets. Keep in mind however that the most fitting statistics depend on your research thesis.
Top Users
topusers <- tweets.sample %>% # groups the tweets by the variable user_name
group_by(user_name) %>% # counts the occurrences of each user_name
count %>% # rearranges the rows by the number of occurrences in descending order
arrange(desc(n))
topusers
Let’s take a closer look on these users. They don’t seem to belong to any major media network. lookup_users()
gives us more information about them.
topusers[1:10, "user_name"] %>% lookup_users
Some background images are:
Clearly most of their tweets are not written by a human being.
We see a very different picture, if we not only look for which users have the most tweets, but combine these figures with the number of followers these user accounts have.
# selects the 18,000 top users
mostfollowers <- topusers$user_name[1:5000] %>% # request their attributes
lookup_users() %>% # puts the users with the most followers to the top
arrange(desc(followers_count))
mostfollowers
There is more to discover as the available variables show us:
glimpse(tweets.sample)
## Observations: 1,977,422
## Variables: 16
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ lang <chr> "de", "de", "de", "de", "de", "de", "de", "de...
## $ polarity <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000...
## $ created <chr> "2017-08-24T18:11:35", "2017-08-24T18:11:36",...
## $ text <chr> "@CSU Licht: Wird es nicht\nSchatten: Das nae...
## $ user_description <chr> "Dualitaet praegt das Leben. Gut/ Schlecht. Y...
## $ user_followers <int> 15, 666, 25329, 5, 28, 12494, 62, 259, 2, 159...
## $ user_location <chr> "", "", "Berlin, Deutschland", "Berlin, Deuts...
## $ coordinates <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ user_bg_color <chr> "000000", "C0DEED", "ACDED6", "F5F8FA", "C0DE...
## $ id_str <S3: integer64> 1.812680e-248, 1.812681e-248, 1.812...
## $ subjectivity <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0...
## $ user_created <chr> "2017-01-04T03:50:14", "2010-01-07T12:58:09",...
## $ retweet_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ geo <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ user_name <chr> "LichtuSchatt3n", "Gekko125", "Martin_Lejeune...
Number of tweets over time
We can plot the tweets per day. We have two distinct peaks: One at the day of the ‘TV-Duell’, the other at the election day:
# operator uses the object on the left and assigns the output to it
tweets.sample$created %<>% as.POSIXct
# plots the frequency of the tweets by days
ts_plot(tweets.sample, by = "days", dtname = "created")
Word Frequencies
If we want to extract patterns like keywords, we have to use regular expressions
. Twitter delivers mentions as attributes via their APIs, but the sample data does not contain this attribute that’s why we have to extract them here using a custom function. Stringr
is a package that eases the work with strings.
# most convenient package
library(stringr)
# extracts all mentions in each tweet
sample.mentions <- lapply(tweets.sample$text, function(x) {
# extracts every string that starts with @
str_extract_all(string = x, pattern = "(@[^\\s]+)")
}) %>% # keeps the mentions just in one vector
unlist
# now we take a look
sample.mentions %>% sample(40)
## [1] "@caranoia" "@CDU" "@YouTube-Video:"
## [4] "@MartinSchulz" "@rbbabendschau" "@ardmoma"
## [7] "@CarmenKuprat" "@MSFTMechanics" "@MartinSchulz"
## [10] "@BILD" "@PaulUmstaetter" "@MartinSchulz"
## [13] "@AfD" "@AfD" "@UdoHemmelgarn"
## [16] "@coderboypb" "@hubertus_heil" "@LesVertsSuisse"
## [19] "@zeitonline" "@c_lindner" "@tgd_att"
## [22] "@ulfposh" "@AfD" "@heuteshow"
## [25] "@Die_Gruenen" "@KonstantinKuhle" "@SPIEGELONLINE"
## [28] "@CDU" "@journ_online" "@Georg_Pazderski):"
## [31] "@MartinSchulz" "@VP" "@Fenerinho55"
## [34] "@welt" "@waldruhe" "@Ralf_Stegner"
## [37] "@JulianRoepcke" "@Fabian_Junge" "@HansAlbers6"
## [40] "@Mica4711"
This vector still contains every mention in any tweet, i.e. duplicates across tweets. Who was mentioned many times in our t?
sample.mentions.counted <- factor(sample.mentions) %>% table %>% sort(decreasing = TRUE)
I like to use ggplot2
for plots. Plotly
is a good choice, when you need interactive graphs or use plots in a shiny application. For time series dygraphs
is highly recommendable.
library(ggplot2)
# converts the vector into a data frame and renames the columns
sample.mentions.counted %<>% as.data.frame %>% rename(., User = ., Frequency = Freq)
# plots the 20 most frequent mentiosn
ggplot(data = sample.mentions.counted[1:20, ], aes(x = User, y = Frequency,
fill = User)) + geom_bar(stat = "identity") + ggtitle("number of mentions") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# let's put the figures in relation to the total number of tweets
ggplot(data = sample.mentions.counted[1:20, ], aes(x = User, y = (Frequency/nrow(tweets.sample)),
fill = User)) + geom_bar(stat = "identity") + ggtitle("Percentage of mentions in relation to total tweets",
subtitle = paste("n=", nrow(tweets.sample))) + ylab("Frequency in Percentage") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can find the most used hashtags as well:
# extracts all hashtags in each tweet
sample.hashtags <- lapply(tweets.sample$text, function(x) {
# lower case to aggregate similar/same hashtags
tolower(x) %>% # extracts every string that starts with #
str_extract_all(string = ., pattern = "(#[^\\s]+)")
}) %>% # keeps the hashtags just in one vector
unlist
# now we take a look
sample.hashtags %>% sample(40)
## [1] "#btw17" "#schulz:"
## [3] "#btw2017" "#btw2017"
## [5] "#le0609" "#news"
## [7] "#btw17" "#btwahl2017"
## [9] "#btw17." "#zschäpe"
## [11] "#geschichte" "#spd"
## [13] "#fuer" "#thermilindner"
## [15] "#merkel" "#grüne"
## [17] "#afd" "#aliceweidelgeruechte"
## [19] "#bundestag" "#btwgezwitscher"
## [21] "#dieschmidt" "#afd-chefin"
## [23] "#gysi" "#noafd"
## [25] "#linkspartei" "#afd"
## [27] "#btw17" "#csu"
## [29] "#cdu" "#afd"
## [31] "#spd" "#afd"
## [33] "#sektchen" "#wesermarsch…"
## [35] "#hetze." "#merkel"
## [37] "#spd" "#gehtwählen"
## [39] "#wahl2017" "#oezuguz"
As before we have to count the occurrences across tweets.
# counts the occurences of each hashtag
sample.hashtags.counted <- factor(sample.hashtags) %>% table %>% sort(decreasing = TRUE)
# reshapes data to comply with plot requirements
sample.hashtags.counted %<>% as.data.frame %>% rename(., User = ., Frequency = Freq)
# plots the hashtags as bar chart
ggplot(data = sample.hashtags.counted[1:20, ], aes(x = User, y = Frequency,
fill = User)) + geom_bar(stat = "identity") + ggtitle("occurences of hashtags") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There is a bias towards AfD in proportion to its size. Furthermore, I suspect my data set to be biased to the political right a bit. My choice of track keywords is probably responsible for this outcome. On the one hand the AfD was very present in the election campaign online and offline, but on the other hand this example shows how depended from my collected data the result of my analysis is.
Text Corpus Analysis
Another possibility of analysis would be to examine the occurrences of ‘regular’ words. In this case, we have to create a text corpus and clean it from stop words, punctuation and so forth. A corpus is an abstract concept that can compromise several types of text document collections. Texts of a common domain typically compose one corpus in order to analyze the use of speech in this domain. For instance, the election manifestos of one party across elections can shape one corpus. This collection would allow to analyze how the use of speech of this party has changed over time or to examine if a certain kind of speech is connected to election success?
library(tm)
# filters the data set to tweets from the official parties accounts
tweets.parteien <- filter(tweets.sample, user_name == "spdde" | user_name ==
"CDU" | user_name == "fdp" | user_name == "Die_Gruenen" | user_name == "dieLinke")
# defines a corpus that , we have to specify a langauge
corpus.parteien <- SimpleCorpus(VectorSource(tweets.parteien$text), control = list(language = "ger"))
corpus.parteien
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3057
# enables to inspect the corpus
inspect(corpus.parteien[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] Thomas de Maizière @BMI_Bund hat die linksextremistische Internetplattform ""linksuntenindymediaorg"" und den zugehör… https://t.co/8cadG2JAwF
## [2] Damit Wohnen bezahlbar bleibt: Am 24.9. @MartinSchulz und die SPD wählen - für faire Mieten! Mehr:… https://t.co/aGEEyj4SkD
## [3] Zu Gast bei der @CSU in Bad Kissingen: Angela #Merkel https://t.co/5kyqeZodEv
## [4] Unsere bayerische Schwester! https://t.co/w696nj1tzD
## [5] @SEENOTRETTUNG @CDUdresden @SPD_Dresden @dieLinke @DiePARTEI @fdp @Piratenpartei @_VPartei_ @Tierschutzparte… https://t.co/MbFSOM21tY
Before we can run a text analysis on the corpus, we have to apply some transformations to the corpus such as converting all characters to lower case. The package offers the function tm_map()
for this purpose. It applies (maps) a function to all elements of a corpus.
# custom function to remove URLS
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "",
x, perl = T))
corpus.parteien %<>% # strips unncessary whitespace
tm_map(stripWhitespace) %>% # transforms all words to lowercase
tm_map(content_transformer(tolower)) %>% # removes URLs
tm_map(removeURL) %>% # removes german stopwords such der/die/das etc.
tm_map(removeWords, stopwords("german")) %>% # removes punctuation like '.'
tm_map(removePunctuation)
In the end we turn the data into a TermDocumentMatrix
. It documents if and how often a term appears in each document (here in each tweet).
tdm.parteien <- TermDocumentMatrix(corpus.parteien)
But I’m finally want to know what words the parties used in their tweets. The term-document matrix enables us to access the most frequent terms easily. We can state a lower frequency that words must fulfill in order to be returned by the function:
# parameter lowfreq and highfreq define the limits
findFreqTerms(tdm.parteien, lowfreq = 50)
## [1] "martinschulz" "mehr" "spd" "wählen"
## [5] "angela" "merkel" "unsere" "dielinke"
## [9] "fdp" "deutschland" "fedidwgugl" "heute"
## [13] "zeitfürmartin" "uhr" "live" "amp"
## [17] "familien" "leben" "dass" "land"
## [21] "dafür" "afd" "gute" "menschen"
## [25] "geht" "gibt" "europa" "btw17"
## [29] "denkenwirneu" "unserer" "cdu" "diegruenen"
## [33] "spdde" "dietmarbartsch" "swagenknecht" "cemoezdemir"
## [37] "beim" "schulz" "müssen" "fakt"
## [41] "brauchen" "bildung" "zukunft" "wer"
## [45] "gut" "linke" "goeringeckardt" "jahren"
## [49] "vielen" "sicherheit" "tvduell" "rente"
## [53] "deshalb" "wahl2017" "clindner" "fünfkampf"
## [57] "klartext" "illnerintensiv" "esistzeit" "wahlarena"
## [61] "bpt17" "100hcdu" "schlussrunde"
The output indicates that the parties were talking a lot about themselves during the election campaign. Maybe it would be more interesting to see how they talk about certain issues like education. The function findAssocs()
returns us all terms that have at least the specified correlation with our input term:
findAssocs(tdm.parteien, "bildung", 0.1)
## $bildung
## irgendeine zukunftspolitik bildungspolitik
## 0.31 0.31 0.21
## weltbeste frühkindlichen bildungsetat
## 0.20 0.18 0.17
## wanka bund investieren
## 0.17 0.16 0.16
## verdoppelt beste überall
## 0.15 0.15 0.15
## sinn herkunft gebührenfreie
## 0.14 0.13 0.13
## rekordinvestitionen aufstieg grundstein
## 0.12 0.12 0.12
## ausprobieren johanna anpacken
## 0.12 0.12 0.12
## schieben jugendliche länder
## 0.12 0.12 0.11
## wichtig digitalisierung weiterbildung
## 0.11 0.11 0.11
## heikomaas zeitfuermartin abgewählt
## 0.11 0.10 0.10
## nrw kita
## 0.10 0.10
Another issue that didn’t get the attention it deserves was Digitalisierung
. Which party has a vision how we’re going to deal with the very unique challenges in this domain?
fibre.assocs <- findAssocs(tdm.parteien, "digitalisierung", 0.1)
fibre.assocs$digitalisierung[1:20]
## chefsache neugier staatsminister
## 0.32 0.29 0.21
## öpnv berufsbilder zukunftprogrammieren
## 0.20 0.20 0.14
## dekarbonisierung dezentralisierung ramonapop
## 0.14 0.14 0.14
## sichtlich meins industrie
## 0.14 0.14 0.14
## rahmen schmalspur verschlafen
## 0.14 0.14 0.14
## immoniehus handlungsvorschläge parat
## 0.14 0.14 0.14
## weltmeisterplan jochenblind
## 0.14 0.14
My guess is that instead of discussing the urgent issue of digitalization they talked a lot about migration:
findAssocs(tdm.parteien, "migration", 0.1)
## $migration
## ansatz illegalen reduzierung
## 0.71 0.71 0.71
## steuerung verfolge entwicklungspolitik
## 0.71 0.71 0.71
## handwerk illegale schleppern
## 0.71 0.71 0.71
## ordnung funktionieren legen
## 0.50 0.50 0.41
## stoppen sommerinterview
## 0.29 0.13
Of course, we can reduce the corpus further only to tweets that were written by a particular party:
assocs.spd <- findAssocs(tdm.spd, "rente", 0.1)
assocs.spd$rente[1:15]
## schäuble verlässliche inhalte konzepte
## 0.43 0.43 0.30 0.30
## steuern höhere rentenbeiträge redet
## 0.30 0.30 0.30 0.30
## handlungsbedarf schlussrunde schuldig manuelaschwesig
## 0.30 0.28 0.24 0.23
## beiträge droht sinkendes
## 0.23 0.21 0.21
A few last words on text analysis. You can use this approach easily for any kind of document collection. In fact, there is no difference in the process after you’ve created the corpus.
I think that the package corpus
provides some interesting features for text analysis as well that go beyond what tm
offers. For instance, you can apply the concept of n-grams
to your documents. Instead of only receiving the correlation terms share with others terms, n-grams give you sequences that include your word and appear frequently.
Exercise 4
- Load the file
tweets_electionday.RData
containing the tweets from the election day after 6 pm. - Find out the most popular hashtags and most frequent words and plot your results.
- What are the associates for the words AfD, CDU, SPD and Bundestagswahl?
Network Analysis
In social sciences the analysis of networks have become quite popular in recent years. This is also partly due to the rise of social network platforms where you can study relations between human beings. The technique itself however is quite old and not new to sociologists who concern themselves with the network within an entity and between entities of any kind.
Throughout this notebook I consider a network to map social relations between N entities (nodes) where E edges display the relations. Some scientists also use the term vertex (Eckpunkt) instead of node. Not every node shares an edge with all other nodes. To transfer this to Twitter who follows who or mentions can form a social network.
On Twitter you can distinguish between verified users and not-verified users. The first user group for example does not see every tweet in which they were mentioned. One could consider communication among verified users as partly separated public sphere, though other users can observe this communication. In my imagination a ‘grouped’ network seems to be an interesting issue, with verified users belonging to one group and regular users to the second group. It would be interesting to see how communication flows or not flows between these two groups.
Selection of entities
I’ve decided to include every user from our example data set that has more than 10,000 followers as entity into my network. This decision leaves us in the end with 234 vertices. This was a somewhat arbitrary decision, but it delivers us entities from various fields. To speak fairly vague, how to select the users you include depends in which broader framework your network of interest is embedded.
# returns me only those users with followers >= 10000
influencers <- mostfollowers %>% filter(followers_count >= 10000)
# displays a sample containing 50 observations of those followers
sample(influencers[, "screen_name"], 50)
## [1] "MediterrNewsNet" "fdp_nrw" "ismail_kupeli"
## [4] "BlnTageszeitung" "dushanwegner" "WELT_Politik"
## [7] "SPIEGEL_24" "Ralf_Stegner" "MEEDIA"
## [10] "PortalAlemania" "gaborhalasz1" "BWBreaking"
## [13] "tourismusvideo" "OZlive" "rbb24"
## [16] "peter_simone" "aktuelle_stunde" "Beatrix_vStorch"
## [19] "rbbinforadio" "ntvde" "tazgezwitscher"
## [22] "MGrosseBroemer" "ulfposh" "WDR"
## [25] "taz_news" "WAZ_Redaktion" "LisaL80"
## [28] "SVZonline" "jungewelt" "AZ_Augsburg"
## [31] "focusonline" "FraukePetry" "focuspolitik"
## [34] "derfreitag" "aktenzeichenyx" "bpb_de"
## [37] "NZZ" "hessenschauDE" "annalist"
## [40] "Endzeitkind" "handelsblatt" "RenateKuenast"
## [43] "DerSPIEGEL" "welt" "SPIEGEL_Politik"
## [46] "frielingbailey" "natsocialist" "1LIVE"
## [49] "niggi" "inzamaus"
Create a network with igraph
With the package igraph
one can directly create and visualize networks in R. I show you the basic features of a network with igraph:
library(igraph)
# creates a network with three vertices that share one undirected edge the
# first two values of the vector give the ends of the first edge and so
# on...
graph1 <- graph(edges = c(1, 2, 2, 3, 3, 1), directed = FALSE)
plot(graph1)
If we call the graph directly, we get information about the graph’s structure:
graph1
## IGRAPH 843d5ac U--- 3 3 --
## + edges from 843d5ac:
## [1] 1--2 2--3 1--3
Of course, igraphs can have directed edges too:
graph2 <- graph(c(3, 5, 4, 3, 2, 8, 6, 3), directed = TRUE, n = 8)
plot(graph2)
Using names for the vertices might be more meaningful:
graph3 <- graph(c("Felix", "Sarah", "Sarah", "John", "John", "Stella", "Stella",
"Felix"))
plot(graph3)
Access edge and vertex attributes
To obtain information about your network’s edges or vertices:
E(graph3)
## + 4/4 edges from faa7dbe (vertex names):
## [1] Felix ->Sarah Sarah ->John John ->Stella Stella->Felix
V(graph3)
## + 4/4 vertices, named, from faa7dbe:
## [1] Felix Sarah John Stella
# to examine the network matrix directly
graph3[]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## Felix Sarah John Stella
## Felix . 1 . .
## Sarah . . 1 .
## John . . . 1
## Stella 1 . . .
Add attributes to the network
Enables you to add attributes to your vertices e.g. gender or to your edges e.g. type of relation:
V(graph3)$gender <- c("male", "female", "male", "female")
E(graph3)$type <- "mention"
E(graph3)$weight <- c(1, 2, 2, 1)
edge_attr(graph3)
## $type
## [1] "mention" "mention" "mention" "mention"
##
## $weight
## [1] 1 2 2 1
vertex_attr(graph3)
## $name
## [1] "Felix" "Sarah" "John" "Stella"
##
## $gender
## [1] "male" "female" "male" "female"
Uses the attribute gender to color the nodes in the plot:
plot(graph3, edge.arrow.size = 0.5, vertex.label.color = "black", vertex.label.dist = 1.5,
vertex.color = c("skyblue", "pink")[1 + (V(graph3)$gender == "male")])
There are a lot more options to control the shape of your network or the layout of your plot. I’ll show you them later on, when necessary to obtain good graphs.
Use twitter data for a network analysis
To give you an idea of how we have to reshape twitter data for a network analysis, I’ll show you first which data (format) we need in the end. After all transformations we need to have two data tables that inherit the following structure:
# first data table
nodes <- data.table(id = 1:5, user_name = c("SPIEGELONLINE", "SPDDE", "Alice_Weidel",
"WWF_Deutschland", "heuteshow"), type = c("media", "party", "politician",
"ngo", "media"), followers = c(2368437, 329874, 18244, 391447, 319055))
# second data table
edges <- data.table(from = as.character(c(1, 4, 5, 3, 5, 1)), to = as.character(c(3,
1, 3, 2, 4, 3)), weight = 1, type = "mention")
glimpse(nodes)
## Observations: 5
## Variables: 4
## $ id <int> 1, 2, 3, 4, 5
## $ user_name <chr> "SPIEGELONLINE", "SPDDE", "Alice_Weidel", "WWF_Deuts...
## $ type <chr> "media", "party", "politician", "ngo", "media"
## $ followers <dbl> 2368437, 329874, 18244, 391447, 319055
glimpse(edges)
## Observations: 6
## Variables: 4
## $ from <chr> "1", "4", "5", "3", "5", "1"
## $ to <chr> "3", "1", "3", "2", "4", "3"
## $ weight <dbl> 1, 1, 1, 1, 1, 1
## $ type <chr> "mention", "mention", "mention", "mention", "mention", ...
To create a graph with the two data tables, one simply uses the function graph_from_data_frame()
:
twitter.graph <- graph_from_data_frame(d = edges, directed = TRUE, vertices = nodes)
# plots the graph with the vertex's user name as label
plot(twitter.graph, vertex.label = nodes$user.name)
Now we can think about how we get two data tables in these shapes out of our data. First, we create a data table with the selected users:
nodes2 <- data.table(id = as.character(1:nrow(influencers)), user_name = influencers$screen_name,
type = NA, followers = influencers$followers_count)
head(nodes2)
tail(nodes2)
Unfortunately assigning the correct type is manual work. To be transparent: For more clarity I’ve hidden the code chunk that creates a vector named types
containing the types:
But the vector exists, has the correct length and gives us eight types of nodes:
# checks data integrity and s
length(types)
## [1] 234
head(types)
## [1] "media" "media" "media" "media" "media" "media"
head(nodes2)
table(types)
## types
## artist company government influencer journalist
## 2 5 1 2 21
## media miscellaneous newsbot ngo party
## 109 10 6 4 15
## politician privat scientist
## 22 34 3
Almost half of the accounts belong to the media. Maybe just the number of followers alone is not a good criterion to choose entities.
Create edges
To create edges out of twitter data, I would like to find out who follows other users and use this information to create edges. Mentions may be used to create edges as well, but we keep that for later. Our tweets do not contain the user_id
that’s why I proceed as follows:
# returns me the id and more information about the user
info.nodes <- lookup_users(nodes2$user_name)
# creates a vector with the user ids
userid.nodes <- info.nodes$user_id
# extracts who of the network users follow another user
dt.list <- lapply(1:nrow(nodes2), function(single.id) {
print(paste(single.id, "of", nrow(nodes2)))
# retrieves the friend list
users.friends <- get_friends(nodes2[["user_name"]][single.id])
# looks for matches of network users in his friends list and saves the
# respective position(s)
vertices.ids <- which(userid.nodes %in% users.friends$user_id)
# returns nothing, if there is no single match
no.match <- if (all(!(userid.nodes %in% users.friends$user_id)))
return(NULL)
# otherwise it tabulates the relation in our desired format
users.edges <- data.table(from = as.character(single.id), to = as.character(vertices.ids),
weight = 1, type = "follows")
# tests whether singl.id is dividable by 15 to respect the rate limit
if (single.id%%15 == 0) {
print("pauses")
Sys.sleep(15 * 60)
}
return(users.edges)
})
# merges the data tables of all list elements
edges2 <- rbindlist(dt.list)
With these data tables at hand we can construct a graph:
# constructs the graph from both data tables
graph4 <- graph_from_data_frame(d = edges2, directed = TRUE, vertices = nodes2)
# removes vertices without any edges
graph4 <- delete.vertices(graph4, which(degree(graph4) <= 1))
# determines the types that exist
types.unique <- V(graph4)$type %>% unique
# let's color the nodes in respect to their type
colrs <- sample(colors(), length(types.unique))
# adds the attribute color to each node
for (i in 1:length(types.unique)) {
ind <- which(V(graph4)$type == types.unique[i])
V(graph4)$color[ind] <- colrs[i]
}
plot(graph4, layout = layout.auto, vertex.label = vertex_attr(graph4, "user_name"),
vertex.size = 4, edge.arrow.size = 0.1, vertex.label.dist = 0.6, vertex.label.cex = 0.7,
vertex.color = V(graph4)$color)
# creates a subgraph by taking a random sample of 30 nodes
graph4.sub <- induced_subgraph(graph4, sample.int(229, 30))
plot(graph4.sub, layout = layout.auto, vertex.label = vertex_attr(graph4.sub,
"user_name"), vertex.size = 4, edge.arrow.size = 0.1, vertex.label.dist = 0.6,
vertex.label.cex = 0.7, vertex.color = V(graph4.sub)$color)
You can see the visualization of such huge networks where many nodes share edges can be quite messy. It needs a lot of fine tuning if you want to achieve good looking graphs. Other programs like gephi
are maybe better suited for that task.
I can refine the plot by e.g. using the number of followers for the edge size or computing centrality measures such as degree()
and authority_score()
.
Uses the number of followers for the vertex size:
# normalizes the followers to a range between 0 and 1
for (i in 1:length(V(graph4.sub)$followers)) {
# since the range of values is very wide, taking the log is recommended
logs <- log(V(graph4.sub)$followers)
# normalizes to values between 0 and 1
V(graph4.sub)$vertex.size[i] <- (logs[i] - min(logs))/(max(logs) - min(logs))
}
plot(graph4.sub, layout = layout.auto, vertex.label = vertex_attr(graph4.sub,
"user_name"), vertex.size = V(graph4.sub)$vertex.size * 4, edge.arrow.size = 0.1,
vertex.label.dist = 0.6, vertex.label.cex = 0.5, vertex.color = V(graph4.sub)$color)
On Twitter communication usually happens directly between users. As long as a twitter account is public, everybody can follow this account. There is no structural barrier. Therefore, centrality measures (e.g. betweeness()
) that assume nodes to be a kind of gate keeper may not be applicable for this example where the decision to follow someone form the edges. Secondly, through retweets the original tweets may flow from one user to another user that are not directly connected, but it is disputable if this possibility is reason enough to give meaning to users that sit in between the network. In contrast to that, we can use the degree()
and the authority_score()
functions here that give weight to the nodes that are followed a lot:
# basically uses followers and friends within the network to determine the
# nodes' centrality
centrality.tot <- degree(graph4, mode = "total")
# only uses a nodes' followers to determine its centrality
centrality.in <- degree(graph4, mode = "in")
# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph4)$user_name, centrality.in, centrality.tot) %>%
arrange(desc(centrality.in))
auth.score <- authority_score(graph4, weights = NA)$vector
data.frame(vertex.attributes(graph4)$user_name, auth.score) %>% arrange(desc(auth.score))
plot(graph4, layout = layout.auto, vertex.label = vertex_attr(graph4, "user_name"),
vertex.size = auth.score * 10, edge.arrow.size = 0.1, vertex.label.dist = 0.6,
vertex.label.cex = 0.5, vertex.color = V(graph4)$color, main = "authorities")
There are more options to refine your network analysis. Check out the documentation of igraph
for information on additional distances and paths functions, how to create subgroups and communities, and how to detect assortativity and homophily based on attributes.
Mentions to create edges
We turn to mentions to create edges for the closing part of this section about network analysis. To that end, we have to recreate the nodes data table too since we can’t know a priori which users are important or the opposite. Hence, we determine first how many unique users we have in our data set of almost two million tweets:
# determines how many unique users exist in the dataset
unique.users <- tweets.sample$user_name %>% unique
length(unique.users)
## [1] 199844
With the vector containing the unique users we can create the known data table storing them as nodes. Given the number of nodes I’m not going to label them into categories this time. At least not at this point.
# creates an empty table with the necessary columns and sequential ids
nodes3 <- data.table(id = 1:length(unique.users), user_name = unique.users,
type = NA, followers = 0)
# assigns the number of followers to each node
nodes3$followers <- sapply(1:nrow(nodes3), function(pos) {
# finds the tweets that were published by this user (node)
allmatches <- which(tweets.sample$user_name == nodes3[["user_name"]][pos])
# takes the last match
lastpos <- allmatches[length(allmatches)]
print(paste("completed", pos, "of", nrow(nodes3)))
# returns the number of followers that user had at the time of his last
# tweet
tweets.sample[["user_followers"]][lastpos]
})
With almost two million tweets it takes quite a while to extract every single mention:
# determines all mentions in one tweet
edges3 <- lapply(1:nrow(tweets.sample), function(tweet.no) {
# informs about the progress
message(paste("processing", tweet.no, "of", nrow(tweets.sample)))
# extracts all mentions
mentions <- str_extract_all(tweets.sample[["text"]][tweet.no], "(@[a-zA-Z0-9_]+)")[[1]] %>%
str_replace(., "@", "")
# identifies if mentions are one of our nodes
mentions.pos <- which(nodes3$user_name %in% mentions)
# returns nothing if there are no matches
no.match <- if (length(mentions.pos) == 0)
return(NULL)
# records the author of the tweet
author <- which(nodes3$user_name == tweets.sample$user_name[tweet.no])
# puts the data into a data table with the desired format
dt.list <- data.table(from = author, to = mentions.pos, weight = 1, type = "mentions")
print(dt.list)
return(dt.list)
})
# binds all data tables into one
edges3 <- rbindlist(edges3)
Now we can again create the network graph:
# same procedures as usual
graph5 <- graph_from_data_frame(d = edges3, directed = TRUE, vertices = nodes3)
Either we reduce the network’s number of nodes or we focus on numeric measures like the authority score that informs us about the network’s structure:
centrality.score.tot <- degree(graph5, mode = "total")
centrality.score.in <- degree(graph5, mode = "in")
# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph5)$user_name, centrality.score.in, centrality.score.tot) %>%
arrange(desc(centrality.score.in))
auth.score.g5 <- authority_score(graph5)$vector
# assigns the authority score as attribute to the graph
vertex.attributes(graph5)$authority <- auth.score.g5
# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph5)$user_name, auth.score.g5) %>% arrange(desc(auth.score.g5))
I do not trust the second results and/or they are additional evidence that my data set is biased to the political right.
Let’s see if we can create a nice plot by including only vertices with an in-degree centrality equal to or higher than 1000:
The package
graphTweets
encompasses functions that allow you to create edges from data frames out of the box.
Exercise 5
- Create a network with igraph using the data set containing tweets from the election day after 6pm. Use attributes for your nodes that exist in your data. Include follows as type for an edge, mentions will take too long.
- Plot your network use your attributes to draw a meaningful graph.
- Compute the centraliy once for the mode in-degree and total. Rank the nodes descending and compare your two results.
- Determine the most significant authorities in your network.
Automation
The last part of the workshop is about how to automatize the collection of tweets. In this respect, we have to distinguish between three cases that require different approaches:
The first case is that you want to periodically capture future tweets of a user. In that case you can write a script that you either execute manually or better automatically and that saves the tweets.
The second case is that you want to capture future (!) tweets that contain keywords that are not related to a Restricted group of users. In that case you want to have a script that listens to Twitter’s Streaming API constantly, i.e. runs 24/7 and saves the tweets in a database/file for you.
The third case is that you would like to retrieve historic tweets that neither the Rest API nor the Streaming API will deliver to you due to built-in limitations. In that case you need to use a setting similar to the one that Henrique Jefferson proposes with his project
GetOldTweets-python
. It mimics the browsing behavior of a human being and harvests the data that is delivered to your browser. I describe in the respective section how this works in detail.
Case 1: Rest API
Let’s say if have two politician whose communication on Twitter I would like to store. For this aim the Streaming API would be a bit too much and my data would probably be messy like containing tweets from other users mentioning one of them. To regularly retrieve new tweets of both politicians via the Rest API is more appropriate here. Even more so, when you can plan ahead and want to cover a certain time period e.g. election campaign or whatever. A solution in R can look like the following:
# loads the libraries
library(rtweet)
library(data.table)
# preamble: where you maybe want to set a working directory ...
# vector with the usernames
usernames <- c("katjakipping", "peteraltmaier")
# checks if some tweets were already collected and loads them
try(load("politiciantweets.RData"), TRUE)
# first time
if (!exists("w")) {
for (i in 1:length(usernames)) {
# for the very first timeline
if (i == 1) {
# gets the tweets and create the data.table only once
tweets <- data.table(get_timeline(usernames[i]))
} else {
# gets the timeline and binds them to existing data.table
tweets <- rbind(tweets, get_timeline(usernames[i]))
}
# sets status_id as the key column
setkey(tweets, status_id)
# counter
w <- 1
save("tweets", "w", file = "politiciantweets.RData")
}
} else {
for (i in 1:length(usernames)) {
tweets <- rbind(tweets, get_timeline(usernames[i]))
}
# removes duplicates using the key colun status_id
tweets <- subset(unique(tweets))
w <- w + 1
save("tweets", "w", file = "politiciantweets.RData")
}
Either use cronR
(Linux/Unix) or tasksechduleR
(Windows) to schedule your r script:
Case 2: Streaming API
Kudos to Vik Paruchuri who was written this beautiful collection
of python scripts to listen to Twitter’s Streaming API. The advantage of his approach is that he uses sqlite
, a library that contains a relational database, to save the tweets. Twitter breaks the connection, if your setting does not process the receiving tweets fast enough. Sqlite
is fast that’s why this should not happen as long as your computer has enough power. I’ve modified his code here and there along my needs.
To ensure that the setting works fine, one needs to install the module that are listed in the requirements.txt
:
# command assumes that directory of file = working director
# maybe with sudo
pip install -r requirements.txt
In essence, you only need to put in your credentials in private.py
and specify your desired keywords in the settings.py
file where you can also specify further Restrictions like language or geo location. For the past elections I used a Linux server where I made sure that the execution of scraper.py
is monitored and Restarted automatically, if either the connection breaks or the server has to reboot for any reason.
requirements.txt
tweepy
ipython
matplotlib
scipy
numpy
pandas
dataset
psycopg2
private.py
consumer_key="YOUR CONSUMER KEY"
consumer_secret="YOUR CONSUMER SECRET"
access_token="YOUR ACCESS TOKEN"
access_token_secret="YOUR ACCESS TOKEN SECRET"
# database name
CONNECTION_STRING = "sqlite:///tweets.db"
settings.py
# -*- coding: utf-8 -*-
# vector with terms to track
TRACK_TERMS = ["#btw17"]
# import this object from private.py
CONNECTION_STRING = ""
# name of the csv file, when you dump the data
CSV_NAME = "tweets_btw17.csv"
TABLE_NAME = "btw17"
try:
from private import *
except Exception:
pass
scraper.py
# -*- coding: utf-8 -*-
import settings
import tweepy
import dataset
from sqlalchemy.exc import ProgrammingError
import json
# connects to the database
db = dataset.connect(settings.CONNECTION_STRING)
# creates the class StreamListener basend on tweepy's StreamListener
class StreamListener(tweepy.StreamListener):
# basically says what to do with statuses that are received
def on_status(self, status):
# prevents retweets from being stored
if (status.retweeted) or ('RT @' in status.text):
print('retweet')
return
print(status.retweeted)
# saves the the following tweet's attributes
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
lang = status.lang
# transforms into a string
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
# specfifes where to store tweet
table = db[settings.TABLE_NAME]
# tries to store the tweet in table
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
lang=lang,
))
# says what's to do when an error is encountered
except ProgrammingError as err:
print(err)
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
# loads authentication credentials
auth = tweepy.OAuthHandler(settings.consumer_key, settings.consumer_secret)
# sets token
auth.set_access_token(settings.access_token, settings.access_token_secret)
#
api = tweepy.API(auth)
# assigns our before written class StreamListener
stream_listener = StreamListener()
# starts listening to the stream
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
# filters the stream according to our track terms, ...
stream.filter(track=settings.TRACK_TERMS, languages=settings.TRACK_LANGUAGES)
dumpy.py
import settings
import tweepy
import dataset
# connects to database
db = dataset.connect(settings.CONNECTION_STRING)
# retrieves every tweet stored
result = db[settings.TABLE_NAME].all()
# saves the tweets in a csv
dataset.freeze(result, format='csv', filename=settings.CSV_NAME, encoding='utf-8')
You only need to save these files in the same folder and your good to go.
In action:
Case 3: web harvesting
When you do not have the chance to use the Streaming API, e.g. to access many and very old tweets, the search query is your last option. It would be however very tiring to collect the search results for a query by hand. To ease we can use Jefferson’s projects that works without problems.
Installation
The best way to install this project
is to clone it to a local folder. How you clone a git repository depends on your operating. The following command works on UNIX systems:
git clone https://github.com/Jefferson-Henrique/GetOldTweets-python
The project comes with some requirements as well:
# this works only when you're in local directory where the project was cloned to
pip install -r requirements.txt
After a successful installation you can ask to retrieve tweets by username, keyword and time. It’s sometimes advisable to divide the collection into years or to set a limit.
search by username
The following command searches for all tweets of trump in 2009 and stores them in a csv:
# assumes to be in the directory where Exporter.py is located
# how you call python may be different under windows
python Exporter.py --username 'realdonaldtrump' --since 2009-01-01 --until 2009-12-32 --output 'trump2009.csv'
search by keyword
This way you can get to know what people think about certain activities:
python Exporter.py --querysearch 'debat-o-meter' --maxtweets 100 --output 'debatometer100.csv'
That would be a great success:
“Sind die Begriffe Wahl-O-Mat und Debat-O-Meter eigentlich schon in den Duden aufgenommen worden?” – krabbl_
search for top tweets
Sometimes we want to know what our favorite celebrity is up to:
python Exporter.py --username 'justinbieber' --maxtweets 100 --toptweets
Exercise 6
- Install both python applications.
- Experiment with the available queries and parameters.
More information
I hope you’ve enjoyed this workshop and learned a lot! Feel free to contact me with any suggestions for improvement or questions regarding the content.
Find below some more packages that are linked to the work with social media data. Besides that I’ve added a few references that provide more information in relation to the topic of this workshop.
packages
Something interesting packages that are worthwhile to look at if you plan to use data from social networks:
References
https://github.com/pablobarbera/social-media-workshop
shows how to collect data on Facebook, Instagram, …
tutorial on creating network graphs with igraph
http://blogs.lse.ac.uk/impactofsocialsciences/category/digital-methodologies-series/
series over digital methodologies in social sciences