General Information

In my workshop you learn to use several ways to collect Twitter Data for your own research. The second main emphasis is a descriptive analysis and how to conduct a network analysis with the data.

We mainly use “ready-to-use” functions of existing packages, but if you will conduct a research project later on, you probably have to use customized function to reach your goals. This may seem tough if you are new to R, but it’s the aspect that actually makes R unique compared to other statistical programs.

I’ve decided to use the opinionated tidyverse package, a collection of R packages designed for Data Science for some code chunks. The packages share a common philosophy and work smoothly together. This approach includes the coding practice of piping which makes the R code easier to read. Some R practitioners consider the introduction of piping the most important innovation in the recent years. I will point out the first few times where I use this coding principle in the code chunks. But don’t worry, you don’t need to use the pipe operators yourself to complete this workshop successfully.

The overall subject of the workshop is the Bundestagswahl 2017 to demonstrate the opportunities on an issue of Political Science. I recommend you to go through the instructions linear as some sections build up on knowledge you acquire in previous sections. My second is advice that you try to understand the logic behind my code chunks, but you don’t necessary have to replicate them all. In a few cases it won’t be possible. To check on your knowledge each section finishes with a small exercise.

Happy coding!

Packages

Make sure you have the latest versions of these packages installed and loaded:

# installs packages
install.packages(c("rtweet", "tidyverse", "ggplot2", "tm", "igraph", "data.table", 
    "stringr"), repos = "http://cran.us.r-project.org")

# loads packages
library("rtweet")
library("tidyverse")
library("ggplot2")
library("tm")
library("igraph")
library("data.table")
library("stringr")

Piping

Traditional R coding forces you the either wrap functions into other functions or assign a lot of variables. The concept of piping allows you to forward the output of one function to the next one and read the sequence from left to right (readability). This functionality comes with the package magrittr that is party of tidyverse.

Let me show you two code chunks that do the same and decide for yourself which one you consider more intuitive and more readable.

Traditional approach:

# you have to read from the center outwards to understand what's going on
mean(mtcars[which(mtcars$mpg > 15), "mpg"])
## [1] 21.72308

Pipe approach:

# input data
mtcars %>% # filters data for each obersvation that has a value greater than 15 for mpg
filter(mpg > 15) %>% # select the column mpg
select(mpg) %>% colMeans
##      mpg 
## 21.72308

While the pipe operator %>% is the most important operator, others exist for different uses as well. Consult the vignette and the documentation of the package for more information. I sometimes use . within a function. The dot explicitly states where the output of the previous function shall be put which is in a few cases necessary. In other words:

# the following code snippets are the same
mtcars %>% filter(mpg > 15)
# and
mtcars %>% filter(., mpg > 15)

Data Collection

Twitter Access

  1. To retrieve data from Twitter you first have to register an account on twitter.com.

  2. You need to create an application under apps.twitter.com. The name, description, and website specifications do not matter as you are not creating an actual application for other Twitter users. You can fill these fields with your creativity.

  3. You have to specify http://127.0.0.1:1410 as your callback url to ensure that your access runs smoothly with the R packages.

  1. Save your apps’ credentials in R. They encompass:
  • Consumer Key
  • Consumer Secret
  • Access Token
  • Access Secret

You find the access token and the access secret on the same page a bit further down.

consumer.key <- "CONSUMER KEY"
consumer.secret <- "CONSUMER SECRET"
access.token <- "ACCESS TOKEN"
access.secret = "ACCESS SECRET"

Though possible it is not recommended to use two applications at the same time, e.g. accessing the Twitter API for two different research projects simultaneously through two applications. You may find yourself running into the rate limits. In the worst case Twitter bans your account!

rtweet

The package has implemented a very simple to connect its function with Twitters APIs. You only have to call create_token() with your application name, consumer key and consumer secret as inputs. Once you’ve created the token, you are ready to use the functions that come with the package.

# defines application name
application_name <- "Name of Application"

# creates token for Twitter API
rtweet_token <- create_token(app = application_name, consumer_key = consumer.key, 
    consumer_secret = consumer.secret)

The authentication was successful when you see a tab in your browser that says “Authentication complete. Please close this page and return to R”.

The author of the package explains a way how to ‘preserve’ the generated token for future R sessions. We however stay with this simpler solution.

Exercise 1

  1. If you have no Twitter account yet, create one.
  2. Register one application on Twitter with the previously mentioned callback url.
  3. Create a Twitter token in R for the rtweet package.

Obtaining user specific data

There is various data you can collect about users. Twitter roughly divides them into a user’s attributes like number of followers and friends, the date when the account was created, geo location etc. and lists like which users actually follow them and who the user follows. The documentation of the Rest API lists the data you can get.

Favorites

If you want to analyze how politicians communicate it might wise to know whose tweets they favor. The function get_favorites() of rtweet allows you to obtain up to 3000 favorites of the specified user id.

# assigns Trump's account to the variable
politician <- "realDonaldTrump"

# requests the last 1000 favorites of Trump
favorites <- get_favorites(user = politician, n = 1000)

# displays the column screen_name of the first five favorites
head(favorites[, "screen_name"])
## [1] "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" "sheba418"       
## [5] "foxandfriends"   "IvankaTrump"
# counts how many times each screen_name was favored by Trump
table(favorites[, "screen_name"])
## 
##  DonaldJTrumpJr          FLOTUS   foxandfriends     IvankaTrump 
##               2               2               1               1 
##      mike_pence realDonaldTrump        Ross_7_7        sheba418 
##               1               3               1               1 
## StrongChestwell 
##               2

Donald Trump rarely favors any tweets. Is this evidence for the aspect that politicians tend to rather broadcast than interact on Twitter?

# display the date of Trump's favorites
favorites[, "created_at"]
##  [1] "2017-10-08 14:09:00 UTC" "2017-10-08 13:59:58 UTC"
##  [3] "2017-09-19 22:04:11 UTC" "2017-09-01 13:15:35 UTC"
##  [5] "2017-08-14 09:58:51 UTC" "2017-05-24 12:06:47 UTC"
##  [7] "2017-05-21 09:00:15 UTC" "2017-04-14 19:36:17 UTC"
##  [9] "2016-11-08 23:14:27 UTC" "2016-10-10 02:45:08 UTC"
## [11] "2016-04-19 13:28:26 UTC" "2013-08-24 02:46:09 UTC"
## [13] "2013-03-02 04:27:26 UTC" "2013-03-02 04:27:26 UTC"

Followers

Now assume we want to find out who follows Trump’s account as they for sure are Trump supporters and we need to know who is in favor of him. The function get_followers() returns us the followers of the specified user.

# tries to retrieve all followers of the politician
followers <- get_followers(politician, n = "all")

# shows us the first observations of the followers
head(followers[, 1])
## [1] "917747309881982976" "917746118548705280" "917747294639939584"
## [4] "917746766635847680" "2849434548"         "917734518525591552"

Note: The output only reveals the user_id. We would have to request more information from the Twitter API, if we want to know more about the users’ followers.

Something seems wrong however, if we check the number of followers that were returned:

# informs about how long the list of followers is, i.e. the number of
# followers
length(followers$user_id)
## [1] 75000

We have only received 75,000 followers, but Trump has millions of followers! The API rate limits determine that we are not allowed to get more followers than this number every 15 minutes. But of course we can work around this barrier.

# instructs R to pause 15 minutes
Sys.sleep(15 * 60)

# determines the last follower before we hit the rate limit
page <- next_cursor(followers)

# gets second round of followers
followers.p2 <- get_followers(politician, n = "all", page = page)

# binds the dataframes together
followers <- rbind(followers, followers.p2)

# verifies that the new 75,000 are not equal to the first 75,000 followers
unique(followers$user_id) %>% length
## [1] 149997

This approach however is not very practical for two reasons:

  1. We do not want to do this by hand for accounts that hundred thousands of followers or more.
  2. We do not know yet how many followers Trump actually has.

Luckily rtweet offers the function lookup_users() that delivers us some attributes about the specified user.

# requests the available attributes for the politician
lookedUp <- lookup_users(politician)

# displays us the available attributes
names(lookedUp)[1:18]
##  [1] "user_id"              "name"                 "screen_name"         
##  [4] "location"             "description"          "protected"           
##  [7] "followers_count"      "friends_count"        "listed_count"        
## [10] "created_at"           "favourites_count"     "utc_offset"          
## [13] "time_zone"            "geo_enabled"          "verified"            
## [16] "statuses_count"       "lang"                 "contributors_enabled"

A more general solution can look like the following custom function. The function is only an example and not perfect yet as you’ll might experience:

# the function returns all followers of the specified users input: users
# whose followers you want output: list with a dataframe per user
# parameters: none

iterateFollowers <- function(users) {
    # executes the following on each user stated
    lapply(1:length(users), function(x) {
        # assigns the user name by position
        user <- users[x]
        # inform the R practioner
        message(paste("Finds followers for", user))
        # request the user's attributes
        info <- lookup_users(user)
        # determines how many followers the user has
        n.followers <- info[, "followers_count"]
        # calculates how many batchs are need to retrieve every follower
        n.batchs <- n.followers/75000
        # different procedure for users with more than 75,000 followers
        if (n.batchs > 1) {
            # request 75,000 followers
            followers <- get_followers(user, n = "all")
            # informs the R practioner
            message("Pause")
            # 'sleeps' for 15 minutes
            Sys.sleep(15 * 60)
            # exectues more or less the same in a loop
            for (i in 2:n.batches) {
                message("Extracts batch ", i)
                # saves the last follower of the previous batch
                page <- next_cursor(followers)
                # request the next 75,000 or less followers
                batch <- get_followers(user, n = "all", page = page)
                # binds the new batch with the follower up to now
                followers <- rbind(followers, batch)
                # only pauses when it's not the last batch
                if (x != length(users) & i != n.batches) {
                  Sys.sleep(15 * 60)
                  message("Pause")
                }
            }
            # simpler procedure if the user has less than 75,000 followers
        } else {
            followers <- get_followers(user, n = "all")
        }
        return(followers)
    })
}

Friends

The function get_friends() works pretty much the same except that it informs about whom the specified user follows. These users are referred to as friends in Twitter’s architecture.

# requests the friends of the politician
friends <- get_friends(politician)

# displays the user ids of the first five friends
head(friends)

Timeline

Assume now that we’re interested in each tweet Trump shared with the world. For this purpose rtweet provides the function get_timeline()

# requests the last 5000 tweets of Donald Trump
tweets <- get_timeline(politician, n = 5000)

tweets %>% select("created_at", "text", "retweet_count", "favorite_count") %>% 
    # takes a random sample
sample_n(10)

Again we run into the API Restrictions imposed by Twitter. We can’t access more than 3,200 tweets per user through the Rest API.

nrow(tweets)
## [1] 3223

There are ways to work around this Restriction. The easiest one is to instantly capture the tweets of users, though this doesn’t let you access tweets from the past. The other way is using webscrapping, not so easy.

Let’s find out which tweet has the most retweets:

# piping: the operator %>% pipes the output of the previous expression to
# the following expression

# selects the column retweet_count of the data frame 'tweets'
tweets[, "retweet_count"] %>% which.max %>% tweets[., ] %>% select(created_at, 
    text, retweet_count)

The video shows Trump playing to hit a man whose face is the CNN logo. In case you don’t know it yet here you go.


Not every tweet of Trump has this style as his tweet with the most favorites shows:

# selects the column favorite_count
tweets[, "favorite_count"] %>% # determines the observation, i.e. the row number, with the most favorites
which.max %>% # selects the whole row of this observation by piping the row number
tweets[., ] %>% # displays only the interesting columns
select(created_at, text, favorite_count)

Exercise 2

  1. Find a German politician that has more than 75,000 followers.
  2. Is he favoring any tweets? Whose?
  3. How many users follow him?
  4. Extract at least 75,001 followers with a single function.
  5. Catch the politician’s last 100 tweets and assign them to the variable tweets.

Collect Tweets based on keywords

There are two ways to obtain tweets that either match certain keywords or mention specific users without having to know the authors of those tweets beforehand. One of them is using the search feature of the Rest API, the other is filtering the Streaming API for these keywords or users. Rtweet provides the function search_tweets() for the first and stream_tweets() for the second way.

A quick recall: Twitter does not impose any limit on the number of tweets you can get per se. But the Rest API does not return tweets older than a week and the Streaming API limits the amount of tweets to 1% of all tweets at any moment. To give you a figure of how many tweets you can capture per day: the daily volume is approximately at 230 millions.

Search Engine

Let’s search for the last 100 tweets about the German election:

# searches for tweets that contain the hashtag #btw17
tweets.btw17 <- search_tweets(q = "#btw17", n = 100, type = "recent")

# displays only the text of those tweets
tweets.btw17 %>% select(text) %>% # random sample
sample_n(size = 10)

You can ask the Twitter’s Rest API for different kind of search results. search_tweets() by default returns the most recent tweets, but you can request mixed search results or the most popular ones.

Let’s find the most popular tweets related to the past election using the hashtag #btw17.

# searchs for the most popular tweets containing the keywords in the last 7
# (!) days
tweets.popular <- search_tweets(q = "#btw17", n = 10, type = "popular")

# only display certain columns of the popular tweets
tweets.popular %>% select(screen_name, text, retweet_count, favorite_count) %>% 
    # random sample
sample_n(size = 10)

We can use the common Boolean operators AND and OR to narrow the query.

# searchs for tweets that contain the hashtag #bt17 and the keyword 'Petry'
tweets.btw17_Petry <- search_tweets(q = "#btw17 AND Petry", n = 100, type = "recent")

tweets.btw17_Petry %>% select(text)
# search for tweets that contain the keyword 'Gauland' or 'Höcke, but not
# both
tweets.btw17_GauHoeck <- search_tweets(q = "Gauland OR Höcke", n = 100, type = "recent")

tweets.btw17_GauHoeck %>% select(text)

The function more or less accepts all parameters that Twitter’s Rest API offers, e.g. filtering for one language or only for tweets that are not retweets.

Who dared to tweet about our chancellor Merkel in English recently? And who referred to the famous SPIEGEL lately?

# captures the last 10 tweets about Merkel written in english
tweets_english <- search_tweets(q = "Merkel", n = 10, lang = "en")

tweets_english %>% select(text)
# returns the last 10 tweets about Spiegel that are no retweets
tweets_noRT <- search_tweets(q = "Spiegel", n = 10, include_rts = FALSE)

tweets_noRT %>% select(text)

Access Twitter’s Stream

If you plan to capture Twitter data related to a (huge) political event, the Streaming API will be likely your best choice. Thousands of users express their opinion about on-going politics on Twitter everyday, but you don’t know who is going to unleash his anger about Trump’s impeachment and who will shot up fireworks. Using the Streaming API let’s you collect data instantly based on keywords. After some time you can easily end up with millions of tweets, but that’s not a problem with the available computer power nowadays. It’s easier to reduce unnecessary data in the data preparation phase later than trying to close data gaps. This way you easily end up with some millions.

Let’s see who is tweeting about Merkel, SPD, Deutschland or more broadly about the last German election. With the parameter timeout of the function stream_tweets() you can specify for how long you want to listen to Twitter’s stream. If you choose FALSE, it streams indefinitely long.

# captures tweets containing one of the specified keywords for one minute
tweets.streamed <- stream_tweets(q = "Merkel,Bundestag,SPD,Bundesregierung,Bundestagswahl", 
    timeout = (1 * 60))
# shows only the text of the tweets
tweets.streamed %>% select(text)

You can give a desired language to the function too.

# captures any tweets in English mentioning Merkel for two minutes
tweets.lang <- stream_tweets(q = "Merkel", timeout = (2 * 60), language = "en")
# displays the tweets
tweets.lang %>% select(text)

Moreover Twitter permits to track up to 5000 user IDs via their Streaming API. In this case you have to call c() and use a separate string for each user_id.

# DO NOT RUN
tweets.users <- stream_tweets(q = c("regierungssprecher", "spiegelonline"), 
    timeout = (24 * 60 * 60))

tweets.users %>% select(text)

Michael W. Kearney, the author of rtweet, has written the convenient function users_data() to retrieve user data on collected tweets. You only need to input the data frame containing the tweets and you’ll get the information available about the tweets’ authors.

users_data(tweets.streamed)
users_data(friends)

Exercise 3

  1. Find the last 100 or less tweets about your politician.
  2. Can you find any tweets about him in English?
  3. Now search again: This time for tweets containing the keyword Koalition or Jamaika? Find the 10 most popular ones.
  4. Access the Streaming API for some minutes to collect tweets about your politician. If you can’t find any, search for another politician or the keywords Koalition and Jamika.
  5. Obtain information about the tweets’ authors.

Rate Limits

A few last words about the (rate) limits of Twitter’s APIs. They are only important for the use of the Rest API as you can’t work around the volume limit of the Streaming API other than buying yourself exclusive access. Rtweet includes the function rate_limits() that shows you how much you’ve used up your volume in the current 15 minutes time window. It also shows you when your access volume will be reset. You can access this information and use it to build functions that respect the rate limits.

# displays the current rate limits
rate_limit(token = rtweet_token)
# returns the remaining number of accepted request for the current time
# window
rate_limit(rtweet_token)[1, 3]
## [1] 15

Exercise 3.1

  1. How can you improve my function iterateFollowers by using rate_limit()?

Data Analysis

The number of (social) scientists using data from social networks such as Twitter has significantly increased over the last couple of years. It’s not surprising therefore that the number of tools to analyse this data has increased with it as well. But not only the number of available tools has changed, their capacities and the range of analyses they offer “out-of-the-box” are well beyond what was ready to use some years ago. In this chapter, you’ll learn to apply some analysis to Twitter’s data. At first, we look at frequencies of words and create time plots. Afterwards we continue with the more complex task of creating a network with the data and conduct a network analysis.

We will work with real twitter data that I collected during 2017’s election campaign for the Bundestag.

Data Preparation

The classic data frame of R handle large data sets poorly for several reasons. But of course some R enthusiasts have already developed a solution for that problem with the package data.table that increases performance significantly. fread() is a lot faster than csv.read().

You can’t reproduce the following part for now, but I provide you with a dataset for the exercise.

# loads the package
library(data.table)

# reads in the tweets if the file resides in your home directory
tweets.sample <- fread("tweets_btw17.csv")
## 
Read 0.0% of 2015257 rows
Read 9.4% of 2015257 rows
Read 15.9% of 2015257 rows
Read 28.3% of 2015257 rows
Read 33.2% of 2015257 rows
Read 45.7% of 2015257 rows
Read 53.1% of 2015257 rows
Read 68.0% of 2015257 rows
Read 72.0% of 2015257 rows
Read 85.3% of 2015257 rows
Read 95.8% of 2015257 rows
Read 1977422 rows and 16 (of 16) columns from 0.572 GB file in 00:00:17

Descriptive Statistics

The public quite intensely and controversial debated the use of (social) bots for campaigning. In general, it’s interesting to see what’s happening on social networks around election days. In the following I introduce a few descriptive statistics about a collection of tweets. Keep in mind however that the most fitting statistics depend on your research thesis.

Top Users

topusers <- tweets.sample %>% # groups the tweets by the variable user_name
group_by(user_name) %>% # counts the occurrences of each user_name
count %>% # rearranges the rows by the number of occurrences in descending order
arrange(desc(n))

topusers

Let’s take a closer look on these users. They don’t seem to belong to any major media network. lookup_users() gives us more information about them.

topusers[1:10, "user_name"] %>% lookup_users

Some background images are:

Clearly most of their tweets are not written by a human being.

We see a very different picture, if we not only look for which users have the most tweets, but combine these figures with the number of followers these user accounts have.

# selects the 18,000 top users
mostfollowers <- topusers$user_name[1:5000] %>% # request their attributes
lookup_users() %>% # puts the users with the most followers to the top
arrange(desc(followers_count))

mostfollowers

There is more to discover as the available variables show us:

glimpse(tweets.sample)
## Observations: 1,977,422
## Variables: 16
## $ id               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ lang             <chr> "de", "de", "de", "de", "de", "de", "de", "de...
## $ polarity         <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000...
## $ created          <chr> "2017-08-24T18:11:35", "2017-08-24T18:11:36",...
## $ text             <chr> "@CSU Licht: Wird es nicht\nSchatten: Das nae...
## $ user_description <chr> "Dualitaet praegt das Leben. Gut/ Schlecht. Y...
## $ user_followers   <int> 15, 666, 25329, 5, 28, 12494, 62, 259, 2, 159...
## $ user_location    <chr> "", "", "Berlin, Deutschland", "Berlin, Deuts...
## $ coordinates      <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ user_bg_color    <chr> "000000", "C0DEED", "ACDED6", "F5F8FA", "C0DE...
## $ id_str           <S3: integer64> 1.812680e-248, 1.812681e-248, 1.812...
## $ subjectivity     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0...
## $ user_created     <chr> "2017-01-04T03:50:14", "2010-01-07T12:58:09",...
## $ retweet_count    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ geo              <chr> "", "", "", "", "", "", "", "", "", "", "", "...
## $ user_name        <chr> "LichtuSchatt3n", "Gekko125", "Martin_Lejeune...

Number of tweets over time

We can plot the tweets per day. We have two distinct peaks: One at the day of the ‘TV-Duell’, the other at the election day:

# operator uses the object on the left and assigns the output to it
tweets.sample$created %<>% as.POSIXct

# plots the frequency of the tweets by days
ts_plot(tweets.sample, by = "days", dtname = "created")

Word Frequencies

If we want to extract patterns like keywords, we have to use regular expressions. Twitter delivers mentions as attributes via their APIs, but the sample data does not contain this attribute that’s why we have to extract them here using a custom function. Stringr is a package that eases the work with strings.

# most convenient package
library(stringr)

# extracts all mentions in each tweet
sample.mentions <- lapply(tweets.sample$text, function(x) {
    # extracts every string that starts with @
    str_extract_all(string = x, pattern = "(@[^\\s]+)")
}) %>% # keeps the mentions just in one vector
unlist

# now we take a look
sample.mentions %>% sample(40)
##  [1] "@caranoia"          "@CDU"               "@YouTube-Video:"   
##  [4] "@MartinSchulz"      "@rbbabendschau"     "@ardmoma"          
##  [7] "@CarmenKuprat"      "@MSFTMechanics"     "@MartinSchulz"     
## [10] "@BILD"              "@PaulUmstaetter"    "@MartinSchulz"     
## [13] "@AfD"               "@AfD"               "@UdoHemmelgarn"    
## [16] "@coderboypb"        "@hubertus_heil"     "@LesVertsSuisse"   
## [19] "@zeitonline"        "@c_lindner"         "@tgd_att"          
## [22] "@ulfposh"           "@AfD"               "@heuteshow"        
## [25] "@Die_Gruenen"       "@KonstantinKuhle"   "@SPIEGELONLINE"    
## [28] "@CDU"               "@journ_online"      "@Georg_Pazderski):"
## [31] "@MartinSchulz"      "@VP"                "@Fenerinho55"      
## [34] "@welt"              "@waldruhe"          "@Ralf_Stegner"     
## [37] "@JulianRoepcke"     "@Fabian_Junge"      "@HansAlbers6"      
## [40] "@Mica4711"

This vector still contains every mention in any tweet, i.e. duplicates across tweets. Who was mentioned many times in our t?

sample.mentions.counted <- factor(sample.mentions) %>% table %>% sort(decreasing = TRUE)

I like to use ggplot2 for plots. Plotly is a good choice, when you need interactive graphs or use plots in a shiny application. For time series dygraphs is highly recommendable.

library(ggplot2)

# converts the vector into a data frame and renames the columns
sample.mentions.counted %<>% as.data.frame %>% rename(., User = ., Frequency = Freq)

# plots the 20 most frequent mentiosn
ggplot(data = sample.mentions.counted[1:20, ], aes(x = User, y = Frequency, 
    fill = User)) + geom_bar(stat = "identity") + ggtitle("number of mentions") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

# let's put the figures in relation to the total number of tweets
ggplot(data = sample.mentions.counted[1:20, ], aes(x = User, y = (Frequency/nrow(tweets.sample)), 
    fill = User)) + geom_bar(stat = "identity") + ggtitle("Percentage of mentions in relation to total tweets", 
    subtitle = paste("n=", nrow(tweets.sample))) + ylab("Frequency in Percentage") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))


We can find the most used hashtags as well:

# extracts all hashtags in each tweet
sample.hashtags <- lapply(tweets.sample$text, function(x) {
    # lower case to aggregate similar/same hashtags
    tolower(x) %>% # extracts every string that starts with #
    str_extract_all(string = ., pattern = "(#[^\\s]+)")
}) %>% # keeps the hashtags just in one vector
unlist

# now we take a look
sample.hashtags %>% sample(40)
##  [1] "#btw17"                "#schulz:"             
##  [3] "#btw2017"              "#btw2017"             
##  [5] "#le0609"               "#news"                
##  [7] "#btw17"                "#btwahl2017"          
##  [9] "#btw17."               "#zschäpe"             
## [11] "#geschichte"           "#spd"                 
## [13] "#fuer"                 "#thermilindner"       
## [15] "#merkel"               "#grüne"               
## [17] "#afd"                  "#aliceweidelgeruechte"
## [19] "#bundestag"            "#btwgezwitscher"      
## [21] "#dieschmidt"           "#afd-chefin"          
## [23] "#gysi"                 "#noafd"               
## [25] "#linkspartei"          "#afd"                 
## [27] "#btw17"                "#csu"                 
## [29] "#cdu"                  "#afd"                 
## [31] "#spd"                  "#afd"                 
## [33] "#sektchen"             "#wesermarsch…"        
## [35] "#hetze."               "#merkel"              
## [37] "#spd"                  "#gehtwählen"          
## [39] "#wahl2017"             "#oezuguz"

As before we have to count the occurrences across tweets.

# counts the occurences of each hashtag
sample.hashtags.counted <- factor(sample.hashtags) %>% table %>% sort(decreasing = TRUE)

# reshapes data to comply with plot requirements
sample.hashtags.counted %<>% as.data.frame %>% rename(., User = ., Frequency = Freq)

# plots the hashtags as bar chart

ggplot(data = sample.hashtags.counted[1:20, ], aes(x = User, y = Frequency, 
    fill = User)) + geom_bar(stat = "identity") + ggtitle("occurences of hashtags") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

There is a bias towards AfD in proportion to its size. Furthermore, I suspect my data set to be biased to the political right a bit. My choice of track keywords is probably responsible for this outcome. On the one hand the AfD was very present in the election campaign online and offline, but on the other hand this example shows how depended from my collected data the result of my analysis is.


Text Corpus Analysis

Another possibility of analysis would be to examine the occurrences of ‘regular’ words. In this case, we have to create a text corpus and clean it from stop words, punctuation and so forth. A corpus is an abstract concept that can compromise several types of text document collections. Texts of a common domain typically compose one corpus in order to analyze the use of speech in this domain. For instance, the election manifestos of one party across elections can shape one corpus. This collection would allow to analyze how the use of speech of this party has changed over time or to examine if a certain kind of speech is connected to election success?

library(tm)

# filters the data set to tweets from the official parties accounts
tweets.parteien <- filter(tweets.sample, user_name == "spdde" | user_name == 
    "CDU" | user_name == "fdp" | user_name == "Die_Gruenen" | user_name == "dieLinke")

# defines a corpus that , we have to specify a langauge
corpus.parteien <- SimpleCorpus(VectorSource(tweets.parteien$text), control = list(language = "ger"))

corpus.parteien
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3057
# enables to inspect the corpus
inspect(corpus.parteien[1:5])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] Thomas de Maizière @BMI_Bund hat die linksextremistische Internetplattform ""linksuntenindymediaorg"" und den zugehör… https://t.co/8cadG2JAwF
## [2] Damit Wohnen bezahlbar bleibt: Am 24.9. @MartinSchulz und die SPD wählen - für faire Mieten! Mehr:… https://t.co/aGEEyj4SkD                   
## [3] Zu Gast bei der @CSU in Bad Kissingen: Angela #Merkel https://t.co/5kyqeZodEv                                                                 
## [4] Unsere bayerische Schwester! https://t.co/w696nj1tzD                                                                                          
## [5] @SEENOTRETTUNG @CDUdresden @SPD_Dresden @dieLinke @DiePARTEI @fdp @Piratenpartei @_VPartei_ @Tierschutzparte… https://t.co/MbFSOM21tY

Before we can run a text analysis on the corpus, we have to apply some transformations to the corpus such as converting all characters to lower case. The package offers the function tm_map() for this purpose. It applies (maps) a function to all elements of a corpus.

# custom function to remove URLS
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", 
    x, perl = T))

corpus.parteien %<>% # strips unncessary whitespace
tm_map(stripWhitespace) %>% # transforms all words to lowercase
tm_map(content_transformer(tolower)) %>% # removes URLs
tm_map(removeURL) %>% # removes german stopwords such der/die/das etc.
tm_map(removeWords, stopwords("german")) %>% # removes punctuation like '.'
tm_map(removePunctuation)

In the end we turn the data into a TermDocumentMatrix. It documents if and how often a term appears in each document (here in each tweet).

tdm.parteien <- TermDocumentMatrix(corpus.parteien)

But I’m finally want to know what words the parties used in their tweets. The term-document matrix enables us to access the most frequent terms easily. We can state a lower frequency that words must fulfill in order to be returned by the function:

# parameter lowfreq and highfreq define the limits
findFreqTerms(tdm.parteien, lowfreq = 50)
##  [1] "martinschulz"   "mehr"           "spd"            "wählen"        
##  [5] "angela"         "merkel"         "unsere"         "dielinke"      
##  [9] "fdp"            "deutschland"    "fedidwgugl"     "heute"         
## [13] "zeitfürmartin"  "uhr"            "live"           "amp"           
## [17] "familien"       "leben"          "dass"           "land"          
## [21] "dafür"          "afd"            "gute"           "menschen"      
## [25] "geht"           "gibt"           "europa"         "btw17"         
## [29] "denkenwirneu"   "unserer"        "cdu"            "diegruenen"    
## [33] "spdde"          "dietmarbartsch" "swagenknecht"   "cemoezdemir"   
## [37] "beim"           "schulz"         "müssen"         "fakt"          
## [41] "brauchen"       "bildung"        "zukunft"        "wer"           
## [45] "gut"            "linke"          "goeringeckardt" "jahren"        
## [49] "vielen"         "sicherheit"     "tvduell"        "rente"         
## [53] "deshalb"        "wahl2017"       "clindner"       "fünfkampf"     
## [57] "klartext"       "illnerintensiv" "esistzeit"      "wahlarena"     
## [61] "bpt17"          "100hcdu"        "schlussrunde"

The output indicates that the parties were talking a lot about themselves during the election campaign. Maybe it would be more interesting to see how they talk about certain issues like education. The function findAssocs() returns us all terms that have at least the specified correlation with our input term:

findAssocs(tdm.parteien, "bildung", 0.1)
## $bildung
##          irgendeine     zukunftspolitik     bildungspolitik 
##                0.31                0.31                0.21 
##           weltbeste      frühkindlichen        bildungsetat 
##                0.20                0.18                0.17 
##               wanka                bund         investieren 
##                0.17                0.16                0.16 
##          verdoppelt               beste             überall 
##                0.15                0.15                0.15 
##                sinn            herkunft       gebührenfreie 
##                0.14                0.13                0.13 
## rekordinvestitionen            aufstieg          grundstein 
##                0.12                0.12                0.12 
##        ausprobieren             johanna            anpacken 
##                0.12                0.12                0.12 
##            schieben         jugendliche              länder 
##                0.12                0.12                0.11 
##             wichtig     digitalisierung       weiterbildung 
##                0.11                0.11                0.11 
##           heikomaas      zeitfuermartin           abgewählt 
##                0.11                0.10                0.10 
##                 nrw                kita 
##                0.10                0.10

Another issue that didn’t get the attention it deserves was Digitalisierung. Which party has a vision how we’re going to deal with the very unique challenges in this domain?

fibre.assocs <- findAssocs(tdm.parteien, "digitalisierung", 0.1)

fibre.assocs$digitalisierung[1:20]
##            chefsache              neugier       staatsminister 
##                 0.32                 0.29                 0.21 
##                 öpnv         berufsbilder zukunftprogrammieren 
##                 0.20                 0.20                 0.14 
##     dekarbonisierung    dezentralisierung            ramonapop 
##                 0.14                 0.14                 0.14 
##            sichtlich                meins            industrie 
##                 0.14                 0.14                 0.14 
##               rahmen           schmalspur          verschlafen 
##                 0.14                 0.14                 0.14 
##           immoniehus  handlungsvorschläge                parat 
##                 0.14                 0.14                 0.14 
##      weltmeisterplan          jochenblind 
##                 0.14                 0.14

My guess is that instead of discussing the urgent issue of digitalization they talked a lot about migration:

findAssocs(tdm.parteien, "migration", 0.1)
## $migration
##              ansatz           illegalen         reduzierung 
##                0.71                0.71                0.71 
##           steuerung            verfolge entwicklungspolitik 
##                0.71                0.71                0.71 
##            handwerk            illegale          schleppern 
##                0.71                0.71                0.71 
##             ordnung       funktionieren               legen 
##                0.50                0.50                0.41 
##             stoppen     sommerinterview 
##                0.29                0.13

Of course, we can reduce the corpus further only to tweets that were written by a particular party:

assocs.spd <- findAssocs(tdm.spd, "rente", 0.1)

assocs.spd$rente[1:15]
##        schäuble    verlässliche         inhalte        konzepte 
##            0.43            0.43            0.30            0.30 
##         steuern          höhere  rentenbeiträge           redet 
##            0.30            0.30            0.30            0.30 
## handlungsbedarf    schlussrunde        schuldig manuelaschwesig 
##            0.30            0.28            0.24            0.23 
##        beiträge           droht       sinkendes 
##            0.23            0.21            0.21

A few last words on text analysis. You can use this approach easily for any kind of document collection. In fact, there is no difference in the process after you’ve created the corpus.

I think that the package corpus provides some interesting features for text analysis as well that go beyond what tm offers. For instance, you can apply the concept of n-grams to your documents. Instead of only receiving the correlation terms share with others terms, n-grams give you sequences that include your word and appear frequently.

Exercise 4

  1. Load the file tweets_electionday.RData containing the tweets from the election day after 6 pm.
  2. Find out the most popular hashtags and most frequent words and plot your results.
  3. What are the associates for the words AfD, CDU, SPD and Bundestagswahl?

Network Analysis

In social sciences the analysis of networks have become quite popular in recent years. This is also partly due to the rise of social network platforms where you can study relations between human beings. The technique itself however is quite old and not new to sociologists who concern themselves with the network within an entity and between entities of any kind.

Throughout this notebook I consider a network to map social relations between N entities (nodes) where E edges display the relations. Some scientists also use the term vertex (Eckpunkt) instead of node. Not every node shares an edge with all other nodes. To transfer this to Twitter who follows who or mentions can form a social network.

On Twitter you can distinguish between verified users and not-verified users. The first user group for example does not see every tweet in which they were mentioned. One could consider communication among verified users as partly separated public sphere, though other users can observe this communication. In my imagination a ‘grouped’ network seems to be an interesting issue, with verified users belonging to one group and regular users to the second group. It would be interesting to see how communication flows or not flows between these two groups.

Selection of entities

I’ve decided to include every user from our example data set that has more than 10,000 followers as entity into my network. This decision leaves us in the end with 234 vertices. This was a somewhat arbitrary decision, but it delivers us entities from various fields. To speak fairly vague, how to select the users you include depends in which broader framework your network of interest is embedded.

# returns me only those users with followers >= 10000
influencers <- mostfollowers %>% filter(followers_count >= 10000)

# displays a sample containing 50 observations of those followers
sample(influencers[, "screen_name"], 50)
##  [1] "MediterrNewsNet" "fdp_nrw"         "ismail_kupeli"  
##  [4] "BlnTageszeitung" "dushanwegner"    "WELT_Politik"   
##  [7] "SPIEGEL_24"      "Ralf_Stegner"    "MEEDIA"         
## [10] "PortalAlemania"  "gaborhalasz1"    "BWBreaking"     
## [13] "tourismusvideo"  "OZlive"          "rbb24"          
## [16] "peter_simone"    "aktuelle_stunde" "Beatrix_vStorch"
## [19] "rbbinforadio"    "ntvde"           "tazgezwitscher" 
## [22] "MGrosseBroemer"  "ulfposh"         "WDR"            
## [25] "taz_news"        "WAZ_Redaktion"   "LisaL80"        
## [28] "SVZonline"       "jungewelt"       "AZ_Augsburg"    
## [31] "focusonline"     "FraukePetry"     "focuspolitik"   
## [34] "derfreitag"      "aktenzeichenyx"  "bpb_de"         
## [37] "NZZ"             "hessenschauDE"   "annalist"       
## [40] "Endzeitkind"     "handelsblatt"    "RenateKuenast"  
## [43] "DerSPIEGEL"      "welt"            "SPIEGEL_Politik"
## [46] "frielingbailey"  "natsocialist"    "1LIVE"          
## [49] "niggi"           "inzamaus"

Create a network with igraph

With the package igraph one can directly create and visualize networks in R. I show you the basic features of a network with igraph:

library(igraph)

# creates a network with three vertices that share one undirected edge the
# first two values of the vector give the ends of the first edge and so
# on...
graph1 <- graph(edges = c(1, 2, 2, 3, 3, 1), directed = FALSE)

plot(graph1)

If we call the graph directly, we get information about the graph’s structure:

graph1
## IGRAPH 843d5ac U--- 3 3 -- 
## + edges from 843d5ac:
## [1] 1--2 2--3 1--3

Of course, igraphs can have directed edges too:

graph2 <- graph(c(3, 5, 4, 3, 2, 8, 6, 3), directed = TRUE, n = 8)

plot(graph2)

Using names for the vertices might be more meaningful:

graph3 <- graph(c("Felix", "Sarah", "Sarah", "John", "John", "Stella", "Stella", 
    "Felix"))

plot(graph3)

Access edge and vertex attributes

To obtain information about your network’s edges or vertices:

E(graph3)
## + 4/4 edges from faa7dbe (vertex names):
## [1] Felix ->Sarah  Sarah ->John   John  ->Stella Stella->Felix
V(graph3)
## + 4/4 vertices, named, from faa7dbe:
## [1] Felix  Sarah  John   Stella
# to examine the network matrix directly
graph3[]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##        Felix Sarah John Stella
## Felix      .     1    .      .
## Sarah      .     .    1      .
## John       .     .    .      1
## Stella     1     .    .      .

Add attributes to the network

Enables you to add attributes to your vertices e.g. gender or to your edges e.g. type of relation:

V(graph3)$gender <- c("male", "female", "male", "female")

E(graph3)$type <- "mention"

E(graph3)$weight <- c(1, 2, 2, 1)

edge_attr(graph3)
## $type
## [1] "mention" "mention" "mention" "mention"
## 
## $weight
## [1] 1 2 2 1
vertex_attr(graph3)
## $name
## [1] "Felix"  "Sarah"  "John"   "Stella"
## 
## $gender
## [1] "male"   "female" "male"   "female"

Uses the attribute gender to color the nodes in the plot:

plot(graph3, edge.arrow.size = 0.5, vertex.label.color = "black", vertex.label.dist = 1.5, 
    vertex.color = c("skyblue", "pink")[1 + (V(graph3)$gender == "male")])

There are a lot more options to control the shape of your network or the layout of your plot. I’ll show you them later on, when necessary to obtain good graphs.

Use twitter data for a network analysis

To give you an idea of how we have to reshape twitter data for a network analysis, I’ll show you first which data (format) we need in the end. After all transformations we need to have two data tables that inherit the following structure:

# first data table
nodes <- data.table(id = 1:5, user_name = c("SPIEGELONLINE", "SPDDE", "Alice_Weidel", 
    "WWF_Deutschland", "heuteshow"), type = c("media", "party", "politician", 
    "ngo", "media"), followers = c(2368437, 329874, 18244, 391447, 319055))

# second data table
edges <- data.table(from = as.character(c(1, 4, 5, 3, 5, 1)), to = as.character(c(3, 
    1, 3, 2, 4, 3)), weight = 1, type = "mention")

glimpse(nodes)
## Observations: 5
## Variables: 4
## $ id        <int> 1, 2, 3, 4, 5
## $ user_name <chr> "SPIEGELONLINE", "SPDDE", "Alice_Weidel", "WWF_Deuts...
## $ type      <chr> "media", "party", "politician", "ngo", "media"
## $ followers <dbl> 2368437, 329874, 18244, 391447, 319055
glimpse(edges)
## Observations: 6
## Variables: 4
## $ from   <chr> "1", "4", "5", "3", "5", "1"
## $ to     <chr> "3", "1", "3", "2", "4", "3"
## $ weight <dbl> 1, 1, 1, 1, 1, 1
## $ type   <chr> "mention", "mention", "mention", "mention", "mention", ...

To create a graph with the two data tables, one simply uses the function graph_from_data_frame():

twitter.graph <- graph_from_data_frame(d = edges, directed = TRUE, vertices = nodes)

# plots the graph with the vertex's user name as label
plot(twitter.graph, vertex.label = nodes$user.name)


Now we can think about how we get two data tables in these shapes out of our data. First, we create a data table with the selected users:

nodes2 <- data.table(id = as.character(1:nrow(influencers)), user_name = influencers$screen_name, 
    type = NA, followers = influencers$followers_count)

head(nodes2)
tail(nodes2)

Unfortunately assigning the correct type is manual work. To be transparent: For more clarity I’ve hidden the code chunk that creates a vector named types containing the types:

But the vector exists, has the correct length and gives us eight types of nodes:

# checks data integrity and s
length(types)
## [1] 234
head(types)
## [1] "media" "media" "media" "media" "media" "media"
head(nodes2)
table(types)
## types
##        artist       company    government    influencer    journalist 
##             2             5             1             2            21 
##         media miscellaneous       newsbot           ngo         party 
##           109            10             6             4            15 
##    politician        privat     scientist 
##            22            34             3

Almost half of the accounts belong to the media. Maybe just the number of followers alone is not a good criterion to choose entities.

Create edges

To create edges out of twitter data, I would like to find out who follows other users and use this information to create edges. Mentions may be used to create edges as well, but we keep that for later. Our tweets do not contain the user_id that’s why I proceed as follows:

# returns me the id and more information about the user
info.nodes <- lookup_users(nodes2$user_name)

# creates a vector with the user ids
userid.nodes <- info.nodes$user_id

# extracts who of the network users follow another user
dt.list <- lapply(1:nrow(nodes2), function(single.id) {
    print(paste(single.id, "of", nrow(nodes2)))
    # retrieves the friend list
    users.friends <- get_friends(nodes2[["user_name"]][single.id])
    # looks for matches of network users in his friends list and saves the
    # respective position(s)
    vertices.ids <- which(userid.nodes %in% users.friends$user_id)
    # returns nothing, if there is no single match
    no.match <- if (all(!(userid.nodes %in% users.friends$user_id))) 
        return(NULL)
    # otherwise it tabulates the relation in our desired format
    users.edges <- data.table(from = as.character(single.id), to = as.character(vertices.ids), 
        weight = 1, type = "follows")
    # tests whether singl.id is dividable by 15 to respect the rate limit
    if (single.id%%15 == 0) {
        print("pauses")
        Sys.sleep(15 * 60)
    }
    return(users.edges)
})

# merges the data tables of all list elements
edges2 <- rbindlist(dt.list)

With these data tables at hand we can construct a graph:

# constructs the graph from both data tables
graph4 <- graph_from_data_frame(d = edges2, directed = TRUE, vertices = nodes2)

# removes vertices without any edges
graph4 <- delete.vertices(graph4, which(degree(graph4) <= 1))

# determines the types that exist
types.unique <- V(graph4)$type %>% unique
# let's color the nodes in respect to their type
colrs <- sample(colors(), length(types.unique))

# adds the attribute color to each node
for (i in 1:length(types.unique)) {
    ind <- which(V(graph4)$type == types.unique[i])
    V(graph4)$color[ind] <- colrs[i]
}

plot(graph4, layout = layout.auto, vertex.label = vertex_attr(graph4, "user_name"), 
    vertex.size = 4, edge.arrow.size = 0.1, vertex.label.dist = 0.6, vertex.label.cex = 0.7, 
    vertex.color = V(graph4)$color)

# creates a subgraph by taking a random sample of 30 nodes
graph4.sub <- induced_subgraph(graph4, sample.int(229, 30))

plot(graph4.sub, layout = layout.auto, vertex.label = vertex_attr(graph4.sub, 
    "user_name"), vertex.size = 4, edge.arrow.size = 0.1, vertex.label.dist = 0.6, 
    vertex.label.cex = 0.7, vertex.color = V(graph4.sub)$color)

You can see the visualization of such huge networks where many nodes share edges can be quite messy. It needs a lot of fine tuning if you want to achieve good looking graphs. Other programs like gephi are maybe better suited for that task.

I can refine the plot by e.g. using the number of followers for the edge size or computing centrality measures such as degree() and authority_score().

Uses the number of followers for the vertex size:

# normalizes the followers to a range between 0 and 1
for (i in 1:length(V(graph4.sub)$followers)) {
    # since the range of values is very wide, taking the log is recommended
    logs <- log(V(graph4.sub)$followers)
    # normalizes to values between 0 and 1
    V(graph4.sub)$vertex.size[i] <- (logs[i] - min(logs))/(max(logs) - min(logs))
}

plot(graph4.sub, layout = layout.auto, vertex.label = vertex_attr(graph4.sub, 
    "user_name"), vertex.size = V(graph4.sub)$vertex.size * 4, edge.arrow.size = 0.1, 
    vertex.label.dist = 0.6, vertex.label.cex = 0.5, vertex.color = V(graph4.sub)$color)

On Twitter communication usually happens directly between users. As long as a twitter account is public, everybody can follow this account. There is no structural barrier. Therefore, centrality measures (e.g. betweeness()) that assume nodes to be a kind of gate keeper may not be applicable for this example where the decision to follow someone form the edges. Secondly, through retweets the original tweets may flow from one user to another user that are not directly connected, but it is disputable if this possibility is reason enough to give meaning to users that sit in between the network. In contrast to that, we can use the degree() and the authority_score() functions here that give weight to the nodes that are followed a lot:

# basically uses followers and friends within the network to determine the
# nodes' centrality
centrality.tot <- degree(graph4, mode = "total")
# only uses a nodes' followers to determine its centrality
centrality.in <- degree(graph4, mode = "in")


# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph4)$user_name, centrality.in, centrality.tot) %>% 
    arrange(desc(centrality.in))
auth.score <- authority_score(graph4, weights = NA)$vector

data.frame(vertex.attributes(graph4)$user_name, auth.score) %>% arrange(desc(auth.score))
plot(graph4, layout = layout.auto, vertex.label = vertex_attr(graph4, "user_name"), 
    vertex.size = auth.score * 10, edge.arrow.size = 0.1, vertex.label.dist = 0.6, 
    vertex.label.cex = 0.5, vertex.color = V(graph4)$color, main = "authorities")

There are more options to refine your network analysis. Check out the documentation of igraph for information on additional distances and paths functions, how to create subgroups and communities, and how to detect assortativity and homophily based on attributes.

Mentions to create edges

We turn to mentions to create edges for the closing part of this section about network analysis. To that end, we have to recreate the nodes data table too since we can’t know a priori which users are important or the opposite. Hence, we determine first how many unique users we have in our data set of almost two million tweets:

# determines how many unique users exist in the dataset
unique.users <- tweets.sample$user_name %>% unique

length(unique.users)
## [1] 199844

With the vector containing the unique users we can create the known data table storing them as nodes. Given the number of nodes I’m not going to label them into categories this time. At least not at this point.

# creates an empty table with the necessary columns and sequential ids
nodes3 <- data.table(id = 1:length(unique.users), user_name = unique.users, 
    type = NA, followers = 0)

# assigns the number of followers to each node
nodes3$followers <- sapply(1:nrow(nodes3), function(pos) {
    # finds the tweets that were published by this user (node)
    allmatches <- which(tweets.sample$user_name == nodes3[["user_name"]][pos])
    # takes the last match
    lastpos <- allmatches[length(allmatches)]
    print(paste("completed", pos, "of", nrow(nodes3)))
    # returns the number of followers that user had at the time of his last
    # tweet
    tweets.sample[["user_followers"]][lastpos]
})

With almost two million tweets it takes quite a while to extract every single mention:

# determines all mentions in one tweet
edges3 <- lapply(1:nrow(tweets.sample), function(tweet.no) {
    # informs about the progress
    message(paste("processing", tweet.no, "of", nrow(tweets.sample)))
    # extracts all mentions
    mentions <- str_extract_all(tweets.sample[["text"]][tweet.no], "(@[a-zA-Z0-9_]+)")[[1]] %>% 
        str_replace(., "@", "")
    # identifies if mentions are one of our nodes
    mentions.pos <- which(nodes3$user_name %in% mentions)
    # returns nothing if there are no matches
    no.match <- if (length(mentions.pos) == 0) 
        return(NULL)
    # records the author of the tweet
    author <- which(nodes3$user_name == tweets.sample$user_name[tweet.no])
    # puts the data into a data table with the desired format
    dt.list <- data.table(from = author, to = mentions.pos, weight = 1, type = "mentions")
    print(dt.list)
    return(dt.list)
})

# binds all data tables into one
edges3 <- rbindlist(edges3)

Now we can again create the network graph:

# same procedures as usual
graph5 <- graph_from_data_frame(d = edges3, directed = TRUE, vertices = nodes3)

Either we reduce the network’s number of nodes or we focus on numeric measures like the authority score that informs us about the network’s structure:

centrality.score.tot <- degree(graph5, mode = "total")
centrality.score.in <- degree(graph5, mode = "in")


# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph5)$user_name, centrality.score.in, centrality.score.tot) %>% 
    arrange(desc(centrality.score.in))
auth.score.g5 <- authority_score(graph5)$vector

# assigns the authority score as attribute to the graph
vertex.attributes(graph5)$authority <- auth.score.g5

# returns us the nodes with the highest authority
data.frame(vertex.attributes(graph5)$user_name, auth.score.g5) %>% arrange(desc(auth.score.g5))

I do not trust the second results and/or they are additional evidence that my data set is biased to the political right.

Let’s see if we can create a nice plot by including only vertices with an in-degree centrality equal to or higher than 1000:

The package graphTweets encompasses functions that allow you to create edges from data frames out of the box.

Exercise 5

  1. Create a network with igraph using the data set containing tweets from the election day after 6pm. Use attributes for your nodes that exist in your data. Include follows as type for an edge, mentions will take too long.
  2. Plot your network use your attributes to draw a meaningful graph.
  3. Compute the centraliy once for the mode in-degree and total. Rank the nodes descending and compare your two results.
  4. Determine the most significant authorities in your network.

Automation

The last part of the workshop is about how to automatize the collection of tweets. In this respect, we have to distinguish between three cases that require different approaches:

  1. The first case is that you want to periodically capture future tweets of a user. In that case you can write a script that you either execute manually or better automatically and that saves the tweets.

  2. The second case is that you want to capture future (!) tweets that contain keywords that are not related to a Restricted group of users. In that case you want to have a script that listens to Twitter’s Streaming API constantly, i.e. runs 24/7 and saves the tweets in a database/file for you.

  3. The third case is that you would like to retrieve historic tweets that neither the Rest API nor the Streaming API will deliver to you due to built-in limitations. In that case you need to use a setting similar to the one that Henrique Jefferson proposes with his project GetOldTweets-python. It mimics the browsing behavior of a human being and harvests the data that is delivered to your browser. I describe in the respective section how this works in detail.

Case 1: Rest API

Let’s say if have two politician whose communication on Twitter I would like to store. For this aim the Streaming API would be a bit too much and my data would probably be messy like containing tweets from other users mentioning one of them. To regularly retrieve new tweets of both politicians via the Rest API is more appropriate here. Even more so, when you can plan ahead and want to cover a certain time period e.g. election campaign or whatever. A solution in R can look like the following:

# loads the libraries
library(rtweet)
library(data.table)

# preamble: where you maybe want to set a working directory ...

# vector with the usernames
usernames <- c("katjakipping", "peteraltmaier")

# checks if some tweets were already collected and loads them
try(load("politiciantweets.RData"), TRUE)

# first time
if (!exists("w")) {
    for (i in 1:length(usernames)) {
        # for the very first timeline
        if (i == 1) {
            # gets the tweets and create the data.table only once
            tweets <- data.table(get_timeline(usernames[i]))
        } else {
            # gets the timeline and binds them to existing data.table
            tweets <- rbind(tweets, get_timeline(usernames[i]))
        }
        # sets status_id as the key column
        setkey(tweets, status_id)
        
        # counter
        w <- 1
        save("tweets", "w", file = "politiciantweets.RData")
    }
} else {
    for (i in 1:length(usernames)) {
        tweets <- rbind(tweets, get_timeline(usernames[i]))
    }
    # removes duplicates using the key colun status_id
    tweets <- subset(unique(tweets))
    w <- w + 1
    save("tweets", "w", file = "politiciantweets.RData")
}

Either use cronR (Linux/Unix) or tasksechduleR (Windows) to schedule your r script:

Case 2: Streaming API

Kudos to Vik Paruchuri who was written this beautiful collection of python scripts to listen to Twitter’s Streaming API. The advantage of his approach is that he uses sqlite, a library that contains a relational database, to save the tweets. Twitter breaks the connection, if your setting does not process the receiving tweets fast enough. Sqlite is fast that’s why this should not happen as long as your computer has enough power. I’ve modified his code here and there along my needs.

To ensure that the setting works fine, one needs to install the module that are listed in the requirements.txt:

# command assumes that directory of file = working director
# maybe with sudo
pip install -r requirements.txt

In essence, you only need to put in your credentials in private.py and specify your desired keywords in the settings.py file where you can also specify further Restrictions like language or geo location. For the past elections I used a Linux server where I made sure that the execution of scraper.py is monitored and Restarted automatically, if either the connection breaks or the server has to reboot for any reason.

requirements.txt

tweepy
ipython
matplotlib
scipy
numpy
pandas
dataset
psycopg2

private.py

consumer_key="YOUR CONSUMER KEY"
consumer_secret="YOUR CONSUMER SECRET"

access_token="YOUR ACCESS TOKEN"
access_token_secret="YOUR ACCESS TOKEN SECRET"

# database name
CONNECTION_STRING = "sqlite:///tweets.db"

settings.py

# -*- coding: utf-8 -*-

# vector with terms to track
TRACK_TERMS = ["#btw17"]

# import this object from private.py
CONNECTION_STRING = ""

# name of the csv file, when you dump the data
CSV_NAME = "tweets_btw17.csv"
TABLE_NAME = "btw17"

try:
    from private import *
except Exception:
    pass

scraper.py

# -*- coding: utf-8 -*-

import settings
import tweepy
import dataset
from sqlalchemy.exc import ProgrammingError
import json

# connects to the database
db = dataset.connect(settings.CONNECTION_STRING)

# creates the class StreamListener basend on tweepy's StreamListener
class StreamListener(tweepy.StreamListener):

    # basically says what to do with statuses that are received
    def on_status(self, status):
        # prevents retweets from being stored
        if (status.retweeted) or ('RT @' in status.text):
            print('retweet')
            return
        print(status.retweeted)
        # saves the the following tweet's attributes
        description = status.user.description
        loc = status.user.location
        text = status.text
        coords = status.coordinates
        geo = status.geo
        name = status.user.screen_name
        user_created = status.user.created_at
        followers = status.user.followers_count
        id_str = status.id_str
        created = status.created_at
        retweets = status.retweet_count
        bg_color = status.user.profile_background_color
        lang = status.lang
        
        # transforms into a string
        if geo is not None:
            geo = json.dumps(geo)

        if coords is not None:
            coords = json.dumps(coords)

        # specfifes where to store tweet
        table = db[settings.TABLE_NAME]
        # tries to store the tweet in table
        try:
            table.insert(dict(
                user_description=description,
                user_location=loc,
                coordinates=coords,
                text=text,
                geo=geo,
                user_name=name,
                user_created=user_created,
                user_followers=followers,
                id_str=id_str,
                created=created,
                retweet_count=retweets,
                user_bg_color=bg_color,
                lang=lang,
            ))
        # says what's to do when an error is encountered
        except ProgrammingError as err:
            print(err)

    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False

# loads authentication credentials
auth = tweepy.OAuthHandler(settings.consumer_key, settings.consumer_secret)
# sets token
auth.set_access_token(settings.access_token, settings.access_token_secret)
# 
api = tweepy.API(auth)

# assigns our before written class StreamListener
stream_listener = StreamListener()
# starts listening to the stream
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
# filters the stream according to our track terms, ...
stream.filter(track=settings.TRACK_TERMS, languages=settings.TRACK_LANGUAGES)

dumpy.py

import settings
import tweepy
import dataset

# connects to database
db = dataset.connect(settings.CONNECTION_STRING)

# retrieves every tweet stored
result = db[settings.TABLE_NAME].all()

# saves the tweets in a csv
dataset.freeze(result, format='csv', filename=settings.CSV_NAME, encoding='utf-8')

You only need to save these files in the same folder and your good to go.

In action:

executes python scraper

‘executes python scraper’

Case 3: web harvesting

When you do not have the chance to use the Streaming API, e.g. to access many and very old tweets, the search query is your last option. It would be however very tiring to collect the search results for a query by hand. To ease we can use Jefferson’s projects that works without problems.

Installation

The best way to install this project is to clone it to a local folder. How you clone a git repository depends on your operating. The following command works on UNIX systems:

git clone https://github.com/Jefferson-Henrique/GetOldTweets-python

The project comes with some requirements as well:

# this works only when you're in local directory where the project was cloned to
pip install -r requirements.txt

After a successful installation you can ask to retrieve tweets by username, keyword and time. It’s sometimes advisable to divide the collection into years or to set a limit.

search by username

The following command searches for all tweets of trump in 2009 and stores them in a csv:

# assumes to be in the directory where Exporter.py is located
# how you call python may be different under windows
python Exporter.py --username 'realdonaldtrump' --since 2009-01-01 --until 2009-12-32 --output 'trump2009.csv'
executes script to harvest tweets by username

‘executes script to harvest tweets by username’

search by keyword

This way you can get to know what people think about certain activities:

python Exporter.py --querysearch 'debat-o-meter' --maxtweets 100 --output 'debatometer100.csv'

That would be a great success:

“Sind die Begriffe Wahl-O-Mat und Debat-O-Meter eigentlich schon in den Duden aufgenommen worden?” – krabbl_

search for top tweets

Sometimes we want to know what our favorite celebrity is up to:

python Exporter.py --username 'justinbieber' --maxtweets 100 --toptweets

Exercise 6

  1. Install both python applications.
  2. Experiment with the available queries and parameters.

More information

I hope you’ve enjoyed this workshop and learned a lot! Feel free to contact me with any suggestions for improvement or questions regarding the content.

Find below some more packages that are linked to the work with social media data. Besides that I’ve added a few references that provide more information in relation to the topic of this workshop.

packages

Something interesting packages that are worthwhile to look at if you plan to use data from social networks:

References

https://github.com/pablobarbera/social-media-workshop

shows how to collect data on Facebook, Instagram, …

http://kateto.net/netscix2016

tutorial on creating network graphs with igraph

http://blogs.lse.ac.uk/impactofsocialsciences/category/digital-methodologies-series/

series over digital methodologies in social sciences