Les sciences dorées

What was Donald Trump speaking about during the elections? A LDA Analysis of Trump’s Facebook posts

On 18. Juni 2017 by rkuebler

I am right now doing some heavy econometric stuff these days that forces me to work with Stata instead of R. To relax a bit I am using my evenings to play around with R and to try out some new things in NLP, which I read a lot about previously and which I found interesting. I know this sounds huuugely nerdy…

A year ago when I presented my work with Anatoli and Koen in Shanghai about using different types of Sentiment Extraction Tools to measure and predict brand equity, someone pointed me onto Latent Topic Modeling. These people from Computer Sciences highlighted the vast potential of this type of unsupervised learning when it comes to understanding textual data.

Latent Topic Modeling is another interesting sort of algorithm that identifies clusters of latent topics within text documents. You basically feed the algorithm with some text inputs (e.g. different documents, chapters, speeches, tweets, facebook posts, whatever you can get), and the algorithm will try to find similar patterns within the different documents. The idea is that patterns are related to topics. So the occurrence of words or the co-occurrence of word combinations will be related to specific topics.

I am not sure if it is mathematically and conceptually correct to compare LDA with Cluster Analysis, but basically you also try to find cluster of words that occur together and therefore indicate a specific topic. If words have similar likelihoods to occur, they might be related to a similar topic.

Like most NLP techniques LDA requires text to be prepared (or “translated”) in a matrix form. Most commonly you have the choice between two sorts of approaches. A (1) TermDocumentMatrix or a (DocumentTermMatrix). Both sorts of “tables” differ in their order

The TermDocumentMatrix (TDM) is a two-dimensional matrix. Each row contains a term featured in at least one of the documents (even though that while preparing your matrix you may want to set rules to exclude rare and unnecessary terms). Each column represents one of the documents from your corpus (i.e. list of documents). Each cell then indicates whether the term of the row was present in a document (usually done by a dummy indicator taking 1 and 0).

The DocumentTermMatrix (DTM) is just an inverted version of the TDM. Here the columns represent all words occuring in the documents and the rows are representing the documents. Again dummies are used to indicate the presence of a word in a document. In case of the DTM a document can be indicated as a dummy combination. Having started my research career with discrete choice experiments, DTMs actually reminded me a lot of experimental design plans, which helped me to get quickly familiar with the concept.

Both, TDMs and DTMs are sorts of sparse matrices, as the number of zeros significantly outnumbers the ones.

For Latent Topic Models, there are different algorithms and models available. Most commonly people use these days so called Latent Dirichlet Allocation (LDA) models for latent topic identification. R has LDA included in its powerful (topicmodels) package.

LDAs use DTMs to identify topics within text. Alike classic cluster analyses (i.e. k-mean clustering) it is upon the researcher to determine apriori the number of topics. This might be a problem, as especially in marketing research people will criticize that the choice was arbitrary or driven by the researchers wish to find specific things within the data.

I personally (not having too much experience with LDA and NLP) see two ways to cope with this issue. On the one hand the choice of k (number of topics) could be lead by theory. Previous findings or strong theory may guide a researcher. On the other hand, data itself may give some indication the optimal number of topics. I personally found this discussion on stackoverflow very helpful.

But lets get a bit more concrete. People are still wondering how Donald Trump could get into office. LDA may not fully answer this question, but be helpful to identify his most often used topics during the elections.

I used R and Pablo Barbera’s wonderful package Rfacebook to extract all posts from Donald Trump from June 2016 to October 2016, shortly before the elections. In total Trump 1058 times on Facebook during this period.

To clean the data and get it into a DTM, R’s tm package is the weapon of choice (as always when it comes to text data and R). Please note that the code below works wonderful on a MacBook Pro. I do not give any guarantee that it will also work on any other computer, as text data usually provides you with a lot of challenges because of UNI-Code issues. Still I am sure it will be quite helpful.

First I extract all posts from the dataframe I got from Rfacbeook and create a Corpus document that is readable for the tm package. I don’t know why I called the corpus tweets. Just ignore this please. Also I am sure that many people will find my code very basic and will find a lot of optimization potential. And they are right. The code is very basic. For two major reasons. First my coding skills are basic and second, I find it easier for beginners to follow things step by step. If I integrate things into functions or use dplyr to combine steps in one line of code, I feel like it is much harder for beginners to follow up. So my apologies to everyone who thinks my code is too basic. It is like it is…

So lets start by building up the corpus

library(tm)
#GetTweets in Corpus Form
tweets <- Corpus(VectorSource(TrumpFacebook$message))

the tm package is well know vor having a lot of issues with emojis and all other sorts of symbols. So we need to clean the data from this. Luckily Donald is not that much a heavier user of emojis. Still there is some stuff we can not read. So before we continue with the normal transformation lets be sure we get rid of all non-readable things:

#remove all signs that are not interpretable by tm package

toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern,” “,x))})

tweets <- tm_map(tweets,toSpace,”[^[:graph:]]”)

To avoid too large matrices, one should convert all text to lowercase, remove stopwords, numbers and other content, that does not help the classification later. I borrowed some of the code from stackoverflow and some code from eight2late who provides us with a nice extra stopword list. Also wordstemming helps to significantly cut down the number of columns by reducing words to it stems (by e.g. cutting difficulties, difficulty, etc. to difficult)

#change text to lower case
tweets <-tm_map(tweets,content_transformer(tolower))
#remove problematic signs
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, “”, x))})
tweets <- tm_map(tweets, toSpace, “-”)
tweets <- tm_map(tweets, toSpace, “’”)
tweets <- tm_map(tweets, toSpace, “‘”)
tweets <- tm_map(tweets, toSpace, “•”)
tweets <- tm_map(tweets, toSpace, “””)
tweets <- tm_map(tweets, toSpace, ““”)
#remove punctuation
tweets <- tm_map(tweets, removePunctuation)
#Strip digits
tweets <- tm_map(tweets, removeNumbers)
#remove stopwords
tweets <- tm_map(tweets, removeWords, stopwords(“english”))
#remove whitespace
tweets <- tm_map(tweets, stripWhitespace)
#Stem document
tweets <- tm_map(tweets,stemDocument)

myStopwords <- c(“can”, “say”,”one”,”way”,”use”,
“also”,”howev”,”tell”,”will”,
“much”,”need”,”take”,”tend”,”even”,
“like”,”particular”,”rather”,”said”,
“get”,”well”,”make”,”ask”,”come”,”end”,
“first”,”two”,”help”,”often”,”may”,
“might”,”see”,”someth”,”thing”,”point”,
“post”,”look”,”right”,”now”,”think”,”‘ve “,
“‘re “,”anoth”,”put”,”set”,”new”,”good”,
“want”,”sure”,”kind”,”larg”,”yes,”,”day”,”etc”,
“quit”,”sinc”,”attempt”,”lack”,”seen”,”awar”,
“littl”,”ever”,”moreov”,”though”,”found”,”abl”,
“enough”,”far”,”earli”,”away”,”achiev”,”draw”,
“last”,”never”,”brief”,”bit”,”entir”,”brief”,
“great”,”lot”)
tweets <- tm_map(tweets, removeWords, myStopwords)

So now we have everything ready to prepare our DTM. You will be surprised how easy this is in R.

dtm <- DocumentTermMatrix(tweets)

However, unfortunately it turns out that Donald sometimes does not have much to say. In some cases he only shares content, without leaving a message. This leads to some empty rows in our DTM, which will late create problems when estimating the LDA model. Therefore we have to clean the DTM from any empty observation. Still we want to have later a data set where we can assign the identified topics to the original messages. Therefore we need to do a bit more than just cleaning.

#Identify Empty Rows in Corpus
rowTotals <- apply(dtm , 1, sum)
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]]
tweets2 <- tweets[-as.numeric(empty.rows)]
dtm2 <- DocumentTermMatrix(tweets2)

In addition to the DTM cleaning we also need to clean our raw set from the empty messages so that things are still related to each other.

#adapt raw data to empty row drop
empty = as.numeric(empty.rows)
messages <- company[-empty, ]

Now we can finally run our LDA model. I suggest that before you just copy/paste my code, you better look at the different options. LDA involves some heavy statistics and procedures like Gibbs Sampling. In my example here I just follow the suggested starting values (like presented here). Still you may want to play around and use other starting values. Or you want to use instead of a Gibbs Sampler approach another LDA model approach (just check topicmodels wonderful helpfile to see the other options available). Also Grün and Horning (2011) give a very helpful hands on discussion of the options you have in LDA and when to best use which one. For now lets use the standard parameters.

#Prepare LDA
library(topicmodels)

#prepare gibbs sampler
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

It is time to choose the number of classes now. I read quiet a bit around. FiveThirtyEight highlights up to 12 different election topics for the 2016 presidential elections. Some other sources speak about 4-8. I tried different number of topics and finally used the data based approach above that guided me to an optimal number of 4 topics.

#define number of topics to identify
k = 4

Now we can finally run the LDA model.

#run LDA Analysis
ldaOut <-LDA(dtm2,k, method=”Gibbs”, control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

It will take some time. I am using a MacBook Pro (Late 2015, with 32GB of working space) and I had some results after 4 minutes. Increasing the size of the corpus will therefore certainly involve longer estimation times! Keep that in mind when you play around with larger datasets. I am also quit sure that creating the DTM will take much longer if you face a larger corpus.

So now lets see what we have found. First we can inspect the 10 most common words found for the preset number of topics.

#Inspect Output with top 10 terms for each topic

ldaOut.terms <- as.matrix(terms(ldaOut,10))

The result looks like this:

Topic 1	Topic 2	Topic 3	Topic 4


1	clinton	american	michael	america
2	hillari	countri	john	thank
3	peopl	job	david	trump
4	state	million	jame	safe
5	presid	year	robert	djt
6	vote	america	mari	support
7	email	peopl	william	togeth
8	govern	work	mike	crook
9	system	everi	richard	win
10	corrupt	plan	thoma	presid

Topic 1 apparently is all about attacking Hillary. Topic 2 is about the US economy and bringing jobs back to the US. Topic3 contains main first names. I am not quiet sure what it is about. So far I interpret it as Donald speaking about people he would hire for his later cabinet. Topic 4 is all about making America Great Again with DJT. Topics seem to make sense and are clearly related to what we also have in mind when we think about Trump’s campaign last year.

So lets now see how these things develop over time. I use ggplot to create some timeseries graphs to observe his behavior.

First lets see how things walk together.

Well, what do we see here? Right, a lot of lines which prevent us from a clearer picture. Even though this looks quite messy (not too say huuuugely), we see some differences in magnitude as well as in posting timing across the four identified topics.

Let have a look at each time series separately. Lets start with Trump posting about Hillary:

Then lets look at the economy related posts:

Trump talking about other (possibly future cabinet) people

And finally Trump speaking about himself

It is quiet interesting to see that Trump is addressing the different topics at different times. He is attacking Hillary at the beginning and at the end of the campaign, but tones down in the middle (late August 16 to early October). Economy seems to get more attention towards the end of the campaign with the elections coming closer. Also he seems to get more concrete about other people towards the end of the campaign. Looking at the fourth topic category we can see that Trump’s post related to himself are even evolving over time. So his ego got apparently bigger during the campaign.

Or he realized that people like actually this identification. To better understand what is driving his posting behavior (and how strongly he emphasized each topic), one may e.g. combine the LDA topic results with some polling data and run some basic regression models. But keep in mind, things may be quiet endogeneous here. Topics may not only drive polls, but polls may also have quite some impact on what Trump posted on facebook. This however then leaves some space for another blog post. Quiet a huuuuge one…

So long! Or for this special context: Good Night and Good Luck America!

Category : Allgemein

facebook

Twitter

del.icio.us

digg

stumbleupon

What was Donald Trump speaking about during the elections? A LDA Analysis of Trump’s Facebook posts

Hinterlasse eine Antwort Antworten abbrechen

Meta

Kategorien