Analyzing Facebook posts and activity using Topic Modelling and other NLP methods

In April, I permanently deleted my Facebook presence. I joined in 2008 and was fairly active over the years.

In the process of deleting my account, I downloaded all my data. I couldn’t resist asking myself: what does this digital time capsule tell me about a decade of my life? And what hints does it give about a decade in Facebook’s life? Time to put my data science skills to work.

For R programmers who read this, I have posted my code to RPubsmaking use of a number of packages including tmtopicmodelsldatuning and plotly and ggplot2 for vizualizations. Hopefully you can use it to do something similar with little/no tweaking. I hope you can give it a go and hope you learn something fun.

1. Setting up the analysis of my timeline

I downloaded the zip file of my Facebook data and extracted it to a folder on my computer. Here’s how to do this if you are not sure how. The particular file I was interested in was in the html subdirectory called timeline.htm. This file contains all the posts on my timeline for the (almost) ten years since I joined. Although other people can post on your timeline, the vast majority of the comments were my own.

Next I used text analytics to extract all the text from the heavy html code in this file, until I was left with just a list of the comments from my timeline. This is not straightforward, but a few lines of the right code can get you there.

I also extracted all the activity dates in a similar way, so I could analyze the patterns in the times and dates that I was posting.

Overall, there were 5115 posts over 3632 days of membership, which averages out at about 1.4 posts per day, so my profile was fairly active over the ten years.

2. Analyzing my activity on Facebook

I took the data on the dates and times of my activity and used R’s nice functionality to group them so I could understand how my activity changed by year, by month of the year and by hour in the day.

First I took a look at how my activity changed by year and I produced this chart, which shows that I was hardly active at all in 2010 (I remember deliberately taking a year off from Facebook!), and most active in 2013–2015.

Then I looked at the overall development of my activity month-by-month, using a scatter plot with a LOESS smooth curve to show the overall trend. This confirmed that I reached an activity peak around 2013–2015. Activity declined substantially after 2015. I couldn’t help but ask: is this the general trend of Facebook? It sure feels like it to me!

Finally, I was interested in my daily trend of posting, so I looked at the total posts by hour of the day. This looked like I was weirdly nocturnal until I realized that the timezone recorded by Facebook is GMT, but for most of this time period I lived in Australia which is 9–10 hours ahead depending on the time of year. Looking at it in this light, this activity looks relatively normal, with a peak in the (Australian) mornings and evenings, and lower levels of activity during working hours and in the night time.

3. Most important words in my posts

Next I tidied up the text to allow me to analyze it easily. I removed punctuation, numbers, and converted everything to lower case. I also removed stopwords which are meaningless words like ‘and’ which are often removed for internet search queries. Finally, I replaced certain names which I knew would appear, in particular my two kids’ names which I converted to ‘child1’ and ‘child2’. I did this to protect their privacy given that I expected to share my analysis.

I turned my list of posts into a document corpus, with each post as a single document. I created a Document Term Matrix, which calculated how many times each word occurred in each document. I then used a measure called Term Frequency-Inverse Document Frequency (TF-IDF) to determine the importance of words within documents.

I averaged the TF-IDF frequencies across all words and then ranked them from greatest to smallest. Finally I used this data to create a wordcloud to represent the universe of my posting over the ten years. Here it is.

First and foremost this tells me that, at least on Facebook, my kids were my life. However, the value of TF-IDF over a simple wordcount is that it gives extra credit to slightly less frequent words and brings them more into the foreground compared to a simple wordcount. These extra words around the edges give some color to the tone of my posts. Looking over it, I look pretty happy and pretty busy. I think that’s a fair conclusion.

4. My most common posting topics

Next I moved on to see if I could elucidate some common topics that I posted about regularly. For this I used a technique called Topic Modelling. Using my Document Term Matrix, I first used a mathematical technique to determine an optimal number of topics to split my posts into. I landed on eight topics — more than that seemed to give diminishing returns, and for every extra topic you add, it increases processing time significantly.

My topic modelling process found eight topics as requested. The frequency that each topic appeared is in this bar chart, and the most frequent topic was topic 8 although all topics occur with some frequency.

But what exactly were the topics?

I created a list of the top ten words by topic, to help me work out what the topics are, and here are the results:

So this is fun! Topic modelling can be a bit hit and miss — especially when the documents in the corpus are short, like Facebook posts or Tweets — but looking at the top ten words by topic, I see some things which make sense:

  • The most frequent topic (topic 8) is about TV, with the words best, episode and season being a giveaway for this. I am indeed a lover of great TV — more on this later.
  • Topic 1 is unsurprising — birthday greetings — likely from others to me over the years.
  • Topic 2 is also popular and is clearly about my family life. I am lucky enough to be able to drop my kids at school every morning, something I have done for many years, and it looks like I posted about this a lot.
  • Topic 3 is has an appearance of my love of football (soccer) and my favourite team — Everton FC.
  • Topic 6 takes me back to TV, and my favourite show of all time — Mad Men.I do remember posting about this show quite a lot. Pretty much after every episode for many years.
  • Other topics seem a bit harder to decipher, but for example in Topic 7, the mention of ‘news’ and ‘abbott’ (Tony Abbott, Prime Minister and Opposition Leader in Australia for most of the time I lived there) indicates that this is likely a number of topical posts based on the Australian news and politics.

So my ten years on Facebook tells me that I am a family man, I love soccer, comment on politics too much, and that I have spent far too long analyzing episodes of Mad Men.

There you have it. Thanks Facebook for ten fun years!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s