Social Media and Automated Sentiment Analysis
Social Media monitoring and analysis has become increasingly popular since the Web 2.0 because it provides an easy and effective way to directly measure the effect of a campaign. This can be done by KPI’s like the number of followers, likes, shares, comments, comments-per-post. Besides these hard and easy-to-measure KPI’s it is also useful to determine the emotional tone of written messages. Analyzing how positively or negatively people talk about a certain product or subject is called Sentiment Analysis.
Although humans are far better at determining the subjectivity of a written message, you will need an automated way of sentiment analysis if you’ re going to evaluate thousands of posts or millions of Twitter messages. Although automated sentiment analysis is less accurate, it can be very useful for relative comparisons. The question then is how we can do automated sentiment analysis? For this we need some Machine Learning and Natural Language Processing techniques (which will be discussed in later posts).
The predictive nature of twitter
Although a lot of Social Media is fit to perform descriptive analytics on (what do people think about product X), twitter is the most logical choice to perform predictive analytics on. It is not hard to imagine how some trends and extrapolations towards the future can be extracted from the Twitter stream. By watching the real life stream for Twitter messages about flu symptoms one can predict the spread of diseases. Or by watching Twitter messages for an intent to meet somewhere one can predict riots and even revolutions . Some claim that Twitter can even be used to predict the stock market, unemployment, tv-ratings or election results.
Even though it is more difficult to perform Sentiment Analysis on Twitter messages (because Tweets are short and contain slang and twitter specific jargon) than larger texts, it still should be possible to predict election results with the Twitter stream. Using Twitter Data and Sentiment Analysis to predict election results is still controversial (I’ll give a short summary of research done so far in a later post), but there is a clear correlation between Twitter behaviour and voting behaviour. The number of retweets of a party / candidate’s tweet messages or the number of Tweets including a candidates name correlates with their popularity. More ‘news’ coverage probably mean a higher popularity (see the example of Trump).
The question then is to what extent the content of Twitter is representative of voting behaviour and how accurately we can predict the results of an election with Twitter data. If it is possible to predict, the accuracy of the predictions will only depends on the cleverness of the proposed model, i.e. how accurately can it filter out ‘noise’, Classify the Tweets and extract their sentiment? Does it take into account the demographics of the Twitter users or not? How well does it take slang and emoticons/icons into account during sentiment analysis? etc
If you were to use Twitter to give a (hopefully accurate) prediction on the upcoming elections, where should you start? Although it seems like a huge and impossible task, it can be divided in 3 or 4 smaller tasks which do not seem huge and impossible if considered individually:
- Data Collection; set up the development environment and mine for the relevant tweets (see blog 1).
- Pre-processing of the data; remove duplicate Tweets, retrieve geo-information and identification of the remaining Tweets. Remove Tweets not written in the relevant language or country. It might also be necessary to remove Tweets from Twitter accounts with a suspiciously large amount of Tweets. These Twitterers are probably Twitterbots, political fanatics or news agencies and could skew the results away from the intentions of the general public.
- Review a number of methods and models of modelling political sentiment and propose a prediction model. Is the volume of tweets mentioning each candidate enough to capture voter intentions? Or is it also necessary to do Sentiment Analysis and only use the positive tweets? Which other boundary conditions lead to smallest MAE? Are retweet volume, twitter user count, unique twitter user count, non-promotion twitter user count or non-promotion tweet volume good parameters for prediction?
- Make a prediction about the upcoming elections (1 November 2015) and evaluate the prediction against the conventional election polls and the final election result.
In the next few blogs I will write about these points and try to predict the upcoming Turkish General Elections. Some of these blogs are more technical and might not be interesting to everybody, but the final results will definitely surprise you. So stay tuned, folks!