For most people, the most interesting part of the previous post, will be the final results. But for the ones who would like to try something similar or the ones who are also curious about the technical part, I will explain the methods and techniques I used (mostly webscraping with Beautifulsoup4) to collect a few million Tweets.
Setting up Python and its relevant packages
I used Python as the programming language to collect all the relevant data, because I have prior experience with it, but the same techniques should be applicable with other languages. In case you do not have any experience with Python, but still would like to use it, I can recommend the Coursera Python Course, Codeacademy Python Course or the Learn Python the Hard Way book. It might also be a good idea to get a basic understanding of how web APIs work.
The Python Packages you will need are packages like Numpy & SciPy for basic calculations, OAuth2 for authorization towards the Twitter, tweepy because it provides an user-friendly wrapper of the Twitter API, pymongo for interacting with the MongoDB database from Python (if that is the db you will use).
If you do not have Python and/or some of its packages, the easiest way to install it is; on linux install pip (Python package manager) first and then install any of the missing packages with
pip install <package>. For Windows I recommend to install Anaconda (which has a lot of built-in packages including pip) first and then IPython. The missing packages can then be installed with the same command.
Getting your twitter credentials;
Twitter is using OAuth2 for authorization and authentication, so whether you are using tweepy to access the Twitter stream or some other method, make sure you have installed the OAuth2 package. After you have installed OAuth2 it is time to get your log-in credentials from https://apps.twitter.com. Log in and click on create a new app, fill in the application details and copy your Consumer Secret and your Access Token Secret to some text file.
Accessing Twitter with its API
I recommend you to use tweepy , which is an open-source Twitter API wrapper, making it easy to access twitter. If you are using a programming language other than Python or if you don’t feel like using tweepy, you can look at the Twitter API documentation and find other means of accessing Twitter.
There are two ways in which you can mine for tweets; with the Streaming API or with the Search Api. The main difference (for an overview, click here) between them is that with Search you can mine for tweets posted in the past while Streaming goes forward in time and captures tweets as they are posted.
It is also important to take the rate-limit for both API’s into account:
- With the Search API you can only sent 180 Requests every 15 min timeframe. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour. One way to increase number of tweets is to authenticate as an application instead of an user. This will increase the rate-limit from 180 Requests to 450 Requests while reducing some of the possibilities you had as an user.
- With the Streaming API you can collect all tweets containing your keyword(s), up to 1 % of the total tweets currently being posted on twitter. So if your keyword is very general and more than 1 % of the tweets contain this term, you will not get all of the tweets containing this term. The obvious solution is to make your query more specific and combining multiple keywords. At the moment 500+ million tweets are posted a day, so 1 % of all tweets still gives you 1+ million tweets a day.
Which one should you use?
Obviously any prediction about the future should be based on tweets coming from the Streaming API, but if you need some data to fine-tune your model you can use the Search API to collect tweets from the past seven days – it does not go further back – (Twitter documentation). However, since there are around 500+ million tweets posted every day, the past seven days should provide you with enough data to get you started. However, if you need tweets older than 7 days, webscraping might be a good alternative, since a search at twitter.com does return old tweets.
Using the tweepy package for Streaming Twitter messages is pretty straight forward. There even is an code sample on the github page of tweepy. So all you need to do is install tweepy/clone the github repository and fill in the search terms in the relevant part of search.py.
With tweepy you can also search for Twitter messages (not older than 7 days). the code sample below shows how it is done.
import tweepy access_token = "" access_token_secret = "" consumer_key = "" consumer_secret = "" auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_acces_token(access_token, access_token_secret) api = tweepy.API(auth) for tweet in tweepy.Cursor(api.search, q="Tayyip%20Erdogan", lang="tr").items(): print tweet
These two samples of code show again the advantages of tweepy; it makes it really easy to access the Twitter API for Python and as a result of this is probably the most popular Python Twitter package. But because it is using the Twitter API, it is also subject to the limitations posed by Twitter; the rate-limit and the fact that you can not search for twitter messages older than 7 days. Since I needed data from the previous elections, this posed a serious problem for me and I had to use web-scraping to collect Twitter messages from May.
Using BeautifulSoup4 to scrape for tweets
There are some pro’s and cons with using web scraping for the collection of twitter data (instead of their API). One of most important pro’s are that there is no rate-limit on the website so you can collect more tweets than the limit which is imposed on the Twitter API. Furthermore, you can also mine for tweets older than seven days :).
If we want to scrape twitter.com with BeautifulSoup we need to send a Request and extract the relevant information from the response. The Search API documentation gives a nice overview of the relevant parameters you can use in your query.
For example, if you want to request all tweets containing ‘akparti’ from 01 May 2015 until 05 June 2015, written in Turkish, you can do that with the following url
The tweets on this page can easily be scraped with the Python module BeautifulSoup.
import urllib2 from bs4 import BeautifulSoup url = "https://twitter.com/search?q=akparti%20since%3A2015-05-01%20until%3A2015-06-05&amp;amp;amp;amp;lang=tr" response = urllib2.urlopen(url) html = response.read() soup = BeautifulSoup(html)
‘soup’ now contains the entire contents of the html page. Now, lets look at how we can extract the more specific elements containing only the tweet-text, tweet-timestamp or user. With the developer tools of Chrome (right-click on the tweet and then ‘Inspect element’) you can see which elements contain the desired contents and scrape them by their class-name.
We can see that an <li> element with class ‘ js-stream-item‘ contains the entire contents of the tweet, a <p> with class ‘tweet-text‘ contains the text and the user is contained in a <span> with class ‘username‘. This gives us enough information to extract these with BeautifulSoup:
tweets = soup.find_all('li','js-stream-item') for tweet in tweets: if tweet.find('p','tweet-text'): tweet_user = tweet.find('span','username').text tweet_text = tweet.find('p','tweet-text').text.encode('utf8') tweet_id = tweet['data-item-id'] timestamp = tweet.find('a','tweet-timestamp')['title'] tweet_timestamp = dt.datetime.strptime(timestamp, '%H:%M - %d %b %Y') else: continue
- The ‘text‘ after
tweet.find('span','username')is necessary to extract only the visible text excluding all html elements.
- Since the tweets are written in Turkish they probably contain non-standard characters which are not support, so it is necessary to encode them as utf8.
- The date in the twitter message is written in human readable format. To convert it to a datetime format which can further be used by Python we need to use datetime’s strptime method. To do this we need to additionally import the datetime and locale package
import datetime as dt import locale locale.setlocale(locale.LC_ALL,'turkish')
Scraping pages with infinite scroll:
In principle this should be enough to scrape all of the twitter messages containing the keyword ‘akparti’ within the specified dates. However the website of twitter uses infinite scroll, which means it initially shows only ~20 tweets and keeps loading more tweets as you scroll down. So a single Request will only get you the initial 20 tweets.
One of the most commenly used solution for scraping pages with infinite scroll is to use Selenium. Selenium can open a web-browser and scroll down to the bottom of the page (see stackoverflow) after which you can scrape the page. I do not recommend you to use this. The biggest disadvantage of Selenium is that it physically opens up a browser and loads all of the tweets. Nowadays tweets can also contain videos and images, and loading these in your web-browser will be slower than simply loading the source code of the page. If you are planning on scraping thousands or millions of tweets it will be a very time consuming and a memory intensive process.
There must be another way!
Lets open up Chrome developer tools again (Ctrl + Shift + I) again to find a solution for this problem. Under the Network tab, you can see the GET and POST requests which are being sent in the instant that you have reached the bottom and the page is being filled with more tweets.
In our case this is a GET request which looks like
The interesting parameter in this request is the parameter
Here the first digit is the tweet-id of the first tweet on the page and the second digit is of the last tweet. Scrolling down to the bottom again we can see that this parameter has become
So every time new tweets a loaded on the page the get request above is sent with the id of the first and last tweet on the page.
At this point I hope it has become clear what needs to be done to scrape all tweets from the twitter page:
1. Read the response of the ‘regular’ Twitter URL with BeautifulSoup. Extract the information you need and save it to a file/database. Separately save the tweet-id of the first and last tweet on the page.
2. Construct the above GET Request where you have filled in the tweet-id of the first and last tweet in their corresponding places. Read the response of this Request with BeautifulSoup and update the tweet-id of the last tweet.
3. Repeat step 2 until you get a response with no more new tweets.
 Here are some good documents to get started with tweepy: