## Part 3, Section 2: Real-Time Data Harvesting.

### Python and Twitter.

#### What is Twitter?

If you already know about Twitter, skip this and the next section. Else: Twitter is a social network where an individual is able to post small snippets of text (140 characters or less) called tweets, which usually deal with small events happening in their life, interesting things they've found online, or whatever else one can fit in 140 characters. Surprisingly, twitter has been used to do substantial things like tracking diseases and assembling rioters; this has often been done by searching and analyzing data for tweets containing certain keywords.

A hashtag, #, on Twitter is generally given at the end of a tweet and (theoretically) relates a tweet to some larger trend or topic. Hence, a tweet like $\mbox{fever, runny nose, sore throat, and I still have to}$ $\mbox{go into work :( #unfair #sick #flu #worksucks}$ tells us that this person is a bit sick but they still must go into work. The related hashtags are "unfair", "sick", "flu", "worksucks" of which the middle two would be useful in tracking a disease, for example.

An @ on Twitter is used to reply or talk about a person. For example, if there was a user called "i_hate_twitter" (there is no such user --- I checked!) then in order to talk about or reply to a tweet, one might write something like: $@\mbox{i_hate_twitter why do you hate twitter so much?}$ $\mbox{#h8r #twitter #somuchhate}$ One may also retweet a posting. This is just sharing another tweet that someone else has made.

As far as social networks go, Twitter is currently one of the largest and most important. Twitter has an API which will allow us to investigate tweets and do some neat analysis with them. As a side-note, we will work with R later, using the marvelous free textbook which will also go over the Twitter API in some detail. We work with Python first only because the reader will most likely be more familair with Python than R.

#### Will I need to sign up to Twitter?

For this course, yes; you don't have to tweet, but you will have to have an account on twitter. If you already have an account, you may just use that. If not, go to twitter and sign up.

#### Okay, I have an account; now what?

We'll now make a sample application; this'll allow us to use the Twitter API.

Here's the steps; if there are hints available, click on the step to expand.

Setting Up.

• 2. Go to the Upper-Right and sign-in (if you aren't signed in) or roll over your corresponding twitter icon. A menu will drop down and say 'My Subscriptions', 'My Applications', 'Sign Out'. Click My Applications.

Here's an example of how my dropdown looks.

• 3. Click on the 'Create a new application' button and fill out the page with reasonable things. I named my application 'Practice1', but you can name yours anything you want.

• 4. After creating your app, click it and you'll see a bunch of tabs saying things like 'Details', 'Settings', 'OAuth tool', and so forth. Click on the 'Settings' tab, scroll down, and make sure that 'Read, Write, and Access Direct Messages' is selected.

• 5. At the bottom of the 'Details' tab, click the 'Create My Access Token' button to create the access_token and access_secret.

• 5. Click the 'OAuth Tools' tab. You should see four fields with lots of letters and numbers and maybe a dash or two. This will become extremely important soon, so just remember where it is.

And that's all you need to know about setting up in Twitter.

#### Twython.

To work with the Twitter API we'll be using Twython. Depending on what OS you're running, getting Twython ranges from super-easy to slightly-irritating. We've worked with pip before, so if you've still got pip you can simply use the command

pip install twython

and you're good to go. If not, then you can choose to use Python's easy_install. If you're on Windows and these things are mystifying, try this solution. If worst comes to worst, go back and (re)-install pip.

I'll also note that I'll be using version 3.0.0 of Twython, so if you're reading this in the far-future, some commands may be slightly different.

#### Tweet with Python.

The point of this section will be to introduce OAuth and demonstrate how we can write a program which will produce a tweet on our account when we run it. Create a new Python program called updatetwitter.py; you can do this in whatever text or Python editor you'd like.

First, let's import tweepy and save some variables. The four variables you'll need are the four long sequences of letters and numbers from the OAuth page:

from twython import Twython

consumer_key = "yourkey"
consumer_secret = "yoursecret"
access_token = "youraccesstoken"
access_secret = "youraccess_secret"


where you replace the strings "yourkey" with the consumer key for your twitter program, and same for the rest. Twython makes it pretty easy to authenticate and start using the twitter API.

twitter = Twython(consumer_key, consumer_secret,
access_token, access_secret)


Note that I broke the command into two lines, but you should keep it as one line when you write it. Okay, that was fun. To test our to see if everything worked, we will attempt to update our status from this program. If this works, perfect; if not, then look to see if you have no misspellings, you've copied all of the variables correctly, and so forth. If you keep getting an error, google the error and see if others have had this problem --- most of the time, there's a solution on a forum somewhere. Using update_status() will update our status with whatever we put inside of it. For example,

twitter.update_status(status="Twython is
better than love.")


will tweet that message. Note that I broke the command into two lines, but you should keep it as one line when you write it. Check to make sure it does; if not, then something is wrong. Either this documentation is out of date (let me know!) and there are some updated commands, or you've typed in something incorrect. Look at the error it returns, google it, and see what others have done.

#### True Life: Terrible Documentation.

One of the major problems that I've found while browsing through some API wrappers (like Twython) is that if you search for something like "twython tutorial", you'll bring up the same two tutorials over and over and these tutorials will ordinarily tell you how to authenticate yourself and how to update your status, and perhaps how to pull some tweets from the main page — and that's about it. For the beginner this kind of thing is a bit upsetting: not only will you not know how to do more with Twython, you will not know how to go about learning more!

This section will follow the "teach a man to fish" mentality: we will think about something we'd like to do and then learn how we can learn to do this.

For the sake of choosing something arbitrary, let's look at local trends. Currently on Twitter, depending on your location, you can look at "local trends" which are common or "hot" hashtags people are using in your area. For example, right now in New Orleans there is a huge thunderstorm and one of the trending topics is "storm". Some of these are more interesting than others, of course.

So. What do we do? Well, you can try to google a tutorial or a "how do I...", but currently there is no obvious solution on the first few pages of google, except for two pages which give the solution for a previous version of Twython which will not work with our version. So, let's get our hands dirty. I'll try to make this as general as possible so that you can always follow these steps. But clicking on the following things will give the "specific" example with Twython.

Finding Commands in your API Wrapper.

• 1. Figure out what you want to do.

This step is clear; if you don't know what you can do with the API then you've got a problem. For us, we want to find a list of local trends on Twitter using Twython.

• 2. Find the wrapper's repository on github (or some related code-storing site).

This might be a scary part for those of you who are not familiar with what github is. We won't go over it now, but 99% of the tools you use will be somewhere on github. For example, searching for twython brings us to its github page. Essentially, this stores all the code that makes up Twython. There'll most likely be a ton of folders and files on the github page.

• 3. Look for a main folder; usually this will be the name of the tool, 'main', or something like that. If you aren't sure, take a guess. Note that the '..' at the top of the files will bring you back to the previous directory if you make a mistake.

For us, currently, the main folder looks like it's the 'twython' folder. Let's click on it. Currently, there's a few files in it: '__init__.py', 'api.py', and so forth.

• 4. If Possible, look for the file which contains the functions that you need. You will most likely have to guess for this one.

For us, we know that it won't be '__init__.py', since that's usually an initialization file and doesn't contain the main stuff by itself; but if you don't know, you could click on it and see that it doesn't have any obviously useful (for us!) functions. Let's go down the list. The next for me is 'advisory.py', which only contains one function. That's probably not what we want. Next. Going down to 'api.py', we have a ton of functions which look promising (remember, functions are commands which start off with 'def', so 'def post(self...)' means that this part is defining the function 'post' with the parameters in ()'s.). Just in case, let's look at the other files: compat.py has nothing interesting for us in it, helpers.py looks equally uninteresting to us, and exceptions.py contains some things, but nothing majorly related to what we want to do. Last, we get to endpoints.py and we gasp! This file seems to have everything we could ever want to do with the twitter API: get_mentions_timeline, get_retweets, show_status...these seem like things extremely related to twitter and things we'd like to do with twitter. It looks like the file we want to look at is, somewhat surprisingly, endpoints.py.

tl;dr, the file you want is endpoints.py.

• 5. Look for the relevant function in that file.

Remember, we wanted local trends. At the near-end of the file, we fine the

gef get_closest_trends(self, **params):

"""Returns the locations that Twitter
has trending topic information
for, closest to a specified location.

Docs: https://dev.twitter.com/docs
/api/1.1/get/trends/closest

"""
return self.get('trends/closest', params=params)


which is exactly what we want. Good so far. We just have to figure out how to use it. Keep this stuff open just in case we need to look at it again.

• 6. Look at the documentation, or the API documentation.

In this case, the author has kindly given us a link to the twitter API, but let's pretend he did not. Look at what the function is returning: self.get(trends/closest). Don't worry about the 'self' part, but the 'get(trends/closest)' part looks interesting. Let's google this. Search for 'twitter api get(trends/closest)' and see what comes up. For me, the page which comes up first is this page for the 1.1 API which is luckily what we need. Looking at this page, we see that there is a Parameters section which tells us the required and optional parameters. Nice. It seems to use something called the WOEID (where on earth ID) which you can look up here. For kicks, I've looked up Orelans Parish in Louisiana, USA and found the relevant geo location was latitude = 29.95 and longitude = -90.08.

• 7. Try out the function and see if it works.

For us, we can use the same file as above in the previous section, but we can get rid of the twitter.update_status line. Instead, we'll write in our function and store it as a variable (since, in this case, we expect twitter to give us back a list of some kind with all of the trends). Let's do this as follows:

results = twitter.get_closest_trends(lat = 29.95, long = -90.08)

Notice that we needed a long and a lat parameter, so we've included them in the ()'s so that the twitter API knows which is which. This is, in general, what you should do: look at the variable names you need, put those in the form 'variablename = ' and then put the value you'd like afterward. Let's make sure this worked.

• 8. You may need to print the value if nothing happens.

If you get an error, something went wrong; check what you wrote, check the variable names, etc., or google the error and see if others have had that problem. If you don't get an error, you may be disappointed: nothing obvious happens. Oops. We forgot to print our variable. Just put in

print results

to see what Twitter is giving us.

• 9. See if this is the data you want.

In this case, we get a somewhat mysterious list — actually, because it is between {}'s, this is a dictionary in Python. If you aren't familiar with this structure, you ought to look it up and rea da bit about it before moving on. Either way, you should see the name of the place you've looked up (or a close location) somewhere in this list. Note that the u you see everywhere means 'unicode' and, for us at this moment, isn't important, and you can just ignore all the u's.

So, is this what we wanted? No. Do we give up? Never!

• 10. If this isn't what you want, look at the related functions and repeat until you get what you want or you die trying.

On the Twitter API page, look at the tab that says 'What links here'; this will tell us most of the related topics. For this, we have a few: they're all GET trends/ but we can choose from 'place', 'available', 'closest'. Well, we've tried closest already. Let's try 'place'. Clicking on it, and reading the first sentence, we see that this returns the top ten trending topics for a specific WOEID! This is exactly what we want! Moreover, we even know what a WOEID is now! Exciting. The only required parameter is 'id', and we need to get that from the WOEID website. We find that, for example, Orleans Parish, Louisiana, USA has the WOEID = 2458833. Great. Okay, we we know what we want; we want some function in Twython that has this GET trends/place command in it. Go back to the Twython documentation (that thing on github; the file was /twython/endpoints.py, if you forgot). Let's look around the 'trends' functions again: there's get_place_trends, which looks promising. We probably ought to have picked this one in the first place, but it's okay because we were able to correct our mistake before it became a catastrophic nightmare.

Okay, now that we know what we're doing, go back to the program and delete that get_closest_trends line (before you do, notice that it actually does give us the WOEID in the dictionary! How nice of it.) and replace it with the function we just found:

results = twitter.get_place_trends(id = 2458833)
print results

We get a long dictionary back, but looking through it gives us the relevant closest trends (these will be the things that look like #loveyourself, #hotdogs, superman or other things with or without hashtags and words.  Good going, we got what we wanted! — almost.  This dictionary thing is really cramping my style.  I wish we had it in a nicer format...

• 11. Format the data.

This is the part where we need to know a little Python. Look at the 'example request' part on the Twitter API trends/place that we just looked at. This looks complicated, but we notice that because it is surrounded with []'s at the beginning and end, the outer structure is a list. Then, it starts a dictionary with a few terms ('as_of', 'created_at', and so on). The relevant item we'd like to look at is 'trends'. So, we need to tell Python to look inside the list (since there's only one part of the list, this is given by results[0]), and then look at the 'trends' part. This is the part where your Python skills might be a bit weak if you've just started Python: look at some examples of lists and dictionaries, and how we can extract elements from both. It's not hard, it's just a bit strange at first. Either way, let's try:

print results[0]['trends']
Recall that results[0] will return the first element of the outer list (which is actually just the big dictionary inside), then we want to look at the value of 'trends' in this dictionary so we use ['trends'] on results[0] to do this.

We get close. We now get back a list of dictionaries (again!). Looking in the dictionaries, it looks like we'd like to take the 'name' value from each. So, we need to tell Python something like, 'for each dictionary in this list, print out the name part.' Luckily, the actual code for this is nearly the same as what I just typed:

Remember that the 'trend' part here is an arbitrary name; I could have said for d in results[0]['trends']: print d['name'] and it would give me back exactly the same thing. If you did everything right, you should have printed out a list of the currend trends near your place-of-choice. Good going!

This may have seemed unnecessarily long; and, indeed, it was purposely a bit verbose to point out some of the things that could go wrong and some of the ways we can fix this. We'll be doing some of this function-hunting later but we will be much briefer with it. Nonetheless, as practice, you should look up how to return the user's (your) top few recent tweets with Twerpy using the process above and print them out nicely. My potential solution is here but don't look at it until you've tried it yourself!

