DyingLoveGrape.

(home) (about) (rss)

Part 1, Section 3:
Introduction to Using Python for Data Analysis.





Importing Data, Manipulating Lists, and Using Numpy.

In this section, we're going to do a bit of analysis of a small-ish data set, just to get some practice in. The data we will be working with is this file on population change which was found in the United States 2010 Census data; I've trimmed it up significantly so that we can work with it a bit easier.

When you click on the file's link, your browser will probably pull up a whole lot of data all separated by commas. This file is called a csv file ("Comma-Separated Values") and is common when working with data; for example, Excel can read csv files and save files in csv format. The point of csv files is that the top line without a #, if any, gives the column headers and the remaining lines fill in the data.

Importing CSV File Data.

Open up Spyder and get to a new file. We're going to use the csv library to start off since it's relatively transparent with what is happening. We'll still import numpy because I have a hunch that we'll need it in a bit.

import numpy as np
import csv as csv

We begin by using the csv command reader to read the csv file and change it into a more Python-friendly format. We need to use an "open(file)" command inside of the reader command to have it read our file like this:

readdata = csv.reader(open("where_you_saved_the_file"))

Make sure you put your file's location in quotes inside csv2rec. Now, let's type in print x to see what reader gives us; it gives us something like this:

<_csv.reader object at 0x05397EF0>

Well, that's certainly not what we wanted. The idea here is that to print out or save this data we need to iterate over it. For example, you can try putting in:

for row in readdata:
  print row

This should give you all of the rows of the dataset. Great, neat. But that's not exactly what we want because that only prints the data, we can't do a whole lot of manipulate; feel free to delete this "for" statement and the "print" statement. We're going to do a little trick to keep all this data in a list. We're going to initialize a blank list called "data" by using the command data = []. We'll then append each row to the list, making a list of rows. It's almost the same as the print statement above, and goes a little something like this:

data = []

for row in readdata:
  data.append(row)

Now try to print the list data. You should see all of the rows; the rows will be in the form of a list, and they'll contain a state name, the population in 1910, and the population in 2010. The first row, data[0] gives you the headings. For reasons that will become apparent in a second, we're going to want to make a list which is just our data and another list which is just our headings. We can accomplish this by using the following code:

Header = data[0]
data.pop(0)

As we noted the headings are at data[0]. The .pop(0) command tells us to take the list data and remove the 0-th element (the header element, in this case). After popping off that element, the list data will only contain the data without the headers. You can print these two variables out to make sure! Your entire code should look like this for now:

import numpy as np
import csv as csv

readdata = csv.reader(open('your_file_here', 'r'))

data = []

for row in readdata:
    data.append(row)

Header = data[0]
data.pop(0)

Don't delete this code yet, we're going to use it in the next section.

Displaying Our Beautiful Data.

At this point, you might be anxious to see this data in a pretty table format. You could try a bunch with print statements and so forth, but it takes a lot of work and creativity to make a nice table format — especially if you try to make it for general datasets. At this point, we need to import pandas.

Pandas is a sweet data analysis tool that we'll use much more later but, for now, it's useful as a way to make a pretty table for our data. To import pandas we just put it up with our other import statements:

import pandas as pd

Here's the beautiful part about pandas: it's super easy to do a lot of things. For example, to make a table (after looking at the documentation to see which function makes a table!) we use the DataFrame command (those of you who know some R will recognize this command). At the end of the document, put the command:

print pd.DataFrame(data, columns=Header)

This tells pandas to make a table out of our data and to use the Header list as the column names. Printing this out, we see that this is exactly what we want. Nice!

But wait, I want to edit this data...!

So, the data that we've been using is real data. It tells us the population of a state in 1910 and 2010. Wouldn't it be neat if we could find the difference of these two columns and make it another column? Of course!

With how we've made the data (at a low-level, with lists) we need to just append this data. Start a new file and copy-paste most of the previous file:

import numpy as np
import csv as csv
import pandas as pd

readdata = csv.reader(open('your_file_here', 'r'))

data = []

for row in readdata:
    data.append(row)

Header = data[0]
data.pop(0)

At the end of this, we want to append a bit to the header. Let's append the element "Difference" which will stand for the difference column:

Header.append("Difference")

Okay, that's fine, but now we need to make a difference row for every other row. This would be annoying if we didn't have For loops in Python. Luckily, this is pretty easy:

for i in range(len(data)):
  diff = int(data[i][2]) - int(data[i][1])
  data[i].append(diff)

Seasoned Python-ers can make this code a bit more streamlined, but this is good enough. This make a variable diff which is the difference of the 3rd and 2nd columns of each row (remember that the third column for row i is given by data[i][2], for example). We needed to include int() around each of these to tell Python that these values are integers and that we can add and subtract them. Last, we append the difference to make a new column for the i-th row.

Good job. Now print out the table again to see your masterpiece:

print pd.DataFrame(data, columns = Header)

Notice that the same command works regardless of how big the data sets is or how big the header is, so long as the number of the columns in the data matches the size of the header.

An Extremely Basic Statistical Analysis.

Two questions. How many people were living in the United States (according to this data) in 1910 and 2010? What was the mean and standard deviation for the 1910 population data and the 2010 population data? These are things we should be able to answer easily.

While there are certainly more powerful methods to do this, for now we're going to just use Python's list manipulation tools. Let's start a new file in Spyder and copy most of the previous code into it:

import numpy as np
import csv as csv
import pandas as pd

readdata = csv.reader(open('your_file_here', 'r'))
data = []
for row in readdata:
    data.append(row)
Header = data[0]
data.pop(0)

Okay, we're used to all this code by now so far. Let's make a list that consists only of the 1910 population data and one which consists only of the 2010 data. This turns out to be fairly easy using a For statement:

pop1910 = []
pop2010 = []
for i in range(len(data)):
  pop1910.append(int(data[i][1]))
  pop1910.append(int(data[i][2]))

Feel free to print out these variables to see if it gives you the correct values. We've already done something like this before, so I won't explain what's going on this time; if you're confused, go back and look at the last section when we did something similar to this.

What's nice about lists in Python is that they're easy to operate on. Especially with numpy!

The way to add all the elements in a list together is np.sum(your_list).

The way to find the mean of all the elements in a list is np.mean(your_list).

The way to find the standard deviation of all the elements in a list is np.std(your_list).

In general, tying np. and looking at Spyder's autocomplete will usually point you to the function you're looking for. If we want to print all this stuff out nicely, we might do something like:

print "Total in 1910: %d" % (np.sum(pop1910)) 
print "Average in 1910: %d" % (np.mean(pop1910))
print "Standard Deviation in 1910: %d" % (np.std(pop1910)) 

print "Total in 2010: %d" % (np.sum(pop2010)) 
print "Average in 2010: %d" % (np.mean(pop2010))
print "Standard Deviation in 2010: %d" % (np.std(pop2010))

But...if we were really clever, we might want to make a function which does this for us (this might be useful in case we had 100 years instead of just two). This is extra credit for the reader, but I encourage you to try it out!

What now?

Next, we're going to use data like this, but we're going to create some charts (because everyone likes pictures).



⇐ Back to 1.2HomeOnwards to 1.4 ⇒