(home) (about) (rss)

Part 2, Section 2:
Getting Deeper into NumPy and Pandas.

Introducing pandas.

pandas Series.

I'll admit, the previous parts were a bit clunky. It just doesn't feel natural to analyze data this way. Those with experience using R, SPSS, SAS, or even Excel will find themselves wondering why we would go through such difficulty to work with data when these programs make such operations nearly trivial. pandas (with a lowercase "p") was built to make data analysis in Python feel more natural and to extend the power of the existing numpy methods of analysis.

pandas was built on Numpy, but instead of the arrays we're used to, pandas uses two slightly different "fundamental" data structures. Luckily, these will both seem familiar to us based on what we've already done.

Let's import a few things. Because we will be using pandas, we clearly need to import pandas (with the usual abbreviation "pd"), but we will be using Series and DataFrame that we will simply import them locally. (This is like if we imported every function from numpy so that we could call it without putting an "np" before it.) We do that like by typing:

>>> import pandas as pd
>>> from pandas import Series, DataFrame 

Good. Once we've imported all of this, we can begin to work with pandas.

The first structure we want to work with is called a Series. Like a basic list or numpy array, the Series provides a way to enter data in list-form but extends the idea by allowing us to specify the index; this means we no longer are a slave to using 0,1,2,3,... for the index of our list, but may use nearly whatever we'd like (though 0,1,2,... will be the default if nothing else is specified). Let's try this out.

>>> data1 = Series([6,77,888,9999])

If we were to look at data[0] or data1[1] we would get 6 and 77 respectively, as usual. We could check on the index of data1 by using the index property like so:

>>> data1.index

This will return something which looks like Int64Index([0,1,2,3]) which is just telling us that data1 has the indices 0,1,2,3.

Now, let's specify our index. This is done by explitly defining the index parameter when defining a Series:

>>> data2 = Series([8, 9, 11], index=["Joe", "Jane", "John"])

If we type data2 and let Python print it out, those of you who have seen R or SPSS programming will rejoice: we've made a data table! Certainly, this is a significantly nicer option than the numpy arrays we made before if we want to consider data like this. Moreover, the indexing works as we'd expect:

>>> data2["Joe"]

Will return the number associated to Joe. But what happens, you may ask, if there are multiple Joes in the table? Let's try it out.

>>> data3 = Series([9,10,11], index=["Joe", "Joe", "John"])
>>> data3["Joe"]

Here, you will get a small piece of the table where each element is indexed by Joe. Indices, then, need not be unique in this data structure. But, if indices are not unique, then pandas has a slight problem when getting multiple indices. Note that for

>>> data2[["Joe", "John"]]
>>> data3[["Joe", "John"]]

The first works but the second does not; this is because in data3 the index "Joe" is not unique and something gets messed up when pandas calls it. This is the trade-off for non-unique indices in Series. Ultimately, this should not come up frequently but it is something you ought to know about.

Remember that exercise where we increased the pay of our workers by a little bit? Let's do that again here. Let's suppose data2 above gives us our employees and their pay. We can add a dollar to each of their pay by the operation

>>> data2 += 1

Similarly, we could multiply their pay with *= and so forth. Now, pretend we are a bit curious as to who is making more than \$10/hr. We can call this sort of thing in a standard way (which is usually called "Boolean") by typing:

>>> data2[data2 > 10]

This looks strange at first, but what it is saying is: return those elements from data2 such that the element (not the index!) is greater than 10. Hence, this returns a table of index and element which has this property; in this case, it is only John.

As another example, let's recall how to make random numbers with numpy. If you haven't already, import numpy and recall that an array of 20 random numbers is given by

>>> np.random.rand(20)

If we want to make this a Series, we simply enclose it with Series() like so:

>>> data4 = Series(np.random.rand(20))

We can now mess around with a Series of random numbers. Let's list all of the ones which are less than or equal to 0.75.

>>> data4[data4 <= 0.75]

We'll go through more logical operators later, but it's nice to know this kind of thing exists.

pandas DataFrames.

The DataFrame will feel much more familiar to you if you've ever worked in R or Excel. The DataFrame is a two-dimensional array with rows and columns which can have different types of data. For example, you may have a table displaying schools in the Boston area with columns that show average math score, average verbal score, and average income of families in the surrounding area. Not all of these columns need to be numerical: you may have another column which gives the Principal's name, for example.

Once you know about the Series, the idea of the DataFrame is simple: in a Series, you construct one index for your elements; in a DataFrame, you construct two.

Usually, these two indices are represented by "column headers" and "row labels". For example,

>>> data = {'school': ['Baxters', 'Racine'],'test scores': [90, 96]}  
>>> table = DataFrame(data, index =['School 1', 'School 2'])

will print out a table like this:

           school  test scores
School 1  Baxters           90
School 2   Racine           96

which is fairly neat. So, what happened here? We made this weird looking "data" variable (called a dictionary in Python) and then specified an index in the DataFrame part. Let's look at this in a bit more detail.

There are many ways to put data into a DataFrame — we'll look at another one later which will allow us to import csv and excel files like we've done in the past — but one of the most simplistic is the dictionary method. The idea is straight-forward: you put in all your values by column. Pretend you have two columns, "Name" and "Sex", and you have five people with the associated sex: James (M), Jane (F), John (M), Jake (M) and Audrey (F). We'll take all the names and associate them with the name column (don't type these in yet until we make the entire dictionary):

'name' : ['James', 'Jane', 'John', 'Jake', 'Audrey']

And, we'll put in the associated sexes into the "sex" column:

'sex' : ['M', 'F', 'M', 'M', 'F']

To make sure we relate these two items (name and sex), we put them into one nice structure called a dictionary by putting them between {}'s. Now we can type this into the interpreter and save it as some variable:

>>> data = {'name' : ['James', 'Jane', 'John', 'Jake', 
    'Audrey'], 'sex' : ['M', 'F', 'M', 'M', 'F']}

Okay, good. Now that we have that, we can make a DataFrame from this; if we don't specify the other index (the row index) it will automatically make it 0,1,2,3,... as usual. This doesn't bother us for now, but in the future we will probably make the names of the people the index; this might make it easier to access information.

>>> table = DataFrame(data)

which, when we print table, outputs:

     name sex
0   James   M
1    Jane   F
2    John   M
3    Jake   M
4  Audrey   F

which is fantastic. At this point, some questions pop up: how would we do something like count the number of males, or to list only the females, or, if there were some test scores associated with these people, how could we take the mean, standard deviation, and so forth of everyone, just the girls, just the boys, or some other crazy combinations? We need to learn some DataFrame manipulation.

DataFrame Manipulation.

Let's expand on our data a bit. Let's put in the following data:

>>> data = {'name' : ['James', 'Jane', 'John', 'Jake','Audrey'],
    'sex' : ['M', 'F', 'M', 'M', 'F'], 
    'height':[77, 56, 66, 61, 50]}

Note that this should be on one line, but I've broken it into a few so it'll fit on the page nicely. Now when we make the table using the table = DataFrame(data) command, we note that, when we print the table out, the columns are put in alphabetical order. When displaying data to the user, this might be slightly annoying so let's just note quickly how to change it. If we explicitly specify the columns we'd like to use (using the same names as we did in our dictionary) in the DataFrame command, then pandas will know to put it in that order. For example,

>>> table = DataFrame(data, columns=['name', 'age', 'sex', 'height'])

will display a table that has the columns in that order. This makes us question why we've expliticly noted them in the dictionary part as well — and, honestly, it isn't necessary (as we'll see when we import csv files) though it does allow us to quickly look at our data in dictionry form and see which data goes with which heading.

Either way, picking a column is easy; we've done it before! As you might guess, it's done by using a command like this:

>>> table['name']

will display the column 'name'. What about if we want name and age? We just make a list out of it.

>>> table[['name', 'age']]

Note this time we've made a list inside of table to denote the fields we want. We really should have put table[['name']] the first time, but Python handles one-element lists similar to how it handles single elements, so it wasn't necessary. In general,

>>> table[listOfThingsYouWant]

Gives us a list of the columns you'd like. The indexing is nearly the same as it was in the numpy arrays, which is convenient since we've seen a lot of this already. The ix_ trick still works, but pandas has dropped the underscore so the command is now simply ix. Don't worry about this stuff too much, just know that it exists for when you might need it.

Now, suppose we want the average age and height. pandas makes this pretty easy. We may use

>>> table.mean()

which gives us

age       17.2
height    62.0

which is a nice printout. The .mean command does a few other neat things (you can sum along the rows, if that's how your data is, etc.) but all of this is in the documentation. We'll work with it a bit more later.

But what if we only wanted the mean height? We can do this by specifying what column we'd like, then specifying that we'd like to take the mean:

>>> table['height'].mean()


Where from here?

Note that, at this point, we've come far enough where we can start talking about real things; we do not need to work with silly little data sets that we've made up. We know the commands (max, min, standard deviation, sum, mean, etc.) because we've used them before, and, even if we have not, pandas and Spyder make it relatively easy to look up which thing we'd like. In other words, I feel that to continue doing somewhat artificial exercises may not be the best use of time here. Therefore, the next few lessons in this part will be real-world examples with real-world datasets (I call them "projects"); I will be asking a question, finding a dataset, analyzing, seeing if there are any follow-up questions, and concluding — but I will do so in a way that will let you see how the general process of data analysis works. I do not know all of the panda commands; we will find them out together!

In the next part (following the projects) we will be looking at a few API's which will allow us to find and use data from popular resources like twitter, google maps, facebook, and so forth. This will allow us to get extremely specialized and, at times, up-to-the-second data.

[By the way, I encourage you not only to work through the projects in the next sections, but also to find your own and work through them. If you find something neat, let me know!]

⇐ Back to 2.1HomeOnwards to 2.3 ⇒