(home) (about) (rss)

Part 2, Section 3:
Getting Deeper into NumPy and Pandas.

Project: CPS Algebra Exit Exams.


We proceed with these projects in the following organized fashion:

  1. (Ask a Question.)
  2. Find the data.
  3. Import the data.
  4. Analyze the data.
  5. Conclude.

I call this the FIAC system because, you know, the first letters of each of the statements. The first one is in ()'s since sometimes we will not know what kinds of questions we can ask until we have the data in front of us; other times, we will have a specific question in mind which we'll need to find the appropriate data for.

For this project, my question was, "How well are kids in Chicago Public Schools (CPS) learning mathematics?"

Finding the Data.

For this one, instead of Googling (which I could have done), I asked on a social networking site if anyone had any data for the CPS system; I have friends who work there and I thought maybe the CPS doesn't give data out to anyone who asks. Luckily, I was wrong: this is a hub for various datasets related to CPS. At the bottom, you'll see the algebra exit exams. For now, don't bother clicking it — I've made a simplified version of it which cuts out the unnecessary (for this project) elements. Moreover, even though the file is naturally in excel-format, I've converted it to csv (remember: you can easily convert csv and excel files back and forth by opening the file in Excel or any of the open Excel-like programs). Download the file here:


Now that we have our file, we need to do the next part of FIAC, which is to import the data.

Importing the Data.

Oh, pandas, what a delight you are to use! We will use the read_csv function that you can find by either searching through the list of pandas functions or by googling something like, "csv file in pandas". Here's what it'll look like:

db = pd.read_csv('fileLocation')

Where, of course, 'fileLocation' should be replaced by the string (in quotes!) of where the file is. If it works, great! If not, strangely, sometimes pandas will bring up a weird error and the only way I've found to remedy that error is to use two \'s right before the file name instead of one (so, for example, 'C:\MyFiles\\algexit2009.csv' with the two \'s at the end). Either way, once you've got it into the variable db we've successfully imported the data.

We now move on to the (arguably!) most important part of FIAC, which is to analyze the data.

Analyzing the Data.

First, let's look at the data. We can just print it out, right?

>>> db

Int64Index: 135 entries, 133 to 123
Data columns (total 4 columns):
school_name       135  non-null values
total_highpass    135  non-null values
tota_pass         135  non-null values
total             135  non-null values
dtypes: int64(3), object(1)

Well, that's certainly not what we want. It gives us the column headings (which was nice of it) and how many values there are, but not what those values are. Hm. Looking through the possible functions in pandas we can apply to db we find one called values. Try this one out.

>>> db.values

Well. That's kind of what we want. But it's super-ugly. Just a list of list. That's not friendly-looking. It turns out that pandas has two nice commands which allow us to work with extremely large data sets (even though this isn't too large) called head and tail; head(number) gives you the top number of elements in your dataframe while tail(number) gives you the last number of elements in your dataframe. Try them out!

>>> db.head(20)
>>> db.tail(15)

You can mess around with different numbers and get different lists. Ultimately, this will help us look at most of the data in a relatively clean and not memory-intensive way.

So, we've listed the data out. Great. Fantastic. That's not quite the analysis we were going for, though. So, let's ask three questions and get three answers.

  1. Which school(s) has the largest number of passing students? What about high-passing students?
  2. Which schools(s) had the largest percentage of passing students? What about "high pass" students?
  3. Which school had the largest number of both pass and high pass students? What about percentage?

Here, I'm assuming that "passing" and "highpassing" students are disjoint; that is, if someone "highpasses", hey count that student in the "total_highpass" column, but not the "total_pass" column.

For the first question, we could do a few things: we could find the maximim value, find out where that value was in the index, and then return the school name associated to that index — but this is a lot of work, and, really, all we need to do is sort the data. So let's do that.

We look for any kind of sorting function in pandas by typing db. and looking at the autocomplte options (or, looking at the documentation). The sort() function looks pretty promising. We'd like to sort by just the "total_pass" column first to answer part of the the first question, which would look something like this:

>>> db.sort(columns=["total_pass"])

If you look at the list again — you'll find that nothing happened. What? We'd need to save this sorted list as another list because we've done some modification to it. That's annoying, so pandas allows us to sort the list in place, which means that it will automatically save the sorting to the current list. The way you do this is, as you might have guessed by looking at the documentation,

>>> db.sort(columns=["total_pass"], inplace=True)

This gives us a delightfully sorted list. If we do db.head(20) again, we'll find a sorted list — but, with the smallest number of passes on top. Ugh. Well, looking at the sort function, we can actually sort in descending order instead by making the "ascending" option False, like this:

>>> db.sort(columns=["total_pass"], inplace=True, ascending=False)

Now when we look at the data with db.head, we find that YOUNG, W and KENWOOD are at the top. If we were to sort for "highpass

instead, we'd see that HAWTHORNE overtakes KENWOOD,

>>> db.sort(columns=["total_highpass"], inplace=True, ascending=False)

That's pretty neat. Note, though, something important: yes, YOUNG, W has the greatest number of highpass and pass students — but they also have the greatest number of students. This idea leads us to the second question regarding the percent of students. This one will take a bit more work.

First, let's make a function which, given an index, will find a percent for that school. For example, if we looked at our sorted list and put in "0", we'd get a percent for YOUNG, W since that's at the 0-th index.

The way we look at an value at the position [row index,column index] is to use the command .iat. I found this by doing a simple google search, but it's also one of the more "common" pandas commands. For example, plugging in db.iat[0,0] should give us YOUNG, W. Try plugging in different indices and see if you get what you expect!

Okay. Now that we know how to get values, something we'd want to do is to find percentages. For YOUNG, W for example, we'd have the percentage of passing students (not highpassing!) is equal to

>>> db.iat[0,2] / float(db.iat[0,3])

The command float here is just to make sure Python knows this answer should be a number with a decimal point as opposed to just rounding to the nearest whole number. Some versions of Python do this for us (without using float) but better safe than wrong.

Either way, we should get a solution like 0.64800... and that's exactly what we wanted. That's the percentage of passing (but not high passing) students for index 0. If you've got some experience programming already, you'll probably have jump the gun here: this is practically begging us to make a For loop out of it. Let's make a list for these values:

>>> passpercent = []
>>> for i in range(len(db)):
...  passpercent.append(db.iat[i,2] / float(db.iat[i,3]))

You will have to press enter a few times after the last line here to make sure it runs the For loop. When you get back to the >>> you're good. Make sure to put one or two (or more) spaces when you press enter after the for i in range line (when you start the line with append) or else Python will complain about indents; indenting is a common important theme in Python.

The above commands shouldn't seem too mysterious to you at this point. We make an empty list by using passpercent=[] and "append" elements to it using the append command. In addition, len(db) is the height of the columns in db. If you run this command, you'll fill up passpercent with values. Check to make sure.

You might think that it'll be crazy to make a new column in our database, but it's actually really simple:

>>> db['pass percent'] = passpercent

This creates the column 'pass percent' and pastes our passpercent list into it. If you check by typing db.head(10), you'll see we've created an appropriate percent of passing column! Nice. Before we sort, let's just create another column for highpass percent.

>>> highpasspercent = []
>>> for i in range(len(db)):
...  highpasspercent.append(db.iat[i,1] / float(db.iat[i,3]))

followed by

>>> db['high pass percent'] = highpasspercent

Great. Now let's sort. First, by pass percent:

>>> db.sort(columns=['pass percent'], inplace=True, ascending=False)

Looking at the data (remember, db.head()), we now see something kind of neat: SHOOP, BLACK, STOCKTON have taken the lead, since nearly all of their students passed! Interesting. But, we note, they have very few students compared to the other schools. This will have to be taken into account when making any kind of conclusion.

Now, by high pass percent:

We now see that GALILEO SCHOLASTIC, THORP O A, GRAY have taken the lead in terms of high pass percent — again, note that these schools have very few students.

[As a non-mathematical aside, we see here why taking averages and ranking schools is difficult: we've got four different ways of looking at passing rates, and the schools on "top" are dastically different depending on what we considered to be important to look at. Ideally, we would be able to appropriately weight these rankings in some way and create a "reasonable" measure of "goodness". For now we will not worry about these considerations and just focus on the data.]

Last, we look at the final question. We need to take a sum of these two columns but, by now, we've probably got a good idea of how to do that. Guess first, then check below (there's more than one right answer!):

>>> totalpass = []
>>> for i in range(len(db)):
...  totalpass.append(db.iat[i,1] + db.iat[i,2])

and then

>>> db.['total pass'] = totalpass

Sorting by the total pass number...

>>> db.sort(columns=['total pass'], inplace=True, ascending=False)

We find that YOUNG, W and KENWOOD are back on top. Doing percentages again...

>>> totalpasspercent = []
>>> for i in range(len(db)):
...  totalpasspercent.append(db.iat[i,6] / float(db.iat[i,3]))

and then...

db['total pass percent'] = totalpasspercent

and finally sorting...

>>> db.sort(columns=['total pass percent'], inplace=True, ascending=False)

And an interesting thing happens: THORP O A overtakes GALILEO.


For now, we'll leave drawing conclusions to the reader. This project was a small exercise in how to manipulate data on a basic level (indeed, there are better ways to do most of what we did — though, what we did works and is fine). You may not be impressed with this data, but it may also be instructive to see how poorly the "worst" schools have done.

We won't make any graphs for this project since I feel I've introduced enough new material. For future projects, we'll want to think about how best to represent the data we've collected and look into how to create the relevant charts and graphs using matplotlib.

Exercise for you!

Below is some data on the Dropout rates for most of the schools in the Chicago Public School System (the headings give the school name, the type of school, the drop out (DO) rate (as a percentage) for 2010, 2011, and 2012.


What can you find out about this data? What schools have the highest and lowest dropout percentages? Do the higher/lower schools tend to be in any particular network? If so, what other data might we look at to try to find a cause for this?

⇐ Back to 2.2HomeOnwards to 2.4 ⇒