(home) (about) (rss)

Part 2, Section 6:
Getting Deeper into NumPy and Pandas.

Project: Chicago Budget Appropriations.

I was curious — are Chicago's budget appropriations this year much different from last year? I decided to compare the 2012 and 2013 appropriations, available here (2012) and here (2013).

Finding the Data + Importing the Data.

It was lucky that Chicago is on a public-data kick; this data was as easy as googling "Chicago Budget".

When we get to the budget, it will come up as an in-browser database; this isn't especially useful to us. Head over to the "menu" part of the little in-browser database thing and click it — then, download the data. Click on the CSV option and it'll download to your computer. Neato.

Import this database as db in a new file. This time, we will not use the console, we will simply make a program. Recall that this means that you go to File and click New File. We will need to save the file and click the little Green Running Man to run it, but this will let us have a bit more substantial code. This table is surprisingly readable — a bit more-so than the Louisiana one since there are no strange symbols afoot. My code so-far, by the by, is:

import numpy as np
import pandas as pd
import scipy as sc
import matplotlib.pyplot as plt
from pandas import DataFrame, Series

file2012 = "(your file location)"
file2013 = "(your file location)"
db2012 = pd.read_csv(file)
db2013 = pd.read_csv(file)

Where you can put in the actual file location inside of the quotes. Note that I've imported numpy and scipy just in case we wind up using them.

Analyzing the Data.

If you type in print db.head(), it will show on the console a ton of data. In fact, probably more data than your screen can handle at once. This is not necessarily useful to us; there is simply too much going on, and too many things we don't need. We don't really care too much about the fund code or the department number.

I haven't asked a particular question yet, but let's look at the data. The first thing that comes into my head is: why don't we sort by department and see if those numbers (the total amount appropriated to each department) differ significantly.

Grouping is a big deal. So much so that I wanted to put that sentence in bold so that you remember it. Grouping is extremely important for looking at tons of itemized data like this and one of the best things about Pandas is that it makes grouping data by some particular column extremely easy.

First, we're going to take some columns we care about and make a new database from them. This is not strictly necessary, but it will make it nicer at the end. For now, let's look at the department and amount. You might notice that there is actually no DEPARTMENT or AMOUNT for some reason in the 2013 version of this report; lluckily, there is still DEPARTMENT DESCRIPTION and something called 2013 APPROPRIATION ORDINANCE which is essentially the same. So, let's make the following databases: for the 2012 database we will make a database that has DEPARTMENT, DEPARTMENT NUMBER, 2013 APPROPRIATION ORDINANCE as columns; for the 2013 databse we will make a database that has DEPARTMENT NUMBER, AMOUNT as columns. Here's how to do that, if you forgot:

slimdb2012 = db2012[["DEPARTMENT", "DEPARTMENT NUMBER", "AMOUNT"]]

I called the new variables "slim" just because they're smaller than the original databases, but you can call them whatever you'd like. Now, if you type in print slimdb2013.head(20) and run the program, you'll notice that we now get a nice, clean char — but it didn't add up anything. It just sort of gave us a cleaner list. We want to group everything together; so let's try this:

grouped2012 = slimdb2012.groupby(["DEPARTMENT", "DEPARTMENT NUMBER"])

This will group together our data by department and department number. If you print this out, you'll notice that it's still listing all of the amounts; we have not summed anything up! The way to do this is with an aggrigation command, and those of you familiar with SQL will be familiar with this type of thing. The idea is as follows:

agg_grouped2012 = grouped2012.aggregate(np.sum).reset_index()

The aggregate command will take all of the numerical columns (usually the amount or debt in a database) which are not part of the index (DEPARTMENT and DEPARTMENT NUMBER are part of the index, since we grouped by them) and apply a certain function to it. The sum becomes the new index, which may not be nice; so, we use the .reset_index() command to reset the index back to what it was before. In this case, we've used np.sum which is numpy's sum command: it takes all of the data and adds it together. As an exercise, you should see what happens when you put in another function into the aggregate command (like np.max or np.min for example; what would these tell us?).

Print out a piece of the new database just to see if it worked by using print agg_grouped2012.head(). It should have! Great. Now, let's quickly do the same for the other database, noting that there are slightly different columns. Your code should look something like this:

slimdb2012 = db2012[["DEPARTMENT", "DEPARTMENT NUMBER", "AMOUNT"]]

grouped2012 = slimdb2012.groupby(["DEPARTMENT", "DEPARTMENT NUMBER"])
grouped2013 = slimdb2013.groupby(["DEPARTMENT NUMBER"])

agg_grouped2012 = grouped2012.aggregate(np.sum).reset_index()
agg_grouped2013 = grouped2013.aggregate(np.sum).reset_index()

Just for kicks, let's print out agg_grouped2013 to see if it all is fine. Wait, what's this...? It didn't add everything together, because the values were dollars and Python apparently doesn't know that this means we need to add them as dollar amounts! We get a string of nonsense back in the 2013 APPROPRIATION ORDINANCE column. Ugh. For reference, this is what I'm talking about:

                                        2013 APPROPRIATION ORDINANCE
DEPARTMENT NUMBER                                                   
1                  $5366703.00$1200.00$18000.00$1000.00$49500.00$...
3                  $2288127.00$9736.00$1250.00$820.00$5000.00$181...
5                  $1555518.00$4022.00$37500.00$3500.00$4500.00$2...
6                  $7174047.00$8542.00$2000.00$6431518.00$6313518...
15                 $6295022.00$8824000.00$140000.00$1400.00$68000...

Notice that we should be adding these dollars together, but instead Python is just concatinating them. That's upsetting. Welp, let's give up.

Just kidding! Remember how we were talking about how we can put in whatever function we wanted to the aggregate command? Well, why don't we just be ambitious for a second and problem solve this: let's make a function that adds this stuff together! This is a great exercise in Python-ing, but I'll put two examples down here if you can't figure out a solution yourself.

def add_dollars1(dollar_list):
    new_amount_list = []
    for item in dollar_list:
        new_item = float(str(item).replace("$", ""))
    return np.sum(new_amount_list)

Of course, there are more efficient functions which do this, but this one is easy to explain. We input a list of dollar values, make an empty list, and for each of the entries we find the dollar sign and replace it with nothing (""), then append that to the new list. We then return the sum of that new list. Of course, making a new list is a bit overkill; we could have just added elements after we removed the dollar sign. We could do that as well:

def add_dollars2(dollar_list):
    total = 0
    for item in dollar_list:
        total += float(str(item).replace("$", ""))
    return total

Make your own, or use one of these, and put it at the beginning of the file, after the imports (just for aesthetics!). We can now add dollars nicely. We replace the command

agg_grouped2013 = grouped2013.aggregate(np.sum).reset_index()

with the new code:

agg_grouped2013 = grouped2013.aggregate(add_dollars2).reset_index()

where the new command would be whatever the name of your function was; here I use add_dollars2 that we made above.

Okay. Finally. We should merge these lists together into one list and the thing we'd like to have happen is that we could just "glue them together" on the DEPARTMENT NUMBER part; both have department numbers, so we could just sort of line up the amounts and paste those numbers in. Luckily, this is pretty east in Pandas:

budget_compare = agg_grouped2012.merge(agg_grouped2013, on="DEPARTMENT NUMBER")

We've made a new variable, budget_compare and we've set it equal to something strange. First, we've taken the agg_grouped2012 dataframe and we've used the merge command. It asks us: "What do you want to merge this dataframe with?" We tell it we'd like to merge with app_grouped2013. It asks: "Well, what column would you like us to match up on both of these?" They both have the department number column, so we simply use on="DEPARTMENT NUMBER" to tell it we'd like to match these up on the department number. If you're a bit confused as to what this means, you should print budget_compare.head() to see what this has done.

But now we don't need the DEPARTMENT NUMBER, so let's just get rid of it.

slimbudget_compare = budget_compare[["DEPARTMENT", "AMOUNT", "2013 APPROPRIATION ORDINANCE"]]

And now slimbudget_compare has only some columns we care about. But, which are which? If we were to give this to someone else, they'd ask: "What is AMOUNT? What's that for?" We probably should rename these columns to something a bit more meaningful. We do this with the rename command, unsurprisingly:

slimbudget_compare.rename(columns={'DEPARTMENT': 'DEPARTMENT', 'AMOUNT': '2012 Amount', '2013 APPROPRIATION ORDINANCE' : '2013 Amount'}, inplace = True)

This takes in a Python dictionary whose entries looks like 'thing' : 'what we want to change it to'. We have not changed DEPARTMENT, but we have changed the others. The inplace command just tells us that it has done the action in place and we don't need to save it as a new variable.

At this point, we're pretty much done. You should try, yourself, to make a "difference" column which gives the difference between the years.

Interpreting: Make a Graph!

A picture is worth a thousand rows. If we could make a graph with this data, it would be pretty neat, no? Well, as we've noted before, this is a thing that matplotlib does rather well and we just happen to have imported it before!

At this point, we need a bar graph. But we don't know how to make it. Luckily, there's a ton of examples of how to do this sort of thing. I've taken an example from the matplotlib website and changed some values to create a new bar chart:

labels = budget_compare["DEPARTMENT NUMBER"]
b2012 = slimbudget_compare["2012 Amount"]
b2013 = slimbudget_compare["2013 Amount"]

ind = np.arange(len(b2012))
width = 1

fig, ax = plt.subplots()

r1 = plt.bar(ind * 3, b2012, width, color ='r' )
r2 = plt.bar(ind * 3 + 1, b2013, width, color = 'b')

ax.set_ylabel('Money Appropriated')
ax.set_title('Comparison of 2012 and 2013 Budget Appropriations for Chicago')
ax.set_xticklabels( labels )

ax.legend( (r1[0], r2[0]), ('2012', '2013') )


This probably looks absolutely terrifying if you haven't seen this kind of thing before, but it's not all that bad; if you get lost, don't worry so much about this part; matplotlib can be confusing, but, for the most part, there are other methods to visualize data (and we'll go over others in future lessons). [A minor note: if you paste this in, you'll notice that I've used department number as the x-axis, since the names would just blend together.]

The first three lines are just naming the labels, and the two bars we'd like. The next two lines give me some indices to work with, which we'll use later, and the width I want each bar to be. The next part with fig, ax gives us a way to talk about the chart itself (for later).

The r1, r2 variables are where it all comes together. Here, I've done something strange. I wanted to do the following: have one bar (for 2012), then have another bar next to it (for 2013), and then a space. So, for index 0 have the first bar, for index 1 have the second, and index 2 have nothing. Then start over again: for index 3 have a 2012 bar, for index 4 have a 2013 bar, for index 5 have a blank space. Notice that the 2012 bars are at 0,3,6,9..., and 2013 bars are at 1,4,7,..., and so forth. The ind variable is equal to: 0, 1, 2, 3, 4, 5, 6..., and so forth, so we tell matplotlib to plot the r1 values (the 2012 rectangles) at 0,3,6,9,... by telling it to plot it at 3*ind which is, in fact, equal to 0,3,6,9,...; the r2 variable is similar, but we use 3*ind + 1. Next, we tell what data r1, r2 should represent, the width it should be, and the color we want it ('r' for red, 'b' for blue, and so on).

The next four lines set the y-axis label, the title, the x-axis tick length and the x-axis labels. All but the third are somewhat obvious. The third is essentially saying, "Where do you want the labels?" We tell it we'd like it at (3*ind)+width, but some other values work too. Change this and then look at the plot to see what's going on.

Last, we make the legend. The (r1[0], r2[0]) makes little colored rectangles corresponding to r1, r2 data, and the next tuple ('2012', '2013') tells us that our colored rectangles should be labeled with these names.

The last line just says, "Hey, matplotlib, let me see my plot!" Compiling, if you've made no errors, you should see something like this:

And that's that. Gives us a nice graph that we can use (and even explore: if you want to zoom in, there's a little tool at the top that lets you do this!). Note the axis on the left has small numbers, but these numbers correspond to something like Billions of Dollars; hence, 2.9 means 2.9 Billion Dollars. Figure out a way to put this in the y-label by modifying the above code!

For ease of viewing, here is my entire code (remember to put in the path to your file at the beginning if you copy+paste).


Draw your own conclusions here. It seems like Department 99, "Finance General", has taken a bit of an increase in 2013; it might be nice to look at the specifics of "Finance General" to see where this extra money is going. Hm, this may be a good idea for another project...

⇐ Back to 2.5HomeOnwards to 3.1 ⇒