Getting Deeper into NumPy and Pandas.

[Note that the dropout rate we will be using is the one-year dropout rate. The differences between different rates is explained here.]

In the last part I gave some data detailing how many students dropped out of a particular school. The data is easy to summarize but, in the real world if you showed someone that data one of the first questions that would come up is: "Why is the data like that?" More specifically, why do certain schools have certain dropout rates and other schools have different dropout rates. Of course, there are nearly an infinite number of potential reasons for this kind of behavior but, for brevity, we will pick one. You may think that one of them may be average income in the area where the students live (there have been many studies which use this, and other, factors). To create a simple project, then, we will do the following:

- Create a data set which has the school name, the dropout rate from 2012, and the median income from the area surrounding the school.
- We will plot the data and use something called
*regression analysis*to construct a "line of best fit" which will give us an approximate correlation between median income and dropout rate. - We will test our line of best fit using a few other data points.

[Before I begin, I'd like to note a few things. Due to the political issues contained here, it seems important for me to note that *I am not claiming that any factors that I have or will state here are, in fact, relevant to the dropout rate*, nor do I claim to know *why* the dropout rate is what it is. This project will contain the analysis part using raw data we are given to answer a question that we have formulated. It is difficult to be objective when one must make decisions (what schools to include, what area the students are from, etc.) but we will do our best. Remember that this is simply a project to help you get more comfortable at using Python and pandas to analyze data, not an attempt to save the world.]

We have the dropout rates in CSV format here. Here's the problem, though: we probably won't be able to find where each student for each school lives and, therefore, will not be able to give an accurate estimate about their median income. We will take a stab at a potential solution: we will use public schools (which already makes our results more specific than we originally wanted) and we will use the area surrouding the school (assuming that most of the children living near a public school will go to that public school — this requires us to use public schools called *neighborhood schools*). **These assumptions must be clearly stated** when we conclude; if we do not clearly state what we have assumed, the reader may accidentally think we've proven more than we have.

Going over to the CPS website's find-a-school feature, looking at High Schools (which we also need to note in our conclusion) and checking off only "neighborhood school", we find that there's a number of schools listed with their addresses. They're in alphabetical order, so I will just take every other school until I have around 12 (this is usually not enough data points to prove anything but it's fine for this practice project).

I've made a part of a CSV including the names of the schools I'm using, and I included the Dropout rate from 2012 which I found by looking at the previously mentioned dropout rate data (link above). Here's what my new CSV looks like so far:

school,DO2012,median income amundsen,6, clemente,0.9, corliss,2.3, douglass,0.6, eric solorio academy hs,3, fenger,16.5, gage park,10.6, harlan,10.3, hirsch,11.2, hubbard,7.4, juarez,5.5, kelly,5.5,

Notice that the median income is missing. How do we find this information out? There are a number of resources to do this but the first way I thought of was to use city-data and check the Estimated median household income in 2010 using the *zip code that the school is located in*. This is another assumption we must note. The reader may argue that this is too large or small of a net to catch the students, or that the data in city-data is not recent; these are both potentially valid complaints. We could solve these by attempting to look for more recent, more specific data — but, for now, we will stick to this data.

After searching around a bit, we have filled in the rest of the CSV:

school,DO2012,median income amundsen,6,50065 clemente,0.9,58987 corliss,2.3,40394 douglass,0.6,28059 eric solorio academy hs,3,42809 fenger,16.5,40394 gage park,10.6,37367 harlan,10.3,40394 hirsch,11.2,40176 hubbard,7.4,37367 juarez,5.5,42575 kelly,5.5,42809

If you'd like, you can copy-paste this into notepad and save it as "dropoutVSmedincome.csv" or you can download my copy here.

Notice, even before we start, that some of the zip-code median incomes are the same — this tells us that we've either got too small of a sample or have cast too wide a net by using zip-codes to find the median income. In a "real" analysis, we would most likely go back and try to find a way to get more exact median incomes per school area.

**Before we begin, we will be using the editor, not the interpreter.** The reason for this is because we'll need to make a few things to put on a particular graph, and this is easier to do in the editor.

We have a CSV, so we know what to do. Open up Spyder and type into the editor everything we need to import and the command to read the CSV file:

import numpy as np import scipy as sc import matplotlib.pyplot as plt import pandas as pd from pandas import DataFrame, Series db = pd.read_csv('fileLocation')

Next, let's make a DataFrame from this data.

table = DataFrame(db, columns=['school', 'DO2012', 'median income'])

Let's make a plot out of this so that we can see if there are any trends.

It's easy to make a scatter plot from our data:

plt.plot(db['DO2012'], db['median income'], 'bo')

The `plot` command (which we've seen before!) tells us that we're going to plot the first argument (the dropout rate) against the second argument (the median income). The "bo" part tells us to plot this in blue with only dots (and no line) to make this a true scatter plot. If you want to see what this looks like, put in the `plt.show()` command at the end of the program and run it. Make sure you get rid of the `show` command before continuing on with this lesson if you tried it out.

Okay, here's the first time we're going to need to remember some algebra. Recall that *lines* can be written as
\[y = mx + b\]
where $m$ is called the *slope* and $b$ is called the *y-intercept*.

A *regression line* is a special kind of line which "best fits" data. Here's a cute page that shows sort of what this means. The idea is that, even though we have a ton of disjoint dots, we can make a line that "kind of" represents what the data is telling us. Sometimes these lines are not all that great at predicting data, and sometimes they're fantastic — and, in fact, there are mathematical ways of saying just *how good* a regression line is. We won't go into all this, but know that it exists.

Given two sets of data, `DataA` and `DataB`, the way to find a regression information is to use the `sc.stats.linregress()` command. This will give us a lot of data back but we really only need the slope and y-intercept, which we can get in the following way:

regressionline = sc.stats.linregress(DataA, DataB) m = regressionline[0] b = regressionline[1]

In order to make a line in pyplot, we need to make a list of evenly-spaced points between two endpoints to be our value for `x`, and then we use the `plot` command. This sounds scarier than it really is:

x = np.linspace(START, END, NUMBER_OF_POINTS) plt.plot(x, m*x + b)

This is not so bad, right?

Now that we know how to do regression-things, let's do exactly that. We just need to put in the relevant information:

regressionline = sc.stats.linregress(db['DO2012'], db['median income']) m = regressionline[0] b = regressionline[1] x = np.linspace(0, 18, 100) plt.plot(x, m*x + b)

Note that we started at 0 and went to 18 for our `linspace` since, after a bit of testing, that's what turned out to look the nicest with our dot plot. 100 is an arbitrary amount, but it's what I generally use for the number of points in `linspace`.

Last, we just need to write:

plt.show()

which will display the plots we made. All in all, your program should look something like this.

When you run your program you should get something that looks like this:

We note that we've had a ton of assumptions on this data: neighborhood high school, only twelve data points, one-year dropout rates, only in Chicago, used zip-codes for median income, etc., but we can say we've found a *slight* negative correlation between the dropout rate and median income of the area; that is, as as dropout rate increases median income decreases. Note that this correlation isn't all that strong (most of the dots are nowhere near the line!) and that the slope of the line is not all that extreme.

If this were a real project, and we'd used tons more data points, more exact income information, etc., and we'd gotten the same kind of regression line, what would we we be able to say?

This is a tricky point: we *cannot* say that lower income implies higher dropout rate, or higher dropout rate implies lower income; we can say that there exists a correlation between these two things and we can say what this correlation is. We cannot imply that this relationship is, in either direction, causal. For more on this idea read this wikipedia article. Keep reading until you understand this comic.

This project only looked at two variables: income and dropout rate. We can do a bit better and look at a few more variables in the future, but let's not go too quickly ahead: the point of this section was to get a bit deeper into pandas and numpy, and we've still got a bit to do to get the basics down.

The next project will be the final project in this section. We'll then move onto learning about *API*s which will allow us to get data from sources like Twitter, Google Maps, Google, and so forth. This data will most likely be a bit more interesting (and more up-to-the-second). By then, we should be fairly comfortable with using pandas DataFrames and using basic numpy and scipy analysis and we'll be ready to look at slightly more specialized tools that numpy and scipy offer.

[This is to answer the most common question I get on here: wouldn't using Excel be easier?]

Excel is great, but it does have some limitations. The primarily limitation, in my mind at least, is that it costs money. Besides that, there is generally nothing wrong with using Excel. When we get further along (especially when we begin using APIs) we'll see that we're able to use standard Python programming alongside of these analyses we've been doing to get lots done. Excel also has an associated language (called VBA) but I find it to be a bit awkward to get used to.

And, hey, there's nothing wrong with knowing *both*.

⇐ Back to 2.3 | Home | Onwards to 2.5 ⇒ |