DyingLoveGrape.

(home) (about) (rss)

Part 2, Section 5:
Getting Deeper into NumPy and Pandas.





Project: Ethnic Groups Moving Out of Louisiana.

A friend of mine noted that a significant number of ethnically white individuals were moving out of Louisiana, and at a much greater rate than other ethicities. Let's investigate this claim.

Finding the Data + Importing the Data.

A good resource for this would seem to be the United States Census, seeing as how the Census would probably look at migration data like this. The Census website can be intimidating: some of the more exotic data sets are done in strange ways which requires a significant investment in time to figure out. Luckily, what we'd like to know is a bit more usual and the Census Bereau has put together an easy we to get basic data like this.

For standard inquiries like the one we've made (we hope!) an easy way to find statistics on the Census website is to use their easystats page. We go here, select the state (Louisiana), choose the topic "people", and pick the table Geographical Mobility In The Past Year For Residence 1 Year Ago In The United States which seems to be the most relevant to our question. We now "get results" and this gives us a nice chart.

Usually, these charts are bigger, so let's get into the habit of importing it into Spyder. Click on the lower-right corner which says MS EXCEL to download an excel version of this file.

In Excel, I generally "trim" off the data I obviously don't need, and trim the headings and extra parts off (though this is not required; you can skip these lines and so forth in Python). My file wound up looking like this.

Import this database as db. You'll be surprised to see a somewhat readable table, despite the apparent complexity of the excel document! The first column, for some reason, is called ÿ in my Spyder for some strange reason (you can find out what yours is called by using the command db.columns which gives the names of the columns in your table). Let's restrict our view here; let's just look at the first column, the African American, the White, and the American Indian or Native American columns. We can do this with the command:

>>> newdb = db[['ÿ', 'White', 'Black or African American', 
    'American Indian and Alaska Native']]  

Note that this weird first symbol may be different on your Spyder. Now we can work with newdb to answer our question.

Analyzing the Data.

Using newdb is nice, but it's still a bit too big. We'd like to make it into a smaller database, since we really only need to know two things: how many people lived in Louisiana in the last year, and how many moved to a new state. We'd like the column headings too, so we only need rows 1, and 5. The way we do this is to use that ix command that we've used once before. This will allow us to specify what rows and columns we want from newdb.

>>> finaldb = newdb.ix[[1,5],:]

I've saved this new database as finaldb. Recall that the : means to get all of the columns from newdb which is what we wanted.

At this point, what we want to do is clear: we want to just divide these things and get a percentage, then look and see if this answers our original question. Note that we could simply plug these into a calculator (since there are only 3 columns) but since this won't teach us anything we'll do it in Python. Besides, we're going to run into trouble in a second and it'll be nice to see if we can get ourselves back in the clear.

Trouble with Commas in Integers.

First, let's make a little percentage function which will take two things and divide them. This is not strictly necessary, but it's nice to make things a bit easier.

>>> def percentage(a, b):
...   return a/float(b)

Recall that the float command makes the number into a decimal-point number as opposed to just a whole number. Now, let's test this out with two values from our database. Recall that, in addition to ix, we have the command iat which gives us the value of a database at a certain index. For example, for us,

>>> finaldb.iat[1,2]
'27,385'

This gives us the integer in the second row (index 1) and the third column (index 2). If we wanted to see the percentage of White individuals who moved, we would want to divide finaldb.iat[1,1] by finaldb.iat[0,1]. Let's try it!

>>> percentage(finaldb.iat[1,1],finaldb.iat[0,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in percentage
ValueError: invalid literal for float(): 2,794,622

Wait, what's this? An error? How? Why? We're just dividing two numbers!

If you were to search google for this "value error", you'd find that it's because this number isn't really a number to Python since it has commas in it. In other words, to Python, 2,794,622 is a string, but cannot be natively converted into an integer. Mildly annoying.

There are two good solutions for this. One imports a module called locale, and one uses the replace command. We'll do the latter since it's a bit more straightforward. We're going to re-define our function to first make the numbers into integers (remove the commas) and then divide them. First, the way replace works is as follows:

someString.replace(stuffToReplace, replaceBy)

So, for example, if you have a string Hello = "Hello World!" and you want to replace each o with an 5? for some reason, you'd write:

Hello.replace('o', '5?')

Try this out to make sure you understand what's going on. There's a lot of neat things to do with the replace command, but we won't spend anymore time on it here.

We'd like it if we could replace the commas in a number by nothing, so our command will look something like this:

a.replace(',', '')

Where a is some variable. Let's redefine the function using this idea, and we'll parse through it afterward:

>>> def percentage(a, b):
...   return float(a.replace(',',''))/float(b.replace(',',''))

We've taken a, replaced all the commas with nothing, and then made sure Python knew it was an integer by adding a float command around it. Similar for b. Let's test this out:

>>> percentage('1,000', '2,400,770')
0.00041653302898653346

Good! We've solved this problem. Nice work.

Making a new row for percentages.

Now, to be super-fancy, we'll make a new row like we did in the previous projects. First, we need an empty list:

percentages = []

Next, let's append the first column to it, which will be "percentage moved".

percentages.append('percentage moved')

Now, we'd like to add on the percentages for the first column, the second column, and the third column. We could do this manually, but we can also do it with a nice For statement, as we've done before:

>>> for i in range(1,4):
...   percentages.append(percentage(finaldb.iat[1,i], finaldb.iat[0,i]))

This goes through the list and divides the total by the number moved for each column. Make sure you understand what's going on here, since this kind of thing pops up a lot.

Let's alter this new thing a bit more. To append it to our main dataframe, we need to make it a dataframe with the same column headings. Note that you can get the column names of the original dataframe by using finaldf.columns. This is mainly so that if you were to create a smaller list to add (say, if you wanted to add percentages to the original dataframe we starrted with) it would know which numbers go where and which columns to ignore.

>>> percentages = DataFrame(percentages)
>>> percentages.columns = ['ÿ', 'White', 'Black or African American', 
    'American Indian and Alaska Native']

Last, we append percentages using the append command.

>>> finaldb.append(percentages)

This will (finally!) give us a table that has percent moved as a last row. Now, we investigate.

Conclusions?

In this case, we don't need to do any sorting by rows or anything (though you should know how to!). It seems that there are a significant amount of White individuals who've moved out of Louisianas: around 2.2% of them. We compare this to the 2% of African Americans and 2.3% of American Indian and Alaska Natives and note that this does not seem like an unusual number.

If one were interested in seeing if this was truly a strange percent of people leaving, one would most likely look at percentages of other states and see if these were similar. Indeed, the same process we used here could be used to look at the other states; our functions would not even need to change!

The next project will introduce us to a slightly more sophisticated way to analyze data by grouping numbers together — by looking at the Chicago budget.


⇐ Back to 2.4HomeOnwards to 2.6 ⇒