(home) (about) (rss)

Part 1, Section 4:
Introduction to Using Python for Data Analysis.

Displaying Data.

In this section, we're going to do some chart and graph making. First, we should learn to import another extremely common type of data that you'll run into.

Importing XLS File Data.

Open up Spyder and get to a new file. Remember when we had a CSV file and everything was great and we loved it? First, download this file in the xls (Excel) format. Pretend you're a teaching assistant and the professor just sent you a list of final grades and, of course, they'll send it in XLS format. You want to display the grades in a nice way and do some analysis with your favorite programming language (Python!) but you only know how to import csv files. Alas! Luckily, there's an easy way to import xls data! We'll first import xlrd which, I'm guessing, means something like "xlr data". It allows us to import xls files and mess around with them.

[As a minor note, it is entirely possible to convert xls files into csv files (easily), but I'll introduce this method because it is a bit more general; you can look at multiple sheets more easily with this method than if you were to convert to CSV.]

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xlrd

At this point, you guys should have a fair idea of what we do now: we somehow open the document, we make a blank list, and then we populate the list with the data by using a For loop. Here's how this is done:

wb = xlrd.open_workbook('C:\Users\james\Desktop\grades.xls')
sheet = wb.sheet_by_index(0)

grades = []

for i in range(sheet.nrows):

This was a lot to throw at you at once, but it's because we've done a lot of similar things before. First, notice that we've made a variable called wb (workbook) and used the open_workbook command. This, of course, opens the workbook. In Excel, we have different files that are in your workbook called "Sheets"; most of the time for small data sets like these your data will be on the first sheet. Recall that in Python, everything starts at 0, so when we choose the sheet we want to work with via the sheet_by_index() command we must choose the index 0 to mean "the first sheet".

The next lines create an empty list called grades and then populates it in the usual way. Notice that sheets.nrows returns the total number of (non-empty) rows in your sheet; in this case, this gives us the length of the list of data in Excel. We append to grades the values in each cell via the .cell().value command. Inside the cell() function, we specify which row and column we'd like: the first column is column 0, so to go through all the rows in the first column we need to look at .cell(i,0). We call the int function around the value just so that Python knows these are integer values. Note that sometimes Python will automatically understand what kind of data this is, but I have a tendency to include it anyway just in case Python is confused.

Histograms in Matplotlib.pyplot.

Okay. Print out grades. You should get a list of grades. As before, we can look at some various statistics of grades with numpy (I do this by defining a function, but you can do it however you'd like):

def summary(data):
    print "Max: %d" % (np.amax(data))
    print "Min: %d" % (np.amin(data))
    print "Mean: %d" % (np.mean(data))
    print "Stdev: %d" % (np.std(data))

Neato. Your professor, though, wants you to see if the data is normally distributed (bell-shaped). You get the brilliant idea to make a histogram of this data since you've seen this command before (remember?). We use the matplotlib.pyplot package that we've imported and we use the hist function.

plt.hist(grades, bins=30)

Look at ths picture. It looks a bit too fine, though; we may have used too many bins. Maybe we should use fewer bins. Try changing the number of bins to 20 and then 10, and see which looks best to you.

But, maybe bins is a bit too vague. You sort of wanted it to count at intervals of 5's (the way letter-grades are often given). That's okay too: we can specify the endpoints of the bins in a relatively easy way:

plt.hist(grades, bins=[65,70, 75, 80, 85, 90, 95, 100])

Which will gives us bin endpoints of 65, 70, 75, and so on. Note that if you do not specify bins, it defaults to 10 bins evenly spaced.

At this point you can show your professor the chart and pretty confidently reply that it is not normally distributed. Sad.

Pie Charts in matplotlib.pyplot.

Suppose your professor has tenure and he hates his students. He decides that any student who got less than a 75 fails the class. He asks you to show him how many students passed and how many failed. We're going to count these values using Python (of course!). Note that there are several faster and more efficient ways to do this, but I'll stick to a basic "beginner-friendly" level to do this.

Here's the battle plan: we're going to use a For loop to cycle through the data, and an If statement to say "If this grade is below 75 then add 1 to the value of "failed" if not, add 1 to the value of "passed"." Easy-peasy. Let's look at how this might look:

failed = 0
passed = 0

for grade in grades:
  if grade < 75:
    failed += 1
  	passed += 1

Note that the only new thing here is maybe this "+= 1" business, but that just says, "Add 1 to the variable." Print out failed and passed; they should seem reasonable, and they should add to 100 (for this data).

Now that we have these values, we can construct a pie chart. We will use the command pie to do this:

plt.pie([passed, failed])

This gives the command to make two chunks: one that stands for the fraction of passed students and the other for failed students. But if you only do this, you won't know which is which! Moreover, this plot isn't especially beautiful.

To tweak it, we're going to first put some labels on it.

plt.pie([passed, failed], labels=["passed","failed"])

Next, we're going to "explode" part of it; that is, pull a piece a little out of the pie. The best way to explain this is to just see what it does:

plt.pie([passed, failed], labels=["passed","failed"], explode = (0,0.05))

Next, a shadow won't add much to the data, but it will make it look slightly cooler. To add a shadow:

plt.pie([passed, failed], labels=["passed","failed"],explode = (0,0.05), shadow = True)

Last, we ought to title this chart, lest the professor forget what it's for. We can do this by using the "title" command after we create the pie chart:

plt.pie([passed, failed], labels=["passed","failed"],explode = (0,0.05), shadow = True)

title('Class Pass/Fail.')

Note that there's a ton of options for you to customize your charts. For now, your professor will probably be content with this.

⇐ Back to 1.3HomeOnwards to 2.1 ⇒