(home) (about) (rss)

Part 1, Section 1:
Introduction to Using Python for Data Analysis.

Making and Plotting Some Data.

Getting Started.

First, open up Spyder. Spyder has three main parts: the main window (we'll be typing our program in here), the console in the lower-right (by default) and this amazing thing called the Object Inspector in the upper-right (by default). The object inspector will look at what you're typing and instantly pull up the documentation for it which, as beginners, is invaluable. As time goes on you may want to turn it off, but it helps a bunch in the beginning.

For now, we're going to do two mini-projects just to get your feet wet with Python: we're going to construct and plot $\sin(x)$ and we're going to make a histogram from normal (bell-shaped) data. If you're not sure of what the normal distribution is, you should check it out before starting. The two main packages we'll be using are numpy which you will see a lot (it is the scientific computing package for Python) and matplotlib which is useful for, among other things, graphing. Let's get going!


For both files, we will need to import numpy and the pyplot part of matplotlib.pyplot. We can do this by typing the following commands at the top of the file:

import numpy as np
import matplotlib.pyplot as plt

These things might look strange, but they're not bad: we're simply importing the packages, and when we want to use something in them we will "call" them by using np for numpy and plt for matplotlib.pyplot. You'll see what we mean by this in a second.

Plotting a Sine Curve.

First, let's try something out. Python, by default, doesn't know anything about the sine function; luckily, numpy knows a lot about it, and we've imported it. Hence, the line

print np.sin(pi / 2)

will print out 1.0 if everything is correct (to try this, click the little green running man at the top, or click on "Run" and "Run" in the menu). What did we do in this line? We've called numpy by typing np first, then we said, "I want to use the sine function fron the numpy package," so we typed np.sin(). For kicks, I evaluated this at $\frac{\pi}{2}$, so we put pi / 2 on the inside the ()'s. In general, this is how using commands from numpy or matplotlib.pyplot work: you put np or pl followed by a dot followed by a function. Spyder has some sweet auto-complete stuff, so if you have an idea of what you want to do you may just try typing it in and see if there's a related function. If not, google.

That one value of sine was great, but we want to plot sine so we need a lot of values. Let's take 100 values between 0 and $2\pi$. We could think about this like this: "Well, $2\pi = 6.28$..., so if I divide this into a hundred pieces each piece will be 0.628..., so I'll have a list like..." Luckily, numpy has a function which automatically splits up an interval into however many pieces we want! How nice of it. Let's split up the interval $0$ to $2\pi$ into 100 pieces and save that partition as the variable x as follows:

x = np.linspace(0,2*pi, 100)

Again, this calls numpy with np and uses the linspace function which takes the interval $0$ to $2\pi$ and divides it into 100 pieces. Cool.

Once we have this partition, we'd like to evaluate sin at each point. Luckily, this is easy! Recalling that x is the 100 points equally spaced between 0 and 1 that we just made, we apply np.sin to our x. Here's what we type:

y = np.sin(x)

This saves our sine values (as an array) to the variable y.

Now, we will use matplotlib.pyplot to plot y. We use the command plt.plot(y). In total, your entire program should look like this:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*pi,100)
y = np.sin(x)


and this will return a delightful picture showing off the sine curve. Good job!

Making a Histogram with Normal Data.

Save your previous program and start a new file. We'll start by importing the same two things as before.

import numpy as np
import matplotlib.pyplot as plt

Getting normal-shaped (bell-shaped) data is easy in numpy! To get, say, 100 random numbers from a normal distribution we have the random.randn command. The first "random" is because numpy has a lot of random commands that we can use; the "randn" tells it which particular random command we'd like to use. Let's use this command and save it as a variable N:

N = np.random.randn(100)

The 100 tells numpy that we want 100 of these random numbers. At this point, we can use the histogram command from mathplotlib.pyplot:

plt.hist(N, bins=10)
The "bins" here tells us how many bins to use in our histogram. If you don't know what this means, try using bins = 2, bins = 20, bins = 200 to see the difference. For us, 10 is a reasonable number. In total, your program should look like this:

import numpy as np
import matplotlib.pyplot as pl

g = np.random.randn(100)

plt.hist(g, bins=10)

So run it and check out that histogram. If we want it to look a bit nicer, we can try to generate 10,000 random numbers (what would we change?).

That's it.

Okay, well, that's not everything, but this is a good start: we know that numpy and matplotlib.pyplot are neat and they have a lot to offer. They're also relatively simple to use. This section was essentially to move you gently towards making bigger and more powerful programs — and, just as important, to see if Python and friends have been set up correctly.

⇐ Back to 0.1 Home Onwards to 1.2 ⇒