Getting Deeper into NumPy and Pandas.

In this post, we're going to look at Numpy's *array* type; this is a fundamental type that is a beefed-up version of Python's *list* type. The array will look like a list of lists, but we'll be able to do some fairly neat things to it with some Numpy commands.

For this part, we will be using **Spyder's Python interpreter**. Open Spyder, go to the *Interpreters* menu, and choose *Open a Python Interpreter*. It should open up *in your console* (but you should still see the big "main" window we usually use; we will not use this in this post). You should see something that looks like this in the console:

>>>

Click after the `>>>` and type in any Python command, such as `print "Hello!"`. You should see that the Python interpreter immediately processes this line and returns "Hello!" to you.

The difference between the main window we usually use and the interpreter is that the interpreter will interpret each line you put in when you press the enter key. The reason we're using the interpreter now is because if we want to look at changes made to a data set, it's a bit more natural feeling (to me, at least) than changing the main code each time and re-interpreting.

Before we do anything, import numpy by typing

>>> import numpy as np

into the interpreter. Note that you do not need to type the > signs, but I will keep them there because they'll appear in the interpreter telling you where you can type commands.

To make a NumPy array, you use the command `np.array()`, where you put your array (a list of lists) inside of the ()'s. As usual, you may save the whole array as a variable. For kicks, let's make a data set and do some things to it. We'll call this `data1`:

>>> data1 = np.array([[1,2,3],[4,5,6]])

Try printing `data1` in the interpreter. Note that in the interpreter we don't need to say `print` before a variable to print it, like this:

>>> data1

There's a few things we can do with this, but let's first find out how big this data set is. We can do this using the command

>>> np.size(data1) 6 >>> np.shape(data1) (2, 3)

This shows us that the size of `data1` (the number of elements the dataset has) is 6, as we'd expect, and that the shape of `data1` is `(2,3)`. This might be a bit mysterious at first, but it means that our dataset has 2 rows and 3 columns. In general, the *shape* of the data will be given by `(rows, columns)`. This will become more important later when we import data from external sources and we need to see how many rows and columns our data has.

Now, let's make another data set. Let's call it `pay` and let's define it as follows:

>>> pay = np.array([[8,8,10,11],[15,16,16,16],[20,24,25,26]])

Again, we can print out `pay` in a pretty format by simply typing in the variable name:

>>> pay [[ 8 8 10 11] [15 16 16 16] [20 24 25 26]]

We'll interpret this as follows. Pretend you own a small company and you want to standardize pay for your employees based on the importance (low-level, mid-level, high-level) and education (high school, bachelor's, master's, and PhD). We'll say that each row corresponds to low, mid, and high, and we'll say each column corresponds to high school, bachelor's, master's, and PhD. For example, if you are working as a mid-level employee and you have a master's, you'll be paid \$16; if you are working as a high-level employee with a high school degree only, you'll be paid \$20. The interpretation isn't super-important, but it will allow me to give a real-world interpretation to the following manipulations.

Suppose that your company gets really popular and you get a ton of extra money. Being a cool boss, you'd like to *add a dollar to each person's pay*. In essence, we'd like to just "add 1 to `pay`", which is exactly what Numpy allows us to do:

>>> pay + 1 array([[ 9, 9, 11, 12], [16, 17, 17, 17], [21, 25, 26, 27]])

Indeed, this adds 1 to `pay` and prints that value. **Note, this does not change the value of pay**; to do this, we'd need to say `pay = pay + 1` or, in Python shorthand, `pay += 1`.

But this seems a bit sloppy. In real life, you'd probably be a slightly less-cool boss (no offence, you seem like a great person and all), and you'd probably only give more money to the high-level employees. But we can't just add 1 now to `pay`. Unfortunate.

Well, we can do a few things here. First, we could just create *another array* which we can add to the first one. Arrays add elementwise (see what happens when you try ` pay+pay`) so we'd just create an array that had the same shape as `pay` but only the values we want to add. So,

>>> payincrease = np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1]]) >>> pay = pay + payincrease

This will increase only the last row. Neat. Notice here that we've used two rows of `[0,0,0,0]` and one row of `[1,1,1,1]`; this would be irritating to type out if we had more rows. Luckily, Python has an easy way to make long lists that contain the same number. It's done by using the multiplication commands, *. It's done like this:

>>> payincrease = np.array([[0]*4, [0]*4,[1]*4])

For example, `[0]*7` will produce `[0,0,0,0,0,0,0]` and `[6]*4` will produce `[6,6,6,6]`. Great. So, we've simplified this pay increase thing a little bit. Unfortunately, if we had an array that had 5000 rows, it would be a bit frustrating to do this sort of thing. Moreover, this kind of thing doesn't quite *feel right*; we're making an *additional structure* just to manipulate our original structure, which seems like a waste of resources.

Instead of creating a new array, maybe we'd like to just edit the original array. To do this, we need to remember a bit about *indexing* for lists in Python. Some nice information can be found here, but we'll just review by example. Create this sample array (just so we're all on the same page here!):

>>> sadness = np.array([[1,2,3],[4,5,6],[7,8,9]])

Remember that all indices in Python start from 0; that means, to access the first row, `[1,2,3]`, we would write

>>> sadness[0]

Now that `sadness[0]` is, itself, an array! Hence, we can access elements from it in the same way. To access the second element (which is index 1) we can write

>>> sadness[0][1]

This says, "From `sadness[0]`, the first row of the array called `sadness`, access the element in index 1, which is the second element." Similarly, if we wanted to access the third row, first element, we would write `sadness[2][0]`.

Okay, here's the cool part. We can alter parts of this array by talking about the index of the part we'd like to change. For example, suppose, as before, we wanted to add the value 1 to each element in *only the last row*. The last row of `sadness` is given by `sadness[2]`, so this command would look like:

>>> sadness[2] += 1

Recall the `+=` part tells Python to add the value after the `+=` and then save that changed variable as the old variable. Hence, this command takes the third row of `sadness` and adds the value 1 to each element.

Last, suppose we wanted to add another value of 1 to *only the last two columns in the last row* of `sadness`. We *could* try this:

>>> sadness[2] += np.array([0,1,1])

But, like before, this might not be as extendable as we'd like; when working with large data sets, this might be a huge pain. Instead, we can use *slice* notation for lists. The Python notation `SomeList[5:7]` means "return the elements from index 5 to index 7 (but *don't include index 7*) in the list `SomeList`." In general, the slice will contain the number corresponding to the first index, but not the last.

For example, in the list `A = [10,20,30,40,50]`, typing `A[1:4]` would return the list `[20,30,40]`. If you leave out a number from the splice, it will simply start from the beginning or end; for example, `A[:2]` returns `[10,20]` and `A[2:]` returns `[30,40,50]`. If these are a bit confusing, make your own list and try throwing in a few values to see what it returns. What do you think `A[:]` would return?

Getting back to the question before, we wanted to add a value of 1 to only the last two columns in the last row of `sadness`. To get the last row of `sadness`, we would use `sadness[2]`; to get the last two elements, we would use `sadness[2][1:]` since there are three elements in the last row. Later, we'll learn how to use this without worrying too much about how big the array is. For now, this will do:

>>> sadness[2][1:] += 1

As just a final note, slicing two different ways for an array is a bit strange; for example, if we wanted to get just the second and third column of the first and second row of `sadness`, we would put it in like this:

>>> sadness[:2, 1:]

This will seem a bit strange, but the `:2` part says to get the first and second row and the `1:` part says to get the second and third column (remember, `1:` is talking about the *index* 1, which is the second column). We'll delve into this a bit more later when we cover something called *strides* which makes this all a bit easier, but if you're interested the documentation on this sort of thing is kept here.

The old index notation wasn't terrible, right? There were a lot of :'s and all but, for the most part, it got the job done. Suppose, though, that you have a big data set and you only want to return the first, second, and seventh row. What do you do?

To practice, let's first make a data set. Since it doesn't matter too much what our elements are (since this is just for practice) we'll make an>>> import numpy as np

Next, we wonder: where would a random array command be? You can either google this or use the auto-complete in Spyder to find that numpy has something called `random` associated with it; specifically, it looks like `np.random`. If you put another period at the end (so that it becomes `np.random.` and the auto-complete comes up) you'll see a ton of commands in the `random` section. The `rand` command looks pretty reasonable, so let's use that. The way that we create an array from this is to just put in the dimensions of the array we'd like in the form `np.random.rand(rows, cols)`. For this, let's put in:

>>> data = np.random.rand(3,8)

Now, let's suppose we want the first, second, and seventh row. These correspond to the indexes `0, 1, 6`. Here's the punchline: we can simply feed in a list of the row indices we want from our array, and Python will spit them out for us!

>>> data[[0,1,6]]

This will return the first, second, and seventh rows from the table. Neat. You might say, "Welp, great, but what if I want the first and third columns?" Worry not, since the same kind of deal will work:

>>> data[:,[0,2]]

Remember that the `:` before the comma means "all of the rows".

At this point, suppose we want the first and last column, but only the odd indexed rows. You might expect that...

>>> data[[1,3,5,7],[0,2]]

would work, **BUT IT DOES NOT WORK**. The reason for this is a little bit beyond this post, but it is a bit of an annoyance (and it seems to be a thorn in a number of sides, judging by the forum questions regarding it). A solution to this problem is given by using the `np.ix_` function (make sure to include the underscore in `ix_`) which makes our indexes work nicely with each other. For example,

>>> data[np.ix_([1,3,5,7],[0,2])]

works exactly how we want. This might be a lot to take in now, but don't worry if it doesn't completely sink in — we'll be using the `ix_` function a lot later, and we'll also be reviewing some ways to index and slice arrays. Just play around with this a bit and you should be fine for now. Note that there are dozens of ways to manipulate arrays like we've been doing, but these are more common methods.

Here's a couple of exercises for you to practice your numpy and Python skills. First, make the following two arrays:

happy = np.array([[1,1,1],[2,2,2],[3,3,4]]) sad = np.array([[1,3],[2,4],[5,7],[6,8],[99,100])

Exercises.

Use slice notation to print the 2nd column from the 1st row in

`happy`.`happy[0][1]`Can you add

`happy`and`sad`together? Why or why not?You cannot; they're not the same shape!

Add 5 to each element in

`sad`.`sad += 5`Subtract 1 from the last element in the last row of

`happy`.`happy[2][2] -= 1`Challenge: Make an array using slice notation that looks like

`happy`but does not include the first column of each row.`happy[:,1:]`This might look a bit strange (see the last paragraph in the last section) but the first

`:`says to include all of the rows and the`1:`says to only include elements from index 1 onwards — in this case that means only the second and third row, as we wanted.

⇐ Back to 1.4 | Home | Onwards to 2.2 ⇒ |