(home) (about) (rss)

Part 2, Section 1:
Getting Deeper into NumPy and Pandas.

NumPy Arrays.


In this post, we're going to look at Numpy's array type; this is a fundamental type that is a beefed-up version of Python's list type. The array will look like a list of lists, but we'll be able to do some fairly neat things to it with some Numpy commands.

For this part, we will be using Spyder's Python interpreter. Open Spyder, go to the Interpreters menu, and choose Open a Python Interpreter. It should open up in your console (but you should still see the big "main" window we usually use; we will not use this in this post). You should see something that looks like this in the console:


Click after the >>> and type in any Python command, such as print "Hello!". You should see that the Python interpreter immediately processes this line and returns "Hello!" to you.

The difference between the main window we usually use and the interpreter is that the interpreter will interpret each line you put in when you press the enter key. The reason we're using the interpreter now is because if we want to look at changes made to a data set, it's a bit more natural feeling (to me, at least) than changing the main code each time and re-interpreting.

Manipulating NumPy Arrays: constructing, adding, and slicing up Arrays.

Before we do anything, import numpy by typing

>>> import numpy as np

into the interpreter. Note that you do not need to type the > signs, but I will keep them there because they'll appear in the interpreter telling you where you can type commands.

To make a NumPy array, you use the command np.array(), where you put your array (a list of lists) inside of the ()'s. As usual, you may save the whole array as a variable. For kicks, let's make a data set and do some things to it. We'll call this data1:

>>> data1 = np.array([[1,2,3],[4,5,6]])

Try printing data1 in the interpreter. Note that in the interpreter we don't need to say print before a variable to print it, like this:

>>> data1

There's a few things we can do with this, but let's first find out how big this data set is. We can do this using the command

size, and we can see the shape of the data using the command shape like so:

>>> np.size(data1)
>>> np.shape(data1)
(2, 3)

This shows us that the size of data1 (the number of elements the dataset has) is 6, as we'd expect, and that the shape of data1 is (2,3). This might be a bit mysterious at first, but it means that our dataset has 2 rows and 3 columns. In general, the shape of the data will be given by (rows, columns). This will become more important later when we import data from external sources and we need to see how many rows and columns our data has.

Now, let's make another data set. Let's call it pay and let's define it as follows:

>>> pay = np.array([[8,8,10,11],[15,16,16,16],[20,24,25,26]])

Again, we can print out pay in a pretty format by simply typing in the variable name:

>>> pay
[[ 8  8 10 11]
 [15 16 16 16]
 [20 24 25 26]]

We'll interpret this as follows. Pretend you own a small company and you want to standardize pay for your employees based on the importance (low-level, mid-level, high-level) and education (high school, bachelor's, master's, and PhD). We'll say that each row corresponds to low, mid, and high, and we'll say each column corresponds to high school, bachelor's, master's, and PhD. For example, if you are working as a mid-level employee and you have a master's, you'll be paid \$16; if you are working as a high-level employee with a high school degree only, you'll be paid \$20. The interpretation isn't super-important, but it will allow me to give a real-world interpretation to the following manipulations.

Suppose that your company gets really popular and you get a ton of extra money. Being a cool boss, you'd like to add a dollar to each person's pay. In essence, we'd like to just "add 1 to pay", which is exactly what Numpy allows us to do:

>>> pay + 1
array([[ 9,  9, 11, 12],
       [16, 17, 17, 17],
       [21, 25, 26, 27]])

Indeed, this adds 1 to pay and prints that value. Note, this does not change the value of pay; to do this, we'd need to say pay = pay + 1 or, in Python shorthand, pay += 1.

But this seems a bit sloppy. In real life, you'd probably be a slightly less-cool boss (no offence, you seem like a great person and all), and you'd probably only give more money to the high-level employees. But we can't just add 1 now to pay. Unfortunate.

Well, we can do a few things here. First, we could just create another array which we can add to the first one. Arrays add elementwise (see what happens when you try pay+pay) so we'd just create an array that had the same shape as pay but only the values we want to add. So,

>>> payincrease = np.array([[0,0,0,0],[0,0,0,0],[1,1,1,1]])
>>> pay = pay + payincrease

This will increase only the last row. Neat. Notice here that we've used two rows of [0,0,0,0] and one row of [1,1,1,1]; this would be irritating to type out if we had more rows. Luckily, Python has an easy way to make long lists that contain the same number. It's done by using the multiplication commands, *. It's done like this:

>>> payincrease = np.array([[0]*4, [0]*4,[1]*4])

For example, [0]*7 will produce [0,0,0,0,0,0,0] and [6]*4 will produce [6,6,6,6]. Great. So, we've simplified this pay increase thing a little bit. Unfortunately, if we had an array that had 5000 rows, it would be a bit frustrating to do this sort of thing. Moreover, this kind of thing doesn't quite feel right; we're making an additional structure just to manipulate our original structure, which seems like a waste of resources.

Instead of creating a new array, maybe we'd like to just edit the original array. To do this, we need to remember a bit about indexing for lists in Python. Some nice information can be found here, but we'll just review by example. Create this sample array (just so we're all on the same page here!):

>>> sadness = np.array([[1,2,3],[4,5,6],[7,8,9]])

Remember that all indices in Python start from 0; that means, to access the first row, [1,2,3], we would write

>>> sadness[0]

Now that sadness[0] is, itself, an array! Hence, we can access elements from it in the same way. To access the second element (which is index 1) we can write

>>> sadness[0][1]

This says, "From sadness[0], the first row of the array called sadness, access the element in index 1, which is the second element." Similarly, if we wanted to access the third row, first element, we would write sadness[2][0].

Okay, here's the cool part. We can alter parts of this array by talking about the index of the part we'd like to change. For example, suppose, as before, we wanted to add the value 1 to each element in only the last row. The last row of sadness is given by sadness[2], so this command would look like:

>>> sadness[2] += 1

Recall the += part tells Python to add the value after the += and then save that changed variable as the old variable. Hence, this command takes the third row of sadness and adds the value 1 to each element.

Last, suppose we wanted to add another value of 1 to only the last two columns in the last row of sadness. We could try this:

>>> sadness[2] += np.array([0,1,1])

But, like before, this might not be as extendable as we'd like; when working with large data sets, this might be a huge pain. Instead, we can use slice notation for lists. The Python notation SomeList[5:7] means "return the elements from index 5 to index 7 (but don't include index 7) in the list SomeList." In general, the slice will contain the number corresponding to the first index, but not the last.

For example, in the list A = [10,20,30,40,50], typing A[1:4] would return the list [20,30,40]. If you leave out a number from the splice, it will simply start from the beginning or end; for example, A[:2] returns [10,20] and A[2:] returns [30,40,50]. If these are a bit confusing, make your own list and try throwing in a few values to see what it returns. What do you think A[:] would return?

Getting back to the question before, we wanted to add a value of 1 to only the last two columns in the last row of sadness. To get the last row of sadness, we would use sadness[2]; to get the last two elements, we would use sadness[2][1:] since there are three elements in the last row. Later, we'll learn how to use this without worrying too much about how big the array is. For now, this will do:

>>> sadness[2][1:] += 1

As just a final note, slicing two different ways for an array is a bit strange; for example, if we wanted to get just the second and third column of the first and second row of sadness, we would put it in like this:

>>> sadness[:2, 1:]

This will seem a bit strange, but the :2 part says to get the first and second row and the 1: part says to get the second and third column (remember, 1: is talking about the index 1, which is the second column). We'll delve into this a bit more later when we cover something called strides which makes this all a bit easier, but if you're interested the documentation on this sort of thing is kept here.

Fancy Indexing.

The old index notation wasn't terrible, right? There were a lot of :'s and all but, for the most part, it got the job done. Suppose, though, that you have a big data set and you only want to return the first, second, and seventh row. What do you do?

To practice, let's first make a data set. Since it doesn't matter too much what our elements are (since this is just for practice) we'll make an array with random elements. First, as usual, let's import numpy.

>>> import numpy as np

Next, we wonder: where would a random array command be? You can either google this or use the auto-complete in Spyder to find that numpy has something called random associated with it; specifically, it looks like np.random. If you put another period at the end (so that it becomes np.random. and the auto-complete comes up) you'll see a ton of commands in the random section. The rand command looks pretty reasonable, so let's use that. The way that we create an array from this is to just put in the dimensions of the array we'd like in the form np.random.rand(rows, cols). For this, let's put in:

>>> data = np.random.rand(3,8)

Now, let's suppose we want the first, second, and seventh row. These correspond to the indexes 0, 1, 6. Here's the punchline: we can simply feed in a list of the row indices we want from our array, and Python will spit them out for us!

>>> data[[0,1,6]]

This will return the first, second, and seventh rows from the table. Neat. You might say, "Welp, great, but what if I want the first and third columns?" Worry not, since the same kind of deal will work:

>>> data[:,[0,2]]

Remember that the : before the comma means "all of the rows".

At this point, suppose we want the first and last column, but only the odd indexed rows. You might expect that...

>>> data[[1,3,5,7],[0,2]]

would work, BUT IT DOES NOT WORK. The reason for this is a little bit beyond this post, but it is a bit of an annoyance (and it seems to be a thorn in a number of sides, judging by the forum questions regarding it). A solution to this problem is given by using the np.ix_ function (make sure to include the underscore in ix_) which makes our indexes work nicely with each other. For example,

>>> data[np.ix_([1,3,5,7],[0,2])]

works exactly how we want. This might be a lot to take in now, but don't worry if it doesn't completely sink in — we'll be using the ix_ function a lot later, and we'll also be reviewing some ways to index and slice arrays. Just play around with this a bit and you should be fine for now. Note that there are dozens of ways to manipulate arrays like we've been doing, but these are more common methods.

A few exercises for you!

Here's a couple of exercises for you to practice your numpy and Python skills. First, make the following two arrays:

happy = np.array([[1,1,1],[2,2,2],[3,3,4]])
sad = np.array([[1,3],[2,4],[5,7],[6,8],[99,100])


⇐ Back to 1.4HomeOnwards to 2.2 ⇒