Lab 7 - NumPy arrays, I/O and files

The purpose of this week’s lab is to:

Familiarize with rank-1 and rank-2 NumPy arrays, and with the concept of vectorization.
Examine the directory structure and file system on your computer, and understand the concept of absolute and relative paths.
Read (and write) text files in python.

Note: If you do not have time to finish all exercises (in particular, the programming problems) during the lab time, you can continue working on them later.

If you have any questions about or difficulties with any of the material in this lab, or any of the material covered in the course so far, ask your tutor for help during the lab.

NumPy arrays#

As we saw in last week’s lecture, arrays are provided by the external library NumPy, which stands for ``Numerical Python’’. NumPy is a vast library with a tremendous amount of functionality very useful for writting scientific programs. In this course (and this lab) we restrict ourselves to a small subset of this functionality. In any case, probably the next most natural (and less steep) step to explore more about this functionality is this tutorial from the NumPy documentation.

rank-1 arrays#

Arrays are sequence types. In contrast to lists, all of its elements are ALWAYS of the same type (e.g., ints or floats). Let us start working with rank-1 arrays, also known as 1-dimensional arrays or simply vectors. Recall that the defining property of rank-1 array is that you only have to use one index to access to its entries (as with lists or strings).

Exercise 0 - Warm-up#

The first thing to do is to import the numpy module. The de-facto practice (i.e., you will observe many programs doing this) is to rename it as np when importing:

>>> import numpy as np

Before working with a rank-1 array, one has to generate (create) it. NumPy offers many different ways to generate rank-1 arrays. The function calls below show some examples of these. Run the following in a Python shell and try to understand what each function call does:

In [1]: import numpy as np

In [2]: x = np.array([0.0, 0.25, 0.5, 0.75, 1.0])

In [3]: x
Out[3]: array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [4]: type(x) # what type of sequence is this?
Out[4]: ...

In [4]: y=np.zeros(5)

In [5]: y
Out[5]: ...

In [6]: z=np.linspace(0.0,10.0,5)

In [7]: z
Out[7]: ...

In [8]: t = np.arange(0.0, 5.0, 0.5)

In [8]: t
Out[8]: ...

As always, you can check the docstrings of these functions (e.g., help(np.zeros) in the Python shell) to find more about these.

Recall that arrays are sequence types. Thus, once the rank-1 array is established, we can set and retrieve individual values, and extract slices, as with lists or strings. We can also iterate over the entries using loops. Run the following in the Python shell to convince yourself that this is indeed true:

In [1]: import numpy as np

In [2]: y=np.zeros(4)

In [3]: y[0]=0.0

In [4]: y[1]=0.01

In [5]: y[2]=0.02

In [6]: y[3]=0.03

In [7]: y
Out[7]: ...

In [8]: y[1:3]
Out[8]: ...

In [9]: y[1:]
Out[9]: ...

In [10]: y[1:-2]
Out[10]: ...

In [11]: for elem in y:
             print(elem)
Out[11]: ...

In [12]: for i in range(0,len(y)):
             print(y[i])
Out[12]: ...

An important feature of NumPy is that code written in terms of arrays might be vectorized. That is, rewritten/transformed in terms of operations on entire arrays at once (without Python loops). Vectorized code can run much faster than non-vectorized code, specially with large arrays, as we saw in the lecture.

Execute the vectorized operations among arrays below in a Python shell and figure out what
they are returning:

In [1]: import numpy as np
In [2]: import math

In [3]: def double(x):
             return x*2

In [4]: x = np.linspace(0.0, 2*math.pi, 10)

In [5]: x 
Out[5]: ...

In [6]: x+10 
Out[6]: ...

In [7]: x**2 
Out[7]: ...

In [8]: np.sin(x) 
Out[8]: ...

In [9]: math.sin(x) 
Out[9]: ...

In [10]: double(x) 
Out[10]: ...

Exercise 1 - Plotting a function#

Given the mathematical function

h (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{1}{2} x^{2}}

1(a). Scalar function evaluations

Write a function evaluate_h(a,b,N) that returns two rank-1 arrays of floats named, say, x and y. The array x holds $x$ values obtained from splitting the interval $[a, b]$ into N equispaced points. For example, if a=0, b=10, and N=5, then x must have the values 0.0, 2.5, 5.0, 7.5, and 10.0, in this order. On the other hand, the array y holds $h (x)$ evaluations at the corresponding values of the array x.

The function MUST first create two rank-1 arrays of the appropriate size, and then fill them by computing each element, element by element, with a for loop.

Hint: the distance among two consecutive $x$ values in the $[a, b]$ interval is given by $\frac{b - a}{N - 1}$ (Think why!)

1(b). Vectorized function evaluations

Write a version of the previous function, evaluate_h_vectorized(a,b,N), that vectorizes evaluate_h(a,b,N). This function must not use Python loops. To this end, you can use the linspace or arange functions from numpy to create the entries of the x array, and evaluate $h (x)$ for an array argument to create the corresponding ones of y (as we showed in the lecture).

1(c). Write a test function

Write a test function test_evaluate_h_function(evaluate_h_function) that performs at least two tests with the evaluate_h_function function passed as an argument. The evaluate_h_function has the same interface and semantics as the two functions written in the previous two exercises. As usual, the test_evaluate_h_function function should print the message "All tests passed!" if all tests passed, or generate an error otherwise.

Using test_evaluate_h_function, check that both evaluate_h and evaluate_h_vectorized pass the tests that you wrote.

Hint 1: look at the tests functions provided in any of the homeworks for the general pattern of a test function.

Hint 2: look at the documentation of the NumPy np.allclose function

1(d). Compare performance (i.e., time to solve the problem)

Using the %timeit IPython shell magic command (as shown in the lecture code examples), compare the performance of the functions written in 1(a) versus 1(b) with increasing powers of 10 for N, e.g., $10$ , $10^{2}$ , $\dots$ , $10^{6}$ . Accordingly to your measurements, which one is faster?

1(e). Write a function to generate a plot

Write a function, plot_h(a,b,N), that generates a plot of $h (x)$ within the interval $[a, b]$ . It can use any of the two functions written in 1(a) and 1(b).

In order to generate the plot, you can use the matplotlib library. If x, and y are two rank-1 arrays produced by the functions written in 1(a) and 1(b), then you can use the following code to plot the function:

import matplotlib.pyplot as plt
... # generate x and y rank-1 arrays
plt.plot(x,y)
plt.grid()
plt.show()

Note that the plt.grid() function call shows a grid of vertical and horizontal lines on the plot on certain x- and y-values, respectively. In general, grids are used to help you to interpret the quantitative values in a graph.

Using the plot_h function, visualize the function $h (x)$ using increasing values of N in the interval $[- 4, 4]$ . For example, you can use plot_h(-4,4,5), plot_h(-4,4,10), plot_h(-4,4,20), plot_h(-4,4,40), etc. Can you find a value of N beyond which the plot does not apparently change to your eye when you increase N?

rank-2 arrays#

Warm-up#

rank-2 arrays can be thought of as tables of numbers (also known as “matrices” by mathematicians). They are indexed using two indices, one for the rows, and another one for the columns.

As with rank-1 arrays, there are many ways one can create a rank-2 array. Two possible ways are:

By converting a list of lists (a.k.a. nested list) into a rank-2 array:

>>> import numpy as np
>>> A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
>>> A 
>>> ...

By providing the shape of the rank-2 array to the zeros function:
```
>>> A = np.zeros((2, 3))
>>> A
>>> ... 
```
Recall from the lecture that the shape of a rank-2 array is a tuple with the number of elements in each dimension. Here we have 2 rows and 3 columns.

Once created, the shape of an array can be queried like this:

>>>A.shape
>>>...

A word of caution in regards to shapes. The shape of a rank-1 array with n elements is the tuple (n,). On the other hand, rank-2 arrays might have only one element in some (or all) the dimensions. Even in this case, they are still rank-2 arrays. Execute the following instructions to double check this:

>>> import numpy as np
>>> x=np.zeros((3,))
>>> x.shape
...
>>> y=np.zeros((3,1))
>>> y.shape 
>>> ...

>>> z=np.zeros((1,3))
>>> z.shape 
>>> ...

>>> t=np.zeros((1,1))
>>> t.shape 
>>> ...

Once created, the individual entries of a rank-2 array can be accessed using A[i][j] or A[i,j], where i and j denote the row and column identifiers of the corresponding entry of A. One can also use slicing in the column dimension (e.g., A[i,:] returns the row i of A as a rank-1 array), in the row dimension (e.g., A[:,j] returns the column j of A as a rank-1 array), or both dimensions (e.g., A[0:2,1:3] returns a rank-2 array with the entries in [rows 0 or 1] and [columns 1 or 2]).

An important difference among lists and arrays is that array slices create a “view” of the original array, instead of a shallow copy (as we showed in the lectures). This means that if you modify an array view, you will be actually modifying the original array the view was created from. Execute the following Python statements and expressions, and convince yourself that this is the case:

  >>> import numpy as np
  >>> A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  >>> A 
  >>> ...
  >>> A[0,:]=np.array([12.3,44.5,22.3])
  >>> A 
  >>> ...
  >>> A[:,1]=A[:,2]
  >>> A 
  >>> ...
  >>> A[1,:]=A[0,:]
  >>> A 
  >>> ...

Exercise 2 - Apply a function to a rank-2 array#

Let $A$ be the two-dimensional array:

[\begin{array}{cccc} 0.2 & 12.4 & - 1.0 & 5.78 \\ - 1.34 & - 1.5 & - 1.5 & 0.1 \\ 1.43 & 5.23 & 5.45 & - 2.23 \end{array}]

Apply the function $h (x)$ in Exercise 1 to each entry of $A$ , element by element, using a nested for loop, and store the result on another rank-2 array $B$ of the same shape as $A$ . Then, apply the function $h (x)$ directly to $A$ and demonstrate that at the end the result of the two methods (e.g., non-vectorized versus vectorized) is the same.

Exercise 3 - Computing the diagonality of a matrix#

A diagonal matrix is a square (n x n) table of numbers M such that only elements on the diagonal (where row equals column, that is, M[i,i]) are non-zero. This can be generalised: Say a matrix M has diagonality d if M[i,j] != 0 only in positions such that the absolute difference between i and j is strictly less than d. (Thus, a diagonal matrix has diagonality 1.) For example, this matrix has diagonality 3:

a matrix with diagonality 3

3(a) Write a function diagonality(matrix) which takes as argument a matrix (represented as a rank-2 NumPy array) and returns the smallest d such that matrix has diagonality d.

3(b) (advanced) The diagonality of a matrix can sometime be decreased by rearranging the order of rows and columns. For example, the matrix

np.array([[1, 2, 0, 0],
 [0, 0, 5, 6],
 [3, 4, 0, 0],
 [0, 0, 7, 8]])

has diagonality 3, but swapping the two middle rows results in the matrix

np.array([[1, 2, 0, 0],
 [3, 4, 0, 0],
 [0, 0, 5, 6],
 [0, 0, 7, 8]])

with diagonality 2.

First, work out how to swap a pair of rows, or a pair of columns in a matrix. You probably need to use some temporary variable to hold one of the vectors/values being swapped.

Next, write a function that attempts to minimise the diagonality of a given matrix by repeatedly swapping rows or columns. A simple idea is to look at the result of every possible swap and check if has lower diagonality. However, this may not always be enough, as several swaps may be necessary to reduce the diagonality.

The file system#

Files and directories are an abstraction provided by the operating system (OS) to make it easier for programmers and users to interact with the underlying storage device (hard drive, USB key, network file server, etc).

In a unix system (like the one in the CSIT labs), or a system running MacOS, the file system is organised into a single directory tree. That means directories can contain several (sub-)directories (or none), but each directory has only one directory directly above it, the one that it is contained in. (This is often called the “parent directory”.) The top-most directory in the tree, which is called /, has no parent directory. Every file is located in some directory.

If you are working on a computer running Microsoft Windows, each “drive” will be the root of it’s own directory tree. The topmost directory within a drive is the drive letter - followed by a ‘:’, for example “C:”. Within each drive, the parent/sub-directory structure is similar to a MacOS or Unix system, where each directory has a single parent directory, but can contain many sub-directories.

Exercise 4: Navigating the directory structure#

For this exercise, you will need some directories and files to work with. As we recommended in Lab 1, create a directory called comp1730 in your home directory (if you haven’t already), and within that directory create one called lab7. You can do this using whichever tool you’re familiar with (the command-line terminal or the graphical folders tool).

Next, create a file with a few lines of text (including some empty lines). You can use the editor in the IDE, or any other text editor, to write the file. (If you don’t know what to write, just copy some text from this page.) Save your file with the name sample.txt in the lab7 directory.

When you are done, you should have something like the following:

screenshot showing folders and files

In the python3 shell, import the os module:

In [1]: import os

This module provides several useful functions for interacting with the file system; in particular, it can let you know what the current working directory is, and change that. Try the following:

In [2]: os.getcwd()
Out [2]: 'NNNNN'

The value returned will depend on your operating system and where you are running Python from. From here onwards, where we ask you to type “NNNNN” you should use whatever was returned in Out [2].

If you are running Linux or MacOs try:

In [3]: os.chdir('NNNNN/comp1730')
In [4]: os.getcwd()
Out [4]: ...

Where “NNNNN” was the value returned in Out [2]

If you are running Windows try:

In [3]: os.chdir('NNNNN\\comp1730')
In [4]: os.getcwd()
Out [4]: ...

Where again “NNNNN” was the value returned in Out [2]

You should find that the current working directory is now 'NNNNN/comp1730'. The os.chdir (“change directory”) function changes it.

The location given in the example above is absolute: it specifies the full path, from the top-level (“root”) directory or drive. A relative path specifies a location in the directory structure relative to the current working directory. Try

In [5]: os.chdir('lab7')
In [6]: os.getcwd()
Out [6]: ...
In [7]: os.chdir('..')
In [8]: os.getcwd()
Out [8]: ...

The path .. means “the parent directory”. So, for example, if your current working directory is NNNNN/comp1730/lab7 and you also have a lab1 directory in comp1730, you can change to it with

os.chdir('../lab1')

Finally, the os.listdir function returns a list of the files and subdirectories in a given directory. If you have created the text file sample.txt (and nothing else) in comp1730/lab7, then

os.listdir('NNNNN/comp1730/lab7')  # Linux or MacOS
os.listdir('NNNNN\\comp1730\\lab7')  # Windows

should return ['sample.txt'], while

os.listdir('..')

will return a list of the subdirectories and files in the parent of the current working directory.

Exercise 5 - Reading text files#

5(a). Reading a text file.

To read the sample text file that you created in python, you can do the following:

In [1]: fileobj = open("sample.txt", "r")
In [2]: fileobj.readline()
Out [2]: ...
In [3]: fileobj.readline()
Out [3]: ...

(This assumes the current working directory is where the file is located, i.e., 'NNNNN/comp1730/lab7'. If not, you need to give the (absolute or relative) path to the file as the first argument to open.) You can keep repeating fileobj.readline() as many times as you wish. Notice that each call returns the next line in the file: the file object keeps track of the next point in the file to read from. When you get to the end of the file, readline() returns an empty string. Also notice that each line has a newline character ('\n') at the end, including empty lines in the file.

When you are done reading the file (whether you have reached the end of it or not), you must always close it:

In [4]: fileobj.close()

5(b). Walking through the contents of the file with a for loop.

A more convenient way to iterate through the lines of a text file is using a for loop. The file object that is returned by the built-in function open is iterable, which means that you can use a for loop, like this:

for line in my_file_obj:
    # do something with the line

However, the file object is not a sequence, so you can’t index it, or even ask for its length.

Write a function that takes as argument the path to a file, reads the file and returns the number of non-empty lines in it. You should use a for loop to iterate through the file.

Remember to close the file before the end of the function!

Programming problems#

Note: We don’t expect everyone to finish all these problems during the lab time. If you do not have time to finish these programming problems in the lab, you should continue working on them later (at home, in the CSIT labs after teaching hours, or on one of the computers available in the university libraries or other teaching spaces).

Reading in reverse#

Files can only be read forward. When you read, for example, a line from a text file, the file position advances to the beginning of the next line.

However, you can move the file position, using the method seek(pos) on the file object. File position 0 is the beginning of the file. The default is that pos is a positive integer offset from the beginning of the file, but there are also other seek modes (see the documentation). The method tell() returns the current file position. For example:

fileobj = open("my_text_file.txt")
line1 = fileobj.readline() # reads the first line
pos_ln_2 = fileobj.tell()  # file position of beginning of line 2
line2 = fileobj.readline()
line3 = fileobj.readline()
fileobj.seek(pos_ln_2)      # go back
line2b = fileobj.readline() # reads line 2 again
fileobj.seek(0)             # go back
line1b = fileobj.readline() # reads line 1 again

You can verify that line2 and line2b are the same, as are line1 and line1b.

Write a program that reads a text file and prints its lines in reverse order. For example, if the file contents are

They sought it with thimbles, they sought it with care;
     They pursued it with forks and hope;
They threatened its life with a railway-share;
     They charmed it with smiles and soap.

then the output of your program should be

     They charmed it with smiles and soap.
They threatened its life with a railway-share;
     They pursued it with forks and hope;
They sought it with thimbles, they sought it with care;

Can you write a program that does this while reading through the file, from beginning to end, only once?
Can you write a program that does this without storing all lines in memory?

It is possible to do both, but it is not possible to do both at the same time.

Weather data#

Has the winter in Canberra been unusually mild this year?

The Bureau of Meteorology provides free access to historical weather observations, including daily min/max temperatures and rainfall, from its many measuring stations around Australia. Here are the records of daily minimum and maximum temperature from the Canberra Airport station:

Data is not always complete. Records from the Canberra airport station only go back to September of 2008, and also between then and now there are some missing entries.

What is in the files

The files are in CSV format. The columns are:

Column 0: Product code. This is a “magic” string that describes what kind of data is in this file (“IDCJAC0011” for the daily minimum temperature and “IDCJAC0010” for the daily maximum). It is the same for all rows in a file. You can ignore this column.
Column 1: Station number. This is where the measurement was taken (070351 is Canberra airport). This is the same for all rows in both files, so you can ignore this too.
Column 2: Year (an integer).
Column 3: Month (an integer, 01 to 12).
Column 4: Day (an integer, 01 to number of days in the month)
Column 5: Recorded min/max temperature (depending on which file). This is a decimal number. Note that the temperature may be missing, in which case this field is empty.
Column 6: Number of consecutive days over which the value was recorded. This number should be 1 for all (non-missing) entries in both files, so you don’t need to consider it.
Column 7: Has this entry been quality checked? ‘Y’ or ‘N’. Entries that have not been checked are not necessarily wrong, it just means they haven’t been quality checked (yet).

Note that the first line in each file is a header (i.e., lists the column names), so data starts from the second line.

Questions

The question that we are trying to answer is whether the 2021 winter has been unusually mild (short and/or warm). To arrive at any kind of answer, you first need to make the question more precise:

What is “winter” in Canberra? Traditionally, winter is said to be the months of June, July and August.
How “cold” has the winter been? You could count the number of nights with sub-zero temperature (in each month or over the whole winter), or you could compute an average (of the minimum, or of the maximum, or of the two combined somehow).
How “long” has the winter been? You could compare the aggregate values (however you define them) between the three months. For example, is the August average minimum temperature significantly higher or lower than what it was in July?
Is this “unusual”? With more than 10 years of data, you can see how the 2021 values (however you define them) rank compared to other years. Is it the shortest or warmest winter among the last ten?

Testing

Although there may not be a single, right answer to the questions above, there are many answers that are clearly wrong. How do you convince yourself, and others, that your code is correct when no one has given you a solution to check against and the problem has not been precisely specified? There are a number of ways:

First, define precisely what each part of your code should be doing. For example, if you write a function to count the number of nights with sub-zero temperature in a given month of a given year, then the value that this function should return is unambiguous, and you can check if it works correctly by counting a few cases by hand.
Second, sanity check! The count returned by the function in the example above should never be negative and never greater than the number of days in the month.
Third, cross-check. You can implement the same function in several different ways, or different functions that have a known relationship. For example, if you write a function to count the number of nights with sub-zero temperature in a given month of a given year, and another function to count the number of nights with a lowest temperature that is above zero in a given month of a given year, then the sum of the two counts should equal the number of days in that month. (Remember to adjust for leap years if the month is February!)
Fourth, ask someone else to look critically at your code, while you do the same with theirs. Ask them to try to find cases where it could fail, or give the wrong answer. After practicing this on someone else’s code, try it with your own. (What happens if the first or last night of the month has negative temperature? What if there is no record for some nights?)

Using Pandas DataFrame#

The best library to analyse data in csv file format is the Pandas DataFrame class, which builds on the numpy array class.

Unlike a numpy array which is homogeneous i.e. same data type for all columns, a DataFrame suports separate types for each column, so both categorical and numerical variables can be read in and processed. Read in a csv file as follows:

import pandas as pd df = pd.read_csv(‘filename.csv’, header=0)

header is 0 indicates column names are given in row 0, as is typically the case for csv files

df.head() # show first rows and column names

See lecture notes for additional examples on indexing DataFrames.

Search this site

Semester 1, 2024: Lab 7