Lab 7 - NumPy arrays, I/O and files
The purpose of this week’s lab is to:
- Familiarize with rank-1 and rank-2 NumPy arrays, and with the concept of vectorization.
- Examine the directory structure and file system on your computer, and understand the concept of absolute and relative paths.
- Read (and write) text files in python.
Note: If you do not have time to finish all exercises (in particular, the programming problems) during the lab time, you can continue working on them later.
If you have any questions about or difficulties with any of the material in this lab, or any of the material covered in the course so far, ask your tutor for help during the lab.
NumPy arrays#
As we saw in last week’s lecture, arrays are provided by the external library NumPy, which stands for ``Numerical Python’’. NumPy is a vast library with a tremendous amount of functionality very useful for writting scientific programs. In this course (and this lab) we restrict ourselves to a small subset of this functionality. In any case, probably the next most natural (and less steep) step to explore more about this functionality is this tutorial from the NumPy documentation.
rank-1 arrays#
Arrays are sequence types. In contrast to lists,
all of its elements are ALWAYS of the same type (e.g., int
s or float
s).
Let us start working with rank-1 arrays, also known as
1-dimensional arrays or simply vectors. Recall that the defining property of
rank-1 array is that you only have to use one index to
access to its entries (as with lists or strings).
Exercise 0 - Warm-up#
The first thing to do is to import the
numpy
module. The de-facto practice (i.e., you will observe many programs doing this) is to rename it as np
when importing:
>>> import numpy as np
Before working with a rank-1 array, one has to generate (create) it. NumPy offers many different ways to generate rank-1 arrays. The function calls below show some examples of these. Run the following in a Python shell and try to understand what each function call does:
In [1]: import numpy as np
In [2]: x = np.array([0.0, 0.25, 0.5, 0.75, 1.0])
In [3]: x
Out[3]: array([0. , 0.25, 0.5 , 0.75, 1. ])
In [4]: type(x) # what type of sequence is this?
Out[4]: ...
In [4]: y=np.zeros(5)
In [5]: y
Out[5]: ...
In [6]: z=np.linspace(0.0,10.0,5)
In [7]: z
Out[7]: ...
In [8]: t = np.arange(0.0, 5.0, 0.5)
In [8]: t
Out[8]: ...
As always, you can check the docstrings of these functions
(e.g., help(np.zeros)
in the Python shell)
to find more about these.
Recall that arrays are sequence types. Thus, once the rank-1 array is established, we can set and retrieve individual values, and extract slices, as with lists or strings. We can also iterate over the entries using loops. Run the following in the Python shell to convince yourself that this is indeed true:
In [1]: import numpy as np
In [2]: y=np.zeros(4)
In [3]: y[0]=0.0
In [4]: y[1]=0.01
In [5]: y[2]=0.02
In [6]: y[3]=0.03
In [7]: y
Out[7]: ...
In [8]: y[1:3]
Out[8]: ...
In [9]: y[1:]
Out[9]: ...
In [10]: y[1:-2]
Out[10]: ...
In [11]: for elem in y:
print(elem)
Out[11]: ...
In [12]: for i in range(0,len(y)):
print(y[i])
Out[12]: ...
An important feature of NumPy is that code written in terms of arrays might be vectorized. That is, rewritten/transformed in terms of operations on entire arrays at once (without Python loops). Vectorized code can run much faster than non-vectorized code, specially with large arrays, as we saw in the lecture.
Execute the vectorized operations among arrays
below in a Python shell and figure out what
they are returning:
In [1]: import numpy as np
In [2]: import math
In [3]: def double(x):
return x*2
In [4]: x = np.linspace(0.0, 2*math.pi, 10)
In [5]: x
Out[5]: ...
In [6]: x+10
Out[6]: ...
In [7]: x**2
Out[7]: ...
In [8]: np.sin(x)
Out[8]: ...
In [9]: math.sin(x)
Out[9]: ...
In [10]: double(x)
Out[10]: ...
Exercise 1 - Plotting a function#
Given the mathematical function
1(a). Scalar function evaluations
Write a function evaluate_h(a,b,N)
that returns
two rank-1 arrays of float
s named, say, x
and y
.
The array x
holds N
equispaced points. For example, if a=0
, b=10
, and N=5
,
then x
must have the values 0.0
, 2.5
, 5.0
, 7.5
,
and 10.0
, in this order. On the other hand, the array y
holds x
.
The function MUST first create two rank-1 arrays
of the appropriate size, and then fill them by computing each element,
element by element, with a for
loop.
Hint: the distance among two consecutive
1(b). Vectorized function evaluations
Write a version of the previous function,
evaluate_h_vectorized(a,b,N)
,
that vectorizes evaluate_h(a,b,N)
.
This function must not use Python loops.
To this end, you can use the linspace
or arange
functions from
numpy
to create the entries of the x
array, and evaluate
y
(as we showed in the lecture).
1(c). Write a test function
Write a test function test_evaluate_h_function(evaluate_h_function)
that performs
at least two tests with the evaluate_h_function
function
passed as an argument.
The evaluate_h_function
has the same interface and semantics as
the two functions written in the previous two exercises.
As usual, the test_evaluate_h_function
function should print the
message "All tests passed!"
if all tests passed, or generate
an error otherwise.
Using test_evaluate_h_function
, check that both evaluate_h
and evaluate_h_vectorized
pass the tests that you wrote.
Hint 1: look at the tests functions provided in any of the homeworks for the general pattern of a test function.
Hint 2: look at the documentation of the NumPy
np.allclose
function
1(d). Compare performance (i.e., time to solve the problem)
Using the %timeit
IPython shell magic command (as shown in the lecture
code examples), compare the performance of the functions written in
1(a) versus 1(b) with increasing powers of 10 for N
, e.g.,
1(e). Write a function to generate a plot
Write a function, plot_h(a,b,N)
, that generates a plot of
In order to generate the plot, you can use the matplotlib
library.
If x
, and y
are two rank-1 arrays produced by the functions
written in 1(a) and 1(b), then you can use the following code
to plot the function:
import matplotlib.pyplot as plt
... # generate x and y rank-1 arrays
plt.plot(x,y)
plt.grid()
plt.show()
Note that the plt.grid()
function call shows a grid of vertical and horizontal lines on the plot
on certain x- and y-values, respectively. In general, grids are used to help you to interpret
the quantitative values in a graph.
Using the plot_h
function, visualize the function N
in the interval plot_h(-4,4,5)
, plot_h(-4,4,10)
, plot_h(-4,4,20)
,
plot_h(-4,4,40)
, etc. Can you find a value of N
beyond which
the plot does not apparently change to your eye when you increase N
?
rank-2 arrays#
Warm-up#
rank-2 arrays can be thought of as tables of numbers (also known as “matrices” by mathematicians). They are indexed using two indices, one for the rows, and another one for the columns.
As with rank-1 arrays, there are many ways one can create a rank-2 array. Two possible ways are:
- By converting a list of lists (a.k.a. nested list) into a rank-2 array:
>>> import numpy as np >>> A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) >>> A >>> ...
- By providing the shape of the rank-2 array
to the
zeros
function:>>> A = np.zeros((2, 3)) >>> A >>> ...
Recall from the lecture that the shape of a rank-2 array is a tuple with the number of elements in each dimension. Here we have 2 rows and 3 columns.
Once created, the shape of an array can be queried like this:
>>>A.shape
>>>...
A word of caution in regards to shapes. The shape of a rank-1
array with n
elements is the tuple (n,)
. On the other hand,
rank-2 arrays might have only one element in some (or all) the
dimensions. Even in this case, they are still rank-2 arrays.
Execute the following instructions to double check this:
>>> import numpy as np
>>> x=np.zeros((3,))
>>> x.shape
...
>>> y=np.zeros((3,1))
>>> y.shape
>>> ...
>>> z=np.zeros((1,3))
>>> z.shape
>>> ...
>>> t=np.zeros((1,1))
>>> t.shape
>>> ...
Once created, the individual entries of a rank-2 array can be
accessed using A[i][j]
or A[i,j]
, where i
and j
denote
the row and column identifiers of the corresponding entry of A
.
One can also use slicing in the column dimension (e.g., A[i,:]
returns
the row i
of A
as a rank-1 array), in the row dimension (e.g., A[:,j]
returns the column j
of A
as a rank-1 array), or both dimensions (e.g., A[0:2,1:3]
returns a rank-2 array with the entries in
[rows 0 or 1] and [columns 1 or 2]).
An important difference among lists and arrays is that array slices create a “view” of the original array, instead of a shallow copy (as we showed in the lectures). This means that if you modify an array view, you will be actually modifying the original array the view was created from. Execute the following Python statements and expressions, and convince yourself that this is the case:
>>> import numpy as np
>>> A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
>>> A
>>> ...
>>> A[0,:]=np.array([12.3,44.5,22.3])
>>> A
>>> ...
>>> A[:,1]=A[:,2]
>>> A
>>> ...
>>> A[1,:]=A[0,:]
>>> A
>>> ...
Exercise 2 - Apply a function to a rank-2 array#
Let
Apply the function for
loop, and store the
result on another rank-2 array
Exercise 3 - Computing the diagonality of a matrix#
A diagonal matrix is a square (n x n) table of numbers M
such that
only elements on the diagonal (where row equals column, that is,
M[i,i]
) are non-zero.
This can be generalised:
Say a matrix M
has diagonality d if M[i,j] != 0
only
in positions such that the absolute difference between
i and j is strictly less than d.
(Thus, a diagonal matrix has diagonality 1.)
For example, this matrix has diagonality 3:
3(a)
Write a function diagonality(matrix)
which takes as argument a
matrix (represented as a rank-2 NumPy array)
and returns the smallest d such that matrix
has
diagonality d.
3(b) (advanced) The diagonality of a matrix can sometime be decreased by rearranging the order of rows and columns. For example, the matrix
np.array([[1, 2, 0, 0],
[0, 0, 5, 6],
[3, 4, 0, 0],
[0, 0, 7, 8]])
has diagonality 3, but swapping the two middle rows results in the matrix
np.array([[1, 2, 0, 0],
[3, 4, 0, 0],
[0, 0, 5, 6],
[0, 0, 7, 8]])
with diagonality 2.
First, work out how to swap a pair of rows, or a pair of columns in a matrix. You probably need to use some temporary variable to hold one of the vectors/values being swapped.
Next, write a function that attempts to minimise the diagonality of a given matrix by repeatedly swapping rows or columns. A simple idea is to look at the result of every possible swap and check if has lower diagonality. However, this may not always be enough, as several swaps may be necessary to reduce the diagonality.
The file system#
Files and directories are an abstraction provided by the operating system (OS) to make it easier for programmers and users to interact with the underlying storage device (hard drive, USB key, network file server, etc).
In a unix system (like the one in the CSIT labs), or a system running
MacOS, the file system
is organised into a single directory tree. That means directories
can contain several (sub-)directories (or none), but each directory
has only one directory directly above it, the one that it is contained
in. (This is often called the “parent directory”.) The top-most
directory in the tree, which is called /
, has no parent
directory. Every file is located in some directory.
If you are working on a computer running Microsoft Windows, each “drive” will be the root of it’s own directory tree. The topmost directory within a drive is the drive letter - followed by a ‘:’, for example “C:”. Within each drive, the parent/sub-directory structure is similar to a MacOS or Unix system, where each directory has a single parent directory, but can contain many sub-directories.
Exercise 4: Navigating the directory structure#
For this exercise, you will need some directories and files to work
with. As we recommended in Lab 1, create a directory called
comp1730
in your home directory (if you haven’t already),
and within that directory create one called lab7
. You can
do this using whichever tool you’re familiar with (the command-line
terminal or the graphical folders tool).
Next, create a file with a few lines of text (including some empty
lines). You can use the editor in the IDE, or any other text editor,
to write the file. (If you don’t know what to write, just copy some
text from this page.) Save your file with the name sample.txt
in the lab7
directory.
When you are done, you should have something like the following:
In the python3 shell, import the os
module:
In [1]: import os
This module provides several useful functions for interacting with the file system; in particular, it can let you know what the current working directory is, and change that. Try the following:
In [2]: os.getcwd()
Out [2]: 'NNNNN'
The value returned will depend on your operating system and where you are running Python from. From here onwards, where we ask you to type “NNNNN” you should use whatever was returned in Out [2].
If you are running Linux or MacOs try:
In [3]: os.chdir('NNNNN/comp1730')
In [4]: os.getcwd()
Out [4]: ...
Where “NNNNN” was the value returned in Out [2]
If you are running Windows try:
In [3]: os.chdir('NNNNN\\comp1730')
In [4]: os.getcwd()
Out [4]: ...
Where again “NNNNN” was the value returned in Out [2]
You should find that the current working directory is now
'NNNNN/comp1730'
. The os.chdir
(“change directory”) function changes it.
The location given in the example above is absolute: it specifies the full path, from the top-level (“root”) directory or drive. A relative path specifies a location in the directory structure relative to the current working directory. Try
In [5]: os.chdir('lab7')
In [6]: os.getcwd()
Out [6]: ...
In [7]: os.chdir('..')
In [8]: os.getcwd()
Out [8]: ...
The path ..
means “the parent directory”. So, for example, if
your current working directory is NNNNN/comp1730/lab7
and you also have a lab1
directory in comp1730
, you can
change to it with
os.chdir('../lab1')
Finally, the os.listdir
function returns a list of the files
and subdirectories in a given directory. If you have created the text
file sample.txt
(and nothing else) in comp1730/lab7
,
then
os.listdir('NNNNN/comp1730/lab7') # Linux or MacOS
os.listdir('NNNNN\\comp1730\\lab7') # Windows
should return ['sample.txt']
, while
os.listdir('..')
will return a list of the subdirectories and files in the parent of the current working directory.
Exercise 5 - Reading text files#
5(a). Reading a text file.
To read the sample text file that you created in python, you can do the following:
In [1]: fileobj = open("sample.txt", "r")
In [2]: fileobj.readline()
Out [2]: ...
In [3]: fileobj.readline()
Out [3]: ...
(This assumes the current working directory is where the file is located,
i.e., 'NNNNN/comp1730/lab7'
. If not, you need to
give the (absolute or relative) path to the file as the first argument
to open
.) You can keep repeating fileobj.readline()
as many times as you wish. Notice that each call returns the next line
in the file: the file object keeps track of the next point in the file
to read from. When you get to the end of the file, readline()
returns an empty string. Also notice that each line has a newline
character ('\n'
) at the end, including empty lines in the file.
When you are done reading the file (whether you have reached the end of it or not), you must always close it:
In [4]: fileobj.close()
5(b). Walking through the contents of the file with a for
loop.
A more convenient way to iterate through the lines of a text file is
using a for
loop. The file object that is returned by the built-in
function open
is iterable, which means that you can use a for loop,
like this:
for line in my_file_obj:
# do something with the line
However, the file object is not a sequence, so you can’t index it, or even ask for its length.
Write a function that takes as argument the path to a file, reads the
file and returns the number of non-empty lines in it. You should use
a for
loop to iterate through the file.
Remember to close the file before the end of the function!
Programming problems#
Note: We don’t expect everyone to finish all these problems during the lab time. If you do not have time to finish these programming problems in the lab, you should continue working on them later (at home, in the CSIT labs after teaching hours, or on one of the computers available in the university libraries or other teaching spaces).
Reading in reverse#
Files can only be read forward. When you read, for example, a line from a text file, the file position advances to the beginning of the next line.
However, you can move the file position, using the method
seek(pos)
on the file object. File position 0
is the beginning of the file. The default is that pos is a
positive integer offset from the beginning of the file, but there
are also other seek modes
(see the documentation). The method tell()
returns the current file position.
For example:
fileobj = open("my_text_file.txt")
line1 = fileobj.readline() # reads the first line
pos_ln_2 = fileobj.tell() # file position of beginning of line 2
line2 = fileobj.readline()
line3 = fileobj.readline()
fileobj.seek(pos_ln_2) # go back
line2b = fileobj.readline() # reads line 2 again
fileobj.seek(0) # go back
line1b = fileobj.readline() # reads line 1 again
You can verify that line2
and line2b
are the same, as are
line1
and line1b
.
Write a program that reads a text file and prints its lines in reverse order. For example, if the file contents are
They sought it with thimbles, they sought it with care;
They pursued it with forks and hope;
They threatened its life with a railway-share;
They charmed it with smiles and soap.
then the output of your program should be
They charmed it with smiles and soap.
They threatened its life with a railway-share;
They pursued it with forks and hope;
They sought it with thimbles, they sought it with care;
- Can you write a program that does this while reading through the file, from beginning to end, only once?
- Can you write a program that does this without storing all lines in memory?
It is possible to do both, but it is not possible to do both at the same time.
Weather data#
Has the winter in Canberra been unusually mild this year?
The Bureau of Meteorology provides free access to historical weather observations, including daily min/max temperatures and rainfall, from its many measuring stations around Australia. Here are the records of daily minimum and maximum temperature from the Canberra Airport station:
Data is not always complete. Records from the Canberra airport station only go back to September of 2008, and also between then and now there are some missing entries.
What is in the files
The files are in CSV format. The columns are:
- Column 0: Product code. This is a “magic” string that describes what kind of data is in this file (“IDCJAC0011” for the daily minimum temperature and “IDCJAC0010” for the daily maximum). It is the same for all rows in a file. You can ignore this column.
- Column 1: Station number. This is where the measurement was taken (070351 is Canberra airport). This is the same for all rows in both files, so you can ignore this too.
- Column 2: Year (an integer).
- Column 3: Month (an integer, 01 to 12).
- Column 4: Day (an integer, 01 to number of days in the month)
- Column 5: Recorded min/max temperature (depending on which file). This is a decimal number. Note that the temperature may be missing, in which case this field is empty.
- Column 6: Number of consecutive days over which the value was recorded. This number should be 1 for all (non-missing) entries in both files, so you don’t need to consider it.
- Column 7: Has this entry been quality checked? ‘Y’ or ‘N’. Entries that have not been checked are not necessarily wrong, it just means they haven’t been quality checked (yet).
Note that the first line in each file is a header (i.e., lists the column names), so data starts from the second line.
Questions
The question that we are trying to answer is whether the 2021 winter has been unusually mild (short and/or warm). To arrive at any kind of answer, you first need to make the question more precise:
- What is “winter” in Canberra? Traditionally, winter is said to be the months of June, July and August.
- How “cold” has the winter been? You could count the number of nights with sub-zero temperature (in each month or over the whole winter), or you could compute an average (of the minimum, or of the maximum, or of the two combined somehow).
- How “long” has the winter been? You could compare the aggregate values (however you define them) between the three months. For example, is the August average minimum temperature significantly higher or lower than what it was in July?
- Is this “unusual”? With more than 10 years of data, you can see how the 2021 values (however you define them) rank compared to other years. Is it the shortest or warmest winter among the last ten?
Testing
Although there may not be a single, right answer to the questions above, there are many answers that are clearly wrong. How do you convince yourself, and others, that your code is correct when no one has given you a solution to check against and the problem has not been precisely specified? There are a number of ways:
-
First, define precisely what each part of your code should be doing. For example, if you write a function to count the number of nights with sub-zero temperature in a given month of a given year, then the value that this function should return is unambiguous, and you can check if it works correctly by counting a few cases by hand.
-
Second, sanity check! The count returned by the function in the example above should never be negative and never greater than the number of days in the month.
-
Third, cross-check. You can implement the same function in several different ways, or different functions that have a known relationship. For example, if you write a function to count the number of nights with sub-zero temperature in a given month of a given year, and another function to count the number of nights with a lowest temperature that is above zero in a given month of a given year, then the sum of the two counts should equal the number of days in that month. (Remember to adjust for leap years if the month is February!)
-
Fourth, ask someone else to look critically at your code, while you do the same with theirs. Ask them to try to find cases where it could fail, or give the wrong answer. After practicing this on someone else’s code, try it with your own. (What happens if the first or last night of the month has negative temperature? What if there is no record for some nights?)
Using Pandas DataFrame#
The best library to analyse data in csv file format is the Pandas DataFrame class, which builds on the numpy array class.
Unlike a numpy array which is homogeneous i.e. same data type for all columns, a DataFrame suports separate types for each column, so both categorical and numerical variables can be read in and processed. Read in a csv file as follows:
import pandas as pd df = pd.read_csv(‘filename.csv’, header=0)
header is 0 indicates column names are given in row 0, as is typically the case for csv files
df.head() # show first rows and column names
See lecture notes for additional examples on indexing DataFrames.