[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
COMP2100/2500
Lecture 22: Shell Programming IISummary
I work through a more substantial shell scripting example.
Aims
To demonstrate some of the processes behind developing a shell script.
To introduce more features of the Bash language that weren't covered in the previous lecture.
1. The Problem
Since student accounts have a strict quota and students are often running into trouble with the quota, it would be nice to be able to produce a list of the ten largest files in your account. These would be good places to start when deleting files to get back under quota. So our task is:
Write a bash script which produces a list of the ten largest files in the user's area, together with their sizes.That's the original statement of the problem, and it's not particularly precise. It will get modified as we go. This is one of the things that happens in any software project: once you understand the problem a little, the requirements must be changed, and they keep on changing throughout the project.
[Richard's comment: Work really hard at getting the requirements right before you start designing and coding. For such a simple problem it is all too easy to start coding right away; jumping in early frequently leads to disaster.]
2. A Rough Strategy
Here's a plan for how to do it. I'm remembering that in the previous lecture we had a way to list (and then count) all files and directories, so I plan to start with that.
List all files and directories.
Weed out the directories to get a list of all files.
Add the size to each item in the list.
Sort this list into descending order of size.
Print the first ten.
Step 1 we know how to do already. We can try it at the command line before we even start writing a script. (But note the caveat in the previous lecture about the output including blank lines.)
bash$ ls -R .: CVS bigfiles1 bigfiles2 bigfiles3 bigfiles4 bigfiles5 lec03.html CVS: Entries Repository Root bash$ ls -R | grep -v ':$' CVS bigfiles1 bigfiles2 bigfiles3 bigfiles4 bigfiles5 lec03.html Entries Repository RootI just ran that in the directory where I'm writing this, so you can see the files I'm working on. The files bigfiles1 ... bigfiles5 are the five stages in the development of the script for this lecture. The CVS directory is where the CVS version control system keeps stuff it needs.
This is working, but it has a few problems. One is that it still lists the directories as well as the ordinary files. That's not too much of a problem, because there is a way to tell the difference. We'll be see that later. The bigger problem is that for files in the CVS subdirectory it's just listing their names, but not the path to them. And for this script to be useful, the user needs to know where the big files are, not just what their names are.
So we need a way to display the full paths to the files we find. I looked through the manual for the ls command, (by typing man ls but didn't find anything useful. I could imagine having the script sort through and instead of throwing away all the lines ending in a colon, keeping them and sticking that directory path onto the start of each file name. But it sounds complicated, and there's a better way.
3. The du command
There's a very useful command du which stands for disk usage. Without any options it just lists the size (in something called blocks, which are a different size on different computers) of a directory and all its subdirectories:
bash$ du 4 ./CVS 34 .Not particularly useful, but looking through the manual page (by typing man du) I found that there's an option -a to have it list sizes of all files as well as directories. Just what we want, and the important thing is that it lists each file with its full path (either from the current directory, or from the directory given on the command line). There's also a -k option which makes sure that the sizes are in blocks of 1024 bytes - K's.
bash$ du -ka 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries 4 ./CVS 1 ./bigfiles1 22 ./lec03.html 1 ./bigfiles2 1 ./bigfiles3 1 ./bigfiles4 2 ./bigfiles5 33 . bash$ du -ka /home/barnes/2001/comp2100/lectures/lec03/ 1 /home/barnes/2001/comp2100/lectures/lec03/CVS/Root 1 /home/barnes/2001/comp2100/lectures/lec03/CVS/Repository 1 /home/barnes/2001/comp2100/lectures/lec03/CVS/Entries 4 /home/barnes/2001/comp2100/lectures/lec03/CVS 1 /home/barnes/2001/comp2100/lectures/lec03/bigfiles1 22 /home/barnes/2001/comp2100/lectures/lec03/lec03.html 1 /home/barnes/2001/comp2100/lectures/lec03/bigfiles2 1 /home/barnes/2001/comp2100/lectures/lec03/bigfiles3 1 /home/barnes/2001/comp2100/lectures/lec03/bigfiles4 2 /home/barnes/2001/comp2100/lectures/lec03/bigfiles5 33 /home/barnes/2001/comp2100/lectures/lec03This is the sort of list we can work with.
4. Removing the sizes
In order to weed out the directories from this list, we actually need to throw away those sizes. We'll get them back later when we only have ordinary files in our list. We can get rid of the sizes using the cut command from last lecture. A quick check on the manual page gives information on the command-line options:
-f, --fields field-list Print only the fields listed in field-list. Fields are separated by a TAB by default. -d, --delimiter delim For -f, fields are separated by the first character in delim instead of by TAB.(When you type man cut, the information is presented to you by the paging program less. This is useful for reading through large files: you can invoke it by typing less file. When you're inside less, you can use the space bar to move down a page, the Enter key to move down a line, the `u' key to move up a page, the `g' to go to the top of the file, `G' to go to the end. You can also search for a word by typing `/' followed by the word. This will highlight all occurrences, and you can then move from one to the next by typing `n'. You get out by typing `q'.)
OK, so it looks like the numbers are separated from the file and directory names by TABs, so we can try
bash$ du -ka | cut -f2 ./CVS/Root ./CVS/Repository ./CVS/Entries ./CVS ./bigfiles1 ./lec03.html ./bigfiles2 ./bigfiles3 ./bigfiles4 ./bigfiles5 .This is what we need for the next step.
5. Weeding out the directories
Some of the lines we have there are directories rather than files, so we want to get rid of them and leave a list of just the files. At this point I think it's time to stop playing around interactively and start writing my script. The first version, with nothing new in it, is:
#!/bin/bash # Get a list of all files (and directories etc) files=$(du -ka | cut -f2) # Print the list echo $filesWhen I run this (after remembering to make it executable), I get:
bash$ bigfiles1 ./CVS/Root ./CVS/Repository ./CVS/Entries ./CVS ./bigfiles1 ./le c03.html ./bigfiles2 ./bigfiles3 ./bigfiles4 ./bigfiles5 .As you can see, the result looks a little different, but that's OK. It's a list of all the files and directories, ready for us to do some more work.
Now we want to go through that list, one at a time, and only keep the real files. How to do this? Back to the manual pages. This time look at the manual for bash itself. It turns out that there's a very useful builtin command called test, which evaluates an expression and returns a true or false answer (as zero or not-zero) as its return code. It can do comparisons similar to boolean expressions in Eiffel. It can also check whether a file exists, whether it is a directory, whether it is an ordinary file, and so on. To make it look even more like ordinary boolean expressions, it has two alternative forms: test expr and [ expr ]. (The spaces around the expression are important in the second form.)
The command [ -d file ] returns 0 (true) if file is a directory, non-zero otherwise. Similarly [ -f file ] tells whether file is a normal file or not, and [ -e file ] simply tells whether file exists. We can use a for loop to examine the files on our list one by one, and use this test to select only the real files. Here's the second version of the script.
#!/bin/bash # Get a list of all files (and directories etc) files=$(du -ka | cut -f2) # echo $files # Write list of all regular files with sizes to a temporary file tmp=".tmp_bigfiles" for f in $files do if [ -f $f ] then du -ka $f >> $tmp fi done # Print the list cat $tmpThis excludes not just directories, but any other weird things that might be there. (Is this the right thing to do? Good question. That's where our requirements aren't really precise enough.) Running the new version of the script gives us:
bash$ bigfiles2 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries 1 ./bigfiles1 23 ./lec03.html 1 ./bigfiles2 1 ./bigfiles3 1 ./bigfiles4 2 ./bigfiles5A couple of points. The usual output redirection is with >. Using >> appends output to the end of the file named, rather than overwriting it. I got the sizes back by just calling du again.
Now I've made a mistake here, which you'll see when I run the script again:
bash$ bigfiles2 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries 1 ./bigfiles1 23 ./lec03.html 1 ./bigfiles2 1 ./bigfiles3 1 ./bigfiles4 2 ./bigfiles5 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries 1 ./bigfiles1 1 ./.tmp_bigfiles 24 ./lec03.html 1 ./bigfiles2 1 ./bigfiles3 1 ./bigfiles4 2 ./bigfiles5Since my script just appends to .bigfiles, that file is just going to grow and grow. I need to delete it when I'm finished. And in order to be sure that I don't clobber something important, I'd better also check that it doesn't already exist before I do anything to it. So here's version 3.
#!/bin/bash # Get a list of all files (and directories etc) files=$(du -ka | cut -f2) # echo $files # Make sure we're not about to clobber someone's data tmp=".tmp_bigfiles" if [ -e $tmp ] then echo "$tmp already exists - exiting." exit 1 fi # Write list of all regular files with sizes to $tmp for f in $files do if [ -f $f ] then du -ka $f >> $tmp fi done # Print the list cat $tmp # Remove the temporary file rm -f $tmpThe [ -e $tmp ] test checks whether there is already a file with that name. If there is, we'll just stop, rather than messing up any data that might be there. (Chances are that any file with a name starting with ``tmp'' probably isn't important, but you never know. Better to leave it up to the user to decide what to do.) The command exit 1 terminates execution of our script with return code 1, which says that the script was unsuccessful. This is important: if anyone ever calls this script from another script, they need to be able to check the return value to tell whether it succeeded or failed.
The other thing I added is the last line. As you know, the rm command removes files. The option -f stands for ``force''. In other words, do it anyway, even if the user has their account set up so that rm usually asks for confirmation from the user. (You can do that by editing your .cshrc and inserting the line alias rm 'rm -i'.)
Let's try it.
bash$ bigfiles3 .tmp_bigfiles already exists - exiting. bash$ rm .tmp_bigfiles bash$ bigfiles3 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries 1 ./bigfiles1 24 ./lec03.html 1 ./bigfiles2 1 ./bigfiles3 1 ./bigfiles4 2 ./bigfiles5
6. Sorting the list
Our next task is to take that list of file sizes and names, sort it into descending order (of size) and print the first ten. This is much easier than you might think. We don't have to go back to our first year notes and try to code the quick-sort algorithm in bash. The philosophy of shell scripting is to always look for a quick-and-dirty way to do things, and the quickest and dirtiest is to re-use code someone else has already written. The Unix system has hundreds, perhaps thousands of well-written utility programs for all sorts of applications. For most tasks you might want to do, you're probably not the first person to want to do them.
In our case, we want to sort some data - and we're certainly not the first people to want to do that. The sort program exists just for this purpose, and has been tried and tested for many years. No matter how hard you try, you would be very unlikely to write a better general-purpose sorting program. [Richard's note: `tried and tested' doesn't mean all that much. Better to have a proof that the algorithms used are correct.] Again, we check the manual page for the command options:
-n Restrict the sort key to an initial numeric string, consisting of optional blank characters, optional minus sign, and zero or more digits with an optional radix character and thousands separa- tors (as defined in the current locale), which will be sorted by arithmetic value. An empty digit string is treated as zero. Leading zeros and signs on zeros do not affect ordering. -r Reverse the sense of comparisons.So it looks as if we want sort -rn $tmp. This will read the lines from the file $tmp, sort them into reverse numeric order (i.e. descending order of size) and write the results to standard output. All that remains for us to do is to print the first ten lines. We can do that with a pipe to the head command. So the finished script is:
#!/bin/bash # Get a list of all files (and directories etc) files=$(du -ka | cut -f2) # Make sure we're not about to clobber someone's data tmp=".tmp_bigfiles" if [ -e $tmp ] then echo "$tmp already exists - exiting." exit 1 fi # Write list of all regular files with sizes to $tmp for f in $files do if [ -f $f ] then du -ka $f >> $tmp fi done # Sort the list and print the top ten sort -rn $tmp | head -10 # Remove the temporary file rm -f $tmpHere's the result of running it.
bash$ bigfiles4 24 ./lec03.html 2 ./bigfiles5 1 ./bigfiles4 1 ./bigfiles3 1 ./bigfiles2 1 ./bigfiles1 1 ./CVS/Root 1 ./CVS/Repository 1 ./CVS/Entries bash$ cd .. bash$ lec03/bigfiles4 29 ./lec02/lec02.html 24 ./lec03/lec03.html 17 ./lec19/lec19.html 16 ./lec26/lec26.html 15 ./lec09/lec09.html 14 ./lec17/lec17.html 14 ./lec11/lec11.html 13 ./lec12/lec12.html 13 ./lec04/lec04.html 12 ./lec23/lec23.html
7. Are we done yet?
This script does pretty much what we wanted, but not quite. The initial requirements said list the ten biggest files in the user's area, but this script lists the ten biggest in the current directory and all its subdirectories. Running it from the user's home directory will do the job required, but it's worth thinking about. Why not extend it to allow the user to specify a directory on the command line, but still use the current directory as a default?
It would also be nice to be able to change the number 10 to something else.
To do this, we're going to have to read arguments from the command line. In the previous lecture we learned about the special variables which give access to the command line arguments. Now we're going to use them in a more sophisticated way.
For this we need two new bash commands. The first is shift. This throws away the first command line argument (${1}) and moves all the others down one position, decrementing the value of ${#} at the same time. This is useful for loops which process one command argument at a time.
The second new bash thing is the case construction, which is a bit like the inspect statement in Eiffel. It tries to match something against each of a list of regular expressions in turn, and carries out the action corresponding to the first successful match. The regular expression syntax used is the same one as is used for file names by the ls command (which means it's not the same as the one for grep). (By the way, this expansion is called ``globbing''.)
Here's the enhanced version of the bigfiles script.
#!/bin/bash # bigfiles # List the n (or 10) largest files in the directory specified (or the # current directory by default), and all its subdirectories, together # with their sizes. # Version 5 # Ian Barnes, 1 February 2001 # Set default values debug= number="-10" # Parse command line arguments while [ ${#} -gt 0 ] do case ${1} in -*) # Assume that what's after the - is a number number=${1} ;; *) # It's a directory dir=${1} esac shift done [ $debug ] && echo number=${number} [ $debug ] && echo dir=${dir} # Get a list of all files (and directories etc) if [ ${dir} ] then if [ -d ${dir} ] then files=$(du -ka ${dir} | cut -f2) else echo Error: ${dir} is not a directory! exit 1 fi else files=$(du -ka | cut -f2) fi [ $debug ] && echo ${files} # Make sure we don't clobber any data tmp=".tmp_bigfiles" if [ -e ${tmp} ] then echo Error: $tmp already exists! exit 1 fi # Write list of all regular files with sizes to $tmp for f in ${files} do if [ -f ${f} ] then du -ka ${f} >> ${tmp} fi done [ $debug ] && cat ${tmp} # Sort the list and print the top -${number} sort -rn ${tmp} | head ${number} # Remove the temporary file rm -f ${tmp}[Richard's note: can the code duplication (the du | cut line) be avoided?]
There's one other new thing there. I have created a new variable debug, which is initially assigned the empty string (meaning ``False''). If I change the assignment statement at the start of the script to assign any non-empty value, then the script will print a whole lot of helpful information about its progress. The notation
statement1 && statement2means ``Execute statement1. If the return code is `True' (= `success' = `zero') then execute statement 2.'' In other words, it is shorthand for
if statement1 then statement2 fiSimilarly you can write
statement1 || statement2which means ``Execute statement1. If the return code is `False' (= `failure' = `non-zero') then execute statement 2.'' This is particularly useful for tasks that might fail, and which should terminate the script if they do.
do something dubious || { echo Failed; exit 1; }[Richard's note: watch the spacing and punctuation. The space after { is absolutely necessary, as is the ; just before the }. But in this case you don't need the space before the { or the space before the }, because here they are immediately preceded by other shell punctuation. It does makes sense (once you get used to it), but remember that these little things make the difference between your script working or not working.]
What I've done is simpler: the test instruction, if given a single string, returns true if the string is non-empty, false otherwise.
Exercises
Exercise 1: Modify the script so that debug mode can be turned on by giving the script a -d option on the command line like this: bigfiles -d -17 ~.
Exercise 2: Modify the script so that it prints a usage message and quits if the user gives the option -h or if there is more than one directory given, or if an unknown option is supplied (i.e. a minus sign followed by something which is not a number, `d' or `h').
Exercise 3*: Modify the script so that it uses sizes in bytes (not K) for sorting and output. (Hint: If you do ls -l file then you will get a long line containing all the information you want, but a lot of other stuff too. The cut program won't do it the way we've used it so far, because the separators aren't tabs or single spaces, but sequences of varying numbers of spaces, which line the output up in columns 8 spaces wide. There are two possibilities I can think of. One is to use cut, but find out about how to cut at particular character (= byte) positions rather than by fields. The other possibility is to learn enough about awk or sed to pull out just the parts you want.)
Richard's notes for those who've made it this far
Ian's notes show you the hard way to do it. Let's see the easy way. The find command does all of the hard work. The new find has options -type and -printf, where the latter allows to print formatted output much like the command printf (itself "borrowed" form C). The relevant format options which go with -printf include
-%k prints the file size in 1K blocks (rounded up). -%p prints the file name(the total number of the format options is about 50.) The -type option allows you to specify the file type you are looking for (f is for regular file, but can be also a directory (d), a link (l), a socket (s), etc.)
The following works on GNU/Linux:find . -type f -printf "%k\t%p\n" | sort -nr | head -10[Careful here: there's the `good old' version of find that you'll find on Solaris, and there's GNU find, which has lots more options. You can do the same thing with the old find -- check out the -ls option -- but GNU find has that extra printf and %k stuff that's so useful. Just be aware that using the options that are GNU-find-specific makes your script less portable. As an exercise, use find's -ls option (together with the cut command) to solve the problem in a way that works portably across Solaris and GNU/Linux.]What are the advantages and disadvantages of doing it my way?
[ANU] [DCS] [COMP2100/2500] [Description] [Schedule] [Lectures] [Labs] [Homework] [Assignments] [COMP2500] [Assessment] [PSP] [Java] [Reading] [Help]
Copyright © 2006, Ian Barnes & Richard Walker, The Australian National University
Version 2006.3, Friday, 5 May 2006, 16:42:29 +1000
Feedback & Queries to
comp2100@cs.anu.edu.au