Alexei B Khorev
April 2013
In this part we shall consider basic algorithms and data structures (DS), including their implementation in Java Collections Framework — JCF:
Algorithms
Data types (interfaces) and data structures (implementations)
JCF: Java’s Collections Framework
Collections interfaces
Implementations of Collections interfaces
Iterator and collection traversal
Map’s, hash functions and hash tables
Trees as a data structure, Binary Search Trees
Comparable objects
Generic algorithms
Copies and Views
Algorithms occupy the intermediate position between a solution and a method, both in terms of generality and detail. The origin — mathematical (algebraic), the essence — transformation of data (input → output) in a finite number of steps.
Writing a computer code often involves solving algorithmic problems. Some of these techniques known as the top-down design (remember Block 1):
Design the algorithm that you intend to implement for each method
Test the resulting program
The top-down design is often used together with the bottom-up design — translating an existing solution (like a mathematical solution into the code and integration separate parts into one working code). The bottom-up design may also involve the data structure selection (or construction); this is called the data-driven design.
A good algorithm always has a virtue of generality: it can be used on a family of data structure with specific characteristics reduced to minimum. An example — sorting an array of N floats:
This problem is a particular case of a more general sorting problem: Given an array of N > 0 objects, return an array which contains these N objects in ascending (descending) order.
It is assumed that objects standing for array elements can be compared, using unspecified criterion (in Java, it means that the object class implements the Comparable interface). When implemented with such generality, the algorithm is said to be generic.
Computer algorithms are characterised by:
To quantify the algorithm efficiency, one has to establish how T(N) — the number atomic operations (they do not depend on N), needed to perform the algorithm — scales with the size of the problem N. The time of execution depends on hardware platform and state of the execution environment; it is not an invariant measure. The computer science uses the big-\(\mathcal{O}\) notations (pronounced “big-omicron” by eggheads, “big-o” by pub frequenters) to characterise the efficiency (complexity) of an algorithm. It tells what is the leading term of the scaling dependency:
\[\mbox{if}\; \lim_{N\to\infty} T(N) = \mbox{const}\;\,\cdot f\,(N),\; \mbox{then} T(N) = \mathcal{O}\,(f\,(N)),\]
when N becomes large enough.
Viewed as a problem, an algorithm is characterised by complexity
Viewed as a solution, it’s characterised by efficiency
The function f(N) usually has one of the following types of behaviour:
the power law:
\[f\,(N) = N^{\beta}, \; \mbox{where } \beta \mbox{ is a number equals}\]
to2 for the Selection Sort, and to 2.376 for “smart” matrix multiplication, etc. Such polynomial problems are tractable.the logarithm law:
\[f\,(N) = (\log\,N)^{\beta},\;\mbox{where }\beta\]
is1 for a binary search (in a sorted array, in a binary search tree). Such logarithmic problems are easy.the exponential law:
\[f\,(N) = 10^{\beta\,N},\;\mbox{where }\beta\]
is 1/3 for the factorisation problem. Such problems with exponential behaviour are hard (intractable). Examples: the traveling salesman problem, set partitioning etc.
A combination (as factors) of the above, eg,
\[f\,(N) = N\cdot\log_{2}N,\;\mbox{for the QuickSort algorithm}\]
\[f\,(N) = N^{3}(\log\,N)^{k},\;\mbox{for the quantum Shor algorithm}\]
Some rigour
To proper understand the \(\mathcal{O}\)–notations one should remember that \(\mathcal{O}(f(N))\) denotes not a single function, but a class of functions with the designated asymptotic behaviour. If we say that a function g(N) has an asymptotic of \(\mathcal{O}(f(N))\), \(g(N) = \mathcal{O}(f(N))\), it means not equality, but rather the inclusion of g(N) into the same class.
The \(g(n) = \mathcal{O}(f(N))\) has meaning of the upper-bound, ie that its magnitude is at most as big as const×f(N) (when N is large). At most! but it could also be (substantially) smaller. Hence, an algorithm should not be dismissed outright if it is characterised by the efficiency of, say, \(\mathcal{O}(N^{2})\) (like in sorting), just because its actual performance can be better. Opposite to upper-bound is an lower-bound estimate, ie that the function magnitude is at least as large as const×f(N) (N is large). This asymptotic behaviour (different set) is notated with another symbol, \(Omega\): \(g(N) = \Omega(f\,(N))\). The \(\Omega\)-properties of algorithms are harder to establish, but they are more valuable.
Even more valuable (and still harder to come by) are asymptotic with both lower- and upper bounds: that a function g(N) is such that for sufficiently large N, there exist two constants, C1 and C2 (C1 > C2), that \(C_{2}\cdot f(N) < g(N) < C_{2}\cdot F(N)\). Such asymptotic sets are notated with the symbol \(\Theta :\; g(N) = \Theta(f(N))\).
![]() |
![]() |
![]() |
|
| bounded from above \(g(n)=\mathcal{O}(f(n))\) | bounded from below \(g(n)=\Omega(f(n))\) | bounded in between \(g(n)=\Theta(f(n))\) |
One can proof that (something which is intuitively obvious):
\[g(n)=\mathcal{O}(f(n))\;\; and\;\; g(n)=\Omega(f(n)) \iff g(n)=\Theta(f(n))\]
The reason (not because “programmers are stupid”) of why the \(\mathcal{O}\)-notations are used predominantly (when \(\Omega\) would be more rigorous) is explained in a classic Donald Knuth’s paper “Big Omicron and Big Omega and Big Theta”, in SIGACT NEWS, Apr.-June 1967, pp.18–24.
Problem: Given unsorted array of N elements (doubles), find the smallest element in a subarray of this array.
Algorithm: (trivial, correctness follows from the facts that each element of the array will undergo a comparison with the tentative result and comparisons of integers and real numbers are transitive, ie from a> b and b > c follows a > c).
Compare this number with every element in succession
When all of the index locations have been exhausted, return the tentative result
/** @param data an array of doubles
* @return value of smallest element */
public static double smallest(double[] data, int lo, int hi) {
assert (hi > lo);
int iSmallest = lo; //Initial value
for (int i=lo+1; i<=hi; i++)
if (data[i] < data[iSmallest]) iSmallest = i;
return data[iSmallest];
}
Complexity of this algorithm is \(\mathcal{O}(N)\), evidently, because we have to go from the first element to the very last one in search of the smallest (unsorted nature of the array is crucial).
i, over every index of an array in ascending orderAt every step, do the following
smallestInRange (the index of the smallest element) between the present array index, i, and the maximum array index. Call this index iSmallesti and iSmallest |
As
|
Complexity: number of iterations, ~ N, multiplied by the number of smallestInRange() calls, also ~ N, hence \(\mathcal{O}(N^{2})\). More accurate count gives the number N2/2, which is still \(\mathcal{O}(N^{2})\)! For an almost sorted array, the complexity becomes \(\mathcal{O}(N)\) as the majority of swaps are not needed.
Correctness and efficiency of algorithm implementations must be tested both theoretically (by examining the code), and experimentally (by running comprehensive test harnesses with all possible types of input data, and measuring the time-size dependence). The test harnesses (special testing programs — not necessarily written in Java — and input/output data) represent assets of the development as much as the code itself. Correctness is the primary goal of testing, but the efficiency can also be an important (eg, to check a theoretical conclusion regarding the algorithm complexity).
To check the correctness of our implementation of SelectionSort do the following (example SelectionSortTest.java):
0 and 1!)Repeat the above for many different sizes and initialisations:
To check the efficiency one has to:
Time measurements
OK: use the Unix system utility command time -p when running the Java program (the full path is needed for some shells):
/usr/bin/time -p java SelectionSortTest
Of three times printed, “user time” gives the time in seconds taken by the program including “paging” interrupts.
java.util package, and to dump out times in milliseconds. The overhead here is the creation and call to Date objects. Also, in a time-sharing system, there is no way to find the time for your particular process alone (In both pervious approaches, the overhead effect flattens out for large N).Best: call the system clock via System.currentTimeMillis(), or create a utility class similar to StopWatch.java to include such calls for calculating elapsed times.
Examples SelectionSortWithTiming.java, ArrayUtil.java and StopWatch.java
These are two elegant algorithms for solving the sorting problem. Neither is as “naïvely intuitive” as the selection sort or the insertion sort in the next section. The gap sort is an amazingly fast variant of the bubble sort The essence of the BubbleSort algorithm:
data.length-2 element. The algorithm terminates when it makes one pass with no interchanges.The gist of this (and other algorithms which try to improve the SelectionSort efficiency) is to minimise repetitive swaps. Yet, BubbleSort is still \(\mathcal{O}(N^{2})\) (demonstrate BubbleSortTest.java).
The analogous GapSort improves the efficiency, but the reason for this isn’t well understood. The algorithm achieves an intermediate performance between a standard \(\mathcal{O}(N^{2})\) and \(\mathcal{O}(N\cdot\log_{2}N)\) (the actual asymptotic fluctuates). GapSort compares elements which are separated by a “gap” number of elements (rather than one element as in the bubble sort). The initial gap is taken pretty large — approximately 3/4 of the size of the array. It is reduced in size every outer iteration by a “magic” factor (∼1.3). The intuition is that this gapping considerably reduces the number of repetitive swaps considerably (demo with GapSortTest.java).
A simplified algorithm for the insertion sort (inspired by card playing):
i = 0..(length-1)), set a temporary variable tmp = data[i]0 to lenght-1 and always remains sortedtmp in to the appropriate place k in the left subarray, k = 0..(i - 1) in such a way that it remains sortedThe insertion can be done naively, or via binary search (because the array on left, in which the insertion is done, is sorted). The naïve version is demonstrated in InsertionSortTest.java.
The efficiency of InsertionSort is still \(\mathcal{O}(N^{2})\), for generic (random) inputs. But for almost sorted inputs, it becomes almost \(\mathcal{O}(N)\). This property of InsertionSort is valuable, since the most efficient famous QuickSort (see below) suffers deterioration in such cases. By combining the two algorithms, an improved QuickSort can be implemented with much more robust performance.
Demo with SortingApplet.
A useful site with major sorting algorithms (animated like the above) including both time and space complexity analyses is Sorting Algorithm Animations.
Methods which call themselves
A basic problem of calculating the factorial of N:
\[N\,! = N\cdot (N-1)\cdot (N-2)\cdot\ldots\cdot 2\cdot 1 = N\cdot (N-1)!\]
One implementation is “old fashioned” iterative:
public static int factorial(int n) {
int r = 1;
for (int i=1; i<=n; i++)
r=r*i;
return r;
}
The recursive implementation:
public static int factorial(int n) {
assert (n >= 0) : "The factorial is tricky if the argument is negative!"
if (n == 0) return 1;
return n * factorial(n-1);
}
Recursion is a very elegant way of thinking about some algorithms and some recursive algorithms are very fast. However it is possible to write very slow recursive algorithms as well and some care is needed. (For example, you need to be aware that there is an overhead in making all of the method calls in the recursive factorial method.)
The most famous of all (and first recursive algorithm) is QuickSort, invented by Tony Hoare in 1962 before the very concept of recursions became known to computer professionals. The essence of QuickSort algorithm is divide-and-concur:
|
During the algorithm execution (after every partitioning is done) the following invariant must be maintained for an array x[a..b] with the pivot value t = x[m]):
forall i, a <= i < m, x[i] <= x[m] and forall i, m <= i <= b, x[m] < x[i]
m = l; t = x[m]; // selecting pivot
for i = [l+1, u] //closed interval
if x[i] < t // before swapping, increment m to give the
swap(++m, i) // left subarray a room for the new element
That’s how the “detailed” partitioning algorithm looks like pictorially (code is QuickSort.java):
|
|
|
|
QuckSort is fast for generic (random-like) inputs, its efficiency is \(\mathcal{O}(N\cdot\log_{2}N)\).
![]() |
Yet, when an input becomes (almost) sorted, it degenerates to the “banal” \(\mathcal{O}(N^{2})\). Why?
|
The actual analysis of a relatively complex algorithm such as QuickSort is rather complicated. Its weak points must be taken care of to prevent performance problems. That’s why an industrial strength implementation (found in good libraries) is more than 200 LOC long (instead of puny 20–30 in an “educational” case like ours.)
Taken as-is, the above algorithm can be implemented into an efficient sorting routine with performance \(\mathcal{O}(N\,\log_{2} N)\) for random inputs. For special cases — identical elements arrays, (almost) sorted arrays etc — the performance will be worse, \(\mathcal{O}(N^{2})\). Code tuning is needed:
|
|
The main weakness of QuickSort is sorting a large number of short arrays at the end of recursion chain. InsertionSort, on the other hand, is very effective if the input is an “array of subarrays” with each subarray unsorted, but all values in it are greater than the largest value in the preceding subarray, and not greater than the smallest value in the following subarray. Such “array of subarrays” is what quicksort() produces if stopped “prematurely”, when subarray.length >= cutoff > 2. Completing the sorting with InsertionSort produces a robust routine:
QuickSort(data, 0, data.length - 1, cutoff);
InsertionSort(data);
The hybrid QuickSort was used in java.util.Array.sort() methods (until JDK 7).
The elegant merging algorithm begins with… well, merging two sorted arrays which preserves the sorted order (the resulting “big” array is also sorted). The merging starts with choosing either the first element of one or the first element of the other array. Then one looks at the next two possible elements and chose one of them. The coding for this is straightforward but a little bit long-winded. From time to time you will be selecting several elements from one of the arrays before you select any from the other (for an implementation see MergeSortTest.java).
|
The idea of the merge sort is that you divide the original array into two parts, sort them and then merge them. If you use recursion, then you can effectively keep dividing until the initial arrays have size 1 so that they are trivially sorted and all you need to do is merge them. Because it uses the same principle “divide-and-concur”, the expected performance is the same as for QuickSort \(\mathcal{O}(N\cdot\log_{2} N)\). A hybrid of MergeSort and InsertionSort — so called TimSort — is used in Array.sort() since JDK 7.
This is another quintessential computational task — searching in a sorted database, the simplest case of which is a one-dimensional array. The algorithm seems almost trivial:
mid, compare value of its element with the targetx[mid], apply the BinarySearch to the left subarraytarget == x[mid], Bingo!x[mid], apply the BinarySearch to the right subarray
|
Despite such almost trivial principle, it’s notoriously hard to achieve a correct implementation. D. Knuth, in the Vol. 3, reports that while the first BinarySearch algorithm was published in 1946, the first correct implementation was only achieved in 1962. The helpful (for getting it right) concept is a loop invariant —– see next slide and Ch.4 of “Programming Pearls” by Jon Bentley.
BinarySearch is a \(\mathcal{O}(\log_{2}N)\) algorithm. It’s often used “inside” other algorithms, eg, for finding roots of nonlinear equations via the Newton’s method.
Implementing an algorithm is (pseudo-)code is often a more complex task than it may seem, even for simplest cases, like a binary search algorithm. Programs are like mathematical formulae. The latter can be (very) complex. The validity is ensured by application of mathematical axioms and rules of logic at every step of formula manipulation.
The algorithm correctness is ensured in a similar way with the addition of an invariant which is a formal expression of the algorithm purpose.
Algorithm’s invariant is a formal assertion about the program state which must be true at every stage of program execution. The invariant assertion involves input, program variables and output. A (single-threaded) program usually contains:
A program A built from these elements, \(A = A_{1}A_{2}\ldots A_{i}\ldots A_{n}\), will be correct if each of them preserves the invariant, and the program itself terminates. (Termination of loops and function calls should be proved independently.)
Correctness of a program (including an algorithm) is established by demonstrating that the full computational chain \(A = A_{1}A_{2}\ldots A_{i}\ldots A_{n}\) (provided it terminates) preserves the invariant. Deduction \(\{P_{i}\}A_{i}\{Q_{i}\}\) is performed using logic and “obvious” mathematics (Hoare’s logic and Dijkstra’s Weakest Precondition).
Choice of the invariant for a particular algorithm is more like an art than science. In the case of binary search in a linear array-like DS, the invariant can be formulated as follows (as taken from Jon Bentley’s book “Programming Pearls”):
mustbe(range) is a short form of “if t is inside the array then it must be in range” (this is an implication, the preposition “if t is inside” matters!).cantbe(range) is a short for “t cannot be in range”.for-loop {mustbe(range)} serves as the loop invariant, which an assertion about the program which is true at the start and at the end every loop iteration.If the algorithm preserves the invariant (and if it terminates), we can be (“almost”) assured that the result is correct. The algorithm analysis (“walk-through”) must establish that the invariant is never broken (including the proof that the algorithm halts at some point) at each execution step.
Again
mustbe(range) means that the search target t must be inside the range l...ucantbe(range) means that t cannot be in rangeHere is the BinarySearch pseudo-code with the invariant:
|
|
Bugs can stay hidden for a long time until execution environment changes and they get “activated”. This happened to Java’s implementation of the binary search as it was used in java.util.Arrays binarySearch() methods:
1: public static int binarySearch(int[] a, int key) {
2: int low = 0; int high = a.length - 1;
3: while (low <= high) {
4: int mid = (low + high) / 2; // Where the problem lurks!
5: int midVal = a[mid];
6: if (midVal < key)
7: low = mid + 1;
8: else if (midVal > key)
9: high = mid - 1;
10: else
11: return mid; // key found
12: }
13: return -(low + 1); // key not found.
14: }
The bug struck in 2006, when the size of the search array became too large and some int values exceeded the allowed range. The solution is to replace the line 4 on:
int mid = low + ((high - low) / 2); OR int mid = (low + high) >>> 1;
When one is better than another
Iterations are most often faster (recursions have method stack overhead), but this advantage is not crucial
Iterations require less memory (recursions have limited depth due to finite stack size), and this maybe crucial
Recursions often (not always!) result in a simpler (often, much simpler) implementation due to the algorithm recursive nature (either due to the structure of mathematical definition, like factorial or Fibonacci sequence, or due to recursive nature of the data structure on which the algorithm operates, eg, trees)
Non-recursive algorithms on recursive data structures are possible, eg, search in a binary tree (but it also involves a stack)
Recursive algorithms on non-recursive data structures (arrays, lists …) are possible (Quicksort and others)
Proof of correctness for recursive algorithms often greatly assisted by the mathematical induction method
When using recursion, always provide the base case (cases), and ensure that the recursive calls converge to one of the base cases. Remember of the stack size limitation! Never use recursion for processing user inputs!
Study of algorithm is at the core of Computer Science. “Simple fundamental algorithms is the ultimate portable software” (R. Sedgewick, J. Bentley).
Algorithm Intensive CS Fields
Few references
“The Art of Computer Programming” 3rd Edition, by Donald Knuth, Addison-Wesley, 1997–1998; volumes:
Introduction
To group multiple elements into a single unit, known as a collection or container, to store and retrieve the elements, and manipulate this group as a whole through a small set of operations, defined in its interface, is a common computational problem. The most basic examples of such collection types are arrays. Classes like Vector, and Hashtable (we’ve met) are separate (raw) collection types. With generics, collection types are vastly expanded and organised into the Java Collections Framework, endowed with the architecture for representing and manipulating its constituents. JCF tries to emulate the famous C++ STL. The framework is hosted in java.util package. It has the following components:
Collection Metaphors
Kent Beck (a well known software expert, the co-author of JUnit, and one of the Extreme Programming’s pioneers) remarks, in his recent book “Implementation Patterns”, that collections should be regarded as the first-class language construct (alongside with variables), because they are multi-valued variables, and because their cardinality (“many”, meaning two and more), is, in fact, complimentary to other two fundamental mathematical cardinalities: zero (no field) and one (a usual field variable). The collection concept combines in itself three metaphors: it’s an object, which can hold multiple values at the same time, and has properties of a set.
Collections which you can encounter in a program can be classified according to their interface:
To put it simply — Collection objects store other objects (eg, ArrayList). We shall:
re-enforce the OO idea of a separation between the interface to a collection and the implementation (DS) of that collection
introduce (or review, in the case of array) new
| Collection Data Types | Data Structures |
|
|
describe how collections are defined in Java Collections Framework
Before considering other implementations of the BookList.java interface, let us describe its basic operations. These basic operations (alongside with the bulk operations) form the Collection<E> interface. The BookList interface is therefore a particular example of Collection<Book>. The linked list implementation of three of its methods are illustrated:
int size() |
boolean isEmpty() |
boolean contains(Book b) |
|
String toString() |
Book get(int p) |
Iterator<Book> iterator() |
|
boolean add(Book b) |
boolean remove(Book b) |
void insert(Book nB,int p) |
|
All implementations of iterator() returns the implementation of Iterator given by the class BookListIterator.java.
The example of BookList.java defines an interface for the Library program for filing a collection of Book objects, processing and printing them. Its two implementations use different DS for internal storing collection elements, but their usage is exactly the same (polymorphism).
|
The Collections API already provides all required DS, but we implemented the BookList interface from scratch. Versions which uses the API DS are discussed next.
When a standard library offers an effective implementation of a collection type, it is more prudent to use it for implementing your own interface. We discuss two ways to use ArrayList collection class for implementation of our BookList interface. Two approaches are possible.
One approach is very convenient and allows a minimum of work because it uses
Inheritance
Make the class to implement BookList and inherit ArrayList, thus making the implementation class of the ArrayList type. In software design this is called “Is-A” relationship (between the classes) — by the virtue of inheritance BookListIsALjava is ArrayList. The trade-offs:
Another approach, however, is often a better choice — it’s based on
Composition
Compose the class which implements BookList with a field of the ArrayList object. In software design this is called “Has-A” relationship — by the virtue of composition (aka containment), BookListHasAL.java has an instance of the ArrayList class among its fields. The trade-offs:
this), and callbacks do not find the wrapper. (The “Further programming ideas” feature in lab 5 deals with such callbacks. — Elaborate?)Queues and stacks are very popular interfaces. Both are widely used in systems level programming (for managing memory and processes) as well as in other applications.
A Stack<E> is a “last in first out” (LIFO) collection type which can be implemented by adding (with a push method) and extracting (with a pop method) from the head of a list. The interface:
|
Stack class in Java Collections Framework has slightly different interface. It extends Vector class to implement with operations which allow Vector to be treated as a stack. Java regards Stack (as well as Vector) as a legacy class (on the way out), and suggests to use ArrayList for implementing the Stack interface in applications. (What could be the reason for such policy?)
A Queue<E> is a “first in first out” (FIFO) type which can be implemented by adding objects (with an enqueue method) to the head of a list and by extracting them (with a dequeue method) from its tail. The interface:
|
In JCF, Queue interface has slightly different names for the above operations, plus additional operations which allow to implement this interface into a stack, as well as to have a specific ordering (defined by the constructor with Comparator parameter), for instance, in the PriorityQueue class.
| The List interface represents an ordered collection with controls where each new element is inserted, what element the user can access and search (any, given the index). Lists allow duplicate elements (unlike Sets) Two major implementations are the following classes (all generic): |
ArrayList, a very familiar class with the following efficiency for the major operations:
get(i), set(i, elem), — \(\mathcal{O}(1)\) (constant)add(i, elem), remove(i) — \(\mathcal{O}(N - i)\) (requires recopying of a part of the list)LinkedList. This is a doubly linked list, or deque, are available for traversing in both directions — every node (a deque element) has two references, to the preceding node and the following node). The performance of the LinkedList operations is complimentary to those of the ArrayList (except for adding and removing elements at any of ends, \(\mathcal{O}(1)\)):
get(i), set(i, elem), — \(\mathcal{O}(i)\) (needs i steps to get there)add(i, elem), remove(i) — \(\mathcal{O}(1)\) (no recopying necessary)Two legacy classes: Vector and Stack (in the early version, Stack extended Vector)
List classes all implement (or inherit) the Iterator interface which defines three operations to step through (iterate) the collection contents one element at a time. The methods:
boolean hasNext(), returns true if there are more elements leftE next(), returns the next elementvoid remove(), removes the last element returned by the iterator, can be called only once per call to next() (subtle operation: requires safe removal, optional)If your own list implements the Iterable or Iterator interfaces, the iterator() method (form Iterable, it returns an Iterator object), or the methods (listed above) of the Iterator can be used for the list traversal. The implementation of Iterable allows to traverse with for-each loop. However, this approach (see next slide) is not very safe, when one has to filter (remove by a certain criterion) the collection elements, or to traverse multiple collections simultaneously. Examples — BookListWithArray.java and BookListWithLL.java classes (each implements “Iter” interfaces differently).
The alternative way to traverse a collection involves the Iterator, which is the only safe way to modify the collection along the way. (The Iterator-based “plain” for-loop is also the right way to traverse more than one collections simultaneously). The correctly implemented iterator() method, which also requires implementation of Iterator<E> interface, must guarantee that when the collection gets modified the iterator will properly move through the subsequent elements of the collection. The slide below “Traversing a Collection with Iterator” demonstrates the two traversals, including failure of one of them.
Java 1.5 “new” language feature
To cut on the amount of code when working with Collection types, the Java 1.5 (Tiger) introduced the enhanced for-loop (borrowed from Perl):
for( type var : collection ) statement block
Two examples:
for( String arg : args ) { // args is an String[] type
System.out.println( arg );
}
for (Book book : books) { // for BookListIsAL books implemented with ArrayList
System.out.println( book );
}
A “for-each” loop can be used to iterate through most of the Java collection classes (arrays, ArrayLists, HashSets, etc) — anything that implements Collection interface. The example is in the Library.java, the client program of the BookList.java types.
Once again, remember: The collection type which you define yourself must implement Iterable interface to be amenable for for-each traversal.
The for-each loop can also be used on arrays, but it is not suitable if the traversal process uses the index explicitly, or one has to traverse in the opposite direction.
Safe removal while traversing
When we need to modify a collection (eg, filter out its elements based on some criterion), the safest way is to implement Iterable interface and use the standard for-loop. This can be subtle. The standard Collection classes (ArrayList, LinkedList) include a proper implementation of Iterator which guarantees a safe co-modification during a traversal:
BookListIsAL books = new BookListIsAL();
books.add(new Book("Java Software Solutions", false));
... ...
for (Iterator iter = books.iterator(); iter.hasNext(); ) {
Book nextBook = iter.next();
if (!nextBook.isChildren())
iter.remove();// call "remove" only through Iterator reference!
}
The remove() method may be called only once per call to next() and throws an exception if this rule is violated. An attempt to achieve the same effect with the would be identical for-each loop — and direct call to remove() — will result in a run-time exception:
for (Book book: books) { // for-each hides the iterator,
if (book.isChildren()) books.remove(book); // and one cannot call remove()
}
... ... ...
Exception in thread "main" java.util.ConcurrentModificationException
The extensions of Collection type which disallow duplicates (identical elements): a repeated invocation of add(elem) with the same element elem (or such elem1, that elem.equals(elem1) returns true) returns false, and the collection remains unchanged. Set types, therefore, are sets in the mathematical sense (some computer scientists call them “bags”).
The elements of a set are unordered. The subtype SortedSet (extension of the Set interface) represents a collection, whose elements can be compared — they implement Comparable interface. The ordering allows to introduce additional methods:
|
Set is implemented by the HashSet class with a hash table as the DS. Content modification (add, set) and testing (contains) are \(\mathcal{O}(1)\) operations. SortedSet is implemented by the TreeSet class with a binary tree as the DS. If the implementation can maintain the balanced tree, the search and modify operations are \(\mathcal{O}(\log_{2}N)\). Example — SetTest.java.
Unlike all previous types from JCF, the Map family of types does not extends the Collection interface: their contract is different — to represent not a collection of elements, but a correspondence between two collections. So, a map contains key/value pairs, with no duplicate keys (the keys form a set) and at most one value for each key. Map is a model of a mathematical abstraction called function. The interface Map<K,V> operations are:
|
The SortedMap extension requires the set of keys be sorted. Methods like firstKey() and lastKey() are added. The Map interface has two major implementation classes (similar to Set) — HashMap which uses a hash table DS for the implementation (with similar \(\mathcal{O}(1)\) performance for put/get operations), and TreeMap which implements SortedMap in the similar to TreeSet way (with \(\mathcal{O}(\log_{2}N)\) efficiency). Example — MapTest.java
The basic Collection operations were listed above (slide Collection Interface). They allow to examine and manipulate the collection element by element. The bulk operations allow to manipulate the whole part of the collection in one go:
boolean containsAll(Collection<?> c);boolean addAll(Collection<? extends E> c); // add all elements from c to the collectionboolean retainAll(Collection<?> c); //retains only those elements that are found in cboolean removeAll(Collection<?> c); //opposite to retainAll()void clear(); // removes all elements (“clean start”)Another category of operations are array operations:
Object[ ] toArray(); // returns an array of all the elements in this list in the correct order<T> T[ ] toArray(T[ ] dest); // the list elements are placed in the array dest, which is returnedArrays are objects, therefore when one array is assigned to another, the two identifiers point to the same memory location (you remember this from Block-1, dont’t you?). How to copy elements of an array into a different array?
To achieve element-by-element copying, one has to use the |
int[] smallPrimes = {2,3,5,7,11,13};
int[] luckyNumbers = {1001,1002,1003,1004,1005,1006,1007};
System.arraycopy(smallPrimes,2,luckyNumbers,3,4);
for (int i=0; i<luckyNumbers.length; i++)
System.out.println(i + ":" + luckyNumbers[i]);
The interfaces of the Collections Framework and some of their implementations:

Maximum Performance Data Structures used in the implementation
| Implementations | |||||
| Interfaces | Hash table | Resizable array | Tree | Linked list | |
Set |
HashSet |
Makes no sense | TreeSet |
Makes no sense | |
List |
|
ArrayList |
|
LinkedList |
|
Deque |
|
ArrayDeque |
|
LinkedList |
|
Map |
HashMap |
Makes no sense | TreeMap |
Makes no sense |
Essentially, there are only few prime data structures which can be used for implementation of the above interfaces (their nature is determined by the modern computer architecture, the von Neumann machine architecture):
| An ordinary array can be thought of as a table or mapping between an integer index value and a data value. Arrays enable a very efficient random access to the data, \(\mathcal{O}(1)\). The performance deteriorates when we want to copy in more than the original array can take — can we retain the high performance in this situation? Also, often one wants to index entries not by integers (eg, by Stringy names) — can we do it with \(\mathcal{O}(1)\) efficiency (instead of resorting to linked lists)? |
A data structure which allows us to do such things is the hash table (HT) — “the great invention of Computer Science” (Kernighan, Pike). A HT is an array of linked lists plus a hash code function, which maps the data element to the index value of the array.
hashcode mapping can be many-to-one.Normally, different keys will compute to different table indices (if the hashcode is well implemented), but if not — hash collision — then we add a new entry to the bucket list. It is very important to keep the HT balanced and avoid growing bucket lists too long. For well-balanced HT, the performance of the data entry/retrieval operations is \(\mathcal{O}(1)\), otherwise it becomes \(\mathcal{O}(N)\).
hashCode()To achieve the optimal performance (reduce the number of collisions), one should:
hashCode() function properly for the class of elements which a going to be hashed — only two elements which are equal, e1.equals(e2) == true, must have he same value e1.hashCode() == e2.hashCode(). Every class which redefines equals() method should also redefine hashCode().Here is an example of the hashCode() for a simple Item class:
class Item {
private String description;
private int partNumber;
...
public int hashCode() {
// note use of prime number multipliers
return 13*description.hashCode() + 17*partNumber;
}
}
hashCode() functionHash codes are meant to generate array indices which are uniformly and randomly distributed over the size of the array. For this reason they work on an idea of “scrambling up” the values of the data being hashed, so that it is not possible to predict where a particular object will be stored. This is how hashCode() method for the String class might have been implemented:
\[hashCode(s) = \sum\limits_{i=0}^{n-1}\,s_{i}\cdot \mbox{31}^{(n-1-i)},\]
where n = s.length() and \(s_{i}\) is the code of the character s.charAt(i) (Exercise: implement this function as a Java method.) Very similar inputs evaluate to quite different outputs:
| String | Hash value | String | Hash value |
|---|---|---|---|
Hello |
69609650 |
Harry |
69496448 |
hello |
99162322 |
Hacker |
-2141031506 |
To write a good hash function for every class whose instances need to be stored in a hash table would be a daunting task.Thankfully, all major Java objects that we might need to use as keys (String, Integer, Double, Date) have good hash code functions. When you override hashCode() for your own type, do it using the principle exemplified on the previous slide (class Item), and only use two- (or more) digit primes (13, 17, 31…) as factors. (To learn more, one has to take a dedicated course in algorithms and/or computer cryptology).
Trees are recursive data structure with more than one (two for binary trees) self-references. Of all standard DS they are most complex and interesting.
|
The are different types of binary trees (see next slide) which differ by how the elements are arranged and what operations are defined. The most commonly used is a balanced binary search tree. This structure is defined recursively (as it should), and it enables a binary decision (see below) to be made at each level which reduces the search time to an \(\mathcal{O}(\log_{2}N)\) operation (similar to the binary search in a linear ordered array ).
Multivalent trees are almost always can be reduced to binary ones (albeit with more complex node structure). Of all data structure considered so far, binary trees are the most complex, diverse and interesting (for research). Their value as DS is due to the fact, that the tree traversal is a key element of every operation, and for a well structured (balanced search, or amortised) tree, the traversal only consists in monotone descent, it takes \(\mathcal{O}(h)\) steps (h is the height), and \(h\sim\log_{2}N\).
There are many types of tree DS (all listed below are binary except for B-Trees):
Binary Search Trees (also known as Binary Ordered Trees)
Heaps — (almost) complete BT which are built to represent an array and used for sorting
Expression Trees — where every node represents an operator, and every leaf is a value
B-Trees — used in out-of-core algorithms (when data is so large, that it doesn’t fin into memory, and had to be operated while on the disk)
Many-many more-more
Binary Search Trees (BST) are simplest case. They are used in implementation of SortetSet (TreeSet class) and SortedMap (TreeMap) interfaces.
BST are defined in a simple way:
BST are the most simple and often used type of binary trees. The example SearchTree.java defines the SearchTree class with the following operations: |
get(E e) — returns the node which contains e value (if found)add(E e) — adding an element while preserving tree’s ordered structureremove(E e) — removes an element while preserving tree’s ordered structureheight() — returns the tree heightsuccessor(E e) — return the node with the value next larger to epredecessor(E e)— return the node with the value next smaller to eIt is important for a search tree to be balanced to exact the performance of its operations (listed on the previous slide) of \(\Omega(\log N)\). Obviously, this requires the tree to be everywhere “dense”, when every branch (path from the root to a leaf) has approximately the same value (maximum of which is called height). Such “dense” trees are usually called balanced. Balancing is broken some branches are abnormally long, with length \(\sim N\). If a search tree becomes strongly unbalanced, the efficiency of its operations deteriorates to \(\mathcal{O}(N)\).
Search trees are often build at run time, using the data which are streaming from an outside source (like it’s done in the example SearchTree.java). If the source data is not skewed, the resulting search tree will likely be balanced, bit this cannot be guaranteed. If unbalancing occur, the tree can be adjusted to restore the balance tree performance. This can be quite complex; examples of self-adjusting balanced trees include Red-Black Trees and Splay Trees.
There is a duality between algorithms and data structures. The temporal structure of an algorithm is dual to the “spatial” structure of data. When an algorithm performs in its generic fashion (with its “generic” efficiency, eg, QuickSort running at \(\Omega(N\log N))\)), the structure of recursion calls (the temporal portrait, as it were) looks exactly as a balance search tree — everywhere dense without anomaly of protruding separate branches. This is how the “spatial” structure of a balanced tree look like, and it gives an optimal tree traversal performance \(\mathcal{O}(\log_{2}N)\).
Temporal structure of one resembles structural (“spacial”) feature of another.
There are two meaning of the term heap in Computer Science.
A part of computer memory which is unused by the OS and other running programs, and from which each program can borrow. In Java, all explicitly created objects use the heap memory. If you declare a fixed size array with elements like literal strings, integers, enum constants etc, the required memory is allocated in advanced. Otherwise, when a DS changes during the execution (when adding new elements), the memory comes from the heap. When an object is destroyed by the garbage collector, the memory is returned to the heap.
An array data structure a[i], i=0.. in which elements are placed on an imaginary tree in accordance with the following index rules for an element of the index i:
The heap element a[i] also satisfies a special condition — a so-called heap property:
The array heap tree is always (almost) complete — the length of all branches is either equal to the tree height h, or h-1 (this follows from the trivial mathematical fact that any number N satisfies the inequalities: \(2^{h} \leq N \leq 2^{h+1} - 1\) for some positive integer h).
For an arbitrary array a[i] of the length N, one can always heapify it — calculate relationships between the elements as prescribed by the heap property; the effect is like placing the elements on a tree.
Why heaps are useful? Or, put it another way, what advantage does rearranging an array into a heap give? Max-heaps are used in the sorting algorithms HeapSort which gives \(\Omega(N\log_{2}N)\) in-place sorting performance. This performance does not deteriorate like QuickSort on almost sorted inputs. Min-heaps are used in implementation of priority queues.
An arbitrary array
|
If writing (adding and deleting nodes) into a tree is allowed, its initially balanced structure can be compromised with the consequences of worsened behaviour — access cost \(\mathcal{O}(h)\) (h is the tree height) instead being \(\mathcal{O}(\log N)\) will be \(\mathcal{O}(N)\). How to deal with this problem?
Add additional markers to nodes for constraining the structure; when the constraint gets broken (after add or delete), perform an local transform (called rotation) to restore the constraint. A well-known example is Red-Black Trees, in which every node (in addition to actual data) has a colour marker, “red” or “black”. The constraint:
As mentioned above, the constraint guarantees that the tree remains approximately balanced — same node descending path lengths differ at most by factor of 2. Complexity of RBT structure and performance analysis is prohibitive for us here. A valuable study of Red-Black Trees by R. Sedgewick includes a Java code and animations.
As alternative to using additional markers, one can restructure the tree by moving a most recently accessed node towards the root. Such operation is called splaying, and the trees which support it are called Splay Trees. The case for using Splay Trees is when same elements are accessed repeatedly in close sequence (during which no large tree modification is performed). So despite an initial access can be costly, \(\mathcal{O}(N)\), after splaying, it becomes \(\Omega(1)\), and the average cost of one access gets amortised. It is possible to show that the average cost can be made the “usual” \(\mathcal{O}(\log_{2}N)\), but we shall not go into details.
The above examples dealt with tree structures to store data. Trees also can also be used to represent complex constructs with well defined semantics (ie, they can be evaluated. Examples:
A tree which represents a computer program is called a program tree graph, or a parse tree, or Abstract Syntax Tree. |
|
Expression trees have operators for their nodes, and values for their leaves. A process which creates an expression tree from the input string (a conventional arithmetic formula like above, or a computer program text) is called parsing. If the input string contains (syntax) errors, the parsing fails. Once built, an expression tree can be modified, optimised, transformed and evaluated. Compilation is an example of parsing process. Parsing is done through application of grammar rules, which define how to substitute a chunks of the input string (aka, tokens) into syntactic categories (nodes of the parse tree) and terminal symbols (leaves). Grammar is a set of recursive relationships:
<expression> ::= <number> | <expression> <operator> <expression> | (<expression>)
<operator> ::= + | - | * | / | %
| When data is too big to fit into memory and resides on a secondary storage device (hard disk, database etc) the time to access it is 50,000–100,000 times longer (from 50 nano sec to milli-seconds). If an algorithm involves a data structure from the “disk” — a so-called out-of-core algorithm — it must minimise read and write operations (“disk op”) on such DS. |
A very usable choice for on-the-disk DS is B-Trees. B-Trees are balanced like Red-Black Trees but the number of children at every node can change from a few to virtually thousands. The larger the number of children (let’s say a B-Tree has t of them on average), the shorter the tree height: \(h = \mathcal{O}(\log_{t}N)\), where N is the total number of nodes. Since the disk ops involve a node access, the fewer of them performed, the faster the algorithm runs.

B-Tree Definition details
The tree node can have up to (t-1) sorted keys \(k_{i}\) which define t intervals: \(\lbrack -\infty,k_{1}\rbrack\), \(\lbrack k_{t-1},+\infty\rbrack\) and \(\lbrack k_{i-1},k_{i}\rbrack,\; i=1..(t-1)\).
When a value k is sought (inserted, deleted), the descent is performed into the subtree of a node which is represented by the i-th interval which contains \(k\in\lbrack k_{i-1},k_{i}\rbrack\)
The maximum number of children t is determined by the size of memory (a page) which the RW head can “scoop” in one op. The scale of \(t\sim 10^{3}\).
Standard search, add and delete operations can be defined with the performance \(\Omega(\log_{t}N)\)
Just as for other kinds of DS, some operations are quite simple when they are defined recursively. For example, the search and return operation (call it get(e)) is called on the whole tree with the argument e of the type E data which is stored in tree nodes; if the node which contains the data equal to e is found, the tree node (which has the type Tree<E>) is returned, if not found — null value is returned. The operation add(e) must preserve the search-tree property: It should go down to one of the leaves and add a left or right subtree into which the new element e will be “deposited”. The key aspect in both case is the tree traversal, and it is quite simple since we use the search-tree property:
get(e) — start with the root node this and examine how its element compares with e by calling this.stuff.compareTo(e) (that’s why we made E implement Comparable)
this.stuff.compareTo(e) == 0 — found it, return thisthis.stuff.compareTo(e) < 0 — go left: this.left.get(e)this.right.get(e)add(e) — Start with the root (as alway) and, using comparisons to choose left or right turn at every node, traverse all the way down to a leaf where a new node containing the value e is attached. All new values are placed at one of the leaves at the bottom; this is the easiest way to preserve the search-tree properties while adding elements.
remove(e) — Is hardest of all since the element e can be anywhere.
printAscend(), printDescend(), height() — all recursive and simple whole-tree ops.
The example SearchTree.java presents implementation of the first three operations in which recursion plays a vital role — get(e) and add(e) are downright recursive, and remove(e) relies on get(e) to the node e. Details of remove(e) are somewhat tricky.
One has to consider four cases — three simple and a tricky one. When a node slated for deletion is found, it is pulled out of the tree, and its place is taken by another node which is carefully chosen in such a way that the resulting tree is still a search tree.
![]() |
![]() |
(Case 2 is not shown — it’s similar to Case 1, just replace the right subtree onto the left one.)
The most difficult case is generic one — when the deletable node z has a normal right subtree r, such that the right child is not the smallest element greater than that in the deletable node (like in Case 3). The delete operation requires a “surgery” to implant the node with smallest element y in place of z and rearranging broken links to preserve the search property.
|
One should study the actual code in SearchTree.java to understand all the details and appreciate the subtlety. (Worry not — this is not going to be in the final exam! -:)
Algorithms can be (non-)recursive, and DS can be (non-)recursive. Do all elements of this “\(2\times 2\)”-matrix make sense?
Recursive operation on a recursive DS?
get(e) on a link list (an exercise for you)Recursive operation on a non-recursive DS (like an array)? Of course — recursive sorting algorithms like QuickSort and MergeSort, BinarySearch
Non-recursive operation on a non-recursive DS? SelectionSort, InsertionSort
Non-recursive operation on a recursive DS? More particular: a non-recursive traversal (search) on a binary tree?
Comparing objects (elements in a collection) can be done in two ways:
class of stored objects implements Comparable interface (to support the compareTo(Object o)method)
the constructor of TreeSet (or, TreeMap) is passed a Comparator<E> parameter (which has the method int compare(E e1, E e2)), if the elements do not implement Comparable, or if we want to compare them differently, and so we need to specify which comparison function to used for sorting the set.
class ItemComparator implements Comparator<Item> {
public int compare(Item a, Item b) {
String descrA = a.getDescription();
String descrB = b.getDescription();
return desrcA.compareTo(descrB);
}
}
ItemComparator comp = new ItemComparator();
TreeSet<Item> sortByDescription = new TreeSet<Item>(comp);
In the example Library1.java, the same collection of Books is first sorted by their natural order, and then by the the length of their title.
Generic algorithms, which operate on the collection classes are provided as static methods from Arrays and Collections (not to be confused with Collection interface) classes. The Collections methods include (method return values and parameters are bounded types, actually; consult the API)
public static <T> min(Collection c) — returns the smallest element in the collection (another overloaded version also which accepts a Comparator object)public static <T> max(Collection coll)public static Comparator reverseOrder() — returns a Comparator which reverses the natural ordering of the collectionpublic static void reverse(List list) —— reverses the order of a listpublic static void shuffle(List list) — randomly shuffles a listpublic static void fill(List list, E elem) — replaces each element of a list with elempublic static void copy(List destination, List source)public static List nCopies(int n, E elem) — returns an immutable list that contains n copies of elempublic static void sort(List list) and sort(List list, Comparator c)public static int binarySearch(List list, K key) — the list must be sorted already for this method to workVarious methods to manipulate arrays (such as sorting and searching) are contained in Arrays class. When combined with Collections class methods, one can achieve powerful results with just few lines of code. The following program (from Java Tutorial) prints out its arguments in alphabetical — natural for Strings — order (another example is Anagram.java):
public class Sort {
public static void main(String[] args) {
List<String> list = Arrays.asList(args);
Collections.sort(list);
System.out.println(list);
}
}
Another example — a list with 100 elements each initialised to an Integer of value -1:
Integer init = new Integer(-1);
List values = new ArrayList(Collections.nCopies(100,init));
One more — the method Array.asList() (used above) returns a list view of the array:
Card[] cardDeck = new Card[52];
List cardList = Arrays.asList(cardDeck);
Because cardList is only a view of an underlying array, you can set the values of elements but you cannot add or remove them. The list view is useful if you want to pass a simple array to methods of the Collections class.
Major advantages of using the JCF (repeating the Java Tutorial)
No need to invent a wheel every time, the best (“well rounded”) wheels have been already invented! Use them.
Collections can grow very large and consume large memory.
Same collection object can be used by different clients with different needs in terms of how data represented by the collection are used (read, modified).
Using different variables to stand for the same collection – aliases — does not provide an acceptable solution (aliasing achieve nothing)
Copying the data for different clients may be too expensive and undesirable (which copy must be regarded as primary when it comes to data preservation?)
Instead, different views of the same collection object offer the right solution
Views are normally created by calling a static factory method with the underlying collection object passed as a parameter. Technically speaking: “A view is an object of a class that implements one of the interface types in the JFC, and that permits restricted access to a data structure (CH).” The view object is like a shallow copy, with access to elements of the original Collection object (hence, a modifiable view can be used to modify the original).
Some operations declared in Collection interface are marked as optional (add(), clear(), remove() and some bulk operations). This is because for certain implementations of the interface, these operations are deemed “destructive” — they attempt to modify the collection, and, if this is not consistent with the collection contract, they are not supported and their invocation throws UnsupportedOperationException.
boolean remove(E o) { throw UnsupportedOperationException; }
Also, “…some Collection implementations have restrictions on the elements that they may contain, like prohibiting null elements, and some have restrictions on the types of their elements. Attempting to add an non-eligible element throws an unchecked exception, typically NullPointerException or ClassCastException. Attempting to query the presence of an ineligible element may throw an exception, or it may simply return false”, depending on implementation (Java API docs).
The reason for including optional operations in the first place is to increase re-usability and flexibility of the Collection types.
The mechanism of views increases the number of concrete JFC classes by an integer factor (number of views). The different views — normal (unsynchronised), synchronised, unmodifiable and checked wrappers — that can be obtained from a Collection (or Map) type as three new sorts of collections (or maps) except that they are supported by an underlying (Map) data structure. Wrapper implementations delegate all their real work to a specified collection but add extra functionality on top of what this collection offers. This is another example of the Decorator pattern.
The wrapper views are obtained via static factory methods (methods which return newly created objects) defined in Collections class of java.util package. The views and their advantages are:
the synchronised wrappers which add automatic synchronisation (thread-safety) to an arbitrary collection:
Collections.synchronizedCollection(Collection c)
Collections.synchronizedList(List l)
... similar for Set, SortedSet, Map, SortedMap
Map map = Collections.synchronizedMap(new HashMap());
//Vector class is synchronised by default, hence it's slower then ArrayListunmodifiableCollection() and similar calls; all modifying methods in this view throw UnsupportedOperationExceptioncheckedCollection() and similar calls; checked view of the original raw collection will make a runtime check to enforce the type safety lost in raw typesthe concurrent collections are not standard wrappers, but are types defined in a separate package java.util.concurrent; they provide implementations which are not only thread-safe for synchronised access, but are specially designed to support multi-threaded use, like ensuring that a collection becomes non-empty before a request to access it is carried out.