Publications


[home]
[contact]
[publications]
[teaching]
[resume]
[who am i?]
[links]


[2004]

[D.Hawking, N.Craswell, F.Crimmins, T.Upstill]
"How valuable is external link evidence when searching enterprise webs?"
Fifteenth Australasian Database Conference (ADC'04), January 2004
Paper: [pdf]
Abstract:

Link information, especially anchor text, is known to be very useful for effective ranking of web pages, particularly in response to navigational queries. We investigated whether enterprise webs contain sufficient internal link information to adequately answer queries derived from the enterprise's site map or, alternatively, whether adding link evidence from the external Web can boost search effectiveness. Using 1266 navigational queries derived from Stanford University's A-Z site index, we found no difference between the quality of results returned by Stanford's Google appliance and those from an appropriately site-restricted search of the global Google service. Applying similar methodology to our own crawls of seven Australian organisations, we found that adding external link evidence made no significant difference to search effectiveness in five cases and a slight difference (in different directions) in the other two. We observed that external links to an organisation show very different patterns to internal links. Unlike enterprise web publishers, external web authors heavily favour directory default pages, particularly the organisation's home page and pages offering information or services likely to be useful on an ongoing basis. External links seldom reference the complex, parameterised URLs in common use in many organisations.



[2003]

[T.Upstill, N.Craswell, D.Hawking]
"Predicting Fame and Fortune: PageRank or Indegree?"
Eighth Australasian Document Computing Symposium (ADCS 2003), December 2003.
Paper: [pdf]
Abstract:

Measures based on the Link Recommendation Assumption are hypothesised to help modern Web search engines rank `important, high quality' pages ahead of relevant but less valuable pages and to reject `spam'. We tested these hypotheses using inlink counts and PageRank scores readily obtainable from search engines Google and Fast.

We found that the average Google-reported PageRank of websites operated by Fortune 500 companies was approximately one point higher than the average for a large selection of companies. The same was true for Fortune Most Admired companies. A substantially bigger difference was observed in favour of companies with famous brands. Investigating less desirable biases, we found a one point bias toward technology companies, and a two point bias in favour of IT companies listed in the Wired 40. We found negligible bias in favour of US companies.

Log of indegree was highly correlated with Google-reported PageRank scores, and just as effective when predicting desirable company attributes. Further, we found that PageRank scores for sites within a known spam network were no lower than would be expected on the basis of their indegree. We encounter no compelling evidence to support the use of PageRank over indegree.




[N.Craswell, D.Hawking, A.McLean, T.Upstill, R.Wilkinson, M.Wu]
"TREC12 Web Track at CSIRO at CSIRO"
Text Retrieval Conference 12, November 2003
Paper: [pdf]
Overview:

This year, CSIRO teams participated in all three tasks of the web track: the automatic topic distillation task, the home/named page finding task and the interactive top distillation task. This paper describes our approaches experiments and results.

This year we focused on tuning Okapi BM25 and Web evidence parameters. Our home/named page finding submissions use tunings computed for both home and named page finding, and evaluate two run combination methods. Our topic distillation submissions are tuned for home page finding only and test whether the Web evidence evaluated is useful, and whether the use of stemming improves performance.




[T.Upstill, N.Craswell, D.Hawking]
"Query-independent evidence in homepage finding"
Revised August 2002, Revised February 2003, ACM Transactions On Information Systems, July 2003
Paper: [pdf]
Abstract:

Hyperlink recommendation evidence, that is evidence based on the structure of the Web's link graph, is widely exploited by commercial web search systems. However there is little published work to support its popularity. Another form of query independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compare the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora -- one enterprise crawl, and the WT10g and VLC2 web test collections.

We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text and URL-type classification.



[2002]

[N.Craswell, D.Hawking, J.Thom, T.Upstill, R.Wilkinson, M.Wu]
"TREC11 Web and Interactive Tracks at CSIRO"
Text Retrieval Conference 11, November 2002
Paper: [pdf]
Overview:

For the 2002 round of TREC, the CSIRO teams participated and completed runs in two tracks: web and interactive.

This year's Web track participation was a preliminary exploration of forms of evidence which might be useful for named page finding and topic distillation. For this reason, we made heavy use of evidence other than page content in our runs.




[T.Upstill, N.Craswell, D.Hawking]
"Buying bestsellers online: A case study in Search & Searchability"
Australasian Document Computing Symposium, December 2002
Paper: [pdf]   Presentation: [pdf, ppt]
Abstract:

Here, we study both search effectiveness and searchability with respect to a particular type of commodity (books) which is frequently sold over the Web. We measure the relative effectiveness of a selection of search engines in finding pages from which a book, specified only by its title, may be purchased. We also compare the relative searchability of a selection of online bookstores. This is performed through an examination of the proportion of best-seller books for which a ``buy-here'' page from a bookseller appears in search engine listings.

The design of a Web site directly affects how well search engines can crawl, match and rank a website's pages. For this reason, searchability is an important concern in site design. We study the interaction between search engines and Web sites by means of a case study of online bookstores and general-purpose search engines. The task modelled is that of finding web pages from which a book, described by its title, may be purchased.

We first compared the relative effectiveness of search engines in finding pages matching the criterion, regardless of bookstore. Then we compared the relative searchability of the bookstore websites by observing how many times each bookstore contributed useful answers to the search results.

Large differences in the performance of both search engines and bookstores were observed. Two of the search engines performed better than their peers, and one bookstore was far more searchable than all others. To further explore these differences we tabulate the total number of pages from each bookshop which are included in the search engine indexes.

We conclude with recommendations both to bookstores on how they may improve their Web presence, and to search engines on how they may improve their performance for product searches.



[2001]

[T.Upstill, R.Naggapan, N.Craswell]
"Visual Clustering of Image Search Results"
SPIE Visual Data Exploration and Analysis VIII, January 2001.
Paper: [pdf]
Abstract:

This paper presents a novel method for visualizing the results of an image search. Current approaches to visualizing WWW image searches rank result in a linear list and present them as a sorted thumbnail grid. The method outlined in this paper visually clusters images based on the user's search terms. To accomplish this, a flexible image retrieval method which incorporates a combination of content-based and textual image matching is used. A new information visualization is used to display the search results.

In our model multiple types of partitioning and querying can occur concurrently, thereby creating a multi-dimensional display of image properties. The display groups similar images, enabling users to quickly scan for the most relevant images. This visualization allows users to exploit the location of images as their guide to what an image contains and use thumbnails to preview potentially relevant images. Through the identification of relevant images users can locate relevant areas in the visualization. It is then possible for users to focus their attention on one area of the visualization using a zooming function. The user's interaction with the system is explored using new evaluation metrics based on Information Foraging theory.



[2000]

[T.Upstill]
"Consistency, Clarity and Control: Development of a new approach to WWW image retrieval"
Honours Thesis, November 2000
Paper: [gzipped ps,pdf]
Abstract:

The number of digital images is expanding rapidly and the World-Wide Web (WWW) has become the predominant medium for their transferral. Consequently, there exists a requirement for effective WWW image retrieval. While several systems exist, they lack the facility for expressive queries and provide an uninformative and non-interactive grid interface.

This thesis surveys image retrieval techniques and identifies three problem areas in current systems: consistency, clarity and control. A novel WWW image retrieval approach is presented which addresses these problems. This approach incorporates client side image analysis, visualisation of results and an interactive interface. The implementation of this approach, the VISR or Visualisation of Image Search Results tool is then discussed and evaluated using new effectiveness measures.

VISR offers several improvements over current systems. Consistency is aided through consistent image analysis and result visualisation. Clarity is improved through a visualisation, which makes it clear why images were returned and how they matched the query. Control is improved by allowing users to specify expressive queries and enhancing system interaction.

The new evaluation measures include a measure of visualisation precision and visualisation entropy. The visualisation precision measure illustrates how VISR clusters images more effectively than a thumbnail grid. The visualisation entropy measure demonstrates the stability of VISR over changing data sets. In addition to these measures, a small user study is performed. It shows that the spring-based visualisation metaphor, upon which VISR's display is based, can generally be easily understood.



Last Modified 01-10-2005
© Trystan Upstill 2004

Valid XHTML 1.0!