Lectures Notes : COMP1710: Web search
Slide 1 : 1/36: Web search
COMP1710 Web Development and Design
Web search
then 'n'ext or 'b'ack
Slide 2 : ToC : COMP1710: Web search
Table of Contents (36 slides) for the presentation :
COMP1710: Web search
Slide 3 : 3/36: New Media and Web
In this session: web search
-
Search engines
-
User queries
-
Web spam
-
User needs
Reference for slides:
CS276
Slide 4 : 4/36: Search use
Search use

Slide 5 : 5/36: Web search and web scale
Without search engines the web wouldn't scale
- No incentive in creating content unless it can be easily found -
other finding methods haven't kept pace (taxonomies, bookmarks, etc)
- The web is both a technology artifact and a social environment
-
The Web has become the new normal
in the American way of
life; those who don't go online constitute an ever-shrinking
minority.
= [Pew Foundation report, January 2005]
- Search engines make aggregation of interest possible:
-
Create incentives for very specialized niche players
-
Economical - specialized stores, providers, etc
-
Social - narrow interests, specialized communities, etc
- The acceptance of search interaction makes
unlimited selection
stores possible:
- Search turned out to be the best mechanism for advertising on the
web, a $15+ B industry.
-
Growing very fast but entire US advertising industry $250B - huge room
to grow
-
Sponsored search marketing is about $10B
Slide 6 : 6/36: Coarse dynamics
Coarse dynamics of Web interaction

Slide 7 : 7/36: Brief history
Brief (non-technical) history
Early keyword-based engines
-
Altavista, Excite, Infoseek, Inktomi, ca. 1995-1997
Paid placement ranking:
-
Goto.com (morphed into Overture.com -> Yahoo!)
-
Your search ranking depended on how much you paid
-
Auction for keywords: casino was expensive!
Slide 8 : 8/36: History 2
Brief (non-technical) history
1998+: Link-based ranking pioneered by Google
-
Blew away all early engines save Inktomi
-
Great user experience in search of a business model
-
Meanwhile Goto/Overture's annual revenues were nearing $1 billion
Result: Google added paid-placement 'ads' to the side, independent of
search results
-
Yahoo follows suit, acquiring Overture (for paid placement) and
Inktomi (for search)
Slide 9 : 9/36: Nigritude ultramarine

Slide 10 : 10/36: Ads vs search
Adverts vs. search results
Google states ads (based on vendors bidding for
keywords) do not affect vendors' rankings in search results

Slide 11 : 11/36: Web search basics
Web search basics

Slide 12 : 12/36: User needs
User needs [Broder '02, Rose et al '04, Jansen et al '08]
Informational - want to learn about something (39% / 61% / 81%)
Navigational - want to go to that page (25% / 15% / 10%)
Transactional - want to do something (web-mediated) (36% / 24% / 9%)
Access a service
Downloads
Shop
Gray areas
Exploratory search see what's there
-
-
Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36(2):3-10.
Rose, D.E. and Levinson, D. 2004. Understanding user goals in web search.
Proc 13th Int Conf on World Wide Web, New York, 13-19.
Jansen, B.J., Booth, D.L. and Spink, A. 2008. Determining the informational,
navigational, and transactional intent of Web queries. Information Processing
and Management, 44:1251-1266.
Slide 13 : 13/36: Web search users
Web search users
Make ill defined queries
-
Short
-
AV 2001: 2.54 terms avg,
(80% < 3 words)
-
Not increasing over time
-
Imprecise terms
-
Sub-optimal syntax
(most queries without operator)
-
Low effort
|
Wide variance in
-
Needs,
Expectations,
Knowledge,
Bandwidth
Specific behavior
-
68% look at one result screen only
(increased over time)
-
about 50% of queries are not modified
(one query/session)
-
Follow links -
the scent of information ...
|
Slide 14 : 14/36: Query Distribution
Query Distribution

Power law: few popular broad queries,
many rare specific queries
Slide 15 : 15/36: How far?
How far do people look for results?
When you perform a search on a search engine and don't find what you are looking for, at what point do you typically either revise your search or move on to another search engine? (Select one)

iprospect.com WhitePaper_2006_SearchEngineUserBehavior
Slide 16 : 16/36: How far?
How far do people look for results?
When you perform a search on a search engine and don't find what you are looking for, at what point do you typically either revise your search or move on to another search engine? (Select one)


Slide 17 : 17/36: Users' eval
Users' empirical evaluation of results
Quality of pages varies widely
-
Relevance is not enough
-
Other desirable qualities
-
Content: Trustworthy, new info, non-duplicates, well maintained,
-
Web readability: display correctly & fast
-
No annoyances: pop-ups, etc
Precision vs. recall
-
Precision = proportion of retrieval relevant
Recall = proportion of all revelant found
-
On the web, recall seldom matters
What matters
-
Precision at 1? Precision at 1 for first page?
-
Comprehensiveness - must be able to deal with obscure queries
-
Recall matters when the number of matches is very small
User perceptions may be unscientific, but are significant over a large
aggregate
Slide 18 : 18/36: Users' eval 2
Users' empirical evaluation of engines
Relevance and validity of results
Interface - Simple, no clutter, error tolerant
Trust - Results are objective
Coverage of topics for polysemic queries
Pre/Post process tools provided
-
Mitigate user errors (auto spell check, syntax errors, ...)
-
Explicit: Search within results, more like this, refine ...
-
Anticipative: related searches
Deal with idiosyncrasies
-
Web specific vocabulary
-
Impact on stemming, spell-check, etc
-
Web addresses typed in the search box
Slide 19 : 19/36: Loyalty
Loyalty to a given search engine

Reference: iPropect Survey, April04
Slide 20 : 20/36: Web corpus
The Web corpus
|

|
No design/co-ordination
Distributed content creation, linking, democratization of publishing
Content includes truth, lies, obsolete information, contradictions ...
Unstructured (text, html, ...), semi-structured (XML, annotated photos),
structured (Databases) ...
Scale much larger than previous text corpora ... but corporate records
are catching up.
Growth - slowed down from initial volume doubling every few months
but still expanding
Content can be dynamically generated
|
Slide 21 : 21/36: Dynamic content
The Web: Dynamic content
A page without a static html version
-
E.g., current status of flight QF007
-
Current availability of rooms at a hotel
Usually, assembled at the time of a request from a browser
Typically, URL has a '?' character in it
Most dynamic content is ignored by web spiders

Slide 22 : 22/36: Web size
Size of the web
What is being measured?
-
Number of hosts
-
Number of (static) html pages
Number of hosts - netcraft survey
Number of pages - numerous estimates (will discuss later)
Slide 23 : 23/36: Server survey
Netcraft Web Server Survey

The biggest change this month was a 1.9M increase in hostnames
served using Apache. The largest contributor to this was a growth of
561K sites by Next Dimension Inc, but the majority of this consisted
of parked sites on a single IP address.
Slide 24 : 24/36: Web change
Rate of change on the Web
[Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 -
Jun 14, 1999
-
Any changes: 40% weekly, 23% daily
[Fett02(04)] Massive study 151M pages checked over few months
-
Significant changed - 7% weekly
-
Small changes - 25% weekly
[Ntul04] 154 large sites re-crawled from scratch weekly
-
8% new pages/week
-
8% die
-
5% new content
-
25% new links/week
Cho J, Garcia-Molina H. 2000. The evolution of the Web and implications
for an incremental crawler. Proc. 26th Int. Conf. on Very Large Databases.
Morgan Kaufmann, 200-209.
Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the web?: the
evolution of the web from a search engine perspective.
Proc. 13th Int. Conf. on World Wide Web, ACM, New York, 1-12.
Slide 25 : 25/36: Change static pages
Static pages: rate of change
Fetterly et al. study (2004): several views of data, 150 million pages
over 11 weekly crawls
-
Bucketed into 85 groups by extent of change

Fetterly, D., Manasse, M., Najork, M. and Wiener, J.L. 2004.
Softw. Pract. Exper. 2004; 34:213-237.
Slide 26 : 26/36: Some web characteristics
Some web characteristics
Significant duplication
-
Syntactic - about 30% (near) duplicates [Fett03]
-
Semantic - ???
High linkage
-
More than 8 links/page in the average
Complex graph topology
Spam
Fetterly, D., Manasse, M. and Najork, M. 2003.
On the evolution of clusters of near-duplicate Web pages
Procs. 1st Latin-Amer. Web Congress, 37-45.
Slide 27 : 27/36: Spam: Paid ...
SPAM: Paid placement - the trouble with paid placement is ...
It costs money. What's the alternative?
Search Engine Optimization (SEO):
-
Tuning
your web page to rank highly in the search results for
select keywords
-
Alternative to paying for placement
-
Thus, intrinsically a marketing function
Performed by companies, webmasters and consultants (Search engine
optimizers
) for their clients
Some perfectly legitimate, some very shady
Slide 28 : 28/36: Simple SEO
Simplest forms of Search Engine Optimisation
First generation engines relied heavily on
tf/idf
-
The top-ranked pages for the query maui resort were the ones
containing the most
maui
s and resort
s
SEOs - dense repetitions of chosen terms
-
e.g., maui resort maui resort maui resort
-
Often, the repetitions would be in the same color as the
background of the web page
-
Repeated terms got indexed by crawlers
-
But not visible to humans on browsers
Note: tf/idf
is term frequency divided by inverse document frequency, so
a frequent word gets a high value unless the word is common in the whole
document collection.
Slide 29 : 29/36: Cloaking
Cloaking
Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate

Slide 30 : 30/36: More spam tech
More spam techniques
Doorway pages
-
Pages optimized for a single keyword that re-direct to the real
target page
Link spamming
-
Mutual admiration societies, hidden links, ...
-
Domain flooding: numerous domains that point or re-direct to a
target page
Robots
-
Fake query stream - rank checking programs
-
Curve-fit
ranking programs of search engines
-
Millions of submissions via Add-Url
(example)
Slide 31 : 31/36: Anti spam
The war
against spam
Quality signals - Prefer authoritative pages based on:
-
Votes from authors (linkage signals)
-
Votes from users (usage signals)
Policing of URL submissions
Limits on meta-keywords
Robust link analysis
-
Ignore statistically implausible linkage (or text)
-
Use link analysis to detect spammers (guilt by association)
|
Spam recognition by artificial intelligence
-
Training set based on known spam
Family friendly filters
-
Linguistic analysis, general classification techniques, etc.
-
For images: flesh tone detectors, source text analysis, etc.
Editorial intervention
-
Blacklists
-
Top queries audited
-
Complaints addressed
-
Suspect pattern detection
|
Slide 32 : 32/36: More on spam
More on spam
Web search engines have policies on SEO practices they encourage /
tolerate / block
Adversarial IR: the unending (technical) battle between SEO's and web
search engines
Slide 33 : 33/36: User need
Answering the need behind the query
Query language determination
-
Auto filtering
-
Different ranking (if query in Japanese do not return English)
Hard & soft (partial) matches
-
Personalities (triggered on names)
-
Cities (travel info, maps)
-
Medical info (triggered on names and/or results)
-
Stock quotes, news, company info
Natural Language reformulation
Integration of Search and Text Analysis
Slide 34 : 34/36: Need - context
Answering the need behind the query
: Context
Context determination
-
spatial (user location/target location)
-
query stream (previous queries)
-
personal (user profile)
-
explicit (user choice of a vertical search, )
-
implicit (use Google from France, use google.fr)
Context use
-
Result restriction
-
Eliminate inappropriate results
-
Ranking modulation
-
Use a
rough
generic ranking, but personalise later
Slide 35 : 35/36: Query exp
Query expansion (e.g. www.ask.com)

Slide 36 : ToC : COMP1710: Web search
Table of Contents (36 slides) for the presentation :
COMP1710: Web search