Lectures Notes : COMP1710: Web search


Slide 1 : 1/36: Web search

COMP1710 Web Development and Design

 

Web search

Click here to start or press 's'tart or 'i',

then 'n'ext or 'b'ack

Click here for the 't'able of Contents


Slide 2 : ToC : COMP1710: Web search

Table of Contents (36 slides) for the presentation :

COMP1710: Web search


Slide 3 : 3/36: New Media and Web

In this session: web search

Reference for slides: CS276


Slide 4 : 4/36: Search use

Search use

(iProspect Survey, 4/04, http://198.170.122.106/premiumPDFs/iProspectSurveyComplete.pdf)

See also a more recent study: http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2716/2385

 


Slide 5 : 5/36: Web search and web scale

Without search engines the web wouldn't scale

  1. No incentive in creating content unless it can be easily found - other finding methods haven't kept pace (taxonomies, bookmarks, etc)
  2. The web is both a technology artifact and a social environment
    • The Web has become the new normal in the American way of life; those who don't go online constitute an ever-shrinking minority. = [Pew Foundation report, January 2005]
  3. Search engines make aggregation of interest possible:
    • Create incentives for very specialized niche players
      • Economical - specialized stores, providers, etc
      • Social - narrow interests, specialized communities, etc
  4. The acceptance of search interaction makes unlimited selection stores possible:
    • Amazon, Netflix, etc
  5. Search turned out to be the best mechanism for advertising on the web, a $15+ B industry.
    • Growing very fast but entire US advertising industry $250B - huge room to grow
    • Sponsored search marketing is about $10B

 


Slide 6 : 6/36: Coarse dynamics

Coarse dynamics of Web interaction

 


Slide 7 : 7/36: Brief history

Brief (non-technical) history

Early keyword-based engines

Paid placement ranking:

 


Slide 8 : 8/36: History 2

Brief (non-technical) history

1998+: Link-based ranking pioneered by Google

Result: Google added paid-placement 'ads' to the side, independent of search results

 


Slide 9 : 9/36: Nigritude ultramarine

 


Slide 10 : 10/36: Ads vs search

Adverts vs. search results

Google states ads (based on vendors bidding for keywords) do not affect vendors' rankings in search results

 


Slide 11 : 11/36: Web search basics

Web search basics

 


Slide 12 : 12/36: User needs

User needs [Broder '02, Rose et al '04, Jansen et al '08]

Informational - want to learn about something (39% / 61% / 81%)

Navigational - want to go to that page (25% / 15% / 10%)

Transactional - want to do something (web-mediated) (36% / 24% / 9%)

Access a service

Downloads

Shop

Gray areas

Find a good hub

Exploratory search see what's there

Broder, A. 2002. A taxonomy of web search. SIGIR Forum 36(2):3-10.
Rose, D.E. and Levinson, D. 2004. Understanding user goals in web search.
    Proc 13th Int Conf on World Wide Web, New York, 13-19.
Jansen, B.J., Booth, D.L. and Spink, A. 2008. Determining the informational,
    navigational, and transactional intent of Web queries. Information Processing
    and Management, 44:1251-1266.


Slide 13 : 13/36: Web search users

Web search users

Make ill defined queries

  • Short
    • AV 2001: 2.54 terms avg,
      (80% < 3 words)
    • Not increasing over time
  • Imprecise terms
  • Sub-optimal syntax
    (most queries without operator)
  • Low effort

Wide variance in

  • Needs, Expectations, Knowledge, Bandwidth

Specific behavior

  • 68% look at one result screen only
    (increased over time)
  • about 50% of queries are not modified
    (one query/session)
  • Follow links - the scent of information ...


Slide 14 : 14/36: Query Distribution

Query Distribution

Power law: few popular broad queries, many rare specific queries


Slide 15 : 15/36: How far?

How far do people look for results?

When you perform a search on a search engine and don't find what you are looking for, at what point do you typically either revise your search or move on to another search engine? (Select one)

iprospect.com WhitePaper_2006_SearchEngineUserBehavior


Slide 16 : 16/36: How far?

How far do people look for results?

When you perform a search on a search engine and don't find what you are looking for, at what point do you typically either revise your search or move on to another search engine? (Select one)


Slide 17 : 17/36: Users' eval

Users' empirical evaluation of results

Quality of pages varies widely

Precision vs. recall

What matters

User perceptions may be unscientific, but are significant over a large aggregate


Slide 18 : 18/36: Users' eval 2

Users' empirical evaluation of engines

Relevance and validity of results

Interface - Simple, no clutter, error tolerant

Trust - Results are objective

Coverage of topics for polysemic queries

Pre/Post process tools provided

Deal with idiosyncrasies


Slide 19 : 19/36: Loyalty

Loyalty to a given search engine

Reference: iPropect Survey, April04


Slide 20 : 20/36: Web corpus

The Web corpus

No design/co-ordination

Distributed content creation, linking, democratization of publishing

Content includes truth, lies, obsolete information, contradictions ...

Unstructured (text, html, ...), semi-structured (XML, annotated photos), structured (Databases) ...

Scale much larger than previous text corpora ... but corporate records are catching up.

Growth - slowed down from initial volume doubling every few months but still expanding

Content can be dynamically generated


Slide 21 : 21/36: Dynamic content

The Web: Dynamic content

A page without a static html version

Usually, assembled at the time of a request from a browser

Typically, URL has a '?' character in it

Most dynamic content is ignored by web spiders


Slide 22 : 22/36: Web size

Size of the web

What is being measured?

Number of hosts - netcraft survey

Number of pages - numerous estimates (will discuss later)


Slide 23 : 23/36: Server survey

Netcraft Web Server Survey

The biggest change this month was a 1.9M increase in hostnames served using Apache. The largest contributor to this was a growth of 561K sites by Next Dimension Inc, but the majority of this consisted of parked sites on a single IP address.


Slide 24 : 24/36: Web change

Rate of change on the Web

[Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 - Jun 14, 1999

[Fett02(04)] Massive study 151M pages checked over few months

[Ntul04] 154 large sites re-crawled from scratch weekly

Cho J, Garcia-Molina H. 2000. The evolution of the Web and implications
    for an incremental crawler. Proc. 26th Int. Conf. on Very Large Databases.
    Morgan Kaufmann, 200-209.
Ntoulas, A., Cho, J., and Olston, C. 2004. What's new on the web?: the
    evolution of the web from a search engine perspective.
    Proc. 13th Int. Conf. on World Wide Web, ACM, New York, 1-12.


Slide 25 : 25/36: Change static pages

Static pages: rate of change

Fetterly et al. study (2004): several views of data, 150 million pages over 11 weekly crawls

Fetterly, D., Manasse, M., Najork, M. and Wiener, J.L. 2004.
    Softw. Pract. Exper. 2004; 34:213-237.


Slide 26 : 26/36: Some web characteristics

Some web characteristics

Significant duplication

High linkage

Complex graph topology

Spam

Fetterly, D., Manasse, M. and Najork, M. 2003. On the evolution of clusters of near-duplicate Web pages
    Procs. 1st Latin-Amer. Web Congress, 37-45.


Slide 27 : 27/36: Spam: Paid ...

SPAM: Paid placement - the trouble with paid placement is ...

It costs money. What's the alternative?

Search Engine Optimization (SEO):

Performed by companies, webmasters and consultants (Search engine optimizers) for their clients

Some perfectly legitimate, some very shady


Slide 28 : 28/36: Simple SEO

Simplest forms of Search Engine Optimisation

First generation engines relied heavily on tf/idf

SEOs - dense repetitions of chosen terms

Note: tf/idf is term frequency divided by inverse document frequency, so a frequent word gets a high value unless the word is common in the whole document collection.


Slide 29 : 29/36: Cloaking

Cloaking

Serve fake content to search engine spider

DNS cloaking: Switch IP address. Impersonate


Slide 30 : 30/36: More spam tech

More spam techniques

Doorway pages

Link spamming

Robots


Slide 31 : 31/36: Anti spam

The war against spam

Quality signals - Prefer authoritative pages based on:

  • Votes from authors (linkage signals)
  • Votes from users (usage signals)

Policing of URL submissions

  • Anti robot test

Limits on meta-keywords

Robust link analysis

  • Ignore statistically implausible linkage (or text)
  • Use link analysis to detect spammers (guilt by association)

Spam recognition by artificial intelligence

  • Training set based on known spam

Family friendly filters

  • Linguistic analysis, general classification techniques, etc.
  • For images: flesh tone detectors, source text analysis, etc.

Editorial intervention

  • Blacklists
  • Top queries audited
  • Complaints addressed
  • Suspect pattern detection


Slide 32 : 32/36: More on spam

More on spam

Web search engines have policies on SEO practices they encourage / tolerate / block

Adversarial IR: the unending (technical) battle between SEO's and web search engines

Research http://airweb.cse.lehigh.edu/


Slide 33 : 33/36: User need

Answering the need behind the query

Query language determination

Hard & soft (partial) matches

Natural Language reformulation

Integration of Search and Text Analysis


Slide 34 : 34/36: Need - context

Answering the need behind the query: Context

Context determination

Context use


Slide 35 : 35/36: Query exp

Query expansion (e.g. www.ask.com)


Slide 36 : ToC : COMP1710: Web search

Table of Contents (36 slides) for the presentation :

COMP1710: Web search