next up previous
Next: The PageGather Algorithm Up: A Case Study: Index Previous: A Case Study: Index

The Index Page Synthesis Problem

Page synthesis is the automatic creation of web pages. An index page is a page consisting of links to a set of pages that cover a particular topic (e.g., electric guitars). Given this terminology we define the index page synthesis problem: given a web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages. An access log is a document containing one entry for each page requested of the web server. Each request lists at least the origin (IP address) of the request, the URL requested, and the time of the request. Related but unlinked pages are pages that share a common topic but are not currently linked at the site; two pages are considered linked if there exists a link from one to the other or if there exists a page that links to both of them.

The problem of synthesizing a new index page can be decomposed into several subproblems.

1.
What are the contents (i.e. hyperlinks) of the index page?
2.
How are the hyperlinks on the page ordered?
3.
How are the hyperlinks labeled?
4.
What is the title of the page? Does it correspond to a coherent concept?
5.
Is it appropriate to add the page to the site? If so, where?
In this paper, we focus on the first subproblem -- generating the contents of the new web page. The remaining subproblems are topics for future work. We note that several subproblems, particularly the last one, are quite difficult and will be solved in collaboration with the site's human webmaster. Nevertheless, we show that the task of generating candidate index page contents can be automated with some success using the PageGather algorithm described below.


next up previous
Next: The PageGather Algorithm Up: A Case Study: Index Previous: A Case Study: Index
Mike Perkowitz
1999-03-02