Towards Automated Categorization and Abstracting of Web Sites
Giuseppe Attardi
Dipartimento di Informatica, Università di Pisa, Italy
attardi@di.unipi.it
Antonio Gullí
I-Tech, Pisa, Italy
gulli@di.unipi.it
Domenico Dato
I-Tech, Pisa, Italy
dato@di.unipi.it
Cristina Tani
IOL, Pisa, Italy
c.tani@pisa.iol.it
1. Introduction
The World Wide Web will reach more than 1 billion pages by year 2000 [1].
This size raises serious problems in accessing information. The most useful
tools we use today to interact with Internet content are search engines (SE)
and Web directories (WD). Although they are often confused with one another,
they are significantly different.
A search engine [2] is built in a fully automated way: spiders navigate the
net, following the links extracted from Web pages, and the information they
acquire is used to build the SE index. A user can consult the SE giving relevant
keywords for the particular searching he/she wants to perform. Each SE uses
its own techniques to find the most appropriate documents in relation to the
submitted query, and these techniques are often applications of well-know theories
developed in the field of Information Retrieval (IR).
A Web directory [3] is built from reviews of Web material performed by a
relative large group of well-trained human technicians. They examine Web sites,
submitted to the directory by their authors, and classify them according to
the reviewers’ personal tastes and opinions. We could identify at least the
following steps in the process of reviewing: manual navigation through the site;
extraction of the significant information contained in the site (summed up in
few lines called abstract or summary); categorization of the site within a
pre-existent schema of categories usually organized in a taxonomy.
Both SEs and WDs are starting to show some limitations due to the explosion of
material on the Web.
According to SearchEngineWatch [4], Altavista, a SE with one of the largest
indexes of Internet, is about to give up its initial slogan "we index it at
all" in favor of "we index the best". When the index is too large,
users have difficulties in finding the information they are looking for. This is
true for a simple reason: even an experienced user has problems in identifying the
relevant keywords for a given topic. Searching is more and more a frustrating process
of choosing some keywords, navigating among the results and starting over if no
relevant or too much irrelevant information is found. A recent survey shows that
no more than three keywords are ever used and that rarely users look at items
beyond the third in the list of results [5].
For WDs, problems do not appear as much at the user side, but rather at the
reviewers and submitters level. As submissions increase, the guide must either add
new reviewers to process listing or accept an increase in the turnaround time for
reviews or simply accept a smaller percentage of listing as in the past. Sriniija
Srinivasan (who oversees the Yahoo!’s listing process) said: "the Web
grows faster than we do, and we couldn’t possible scale personnel to match
the rate in which new Web sites that are coming along". The problem results
more difficult if we consider that a submitted site can evolve its contents. So a
categorization and a description given today could be not suitable for tomorrow.
Another question to point out is that a human review of a Web site could not be
systematic. As Sullivan [4] said: " In the end, the situation point out
the delicate balancing act that Yahoo’s editor face when they consider how to
list sites. They have some general rules, but there is no classification system set
in the stone. Instead, the consensus of various Yahoo surfer acts as a living, ever
changing system."
During a recent Search Engine Conference [6], several people predicted that
future search tools would combine the best features of handcrafted catalogs with
the benefits of Web search engines.
We have developed ACAB, a system that performs Automated Categorization and
ABstracting of Web sites. It builds Web a directory with a minimal human intervention
limited to an initial training phase. ACAB is being successfully used in creating the
Web guide for the Arianna’s site [Arianna] (the Arianna service is run by
Italia On Line. It is both a large national search engine and a Web directory dedicated
to the Italian Web space).
In this paper we present the techniques on which ACAB is based. These include typical
search engine technology (such as spidering, indexing, ranking), techniques from the
field of IR (such as automated categorization by content, query biased abstracting) and
specific techniques for the Web environment (such as URLs clustering). Automated
categorization, compared to manual categorization, allows:
- savings of human resources;
- more frequent updates
- dealing with large amounts of data;
- discovery and categorization of new sites without human intervention;
- re-categorization of known sites when their content changes;
- re-categorization of known sites when the catalog taxonomy changes;
Automated categorization is useful for Search Engines. In response to a query, a
SE might report the most relevant categories that contain significant URLs, combining
available information retrieval and categorization capabilities. This helps users resolving
the ambiguity that arises from a not enough specific query terms.
The rest of this paper is organized in this way: We review related works, in the
next section. In section 3, we present the architecture of ACAB and the techniques exploited.
The section also illustrates the benefits of a unifying approach between SE and WD.
Subsection 3.1 describes the methodology used to build the categories descriptors that
are used for the categorization purpose in 3.2. Subsection 3.3 talks about the need to
identify "Web sites" among a set of pages, and 3.4 presents the query biased
techniques we adopt to summarize information from "Web sites". In section 4
we present the experimentation conducted to verify the validity of the method
(e.g. the Web directory that was automatically built), and in subsection 4.1 we show our
results in Categorization-based Retrieval applied to WD. Section 5 exposes our suggestion
to use Categorization-based Retrieval for SEs. Section 6 is dedicated to analyze the future
directions we are planning to explore, whereas section 7 deals with conclusions.
2. Related Work >
Several approaches have been developed recently for automated categorization, URL
clustering, and abstracting. We discuss those most closely related to our work.
Northern Light [NLsearch] uses automated categorization to dynamically organize search results,
creating Custom Search FolderÔ
of documents with similar subject, source or type. Each folder contains a more focused subset
of the original results. Northern Light performs classification of documents off-line,
extracting classification metadata. Such metadata is used to organize dynamically the result of
the queries into useful categories. The selection of categories is rule-based and is
drawn on 20,000-term subject hierarchy developed by expert librarians.
Inktomi [Inktomi] has developed technology for search engine and tool for automatic
categorization. They claim that proper categorization will be achieved using 300-350
basic vectors and about 20.000 terms, extracted from 1 million or more documents[6].
Lycos [Lycos] condenses the Internet in Web Guides on various topic. They claim to
use search technology to find the most relevant information from Web. Moreover,
Lycos uses relevance feedback from users and sorts each list of recommended Web sites
accordingly. Currently, Lycos doesn’t provide any summary for the listed sites.
Vistabar [7] is a desktop assistance for Web Browsing. It includes a categorization
tool that extracts a profile for each category contained in Yahoo! hierarchy. In order
to categorize new documents, a measure of similarity is defined based on a traditional
vector space matching on the weighted word frequencies of a document relative to the corpus.
Amico [Amico] is a personalized newspaper for the Web. It maintains a database of
users’ profiles which consists of a list of keywords, and gathers information from
selected Web sites and press agency about news, weather, job offering. It classifies collected
data and presents them to users according to their profiles. Many Internet sites offer similar
services. They act on the behalf of users in order to classify information originating
from various sources. Among others we mention [LineOne], [HotWired], [MyYahoo].
Autonomy [Autonomy] has developed a knowledge management system that uses Adaptive
Probabilistic Concept Modeling (APCM) to understand large documents. The concepts are
represented by patterns of terms and contextual relationships that are most commonly recognized
as important in documents. APCM doesn’t depend on specific linguistic structures or
semantics, but it uses algorithms based on Bayesian probabilities and neural networks. ACMP is exploited to extract users’ profiles and to categorize documents.
Verity [Verity] has developed a Knowledge Organizer (KO), which aims at providing organizations
with the ability to build their own catalog, with minimal human intervention. KO has a graphical
taxonomy editor for building a category hierarchy and associating a query for each node that
specifies a category [6].
Alexa [Alexa] has built a "surf engine", which learns and improves over time
with the collective participation of its users. The system gathers information by crawling
the Web and storing it in an archive. It then exploits both link analysis and surfer’s
usage patterns to determine which sites will be of most interest to individual users. Alexa
discovered about 20,000,000 unique "content areas". Recently Netscape has integrated
Alexa technology inside its browser in the form of "smart browsing". By clicking on
the "What’s related" button surfers receive a list of sites related to the
one currently visited.
HITS [8] exploits the Web topology for compiling a list of authoritative Web resources
on a topic. The algorithm consists of three phases. During the search and grow phase a
query related to the chosen topic is sent to a search engine and the retrieved documents
are named the root set. Then, this set is expanded with: (1) any document pointed by
a document in the root set and (2) with any documents that points to a document in the
root set. HITS considers two kind of pages: Authority pages, which contain a lot of
information about a topic, and Hub pages, which contain a lot of links about the topic.
These pages exhibit a mutual reinforcing relationship: a good hub page points to many
good authorities and a good authority is pointed to by many good hubs. Solving by
iteration the system a=Wh; h=Za allows identifying hub pages and
authoritative pages (where W is a matrix that contains 1 in position (p,q)
if page p, in the expanded root set points to page q, Z is
the transpose of W and h is a vector initialized to be all 1). From a
mathematical point of view, this is an application of the theory of eigenvectors and
the convergence of the algorithm is proved in [8].
CLEVER [9] is an extension of HITS, which exploits contextual information in the form
of text surrounding the links between the documents, in order to focus the algorithm on
related pages. Instead of a W matrix of ones or zeros, a weight
w(p, q) is computed from the number of matches between
terms in a topic description and a window of 50 bytes of the text around the hyperlink
defined by HTML tag "<A>". Test of usage of CLEVER has shown its
capability to infer Web communities from link topology [10].
PageRank [11] is a method to rank each page on the Web, based on a
weight-propagation algorithm that simulates a short random walk on the directed link graph
of the Web. It is used by the SE Google [2].
Teseo [Teseo] is a tool for the automatic categorization of Web documents. It is based
on the technique of categorization by context, which exploits the context perceivable
from the structure of HTML documents to extract useful information for classifying documents
they refer to. It is successfully used to reclassify existent Web directories according
new hierarchical schema of categories or to merge them [12].
Inquirius [13] is a meta search engine. It downloads the pages listed by querying other
search engines. It then displays, along each entry, the local context in which the query
terms appear, rather than an unrelated summary of the page. Displaying the context rather
than a query-insensitive abstract of the document helps the user determining if the document
answers his or her query.
Tombros [14] performed a user survey to compare query–biased summarization techniques
with static or AI based abstracting. He concludes that the former is preferable. A query-biased
technique was also proposed by Helm [15] in his toolset for personal information management.
3. The ACAB Architecture>
The task of building a catalog involves the following activities: (1) selecting or building
a schema of hierarchical categories; (2) selecting the collection of documents to categorize;
(3) identifying the most appropriate categories for each document; (4) clustering closely
related URLs which represent a single site, and (5) extracting and summarizing the most
significant information contained in each identified site.
ACAB is one of the first systems to exploit a tight integration between a search engine
and an automated document classifier. This integration has a number of benefits from the
point of view of implementation:
performance: since the
search engine already retrieves and stores documents
from the Web, the classifier has a large number of
documents available locally.
indexing: the indexes
built by the search engine can be directly inquired
for information useful in building the catalog. In
particular, it is fairly simple to obtain
information about the number of URLs relevant to a
certain category with respect to the total number of
URLs in a given site. It is sufficient to perform a
query on the indexes kept by the search engine.
Similarly one can obtain useful information for
ranking the relevance of pages, for clustering of
URLs, and for determining Web communities based on
link topologies.
history: since a search
engine keeps a history of visited pages, the
classifier can be triggered when new or updated
pages are detected. This enables a more timely
update of the catalog to reflect changes in the
original sites.
Among other attempts at exploiting a search engine for classification we mention
the CLEVER system [9], which extracts some information by querying the Altavista
search engine, in particular for building its root set. Lycos announced the use of an
automatic classifier for building their new catalog, and they claim it is integrated
with SE techniques.
The integration of a search engine with a Web directory service can be quite beneficial
also for users. With respect to typical search engine, users can search within restricted
or better-specified topics, improving precision and reducing noise. With respect to manual
Web directories, users may gain access to a larger collection of documents, more up to
date and with a more consistent classification.
Moreover additional services can be envisaged, for instance:
- grouping by categories the
results of a query (as in Northern Light
[NLsearch]);
- asking within which
categories a document appears;
- providing
category-dependent abstracts for the same document;
- filtering documents
according to both a user profile and the categories
to which the document belongs. For example, one may
prevent showing documents with an adult content to
an inappropriate audience.
The ACAB architecture reflects the integration of an automated classifier with the
Arianna search engine [Arianna] as illustrated in Figure 1. At the top appear
some components of the Arianna search engine: the Spidering and Indexing subsystems.
Arianna uses Fulcrum [Fulcrum] as an IR engine for indexing and retrieval. The remaining
components perform the tasks mentioned earlier: a builder is used to create a descriptor
tree for categories; a classifier which interfaces to the index of the search engine; a
module for clustering URLs and for identifying a "Web site"; a module to
produce the summary of documents. In the following paragraphs we will analyze each
subsystem in turn.
Figure 1 Macro-architecture
of ACAB: it uses search engines technology (such as
spidering, indexing), techniques from IR field (such as
automated categorization by content, query biased
abstracting), and methods typical of the Web environment
(such as URL clustering).
3.1 Creating Profiles the Categories
A human editor "knows" how to categorize a site from his/her knowledge,
understanding and past experience in Web browsing. A reviewer has expertise in organizing
data in a hierarchical manner that derives from his/her personal interests, independently
from the Web.
A category C can be typically represented by a descriptor CDC,
consisting of a set of weighted keywords and sentences. Category descriptors can be
built either manually, by expert librarians or linguists, as in Infoseek, Verity, Northern Light,
etc [9], or automatically exploiting machine-learning, statistical [Vistabar] or neural network
technologies [Autonomy].
In IR the common way to extract the knowledge about categories is to perform a
training phase on a large set of human pre-classified documents. Among such techniques,
we mention the approach by Fuhr [16], which consists of the following steps. Terms are
extracted from the pre-classified collection where the documents are filtered against stop
words and stemmed. Then, a relevance descriptor is built for each term in the term space.
It contains statistical information, such as term frequency (tf), inverse
document frequency (IDF), as well as HTML structure information. Documents are
indexed by weighted terms exploiting the set of relevance descriptors. For each document a
vector of weights is created by using a probabilistic technique known as "Darmstadt"
indexing, an application of Bayesian theory of probability. Darmstadt indexing is proved to
be very effective at the TREC [TREC] conference. The descriptor produced by this method
consists typically of vector of over 100,000 weights.
Sebastiani has developed a refinement phase, which reduces the size of the weight vectors
associated with documents, controlling the degradation in precision. It was based on a
previous works by Yang [17]. Moreover, Sebastiani builds a single vector representing a
category by means of the Rocchio [18] methodology (a standard IR technique).
In the present implementation of ACAB, we use a set of manually built category descriptions.
These descriptions are validated through a graphical tool (Category Builder) which helps
in evaluating the adequacy of a description by computing a statistical measure (based on
tf * IDF), for each term in a category description. Weights can be associated to
each term in order to represent their significance as characterizing term for the category.
The Category Builder allows organizing the category descriptions in a tree data structure,
where each node represents a category C and it is annotated with some attributes
(such us CDC and others we are going to define in the rest of the paper).
A node can inherits the descriptors from its parents, inducing in the catalog to be generated
a clear taxonomy.
We plan to incorporate in ACAB the techniques by Fuhr and Sebastiani. We are also considering
the use of Natural Language Processing (NPL) tools such as Part-of-Speech (POS) to improve the
quality of terms extracted from documents. In particular we plan to use TreeTagger, a POS
tagger developed within the TC project [19] and optimized by Attardi [20] for the EuroSearch
project.
3.2 Categorization Task
Given a tree of category descriptors, there are two possible approaches to perform document
categorization:
- Document-centered
categorization: given a document, find the most
appropriate categories it belongs to. Since the
complexity is proportional to the number of
documents, this approach is feasible only when this
number is not too large. KNN [21] is a typical
technique of this kind, and Amico is an example of
document-centered application.
- Category-centered
categorization: given the category descriptors,
search the database to find those documents that
satisfy best each descriptor. This task is
accomplished with standard IR system, and so it is
well suitable to be used with a Web SE. Another
advantage of category-centered categorization is
that the complexity is proportional to the number of
categories, which are typically much smaller than
the number of documents to categorize.
ACAB adopts a category-centered categorization achieved by performing a query
directly to the Arianna index. Fulcrum, the IR engine used in Arianna, can be interrogated
through a subset of SQL. It supports various retrieval models such as strict boolean,
weighted boolean, fuzzy boolean and vector space [22]. The strict
boolean retrieval model combines individual predicates using logical boolean operators.
The weighted boolean extends the strict boolean model allowing associating a weight to each
predicate. The fuzzy boolean model does not enforce the fulfillment of all AND operators. In
the vector space model, documents and queries can be viewed as vectors and search as a
matrix product, where the matrix represents the collection of documents. Fulcrum provides
several ranking algorithms that can be used to order the results of a query.
In ACAB we are currently adopting the weighted boolean model, since it offers very precise
control over what is retrieved. For each category C, we build a query QC
query by transforming the category descriptor from a vector of weighted terms to a boolean
clause: weighted terms are combined using the OR logical operator. The weights used in queries
are those determined when building the category descriptions.
A delicate issues is the ranking to apply during search on Web pages, since they exhibit
quite varied content and may purposefully contain terms for deceiving SEs. Therefore we
must avoid that several unrelated occurrences of a term in a page skew the distribution
of weights, reducing the influence of other relevant terms that may appear in the same page.
It is necessary to balance the contribution of all terms in the descriptor. We performed
extensive experimentation to figure out a satisfactory solution and eventually settled on
a ranking algorithm that maximizes the number of distinct terms matched in each document.
Moreover, the ranking function increases the weight associated to a term, taking into account
the HTML tags within which it is placed.
Our experience shows that, in searches with over 50 boolean OR clauses, performance degrades
significantly. For this reason, we are investigating ways to reduce the search space,
for instance by extending the boolean clause with additional AND NOT expressions.
ACAB performs a QC query to the IR system that returns a list of Web pages,
ordered according to a rank of decreasing similarity with the category descriptor.
A threshold TQC, defined for each category, is used to select
how many top-ranked documents should be categorized under C. It can be chosen to reflect
the percentage of documents classified in each category, as learned from the distribution in
a training sample of pre-classified documents. Otherwise, it can be considered as a similarity
measure. Only documents with a rank that exceeds TQC are similar enough to be
included in a category. We are currently investigating the impact of such choices. The list
of documents within the threshold TQC , is called LC.
3.3 URLs clustering to
identify relevant Web sites
In traditional IR field, one stops once obtained a list LC of Web pages relevant for category C. In a Web environment we should take into account that Web pages are clustered together by their authors in complex manner. Therefore ACAB analyzes the top ranked documents within a category and tries to cluster them in order to identify relevant "Web sites".
For example a natural way of clustering may be inferred from the fact that a group of Web pages are situated at the same Internet site. A group of Web documents DCH Í
LC characterized by URLs with the same H host prefix, could increase their mutual relevance within a category. According to our experiments, one often obtains noise categorizing a single Web page leaving out of consideration the hypertextual context in which the page was placed.
Grouping by site is also an option available in querying some SE.
Such clustering is often insufficient, since there exist Internet sites (like, for example GeoCities) that house many pages developed by different authors. These pages could be thought as "Web sites" for all intents and purposes. Therefore, we are exploring a more effective solution.
We construct the set ACH of hypertextual links that connect pair of documents within DCH. This information is easily obtained through direct access to the search engine index. We define a directed graph GCH = (DCH, ACH), consisting of as many connected components as are the "Web sites" inside the host H relevant for the category C. One way to find these connected components is to perform a visit starting from each node in DCH. This leads to constructing a partition of the list LC, where each partition is a cluster l
Ci of documents grouped by means of their hypertextual relations. Among such clusters l
Ci we retain those whose cardinality is greater than a threshold value THC. In other words, we consider "Web sites" only sets of pages clustered in groups of a certain size.
HITS [8] addresses a similar problem. HITS tries to infer large "Web communities"
from link topology; ACAB tries to identify a "Web site" from the pages on the same
Internet site relevant to a category. HITS exploits inter-site hyperlinks and performs its
task by solving iteratively a system of equations. Our strategy exploits only intra-site
links and computes the connected components inside a small graph GCH.
In the future, we also plan to integrate some algorithms used by HITS and to perform a
experimental comparison.
3.4 Creating a summary for a Web page
If we sort the documents contained in each l
Ci by rank, we get a correspondingly ordered list g
Ci of document relevant for each "Web site". Notice that the most significant documents for the cluster l
Ci appear at the top of g
Ci.
We need of a method for summarizing a generic document d Î
g
Ci. We adopt a query-biased methodology optimized for the Web environment. Query biased document summarization are intensively investigated in recent work by Tombros [14], and compared with static and AI abstracting techniques.
There have been attempts to produce coherent summaries by language generation and AI techniques, but they are capable of processing texts only within a narrow domain, where some characteristics are predictable. Nowadays, there is not enough evidence that such systems work on domain independent text, such as the generic Web. Query biased document summarization follows a simpler approach based on three steps [fig.2]: (1) sentence extraction, (2) sentence manipulation, and (3) calculation of a "query score".
Sentence extraction from a Web
page is performed according to some criteria, derived
from punctuation analysis and HTML structure (authors
separate sentences by using HTML tags such us
<TITLE>, <H*>, <P>, <BR>,
<HR>, <TH>, <TD>, <TR>). We
wrote a parser that extracts sentences according these
criteria.
Figure 2: Architecture of Query Biased Summarizer
In the sentence manipulation step several abstracts are created. Each of these is obtained by joining together groups of consecutive sentences up to a desired length. Within a group, sentences that do not satisfy some heuristic criteria, such as the presence of enough pronouns, articles and verbs, are discarded. Currently we detect this by perform a match in a predefined static set of morphological elements. We plan to exploit more sophisticated criteria in order to improve the choice of meaningful sentences. For instance we could drop sentences that are not syntactically well formed, e.g. sentences that are just list of URLs or short phrases without verbs. A Part-Of-Speech tagger will be helpful in this task.
The next step is to rank the abstracts, for instance according to the weight of terms contained in CDC, the category descriptor for category C. One expects in fact that more descriptor terms are present in an abstract, more likely it contains a significant amount of information related to the category C. Similarly to what we did earlier in the categorization task, the combination of weights of terms appearing in the abstract is performed according to the following rules:
The weight of each matched
term is increased by a combination factor that keeps
track of the nesting of HTML tags in which it
appears, with each HTML tag having a different
combination factor. Our parser keeps track of the
nested HTML contexts, thereby exploiting the full
HTML structure of documents.
The weight of each term is progressively reduced when it occurs several times in an abstract. This mitigates the keyword spamming phenomenon [4] (people that introduce in a Web page the same hidden keyword in order to increment the search engines’ rank). We use a simple function that decreases the weights according to the number of occurrences (possibly within a limited text window). This has the effect that abstracts containing several distinct terms in CDC are
ranked higher.
For a document d Î
g
Ci we select the top ranked abstract as its abstract (and its rank is the rank of the abstract).
Since we need a single abstract for each group of documents g
Ci, we should select the abstract with highest rank. For performance reasons, we can exploit the ranking of documents within g
CI produced by the classification step, and select the first abstract for a document in g
Ci that exceeds a predefined threshold TAC.
TAC could be thought as a minimum level of quality of the abstracts. Sites whose abstract does not reach this threshold are currently discarded. For instance, this happens when the terms contained in CDC are scattered in the original document. Query biased summarization tends to give an highest rank to group of sentences in which terms in CDC are
situated in small windows of text. This behavior produces a significant impact in terms of ACAB’s performance: we can avoid using of an expensive "proximity" search of terms CDC descriptors in the categorization
task, and obtain a similar result here.
Adopting query-biased technique has another important consequence. One should note that since a document can be classified in more than a category, ACAB produces a different abstract for each category to which a document is assigned.
Categorization and abstracting put slightly different requirements in the category descriptors CDC. For the purpose of categorization CDC should be kept as small as possible for performance reasons. On the other hand abstracting is performed on a quite smaller set of documents and quality is more important. Therefore we are considering an automatic expansion of a category descriptor, for instance by exploiting the hierarchies of concepts and the synset lists (i.e. groups of synonyms) provided by WordNet [23].
3.4.1 Ranking and Mirror Web Sites
Mirror Web sites duplicate the same informative contents in different place, in order to reduce the network distance to clients.
Our methodology associates a rank to each classified site. Moreover, we construct a query-biased abstract for each classified site. Mirror Web sites will have the same rank and the same abstract and so they are easy to spot. ACAB is capable of note this and signal the presence of mirror sites in the listing produced for the catalog.
ACAB is more effective at identifying mirror site than the traditional technique of comparing the MD5 signature of the whole Web pages. The latter method has problems when the mirror pages have dynamic content that modifies HTML code (such as advertisement banners, counters). ACAB detects similarities in the abstracts, which are less subject to this variation.
4 Prototype
A prototype tool set based on the described categorization and summarization techniques has been built in order to assess their validity.
A visual Category Builder was developed which enables to define, insert, move or remove categories. It uses a tree data structure maintained in a database. The graphical interface is available via Web. The tool allows annotating a node representing a category with its name, editing terms and weights for the category descriptor, and modifying the various thresholds.
In order to make available the catalog built by ACAB, we developed a generator of HTML catalog pages. The generator builds a set of hypertext pages for each category, interconnected according to reflect the catalog structure. In the hypertext page appear sites, ordered by rank, and related statistical information.
ACAB has been used in connection with the Arianna Search Engine, the largest national search engine for the Italian Web space, containing over 4 million Web pages. ACAB has been used to build a catalog for a significant portion of the whole Arianna search space, generating automatically five top-level categories of its Web Directory: "Computer and Internet", "Information and News", "Entertainment" and "Travel and Tourism", "Businnes and Finance". Each top-level category contains ten or more lower level categories. [Fig. 3] illustrates one the pages of the catalog built by ACAB.
Arianna plans to apply automated categorization and abstracting to all categories of its directory, which so far have been maintained manually.
The results achieved are quite encouraging. The five top-level categories contain about 10.000 classified sites, and in most cases, the prototype categorizes each document in the appropriate category. The few exceptions we encountered are caused by either not well-tuned threshold or not well-defined category descriptors.
For example, we are evaluating the CLEVER suggestion to limit the showing of hub pages. This because they may have little identity of their own to help the user decide either to visit them or not. Hub pages are just pages of links to relevant information. Sometimes they are personal bookmark pages; sometimes they are pages from other Web directories. Therefore, users could be perplexed to see them classified. For us, to exclude them means to set an appropriate threshold after the URLs clustering.
Figure 3. An example of the result of automated categorization. You can see the most relevant URLs within the category "Computer Graphics". In first position is a "Digital Art" site. Sites talking about VRML, ray tracing, rendering and so on follow this. The language of the abstracts is Italian.
We also notice that user are satisfied with the system. Since it appears online, the number of hits to the Web directory increases consistently. The five top-level categories built by ACAB, obtained more than 500.000 accesses in one month (a significant value for the Italian Web Space)
4.1 Searching within categories
The catalog is searchable in two ways:
- within a category
- in the whole directory
In the former case, users navigate in the subject tree that naturally limits the scope of search yielding higher precision.
Figure 4. Searching in the whole directory for the keyword "notizie" (the Italian word for "news"). On the right side of the browser there are sites about "news and newspaper", teletext, real-time news and so on. Folders, on the left side, represent the most appropriate categories for the search results. They are: "press-agency", "daily news", "newspaper", "search engine", etc. Numbers in parenthesis show element in each folder.
The latter method allows the user to perform a term-based search in the whole guide. This is still different from performing a standard search on the Arianna SE, which instead performs a search through the whole Arianna space of several million individual pages. The result of a term-based query on the guide produces:
- a short list of the highest
ranking URLs;
- a folder for each of the
most appropriate categories for grouping the
results.
Grouping documents by categories, gives users suggestions about a certain number of semantic classes in which the submitted keywords can be interpreted, in a manner similar to the Custom Search Folders of Northern Light. For example, in [fig. 4] a user submitted the keyword "notizie" (the Italian word for "news"). On the right side of the browser you can see the list of relevant URLs. On the left side there are the folders representing the most appropriate categories for the given search terms, sorted by the number of URLs contained within. Moreover, ACAB performs some additional service such us providing a category-dependent abstract for the same document, and finding within which category a document appears (though the latter is experimental and not available on line).
Note that the categorization scheme outperforms dynamic clustering based on the supplied search terms. It works quite rapidly because the core of the work is done in advance of the search.
5. Category-based Retrieval applied to search engine
Categorization doesn’t only concern Web directories but also search engines. Arianna search engine plans to extend its database to retain the data produced by categorization, URLs clustering, and abstracting processes.
As we mentioned earlier, ACAB is one of the first systems to exploit a tight integration between a search engine and an automated document classifier. ACAB’s result is a categorization of the indexes of a search engine.
A first way to exploit this categorization is to build a Web Directory, clustering URLs, and selecting, within each cluster, the URL with the most relevant abstract. Inside the WD only this URL is shown, for each cluster.
Another way is to retain all the data produced by categorization, clustering and abstracting and exploit them when term-base queries are submitted to a search engine. In a similar way to the search within the directory that we mention earlier, SE might use this data to group result by categories or to provide other services we have envisaged before.
Our studies in this direction are quite encouraging, tough at preliminary stage.
6. Future Work
ACAB will evolve to integrate new functionality inside its architecture. Firstly, by using natural language processing tools available (such as POS and WordNet). Lastly, we plan to perform a deeper study about links information that could be derived from the Web pages indexed inside the search engine. PageRank and CLEVER seems two very interesting approaches.
An interesting field to explore will be that of relevance feedback, which could be given from the users. We are currently monitoring the items chosen, among those categorized. We are investigating strategies to exploit this feedback in a refinement phase to better assign weight to the terms contained in the category descriptors or to expand them. Moreover, we want to exploit this feedback to improve the rank of the categorized documents, building a list of top visited sites for each category. HotBot uses similar technique in ranking pages (developed by Direct Hit [24]). The same for the Lycos Web guide.
We are planning to make testbed for user evaluation in term of precision (measure to what extent the listed sites are focused on the category’s topic) and recall (measure how broadly the listed sites cover the category’s topic). These are well-known IR measures but they should treated very carefully in the Web environment that is not a traditional closed system.
Another interesting field will be the exploitation of metadata information, as soon as they are widely and correctly [25] adopted in the Web.
7. Conclusion
We presented an approach to automatically categorize Web documents, capable of identifying relevant Web sites and extracting a summary of their content. With this method we built Arianna’s Web directory with minimal human intervention, limited to the initial training phase. The preliminary results of our prototype as well as the feedback from users are quite encouraging.
We believe that automated categorization will be a necessary step for the World Wide Web in the next years.
8. Acknowledgements
We want to thank Antonio Converti who provided valuable indications for the research activities we carried out. We also thank the Italia On Line staff (with particular attention to: Roberto Bianchettin, Luigi Madella, Alberto Vaccari, Marisa Farusi) and the Web Agent Group inside the Department of Computer Science, Pisa. Our acknowledgement goes to all people inside I-Tech. Last but not least, We thank Dr. Paola Padella for her critic reviews about this paper.
Our work was partially granted by "EuroSearch"[7], an EEC funded Telematics project.
9. References
[Alexa] http://www.alexa.com
[Amico] http://amico.arianna.com
[Arianna] http://www.arianna.com
[Autonomy] http://www.autonomy.com
[Euroserch] http://eurosearch.iol.it
[Fulcrum] http://www.pcdocs.com
[HotWired] http://www.hotwired.com
[Inktomi] http://www.inktomi.com
[LineOne] http://www.lineone.com
[Lycos] http://www.lycos.com
[MyYahoo] http://my.yahoo.com
[NLsearch] http://www.nlsearch.com
[Teseo] http://medialab.di.unipi.it
[Trec] http://trec.nist.gov
[Verity] http://www.verity.com
[Yahoo] http://www.yahoo.com
[1] "Internet Donates Archive of the World Wide Web to Library of Congress",
`http://www.alexa.com/company/inthenews/loc.html
[2] "The Anatomy of a Large-Scale Hypertextual Web Search Engine" S. Brin and L. Page. Computer Network and ISDN Systems, 30(1998) 107-117
http://www7.dstc.edu.au
[3] "Yahoo Special Report"
http://www.searchenginewatch.com/webmasters/
[4]SearchEngineWatch. Danny Sullivan, Editor. Mecklermedia
http://www.searchenginewatch.com
[5] "Searchers, the subjects they search, and sufficiency: a study of a large sample of
Excite searches", B. J. Jansen, A. Spink, J, Bateman, WebNet ’98, 1998.
[6]"Pleteaus, Peaks and Promise: The Infonortics 98 Search Engine Conference",
http://www.infotoday.com/searcher/jun/story4.html
[7] Marais, H. Bharat, K. "Supporting cooperative and personal surfing with a desktop
assistant". Proceeding of ACM UIST'97, ACM, (1997), 129-138
[8] "Authoritative Sources in a Hyperlinked Environment" J. K. Kleinberg,
Proc. 9th ACM-SIAM Symposium on Discrete Algoritms, 1998.
Also appear as IBM Seach report RJ 10076, May 1997
http://simon.cs.cornell.edu/home/kleinber/kleinber.html
[9] "Automatic resource compilation by analyzing hyperlink structure and associated
text" S. Chakrabartia, B. Doma, P. Raghavana, S. Rajagopalana, D. Gibsonb, and J. Kleinbergc,
Computer Network and ISDN Systems, 30(1998)
http://www7.dstc.edu.au/
http://almaden.ibm.com/cs/k53/clever.html
[10] "Inferring Web Comunities from Link Topology" D.Gibson, J.Kleinberg and
P. Raghavana Proc 9thACM-SIAM on Hypertext and Hypermedia, 1998.
[11] "PageRank: Bringing order to World Wide Web: An exploratory analysis of
the intellectual structure of cyperspace",L. Page Ann. Meeting of American Soc.
Info. Sci., 1996
[12] "Categoritazion by Context" G. Attardi, S. Di Marco and D. Salvi in
J.UCS Vol 9/4.
"Theseus: Categorization by Context" G.Attardi, A. Gullí, F.Sebastiani,
submitted for publication.
[13]Inquirus, the NECI meta search engine S. Lawrence and C.L. Giles,
Computer Network and ISDN Systems, 30(1998)
http://www7.dstc.edu.au
[14] "Advantages of Query Biased Summaries in Information Retrieval"
A. Tombros and M. Sanderson, In the Proceedings of the 21st Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2-10, 1998.
http://www.dcs.gla.ac.uk/~tombrosa
[15] "A Toolset for Personal Information Management", D.J. Helm and R. J. D'Amore,
In the Proceedings of WebNet 97 World Conferences of the WWW, Internet & Intranet, pp. 223-228.
http://www.aace.org/conf/Webnet/Default.htm
[16] "A probabilistic learning approach for document indexing" N. Fuhr and C.
Buckely ACM Transaction on Information Systems, 9(3), 233-248, 1991.
http://amaunet.cs.uni-dortmund.de/ir/
[17] "A comparative study on feature selection in text categorization"
Y.Yang and J.Pederson, fourteen international conference on machine learning, 4124-420.
[18] "Relevance feedback in information retrieval" J.J Rocchio.
The SMART Retrieval System – Experiments in Automatic Document Processing.
Prentice Hall, Englewood, Cliffs, N. J. 1971
[19] "TreeTagger - a language independent part-of-speech tagger",
http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html
[20] The Web Agents Group, Part of Speech Tagger
http://medialab.di.unipi.it/Resource/POS
[21] "An example-based mapping method for text categorization and retrieval",
Yiming Yang and Crhistopher G. Chute. In ACM Transaction on Information Systems,
12, 252-227.
[22] "Web Search Service in 1998: Trends and Challenges",
http://www.infotoday.com/searcher/jun/story2.htm
[23] "A Lexical database for English" G.A. Miller Communication of the ACM,
38(11), 1995, 39–41.
[24] "Counting Clinks and Looking at Links" D. Sullivan The Search Engine Report.
http://searchenginewatch.com/sereports/9808-clicks.html
http://www.directhits.com
[25] "The limits of Web Metadata, and beyond" M. Marchiori
Computer Network and ISDN Systems, 30, 1998.
http://www7.dstc.edu.au
|