|You are here: Home | Automated Web Categorization|
Automated Web Categorization
|Abstract: During a recent Search Engine Conference, several people predicted that future search tools would combine the best features of handcrafted catalogs with the benefits of Web search engines. Here we present ACAB, a system that performs Automated Categorization and ABstracting of Web sites. It builds Web a directory with a minimal human intervention limited to an initial training phase. ACAB is being successfully used in creating the Web guide for the Arianna site|
Where & Why ?
The World Wide Web will reach more than 1 billion pages by year 2000. This size raises serious problems in accessing information. The most useful tools we use today to interact with Internet content are search engines (SE) and Web directories (WD). Both SEs and WDs are starting to show some limitations due to the explosion of material on the Web.
According to SearchEngineWatch, Altavista, a SE with one of the largest indexes of Internet, is about to give up its initial slogan "we index it at all" in favor of "we index the best". When the index is too large, users have difficulties in finding the information they are looking for. This is true for a simple reason: even an experienced user has problems in identifying the relevant keywords for a given topic. Searching is more and more a frustrating process of choosing some keywords, navigating among the results and starting over if no relevant or too much irrelevant information is found. A recent survey shows that no more than three keywords are ever used and that rarely users look at items beyond the third in the list of results.
For WDs, problems do not appear as much at the
user side, but rather at the reviewers and submitters level. As submissions
increase, the guide must either add new reviewers to process listing or
accept an increase in the turnaround time for reviews or simply accept
a smaller percentage of listing as in the past. Sriniija Srinivasan (who
oversees the Yahoo!'s listing process) said: "the Web grows faster than
we do, and we couldn't possible scale personnel to match the rate in which
new Web sites that are coming along". The problem results more difficult
if we consider that a submitted site can evolve its contents. So a categorization
and a description given today could be not suitable for tomorrow. Another
question to point out is that a human review of a Web site could not be
systematic. As Sullivan said: " In the end, the situation point out the delicate balancing act
that Yahoo's editor face when they consider how to list sites. They have
some general rules, but there is no classification system set in the stone.
Instead, the consensus of various Yahoo surfer acts as a living, ever changing
During a recent Search Engine Conference, several people predicted that future search tools would combine the best features of handcrafted catalogs with the benefits of Web search engines.
We have developed ACAB, a system that performs Automated Categorization and ABstracting of Web sites. It builds Web a directory with a minimal human intervention limited to an initial training phase. ACAB is being successfully used in creating the Web guide for the Arianna's site (Arianna is run by Italia On Line. It is both a large national search engine and a Web directory dedicated to the Italian Web space).
In this paper (submitted to WWW 8 conference), we present the techniques on which ACAB is based. These include typical search engine technology (such as spidering, indexing, ranking), techniques from the field of IR (such as automated categorization by content, query biased abstracting) and specific techniques for the Web environment (such as URLs clustering). Automated categorization, compared to manual categorization, allows:
The task of building a catalog involves the following activities: (1) selecting or building a schema of hierarchical categories; (2) selecting the collection of documents to categorize; (3) identifying the most appropriate categories for each document; (4) clustering closely related URLs which represent a single site, and (5) extracting and summarizing the most significant information contained in each identified site.
ACAB is one of the first systems to exploit a tight integration between a search engine and an automated document classifier. This integration has a number of benefits from the point of view of implementation:
The ACAB architecture reflects the integration of an automated classifier with the Arianna search engine [Arianna] as illustrated in Figure 1. At the top appear some components of the Arianna search engine: the Spidering and Indexing subsystems. Arianna uses Fulcrum as an IR engine for indexing and retrieval. The remaining components perform the tasks mentioned earlier: a builder is used to create a descriptor tree for categories; a classifier which interfaces to the index of the search engine; a module for clustering URLs and for identifying a "Web site"; a module to produce the summary of documents.
ACAB has been used in connection with the Arianna Search Engine, the largest national search engine for the Italian Web space, containing over 4 million Web pages. ACAB has been used to build the whole catalog for Arianna search space, generating automatically top-level categories of its Web Directory. Among others: "Computer and Internet", "Information and News", "Entertainment, "Travel and Tourism" and "Businnes and Finance", "Free Time", "Sport", "Games and Lottery". Each top-level category contains ten or more lower level categories. [Fig. 1] illustrates one the pages of the catalog built by ACAB.
Figure 1. An example of the result of automated categorization. You can see the most relevant URLs within the category "Computer Graphics". In first position is a "Digital Art" site. Sites talking about VRML, ray tracing, rendering and so on follow this. The language of the abstracts is Italian.
The results achieved are quite encouraging. The top-level categories contain about 30.000 classified sites, and in most cases, the prototype categorizes each document in the appropriate category. Since the adopting of ACAB and automatic classification, the hits to Web Directory increase of about 400% in respect to manual classification.
We also notice that user are satisfied with the system. Since it appears online, the number of hits to the Web directory increases consistently. The top-level categories built by ACAB, obtained more than 500.000 accesses in the first month and about 1.000.000 in the second month of life (the old manual based Web Directory had about 250.000 hits/per month). Now, we have an access rate of more than 35-40.000 hits/day (a very significant value for an italian web directory) and we keep growing ...
The catalog is searchable in two ways:
Note that the categorization scheme outperforms dynamic clustering based on the supplied search terms. It works quite rapidly because the core of the work is done in advance of the search.
For more information on ACAB see: