You are here: Home | Automated Web Categorization
 

Automated Web Categorization

Abstract: During a recent Search Engine Conference, several people predicted that future search tools would combine the best features of handcrafted catalogs with the benefits of Web search engines. Here we present ACAB, a system that performs Automated Categorization and ABstracting of Web sites. It builds Web a directory with a minimal human intervention limited to an initial training phase. ACAB is being successfully used in creating the Web guide for the Arianna site
 

 

  • See a paper submitted for pubblication to WWW8
  • See Arianna a search engine using Automatic web Categorization to build its Web Directory
  • See a Faq about Arianna Automatic Web Categorization
 
Automated Web Categorization:
Where & Why ?

The World Wide Web will reach more than 1 billion pages by year 2000. This size raises serious problems in accessing information. The most useful tools we use today to interact with Internet content are search engines (SE) and Web directories (WD). Both SEs and WDs are starting to show some limitations due to the explosion of material on the Web.

Search Engine Problems

According to SearchEngineWatch, Altavista, a SE with one of the largest indexes of Internet, is about to give up its initial slogan "we index it at all" in favor of "we index the best". When the index is too large, users have difficulties in finding the information they are looking for. This is true for a simple reason: even an experienced user has problems in identifying the relevant keywords for a given topic. Searching is more and more a frustrating process of choosing some keywords, navigating among the results and starting over if no relevant or too much irrelevant information is found. A recent survey shows that no more than three keywords are ever used and that rarely users look at items beyond the third in the list of results.

Web Directory’s Problems

For WDs, problems do not appear as much at the user side, but rather at the reviewers and submitters level. As submissions increase, the guide must either add new reviewers to process listing or accept an increase in the turnaround time for reviews or simply accept a smaller percentage of listing as in the past. Sriniija Srinivasan (who oversees the Yahoo!'s listing process) said: "the Web grows faster than we do, and we couldn't possible scale personnel to match the rate in which new Web sites that are coming along". The problem results more difficult if we consider that a submitted site can evolve its contents. So a categorization and a description given today could be not suitable for tomorrow. Another question to point out is that a human review of a Web site could not be systematic. As Sullivan said: " In the end, the situation point out the delicate balancing act that Yahoo's editor face when they consider how to list sites. They have some general rules, but there is no classification system set in the stone. Instead, the consensus of various Yahoo surfer acts as a living, ever changing system."
 

The ACAB Solution
 
During a recent Search Engine Conference, several people predicted that future search tools would combine the best features of handcrafted catalogs with the benefits of Web search engines.

We have developed ACAB, a system that performs Automated Categorization and ABstracting of Web sites. It builds Web a directory with a minimal human intervention limited to an initial training phase. ACAB is being successfully used in creating the Web guide for the Arianna's site (Arianna is run by Italia On Line. It is both a large national search engine and a Web directory dedicated to the Italian Web space).

In this paper (submitted to WWW 8 conference), we present the techniques on which ACAB is based. These include typical search engine technology (such as spidering, indexing, ranking), techniques from the field of IR (such as automated categorization by content, query biased abstracting) and specific techniques for the Web environment (such as URLs clustering). Automated categorization, compared to manual categorization, allows:

  • savings of human resources;
  • more frequent updates;
  • dealing with large amounts of data;
  • discovery and categorization of new sites without human intervention;
  • re-categorization of known sites when their content changes;
  • re-categorization of known sites when the catalog taxonomy changes;
We also believe that automated categorization will be useful for Search Engines. In response to a query, a SE might report the most relevant categories that contain significant URLs, combining available information retrieval and categorization capabilities. We believe this helps users to solve the potential searching ambiguity that could derive from a not enough specific query. Moreover additional services can be envisaged, for instance:
  • grouping by categories the results of a query (as in Northern Light);
  • asking within which categories a document appears;
  • providing category-dependent abstracts for the same document;
  • filtering documents according to both a user profile and the categories to which the document belongs. For example, one may prevent showing documents with an adult content to an inappropriate audience.
The ACAB Architecture
 
 The task of building a catalog involves the following activities: (1) selecting or building a schema of hierarchical categories; (2) selecting the collection of documents to categorize; (3) identifying the most appropriate categories for each document; (4) clustering closely related URLs which represent a single site, and (5) extracting and summarizing the most significant information contained in each identified site.

ACAB is one of the first systems to exploit a tight integration between a search engine and an automated document classifier. This integration has a number of benefits from the point of view of implementation:

  • performance: since the search engine already retrieves and stores documents from the Web, the classifier has a large number of documents available locally.
  • indexing: the indexes built by the search engine can be directly inquired for information useful in building the catalog. In particular, it is fairly simple to obtain information about the number of URLs relevant to a certain category with respect to the total number of URLs in a given site. It is sufficient to perform a query on the indexes kept by the search engine. Similarly one can obtain useful information for ranking the relevance of pages, for clustering of URLs, and for determining Web communities based on link topologies.
  • history: since a search engine keeps a history of visited pages, the classifier can be triggered when new or updated pages are detected. This enables a more timely update of the catalog to reflect changes in the original sites.

The ACAB architecture reflects the integration of an automated classifier with the Arianna search engine [Arianna] as illustrated in Figure 1. At the top appear some components of the Arianna search engine: the Spidering and Indexing subsystems. Arianna uses Fulcrum as an IR engine for indexing and retrieval. The remaining components perform the tasks mentioned earlier: a builder is used to create a descriptor tree for categories; a classifier which interfaces to the index of the search engine; a module for clustering URLs and for identifying a "Web site"; a module to produce the summary of documents.

ACAB and Arianna

ACAB has been used in connection with the Arianna Search Engine, the largest national search engine for the Italian Web space, containing over 4 million Web pages. ACAB has been used to build the whole catalog for Arianna search space, generating automatically top-level categories of its Web Directory. Among others: "Computer and Internet", "Information and News", "Entertainment, "Travel and Tourism" and "Businnes and Finance", "Free Time", "Sport", "Games and Lottery". Each top-level category contains ten or more lower level categories. [Fig. 1] illustrates one the pages of the catalog built by ACAB.

Figure 1. An example of the result of automated categorization. You can see the most relevant URLs within the category "Computer Graphics". In first position is a "Digital Art" site. Sites talking about VRML, ray tracing, rendering and so on follow this. The language of the abstracts is Italian.

Arianna applied automated categorization and abstracting to all categories of its directory, which so far have been maintained manually.

The results achieved are quite encouraging. The top-level categories contain about 30.000 classified sites, and in most cases, the prototype categorizes each document in the appropriate category. Since the adopting of ACAB and automatic classification, the hits to Web Directory increase of about 400% in respect to manual classification.

We also notice that user are satisfied with the system. Since it appears online, the number of hits to the Web directory increases consistently. The top-level categories built by ACAB, obtained more than 500.000 accesses in the first month and about 1.000.000 in the second month of life (the old manual based Web Directory had about 250.000 hits/per month). Now, we have an access rate of more than 35-40.000 hits/day (a very significant value for an italian web directory) and we keep growing ...

Searching within categories

The catalog is searchable in two ways:

  • within a category
  • in the whole directory
In the former case, users navigate in the subject tree that naturally limits the scope of search yielding higher precision.

Figure 2. Searching in the whole directory. The user submitted the keyword "notizie" (the Italian word for "news"). On the right side of the browser there are sites about "news and newspaper", teletext, real-time news and so on. On the left side, there are the folders representing the most appropriate categories for the given search terms. They are: "press-agency", "daily news", "newspaper", "search engine" and others. The numbers count how many URLs are relevant for the search. The latter method allows the user to perform a term-based search in the whole guide. This is still different from performing a standard search on the Arianna SE, which instead performs a search through the whole Arianna space of several million individual pages. The result of a term-based query on the guide produces:
  • a short list of highest ranking URLs;
  • a folder for each of the most appropriate categories for grouping the results.
Grouping documents by categories, gives users suggestions about a certain number of semantic classes in which the submitted keywords can be interpreted, in a manner similar to the Custom Search Folders of Northern Light. For example, in [fig. 2] a user submitted the keyword "notizie" (the Italian word for "news"). On the right side of the browser you can see the list of relevant URLs. On the left side there are the folders representing the most appropriate categories for the given search terms, sorted by the number of URLs contained within. Moreover, ACAB performs some additional service such us providing a category-dependent abstract for the same document, and finding within which category a document appears (though the latter is experimental and not available on line).

Note that the categorization scheme outperforms dynamic clustering based on the supplied search terms. It works quite rapidly because the core of the work is done in advance of the search.

More Information

For more information on ACAB see:


Antonio Gullì
gulli@di.unipi.it