You are here: Home | Papers Submitted   

 

Automatic Categorization

of Web Content

 Abstract: The World Wide Web will reach more than 1 billion pages in the year 2000. This dimension raises serious problems to both Web directories and search engines. Maintaining catalogues manually is more and more difficult because it is impossible to scale personnel to match the rate in which new Web sites are coming along, and when the index of a search engine is too large, users have difficulties in finding the information they are looking. In this paper, we present ACAB an Automatic Classifier and ABstracter of information contained in Web sites. It allows building a Web directory with a minimal human intervention human limited to an initial training phase. ACAB was successfully used to create the Web guide present at the Arianna’s site. In this paper, we also suggest that categorization could have a relevant impact on search engines.

Keywords: Information Retrieval, Automatic Text Categorization, Web Search

 
Towards Automated Categorization and Abstracting of Web Sites

Giuseppe Attardi
Dipartimento di Informatica, Università di Pisa, Italy
attardi@di.unipi.it

Antonio Gullí
I-Tech, Pisa, Italy
gulli@di.unipi.it

Domenico Dato
I-Tech, Pisa, Italy
dato@di.unipi.it

Cristina Tani
IOL, Pisa, Italy
c.tani@pisa.iol.it

1. Introduction

The World Wide Web will reach more than 1 billion pages by year 2000 [1]. This size raises serious problems in accessing information. The most useful tools we use today to interact with Internet content are search engines (SE) and Web directories (WD). Although they are often confused with one another, they are significantly different.

A search engine [2] is built in a fully automated way: spiders navigate the net, following the links extracted from Web pages, and the information they acquire is used to build the SE index. A user can consult the SE giving relevant keywords for the particular searching he/she wants to perform. Each SE uses its own techniques to find the most appropriate documents in relation to the submitted query, and these techniques are often applications of well-know theories developed in the field of Information Retrieval (IR).

A Web directory [3] is built from reviews of Web material performed by a relative large group of well-trained human technicians. They examine Web sites, submitted to the directory by their authors, and classify them according to the reviewers’ personal tastes and opinions. We could identify at least the following steps in the process of reviewing: manual navigation through the site; extraction of the significant information contained in the site (summed up in few lines called abstract or summary); categorization of the site within a pre-existent schema of categories usually organized in a taxonomy.

Both SEs and WDs are starting to show some limitations due to the explosion of material on the Web.

According to SearchEngineWatch [4], Altavista, a SE with one of the largest indexes of Internet, is about to give up its initial slogan "we index it at all" in favor of "we index the best". When the index is too large, users have difficulties in finding the information they are looking for. This is true for a simple reason: even an experienced user has problems in identifying the relevant keywords for a given topic. Searching is more and more a frustrating process of choosing some keywords, navigating among the results and starting over if no relevant or too much irrelevant information is found. A recent survey shows that no more than three keywords are ever used and that rarely users look at items beyond the third in the list of results [5].

For WDs, problems do not appear as much at the user side, but rather at the reviewers and submitters level. As submissions increase, the guide must either add new reviewers to process listing or accept an increase in the turnaround time for reviews or simply accept a smaller percentage of listing as in the past. Sriniija Srinivasan (who oversees the Yahoo!’s listing process) said: "the Web grows faster than we do, and we couldn’t possible scale personnel to match the rate in which new Web sites that are coming along". The problem results more difficult if we consider that a submitted site can evolve its contents. So a categorization and a description given today could be not suitable for tomorrow. Another question to point out is that a human review of a Web site could not be systematic. As Sullivan [4] said: " In the end, the situation point out the delicate balancing act that Yahoo’s editor face when they consider how to list sites. They have some general rules, but there is no classification system set in the stone. Instead, the consensus of various Yahoo surfer acts as a living, ever changing system."

During a recent Search Engine Conference [6], several people predicted that future search tools would combine the best features of handcrafted catalogs with the benefits of Web search engines.

We have developed ACAB, a system that performs Automated Categorization and ABstracting of Web sites. It builds Web a directory with a minimal human intervention limited to an initial training phase. ACAB is being successfully used in creating the Web guide for the Arianna’s site [Arianna] (the Arianna service is run by Italia On Line. It is both a large national search engine and a Web directory dedicated to the Italian Web space).

In this paper we present the techniques on which ACAB is based. These include typical search engine technology (such as spidering, indexing, ranking), techniques from the field of IR (such as automated categorization by content, query biased abstracting) and specific techniques for the Web environment (such as URLs clustering). Automated categorization, compared to manual categorization, allows:

  • savings of human resources;
  • more frequent updates
  • dealing with large amounts of data;
  • discovery and categorization of new sites without human intervention;
  • re-categorization of known sites when their content changes;
  • re-categorization of known sites when the catalog taxonomy changes;

Automated categorization is useful for Search Engines. In response to a query, a SE might report the most relevant categories that contain significant URLs, combining available information retrieval and categorization capabilities. This helps users resolving the ambiguity that arises from a not enough specific query terms.

The rest of this paper is organized in this way: We review related works, in the next section. In section 3, we present the architecture of ACAB and the techniques exploited. The section also illustrates the benefits of a unifying approach between SE and WD. Subsection 3.1 describes the methodology used to build the categories descriptors that are used for the categorization purpose in 3.2. Subsection 3.3 talks about the need to identify "Web sites" among a set of pages, and 3.4 presents the query biased techniques we adopt to summarize information from "Web sites". In section 4 we present the experimentation conducted to verify the validity of the method (e.g. the Web directory that was automatically built), and in subsection 4.1 we show our results in Categorization-based Retrieval applied to WD. Section 5 exposes our suggestion to use Categorization-based Retrieval for SEs. Section 6 is dedicated to analyze the future directions we are planning to explore, whereas section 7 deals with conclusions.

2. Related Work >

Several approaches have been developed recently for automated categorization, URL clustering, and abstracting. We discuss those most closely related to our work.

Northern Light [NLsearch] uses automated categorization to dynamically organize search results, creating Custom Search FolderÔ of documents with similar subject, source or type. Each folder contains a more focused subset of the original results. Northern Light performs classification of documents off-line, extracting classification metadata. Such metadata is used to organize dynamically the result of the queries into useful categories. The selection of categories is rule-based and is drawn on 20,000-term subject hierarchy developed by expert librarians.

Inktomi [Inktomi] has developed technology for search engine and tool for automatic categorization. They claim that proper categorization will be achieved using 300-350 basic vectors and about 20.000 terms, extracted from 1 million or more documents[6].

Lycos [Lycos] condenses the Internet in Web Guides on various topic. They claim to use search technology to find the most relevant information from Web. Moreover, Lycos uses relevance feedback from users and sorts each list of recommended Web sites accordingly. Currently, Lycos doesn’t provide any summary for the listed sites.

Vistabar [7] is a desktop assistance for Web Browsing. It includes a categorization tool that extracts a profile for each category contained in Yahoo! hierarchy. In order to categorize new documents, a measure of similarity is defined based on a traditional vector space matching on the weighted word frequencies of a document relative to the corpus.

Amico [Amico] is a personalized newspaper for the Web. It maintains a database of users’ profiles which consists of a list of keywords, and gathers information from selected Web sites and press agency about news, weather, job offering. It classifies collected data and presents them to users according to their profiles. Many Internet sites offer similar services. They act on the behalf of users in order to classify information originating from various sources. Among others we mention [LineOne], [HotWired], [MyYahoo].

Autonomy [Autonomy] has developed a knowledge management system that uses Adaptive Probabilistic Concept Modeling (APCM) to understand large documents. The concepts are represented by patterns of terms and contextual relationships that are most commonly recognized as important in documents. APCM doesn’t depend on specific linguistic structures or semantics, but it uses algorithms based on Bayesian probabilities and neural networks. ACMP is exploited to extract users’ profiles and to categorize documents.

Verity [Verity] has developed a Knowledge Organizer (KO), which aims at providing organizations with the ability to build their own catalog, with minimal human intervention. KO has a graphical taxonomy editor for building a category hierarchy and associating a query for each node that specifies a category [6].

Alexa [Alexa] has built a "surf engine", which learns and improves over time with the collective participation of its users. The system gathers information by crawling the Web and storing it in an archive. It then exploits both link analysis and surfer’s usage patterns to determine which sites will be of most interest to individual users. Alexa discovered about 20,000,000 unique "content areas". Recently Netscape has integrated Alexa technology inside its browser in the form of "smart browsing". By clicking on the "What’s related" button surfers receive a list of sites related to the one currently visited.

HITS [8] exploits the Web topology for compiling a list of authoritative Web resources on a topic. The algorithm consists of three phases. During the search and grow phase a query related to the chosen topic is sent to a search engine and the retrieved documents are named the root set. Then, this set is expanded with: (1) any document pointed by a document in the root set and (2) with any documents that points to a document in the root set. HITS considers two kind of pages: Authority pages, which contain a lot of information about a topic, and Hub pages, which contain a lot of links about the topic. These pages exhibit a mutual reinforcing relationship: a good hub page points to many good authorities and a good authority is pointed to by many good hubs. Solving by iteration the system a=Wh; h=Za allows identifying hub pages and authoritative pages (where W is a matrix that contains 1 in position (p,q) if page p, in the expanded root set points to page q, Z is the transpose of W and h is a vector initialized to be all 1). From a mathematical point of view, this is an application of the theory of eigenvectors and the convergence of the algorithm is proved in [8].

CLEVER [9] is an extension of HITS, which exploits contextual information in the form of text surrounding the links between the documents, in order to focus the algorithm on related pages. Instead of a W matrix of ones or zeros, a weight w(pq) is computed from the number of matches between terms in a topic description and a window of 50 bytes of the text around the hyperlink defined by HTML tag "<A>". Test of usage of CLEVER has shown its capability to infer Web communities from link topology [10].

PageRank [11] is a method to rank each page on the Web, based on a weight-propagation algorithm that simulates a short random walk on the directed link graph of the Web. It is used by the SE Google [2].

Teseo [Teseo] is a tool for the automatic categorization of Web documents. It is based on the technique of categorization by context, which exploits the context perceivable from the structure of HTML documents to extract useful information for classifying documents they refer to. It is successfully used to reclassify existent Web directories according new hierarchical schema of categories or to merge them [12].

Inquirius [13] is a meta search engine. It downloads the pages listed by querying other search engines. It then displays, along each entry, the local context in which the query terms appear, rather than an unrelated summary of the page. Displaying the context rather than a query-insensitive abstract of the document helps the user determining if the document answers his or her query.

Tombros [14] performed a user survey to compare query–biased summarization techniques with static or AI based abstracting. He concludes that the former is preferable. A query-biased technique was also proposed by Helm [15] in his toolset for personal information management.

3. The ACAB Architecture>

The task of building a catalog involves the following activities: (1) selecting or building a schema of hierarchical categories; (2) selecting the collection of documents to categorize; (3) identifying the most appropriate categories for each document; (4) clustering closely related URLs which represent a single site, and (5) extracting and summarizing the most significant information contained in each identified site.

ACAB is one of the first systems to exploit a tight integration between a search engine and an automated document classifier. This integration has a number of benefits from the point of view of implementation:

  • performance: since the search engine already retrieves and stores documents from the Web, the classifier has a large number of documents available locally.
  • indexing: the indexes built by the search engine can be directly inquired for information useful in building the catalog. In particular, it is fairly simple to obtain information about the number of URLs relevant to a certain category with respect to the total number of URLs in a given site. It is sufficient to perform a query on the indexes kept by the search engine. Similarly one can obtain useful information for ranking the relevance of pages, for clustering of URLs, and for determining Web communities based on link topologies.
  • history: since a search engine keeps a history of visited pages, the classifier can be triggered when new or updated pages are detected. This enables a more timely update of the catalog to reflect changes in the original sites.
  • Among other attempts at exploiting a search engine for classification we mention the CLEVER system [9], which extracts some information by querying the Altavista search engine, in particular for building its root set. Lycos announced the use of an automatic classifier for building their new catalog, and they claim it is integrated with SE techniques.

    The integration of a search engine with a Web directory service can be quite beneficial also for users. With respect to typical search engine, users can search within restricted or better-specified topics, improving precision and reducing noise. With respect to manual Web directories, users may gain access to a larger collection of documents, more up to date and with a more consistent classification.

    Moreover additional services can be envisaged, for instance:

    1. grouping by categories the results of a query (as in Northern Light [NLsearch]);
    2. asking within which categories a document appears;
    3. providing category-dependent abstracts for the same document;
    4. filtering documents according to both a user profile and the categories to which the document belongs. For example, one may prevent showing documents with an adult content to an inappropriate audience.

    The ACAB architecture reflects the integration of an automated classifier with the Arianna search engine [Arianna] as illustrated in Figure 1. At the top appear some components of the Arianna search engine: the Spidering and Indexing subsystems. Arianna uses Fulcrum [Fulcrum] as an IR engine for indexing and retrieval. The remaining components perform the tasks mentioned earlier: a builder is used to create a descriptor tree for categories; a classifier which interfaces to the index of the search engine; a module for clustering URLs and for identifying a "Web site"; a module to produce the summary of documents. In the following paragraphs we will analyze each subsystem in turn.

    Figure 1 Macro-architecture of ACAB: it uses search engines technology (such as spidering, indexing), techniques from IR field (such as automated categorization by content, query biased abstracting), and methods typical of the Web environment (such as URL clustering).

    3.1 Creating Profiles the Categories

    A human editor "knows" how to categorize a site from his/her knowledge, understanding and past experience in Web browsing. A reviewer has expertise in organizing data in a hierarchical manner that derives from his/her personal interests, independently from the Web.

    A category C can be typically represented by a descriptor CDC, consisting of a set of weighted keywords and sentences. Category descriptors can be built either manually, by expert librarians or linguists, as in Infoseek, Verity, Northern Light, etc [9], or automatically exploiting machine-learning, statistical [Vistabar] or neural network technologies [Autonomy].

    In IR the common way to extract the knowledge about categories is to perform a training phase on a large set of human pre-classified documents. Among such techniques, we mention the approach by Fuhr [16], which consists of the following steps. Terms are extracted from the pre-classified collection where the documents are filtered against stop words and stemmed. Then, a relevance descriptor is built for each term in the term space. It contains statistical information, such as term frequency (tf), inverse document frequency (IDF), as well as HTML structure information. Documents are indexed by weighted terms exploiting the set of relevance descriptors. For each document a vector of weights is created by using a probabilistic technique known as "Darmstadt" indexing, an application of Bayesian theory of probability. Darmstadt indexing is proved to be very effective at the TREC [TREC] conference. The descriptor produced by this method consists typically of vector of over 100,000 weights.

    Sebastiani has developed a refinement phase, which reduces the size of the weight vectors associated with documents, controlling the degradation in precision. It was based on a previous works by Yang [17]. Moreover, Sebastiani builds a single vector representing a category by means of the Rocchio [18] methodology (a standard IR technique).

    In the present implementation of ACAB, we use a set of manually built category descriptions. These descriptions are validated through a graphical tool (Category Builder) which helps in evaluating the adequacy of a description by computing a statistical measure (based on tf * IDF), for each term in a category description. Weights can be associated to
    each term in order to represent their significance as characterizing term for the category. The Category Builder allows organizing the category descriptions in a tree data structure, where each node represents a category C and it is annotated with some attributes (such us CDC and others we are going to define in the rest of the paper). A node can inherits the descriptors from its parents, inducing in the catalog to be generated a clear taxonomy.

    We plan to incorporate in ACAB the techniques by Fuhr and Sebastiani. We are also considering the use of Natural Language Processing (NPL) tools such as Part-of-Speech (POS) to improve the quality of terms extracted from documents. In particular we plan to use TreeTagger, a POS tagger developed within the TC project [19] and optimized by Attardi [20] for the EuroSearch project.

    3.2 Categorization Task

    Given a tree of category descriptors, there are two possible approaches to perform document categorization:

    • Document-centered categorization: given a document, find the most appropriate categories it belongs to. Since the complexity is proportional to the number of documents, this approach is feasible only when this number is not too large. KNN [21] is a typical technique of this kind, and Amico is an example of document-centered application.
    • Category-centered categorization: given the category descriptors, search the database to find those documents that satisfy best each descriptor. This task is accomplished with standard IR system, and so it is well suitable to be used with a Web SE. Another advantage of category-centered categorization is that the complexity is proportional to the number of categories, which are typically much smaller than the number of documents to categorize.

    ACAB adopts a category-centered categorization achieved by performing a query directly to the Arianna index. Fulcrum, the IR engine used in Arianna, can be interrogated through a subset of SQL. It supports various retrieval models such as strict boolean, weighted boolean, fuzzy boolean and vector space [22]. The strict boolean retrieval model combines individual predicates using logical boolean operators. The weighted boolean extends the strict boolean model allowing associating a weight to each predicate. The fuzzy boolean model does not enforce the fulfillment of all AND operators. In the vector space model, documents and queries can be viewed as vectors and search as a matrix product, where the matrix represents the collection of documents. Fulcrum provides several ranking algorithms that can be used to order the results of a query.

    In ACAB we are currently adopting the weighted boolean model, since it offers very precise control over what is retrieved. For each category C, we build a query QC query by transforming the category descriptor from a vector of weighted terms to a boolean clause: weighted terms are combined using the OR logical operator. The weights used in queries are those determined when building the category descriptions.

    A delicate issues is the ranking to apply during search on Web pages, since they exhibit quite varied content and may purposefully contain terms for deceiving SEs. Therefore we must avoid that several unrelated occurrences of a term in a page skew the distribution of weights, reducing the influence of other relevant terms that may appear in the same page. It is necessary to balance the contribution of all terms in the descriptor. We performed extensive experimentation to figure out a satisfactory solution and eventually settled on a ranking algorithm that maximizes the number of distinct terms matched in each document. Moreover, the ranking function increases the weight associated to a term, taking into account the HTML tags within which it is placed.

    Our experience shows that, in searches with over 50 boolean OR clauses, performance degrades significantly. For this reason, we are investigating ways to reduce the search space, for instance by extending the boolean clause with additional AND NOT expressions.

    ACAB performs a QC query to the IR system that returns a list of Web pages, ordered according to a rank of decreasing similarity with the category descriptor. A threshold TQC, defined for each category, is used to select how many top-ranked documents should be categorized under C. It can be chosen to reflect the percentage of documents classified in each category, as learned from the distribution in a training sample of pre-classified documents. Otherwise, it can be considered as a similarity measure. Only documents with a rank that exceeds TQC are similar enough to be included in a category. We are currently investigating the impact of such choices. The list of documents within the threshold TQC , is called LC.

    3.3 URLs clustering to identify relevant Web sites

    In traditional IR field, one stops once obtained a list LC of Web pages relevant for category C. In a Web environment we should take into account that Web pages are clustered together by their authors in complex manner. Therefore ACAB analyzes the top ranked documents within a category and tries to cluster them in order to identify relevant "Web sites".

    For example a natural way of clustering may be inferred from the fact that a group of Web pages are situated at the same Internet site. A group of Web documents DCH Í  LC characterized by URLs with the same H host prefix, could increase their mutual relevance within a category. According to our experiments, one often obtains noise categorizing a single Web page leaving out of consideration the hypertextual context in which the page was placed.

    Grouping by site is also an option available in querying some SE.

    Such clustering is often insufficient, since there exist Internet sites (like, for example GeoCities) that house many pages developed by different authors. These pages could be thought as "Web sites" for all intents and purposes. Therefore, we are exploring a more effective solution.

    We construct the set ACH of hypertextual links that connect pair of documents within DCH. This information is easily obtained through direct access to the search engine index. We define a directed graph GCH = (DCH, ACH), consisting of as many connected components as are the "Web sites" inside the host H relevant for the category C. One way to find these connected components is to perform a visit starting from each node in DCH. This leads to constructing a partition of the list LC, where each partition is a cluster l Ci of documents grouped by means of their hypertextual relations. Among such clusters l Ci we retain those whose cardinality is greater than a threshold value THC. In other words, we consider "Web sites" only sets of pages clustered in groups of a certain size.

    HITS [8] addresses a similar problem. HITS tries to infer large "Web communities" from link topology; ACAB tries to identify a "Web site" from the pages on the same Internet site relevant to a category. HITS exploits inter-site hyperlinks and performs its task by solving iteratively a system of equations. Our strategy exploits only intra-site links and computes the connected components inside a small graph GCH. In the future, we also plan to integrate some algorithms used by HITS and to perform a experimental comparison.

    3.4 Creating a summary for a Web page

    If we sort the documents contained in each l Ci by rank, we get a correspondingly ordered list g Ci of document relevant for each "Web site". Notice that the most significant documents for the cluster l Ci appear at the top of g Ci.

    We need of a method for summarizing a generic document d Î  g Ci. We adopt a query-biased methodology optimized for the Web environment. Query biased document summarization are intensively investigated in recent work by Tombros [14], and compared with static and AI abstracting techniques.

    There have been attempts to produce coherent summaries by language generation and AI techniques, but they are capable of processing texts only within a narrow domain, where some characteristics are predictable. Nowadays, there is not enough evidence that such systems work on domain independent text, such as the generic Web. Query biased document summarization follows a simpler approach based on three steps [fig.2]: (1) sentence extraction, (2) sentence manipulation, and (3) calculation of a "query score".

    Sentence extraction from a Web page is performed according to some criteria, derived from punctuation analysis and HTML structure (authors separate sentences by using HTML tags such us <TITLE>, <H*>, <P>, <BR>, <HR>, <TH>, <TD>, <TR>). We wrote a parser that extracts sentences according these criteria.

    Figure 2: Architecture of Query Biased Summarizer

    In the sentence manipulation step several abstracts are created. Each of these is obtained by joining together groups of consecutive sentences up to a desired length. Within a group, sentences that do not satisfy some heuristic criteria, such as the presence of enough pronouns, articles and verbs, are discarded. Currently we detect this by perform a match in a predefined static set of morphological elements. We plan to exploit more sophisticated criteria in order to improve the choice of meaningful sentences. For instance we could drop sentences that are not syntactically well formed, e.g. sentences that are just list of URLs or short phrases without verbs. A Part-Of-Speech tagger will be helpful in this task.

    The next step is to rank the abstracts, for instance according to the weight of terms contained in CDC, the category descriptor for category C. One expects in fact that more descriptor terms are present in an abstract, more likely it contains a significant amount of information related to the category C. Similarly to what we did earlier in the categorization task, the combination of weights of terms appearing in the abstract is performed according to the following rules:

    • The weight of each matched term is increased by a combination factor that keeps track of the nesting of HTML tags in which it appears, with each HTML tag having a different combination factor. Our parser keeps track of the nested HTML contexts, thereby exploiting the full HTML structure of documents.
    • The weight of each term is progressively reduced when it occurs several times in an abstract. This mitigates the keyword spamming phenomenon [4] (people that introduce in a Web page the same hidden keyword in order to increment the search engines’ rank). We use a simple function that decreases the weights according to the number of occurrences (possibly within a limited text window). This has the effect that abstracts containing several distinct terms in CDC are ranked higher.

    For a document d Î g Ci we select the top ranked abstract as its abstract (and its rank is the rank of the abstract).

    Since we need a single abstract for each group of documents g Ci, we should select the abstract with highest rank. For performance reasons, we can exploit the ranking of documents within g CI produced by the classification step, and select the first abstract for a document in g Ci that exceeds a predefined threshold TAC.

    TAC could be thought as a minimum level of quality of the abstracts. Sites whose abstract does not reach this threshold are currently discarded. For instance, this happens when the terms contained in CDC are scattered in the original document. Query biased summarization tends to give an highest rank to group of sentences in which terms in CDC are situated in small windows of text. This behavior produces a significant impact in terms of ACAB’s performance: we can avoid using of an expensive "proximity" search of terms CDC descriptors in the categorization task, and obtain a similar result here.

    Adopting query-biased technique has another important consequence. One should note that since a document can be classified in more than a category, ACAB produces a different abstract for each category to which a document is assigned.

    Categorization and abstracting put slightly different requirements in the category descriptors CDC. For the purpose of categorization CDC should be kept as small as possible for performance reasons. On the other hand abstracting is performed on a quite smaller set of documents and quality is more important. Therefore we are considering an automatic expansion of a category descriptor, for instance by exploiting the hierarchies of concepts and the synset lists (i.e. groups of synonyms) provided by WordNet [23].

    3.4.1 Ranking and Mirror Web Sites

    Mirror Web sites duplicate the same informative contents in different place, in order to reduce the network distance to clients.

    Our methodology associates a rank to each classified site. Moreover, we construct a query-biased abstract for each classified site. Mirror Web sites will have the same rank and the same abstract and so they are easy to spot. ACAB is capable of note this and signal the presence of mirror sites in the listing produced for the catalog.

    ACAB is more effective at identifying mirror site than the traditional technique of comparing the MD5 signature of the whole Web pages. The latter method has problems when the mirror pages have dynamic content that modifies HTML code (such as advertisement banners, counters). ACAB detects similarities in the abstracts, which are less subject to this variation.

    4 Prototype

    A prototype tool set based on the described categorization and summarization techniques has been built in order to assess their validity.

    A visual Category Builder was developed which enables to define, insert, move or remove categories. It uses a tree data structure maintained in a database. The graphical interface is available via Web. The tool allows annotating a node representing a category with its name, editing terms and weights for the category descriptor, and modifying the various thresholds.

    In order to make available the catalog built by ACAB, we developed a generator of HTML catalog pages. The generator builds a set of hypertext pages for each category, interconnected according to reflect the catalog structure. In the hypertext page appear sites, ordered by rank, and related statistical information.

    ACAB has been used in connection with the Arianna Search Engine, the largest national search engine for the Italian Web space, containing over 4 million Web pages. ACAB has been used to build a catalog for a significant portion of the whole Arianna search space, generating automatically five top-level categories of its Web Directory: "Computer and Internet", "Information and News", "Entertainment" and "Travel and Tourism", "Businnes and Finance". Each top-level category contains ten or more lower level categories. [Fig. 3] illustrates one the pages of the catalog built by ACAB.

    Arianna plans to apply automated categorization and abstracting to all categories of its directory, which so far have been maintained manually.

    The results achieved are quite encouraging. The five top-level categories contain about 10.000 classified sites, and in most cases, the prototype categorizes each document in the appropriate category. The few exceptions we encountered are caused by either not well-tuned threshold or not well-defined category descriptors.

    For example, we are evaluating the CLEVER suggestion to limit the showing of hub pages. This because they may have little identity of their own to help the user decide either to visit them or not. Hub pages are just pages of links to relevant information. Sometimes they are personal bookmark pages; sometimes they are pages from other Web directories. Therefore, users could be perplexed to see them classified. For us, to exclude them means to set an appropriate threshold after the URLs clustering.

     

    Figure 3. An example of the result of automated categorization. You can see the most relevant URLs within the category "Computer Graphics". In first position is a "Digital Art" site. Sites talking about VRML, ray tracing, rendering and so on follow this. The language of the abstracts is Italian.

     

    We also notice that user are satisfied with the system. Since it appears online, the number of hits to the Web directory increases consistently. The five top-level categories built by ACAB, obtained more than 500.000 accesses in one month (a significant value for the Italian Web Space)

    4.1 Searching within categories

    The catalog is searchable in two ways:

    • within a category
    • in the whole directory

    In the former case, users navigate in the subject tree that naturally limits the scope of search yielding higher precision.

    Figure 4. Searching in the whole directory for the keyword "notizie" (the Italian word for "news"). On the right side of the browser there are sites about "news and newspaper", teletext, real-time news and so on. Folders, on the left side, represent the most appropriate categories for the search results. They are: "press-agency", "daily news", "newspaper", "search engine", etc. Numbers in parenthesis show element in each folder.

    The latter method allows the user to perform a term-based search in the whole guide. This is still different from performing a standard search on the Arianna SE, which instead performs a search through the whole Arianna space of several million individual pages. The result of a term-based query on the guide produces:

    • a short list of the highest ranking URLs;
    • a folder for each of the most appropriate categories for grouping the results.

    Grouping documents by categories, gives users suggestions about a certain number of semantic classes in which the submitted keywords can be interpreted, in a manner similar to the Custom Search Folders™ of Northern Light. For example, in [fig. 4] a user submitted the keyword "notizie" (the Italian word for "news"). On the right side of the browser you can see the list of relevant URLs. On the left side there are the folders representing the most appropriate categories for the given search terms, sorted by the number of URLs contained within. Moreover, ACAB performs some additional service such us providing a category-dependent abstract for the same document, and finding within which category a document appears (though the latter is experimental and not available on line).

    Note that the categorization scheme outperforms dynamic clustering based on the supplied search terms. It works quite rapidly because the core of the work is done in advance of the search.

    5. Category-based Retrieval applied to search engine

    Categorization doesn’t only concern Web directories but also search engines. Arianna search engine plans to extend its database to retain the data produced by categorization, URLs clustering, and abstracting processes.

    As we mentioned earlier, ACAB is one of the first systems to exploit a tight integration between a search engine and an automated document classifier. ACAB’s result is a categorization of the indexes of a search engine.

    A first way to exploit this categorization is to build a Web Directory, clustering URLs, and selecting, within each cluster, the URL with the most relevant abstract. Inside the WD only this URL is shown, for each cluster.

    Another way is to retain all the data produced by categorization, clustering and abstracting and exploit them when term-base queries are submitted to a search engine. In a similar way to the search within the directory that we mention earlier, SE might use this data to group result by categories or to provide other services we have envisaged before.

    Our studies in this direction are quite encouraging, tough at preliminary stage.

     

    6. Future Work

    ACAB will evolve to integrate new functionality inside its architecture. Firstly, by using natural language processing tools available (such as POS and WordNet). Lastly, we plan to perform a deeper study about links information that could be derived from the Web pages indexed inside the search engine. PageRank and CLEVER seems two very interesting approaches.

    An interesting field to explore will be that of relevance feedback, which could be given from the users. We are currently monitoring the items chosen, among those categorized. We are investigating strategies to exploit this feedback in a refinement phase to better assign weight to the terms contained in the category descriptors or to expand them. Moreover, we want to exploit this feedback to improve the rank of the categorized documents, building a list of top visited sites for each category. HotBot uses similar technique in ranking pages (developed by Direct Hit [24]). The same for the Lycos Web guide.

    We are planning to make testbed for user evaluation in term of precision (measure to what extent the listed sites are focused on the category’s topic) and recall (measure how broadly the listed sites cover the category’s topic). These are well-known IR measures but they should treated very carefully in the Web environment that is not a traditional closed system.

    Another interesting field will be the exploitation of metadata information, as soon as they are widely and correctly [25] adopted in the Web.

    7. Conclusion

    We presented an approach to automatically categorize Web documents, capable of identifying relevant Web sites and extracting a summary of their content. With this method we built Arianna’s Web directory with minimal human intervention, limited to the initial training phase. The preliminary results of our prototype as well as the feedback from users are quite encouraging.

    We believe that automated categorization will be a necessary step for the World Wide Web in the next years.

    8. Acknowledgements

    We want to thank Antonio Converti who provided valuable indications for the research activities we carried out. We also thank the Italia On Line staff (with particular attention to: Roberto Bianchettin, Luigi Madella, Alberto Vaccari, Marisa Farusi) and the Web Agent Group inside the Department of Computer Science, Pisa. Our acknowledgement goes to all people inside I-Tech. Last but not least, We thank Dr. Paola Padella for her critic reviews about this paper.

    Our work was partially granted by "EuroSearch"[7], an EEC funded Telematics project.

    9. References

    [Alexa]http://www.alexa.com

    [Amico]http://amico.arianna.com

    [Arianna]http://www.arianna.com

    [Autonomy]http://www.autonomy.com

    [Euroserch]http://eurosearch.iol.it

    [Fulcrum]http://www.pcdocs.com

    [HotWired]http://www.hotwired.com

    [Inktomi]http://www.inktomi.com

    [LineOne]http://www.lineone.com

    [Lycos]http://www.lycos.com

    [MyYahoo]http://my.yahoo.com

    [NLsearch]http://www.nlsearch.com

    [Teseo]http://medialab.di.unipi.it

    [Trec]http://trec.nist.gov

    [Verity]http://www.verity.com

    [Yahoo]http://www.yahoo.com

    [1] "Internet Donates Archive of the World Wide Web to Library of Congress", `http://www.alexa.com/company/inthenews/loc.html

    [2] "The Anatomy of a Large-Scale Hypertextual Web Search Engine" S. Brin and L. Page. Computer Network and ISDN Systems, 30(1998) 107-117 http://www7.dstc.edu.au

    [3] "Yahoo Special Report" http://www.searchenginewatch.com/webmasters/

    [4]SearchEngineWatch. Danny Sullivan, Editor. Mecklermedia http://www.searchenginewatch.com

    [5] "Searchers, the subjects they search, and sufficiency: a study of a large sample of Excite searches", B. J. Jansen, A. Spink, J, Bateman, WebNet ’98, 1998.

    [6]"Pleteaus, Peaks and Promise: The Infonortics 98 Search Engine Conference", http://www.infotoday.com/searcher/jun/story4.html

    [7] Marais, H. Bharat, K. "Supporting cooperative and personal surfing with a desktop assistant". Proceeding of ACM UIST'97, ACM, (1997), 129-138

    [8] "Authoritative Sources in a Hyperlinked Environment" J. K. Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algoritms, 1998. Also appear as IBM Seach report RJ 10076, May 1997

    http://simon.cs.cornell.edu/home/kleinber/kleinber.html

    [9] "Automatic resource compilation by analyzing hyperlink structure and associated text" S. Chakrabartia, B. Doma, P. Raghavana, S. Rajagopalana, D. Gibsonb, and J. Kleinbergc, Computer Network and ISDN Systems, 30(1998) http://www7.dstc.edu.au/ http://almaden.ibm.com/cs/k53/clever.html

    [10] "Inferring Web Comunities from Link Topology" D.Gibson, J.Kleinberg and P. Raghavana Proc 9thACM-SIAM on Hypertext and Hypermedia, 1998.

    [11] "PageRank: Bringing order to World Wide Web: An exploratory analysis of the intellectual structure of cyperspace",L. Page Ann. Meeting of American Soc. Info. Sci., 1996

    [12] "Categoritazion by Context" G. Attardi, S. Di Marco and D. Salvi in J.UCS Vol 9/4.

    "Theseus: Categorization by Context" G.Attardi, A. Gullí, F.Sebastiani, submitted for publication.

    [13]Inquirus, the NECI meta search engine S. Lawrence and C.L. Giles, Computer Network and ISDN Systems, 30(1998) http://www7.dstc.edu.au

    [14] "Advantages of Query Biased Summaries in Information Retrieval" A. Tombros and M. Sanderson, In the Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2-10, 1998. http://www.dcs.gla.ac.uk/~tombrosa

    [15] "A Toolset for Personal Information Management", D.J. Helm and R. J. D'Amore, In the Proceedings of WebNet 97 World Conferences of the WWW, Internet & Intranet, pp. 223-228. http://www.aace.org/conf/Webnet/Default.htm

    [16] "A probabilistic learning approach for document indexing" N. Fuhr and C. Buckely ACM Transaction on Information Systems, 9(3), 233-248, 1991. http://amaunet.cs.uni-dortmund.de/ir/

    [17] "A comparative study on feature selection in text categorization" Y.Yang and J.Pederson, fourteen international conference on machine learning, 4124-420.

    [18] "Relevance feedback in information retrieval" J.J Rocchio. The SMART Retrieval System – Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, N. J. 1971

    [19] "TreeTagger - a language independent part-of-speech tagger", http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html

    [20] The Web Agents Group, Part of Speech Tagger http://medialab.di.unipi.it/Resource/POS

    [21] "An example-based mapping method for text categorization and retrieval", Yiming Yang and Crhistopher G. Chute. In ACM Transaction on Information Systems, 12, 252-227.

    [22] "Web Search Service in 1998: Trends and Challenges", http://www.infotoday.com/searcher/jun/story2.htm

    [23] "A Lexical database for English" G.A. Miller Communication of the ACM, 38(11), 1995, 39–41.

    [24] "Counting Clinks and Looking at Links" D. Sullivan The Search Engine Report. http://searchenginewatch.com/sereports/9808-clicks.html http://www.directhits.com

    [25] "The limits of Web Metadata, and beyond" M. Marchiori Computer Network and ISDN Systems, 30, 1998. http://www7.dstc.edu.au

       

     


    Antonio Gullì
    gulli@di.unipi.it