AG's corpus of news articles

Welcome the the AG's corpus of news articles. 5
Antonio Gulli

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004.
The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.

You are encouraged to download this corpus for any non-commercial use. If you download the dataset, please drop me a mail to describe the academic research you are doing: gulli AT di.unipi.it.
If you use the corpus, you are requested to cite this web page in your academic publication

You are not authorized to change the corpus or to re-distribute (part of) it with a different name.

db version of the AG news article's corpus
xml version of the AG news article's corpus (thanks to Paolo Ferragina)

DB Table

+-------------+--------------+------+-----+-------------------+-------+

| Field       | Type         | Null | Key | Default           | Extra |

+-------------+--------------+------+-----+-------------------+-------+

| source      | varchar(32)  |      | PRI |                   |       |

| url         | varchar(255) |      | PRI |                   |       |

| title       | text         | YES  | MUL | NULL              |       |

| image       | varchar(255) | YES  |     | NULL              |       |

| category    | varchar(32)  |      | PRI |                   |       |

| description | text         | YES  |     | NULL              |       |

| rank        | int(11)      | YES  |     | NULL              |       |

| pubdate     | timestamp    | YES  |     | CURRENT_TIMESTAMP |       |

| video       | varchar(255) | YES  |     | NULL              |       |

+-------------+--------------+------+-----+-------------------+-------+

Publications using the corpus

— G. M. Del Corso, A. Gulli, and F. Romani. Ranking a stream of news. In Proceedings of 14th International World Wide Web Conference, pages 97–106, Chiba, Japan, 2005.
— A. Gulli. The anatomy of a news search engine. In Proceedings of 14th International World Wide Web Conference, pages 880–881, Chiba, Japan, 2005.

DISCLAIMER

THIS CORPUS IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN ADDITION THERE IS NOT WARRANTY ABOUT THE ACCURARY OR THE COMPLETENESS OF THE INFORMATION, TEXT, GRAPHICS, LINKS, OR OTHER ITEMS CONTAINED WITHIN THESE MATERIALS. I AM NOT LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGE, INCLUDING WITHOUT LIMITATION, LOST REVEVUES OR LOST PROFITS, WHICH MAY RESULT FROM THE USE OF THIS CORPUS. I DO NOT ENDORSE, RECOMMEND, OR FAVOR ANY REFERENCE TO A SPECIFIC COMMERCIAL PRODUCT OR SERVICE OR TRADEMARK. THE INFORMATION IN THIS CORPUS IS SUBJECT TO CHANGE WITHOUT NOTICE AND DOES NOT REPRESENT A COMMITMENT ON THE PART IN THE FUTURE.

The copyrigth of the news articles belongs to the orginal news sources.

Antonio Gulli

References

[1] The indexable web is more than 11.5 billion pages, A Gulli, A Signorini
Special interest tracks and posters of the 14th international conference on
[2] A personalized search engine based on Web‐snippet hierarchical clustering, P Ferragina, A Gulli, Software: Practice and Experience 38 (2), 189-225
[3] Automatic Web page categorization by link and context analysis, G Attardi, A Gullì, F Sebastiani, Proceedings of THAI 99 (99), 105-119
[4] Ranking a stream of news, GM Del Corso, A Gulli, F Romani, Proceedings of the 14th international conference on World Wide Web, 97-106
[4] Similarity detection and clustering of images, A Savona, T Yang, X Liu, B Li, A Choksi, F Tanganelli, L Carnevale, US Patent 7,801,893
[5] Fast PageRank computation via a sparse linear system, GM Del Corso, A Gulli, F Romani, Internet Mathematics 2 (3), 251-273
[6] Sampling internet user traffic to improve search results, A Gulli, A Savona, M Mori, US Patent 8,046,357
[7] System and method for monitoring evolution over time of temporal content, A Gulli, F Tanganelli, A Savona, US Patent App. 11/313,584
[8] Method and system to present video content, A Gulli, A Savona, M Veri, US Patent 7,730,405
[9] Storyline visualization A Gulli, A Savona, G Deretta, D Bernhardt, US Patent App. 13/325,365
[10] Systems and methods for clustering information, A Savona, A Gulli, L Foschini, G Deretta, US Patent App. 11/899,832
[11] Building an open source meta-search engine, A Gulli, A Signorini, Special interest tracks and posters of the 14th international conference on …
[12] The anatomy of a news search engine, A Gulli Special interest tracks and posters of the 14th international conference on …
[13] The anatomy of a hierarchical clustering engine for Web-page, news and book snippets, P Ferragina, A Gulli, Data Mining, 2004. ICDM'04. Fourth IEEE International Conference on, 395-398   39   2004
[14] Method and system to present a preview of video content, A Gulli, A Savona, M Veri
US Patent App. 11/297,840 , 2007
[15] Fast PageRank computation via a sparse linear system, GM Del Corso, A Gullí, F Romani, International Workshop on Algorithms and Models for the Web-Graph, 118-130   2004
[16] Theseus: categorization by context, G Attardi, A Gullì, F Sebastiani, Proceedings of the 8th International World Wide Web Conference, 136-137    1999
[17] Systems and methods for selecting and organizing information using temporal clustering, A Savona, A Gulli, L Foschini, US Patent App. 11/417,405    2007
[18] Deep Learning with Keras A Gulli, S Pal Packt Publishing Ltd    2017
[19] Experimenting SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets
P Ferragina, A Gullì, Knowledge Discovery in Databases: PKDD 2004 3202, 543-545   2004
[20] Comparison of Krylov subspace methods on the PageRank problem GM Del Corso, A Gullí, F Romani Journal of Computational and Applied Mathematics 210 (1-2), 159-166 2007
[21] Wsdm cup 2016: Entity ranking challenge, AD Wade, K Wang, Y Sun, A Gulli
Proceedings of the ninth ACM international conference on web search and data …    2016
[22] Method and system to provide targeted advertising with search results A Gulli, A Savona US Patent App. 11/297,838    2007
[23] Population and/or animation of spatial visualization (s) D Bernhardt, M Kaisser, A Gulli US Patent 9,009,159   2015
[23] Tc-socialrank: Ranking the social web A Gulli, S Cataudella, L Foschini
International Workshop on Algorithms and Models for the Web-Graph, 143-154      2009
[24] On two web IR boosting tools: clustering and ranking A Gullı
PhD thesis, Dipartimento di Informatica, Universit ‘a degli Studi di Pisa … 2006
[25] Systems and methods for personalizing a newspaper A Signorini, G Ottaviano, A Gulli US Patent App. 11/787,780   2008
[26] Ag’s corpus of news articles A Gulli
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html 2004
[27] The anatomy of a clustering engine for web-page snippet P Ferragina, A Gulli
The Fourth IEEE International Conference on Data Mining (ICDM’04)    2004
[28] Systems and methods for visually selecting information A Signorini, A Gulli US Patent App. 11/974,386   2009
[30] Intent-aware keyboard YT Kuo, A Gulli, K Wang US Patent 9,576,074    2017
[32] System and methods for monitoring and reducing the consumption of electricity with a network of smart sensors LA Gulli US Patent 9,811,102   2017
[33] Towards automated categorization and abstracting of Web sites G Attardi, A Gullí, D Dato, C Tani Submitted for publication   2   1999
[34] Web Host Enumeration Through DNS. D Dato, A Gulli, G Attardi WebNet        1997
[35] Intent-aware keyboard YT Kuo, A Gulli, K Wang US Patent 9,990,052       2018
[36] 直感 Deep Learning ――Python×Kerasでアイデアを形にするレシピ訳Antonio Gulli、Sujit Pal　著、大串正矢、久保隆宏、中山光樹　O'Reilly Japan, Inc.        2018
[37] Keras深度学习实战安东尼奥·古利（Antonio Gulli)，苏伊特·帕尔（Sujit Pal)
Packt Publishing       2018
[38] TensorFlow 1. x Deep Learning Cookbook: Over 90 unique recipes to solve artificial-intelligence driven problems with Python A Gulli, A Kapoor Packt Publishing       2017
[39] Population and/or animation of spatial visualization (s) D Bernhardt, M Kaisser, A Gulli US Patent 9,842,149       2017
[40] Intent-Based Presentation of Search Results W Ramsey, N Agrawal, S Dube, A Gulli, BK Jha US Patent App. 15/355,930       2017
[41] Intent-based presentation of search results W Ramsey, N Agrawal, S Dube, A Gulli, BK Jha US Patent 9,536,001       2017
[42] System and method for categorizing answers such as URLs A Signorini, A Arzilli, A Gerasoulis, A Gulli, M Sambati US Patent 9,239,882       2016
[43] Providing web-based alternate text options I Klapaftis, A Gullí
US Patent App. 13/922,852       2014
[44] Clarifying User Intent of Query Terms of a Search Query D Bernhardt, A Gulli, RAA Ferreira, ERT Abib, A Gandhe, S Chappidi US Patent App. 13/804,733       2014
[45] Distributed marketplaces using P2P networks and public-key cryptography A Signorini, A Gulli, AM Segre Proceedings of the 3rd international conference on Scalable information …       2008
[46] A Wiki Based Model of Web Social Search A Gulli, S Cataudella, L Foschini
2008
[47] Efficient Sparse Linear System Solution of the PageRank Problem G DEL CORSO, A Gulli, F Romani UPGRADE 8, 5-12       2007
[48] The anatomy of a Clustering Engine for Web Snippets P Ferragina, A Gulli
Università di Pisa       2004
[49] The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets
F P., G A. Knowledge Discovery in Databases: PKDD 2004. 3202 (Springer, Berlin …       2004
[50] Ranking the web G DEL CORSO, A Gulli Third international conference on Fun with Algorithms, 281-283       2004
[51] Survey sugli algoritmi e sulle architetture usate dai search engine
A Gulli http://pages.di.unipi.it/grossi/IND/SurveyGulli.pdf       2000
[52] Jamming. Net: a Server to Balance WWW Load. A Gulli
WebNet       1998
[53] SqlWWW: un server Internet per accedere alla base di dati della Soprintendenza ai beni ambientali, architettonici, artistici e storici di Pisa con connessioni di tipo Keep-Alive G DI TOTA, A GULLI, D MERLITTI 1996