AG's corpus of news articles
Welcome the the AG's corpus of news articles. 5
AG is a collection of more than 1 million news articles. News
articles have been gathered from more
than 2000 news sources by ComeToMyHead in more than 1
year of activity. ComeToMyHead is an academic news search engine which
has been
running since July, 2004.
The dataset is provided by the academic comunity for research purposes
in data mining (clustering, classification, etc), information retrieval
(ranking, search, etc), xml, data compression, data streaming, and any
other non - commercial activity.
You are encouraged to download this corpus for any non-commercial use. If you
download the dataset, please drop me a mail to describe the academic
research you are doing: gulli AT di.unipi.it.
If you use the corpus, you are requested to cite this web page in your
academic publication
You are not authorized to
change the corpus or to re-distribute (part of) it with a different
name.
DB Table
+-------------+--------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+-------------------+-------+
| source | varchar(32) | | PRI | | |
| url | varchar(255) | | PRI | | |
| title | text | YES | MUL | NULL | |
| image | varchar(255) | YES | | NULL | |
| category | varchar(32) | | PRI | | |
| description | text | YES | | NULL | |
| rank | int(11) | YES | | NULL | |
| pubdate | timestamp | YES | | CURRENT_TIMESTAMP | |
| video | varchar(255) | YES | | NULL | |
+-------------+--------------+------+-----+-------------------+-------+
Publications using the corpus
— G. M. Del Corso, A. Gulli, and F. Romani.
Ranking a stream of news. In Proceedings of 14th International World
Wide Web
Conference, pages 97–106, Chiba,
Japan,
2005.
— A.
Gulli. The anatomy of a news search
engine. In Proceedings of 14th International World Wide Web Conference,
pages
880–881, Chiba, Japan, 2005.
DISCLAIMER
THIS CORPUS IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT.
IN ADDITION THERE IS NOT WARRANTY
ABOUT THE ACCURARY OR THE COMPLETENESS OF THE INFORMATION, TEXT,
GRAPHICS, LINKS, OR OTHER ITEMS CONTAINED WITHIN THESE MATERIALS. I AM
NOT LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL
DAMAGE, INCLUDING WITHOUT LIMITATION, LOST REVEVUES OR LOST PROFITS,
WHICH MAY RESULT FROM THE USE OF THIS CORPUS. I DO NOT ENDORSE,
RECOMMEND, OR FAVOR ANY REFERENCE TO A SPECIFIC COMMERCIAL PRODUCT OR
SERVICE OR TRADEMARK. THE INFORMATION IN THIS CORPUS IS SUBJECT TO
CHANGE WITHOUT NOTICE AND DOES NOT REPRESENT A COMMITMENT ON THE PART
IN THE FUTURE.
The copyrigth of the news articles belongs to the orginal news
sources.
Antonio Gulli