AG's corpus of news articles

Welcome the the AG's corpus of news articles.  5

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000  news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004.
The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.

You are encouraged to download this corpus for any non-commercial use. If you download the dataset, please drop me a mail to describe the academic research you are doing: gulli AT di.unipi.it.
If you use the corpus, you are requested to cite this web page in your academic publication

You are not authorized to change the corpus or to re-distribute (part of) it with a different name.

DB Table

+-------------+--------------+------+-----+-------------------+-------+
| Field       | Type         | Null | Key | Default           | Extra |
+-------------+--------------+------+-----+-------------------+-------+
| source      | varchar(32)  |      | PRI |                   |       |
| url         | varchar(255) |      | PRI |                   |       |
| title       | text         | YES  | MUL | NULL              |       |
| image       | varchar(255) | YES  |     | NULL              |       |
| category    | varchar(32)  |      | PRI |                   |       |
| description | text         | YES  |     | NULL              |       |
| rank        | int(11)      | YES  |     | NULL              |       |
| pubdate     | timestamp    | YES  |     | CURRENT_TIMESTAMP |       |
| video       | varchar(255) | YES  |     | NULL              |       |
+-------------+--------------+------+-----+-------------------+-------+

Publications using the corpus

— G. M. Del Corso, A. Gulli, and F. Romani. Ranking a stream of news. In Proceedings of 14th International World Wide Web Conference, pages 97–106, Chiba, Japan, 2005.
A. Gulli. The anatomy of a news search engine. In Proceedings of 14th International World Wide Web Conference, pages 880–881, Chiba, Japan, 2005.

DISCLAIMER

THIS CORPUS IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN ADDITION THERE IS NOT WARRANTY ABOUT THE ACCURARY OR THE COMPLETENESS OF THE INFORMATION, TEXT, GRAPHICS, LINKS, OR OTHER ITEMS CONTAINED WITHIN THESE MATERIALS. I AM NOT LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGE, INCLUDING WITHOUT LIMITATION, LOST REVEVUES OR LOST PROFITS, WHICH MAY RESULT FROM THE USE OF THIS CORPUS. I DO NOT ENDORSE, RECOMMEND, OR FAVOR ANY REFERENCE TO A SPECIFIC COMMERCIAL PRODUCT OR SERVICE OR TRADEMARK. THE INFORMATION IN THIS CORPUS IS SUBJECT TO CHANGE WITHOUT NOTICE AND DOES NOT REPRESENT A COMMITMENT ON THE PART IN THE FUTURE.

The copyrigth of the news articles belongs to the orginal news sources.

Antonio Gulli