Data loading and pre-processing
There are several summarization-related datasets available for training. These datasets are available through the TensorFlow Datasets or tfds
package, which we have used in the previous chapters as well. The datasets that are available differ in length and style. The CNN/DailyMail dataset is one of the most commonly used datasets. It was published in 2015, with approximately a total of 1 million news articles. Articles from CNN, starting in 2007, and Daily Mail, starting in 2010, were collected until 2015. The summaries are usually multi-sentence. The Newsroom dataset, available from https://summari.es, contains over 1.3 million news articles from 38 publications. However, this dataset requires that you register to download it, which is why it is not used in this book. The wikiHow data set contains full Wikipedia article pages and the summary sentences for those articles. The LCSTS data set contains Chinese language data collected from Sina...