Visualization and preparation in pandas
As we saw in Chapter 2, Applying Machine Learning to Structured Data, it's usually a good idea to get an overview of the data before we start training. You can achieve this for the data we obtained from Kaggle by running the following:
train = pd.read_csv('../input/train_1.csv').fillna(0) train.head()
Running this code will give us the following table:
 |
Page |
2015-07-01 |
2015-07-02 |
… |
2016-12-31 |
---|---|---|---|---|---|
0 |
2NE1_zh.wikipedia.org_all-access_spider |
18.0 |
11.0 |
… |
20.0 |
1 |
2PM_zh.wikipedia.org_all-access_spider |
11.0 |
14.0 |
… |
20.0 |
The data in the Page column contains the name of the page, the language of the Wikipedia page, the type of accessing device, and the accessing agent. The other columns contain the traffic for that page on that date.
So, in the preceding table, the first row contains the page of 2NE1, a Korean pop band, on the Chinese version of Wikipedia, by all methods of access, but only for agents classified as spider traffic; that is, traffic not...