Decision trees with scikit-learn
Let's use decision trees to create software that can block banner ads on web pages. This program will predict whether each of the images on a web page is an advertisement or article content. Images that are classified as being advertisements could then be removed from the page. We will train a decision tree classifier using the Internet Advertisements dataset from http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images. The proportions of the classes are imbalanced; 459 of the images are advertisements and 2,820 are content. Decision tree learning algorithms can produced biased trees from data with unbalanced class proportions; we will evaluate a model on the unaltered dataset before deciding whether it is worth balancing the training data by over- or under-sampling instances. The explanatory variables are the dimensions of the image, words from the containing page's URL, words from the image's URL, the image's...