We will revisit a problem that is detecting malicious URLs, and we will find a way to solve the same with decision trees. We start by loading the data:
from urlparse import urlparse
import pandas as pd
urls = pd.read_json("../data/urls.json")
print urls.shape
urls['string'] = "http://" + urls['string']
(5000, 3)
On printing the head of the urls:
urls.head(10)
The output looks as follows:
pred |
string |
truth |
|
0 |
1.574204e-05 |
0 |
|
1 |
1.840909e-05 |
0 |
|
2 |
1.842080e-05 |
0 |
|
3 |
7.954729e-07 |
0 |
|
4 |
3.239338e-06 |
0 |
|
5 |
3.043137e-04 |
0 |
|
6 |
4.107331e-37 |