We will revisit a problem that is detecting malicious URLs, and we will find a way to solve the same with decision trees. We start by loading the data:
from urlparse import urlparse
import pandas as pd
urls = pd.read_json("../data/urls.json")
print urls.shape
urls['string'] = "http://" + urls['string']
(5000, 3)
On printing the head of the urls:
urls.head(10)
The output looks as follows:
pred |
string |
truth |
|
0 |
1.574204e-05 |
http://startbuyingstocks.com/ |
0 |
1 |
1.840909e-05 |
http://qqcvk.com/ |
0 |
2 |
1.842080e-05 |
http://432parkavenue.com/ |
0 |
3 |
7.954729e-07 |
http://gamefoliant.ru/ |
0 |
4 |
3.239338e-06 |
http://orka.cn/ |
0 |
5 |
3.043137e-04 |
http://media2.mercola.com/ |
0 |
6 |
4.107331e-37 |
http://ping.chartbeat.net/ping?h=sltrib.com&p... |