You're reading from Hands-On Data Science and Python Machine Learning Perform data mining and machine learning efficiently using Python and Spark

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781787280748

Length 420 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Data Mining

Author (1):

Frank Kane

View More author details

Table of Contents (11) Chapters

Preface

1. Getting Started FREE CHAPTER

2. Statistics and Probability Refresher, and Python Practice

3. Matplotlib and Advanced Probability Concepts

4. Predictive Models

5. Machine Learning with Python

6. Recommender Systems

7. More Data Mining and Machine Learning Techniques

8. Dealing with Real-World Data

9. Apache Spark - Machine Learning on Big Data

10. Testing and Experimental Design

Implementing a spam classifier with Naïve Bayes

Let's write a spam classifier using Naive Bayes. You're going to be surprised how easy this is. In fact, most of the work ends up just being reading all the input data that we're going to train on and actually parsing that data in. The actual spam classification bit, the machine learning bit, is itself just a few lines of code. So that's usually how it works out: reading in and massaging and cleaning up your data is usually most of the work when you're doing data science, so get used to the idea!

import os 
import io 
import numpy 
from pandas import DataFrame 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
 
def readFiles(path): 
    for root, dirnames, filenames in os.walk(path): 
        for filename in filenames: 
            path = os...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (1)

Frank Kane

Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.

See other products by Frank Kane