Table of Contents
Python: Real-World Data Science
Meet Your Course Guide
What's so cool about Data Science?
Course Structure
Course Journey
The Course Roadmap and Timeline
1. Course Module 1: Python Fundamentals
1. Introduction and First Steps – Take a Deep Breath
A proper introduction
Enter the Python
About Python
Portability
Coherence
Developer productivity
An extensive library
Software quality
Software integration
Satisfaction and enjoyment
What are the drawbacks?
Who is using Python today?
Setting up the environment
Python 2 versus Python 3 – the great debate
What you need for this course
Installing Python
Installing IPython
Installing additional packages
How you can run a Python program
Running Python scripts
Running the Python interactive shell
Running Python as a service
Running Python as a GUI application
How is Python code organized
How do we use modules and packages
Python's execution model
Names and namespaces
Scopes
Guidelines on how to write good code
The Python culture
A note on the IDEs
2. Object-oriented Design
Introducing object-oriented
Objects and classes
Specifying attributes and behaviors
Data describes objects
Behaviors are actions
Hiding details and creating the public interface
Composition
Inheritance
Inheritance provides abstraction
Multiple inheritance
Case study
3. Objects in Python
Creating Python classes
Adding attributes
Making it do something
Talking to yourself
More arguments
Initializing the object
Explaining yourself
Modules and packages
Organizing the modules
Absolute imports
Relative imports
Organizing module contents
Who can access my data?
Third-party libraries
Case study
4. When Objects Are Alike
Basic inheritance
Extending built-ins
Overriding and super
Multiple inheritance
The diamond problem
Different sets of arguments
Polymorphism
Abstract base classes
Using an abstract base class
Creating an abstract base class
Demystifying the magic
Case study
5. Expecting the Unexpected
Raising exceptions
Raising an exception
The effects of an exception
Handling exceptions
The exception hierarchy
Defining our own exceptions
Case study
6. When to Use Object-oriented Programming
Treat objects as objects
Adding behavior to class data with properties
Properties in detail
Decorators – another way to create properties
Deciding when to use properties
Manager objects
Removing duplicate code
In practice
Case study
7. Python Data Structures
Empty objects
Tuples and named tuples
Named tuples
Dictionaries
Dictionary use cases
Using defaultdict
Counter
Lists
Sorting lists
Sets
Extending built-ins
Queues
FIFO queues
LIFO queues
Priority queues
Case study
8. Python Object-oriented Shortcuts
Python built-in functions
The len() function
Reversed
Enumerate
File I/O
Placing it in context
An alternative to method overloading
Default arguments
Variable argument lists
Unpacking arguments
Functions are objects too
Using functions as attributes
Callable objects
Case study
9. Strings and Serialization
Strings
String manipulation
String formatting
Escaping braces
Keyword arguments
Container lookups
Object lookups
Making it look right
Strings are Unicode
Converting bytes to text
Converting text to bytes
Mutable byte strings
Regular expressions
Matching patterns
Matching a selection of characters
Escaping characters
Matching multiple characters
Grouping patterns together
Getting information from regular expressions
Making repeated regular expressions efficient
Serializing objects
Customizing pickles
Serializing web objects
Case study
10. The Iterator Pattern
Design patterns in brief
Iterators
The iterator protocol
Comprehensions
List comprehensions
Set and dictionary comprehensions
Generator expressions
Generators
Yield items from another iterable
Coroutines
Back to log parsing
Closing coroutines and throwing exceptions
The relationship between coroutines, generators, and functions
Case study
11. Python Design Patterns I
The decorator pattern
A decorator example
Decorators in Python
The observer pattern
An observer example
The strategy pattern
A strategy example
Strategy in Python
The state pattern
A state example
State versus strategy
State transition as coroutines
The singleton pattern
Singleton implementation
The template pattern
A template example
12. Python Design Patterns II
The adapter pattern
The facade pattern
The flyweight pattern
The command pattern
The abstract factory pattern
The composite pattern
13. Testing Object-oriented Programs
Why test?
Test-driven development
Unit testing
Assertion methods
Reducing boilerplate and cleaning up
Organizing and running tests
Ignoring broken tests
Testing with py.test
One way to do setup and cleanup
A completely different way to set up variables
Skipping tests with py.test
Imitating expensive objects
How much testing is enough?
Case study
Implementing it
14. Concurrency
Threads
The many problems with threads
Shared memory
The global interpreter lock
Thread overhead
Multiprocessing
Multiprocessing pools
Queues
The problems with multiprocessing
Futures
AsyncIO
AsyncIO in action
Reading an AsyncIO future
AsyncIO for networking
Using executors to wrap blocking code
Streams
Executors
Case study
2. Course Module 2: Data Analysis
1. Introducing Data Analysis and Libraries
Data analysis and processing
An overview of the libraries in data analysis
Python libraries in data analysis
NumPy
pandas
Matplotlib
PyMongo
The scikit-learn library
2. NumPy Arrays and Vectorized Computation
NumPy arrays
Data types
Array creation
Indexing and slicing
Fancy indexing
Numerical operations on arrays
Array functions
Data processing using arrays
Loading and saving data
Saving an array
Loading an array
Linear algebra with NumPy
NumPy random numbers
3. Data Analysis with pandas
An overview of the pandas package
The pandas data structure
Series
The DataFrame
The essential basic functionality
Reindexing and altering labels
Head and tail
Binary operations
Functional statistics
Function application
Sorting
Indexing and selecting data
Computational tools
Working with missing data
Advanced uses of pandas for data analysis
Hierarchical indexing
The Panel data
4. Data Visualization
The matplotlib API primer
Line properties
Figures and subplots
Exploring plot types
Scatter plots
Bar plots
Contour plots
Histogram plots
Legends and annotations
Plotting functions with pandas
Additional Python data visualization tools
Bokeh
MayaVi
5. Time Series
Time series primer
Working with date and time objects
Resampling time series
Downsampling time series data
Upsampling time series data
Timedeltas
Time series plotting
6. Interacting with Databases
Interacting with data in text format
Reading data from text format
Writing data to text format
Interacting with data in binary format
HDF5
Interacting with data in MongoDB
Interacting with data in Redis
The simple value
List
Set
Ordered set
7. Data Analysis Application Examples
Data munging
Cleaning data
Filtering
Merging data
Reshaping data
Data aggregation
Grouping data
3. Course Module 3: Data Mining
1. Getting Started with Data Mining
Introducing data mining
A simple affinity analysis example
What is affinity analysis?
Product recommendations
Loading the dataset with NumPy
Implementing a simple ranking of rules
Ranking to find the best rules
A simple classification example
What is classification?
Loading and preparing the dataset
Implementing the OneR algorithm
Testing the algorithm
2. Classifying with scikit-learn Estimators
scikit-learn estimators
Nearest neighbors
Distance metrics
Loading the dataset
Moving towards a standard workflow
Running the algorithm
Setting parameters
Preprocessing using pipelines
An example
Standard preprocessing
Putting it all together
Pipelines
3. Predicting Sports Winners with Decision Trees
Loading the dataset
Collecting the data
Using pandas to load the dataset
Cleaning up the dataset
Extracting new features
Decision trees
Parameters in decision trees
Using decision trees
Sports outcome prediction
Putting it all together
Random forests
How do ensembles work?
Parameters in Random forests
Applying Random forests
Engineering new features
4. Recommending Movies Using Affinity Analysis
Affinity analysis
Algorithms for affinity analysis
Choosing parameters
The movie recommendation problem
Obtaining the dataset
Loading with pandas
Sparse data formats
The Apriori implementation
The Apriori algorithm
Implementation
Extracting association rules
Evaluation
5. Extracting Features with Transformers
Feature extraction
Representing reality in models
Common feature patterns
Creating good features
Feature selection
Selecting the best individual features
Feature creation
Creating your own transformer
The transformer API
Implementation details
Unit testing
Putting it all together
6. Social Media Insight Using Naive Bayes
Disambiguation
Downloading data from a social network
Loading and classifying the dataset
Creating a replicable dataset from Twitter
Text transformers
Bag-of-words
N-grams
Other features
Naive Bayes
Bayes' theorem
Naive Bayes algorithm
How it works
Application
Extracting word counts
Converting dictionaries to a matrix
Training the Naive Bayes classifier
Putting it all together
Evaluation using the F1-score
Getting useful features from models
7. Discovering Accounts to Follow Using Graph Mining
Loading the dataset
Classifying with an existing model
Getting follower information from Twitter
Building the network
Creating a graph
Creating a similarity graph
Finding subgraphs
Connected components
Optimizing criteria
8. Beating CAPTCHAs with Neural Networks
Artificial neural networks
An introduction to neural networks
Creating the dataset
Drawing basic CAPTCHAs
Splitting the image into individual letters
Creating a training dataset
Adjusting our training dataset to our methodology
Training and classifying
Back propagation
Predicting words
Improving accuracy using a dictionary
Ranking mechanisms for words
Putting it all together
9. Authorship Attribution
Attributing documents to authors
Applications and use cases
Attributing authorship
Getting the data
Function words
Counting function words
Classifying with function words
Support vector machines
Classifying with SVMs
Kernels
Character n-grams
Extracting character n-grams
Using the Enron dataset
Accessing the Enron dataset
Creating a dataset loader
Putting it all together
Evaluation
10. Clustering News Articles
Obtaining news articles
Using a Web API to get data
Reddit as a data source
Getting the data
Extracting text from arbitrary websites
Finding the stories in arbitrary websites
Putting it all together
Grouping news articles
The k-means algorithm
Evaluating the results
Extracting topic information from clusters
Using clustering algorithms as transformers
Clustering ensembles
Evidence accumulation
How it works
Implementation
Online learning
An introduction to online learning
Implementation
11. Classifying Objects in Images Using Deep Learning
Object classification
Application scenario and goals
Use cases
Deep neural networks
Intuition
Implementation
An introduction to Theano
An introduction to Lasagne
Implementing neural networks with nolearn
GPU optimization
When to use GPUs for computation
Running our code on a GPU
Setting up the environment
Application
Getting the data
Creating the neural network
Putting it all together
12. Working with Big Data
Big data
Application scenario and goals
MapReduce
Intuition
A word count example
Hadoop MapReduce
Application
Getting the data
Naive Bayes prediction
The mrjob package
Extracting the blog posts
Training Naive Bayes
Putting it all together
Training on Amazon's EMR infrastructure
13. Next Steps…
Chapter 1 – Getting Started with Data Mining
Scikit-learn tutorials
Extending the IPython Notebook
Chapter 2 – Classifying with scikit-learn Estimators
More complex pipelines
Comparing classifiers
Chapter 3: Predicting Sports Winners with Decision Trees
More on pandas
Chapter 4 – Recommending Movies Using Affinity Analysis
The Eclat algorithm
Chapter 5 – Extracting Features with Transformers
Vowpal Wabbit
Chapter 6 – Social Media Insight Using Naive Bayes
Natural language processing and part-of-speech tagging
Chapter 7 – Discovering Accounts to Follow Using Graph Mining
More complex algorithms
Chapter 8 – Beating CAPTCHAs with Neural Networks
Deeper networks
Reinforcement learning
Chapter 9 – Authorship Attribution
Local n-grams
Chapter 10 – Clustering News Articles
Real-time clusterings
Chapter 11 – Classifying Objects in Images Using Deep Learning
Keras and Pylearn2
Mahotas
Chapter 12 – Working with Big Data
Courses on Hadoop
Pydoop
Recommendation engine
More resources
4. Course Module 4: Machine Learning
1. Giving Computers the Ability to Learn from Data
How to transform data into knowledge
The three different types of machine learning
Making predictions about the future with supervised learning
Classification for predicting class labels
Regression for predicting continuous outcomes
Solving interactive problems with reinforcement learning
Discovering hidden structures with unsupervised learning
Finding subgroups with clustering
Dimensionality reduction for data compression
An introduction to the basic terminology and notations
A roadmap for building machine learning systems
Preprocessing – getting data into shape
Training and selecting a predictive model
Evaluating models and predicting unseen data instances
Using Python for machine learning
2. Training Machine Learning Algorithms for Classification
Artificial neurons – a brief glimpse into the early history of machine learning
Implementing a perceptron learning algorithm in Python
Training a perceptron model on the Iris dataset
Adaptive linear neurons and the convergence of learning
Minimizing cost functions with gradient descent
Implementing an Adaptive Linear Neuron in Python
Large scale machine learning and stochastic gradient descent
3. A Tour of Machine Learning Classifiers Using scikit-learn
Choosing a classification algorithm
First steps with scikit-learn
Training a perceptron via scikit-learn
Modeling class probabilities via logistic regression
Logistic regression intuition and conditional probabilities
Learning the weights of the logistic cost function
Training a logistic regression model with scikit-learn
Tackling overfitting via regularization
Maximum margin classification with support vector machines
Maximum margin intuition
Dealing with the nonlinearly separable case using slack variables
Alternative implementations in scikit-learn
Solving nonlinear problems using a kernel SVM
Using the kernel trick to find separating hyperplanes in higher dimensional space
Decision tree learning
Maximizing information gain – getting the most bang for the buck
Building a decision tree
Combining weak to strong learners via random forests
K-nearest neighbors – a lazy learning algorithm
4. Building Good Training Sets – Data Preprocessing
Dealing with missing data
Eliminating samples or features with missing values
Imputing missing values
Understanding the scikit-learn estimator API
Handling categorical data
Mapping ordinal features
Encoding class labels
Performing one-hot encoding on nominal features
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Selecting meaningful features
Sparse solutions with L1 regularization
Sequential feature selection algorithms
Assessing feature importance with random forests
5. Compressing Data via Dimensionality Reduction
Unsupervised dimensionality reduction via principal component analysis
Total and explained variance
Feature transformation
Principal component analysis in scikit-learn
Supervised data compression via linear discriminant analysis
Computing the scatter matrices
Selecting linear discriminants for the new feature subspace
Projecting samples onto the new feature space
LDA via scikit-learn
Using kernel principal component analysis for nonlinear mappings
Kernel functions and the kernel trick
Implementing a kernel principal component analysis in Python
Example 1 – separating half-moon shapes
Example 2 – separating concentric circles
Projecting new data points
Kernel principal component analysis in scikit-learn
6. Learning Best Practices for Model Evaluation and Hyperparameter Tuning
Streamlining workflows with pipelines
Loading the Breast Cancer Wisconsin dataset
Combining transformers and estimators in a pipeline
Using k-fold cross-validation to assess model performance
The holdout method
K-fold cross-validation
Debugging algorithms with learning and validation curves
Diagnosing bias and variance problems with learning curves
Addressing overfitting and underfitting with validation curves
Fine-tuning machine learning models via grid search
Tuning hyperparameters via grid search
Algorithm selection with nested cross-validation
Looking at different performance evaluation metrics
Reading a confusion matrix
Optimizing the precision and recall of a classification model
Plotting a receiver operating characteristic
The scoring metrics for multiclass classification
7. Combining Different Models for Ensemble Learning
Learning with ensembles
Implementing a simple majority vote classifier
Combining different algorithms for classification with majority vote
Evaluating and tuning the ensemble classifier
Bagging – building an ensemble of classifiers from bootstrap samples
Leveraging weak learners via adaptive boosting
8. Predicting Continuous Target Variables with Regression Analysis
Introducing a simple linear regression model
Exploring the Housing Dataset
Visualizing the important characteristics of a dataset
Implementing an ordinary least squares linear regression model
Solving regression for regression parameters with gradient descent
Estimating the coefficient of a regression model via scikit-learn
Fitting a robust regression model using RANSAC
Evaluating the performance of linear regression models
Using regularized methods for regression
Turning a linear regression model into a curve – polynomial regression
Modeling nonlinear relationships in the Housing Dataset
Dealing with nonlinear relationships using random forests
Decision tree regression
Random forest regression
A. Reflect and Test Yourself! Answers
Module 2: Data Analysis
Chapter 1: Introducing Data Analysis and Libraries
Chapter 2: Object-oriented Design
Chapter 3: Data Analysis with pandas
Chapter 4: Data Visualization
Chapter 5: Time Series
Chapter 6: Interacting with Databases
Chapter 7: Data Analysis Application Examples
Module 3: Data Mining
Chapter 1: Getting Started with Data Mining
Chapter 2: Classifying with scikit-learn Estimators
Chapter 3: Predicting Sports Winners with Decision Trees
Chapter 4: Recommending Movies Using Affinity Analysis
Chapter 5: Extracting Features with Transformers
Chapter 6: Social Media Insight Using Naive Bayes
Chapter 7: Discovering Accounts to Follow Using Graph Mining
Chapter 8: Beating CAPTCHAs with Neural Networks
Chapter 9: Authorship Attribution
Chapter 10: Clustering News Articles
Chapter 11: Classifying Objects in Images Using Deep Learning
Chapter 12: Working with Big Data
Module 4: Machine Learning
Chapter 1: Giving Computers the Ability to Learn from Data
Chapter 2: Training Machine Learning
Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn
Chapter 4: Building Good Training Sets – Data Preprocessing
Chapter 5: Compressing Data via Dimensionality Reduction
Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning
Chapter 7: Combining Different Models for Ensemble Learning
Chapter 8: Predicting Continuous Target Variables with Regression Analysis
B. Bibliography
Index