Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Data Analysis

You're reading from   Practical Data Analysis Pandas, MongoDB, Apache Spark, and more

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher
ISBN-13 9781785289712
Length 338 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Hector Cuesta Hector Cuesta
Author Profile Icon Hector Cuesta
Hector Cuesta
Dr. Sampath Kumar Dr. Sampath Kumar
Author Profile Icon Dr. Sampath Kumar
Dr. Sampath Kumar
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Getting Started 2. Preprocessing Data FREE CHAPTER 3. Getting to Grips with Visualization 4. Text Classification 5. Similarity-Based Image Retrieval 6. Simulation of Stock Prices 7. Predicting Gold Prices 8. Working with Support Vector Machines 9. Modeling Infectious Diseases with Cellular Automata 10. Working with Social Graphs 11. Working with Twitter Data 12. Data Processing and Aggregation with MongoDB 13. Working with MapReduce 14. Online Data Analysis with Jupyter and Wakari 15. Understanding Data Processing using Apache Spark

Tools and toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, and in order to do this, as you go through the book we will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository:

https://github.com/hmcuesta

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a "scripting language" - an interpreted language with its own built-in memory management and good facilities for calling and co-operating with other programs. There are two popular versions, 2.7 or 3.x, and in this book we will be focusing on the 3.x version, because this is under active development and has already seen over two years of stable releases.

Python is multi-platform, runs on Windows, Linux/Unix, and Mac OS X, and has been ported to Java and .NET virtual machines. Python has powerful standard libs and a wealth of third-party packages for numerical computation and machine learning, such as NumPy, SciPy, pandas, SciKit, mlpy, and so on.

Python is excellent for beginners, yet great for experts, is highly scalable, and is also suitable for large projects as well as small ones. It is also easily extensible and object-oriented.

Python is widely used by organizations like Google, Yahoo maps, NASA, Red Hat, Raspberry Pi, IBM, and many more.

http://wiki.python.org/moin/OrganizationsUsingPython

Python has excellent documentation and examples:

http://docs.python.org/3/

The latest Python software is available for free, even for commercial products, and can be downloaded from here:

http://python.org/

Why mlpy?

mlpy (Machine Learning Python) is a module built on top of NumPy, SciPy, and the GNU scientific libraries. It is open source and supports Python 3.x. mlpy has a large number of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

  • Regression: Support Vector Machines (SVM)
  • Classification: SVM, k-nearest-neighbor (k-NN), classification tree
  • Clustering: k-means, multidimensional scaling
  • Dimensionality Reduction: Principal Component Analysis (PCA)
  • Misc: Dynamic Time Warping (DTW) distance

We can download the latest version of mlpy from here here:http://mlpy.sourceforge.net/

Reference: D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, C. Furlanello. mlpy: Machine Learning Python, 2012: http://arxiv.org/abs/1202.6548.

Why D3.js?

D3.js (data-driven documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM, and it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has excellent documentation, examples and community:

We can download the latest version of D3.js from:

https://d3js.org/

Why MongoDB?

NoSQL is a term that covers different types of data storage technology that are used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents. That gives you the possibility to store the view models almost exactly as you model them in the application. You can also perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and works perfectly with JavaScript-based web applications because you can store your data in a JSON document and implement a flexible schema, which makes it perfect for unstructured data.

MongoDB is used by well-known corporations like Foursquare, Craigslist, Firebase, SAP, and Forbes; we can see a detailed list of users at:

https://www.mongodb.com/industries

MongoDB has a big and active community, as well as well-written documentation:

http://docs.mongodb.org/manual/

MongoDB is easy to learn and it's free. We can download MongoDB from here:

http://www.mongodb.org/downloads

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at AU $24.99/month. Cancel anytime