Packt+ | Advance your knowledge in tech

You're reading from Hands-On Data Science with Anaconda Utilize the right mix of tools to create high-performance data science applications

Product type Paperback

Published in May 2018

Publisher Packt

ISBN-13 9781788831192

Length 364 pages

Edition 1st Edition

Languages

Python

Tools

Anaconda

Concepts

Data Science

Authors (2):

James Yan

Yuxing Yan

View More author details

Chapter 1, Ecosystem of Anaconda, introduces some basic concepts such as the reasons why we use Anaconda and the advantages of using a full-fledged Anaconda and/or its baby version, Miniconda. Then, it covers the use of Anaconda online, without installation. We also test a few simple programs, written in R, Python, Julia, and Octave.

Chapter 2, Anaconda Installation, shows how to install Anaconda, test whether the installation is successful, how to launch Jupyter and use it to launch Python, how to launch Spyder and R, and how to find help. Most of these concepts or procedures are quite basic, so users who are quite confident with them can skip this chapter and go directly to the next chapter.

Chapter 3, Data Basics, discusses sources of open data, which include the Bureau of Labor Statistics, the Census Bureau, Professor French’s Data Library, the Federal Reserve’s Data Library, and the UCI (University of California at Irvin) Machine Learning Repository. After that, it explains how to input data; how to deal with missing data; how to sort, slice, and dice datasets; how to merge different datasets and data output. For different languages, such as Python, R, Julia and Octave, several relevant packages for data manipulation are introduced and discussed.

Chapter 4, Data Visualization, discusses various types of visual presentations, which include simple graphs, bar charts, pie charts, and histograms, written in different languages such as R, Python, and Julia. Visual presentations can help our audience understand our data better. For many complex concepts or theories, we could use visual presentations to help explain their logic and complexity. A typical example is the so-called bisection method or bisection search.

Chapter 5, Statistical Modeling in Anaconda, explains many important issues related to statistics, such as T-distribution, F-distribution, T-test, and F-test. We also discuss linear regression, how to deal with missing data, how to treat outliers, collinearity and its treatments, and how to run a multi-variable linear regression.

Chapter 6, Managing Packages, explains the importance of managing packages, how to find out all packages available for R, Python, and Julia, and how to find the manual for each package. In addition, we discuss the issue of package dependency and how to make our programming a little easier when dealing with packages.

Chapter 7, Optimization in Anaconda, discusses several optimization topics, including general optimization problems, expressing various kinds of optimization problems as LPPs, and quadratic optimization. Several examples are offered to make our discussion more practice-oriented, such as how to choose an optimal stock portfolio, how to optimize wealth and resources to promote sustainable development, and how much the government should really tax people. In addition, we introduce several packages for optimization in R, Python, Julia, and Octave.

Chapter 8, Unsupervised Learning in Anaconda, covers unsupervised learning. In particular, hierarchical clustering and k-means clustering are covered. As for R and Python, several related packages are looked at in details. For R: rattle, Rmixmod, and randomUniformForest; For Python: Scipy.cluster, Contrastive, and sklearn.

Chapter 9, Supervised Learning in Anaconda, discusses supervised learning, including classification, k-nearest neighbors algorithm, Bayes' classifiers, reinforcement learning, and specific R and Python-related modules, such as RTextTools and sklearn. In addition, you will see their implementation in R, Python, Julia, and Octave.

Chapter 10, Predictive Data Analytics – Modelling and Validation, covers predictive data analytics, modeling and validation, some useful datasets, time series analytics, how to predict future events, seasonality, and how to visualize our data. We mention prsklearn and catwalk for Python, datarobot, LiblineaR, and eclust for R, QuantEcon for Julia and ltfat for Octave.

Chapter 11, Anaconda Cloud, discusses Anaconda Cloud. Some topics include Jupyter Notebook in depth, different formats of Jupyter notebooks, how to share notebooks with your partners, how to share different projects over different platforms, how to share your working environments, and how to replicate other's environments locally.

Chapter 12, Distributed Computing, Parallel Computing, and HPCC, covers distributed computing and Anaconda Accelerate. When our data or tasks become more complex, we need a good system or a set of tools to process data and run complex algorithms. For this purpose, distributed computing is one solution. In particular, we will explain compute nodes, project add-ons, parallel processing, and advanced Python for data parallelism.