Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Analysis with Python

You're reading from   Data Analysis with Python A Modern Approach

Arrow left icon
Product type Paperback
Published in Dec 2018
Publisher Packt
ISBN-13 9781789950069
Length 490 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
David Taieb David Taieb
Author Profile Icon David Taieb
David Taieb
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Programming and Data Science – A New Toolset FREE CHAPTER 2. Python and Jupyter Notebooks to Power your Data Analysis 3. Accelerate your Data Analysis with Python Libraries 4. Publish your Data Analysis to the Web - the PixieApp Tool 5. Python and PixieDust Best Practices and Advanced Concepts 6. Analytics Study: AI and Image Recognition with TensorFlow 7. Analytics Study: NLP and Big Data with Twitter Sentiment Analysis 8. Analytics Study: Prediction - Financial Time Series Analysis and Forecasting 9. Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis 10. The Future of Data Analysis and Where to Develop your Skills A. PixieApp Quick-Reference Other Books You May Enjoy Index

Jupyter Notebooks at the center of our strategy

In essence, Notebooks are web documents composed of editable cells that let you run commands interactively against a backend engine. As their name indicates, we can think of them as the digital version of a paper scratch pad used to write notes and results about experiments. The concept is very powerful and simple at the same time: a user enters code in the language of his/her choice (most implementations of Notebooks support multiple languages, such as Python, Scala, R, and many more), runs the cell and gets the results interactively in an output area below the cell that becomes part of the document. Results could be of any type: text, HTML, and images, which is great for graphing data. It's like working with a traditional REPL (short for, Read-Eval-Print-Loop) program on steroids since the Notebook can be connected to powerful compute engines (such as Apache Spark (https://spark.apache.org) or Python Dask (https://dask.pydata.org) clusters) allowing you to experiment with big data if needed.

Within Notebooks, any classes, functions, or variables created in a cell are visible in the cells below, enabling you to write complex analytics piece by piece, iteratively testing your hypotheses and fixing problems before moving on to the next phase. In addition, users can also write rich text using the popular Markdown language or mathematical expressions using LaTeX (https://www.latex-project.org/), to describe their experiments for others to read.

The following figure shows parts of a sample Jupyter Notebook with a Markdown cell explaining what the experiment is about, a code cell written in Python to create 3D plots, and the actual 3D charts results:

Jupyter Notebooks at the center of our strategy

ample Jupyter Notebook

Why are Notebooks so popular?

In the last few years, Notebooks have seen a meteoric growth in popularity as the tool of choice for data science-related activities. There are multiple reasons that can explain it, but I believe the main one is its versatility, making it an indispensable tool not just for data scientists but also for most of the personas involved in building data pipelines, including business analysts and developers.

For data scientists, Notebooks are ideal for iterative experimentation because it enables them to quickly load, explore, and visualize data. Notebooks are also an excellent collaboration tool; they can be exported as JSON files and easily shared across the team, allowing experiments to be identically repeated and debugged when needed. In addition, because Notebooks are also web applications, they can be easily integrated into a multi-users cloud-based environment providing an even better collaborative experience.

These environments can also provide on-demand access to large compute resources by connecting the Notebooks with clusters of machines using frameworks such as Apache Spark. Demand for these cloud-based Notebook servers is rapidly growing and as a result, we're seeing an increasing number of SaaS (short for, Software as a Service) solutions, both commercial with, for example, IBM Data Science Experience (https://datascience.ibm.com) or DataBricks (https://databricks.com/try-databricks) and open source with JupyterHub (https://jupyterhub.readthedocs.io/en/latest).

For business analysts, Notebooks can be used as presentation tools that in most cases provide enough capabilities with its Markdown support to replace traditional PowerPoints. Charts and tables generated can be directly used to effectively communicate results of complex analytics; there's no need to copy and paste anymore, plus changes in the algorithms are automatically reflected in the final presentation. For example, some Notebook implementations, such as Jupyter, provide an automated conversion of the cell layout to the slideshow, making the whole experience even more seamless.

Note

For reference, here are the steps to produce these slides in Jupyter Notebooks:

  • Using the View | Cell Toolbar | Slideshow, first annotate each cell by choosing between Slide, Sub-Slide, Fragment, Skip, or Notes.
  • Use the nbconvert jupyter command to convert the Notebook into a Reveal.js-powered HTML slideshow:
  • Optionally, you can fire up a web application server to access these slides online:
      
jupyter nbconvert <pathtonotebook.ipynb> --to slides
      jupyter nbconvert <pathtonotebook.ipynb> --to slides –post serve

For developers, the situation is much less clear-cut. On the one hand, developers love REPL programming, and Notebooks offer all the advantages of an interactive REPL with the added bonuses that it can be connected to a remote backend. By virtue of running in a browser, results can contain graphics and, since they can be saved, all or part of the Notebook can be reused in different scenarios. So, for a developer, provided that your language of choice is available, Notebooks offer a great way to try and test things out, such as fine-tuning an algorithm or integrating a new API. On the other hand, there is little Notebook adoption by developers for data science activities that can complement the work being done by data scientists, even though they are ultimately responsible for operationalizing the analytics into applications that address customer needs.

To improve the software development life cycle and reduce time to value, they need to start using the same tools, programming languages, and frameworks as data scientists, including Python with its rich ecosystem of libraries and Notebooks, which have become such an important data science tool. Granted that developers have to meet the data scientist in the middle and get up to speed on the theory and concept behind data science. Based on my experience, I highly recommend using MOOCs (short for, Massive Open Online Courses) such as Coursera (https://www.coursera.org) or EdX (http://www.edx.org), which provide a wide variety of courses for every level.

However, having used Notebooks quite extensively, it is clear that, while being very powerful, they are primarily designed for data scientists, leaving developers with a steep learning curve. They also lack application development capabilities that are so critical for developers. As we've seen in the Sentiment analysis of Twitter Hashtags project, building an application or a dashboard based on the analytics created in a Notebook can be very difficult and require an architecture that can be difficult to implement and that has a heavy footprint on the infrastructure.

It is to address these gaps that I decided to create the PixieDust (https://github.com/ibm-watson-data-lab/pixiedust) library and open source it. As we'll see in the next chapters, the main goal of PixieDust is to lower the cost of entry for new users (whether it be data scientists or developers) by providing simple APIs for loading and visualizing data. PixieDust also provides a developer framework with APIs for easily building applications, tools, and dashboards that can run directly in the Notebook and also be deployed as web applications.

You have been reading a chapter from
Data Analysis with Python
Published in: Dec 2018
Publisher: Packt
ISBN-13: 9781789950069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime