Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Practical Data Analysis Using Jupyter Notebook
Practical Data Analysis Using Jupyter Notebook

Practical Data Analysis Using Jupyter Notebook: Learn how to speak the language of data by extracting useful and actionable insights using Python

eBook
€15.99 €23.99
Paperback
€29.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Practical Data Analysis Using Jupyter Notebook

Fundamentals of Data Analysis

Welcome and thank you for reading my book. I'm excited to share my passion for data and I hope to provide the resources and insights to fast-track your journey into data analysis. My goal is to educate, mentor, and coach you throughout this book on the techniques used to become a top-notch data analyst. During this process, you will get hands-on experience using the latest open source technologies available such as Jupyter Notebook and Python. We will stay within that technology ecosystem throughout this book to avoid confusion. However, you can be confident the concepts and skills learned are transferable across open source and vendor solutions with a focus on all things data.

In this chapter, we will cover the following:

  • The evolution of data analysis and why it is important
  • What makes a good data analyst?
  • Understanding data types and why they are important
  • Data classifications and data attributes explained
  • Understanding data literacy

The evolution of data analysis and why it is important

To begin, we should define what data is. You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for reference or analysis. The focus of my definition should be the word persistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use. Rather than focus on the formal definition, let's discuss the world of data and how it impacts our daily lives. Whether you are reading a review to decide which product to buy or viewing the price of a stock, consuming information has become significantly easier to allow you to make informed data-driven decisions.

Data has been entangled into products and services across every industry from farming to smartphones. For example, America's Grow-a-Row, a New Jersey farm to food bank charity, donated over 1.5 million pounds of fresh produce to feed people in need throughout the region each year, according to their annual report. America's Grow-a-Row has thousands of volunteers and uses data to maximize production yields during the harvest season.

As the demand for being a consumer of data has increased, so has the supply side, which is characterized as the producer of data. Producing data has increased in scale as the technology innovations have evolved. I'll discuss this in more detail shortly, but this large scale consumption and production can be summarized as big data. A National Institute of Standards and Technology report defined big data as consisting of extensive datasets—primarily in the characteristics of volume, velocity, and/or variability—that require a scalable architecture for efficient storage, manipulation, and analysis.

This explosion of big data is characterized by the 3Vs, which are Volume, Velocity, and Variety,and has become a widely accepted concept among data professionals:

  • Volume is based on the quantity of data that is stored in any format such as image files, movies, and database transactions, which are measured in gigabytes, terabytes, or even zettabytes. To give context, you can store hundreds of thousands of songs or pictures on one terabyte of storage space. Even more amazing than the figures is how much it costs you. Google Drive, for example, offers up to 5 TB (terabytes) of storage for free according to their support site.
  • Velocity is the speed at which data is generated. This process covers how data is both produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity approaches are real time, streams of data where the data flow is in a constant state of movement.
  • Variety is all of the different formats that data can be stored in, including text, image, database tables, and files. This variety has created both challenges and opportunities for analysis because of the different technologies and techniques required to work with the data.

Understanding the 3Vs is important for data analysis because you must become good at being both a consumer and producer of data. The simple questions of how your data is stored, when this file was produced, where the database table is located, and in what format I shouldstore the output of my analysis of the data can all be addressed by understanding the 3Vs.

There is some debate—for which I disagree—that the 3Vs should increase to include Value, Visualization, and Veracity. No worries, we will cover these concepts throughout this book.

This leads us to a formal definition of data analysis which is defined as a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making, as stated in Review of business intelligence through data analysis.

Xia, B. S., & Gong, P. (2015). Review of business intelligence through data analysis. Benchmarking, 21(2), 300-311. doi:10.1108/BIJ-08-2012-0050

What I like about this definition is the focus on solving problems using data without the focus on which technologies are used. To make this possible there have been some significant technological milestones, the introduction of new concepts, and people who have broken down the barriers.

To showcase the evolution of data analysis, I compiled a few tables of key events from the years of 1945 until 2018 that I feel are the most influential. The following table is comprised of innovators such as Dr. E.F. Codd, who created the concept of a database to the launch of the iPhone device that spawned the mobile analytics industry.

The following diagram was collected from multiple sources and centralized in one place as a table of columns and rows and then visualized using this dendrogram chart. I posted the CSV file in the GitHub repository for reference: https://github.com/PacktPublishing/python-data-analysis-beginners-guide. Organizing the information and conforming the data in one place made the data visualization easier to produce and enables further analysis:

That process of collecting, formatting, and storing data in this readable format demonstrates the first step of becoming a producer of data. To make this information easier to consume, I summarize these events by decades in the following table:

Decade

Count of Milestones

1940s

2

1950s

2

1960s

1

1970s

2

1980s

5

1990s

9

2000s

14

2010s

7

From the preceding summary table, you can see that the majority of these milestone events occurred in the 1990s and 2000s. What is insightful about this analysis is that recent innovations have removed the barriers of entry for individuals to work with data. Before the 1990s, the high purchasing costs of hardware and software restricted the field of data analysis to a relatively limited number of careers. Also, the costs associated with access to the underlying data for analysis were great. It typically required higher education and specialized careers in software programming or an actuary.

A visual way to look at this same data would be a trend bar chart, as shown in the following diagram. In this example, the height of the bars represents the same information as in the preceding table and the Count of Milestone events is on the left or the y axis. What is nice about this visual representation of the data is that it is a faster way for the consumer to see the upward pattern of where most events occur without scanning through the results found in the preceding diagram or table:

The evolution of data analysis is important to understand because now you know some of the pioneers who opened doors for opportunities and careers working with data, along with key technology breakthroughs, significantly reducing the time to make decisions regarding data both as consumers and producers.

What makes a good data analyst?

I will now break down the contributing factors that make up a good data analyst. From my experience, a good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. To make it easier to remember, use the following acronyms to help to improve your data analyst skills.

Know Your Data (KYD)

Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it. Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team's success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.


KYD is also about data lineage, which is understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward. Refer back to the 3Vs so you can effectively communicate the responses from common questions about the data such as where this data is sourced from or who is responsible for maintaining the data source.

Voice of the Customer (VOC)

The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service. The relevance of this concept remains important today and should be applied to every data project that you participate in. This process is where you should interview the consumers of the data analysis results before even looking at the data. If you are working with business users, listen to what their needs are by writing down the specific points on what business questions are they trying to answer.

Schedule a working session with them where you can engage in a dialog. Make sure you focus on their current pain points such as the time to curate all of the data used to make decisions. Does it take three days to complete the process every month? If you can deliver an automated data product or a dashboard that can reduce that time down to a few mouse clicks, your data analysis skills will make you look like a hero to your business users.

During a tech talk at a local university, I was asked the difference between KYD and VOC. I explained that both are important and focused on communicating and learning more about the subject area or business. The key differences are prepared versus present. KYD is all about doing your homework ahead of time to be prepared before talking to experts. VOC is all about listening to the needs of your business or consumers regarding the data.

Always Be Agile (ABA)

The agile methodology has become commonplace in the industry for application, web, and mobile development Software Development Life Cycle (SDLC). One of the reasons that makes the agile project management process successful is that it creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features.

The agile process involves creating stories with a common theme where a development team completes tasks in 2-3 week sprints. In that process, it is important to understand the what and the why for each story including the business value/the problem you are trying to solve.

The agile approach has ceremonies where the developers and business sponsors come together to capture requirements and then deliver incremental value. That improvement in value could be anything from a new dataset available for access to a new feature added to an app.

See the following diagram for a nice visual representation of these concepts. Notice how these concepts are not linear and should require multiple iterations, which help to improve the communication between all people involved in the data analysis before, during, and after delivery of results:

Finally, I believe the most important trait of a good data analyst is a passion for working with data. If your passion can be fueled by continuously learning about all things data, it becomes a lifelong and fulfilling journey.

Understanding data types and their significance

As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms of unstructured, semi-structured, and structured.

Unstructured data

The concept behind unstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.

When working with unstructured data, there will be inconsistencies because of the nature of free text including misspellings, the different classification of dates, and so on. Always have a peer review of the workflow or code used to curate the data.

Semi-structured data

Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:

{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}

This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).

The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.

Structured data

Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, or Database Management Systems (DBMS) and Relational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.

The following diagram is a basic Entity-Relationship (ER) diagram of three tables that would be found in a database:

In this example, each entity would represent physical tables stored in the database, named car, part, and car_part_bridge. The relationship between the car and part is defined by the table called car_part_bridge, which can be classified by multiple names such as bridge, junction, mapping, or link table. The name of each field in the table would be on the left such as part_id, name, or description found in the part table.

The pk label next to the car_id and part_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.

Finally, the text aligned on the right side next to the field name labeled as int or text is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.

Common data types

Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.

Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:

Common data type

Common short name

Sample value

Example usage

Integers

int

1235

Counting occurrences, summing values, or the average of values such as sum (hits)

Booleans

bit

TRUE

Conditional testing such as if sales > 1,000, true else false

Geospatial

float or spatial

40.229290, -74.936707

Geo analytics based on latitude and longitude

Characters/string

char

A

Tagging, binning, or grouping data

Floating-point numbers

float or double

2.1234

Sales, cost analysis, or stock price

Alphanumeric strings

blob or varchar

United States

Tagging, binning, encoding, or grouping data

Time

time, timestamp, date

8/19/2000

Time-series analysis or year-over-year comparison

Technologies change and legacy systems will offer opportunities to see data types that may not be common. The best advice when dealing with new data types is to validate the source systems that are created by speaking to an SME (Subject Matter Expert) or system administrator, or to ask for documentation that includes the active version used to persist the data.

In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.

As you evolve your data analysis skills, helping to resolve data type issues or offer suggestions to improve them will make the quality and accuracy of reporting better throughout the organization.

String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar or blob) can be represented as text such as a word or full sentence. Time is a special data type that can be represented and stored in multiple ways such as 12 PM EST or a date such as 08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.

The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.

Data classifications and data attributes explained

Now that we understand more about data types and why they are important, let's break down the different classifications of data and understand the different data attribute types. To begin with a visual, let's summarize all of the possible combinations in the following summary diagram:

In the preceding diagram, the boxes directly below data have the three methods to classify data, which are continuous, categorical, or discrete.

Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.

Categorical (descriptive) data will have values as astringdata type. Categorical data isqualified so it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.

Just because data is defined as categorical, don't assume the values are all alike or consistent. A month can be stored as 1, 2, 3; Jan, Feb, Mar; or January, February, March, or in any combination. You will learn more about how to clean and conform your data for consistent analysis in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.

A discrete data type can be either continuous or categorical depending on how it's used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee. Discrete data is continuous in nature because of its numeric properties but also has limits that make it similar to categorical. Another example would be the numbers on a roulette wheel. There is a limit of whole numbers available on the wheel from 1 to 36, 0, or 00 that a player can bet on, plus the numbers can be categorized as red, black, or green depending on the value.

If only two discrete values exist, such as yes/no or true/false or 1/0, it can also be classified as binary.

Data attributes

Now that we understand how to classify data, let's break down the attribute types available to better understand how you can use them for analysis. The easiest method to break down types is to start with how you plan on using the data values for analysis:

  • Nominal data is defined as data where you can distinguish between different values but not necessarily order them. It is qualitative in nature, so think of nominal data as labels or names as stocks or bonds where math cannot be performed on them because they are string values. With nominal values, you cannot determine whether the word stocks or bonds are better or worse without additional information.
  • Ordinal data is ordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative using labels or names but now the values will have a natural or defined sequence. Similar to nominal data, ordinal data can be counted but not calculated with all statistical methods.

An example is assigning 1 = low, 2 = medium, and 3 = high values. This has a natural sequence but the difference between low and high cannot be quantified by itself. The data assigned to low and high values could be arbitrary or have additional business rules behind it.

Another common example of ordinal data is natural hierarchies such as state, county, and city, or grandfather, father, and son. The relationship between these values are well defined and commonly understood without any additional information to support it. So, a son will have a father but a father cannot be a son.

  • Interval data is like ordinal data, but the distance between data points is uniform. Weight on a scale in pounds is a good example because the difference between the values from 5 to 10, 10 to 15, and 20 to 25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.

Temperature is a good example to demonstrate this paradigm. You can record hourly values and even provide a daily average, but summing the values per day or week would not provide accurate information for analysis. See the following diagram, which provides an hourly temperature for a specific day. Notice the x axis breaks out the hours and the y axis provides the average, which is labeled Avg Temperature, in Fahrenheit. The values between each hour must be an average or mean because an accumulation of temperature would provide misleading results and inaccurate analysis:

  • Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types of integer and float discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative. Also, time could be classified as ratio data,however, I decided tofurther break down this attribute because of how often it is used for data analysis.
Note that there are advanced statistical details about ratio data attributes that are not covered in this book, such as having an absolute or true zero, so I encourage you to learn more about the subject.
  • Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any combination, for example, the time as HH:MM AM/PM, such as 12:03 AM; the year as YYYY, such as 1980; a timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19 14:32:22; or even a date as MM/DD/YY, such as 08/19/00. What's important to recognize when dealing with time data is to identify the intervals between each value so you can accurately measure the difference between them.
It is common during many data analysis projects that you find gaps in the sequence of time data values. For example, you are given a dataset with a range between 08/01/2019 to 08/31/2019 but only 25 distinct date values exist versus 30 days of data. There are various reasons for this occurrence including system outages where log data was lost. How to handle those data gaps will vary depending on the type of analysis you have to perform, including the need to fill in missing results. We will cover some examples in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.

Understanding data literacy

Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio as the ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.

Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about the data types and data attributes with the engineering team versus searching for those details aimlessly.

Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.

Reading data

What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.

Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.

For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from the What makes a good data analyst? section that we should start asking questions about the file attachment:

  • What methods were used to create the file (human or machine)?
  • What system(s) and workflow were used to create the file?
  • Who created the file and when was it created?
  • How often does this file refresh and is it manual or automated?

Asking these questions helps you to understand the concept of data lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.

Working with data

I define working withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.

Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.

For example, if you have a timestamp field in your dataset, is it using UTC (Coordinated Universal Time) or EST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.

Analyzing data

Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down to dimensions and measures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).

The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.

One quick solution to help to classify how the data should be used for analysis would be a data dictionary, which is defined as a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.

You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.

Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create a data dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:

Product

Customer

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Product 1

Customer A

$ 1,000.00

$ 2,000.00

$ 6,000.00

Product 1

Customer B

$ 1,000.00

$ 500.00

Product 2

Customer A

$ 1,000.00

Product 2

Customer C

$ 2,000.00

$ 2,500.00

$ 5,000.00

Product 3

Customer A

$ 1,000.00

$ 2,000.00

Product 4

Customer B

$ 1,000.00

$ 3,000.00

Product 5

Customer A

$ 1,000.00

In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields, Record 1 to Record 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeled Estimated Data Type and Dimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:

Field Name

Record 1

Record 2

Record 3

Estimated Data Type

Dimension or Measure

Product

Product 1

Product 1

Product 2

varchar

Dimension

Customer

Customer A

Customer B

Customer A

varchar

Dimension

Quarter 1

$ 1,000.00

float

Measure

Quarter 2

$ 2,000.00

$ 1,000.00

$ 1,000.00

float

Measure

Quarter 3

$ 6,000.00

$ 500.00

float

Measure

Quarter 4

float

Measure

Using this technique can help you to ask the following questions about the data to ensure you understand the results:

  • What year does this dataset represent or is it an accumulation of multiple years?
  • Does each quarter represent a calendar year or fiscal year?
  • Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?

Arguing about the data

Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.

From a consumer's perspective, trust and verify is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.

As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.

Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.

Here are a few to get started:

Let's begin with the Kagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.

Next is the supporting data from FiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.

Another important open source for data would be The World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under a Creative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.

Summary

Let's look back at what we learned in this chapter and the skills obtained before we move forward. First, we covered a brief history of data analysis and the technological evolution of data by paying homage to the people and milestone events that made working with data possible using modern tools and techniques. We walked through an example of how to summarize these events using a data visual trend chart that showed how recent technology innovations have transformed the data industry.

We focused on why data has become important to make decisions from both a consumer and producer perspective by discussing the concepts for identifying and classifying data using structured, semi-structured, and unstructured examples and the 3Vsof big data: Volume, Velocity, and Variety.

We answered the question of what makes a good data analyst using the techniques of KYD, VOC, and ABA.

Then, we went deeper into understandingdata types by walking through the differences between numbers (integer and float) versus strings (text, time, dates, and coordinates). This includedbreaking down data classifications (continuous, categorical, and discrete) and understanding data attribute types.

We wrapped up this chapter by introducing the concept of data literacyand its importance to the consumers and producers of data by improving communication between them.

In our next chapter,we will get more hands-on by installing and setting up an environment for data analysis and so begin the journey of applying the concepts learned about data.

Further reading

Here are some links that you can refer to for gathering more information about the following topics:

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Find out how to use Python code to extract insights from data using real-world examples
  • Work with structured data and free text sources to answer questions and add value using data
  • Perform data analysis from scratch with the help of clear explanations for cleaning, transforming, and visualizing data

Description

Data literacy is the ability to read, analyze, work with, and argue using data. Data analysis is the process of cleaning and modeling your data to discover useful information. This book combines these two concepts by sharing proven techniques and hands-on examples so that you can learn how to communicate effectively using data. After introducing you to the basics of data analysis using Jupyter Notebook and Python, the book will take you through the fundamentals of data. Packed with practical examples, this guide will teach you how to clean, wrangle, analyze, and visualize data to gain useful insights, and you'll discover how to answer questions using data with easy-to-follow steps. Later chapters teach you about storytelling with data using charts, such as histograms and scatter plots. As you advance, you'll understand how to work with unstructured data using natural language processing (NLP) techniques to perform sentiment analysis. All the knowledge you gain will help you discover key patterns and trends in data using real-world examples. In addition to this, you will learn how to handle data of varying complexity to perform efficient data analysis using modern Python libraries. By the end of this book, you'll have gained the practical skills you need to analyze data with confidence.

Who is this book for?

This book is for aspiring data analysts and data scientists looking for hands-on tutorials and real-world examples to understand data analysis concepts using SQL, Python, and Jupyter Notebook. Anyone looking to evolve their skills to become data-driven personally and professionally will also find this book useful. No prior knowledge of data analysis or programming is required to get started with this book.

What you will learn

  • Understand the importance of data literacy and how to communicate effectively using data
  • Find out how to use Python packages such as NumPy, pandas, Matplotlib, and the Natural Language Toolkit (NLTK) for data analysis
  • Wrangle data and create DataFrames using pandas
  • Produce charts and data visualizations using time-series datasets
  • Discover relationships and how to join data together using SQL
  • Use NLP techniques to work with unstructured data to create sentiment analysis models
  • Discover patterns in real-world datasets that provide accurate insights
Estimated delivery fee Deliver to Austria

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 19, 2020
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781838826031
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Austria

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Publication date : Jun 19, 2020
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781838826031
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 96.97
Python Data Cleaning Cookbook
€36.99
Practical Data Analysis Using Jupyter Notebook
€29.99
Python Data Analysis
€29.99
Total 96.97 Stars icon

Table of Contents

17 Chapters
Section 1: Data Analysis Essentials Chevron down icon Chevron up icon
Fundamentals of Data Analysis Chevron down icon Chevron up icon
Overview of Python and Installing Jupyter Notebook Chevron down icon Chevron up icon
Getting Started with NumPy Chevron down icon Chevron up icon
Creating Your First pandas DataFrame Chevron down icon Chevron up icon
Gathering and Loading Data in Python Chevron down icon Chevron up icon
Section 2: Solutions for Data Discovery Chevron down icon Chevron up icon
Visualizing and Working with Time Series Data Chevron down icon Chevron up icon
Exploring, Cleaning, Refining, and Blending Datasets Chevron down icon Chevron up icon
Understanding Joins, Relationships, and Aggregates Chevron down icon Chevron up icon
Plotting, Visualization, and Storytelling Chevron down icon Chevron up icon
Section 3: Working with Unstructured Big Data Chevron down icon Chevron up icon
Exploring Text Data and Unstructured Data Chevron down icon Chevron up icon
Practical Sentiment Analysis Chevron down icon Chevron up icon
Bringing It All Together Chevron down icon Chevron up icon
Works Cited Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Most Recent
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.9
(9 Ratings)
5 star 55.6%
4 star 11.1%
3 star 11.1%
2 star 11.1%
1 star 11.1%
Filter icon Filter
Most Recent

Filter reviews by




Amazon Customer Nov 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Being a novice, I found this book very well written, informative and above all, helpful. The author knows how to explain concepts in easy to understand language - I didn’t have to google every other word.
Amazon Verified review Amazon
Josef Bauer Aug 19, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Für mich genau der richtige Einstieg in die Data Analysis Welt und Jupyter Notebook. Etwas Python-Kenntnisse sind allerdings sehr von Vorteil.
Amazon Verified review Amazon
Captain Glanton Jun 08, 2021
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Unsuitable for engineering professionals, you’d be better off with the internet. For business types, this book might work.
Amazon Verified review Amazon
Sam2013 Mar 31, 2021
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This book was lazily written and published. There are WAY TOO MANY TYPOS, and too make things worse, most of the typos are in the python commands themselves!!! I'm not even half way through the book yet because I'm constantly trying to assess if a command has a typo in it and then do an independent google search to fix the typo myself. There were also times where certain keyword arguments are left out of a function call that made me seriously question the credibility of Mark Wintjen's python skills. If you are going to write a book for a beginner in data analytics and data science, SPEND MORE TIME PROOFREADING BEFORE YOU PUBLISH.On the bright side, Marc Wintjen does a good job relaying general data analytics/science information and history to the reader. Maybe he should stick to that instead of teaching python.
Amazon Verified review Amazon
Richard Lyons Mar 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Step by step guides are essential!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela