Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Data Analysis

You're reading from   Practical Data Analysis Pandas, MongoDB, Apache Spark, and more

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher
ISBN-13 9781785289712
Length 338 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Hector Cuesta Hector Cuesta
Author Profile Icon Hector Cuesta
Hector Cuesta
Dr. Sampath Kumar Dr. Sampath Kumar
Author Profile Icon Dr. Sampath Kumar
Dr. Sampath Kumar
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Getting Started 2. Preprocessing Data FREE CHAPTER 3. Getting to Grips with Visualization 4. Text Classification 5. Similarity-Based Image Retrieval 6. Simulation of Stock Prices 7. Predicting Gold Prices 8. Working with Support Vector Machines 9. Modeling Infectious Diseases with Cellular Automata 10. Working with Social Graphs 11. Working with Twitter Data 12. Data Processing and Aggregation with MongoDB 13. Working with MapReduce 14. Online Data Analysis with Jupyter and Wakari 15. Understanding Data Processing using Apache Spark

Data, information, and knowledge

Data is facts of the world. Data represents a fact or statement of an event without relation to other things. Data comes in many forms, such as web pages, sensors, devices, audio, video, networks, log files, social media, transactional applications, and much more. Most of these data are generated in real time and on a very large-scale. Although it is generally alphanumeric (text, numbers, and symbols), it can consist of images or sound. Data consists of raw facts and figures. It does not have any meaning until it is processed. For example, financial transactions, age, temperature, and the number of steps from my house to my office are simply numbers. The information appears when we work with those numbers and we can find value and meaning.

Information can be considered as an aggregation of data. Information has usually got some meaning and purpose. The information can help us to make decisions easier. After processing the data, we can get the information within a context in order to give proper meaning. In computer jargon, a relational database makes information from the data stored within it.

Knowledge is information with meaning. Knowledge happens only when human experience and insight is applied to data and information. We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies the theoretical or practical understanding of a subject. The ultimate purpose of knowledge is for value creation.

Inter-relationship between data, information, and knowledge

We can observe that the relationship between data, information, and knowledge looks like cyclical behavior. The following diagram demonstrates the relationship between them. This diagram also explains the transformation of data into information and vice versa, similarly information and knowledge. If we apply valuable information based on context and purpose, it reflects knowledge. At the same time, the processed and analyzed data will give the information. When looking at the transformation of data to information and information to knowledge, we should concentrate on the context, purpose, and relevance of the task.

Inter-relationship between data, information, and knowledge

Now I would like to discuss these relationships with a real-life example:

Our students conducted a survey for their project with the purpose of collecting data related to customer satisfaction of a product and to see the conclusion of reducing the price of that product. As it was a real project, our students got to make the final decision to satisfy the customers. Data collected by the survey was processed and a final report was prepared. Based on the project report, the manufacturer of that product has since reduced the cost. Let's take a look at the following:

  • Data: Facts from the survey.
    • For example: Number of customers purchased the product, satisfaction levels, competitor information, and so on.

  • Information: Project report.
    • For example: Satisfaction level related to price based on the competitor product.

  • Knowledge: The manufacturer learned what to do for customer satisfaction and increase product sales.
    • For example: The manufacturing cost of the product, transportation cost, quality of the product, and so on.

Finally, we can say that the data-information-knowledge hierarchy seemed like a great idea. However, by using predictive analytics we can simulate an intelligent behavior and provide a good approximation. In the following image is an example of how to turn data into knowledge:

Inter-relationship between data, information, and knowledge

The nature of data

Data is the plural of datum, so is always treated as plural. We can find data in all situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter account. In fact, data can be seen as the essential raw material to any kind of human activity. According to the Oxford English Dictionary, data are

"known facts or things used as basis for inference or reckoning".

As it is shown in the following image, we can see data in two distinct ways, Categorical and Numerical:

The nature of data

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable with two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations can be counted and are distinct and separate, for example, the number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval, for example, an economic time series like historic gold prices.

The kinds of datasets used in this book are the following:

  • E-mails (unstructured, discrete)
  • Digital images (unstructured, discrete)
  • Stock market logs (structured, continuous)
  • Historic gold prices (structured, continuous)
  • Credit approval records (structured, discrete)
  • Social media friends relationships (unstructured, discrete)
  • Tweets and treading topics (unstructured, continuous)
  • Sales records (structured, continuous)

For each of the projects in this book we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at R$50/month. Cancel anytime