Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Statistics for Data Science
Statistics for Data Science

Statistics for Data Science: Leverage the power of statistics for Data Analysis, Classification, Regression, Machine Learning, and Neural Networks

eBook
€23.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Statistics for Data Science

Transitioning from Data Developer to Data Scientist

In this chapter (and throughout all of the chapters of this book), we will chart your course for starting and continuing the journey from thinking like a data developer to thinking like a data scientist.

Using developer terminologies and analogies, we will discuss a developer's objectives, what a typical developer mindset might be like, how it differs from a data scientist's mindset, why there are important differences (as well as similarities) between the two and suggest how to transition yourself into thinking like a data scientist. Finally, we will suggest certain advantages of understanding statistics and data science, taking a data perspective, as well as simply thinking like a data scientist.

In this chapter, we've broken things into the following topics:

  • The objectives of the data developer role
  • How a data developer thinks
  • The differences between a data developer and a data scientist
  • Advantages of thinking like a data scientist
  • The steps for transitioning into a data scientist mindset

So, let's get started!

Data developer thinking

Having spent plenty of years wearing the hat of a data developer, it makes sense to start out here with a few quick comments about data developers.

In some circles, a database developer is the equivalent of a data developer. But whether data or database, both would usually be labeled as an information technology (IT) professional. Both spend their time working on or with data and database technologies.

We may see a split between those databases (data) developers that focus more on support and routine maintenance (such as administrators) and those who focus more on improving, expanding, and otherwise developing access to data (such as developers).

Your typical data developer will primarily be involved with creating and maintaining access to data rather than consuming that data. He or she will have input in or may make decisions on, choosing programming languages for accessing or manipulating data. We will make sure that new data projects adhere to rules on how databases store and handle data, and we will create interfaces between data sources.

In addition, some data developers are involved with reviewing and tuning queries written by others and, therefore, must be proficient in the latest tuning techniques, various query languages such as Structured Query Language (SQL), as well as how the data being accessed is stored and structured.

In summary, at least strictly from a data developer's perspective, the focus is all about access to valuable data resources rather than the consumption of those valuable data resources.

Objectives of a data developer

Every role, position, or job post will have its own list of objectives, responsibilities, or initiatives.

As such, in the role of a data developer, one may be charged with some of the following responsibilities:

  • Maintaining the integrity of a database and infrastructure
  • Monitoring and optimizing to maintain levels of responsiveness
  • Ensuring quality and integrity of data resources
  • Providing appropriate levels of support to communities of users
  • Enforcing security policies on data resources

As a data scientist, you will note somewhat different objectives. This role will typically include some of the objectives listed here:

  • Mining data from disparate sources
  • Identifying patterns or trending
  • Creating statistical models—modeling
  • Learning and assessing
  • Identifying insights and predicting

Do you perhaps notice a theme beginning here?

Note the keywords:

  • Maintaining
  • Monitoring
  • Ensuring
  • Providing
  • Enforcing

These terms imply different notions than those terms that may be more associated with the role of a data scientist, such as the following:

  • Mining
  • Trending
  • Modeling
  • Learning
  • Predicting

There are also, of course, some activities performed that may seem analogous to both a data developer and a data scientist and will be examined here.

Querying or mining

As a data developer, you will almost always be in the habit of querying data. Indeed, a data scientist will query data as well. So, what is data mining? Well, when one queries data, one expects to ask a specific question. For example, you might ask, What was the total number of daffodils sold in April? expecting to receive back a known, relevant answer such as in April, daffodil sales totaled 269 plants.

With data mining, one is usually more absorbed in the data relationships (or the potential relationships between points of data, sometimes referred to as variables) and cognitive analysis. A simple example might be: how does the average daily temperature during the month affect the total number of daffodils sold in April?

Another important distinction between data querying and data mining is that queries are typically historic in nature in that they are used to report past results (total sales in April), while data mining techniques can be forward thinking in that through the use of appropriate statistical methods, they can infer a future result or provide the probability that a result or event will occur. For example, using our earlier example, we might predict higher daffodil sales when the average temperature rises within the selling area.

Data quality or data cleansing

Do you think a data developer is interested in the quality of data in a database? Of course, a data developer needs to care about the level of quality of the data they support or provide access to. For a data developer, the process of data quality assurance (DQA) within an organization is more mechanical in nature, such as ensuring data is current and complete and stored in the correct format.

With data cleansing, you see the data scientist put more emphasis on the concept of statistical data quality. This includes using relationships found within the data to improve the levels of data quality. As an example, an individual whose age is nine, should not be labeled or shown as part of a group of legal drivers in the United States incorrectly labeled data.

You may be familiar with the term munging data. Munging may be sometimes defined as the act of tying together systems and interfaces that were not specifically designed to interoperate. Munging can also be defined as the processing or filtering of raw data into another form for a particular use or need.

Data modeling

Data developers create designs (or models) for data by working closely with key stakeholders based on given requirements such as the ability to rapidly enter sales transactions into an organization's online order entry system. During model design, there are three kinds of data models the data developer must be familiar with—conceptual, logical, and physical—each relatively independent of each other.

Data scientists create models with the intention of training with data samples or populations to identify previously unknown insights or validate current assumptions.

Modeling data can become complex, and therefore, it is common to see a distinction between the role of data development and data modeling. In these cases, a data developer concentrates on evaluating the data itself, creating meaningful reports, while data modelers evaluate how to collect, maintain, and use the data.

Issue or insights

A lot of a data developer's time may be spent monitoring data, users, and environments, looking for any indications of emerging issues such as unexpected levels of usage that may cause performance bottlenecks or outages. Other common duties include auditing, application integrations, disaster planning and recovery, capacity planning, change management, database software version updating, load balancing, and so on.

Data scientists spend their time evaluating and analyzing data, and information in an effort to discover valuable new insights. Hopefully, once established, insights can then be used to make better business decisions.

There is a related concept to grasp; through the use of analytics, one can identify patterns and trends within data, while an insight is a value obtained through the use of the analytical outputs.

Thought process

Someone's mental procedures or cognitive activity based on interpretations, past experiences, reasoning, problem-solving, imagining, and decision making make up their way of thinking or their thought process.

One can only guess how particular individuals will actually think, or their exact thoughts at a given point of time or during an activity, or what thought process they will use to accomplish their objectives, but in general terms, a data developer may spend more time thinking about data convenience (making the data available as per the requirements), while data scientists are all about data consumption (concluding new ways to leverage the data to find insights into existing issues or new opportunities).

To paint a clearer picture, you might use the analogy of the auto mechanic and the school counselor.

An auto mechanic will use his skills along with appropriate tools to keep an automobile available to its owner and running well, or if there has been an issue identified with a vehicle, the mechanic will perform diagnosis for the symptoms presented and rectify the problem. This is much like the activities of a data developer.

With a counselor, he or she might examine a vast amount of information regarding a student's past performance, personality traits, as well as economic statistics to determine what opportunities may exist in a particular student's future. In addition, multiple scenarios may be studied to predict what the best outcomes might be, based on this individual student's resources.

Clearly, both aforementioned individuals provide valuable services but use (maybe very) different approaches and individual thought processes to produce the desired results.

Although there is some overlapping, when you are a data developer, your thoughts are normally around maintaining convenient access to appropriate data resources but not particularly around the data's substance, that is, you may care about data types, data volumes, and accessibility paths but not about whether or what cognitive relationships exist or the powerful potential uses for the data.

In the next section, we will explore some simple circumstances in an effort to show various contrasts between the data developer and the data scientist.

Developer versus scientist

To better understand the differences between a data developer and data scientist, let's take a little time here and consider just a few hypotheticals (yet still realistic) situations that may occur during your day.

New data, new source

What happens when new data or a new data source becomes available or is presented?

Here, new data usually means that more current or more up-to-date data has become available. An example of this might be receiving a file each morning of the latest month-to-date sales transactions, usually referred to as an actual update.

In the business world, data can be either real (actual) as in the case of an authenticated sale, or sale transaction entered in an order processing system, or supposed as in the case of an organization forecasting a future (not yet actually occurred) sale or transaction.

You may receive files of data periodically from an online transactions processing system, which provide the daily sales or sales figures from the first of the month to the current date. You'd want your business reports to show the total sales numbers that include the most recent sales transactions.

The idea of a new data source is different. If we use the same sort of analogy as we used previously, an example of this might be a file of sales transactions from a company that a parent company newly acquired. Perhaps another example would be receiving data reporting the results of a recent online survey. This is the information that's collected with a specific purpose in mind and typically is not (but could be) a routine event.

Machine (and otherwise) data is accumulating even as you are reading this, providing new and interesting data sources creating a market for data to be consumed. One interesting example might be Amazon Web Services (https://aws.amazon.com/datasets/). Here, you can find massive resources of public data, including the 1000 Genomes Project (the attempt to build the most comprehensive database of human genetic information) as well as NASA's database of satellite imagery of the Earth.

In the previous scenarios, a data developer would most likely be (should be) expecting updated files and have implemented the Extract, Transform, and Load (ETL) processes to automatically process the data, handle any exceptions, and ensure that all the appropriate reports reflect the latest, correct information. Data developers would also deal with transitioning a sales file from a newly acquired company but probably would not be a primary resource for dealing with survey results (or the 1000 Genomes Project).

Data scientists are not involved in the daily processing of data (such as sales) but will be directly responsible for a survey results project. That is, the data scientist is almost always hands-on with initiatives such as researching and acquiring new sources of information for projects involving surveying. Data scientists most likely would have input even in the designing of surveys as they are the ones who will be using that data in their analysis.

Quality questions

Suppose there are concerns about the quality of the data to be, or being, consumed by the organization. As we eluded to earlier in this chapter, there are different types of data quality concerns such as what we called mechanical issues as well as statistical issues (and there are others).

Current trending examples of the most common statistical quality concerns include duplicate entries and misspellings, misclassification and aggregation, and changing meanings.

If management is questioning the validity of the total sales listed on a daily report or perhaps doesn't trust it because the majority of your customers are not legally able to drive in the United States, the number of the organizations repeat customers are declining, you have a quality issue:

Quality is a concern to both the data developer and the data scientist. A data developer focuses more on timing and formatting (the mechanics of the data), while the data scientist is more interested in the data's statistical quality (with priority given to issues with the data that may potentially impact the reliability of a particular study).

Querying and mining

Historically, the information technology group or department has been beseeched by a variety of business users to produce and provide reports showing information stored in databases and systems that are of interest.

These ad hoc reporting requests have evolved into requests for on-demand raw data extracts (rather than formatted or pretty printed reports) so that business users could then import the extracted data into a tool such as MS Excel (or others), where they could then perform their own formatting and reporting, or perform further analysis and modeling. In today's world, business users demand more self-service (even mobile) abilities to meet their organization's (or an individual's) analytical and reporting needs, expecting to have access to the updated raw data stores, directly or through smaller, focus-oriented data pools.

If business applications cannot supply the necessary reporting on their own, business users often will continue their self-service journey.
                                                                                                    -Christina Wong (www.datainformed.com)

Creating ad hoc reports and performing extracts based on specific on-demand needs or providing self-service access to data falls solely to the role of the organization's data developer. However, take note that a data scientist will want to periodically perform his or her own querying and extracting—usually as part of a project they are working on. They may use these query results to determine the viability and availability of the data they need or as part of the process to create a sampling or population for specific statistical projects. This form of querying may be considered to be a form of data mining and goes much deeper into the data than queries might. This work effort is typically performed by a data scientist rather than a data developer.

Performance

You can bet that pretty much everyone is, or will be, concerned with the topic of performance. Some forms (of performance) are perhaps a bit more quantifiable, such as what is an acceptable response time for an ad hoc query or extract to complete? Or perhaps what are the total number of mouse-clicks or keystrokes required to enter a sales order? Others may be a bit more difficult to answer or address, such as why does it appear that there is a downward trend in the number of repeat customers?

It is the responsibility of the data developer to create and support data designs (even be involved with infrastructure configuration options) that consistently produce swift response times and are easy to understand and use.

One area of performance responsibility that may be confusing is in the area of website performance. For example, if an organization's website is underperforming, is it because certain pages are slow to load or uninteresting and/or irrelevant to the targeted audience or customer? In this example, both a data developer and a data scientist may be directed to address the problem.

These individuals—data developers—would not play a part in survey projects. The data scientist, on the other hand, will not be included in day-to-day transactional (or similar) performance concerns but would be the key responsible person to work with the organization's stakeholders by defining and leading a statistical project in an effort to answer a question such as the one concerning repeat-customer counts.

Financial reporting

In every organization, there is a need to produce regular financial statements (such as an Income Statement, Balance Sheet, or Cash Flow statement). Financial reporting (or Fin reporting) is looking to answer key questions regarding the business, such as the following:

  • Are we making a profit or losing money?
  • How do assets compare to liabilities?
  • How much free cash do we have or need?

The process of creating, updating, and validating regular financial statements is a mandatory task for any business—profit or non-profit based—of just about any size, whether public or private. Organizations, still today, are not all using fully automated reporting solutions. This means that even the task of updating a single report with the latest data could be a daunting ordeal.

Financial reporting is one area that is (pretty) clearly defined within the industry as far as responsibilities go. A data developer would be the one to create and support the processing and systems that make the data available, ensure its correctness, and even (in some cases) create and distribute reports.

Over 83 percent of businesses in the world today utilize MS Excel for Month End close and reporting
                                                                                                           -https://venasolutions.com/

Typically, a data developer would work to provide and maintain the data to feed these efforts.

Data scientists typically do not support an organization's routine processing and (financial) reporting efforts. A data scientist would, however, perform analysis of the produced financial information (and supporting data) to produce reports and visualizations indicating insights around management performance in profitability, efficiency, and risk (to name a few).

One particularly interesting area of statistics and data science is when a data scientist performs a vertical analysis to identify relationships of variables to a base amount within an organization's financial statement.

Visualizing

It is a common practice today to produce visualizations in a dashboard format that can show updated individual key performance indicators (KPI). Moreover, communicating a particular point or simplifying the complexities of mountains of data does not require the use of data visualization techniques, but in some ways, today's world may demand it.

Most would likely agree that scanning numerous worksheets, spreadsheets, or reports is mundane and tedious at best while looking at charts and graphs (such as a visualization) is typically much easier on the eyes. To that point, both the data developer and the data scientist will equally be found designing, creating, and using data visualizations. The difference will be found in the types of visualizations being created. Data developers usually focus on the visualization of repetitive data points (forecast versus actuals, to name a common example), while data scientists use visualizations to make a point as part of a statistical project.

Again, a data developer most likely will leverage visualizations to illustrate or highlight, for example, sales volumes, month-to-month for the year, while a data scientist may use visualizations to predict potential sales volumes, month-to-month for next year, given seasonality (and other) statistics.

Tools of the trade

The tools and technologies used by individuals to access and consume data can vary significantly depending upon an assortment of factors such as the following:

  • The type of business
  • The type of business problem (or opportunity)
  • Security or legal requirements
  • Hardware and software compatibilities and/or perquisites
  • The type and use of data
  • The specifics around the user communities
  • Corporate policies
  • Price

In an ever-changing technology climate, the data developer and data scientist have ever more, and perhaps overwhelming, choices including very viable open source options.

Open source software is software developed by and for the user community. The good news is that open source software is used in the vast majority, or 78 percent, of worldwide businesses today—Vaughan-Nichols, http://www.zdnet.com/. Open source is playing a continually important role in data science.

When we talk about tools and technologies, both the data developer and the data scientist will be equally involved in choosing the correct tool or technology that best fits their individual likes and dislikes and meets the requirements of the project or objective.

Advantages of thinking like a data scientist

So why should you, a data developer, endeavor to think like (or more like) a data scientist? What is the significance of gaining an understanding of the ways and how's of statistics? Specifically, what might be the advantages of thinking like a data scientist?

The following are just a few notions supporting the effort for making the move into data science:

  • Developing a better approach to understanding data
  • Using statistical thinking during the process of program or database designing
  • Adding to your personal toolbox
  • Increased marketability
  • Perpetual learning
  • Seeing the future

Developing a better approach to understanding data

Whether you are a data developer, systems analyst, programmer/developer, or data scientist, or other business or technology professional, you need to be able to develop a comprehensive relationship with the data you are working with or designing an application or database schema for.

Some might rely on the data specifications provided to you as part of the overall project plan or requirements, and still, some (usually those with more experience) may supplement their understanding by performing some generic queries on the data, either way, this seldom is enough.

In fact, in industry case studies, unclear, misunderstood, or incomplete requirements or specifications consistently rank in the top five as reasons for project failure or added risk.

Profiling data is a process, characteristic of data science, aimed at establishing data intimacy (or a more clear and concise grasp of the data and its inward relationships). Profiling data also establishes context to which there are several general contextual categories, which can be used to augment or increase the value and understanding of data for any purpose or project.

These categories include the following:

  • Definitions and explanations: These help gain additional information or attributes about data points within your data
  • Comparisons: This help add a comparable value to a data point within your data
  • Contrasts: This help add an opposite to a data point to see whether it perhaps determines a different perspective
  • Tendencies: These are typical mathematical calculations, summaries, or aggregations
  • Dispersion: This includes mathematical calculations (or summaries) such as range, variance, and standard deviation, describing the average of a dataset (or group within the data)
Think of data profiling as the process you may have used for examining data in a data file and collecting statistics and information about that data. Those statistics most likely drove the logic implemented in a program or how you related data in tables of a database.

Using statistical thinking during program or database designing

The process of creating a database design commonly involves several tasks that will be carried out by the database designer (or data developer). Usually, the designer will perform the following:

  1. Identify what data will be kept in the database.
  2. Establish the relationships between the different data points.
  3. Create a logical data structure to be used on the basis of steps 1 and 2.

Even during the act of application program designing, a thorough understanding of how the data works is essential. Without understanding average or default values, relationships between data points and grouping, and so on, the created application is at risk of failing.

One idea for applying statistical thinking to help with data designing is in the case where there is limited real data available. If enough data cannot be collected, one could create sample (test) data by a variety of sampling methods, such as probability sampling.

A probability-based sample is created by constructing a list of the target population values, called a sample frame, then a randomized process for selecting records from the sample frame, which is called a selection procedure. Think of this as creating a script to generate records of sample data based on your knowledge of actual data as well as some statistical logic to be used for testing your designs.

Finally, approach any problem with scientific or statistical methods, and odds are you'll produce better results.

Adding to your personal toolbox

In my experience, most data developers tend to lock on to a technology or tool based upon a variety of factors (some of which we mentioned earlier in this chapter) becoming increasingly familiar with and (hopefully) more proficient with the product, tool, or technology—even the continuously released newer versions. One might suspect that (and probably would be correct) the more the developer uses the tool, the higher the skill level that he or she establishes. Data scientists, however, seem to lock onto methodologies, practices, or concepts more than the actual tools and technologies they use to implement them.

This turning of focus (from to tool to technique) changes one's mindset to the idea of thinking what tool best serves my objective rather than how this tool serves my objective.

The more tools you are exposed to, the broader your thinking will become a developer or data scientist. The open source community provides outstanding tools you can download, learn, and use freely. One should adopt a mindset of what's next or new to learn, even if it's in an attempt to compare features and functions of a new tool to your preferred tool. We'll talk more about this in the perpetual learning section of this chapter.

An exciting example of a currently popular data developer or data enabling tool is MarkLogic (http://www.marklogic.com/). This is an operational and transactional enterprise NoSQL database that is designed to integrate, store, manage, and search more data than ever before. MarkLogic received the 2017 DAVIES Award for best Data Development Tools. R and Python seem to be at the top as options for the data scientists.

It would not be appropriate to end this section without the mention of IBM Watson Analytics (https://www.ibm.com/watson/), currently transforming the way the industry thinks about statistical or cognitive thinking.

Increased marketability

Data science is clearly an ever-evolving field, with exponentially growing popularity. In fact, I'd guess that if you ask a dozen professionals, you'll most likely receive a dozen different definitions of what a data scientist is (and their place within a project or organization), but most likely, all would agree with their level of importance and that vast numbers of opportunities exist within the industry and the world today.

Data scientist face an unprecedented demand for more models, more insights...there's only one way to do that: They have to dramatically speed up the insights to action. In the future data Scientists, must become more productive. That's the only way they're going to get more value from the data.
                                                                                                                               -Gualtieri
       https://www.datanami.com/2015/09/18/the-future-of-data-science/

Data Scientist is relatively hard to find today. If you do your research, you will find that today's data scientists may have a mixed background consisting of mathematics, programming, and software design, experimental design, engineering, communication, and management skills. In practice, you'll see that most data scientists you find aren't specialists in any one aspect, rather they possess varying levels of proficiency in several areas or backgrounds.

The role of the data scientist has unequivocally evolved since the field of statistics of over 1200 years ago. Despite the term only existing since the turn of this century, it has already been labeled The Sexiest Job of the 21st Century, which understandably, has created a queue of applicants stretched around the block
                                                                                                                                -Pearson 
      https://www.linkedin.com/pulse/evolution-data-scientist-chris-pearson

Currently, there is no official data scientist job description (or prerequisite list for that matter). This presents you with the opportunity to create your own flavour of the data scientist, delivering value in new ways to your organization.

Perpetual learning

The idea of continued assessment or perpetual learning is an important statistical concept to grasp. Consider learning enhanced skills of perception as a common definition. For example, in statistics, we can refer to the idea of cross-validation. This is a statistical approach for measuring (assessing) a statistical model's performance. This practice involves identifying a set of validation values and then running a model a set number of rounds (continuously), using sample datasets and then averaging the results of each round to ultimately see how good a model (or approach) might be in solving a particular problem or meeting an objective.

The expectation here is that given performance results, adjustments could be made to tweak the model so as to provide the ability to identify insights when used with a real or full population of data. Not only is this concept a practice the data developer should use for refining or fine-tuning a data design or data-driven application process, but this is great life advice in the form of try, learn, adjust, and repeat.

The idea of model assessment is not unique to statistics. Data developers might consider this similar to the act of predicting SQL performance or perhaps the practice of an application walkthrough where an application is validated against the intent and purpose stated within its documented requirements.

Seeing the future

Predictive modeling uses the statistics of data science to predict or foresee a result (actually, a probable result). This may sound a lot like fortune telling, but it is more about putting to use cognitive reasoning to interpret information (mined from data) to draw a conclusion. In the way that a scientist might be described as someone who acts in a methodical way, attempting to obtain knowledge or to learn, a data scientist might be thought of as trying to make predictions, using statistics and (machine) learning.

When we talk about predicting a result, it's really all about the probability of seeing a certain result. Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.

If you are a data developer who has perhaps worked on projects serving an organization's office of finance, you may understand why a business leader would find it of value to not just report on its financial results (even the most accurate of results are really still historical events) but also to be able to make educated assumptions on future performance.

Perhaps you can understand that if you have a background in and are responsible for financial reporting, you can now take the step towards providing statistical predictions to those reports!

Statistical modeling techniques can also be applied to any type of unknown event, regardless of when it occurred, such as in the case of crime detection and suspect identification.

Transitioning to a data scientist

Let's start this section by taking a moment to state what I consider to be a few generally accepted facts about transitioning to a data scientist. We'll reaffirm these beliefs as we continue through this book:

  • Academia: Data scientists are not all from one academic background. They are not all computer science or statistics/mathematics majors. They do not all possess an advanced degree (in fact, you can use statistics and data science with a bachelor's degree or even less).
  • It's not magic-based: Data scientists can use machine learning and other accepted statistical methods to identify insights from data, not magic.
  • They are not all tech or computer geeks: You don't need years of programming experience or expensive statistical software to be effective.
  • You don't need to be experienced to get started. You can start today, right now. (Well, you already did when you bought this book!)

Okay, having made the previous declarations, let's also be realistic. As always, there is an entry-point for everything in life, and, to give credit where it is due, the more credentials you can acquire to begin out with, the better off you will most likely be. Nonetheless, (as we'll see later in this chapter), there is absolutely no valid reason why you cannot begin understanding, using, and being productive with data science and statistics immediately.

As with any profession, certifications, and degrees carry the weight that may open the doors, while experience, as always, might be considered the best teacher. There are, however, no fake data scientists but only those with currently more desire than practical experience.

If you are seriously interested in not only understanding statistics and data science but eventually working as a full-time data scientist, you should consider the following common themes (you're likely to find in job postings for data scientists) as areas to focus on:

  • Education: Common fields of study are Mathematics and Statistics, followed by Computer Science and Engineering (also Economics and Operations research). Once more, there is no strict requirement to have an advanced or even related degree. In addition, typically, the idea of a degree or an equivalent experience will also apply here.
  • Technology: You will hear SAS and R (actually, you will hear quite a lot about R) as well as Python, Hadoop, and SQL mentioned as key or preferable for a data scientist to be comfortable with, but tools and technologies change all the time so, as mentioned several times throughout this chapter, data developers can begin to be productive as soon as they understand the objectives of data science and various statistical mythologies without having to learn a new tool or language.
Basic business skills such as Omniture, Google Analytics, SPSS, Excel, or any other Microsoft Office tool are assumed pretty much everywhere and don't really count as an advantage, but experience with programming languages (such as Java, PERL, or C++) or databases (such as MySQL, NoSQL, Oracle, and so on.) does help!
  • Data: The ability to understand data and deal with the challenges specific to the various types of data, such as unstructured, machine-generated, and big data (including organizing and structuring large datasets).
Unstructured data is a key area of interest in statistics and for a data scientist. It is usually described as data having no redefined model defined for it or is not organized in a predefined manner. Unstructured information is characteristically text-heavy but may also contain dates, numbers, and various other facts as well.
  • Intellectual curiosity: I love this. This is perhaps well defined as a character trait that comes in handy (if not required) if you want to be a data scientist. This means that you have a continuing need to know more than the basics or want to go beyond the common knowledge about a topic (you don't need a degree on the wall for this!)
  • Business acumen: To be a data developer or a data scientist you need a deep understanding of the industry you're working in, and you also need to know what business problems your organization needs to unravel. In terms of data science, being able to discern which problems are the most important to solve is critical in addition to identifying new ways the business should be leveraging its data.
  • Communication skills: All companies look for individuals who can clearly and fluently translate their findings to a non-technical team, such as the marketing or sales departments. As a data scientist, one must be able to enable the business to make decisions by arming them with quantified insights in addition to understanding the needs of their non-technical colleagues to add value and be successful.

Let's move ahead

So, let's finish up this chapter with some casual (if not common sense) advice for the data developer who wants to learn statistics and transition into the world of data science.

Following are several recommendations you should consider to be resources for familiarizing yourself with the topic of statistics and data science:

  • Books: Still the best way to learn! You can get very practical and detailed information (with examples) and advice from books. It's great you started with this book, but there is literally a staggering amount (and growing all the time) of written resources just waiting for you to consume.
  • Google: I'm a big fan of doing internet research. You will be surprised at the quantity and quality of open source and otherwise, free software libraries, utilities, models, sample data, white papers, blogs, and so on you can find out there. A lot of it can be downloaded and used directly to educate you or even as part of an actual project or deliverable.
  • LinkedIn: A very large percentage of corporate and independent recruiters use social media, and most use LinkedIn. This is an opportunity to see what types of positions are in demand and exactly what skills and experiences they require. When you see something you don't recognize, do the research to educate yourself on the topic. In addition, LinkedIn has an enormous number of groups that focus on statistics and data science. Join them all! Network with the members--even ask them direct questions. For the most part, the community is happy to help you (even if it's only to show how much they know).
  • Volunteer: A great way to build skills, continue learning, and expand your statistics network is to volunteer. Check out http://www.datakind.org/get-involved. If you sign up to volunteer, they will review your skills and keep in touch with projects that are a fit for your background or you are interested in coming up.
  • Internship: Experienced professionals may re-enlist as interns to test a new profession or break into a new industry (www.Wetfeet.com). Although perhaps unrealistic for anyone other than a recent college graduate, internships are available if you can afford to cut your pay (or even take no pay) for a period of time to gain some practical experience in statistics and data science. What might be more practical is interning within your own company as a data scientist apprentice role for a short period or for a particular project.
  • Side projects: This is one of my favorites. Look for opportunities within your organization where statistics may be in use, and ask to sit in meetings or join calls in your own time. If that isn't possible, look for scenarios where statistics and data science might solve a problem or address an issue, and make it a pet project you work on in your spare time. These kinds of projects are low risk as there will be no deadlines, and if they don't work out at first, it's not the end of the world.
  • Data: Probably one of the easiest things you can do to help your transition into statistics and data science is to get your hands on more types of data, especially unstructured data and big data. Additionally, it's always helpful to explore data from other industries or applications.
  • Coursera and Kaggle: Coursera is an online website where you can take Massive Online Open Curriculum (MOOCs) courses for a fee and earn a certification, while Kaggle hosts data science contests where you can not only evaluate your abilities as you transition against other members but also get access to large, unstructured big data files that may be more like the ones you might use on an actual statistical project.
  • Diversify: To add credibility to your analytic skills (since many companies are adopting numerous arrays of new tools every day) such as R, Python, SAS, Scala, (of course) SQL, and so on, you will have a significant advantage if you spend time acquiring knowledge in as many tools and technologies as you can. In addition to those mainstream data science tools, you may want to investigate some of the up-and-comers such as Paxada, MatLab, Trifacta, Google Cloud Prediction API, or Logical Glue.
  • Ask a recruiter: Taking the time to develop a relationship with a recruiter early in your transformation will provide many advantages, but a trusted recruiter can pass on a list of skills that are currently in demand as well as which statistical practices are most popular. In addition, as you gain experience and confidence, a recruiter can help you focus or fine-tune your experiences towards specific opportunities that may be further out on the horizon, potentially giving you an advantage over other candidates.
  • Online videos: Check out webinars and how to videos on YouTube. There are endless resources from both amateurs and professionals that you can view whenever your schedule allows.

Summary

In this chapter, we sketched how a database (or data) developer thinks on a day-to-day, problem-solving basis, comparing the mindsets of a data developer and a data scientist, using various practical examples.

We also listed some of the advantages of thinking as a data scientist and finally discussed common themes for you to focus on as you gain an understanding of statistics and transition into the world of data science.

In the next chapter, we will introduce and explain (again, from a developer's perspective) the basic objectives behind statistics for data science and introduce you to the important terms and key concepts (with easily understood explanations and examples) that are used throughout the book.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • No need to take a degree in statistics, read this book and get a strong statistics base for data science and real-world programs;
  • Implement statistics in data science tasks such as data cleaning, mining, and analysis
  • Learn all about probability, statistics, numerical computations, and more with the help of R programs

Description

Data science is an ever-evolving field, which is growing in popularity at an exponential rate. Data science includes techniques and theories extracted from the fields of statistics; computer science, and, most importantly, machine learning, databases, data visualization, and so on. This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks. It starts off with simple statistics and then move on to statistical methods that are used in data science algorithms. The R programs for statistical computation are clearly explained along with logic. You will come across various mathematical concepts, such as variance, standard deviation, probability, matrix calculations, and more. You will learn only what is required to implement statistics in data science tasks such as data cleaning, mining, and analysis. You will learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks. By the end of the book, you will be comfortable with performing various statistical computations for data science programmatically.

Who is this book for?

This book is intended for those developers who are willing to enter the field of data science and are looking for concise information of statistics with the help of insightful programs and simple explanation. Some basic hands on R will be useful.

What you will learn

  • • Analyze the transition from a data developer to a data scientist mindset
  • • Get acquainted with the R programs and the logic used for statistical computations
  • • Understand mathematical concepts such as variance, standard deviation, probability, matrix calculations, and more
  • • Learn to implement statistics in data science tasks such as data cleaning, mining, and analysis
  • • Learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks
  • • Get comfortable with performing various statistical computations for data science programmatically
Estimated delivery fee Deliver to Hungary

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 17, 2017
Length: 286 pages
Edition : 1st
Language : English
ISBN-13 : 9781788290678
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Hungary

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Publication date : Nov 17, 2017
Length: 286 pages
Edition : 1st
Language : English
ISBN-13 : 9781788290678
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total €183.97 €190.97 €7.00 saved
Statistics for Machine Learning
€41.99
Statistics for Data Science
€32.99
Basic Statistics and Data Mining for Data Science
€130.99
Total €183.97€190.97 €7.00 saved Stars icon

Table of Contents

12 Chapters
Transitioning from Data Developer to Data Scientist Chevron down icon Chevron up icon
Declaring the Objectives Chevron down icon Chevron up icon
A Developer's Approach to Data Cleaning Chevron down icon Chevron up icon
Data Mining and the Database Developer Chevron down icon Chevron up icon
Statistical Analysis for the Database Developer Chevron down icon Chevron up icon
Database Progression to Database Regression Chevron down icon Chevron up icon
Regularization for Database Improvement Chevron down icon Chevron up icon
Database Development and Assessment Chevron down icon Chevron up icon
Databases and Neural Networks Chevron down icon Chevron up icon
Boosting your Database Chevron down icon Chevron up icon
Database Classification using Support Vector Machines Chevron down icon Chevron up icon
Database Structures and Machine Learning Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.6
(5 Ratings)
5 star 60%
4 star 0%
3 star 0%
2 star 20%
1 star 20%
Adi Mar 02, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Statistics is the main concept in data science , this book helps in analyzing from data developer to a data scientist , R programming logic's for stats and many more concepts. Useful for anyone who is interested in data science
Amazon Verified review Amazon
Vivek V. Oct 06, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Nice book
Amazon Verified review Amazon
Deepak Singh Nov 10, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Imprssive work
Amazon Verified review Amazon
Alexander Jul 31, 2018
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
A lot of blank space on approx. 200pages thus covering topics superficially. I would not recommend this book except to those who are looking for a quick intro.
Amazon Verified review Amazon
Chrisfs Jan 29, 2019
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
I am very disappointed by the book. The contents don't match the title at all. There is very little statistics in the book. It covers the basics of preparing data for analysis and covers the dictionary meaning of some machine learning and statistical terms but it doesn't explain anything in any sort of detail. If you buy this book to learn about statistics, then it's very disappointing and a complete waste of money
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela