Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Predictive Analytics with Python
Mastering Predictive Analytics with Python

Mastering Predictive Analytics with Python: Exploit the power of data in your business by building advanced predictive modeling applications with Python

eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Mastering Predictive Analytics with Python

Chapter 1. From Data to Decisions – Getting Started with Analytic Applications

From quarterly financial projections to customer surveys, analytics help businesses to make decisions and plan for the future. While data visualizations such as pie charts and trend lines using spreadsheet programs have been used for decades, recent years have seen a growth in both the volume and diversity of data sources available to the business analyst and the sophistication of tools used to interpret this information.

The rapid growth of the Internet, through e-commerce and social media platforms, has generated a wealth of data, which is available faster than ever before for analysis. Photographs, search queries, and online forum posts are all examples of unstructured data that can't be easily examined in a traditional spreadsheet program. With the proper tools, these kinds of data offer new insights, in conjunction with or beyond traditional data sources.

Traditionally, data such as historical customer records appear in a structured, tabular form that is stored in an electronic data warehouse and easily imported into a spreadsheet program. Even in the case of such tabular data, the volume of records and the rate at which they are available are increasing in many industries. While the analyst might have historically transformed raw data through interactive manipulation, robust analytics increasingly requires automated processing that can scale with the volume and velocity of data being received by a business.

Along with the data itself, the methods used to examine it have become more powerful and complex. Beyond summarizing historical patterns or projecting future events using trend lines derived from a few key input variables, advanced analytics emphasizes the use of sophisticated predictive modeling (see the goals of predictive analytics, as follows) to understand the present and forecast near and long-term outcomes.

Diverse methods for generating such predictions typically require the following common elements:

  • An outcome or target that we are trying to predict, such as a purchase or a click-through-rate (CTR) on a search result.
  • A set of columns that comprise features, also known as predictors (for example, a customer's demographic information, past transactions on a sales account, or click behavior on a type of ad) describing individual properties of each record in our dataset (for example, an account or ad).
  • A procedure that finds the model or set of models which best maps these features to the outcome of interest on a given sample of data.
  • A way to evaluate the performance of the model on new data.

While predictive modeling techniques can be used in powerful analytic applications to discover complex relationships between seemingly unrelated inputs, they also present a new set of challenges to the business analyst:

  • What method is the best suited for a particular problem?
  • How does one correctly evaluate the performance of these techniques on historical and new data?
  • What are the preferred strategies for tuning the performance of a given method?
  • How does one robustly scale these techniques for both one-off analysis and ongoing insight?

In this book, we will show you how to address these challenges by developing analytic solutions that transform data into powerful insights for you and your business. The main tasks involved in building these applications are:

  • Transforming raw data into a sanitized form that can be used for modeling. This may involve both cleaning anomalous data and converting unstructured data into a structured format.
  • Feature engineering, by transforming these sanitized inputs into the format that is used to develop a predictive model.
  • Calibrating a predictive model on a subset of this data and assessing its performance.
  • Scoring new data while evaluating the ongoing performance of the model.
  • Automating the transformation and modeling steps for regular updates.
  • Exposing the output of the model to other systems and users, usually through a web application.
  • Generating reports for the analyst and business user that distills the data and model into regular and robust insights.

Throughout this volume, we will use open-source tools written in the Python programming language to build these sorts of applications. Why Python? The Python language strikes an attractive balance between robust compiled languages such as Java, C++, and Scala, and pure statistical packages such as R, SAS, or MATLAB. We can work interactively with Python using the command line (or, as we will use in subsequent chapters, browser-based notebook environments), plotting data, and prototyping commands. Python also provides extensive libraries, allowing us to transform this exploratory work into web applications (such as Flask, CherryPy, and Celery, as we will see in Chapter 8, Sharing Models with Prediction Services), or scale them to large datasets (using PySpark, as we will explore in future chapters). Thus we can both analyze data and develop software applications within the same language.

Before diving into the technical details of these tools, let's take a high-level look at the concepts behind these applications and how they are structured. In this chapter, we will:

  • Define the elements of an analytic pipeline: data transformation, sanity checking, preprocessing, model development, scoring, automation, deployment, and reporting.
  • Explain the differences between batch-oriented and stream processing and their implications at each step of the pipeline.
  • Examine how batch and stream processing can be jointly accommodated within the Lambda Architecture for data processing.
  • Explore an example stream-processing pipeline to perform sentiment analysis of social media feeds.
  • Explore an example of a batch-processing pipeline to generate targeted e-mail marketing campaigns.

Tip

The goals of predictive analytics

The term predictive analytics, along with others such as data mining and machine learning, are often used to describe the techniques used in this book to build analytic solutions. However, it is important to keep in mind that there are two distinct goals these methods can address. Inference involves building models in order to evaluate the significance of a parameter on an outcome and emphasizes interpretation and transparency over predictive performance. For example, the coefficients of a regression model (Chapter 4, Connecting the Dots with Models – Regression Methods) can be used to estimate the effect of variation in a particular model input (for example, customer age or income) on an output variable (for example, sales). The predictions from a model developed for inference may be less accurate than other techniques, but provide valuable conceptual insights that may guide business decisions. Conversely, prediction emphasizes the accuracy of the estimated outcome, even if the model itself is a black box where the connection between an input and the resulting output is not always clear. For example, Deep Learning (Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features) can produce state-of-the-art models and extremely accurate predictions from complex sets of inputs, but the connection between the input parameters and the prediction may be hard to interpret.

Designing an advanced analytic solution

What are the essential components of an analytic solution? While the exact design can vary between applications, most consist of the following pieces (Figure 1):

Designing an advanced analytic solution

Figure 1: Reference architecture for an analytic pipeline

  • Data layer: This stage deals with storage, processing, and persistence of the data, and how it is served to downstream applications such as the analytical applications we will build in this volume. As indicated in Figure 1, data serves as the glue the binds together the other pieces of our application, all of which rely on the data layer to store and update information about their state. This also reflects the separation of concerns that we will discuss in more detail in Chapters 8, Sharing Models with Prediction Services and Chapter 9, Reporting and Testing – Iterating on Analytic Systems, where the other three components of our application can be designed independently since they interact only through the data layer.
  • Modeling layer: At this point, the data has been turned into a form that may be ingested by our modeling code in Python. Further feature engineering tasks may be involved to convert this sanitized data into model inputs, along with splitting data into subsets and performing iterative rounds of optimization and tuning. It will also be necessary to prepare the model in a way that can be persisted and deployed to downstream users. This stage is also involved with scoring new data as it is received or performing audits of model health over time.
  • Deployment layer: The algorithm development and performance components in the modeling layer are usually exposed to either human users or other software systems through web services, which these consumers interact with through a server layer by means of network calls to both trigger new rounds of model development and query the results of previous analyses.
  • Reporting layer: Predictions, model parameters, and insights can all be visualized and automated using reporting services.

With these broad components in mind, let's delve more deeply into the details of each of these pieces.

Data layer: warehouses, lakes, and streams

The beginning of any analytic pipeline is the data itself, which serves as the basis for predictive modeling. This input can vary both in the rate at which updates are available and the amount of transformation that needs to be applied to form the final set of features used in the predictive model. The data layer serves as the repository for this information.

Traditionally, data used for analytics might simply be stored on disk in flat files, such as a spreadsheet or document. As the diversity and scale of data have increased, so have the scale and complexity of resources needed to house and process them. Indeed, a modern view of the data layer encompasses both real-time (stream) data and batch data in the context of many potential downstream uses. This combined system, known as Lambda Architecture (Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.), is diagrammed in the following figure:

Data layer: warehouses, lakes, and streams

Figure 2: Data layer as a Lambda Architecture

The components of this data layer are:

  • Data sources: These could be either real time data received in streams, or batch updates received on a periodic or discontinuous basis.
  • Data lake: Both real-time and batch data is commonly saved in a data lake model, in which a distributed file system such as the Hadoop File System (HDFS) or Amazon Web Services (AWS) Simple Storage Service (S3) is used as a common storage medium for data received both in batch and in streams. This data can either be stored with a fixed lifetime (transient) or permanent (persisted) retention policy. This data may then be processed in ongoing batch transformations such as Extract, Load, and Transform (ETL) jobs running in frameworks such as MapReduce or Spark. ETL processes might involve cleaning the data, aggregating it into metrics of interest, or reshaping it into a tabular form from raw inputs. This processing forms the batch layer of the Lambda Architecture, where real-time availability is not expected and latency of minutes to days is acceptable in surfacing views of the data for downstream consumption.
  • Data river: While the data lake accumulates all types of raw data in a central location, the data river forms an ongoing message queue where real-time data is dispatched to stream processing tasks. This is also termed the speed layer (Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.) of the architecture, as it operates on data as soon as it is available and real-time availability is expected.
  • Merged view: Both real-time and batch views of the raw data may be merged into a common persistence layer, such as a data warehouse in structured tables, where they can be queried using Structured Query Language (SQL) and utilized in either transactional (for example, updating a bank balance in real time) or analytic (for example, running analyses or reports) applications. Examples of such warehouse systems include traditional relational systems such as MySQL and PostgreSQL (which usually store data with tabular schema in rows and columns), and NoSQL systems such as MongoDB or Redis (which arrange data more flexibly in key-value systems, where values can take on numerous formats outside the traditional rows and columns). This merged system is also referred to as the serving layer (Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.), and can either be directly queried using the database system, or surfaced to downstream applications.
  • Downstream applications: Systems such as our advanced analytic pipelines can either directly consume the outputs of the batch and real-time processing layers, or interact with one or both of these sources through the merged view in the warehousing system.

How might streaming and batch data be processed differently in the data layer? In batch pipelines, the allowed delay between receiving and processing the data allows for potentially complex transformations of the source data: elements may be aggregated (such as calculating a user or product's average properties over a period of time), joined to other sources (for example, indexing additional website metadata on search logs), and filtered (for example, many web logging systems need to remove bot activity that would otherwise skew the results of predictive models). The source data could be obtained, for example, from simple text files posted to a server, a relational database system, or a mixture of different storage formats (see as follows).

Conversely, due to the speed at which incoming data must often be consumed, streaming processes typically involve less complex processing of inputs than batch jobs, and instead use simple filters or transformations. The sources for such applications are typically continuously updated streams from web services (such as social media or news feeds), events (such as geo-locations of vehicles and mobile phones), or customer activities (such as searches or clicks).

The choice between batch and stream processing at this stage is largely determined by the data source, which is either available as a continuously updated series of events (streaming) or larger, periodically available chunks (batch). In some cases, the nature of the data will also determine the form of the subsequent pipeline and an emphasis on real-time or higher latency processing. In others, the use of the application will take precedent in downstream choices. The normalized view surfaced in the data layer is used downstream in the next stage of the analytic pipeline, the modeling layer.

Modeling layer

The modeling layer involves a number of interconnected tasks, diagrammed in the following figure (Figure 3). As the data layer accommodates both real-time and batch data, we can imagine two main kinds of modeling systems:

  • Streaming pipelines act upon a continuous source of data (such as instant messages or a news feed) as soon as it becomes available, potentially allowing real-time model updates or scoring. However, the ability to update the model in real time may vary by algorithm (for example it will work for models using stochastic updates, described in Chapter 5, Putting Data in its Place – Classification Methods and Analysis), and some can only be developed in an offline process. The potential volume of streaming data may also mean that it cannot be stored in its raw form, but only transformed into a more manageable format before the original record is discarded.
  • Batch processing. Data sources that are updated on a periodic basis (often daily) are frequently processed using a batch-oriented framework. The input does not need to be used at the moment it is available, with a latency of hours or days between updates usually acceptable, meaning the data processing and model development are typically not occurring in real time.

Note

On the surface, the choice between the two classes of pipelines seems to involve the tradeoff between real-time (streaming) or offline (batch) analysis. In practice, the two classes can have real-time and non-real-time components intermingled within a single application.

If both types of pipeline are viable for a given problem (for example, if the streams are stock prices, a dataset whose volume and simple format – a set of numbers – should allow it to be readily stored offline and processed in its entirety at a later date), the choice between the two frameworks may be dictated by technical or business concerns. For example, sometimes the method used in a predictive model allows only for batch updates, meaning that continuously processing a stream as it is received does not add additional value. In other cases, the importance of the business decisions informed by the predictive model necessitates real-time updates and so would benefit from stream processing.

Modeling layer

Figure 3: Overview of the modeling layer

The details of the generic components of each type of pipeline as shown in Figure 3 are as follows:

In the Model Input step the source data is loaded and potentially transformed by the pipeline into the inputs required for a predictive model. This can be as simple as exposing a subset of columns in a database table, or transforming an unstructured source such as text into a form that may be input to a predictive model. If we are fortunate, the kinds of features we wish to use in a model are already the form in which they are present in the raw data. In this case, the model fitting proceeds directly on the inputs. More often, the input data just contains the base information we might want to use as inputs to our model, but needs to be processed into a form that can be utilized in prediction.

In the case of numerical data, this might take the form of discretization or transformation. Discretization involves taking a continuous number (such as consumer tenure on a subscription service) and dividing it into bins (such as users with <30 or >=30 days of subscription) that either reduce the variation in the dataset (by thresholding an outlier on a continuous scale to a reasonable bin number) or turn a numerical range into a set of values that have more direct business implications. Another example of discretization is turning a continuous value into a rank, in cases where we don't care as much about the actual number as its relative value compared to others. Similarly, values that vary over exponential scales might be transformed using a natural logarithm to reduce the influence of large values on the modeling process.

In addition to these sorts of transformations, numerical features might be combined in ratios, sums, products, or other combinations, yielding a potential combinatorial explosion of features from even a few basic inputs. In some models, these sorts of interactions need to be explicitly represented by generating such combined features between inputs (such as the regression models we discuss in Chapter 4, Connecting the Dots with Models – Regression Methods). Other models have some ability to decipher these interactions in datasets without our direct creation of the feature (such as random forest algorithms in Chapter 5, Putting Data in its Place – Classification Methods and Analysis or gradient boosted decision trees in Chapter 6, Words and Pixels – Working with Unstructured Data).

In the case of categorical data, such as country codes or days of the week, we may need to transform the category into a numerical descriptor. This could be a number (if the data is ordinal, meaning for example that a value of 2 has an interpretation of being larger than another record with value 1 for that feature) or a vector with one or more non-zero entries indicating the class to which a categorical feature belongs (for example, a document could be represented by a vector the same length as the English vocabulary, with a number indicating how many times each word represented by a particular vector position appears in the document).

Finally, we might find cases where we wish to discover the hidden features represented by a particular set of inputs. For example, income, occupation, and age might all be correlated with the zip code in which a customer lives. If geographic variables aren't part of our dataset, we could still discover these common underlying patterns using dimensionality reduction, as we will discuss in Chapter 6, Words and Pixels – Working with Unstructured Data.

Sanity checking may also be performed at this stage, as it is crucial to spot data anomalies when they appear, such as outliers that might degrade the performance of the model. In the first phase of quality checks, the input data is evaluated to prevent outliers or incorrect data from impacting the quality of models in the following stages. These sanity checks could take many forms: for categorical data (for example, a state or country), there are only a fixed number of allowable values, making it easy to rule out incorrect inputs. In other cases, this quality check is based on an empirical distribution, such as variation from an average value, or a sensible minimum or maximum range. More complex scenarios usually arise from business rules (such as a product being unavailable in a given territory, or a particular combination of IP addresses in web sessions being illogical).

Such quality checks serve as more than safeguards for the modeling process: they can also serve as warnings of events such as bot traffic on websites that may indicate malicious activity. Consequently, these audit rules may also be incorporated as part of the visualization and reporting layer at the conclusion of the pipeline.

In the second round of quality checks following model development, we want to evaluate whether the parameters of the model make sense and whether the performance on the test data is in an acceptable range for deployment. The former might involve plotting the important parameters of a model if the technique permits, visualizations that can then also be utilized by the reporting step downstream. Similarly, the second class of checks can involve looking at accuracy statistics such as precision, recall, or squared error, or the similarity of the test set to data used in model generation in order to determine if the reported performance is reasonable.

As with the first round of sanity checks, not only can these quality control measures serve to monitor the health of the model development process, but also potentially highlight changes in the actual modeling code itself (especially if this code is expected to be regularly updated).

There isn't inherently much difference between streaming and batch-oriented processing in the sanity checking process, just the latency at which the application can uncover anomalies in the source data or modelling process and deliver them to the reporting layer. The complexity of the sanity checks may guide this decision: simple checks that can be done in real-time are well suited for stream processing, while evaluation of the properties of a predictive model could potentially take longer than the training of the algorithm itself, and is thus more suited for a batch process.

In the model development or update step, once the input data has undergone any necessary processing or transformation steps and passed the quality checks described above, it is ready to be used in developing a predictive model. This phase of the analytic pipeline can have several steps, with the exact form depending upon the application:

  • Data splitting: At this stage we typically split data into disjoin sets, the training data (from which we will tune the parameters of the algorithm), and the test data (which is used for evaluation purposes). The important reason for making this split is so that the model generalizes to data beyond its initial inputs (the training data), which we can check by evaluating its performance on the test set.
  • Parameter tuning: As we will examine in more detail in subsequent chapters, many predictive models have a number of hyperparameters— variables that need to be set before the parameters of the model can be optimized for a training set. Examples include the number of groups in a clustering application (Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning), the number of trees used in a random forest Chapter 4, Connecting the Dots with Models – Regression Methods, or the learning rate and number of layers in a neural network (Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features). These hyperparameters frequently need to be calibrated for optimal performance of a predictive model, through grid search (Chapter 5, Putting Data in its Place – Classification Methods and Analysis) or other methods. This tuning can occur only during the initial phase of model development, or as part of a regular retraining cycle. Following or jointly with hyperparameter tuning, the parameters, such as regression coefficients or decision splits in a tree model Chapter 4, Connecting the Dots with Models – Regression Methods, are optimized for a given set of training data. Depending upon the method, this step may also involve variable selection—the process of pruning uninformative features from the input data. Finally, we may perform the above tasks for multiple algorithms and choose the best performing technique.

    Batch-oriented and streaming processes could differ at this stage depending upon the algorithm. For example, in models that allow for incremental updates through stochastic learning (Chapter 5, Putting Data in its Place – Classification Methods and Analysis), new data may be processed in a stream as each new training example can individually tune the model parameters. Conversely, data may arrive in a stream but be aggregated until a sufficient size is reached, at which point a batch process is launched to retrain the model. Some models allow for both kinds of training, and the choice depends more on the expected volatility of the input data. For example, rapidly trending signals in social media posts may suggest updating a model as soon as events are available, while models based on longer-term events such as household buying patterns may not justify such continuous updates.

  • Model performance: Using either the test data split off during model development or an entirely new set of observations, the modeling layer is also responsible for scoring new data, surfacing important features in the model, and providing information about its ongoing performance. Once the model has been trained on a set of input data, it can be applied to new data in either in real-time computations, or through offline, batch processing to generate a predicted outcome or behavior.

    Depending upon the extent of initial data processing, new records may also need to be transformed to generate the appropriate features for evaluation by a model. The extent of such transformations may dictate whether scoring is best accomplished through a streaming or batch framework.

    Similarly, the use of the resulting prediction may guide the choice between streaming or batch-oriented processing. When such scores are used as inputs to other, responsive systems (such as in reordering search results or ads presented on a webpage), real-time updates from streaming pipelines, allow for immediate use of the new scores and so may be valuable. When the scores are primarily used for internal decision-making (such as prioritizing sales leads for follow-up), real-time updates may not be necessary and a batch-oriented framework can be used instead. This difference in latency may be correlated with whether the downstream consumer is another application (machine to machine interaction), or a human user relying upon the model for insight (machine to human).

  • Model persistence: Once we have tuned the parameters of the predictive model, the result may also need to be packaged, or serialized into a format to allow deployment within a production environment. We will examine this in greater depth in Chapter 8, Sharing Models with Prediction Services, but in brief this process involves transforming the model output into a form for use by downstream systems and saving it back to the data layer for both disaster recovery and potential use by the reporting layer downstream described as follows.

Deployment layer

The output of our predictive modeling can be made broadly available to both individual users and other software services through a deployment layer, which encapsulates the modeling, scoring, and evaluation functions in the previous layer inside of web applications, as shown in the following Figure 4:

Deployment layer

Figure 4: Deployment layer components

This application layer receives network calls over the web, transmitted either through a web browser or from a programmatic request generated by another software system. As we will describe in Chapter 8, Sharing Models with Prediction Services, these applications usually provide a standard set of commands to initiate an action, get a result, save new information, or delete unwanted information. They also typically interact with the data layer to both store results and, in the case of long-running tasks, to store information about the progress of modeling computations.

The network calls received by these applications are brokered by the Server Layer, which serves to route traffic between applications (usually based on url patterns). As we will cover in Chapter 8, Sharing Models with Prediction Services, this separation between the server and application allows us to scale our application by adding more machines, and independently add more servers to balance incoming requests.

The client layer, which initiates the requests received by the server, could be both interactive systems, such as a dashboard, or an independent system such as an e-mail server, that uses the output of a model to schedule outgoing messages.

Reporting layer

The output of the analytical pipeline may be surfaced by the reporting layer, which involves a number of distinct tasks, as shown in the following Figure 5:

Reporting layer

Figure 5: Reporting applications for prediction services

  • Visualizations: This can allow interactive querying of the source data along with model data such as parameters and feature importance. It can also be used to visualize the output of a model, such as the set of recommendations that would be provided to a user on an e-commerce site, or the risk score assigned to a particular bank account. Because it is frequently used in interactive mode, we may also consider aggregating large model inputs into summarized datasets for lower latency during exploratory sessions. Additionally, visualizations can be either an ad hoc process (such as the interactive notebooks we will examine in future chapters), or a fixed series of graphics (such as the dashboards we will construct in Chapter 9, Reporting and Testing – Iterating on Analytic Systems).
  • Audit/Healthcheck: The reporting service involves ongoing monitoring of the application. Indeed, an important factor in developing robust analytic pipelines is regular assessment to ensure that the model is performing as expected. Combining outputs from many previous steps, such as quality control checks and scores for new data, a reporting framework visualizes these statistics and compares them to previous values or a gold standard. This sort of reporting can be used both by the analyst, to monitor the application, and as a way to surface insights uncovered by the modeling process to the larger business organization.
  • Comparison reports: This might be used as we iterate on model development through the process of experimentation, as we will discuss in Chapter 9, Reporting and Testing – Iterating on Analytic Systems. Because this analysis may involve statistical measurements, the visualizations might be combined with a service in the deployment layer to calculate significance metrics.

    The choice of batch versus streaming processes will often determine whether such reports can be provided in real-time, but just because they are available immediately doesn't imply that such frequency is valuable to the user. For example, even if user response rates to an ad campaign can be collected in real-time, decisions about future advertising programs on these results may be constrained by quarterly business planning. In contrast, trending interest in particular search queries might also allow us to quickly tune the results of a recommendation algorithm, and thus this low-latency signal has value. Again, judgment based on the particular use-case is required.

    To conclude this introduction, let's examine a pair of hypothetical applications that illustrates many of the components we've described above. Don't worry too much about the exact meaning of all the terminology, which will be expanded upon in following chapters.

Case study: sentiment analysis of social media feeds

Consider a marketing department that wants to evaluate the effectiveness of its campaigns by monitoring brand sentiment on social media sites. Because changes in sentiment could have negative effects on the larger company, this analysis is performed in real time. An overview of this example is shown in the Figure 6.

Case study: sentiment analysis of social media feeds

Figure 6: Diagram of social media sentiment analysis case study

Data input and transformation

The input data to this application are social media posts. This data is available in real time, but a number of steps need to be applied to make it usable by the sentiment-scoring model. Common words (such as and and the) need to be filtered, messages to be selected which actually refer to the company, and misspellings and word capitalization need to be normalized. Once this cleaning is done, further transformations may turn the message into a vector, with a count of each word in the model's allowed vocabulary, or hashed to populate a fixed-length vector.

Sanity checking

The outputs of the preceding transformations need to be sanity checked – are there any users who account for an unusually large number of messages (which might indicate bot spam)? Are there unexpected words in the input (which could be due to character encoding issues)? Are any of the input messages longer than the allowed message size for the service (which could indicate incorrect separation of messages in the input stream)?

Once the model is developed, sanity checking involves some human guidance. Do the sentiments predicted by the model correlate with the judgment of human readers? Do the words that correspond to high probability for a given sentiment in the model make intuitive sense?

These and other sanity checks can be visualized as a webpage or document summary that can be utilized by both the modeler, to evaluate model health, and the rest of the marketing staff to understand new topics that may correspond to positive or negative brand sentiment.

Model development

The model used in this pipeline is a multinomial logistic regression (Chapter 5, Putting Data in its Place – Classification Methods and Analysis) that takes as input counts of the words in each social media message and outputs a predicted probability that the message belongs to a given sentiment category: VERY POSITIVE, POSITIVE, NEUTRAL, NEGATIVE, and VERY NEGATIVE. While in theory (because the multinomial logistic regression can be trained using stochastic gradient updates), we could perform model training online, in practice this is not possible because the labels (sentiments) need to be assigned by a human expert. Therefore, our model is developed in an offline batch-process each week as a sufficient set of social media messages labelled by an expert becomes available. The hyperparameters of this model (the regularization weight and learning weight) have been estimated previously, so the batch retraining calculates the regression coefficient weights for a set of training messages and evaluates the performance on a separate batch of test messages.

Scoring

Incoming messages processed by this pipeline can be scored by the existing model and assigned to one of the five sentiment classes, and the volume of each category is updated in real time to allow monitoring of brand sentiment and immediate action if there is an extremely negative response to one of the marketing department's campaigns.

Visualization and reporting

As the model scores new social media messages, it updates a real-time dashboard with the volume of messages in each category compared to yesterday, the preceding week, and the preceding month, along with which words are given most weight in this week's model for the different classes. It also monitors the presence of new words, which may not have been present in the model's vocabulary, and which could indicate new features that the model cannot appropriately score, and suggest the need for inter-week retraining. In addition to this real-time dashboard, which the marketing department uses to monitor response to its campaigns, the analyst develops a more detailed report concerning model parameters and performance along with input dataset summary statistics, which they use to determine if the model training process each week is performing as expected, or if the quality of the model is degrading over time.

Case study: targeted e-mail campaigns

In our next example, our same marketing department wants to promote new items on their website to users who are mostly likely to be interested in purchasing them. Using a predictive model that includes features from both users and these new items, customers are sent e-mails containing a list of their most probable purchase. Unlike the real-time sentiment-monitoring example, e-mails are sent in batches and use data accumulated over a customer's whole transaction history as inputs to the model, which is a better fit for batch processing.

An overview of the processes used in this example is shown in Figure 7.

Case study: targeted e-mail campaigns

Figure 7: Diagram of e-mail targeting case study

Data input and transformation

During the initial data ingestion step, customer records stored in a company's data warehouse (a relational database system) are aggregated to generate features such as the average amount spent per week, frequency with which a customer visits the company's website, and the number of items purchased in a number of categories, such as furniture, electronics, clothing, and media. This is combined with a set of features for the set of items that are potentially promoted in the e-mail campaign, such as price, brand, and the average rating of similar items on the site. These features are constructed through a batch process that runs once per week, before e-mails are sent, on Mondays, to customers.

Sanity checking

The inputs to the model are checked for reasonable values: are the average purchase behaviors or transactions volume of a customer far outside the expected range? These could indicate errors in the data warehouse processing, or bot traffic on the website. Because the transformation logic involved in constructing features for the model is complex and may change over time as the model evolves, its outputs are also checked. For example, the purchase numbers and average prices should never be less than zero, and no category of merchandise should have zero records.

Following scoring of potential items prior to e-mail messaging, the top-scoring items per customer are sanity checked by comparing them to either the customer's historical transactions (to determine if they are sensible), or if no history is available, to the purchases of customers most similar in demographics.

Model development

In this example, the model is a random forest regression Chapter 4, Connecting the Dots with Models – Regression Methods that divides historical items – customer pairs into purchases (labeled 1) and non-purchases (labeled 0) and produces a scored probability that customer A purchases item X. One complexity in this model is that items which haven't been purchased might simply not have been seen by the customer yet, so a restriction is imposed in which the negative examples must be drawn from items already available for a month or more on the website. The hyperparameters of this model (the number and size of each tree) are calibrated during weekly retraining, along with the influence of individual variables on the resulting predictions.

Scoring

After the model is retrained each week using historical data, the set of new items on the website are scored using this model for each customer, and the top three are sent in the e-mail campaign.

Visualization and reporting

Either class of sanity checking (of either input data or model performance) can be part of a regular diagnostics report on the model. Because the random forest model is more complex than other approaches, it is particularly important to monitor changes in feature importance and model accuracy as problems may require more time to debug and resolve.

Because the predictions are used in a production system rather than delivering insights themselves, this reporting is primarily used by the analyst who developed the pipeline rather than the other members of the marketing department.

The success of these promotional e-mails will typically be monitored over the next month, and updates on the accuracy (for example, how many e-mails led to purchases above expected levels) can form the basis of a longer-term report that can help guide both the structure of the campaign itself (for example, varying the number of items in the messages) and the model (perhaps training should be performed more frequently if the predictions seem to become significantly worse between weeks).

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  • Log in or register to our website using your e-mail address and password.
  • Hover the mouse pointer on the SUPPORT tab at the top.
  • Click on Code Downloads & Errata.
  • Enter the name of the book in the Search box.
  • Select the book for which you're looking to download the code files.
  • Choose from the drop-down menu where you purchased this book from.
  • Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

Summary

After finishing this chapter, you should now be able to describe the core components of an analytic pipeline and the ways in which they interact. We've also examined the differences between batch and streaming processes, and some of the use cases in which each type of application is well suited. We've also walked through examples using both paradigms and the design decisions needed at each step.

In the following sections we will develop the concepts previously described, and go into greater detail on some of the technical terms brought up in the case studies. In Chapter 2, Exploratory Data Analysis and Visualization in Python, we will introduce interactive data visualization and exploration using open source Python tools. Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, describes how to identify groups of related objects in a dataset using clustering methods, also known as unsupervised learning. In contrast, Chapter 4, Connecting the Dots with Models – Regression Methods, and Chapter 5, Putting Data in its Place – Classification Methods and Analysis, explore supervised learning, whether for continuous outcomes such as prices (using regression techniques in Chapters 4, Connecting the Dots with Models – Regression Methods), or categorical responses such as user sentiment (using classification models described in Chapter 5, Putting Data in its Place – Classification Methods and Analysis). Given a large number of features, or complex data such as text or image, we may benefit by performing dimensionality reduction, as described in Chapter 6, Words and Pixels – Working with Unstructured Data. Alternatively, we may fit textual or image data using more sophisticated models such as the deep neural networks covered in Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features, which can capture complex interactions between input variables. In order to use these models in business applications, we will develop a web framework to deploy analytical solutions in Chapter 8, Sharing Models with Prediction Services, and describe ongoing monitoring and refinement of the system in Chapter 9, Reporting and Testing – Iterating on Analytic Systems.

Throughout, we will emphasize both how these methods work and practical tips for choosing between different approaches for various problems. Working through the code examples will illustrate the required components for building and maintaining an application for your own use case. With these preliminaries, let's dive next into some exploratory data analysis using notebooks: a powerful way to document and share analysis.

Left arrow icon Right arrow icon

Key benefits

  • Master open source Python tools to build sophisticated predictive models
  • Learn to identify the right machine learning algorithm for your problem with this forward-thinking guide
  • Grasp the major methods of predictive modeling and move beyond the basics to a deeper level of understanding

Description

The volume, diversity, and speed of data available has never been greater. Powerful machine learning methods can unlock the value in this information by finding complex relationships and unanticipated trends. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications to deliver insights that are of tremendous value to their organizations. In Mastering Predictive Analytics with Python, you will learn the process of turning raw data into powerful insights. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications and how to quickly apply these methods to your own data to create robust and scalable prediction services. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates not only how these methods work, but how to implement them in practice. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring the insights of predictive modeling to life

Who is this book for?

This book is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move from a conceptual understanding of advanced analytics to an expert in designing and building advanced analytics solutions using Python. You’re expected to have basic development experience with Python.

What you will learn

  • Gain an insight into components and design decisions for an analytical application
  • Master the use Python notebooks for exploratory data analysis and rapid prototyping
  • Get to grips with applying regression, classification, clustering, and deep learning algorithms
  • Discover the advanced methods to analyze structured and unstructured data
  • Find out how to deploy a machine learning model in a production environment
  • Visualize the performance of models and the insights they produce
  • Scale your solutions as your data grows using Python
  • Ensure the robustness of your analytic applications by mastering the best practices of predictive analysis
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2016
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882715
Category :
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Aug 31, 2016
Length: 334 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882715
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 158.97
Mastering Predictive Analytics with Python
$54.99
Advanced Machine Learning with Python
$48.99
Mastering Data Mining with Python ??? Find patterns hidden in your data
$54.99
Total $ 158.97 Stars icon
Banner background image

Table of Contents

10 Chapters
1. From Data to Decisions – Getting Started with Analytic Applications Chevron down icon Chevron up icon
2. Exploratory Data Analysis and Visualization in Python Chevron down icon Chevron up icon
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning Chevron down icon Chevron up icon
4. Connecting the Dots with Models – Regression Methods Chevron down icon Chevron up icon
5. Putting Data in its Place – Classification Methods and Analysis Chevron down icon Chevron up icon
6. Words and Pixels – Working with Unstructured Data Chevron down icon Chevron up icon
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features Chevron down icon Chevron up icon
8. Sharing Models with Prediction Services Chevron down icon Chevron up icon
9. Reporting and Testing – Iterating on Analytic Systems Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
(2 Ratings)
5 star 50%
4 star 0%
3 star 0%
2 star 0%
1 star 50%
Amazon Customer Oct 05, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Walks through cross-section of different modeling problems and full implementation of scalable service on pyspark; lots of code examples and practical advice. Equation formatting could use some work.
Amazon Verified review Amazon
AMGAustin Jun 23, 2019
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Very frustrating book to try to learn from. You often have to go searching for the data because it is not included in what can be downloaded and the book provides no clear instructions about how to get it. The author also has a habit of providing instructions for how to run a model, but then gives no discussion about what the point is. I would not recommend this book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela