Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Tech Guides - Big Data

50 Articles
article-image-healthcare-analytics-logistic-regression-to-reduce-patient-readmissions
Guest Contributor
20 Dec 2017
8 min read
Save for later

Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions

Guest Contributor
20 Dec 2017
8 min read
[box type="info" align="" class="" width=""]We bring to you another guest post by Benjamin Rojogan on Logistic regression to aid healthcare sector in reducing patient readmission. Ben's previous post on ensemble methods to optimize machine learning models is also available for a quick read here.[/box] ER visits are not cheap for any party involved. Whether this be the patient or the insurance company. However, this does not stop some patients from being regular repeat visitors. These recurring visits are due to lack of intervention for problems such as substance abuse, chronic diseases and mental illness. This increases costs for everybody in the healthcare system and reduces quality of care by playing a role in the overflowing of Emergency Departments (EDs). Research teams at UW and other universities are partnering with companies like Kensci to figure out how to approach the problem of reducing readmission rates. The ability to predict the likelihood of a patient’s readmission will allow for targeted intervention which in turn will help reduce the frequency of readmissions. Thus making the population healthier and hopefully reducing the estimated 41.3 billion USD healthcare costs for the entire system. How do they plan to do it? With big data and statistics, of course. A plethora of algorithms are available for data scientists to use to approach this problem. Many possible variables could affect the readmission and medical costs. Also, there are also many different ways researchers might pose their questions. However, the researchers at UW and many other institutions have been heavily focused on reducing the readmission rate simply by trying to calculate whether a person would or would not be readmitted. In particular, this team of researchers was curious about chronic ailments. Patients with chronic ailments are likely to have random flare ups that require immediate attention. Being able to predict if a patient will have an ER visit can lead to managing the cause more effectively. One approach taken by the data science team at UW as well as the Department of Family and Community Medicine at the University of Toronto was to utilize logistic regression to predict whether or not a patient would be readmitted. Patient readmission can be broken down into a binary output: either the patient is readmitted or not. As such logistic regression has been a useful model in my experience to approach this problem. Logistic Regression to predict patient readmissions Why do data scientists like to use logistic regression? Where is it used? And how does it compare to other data algorithms? Logistic regression is a statistical method that statisticians and data scientists use to classify people, products, entities, etc. It is used for analyzing data that produces a binary classification based on one or many independent variables. This means, it produces two clear classifications (Yes or No, 1 or 0, etc). With the example above, the binary classification would be: is the patient readmitted or not? Other examples of this could be whether to give a customer a loan or not, whether a medical claim is fraud or not, whether a patient has diabetes or not. Despite its name, logistic regression does not provide the same output like linear regression (per se). There are some similarities, for instance, the linear model is somewhat consistent as you might notice in the equation below where you see what is very similar to a linear equation. But the final output is based on the log odds. Linear regression and multivariate regression both take one to many independent variables and produce some form of continuous function. Linear regression could be used to predict the price of a house, a person’s age or the cost of a product an e-commerce should display to each customer. The output is not limited to only a few discrete classifications. Whereas logistic regression produces discrete classifiers. For instance, an algorithm using logistic regression could be used to classify whether or not a certain stock price would be either >$50 a share or <$50 a share. Linear regression would be used to predict if a stock share would be worth $50.01, $50.02….etc. Logistic regression is a calculation that uses the odds of a certain classification. In the equation above, the symbol you might know as pi actually represents the odds or probability. To reduce the error rate, we should predict Y = 1 when p ≥ 0.5 and Y = 0 when p < 0.5. This creates a linear classifier, a boundary that when the coefficients β0 + x · β has a p value that is p < 0.5 then Y = 0. By generating coefficients that help predict the logit transformation, the method allows to classify for the characteristic of interest. Now that is a lot of complex math mumbo jumbo. Let’s try to break it down into simpler terms. Probability vs. Odds Let’s start with probability. Let’s say a patient has the probability of 0.6 of being readmitted. Then the probability that the patient won’t be readmitted is .4. Now, we want to take this and convert it into odds. This is what the formula above is doing. You would take .6/.4 and get odds of 1.5. That means the odds of the patient being readmitted are 1.5 to 1. If instead the probability was .5 for both being readmitted and not being readmitted, then the odds would be 1:1. Now the next step in the logistic regression model would be to take the odds and get the “Log odds”. You do this by taking the 1.5 and put it into the log portion of the equation. Now you will get .18(rounded). In logistic regression, we don’t actually know p. That is what we are trying to essentially find and model using various coefficients and input variables. Each input provides a value that changes how much more likely an event will or will not occur. All of these coefficients are used to calculate the log odds. This model can take multiple variables like age, sex, height, etc. and specify how much of an effect each variable has on the odds an event will occur. Once the initial model is developed, then comes the work of deciding its value. How does a business go from creating an algorithm inside a computer and translate it into action. Some of us like to say the “computers” are the easy part. Personally I find the hard part to be the “people”. After all, at the end of the day, it comes down to business value. Will an algorithm save money or not? That means it has to be applied in real life. This could take the form of a new initiative, strategy, product recommendation, etc. You need to find the outliers that are worth going after! For instance, if we go back to the patient readmission example again. The algorithm points out patients with high probabilities of being readmitted. However if the readmission costs are low, they will probably be ignored..sadly. That is how businesses (including hospitals) look at problems. Logistic regression is a great tool for binary classification. It is unlike many other algorithms that estimate continuous variables or estimate distributions. This statistical method can be utilized to classify whether a person will be likely to get cancer because of environmental variables like proximity to a highway, smoking habits, etc? This method has been used effectively in the medical, financial and insurance industry successfully for a while. Knowing when to use what algorithm takes time. However, the more problems a data scientist faces, the faster they will recognize whether to use logistic regression or decision trees. Using logistic regression provides the opportunity for healthcare institutions to accurately target at risk individuals who should receive a more tailored behavioral health plan to help improve their daily health habits. This in turn opens the opportunity for better health for patients and lower costs for hospitals. [box type="shadow" align="" class="" width=""] About the Author Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/box]
Read more
  • 0
  • 0
  • 13534

article-image-understanding-sentiment-analysis-and-other-key-nlp-concepts
Sunith Shetty
20 Dec 2017
12 min read
Save for later

Understanding Sentiment Analysis and other key NLP concepts

Sunith Shetty
20 Dec 2017
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. This book will help you learn to perform big data analytics tasks using machine learning concepts such as clustering, recommending products, data segmentation and more. [/box] With this post, you will learn what is sentiment analysis and how it is used to analyze emotions associated within the text. You will also learn key NLP concepts such as Tokenization, stemming among others and how they are used for sentiment analysis. What is sentiment analysis? One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example: To analyze the reviews of a product whether they are positive or negative This can be especially useful to predict how successful your new product is by analyzing user feedback To analyze the reviews of a movie to check if it's a hit or a flop Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media To analyze the content of tweets or information on other social media to check if a political party campaign was successful or not  Thus, sentimental analysis is a useful technique, but before we see the code for our sample sentimental analysis example, let's understand some of the concepts needed to solve this problem. [box type="shadow" align="" class="" width=""]For working on a sentimental analysis problem we will be using some techniques from natural language processing and we will be explaining some of those concepts.[/box] Concepts for sentimental analysis Before we dive into the fully-fledged problem of analyzing the sentiment behind text, we must understand some concepts from the NLP (Natural Language Processing) perspective. We will explain these concepts now. Tokenization From the perspective of machine learning one of the most important tasks is feature extraction and feature selection. When the data is plain text then we need some way to extract the information out of it. We use a technique called tokenization where the text content is pulled and tokens or words are extracted from it. The token can be a single word or a group of words too. There are various ways to extract the tokens, as follows: By using regular expressions: Regular expressions can be applied to textual content to extract words or tokens from it. By using a pre-trained model: Apache Spark ships with a pre-trained model (machine learning model) that is trained to pull tokens from a text. You can apply this model to a piece of text and it will return the predicted results as a set of tokens. To understand a tokenizer using an example, let's see a simple sentence as follows: Sentence: "The movie was awesome with nice songs" Once you extract tokens from it you will get an array of strings as follows: Tokens: ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]The type of tokens you extract depends on the type of tokens you are interested in. Here we extracted single tokens, but tokens can also be a group of words, for example, 'very nice', 'not good', 'too bad', and so on.[/box] Stop words removal Not all the words present in the text are important. Some words are common words used in the English language that are important for the purpose of maintaining the grammar correctly, but from conveying the information perspective or emotion perspective they might not be important at all, for example, common words such as is, was, were, the, and so. To remove these words there are again some common techniques that you can use from natural language processing, such as: Store stop words in a file or dictionary and compare your extracted tokens with the words in this dictionary or file. If they match simply ignore them. Use a pre-trained machine learning model that has been taught to remove stop words. Apache Spark ships with one such model in the Spark feature package. Let's try to understand stop words removal using an example: Sentence: "The movie was awesome" From the sentence we can see that common words with no special meaning to convey are the and was. So after applying the stop words removal program to this data you will get: After stop words removal: [ 'movie', 'awesome', 'nice', 'songs'] [box type="shadow" align="" class="" width=""]In the preceding sentence, the stop words the, was, and with are removed.[/box] Stemming Stemming is the process of reducing a word to its base or root form. For example, look at the set of words shown here: car, cars, car's, cars' From our perspective of sentimental analysis, we are only interested in the main words or the main word that it refers to. The reason for this is that the underlying meaning of the word in any case is the same. So whether we pick car's or cars we are referring to a car only. Hence the stem or root word for the previous set of words will be: car, cars, car's, cars' => car (stem or root word) For English words again you can again use a pre-trained model and apply it to a set of data for figuring out the stem word. Of course there are more complex and better ways (for example, you can retrain the model with more data), or you have to totally use a different model or technique if you are dealing with languages other than English. Diving into stemming in detail is beyond the scope of this book and we would encourage readers to check out some documentation on natural language processing from Wikipedia and the Stanford nlp website. [box type="shadow" align="" class="" width=""]To keep the sentimental analysis example in this book simple we will not be doing stemming of our tokens, but we will urge the readers to try the same to get better predictive results.[/box] N-grams Sometimes a single word conveys the meaning of context, other times a group of words can convey a better meaning. For example, 'happy' is a word in itself that conveys happiness, but 'not happy' changes the picture completely and 'not happy' is the exact opposite of 'happy'. If we are extracting only single words then in the example shown before, that is 'not happy', then 'not' and 'happy' would be two separate words and the entire sentence might be selected as positive by the classifier However, if the classifier picks the bi-grams (that is, two words in one token) in this case then it would be trained with 'not happy' and it would classify similar sentences with 'not happy' in it as 'negative'. Therefore, for training our models we can either use a uni-gram or a bi-gram where we have two words per token or as the name suggest an n-gram where we have 'n' words per token, it all depends upon which token set trains our model well and it improves its predictive results accuracy. To see examples of n-grams refer to the following table:   Sentence The movie was awesome with nice songs Uni-gram ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs'] Bi-grams ['The movie', 'was awesome', 'with nice', 'songs'] Tri-grams ['The movie was', 'awesome with nice', 'songs']   For the purpose of this case study we will be only looking at unigrams to keep our example simple. By now we know how to extract words from text and remove the unwanted words, but how do we measure the importance of words or the sentiment that originates from them? For this there are a few popular approaches and we will now discuss two such approaches. Term presence and term frequency Term presence just means that if the term is present we mark the value as 1 or else 0. Later we build a matrix out of it where the rows represent the words and columns represent each sentence. This matrix is later used to do text analysis by feeding its content to a classifier. Term Frequency, as the name suggests, just depicts the count or occurrences of the word or tokens within the document. Let's refer to the example in the following table where we find term frequency:   Sentence The movie was awesome with nice songs and nice dialogues. Tokens (Unigrams only for now) ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs', 'and', 'dialogues'] Term Frequency ['The = 1', 'movie = 1', 'was = 1', 'awesome = 1', 'with = 1', 'nice = 2', 'songs = 1', 'dialogues = 1']   As seen in the preceding table, the word 'nice' is repeated twice in the preceding sentence and hence it will get more weight in determining the opinion shown by the sentence. Bland term frequency is not a precise approach for the following reasons: There could be some redundant irrelevant words, for example, the, it, and they that might have a big frequency or count and they might impact the training of the model There could be some important rare words that could convey the sentiment regarding the document yet their frequency might be low and hence they might not be inclusive for greater impact on the training of the model Due to this reason, a better approach of TF-IDF is chosen as shown in the next sections. TF-IDF TF-IDF stands for Term Frequency and Inverse Document Frequency and in simple terms it means the importance of a term to a document. It works using two simple steps as follows: It counts the number of terms in the document, so the higher the number of terms the greater the importance of this term to the document. Counting just the frequency of words in a document is not a very precise way to find the importance of the words. The simple reason for this is there could be too many stop words and their count is high so their importance might get elevated above the importance of real good words. To fix this, TF-IDF checks for the availability of these stop words in other documents as well. If the words appear in other documents as well in large numbers that means these words could be grammatical words such as they, for, is, and so on, and TF-IDF decreases the importance or weight of such stop words. Let's try to understand TF-IDF using the following figure: As seen in the preceding figure, doc-1, doc-2, and so on are the documents from which we extract the tokens or words and then from those words we calculate the TF-IDFs. Words that are stop words or regular words such as for , is, and so on, have low TF-IDFs, while words that are rare such as 'awesome movie' have higher TF-IDFs. TF-IDF is the product of Term Frequency and Inverse document frequency. Both of them are explained here: Term Frequency: This is nothing but the count of the occurrences of the words in the document. There are other ways of measuring this, but the simplistic approach is to just count the occurrences of the tokens. The simple formula for its calculation is:      Term Frequency = Frequency count of the tokens Inverse Document Frequency: This is the measure of how much information the word provides. It scales up the weight of the words that are rare and scales down the weight of highly occurring words. The formula for inverse document frequency is: TF-IDF: TF-IDF is a simple multiplication of the Term Frequency and the Inverse Document Frequency. Hence: This simple technique is very popular and it is used in a lot of places for text analysis. Next let's look into another simple approach called bag of words that is used in text analytics too. Bag of words As the name suggests, bag of words uses a simple approach whereby we first extract the words or tokens from the text and then push them in a bag (imaginary set) and the main point about this is that the words are stored in the bag without any particular order. Thus the mere presence of a word in the bag is of main importance and the order of the occurrence of the word in the sentence as well as its grammatical context carries no value. Since the bag of words gives no importance to the order of words you can use the TF-IDFs of all the words in the bag and put them in a vector and later train a classifier (naïve bayes or any other model) with it. Once trained, the model can now be fed with vectors of new data to predict on its sentiment. Summing it up, we have got you well versed with sentiment analysis techniques and NLP concepts in order to apply sentimental analysis. If you want to implement machine learning algorithms to carry out predictive analytics and real-time streaming analytics you can refer to the book Big Data Analytics with Java.    
Read more
  • 0
  • 0
  • 5347

article-image-handpicked-weekend-reading-15th-dec-2017
Aarthi Kumaraswamy
16 Dec 2017
2 min read
Save for later

Handpicked for your Weekend Reading - 15th Dec, 2017

Aarthi Kumaraswamy
16 Dec 2017
2 min read
As you gear up for the holiday season and the year-end celebrations, make a resolution to spend a fraction of your weekends in self-reflection and in honing your skills for the coming year. Here is the best of the DataHub for your reading this weekend. Watch out for our year-end special edition in the last week of 2017! NIPS Special Coverage A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey 6 Key Challenges in Deep Learning for Robotics by Pieter Abbeel For the complete coverage, visit here. Experts in Focus Ganapati Hegde and Kaushik Solanki, Qlik experts from Predoole Analytics on How Qlik Sense is driving self-service Business Intelligence 3 things you should know that happened this week Generative Adversarial Networks: Google open sources TensorFlow-GAN (TFGAN) “The future is quantum” — Are you excited to write your first quantum computing code using Microsoft’s Q#? “The Blockchain to Fix All Blockchains” – Overledger, the meta blockchain, will connect all existing blockchains Try learning/exploring these tutorials weekend Implementing a simple Generative Adversarial Network (GANs) How Google’s MapReduce works and why it matters for Big Data projects How to write effective Stored Procedures in PostgreSQL How to build a cold-start friendly content-based recommender using Apache Spark SQL Do you agree with these insights/opinions Deep Learning is all set to revolutionize the music industry 5 reasons to learn Generative Adversarial Networks (GANs) in 2018 CapsNet: Are Capsule networks the antidote for CNNs kryptonite? How AI is transforming the manufacturing Industry
Read more
  • 0
  • 0
  • 1329
Banner background image

article-image-stitch-fix-full-stack-data-science-winning-strategies
Aaron Lazar
05 Dec 2017
8 min read
Save for later

Stitch Fix: Full Stack Data Science and other winning strategies

Aaron Lazar
05 Dec 2017
8 min read
Last week, a company in San Francisco was popping bottles of champagne for their achievements. And trust me, they’re not at all small. Not even a couple of weeks gone by, since it was listed on the stock market and it has soared to over 50%. Stitch Fix is an apparel company run by co-founder and CEO, Katrina Lake. In just a span of 6 years, she’s been able to build the company with an annual revenue of a whopping $977 odd million. The company has been disrupting traditional retail and aims to bridge the gap of personalised shopping, that the former can’t accomplish. Stitch Fix is more of a personalized stylist, rather than a traditional apparel company. It works in 3 basic steps: Filling a Style Profile: Clients are prompted to fill out a style profile, where they share their style, price and size preferences. Setting a Delivery Date: The clients set a delivery date as per their availability. Stitch Fix mixes and matches various clothes from their warehouses and comes up with the top 5 clothes that they feel would best suit the clients, based on the initial style profile, as well as years of experience in styling. Keep or Send Back: The clothes reach the customer on the selected date and the customer can try on the clothes, keep whatever they like or send back what they don’t. The aim of Stitch Fix is to bring a personal touch to clothes shopping. According to Lake, “There are millions and millions of products out there. You can look at eBay and Amazon. You can look at every product on the planet, but trying to figure out which one is best for you is really the challenge” and that’s the tear Stitch Fix aims to sew up. In an interview with eMarketer, Julie Bornstein, COO of Stitch Fix said “Over a third of our customers now spend more than half of their apparel wallet share with Stitch Fix. They are replacing their former shopping habits with our service.” So what makes Stitch Fix stand out among its competitors? How do they do it? You see, Stitch Fix is not just any apparel company. It has created the perfect formula by blending human expertise with just the right amount of Data Science to enable it to serve its customers. When we’re talking about the kind of Data Science that Stitch Fix does, we’re talking about a relatively new and exciting term that’s on the rise - Full Stack Data Science. Hello Full Stack Data Science! For those of you who’ve heard of this before, cheers! I hope you’ve had the opportunity to experience its benefits. For those of you who haven’t heard of the term, Full Stack Data Science basically means a single data scientist does their own work, which is mining data, cleans it, writes an algorithm to model it and then visualizes the results, while also stepping into the shoes of an engineer, implementing the model, as well as a Project Manager, tracking the entire process and ensuring it’s on track. Now while this might sound like a lot for one person to do, it’s quite possible and practical. It’s practical because of the fact that when these roles are performed by different individuals, they induce a lot of latency into the project. Moreover, a synchronization of priorities of each individual is close to impossible, thus creating differences within the team. The Data (Science) team at Stitch Fix is broadly categorized based on what area they work on: Because most of the team focuses on full stack, there are over 80 Data Scientists on board. That’s a lot of smart people in one company! On a serious note, although unique, this kind of team structure has been doing well for them, mainly because it gives each one the freedom to work independently. Tech Treasure Trove When you open up Stitch Fix’s tech toolbox, you won’t find Aladdin’s lamp glowing before you. Their magic lies in having a simple tech stack that works wonders when implemented the right way. They work with Ruby on Rails and Bootstrap for their web applications that are hosted on Heroku. Their data platform relies on a robust Postgres implementation. Among programming languages, we found Python, Go, Java and JavaScript also being used. For an ML Framework, we’re pretty sure they’re playing with TensorFlow. But just working with these tools isn’t enough to get to the level they’re at. There’s something more under the hood. And believe it or not, it’s not some gigantic artificial intelligent system running on a zillion cores! Rather, it’s all about the smaller, simpler things in life. For example, if you have 3 different kinds of data and you need to find a relationship between them, instead of bringing in the big guns (read deep learning frameworks), a simple tensor decomposition using word vectors would do the deed quite well. Advantages galore: Food for the algorithms One of the main advantages Stitch Fix has, is that they have almost 5 years’ worth client data. This data is obtained from clients in several ways like through a Client Profile, After-Delivery Feedback, Pinterest photos, etc. All this data is put through algorithms that learn more about the likes and dislikes of clients. Some interesting algorithms that feed on this sumptuous data are on the likes of collaborative filtering recommenders to group clients based on their likes, mixed-effects modeling to learn about a client’s interests over time, neural networks to derive vector descriptions of the Pinterest images and to compare them with in-house designs, NLP to process customer feedback, Markov chain models to predict demand, among several others. A human Touch: When science meets art While the machines do all the calculations and come up with recommendations on what designs customers would appreciate, they still lack the human touch involved. Stitch Fix employs over 3000 stylists. Each client is assigned a stylist who knows the entire preference of the client at the glance of a custom-built interface. The stylist finalizes the selections from the inventory list also adding in a personal note that describes how the client can accessorize the purchased items for a particular occasion and how they can pair them with any other piece of clothing in their closet. This truly advocates “Humans are much better with the machines, and the machines are much better with the humans”. Cool, ain't it? Data Platform Apart from the Heroku platform, Stitch Fix seems to have internal SaaS platforms where the data scientists effectively carry out analysis, write algorithms and put them into production. The platforms exhibit properties like data distribution, parallelization, auto-scaling, failover, etc. This lets the data scientists focus on the science aspect while still enjoying the benefits of a scalable system. The good, the bad and the ugly: Microservices, Monoliths and Scalability Scalability is one of the most important aspects a new company needs to take into account before taking the plunge. Using a microservice architecture helps with this, by allowing small independent services/mini applications to run on their own. Stitch Fix uses this architecture to improve scalability although, their database is a monolith. They now are breaking the monolith database into microservices. This is a takeaway for all entrepreneurs just starting out with their app. Data Driven Applications Data-driven applications ensure that the right solutions are built for customers. If you’re a customer-centric organisation, there’s something you can learn from Stitch Fix. Data-Driven Apps seamlessly combine the operational and analytic capabilities of the organisation, thus breaking down the traditional silos. TDD + CD = DevOps Simplified Both Test Driven Development and Continuous Delivery go hand in hand and it’s always better to imbibe this culture right from the very start. In the end, it’s really great to see such creative and technologically driven start-ups succeed and sail to the top. If you’re on the journey to building that dream startup of yours and you need resources for your team, here’s a few books you’ll want to pick up to get started with: Hands-On Data Science and Python Machine Learning by Frank Kane Data Science Algorithms in a Week by Dávid Natingga Continuous Delivery and DevOps : A Quickstart Guide - Second Edition by Paul Swartout Practical DevOps by Joakim Verona    
Read more
  • 0
  • 0
  • 2326

article-image-points-consider-prepping-data-data-science-project
Amarabha Banerjee
30 Nov 2017
5 min read
Save for later

Points to consider while prepping your data for your data science project

Amarabha Banerjee
30 Nov 2017
5 min read
[box type="note" align="" class="" width=""]In this article by Jen Stirrup & Ruben Oliva Ramos from their book Advanced Analytics with R and Tableau, we shall look at the steps involved in prepping for any data science project taking the example of a data classification project using R and Tableau.[/box] Business Understanding When we are modeling data, it is crucial to keep the original business objectives in mind. These business objectives will direct the subsequent work in the data understanding, preparation and modeling steps, and the final evaluation and selection (after revisiting earlier steps if necessary) of a classification model or models. At later stages, this will help to streamline the project because we will be able to keep the model's performance in line with the original requirement while retaining a focus on ensuring a return on investment from the project. The main business objective is to identify individuals who are higher earners so that they can be targeted by a marketing campaign. For this purpose, we will investigate the data mining of demographic data in order to create a classification model in R. The model will be able to accurately determine whether individuals earn a salary that is above or below $50K per annum. Working with Data In this section, we will use Tableau as a visual data preparation in order to prepare the data for further analysis. Here is a summary of some of the things we will explore: Looking at columns that do not add any value to the model Columns that have so many missing categorical values that they do not predict the outcome reliably Review missing values from the columns The dataset used in this project has 49,000 records. You can see from the files that the data has been divided into a training dataset and a test set. The training dataset contains approximately 32,000 records and the test dataset around 16,000 records. It's helpful to note that there is a column that indicates the salary level or whether it is greater than or less than fifty thousand dollars per annum. This can be called a binomial label, which basically means that it can hold one or two possible values. When we import the data, we can filter for records where no income is specified. There is one record that has a NULL, and we can exclude it. Here is the filter: Let's explore the binomial label in more detail. How many records belong to each label? Let's visualize the finding. Quickly, we can see that 76 percent of the records in the dataset have a class label of <50K. Let's have a browse of the data in Tableau in order to see what the data looks like. From the grid, it's easy to see that there are 14 attributes in total. We can see the characteristics of the data: Seven polynomials: workclass, education, marital-status, occupation, relationship, race, sex, native-country One binomial: sex Six continuous attributes: age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week From the preceding chart, we can see that nearly 2 percent of the records are missing for one country, and the vast majority of individuals are from the United States. This means that we could consider the native-country feature as a candidate for removal from the model creation because the lack of variation means that it isn't going to add anything interesting to the analysis. Data Exploration We can now visualize the data in boxplots, so we can see the range of the data. In the first example, let's look at the age column, visualized as a boxplot in Tableau: We can see that the values are higher for the age characteristic, and there is a different pattern for each income level. When we look at education, we can also see a difference between the two groups: We can focus on age and education, while discarding other attributes that do not add value, such as native-country. The fnlwgt column does not add value because it is specific to the census collection process.When we visualize the race feature, it's noted that the White value appears for 85 percent of overall cases. This means that it is not likely to add much value to the predictor: Now, we can look at the number of years that people spend in education. When the education number attribute was plotted, then it can be seen that the lower values tend to predominate in the <50K class and the higher levels of time spent in education are higher in the >50K class. We can see this finding in the following figure: This finding may indicate some predictive capability in the education feature. The visualization suggests that there is a difference between both groups since the group that earns over $50K per annum does not appear much in the lower education levels. To summarize, we will focus on age and education as providing some predictive capability in determining the income level.The purpose of the model is to classify people by their earning level. Now that we have visualized the data in Tableau, we can use this information in order to model and analyze the data in R to produce the model. If you liked this article, please be sure to check out Advanced Analytics with R and Tableau which consists of this article and many useful analytics techniques with R and Tableau.
Read more
  • 0
  • 0
  • 1240

article-image-retail-analytics-offline-retailers
Savia Lobo
24 Nov 2017
7 min read
Save for later

Cyber Monday Special: Can Walmart beat Amazon in its own game? Probably, if they go full-throttle AI

Savia Lobo
24 Nov 2017
7 min read
Long gone are the days when people would go out exploring one store to another, to buy that beautiful pink dress or a particular pair of shoe.The e-commerce revolution, has surged online shopping drastically. How many time have we heard physical stores are dying. And yet they seem to have a cat-like hold on their lifespan. Wonder why? Because not everyone likes shopping online. We are aware of a group of people who still prefer to buy from the brick and mortar structure. They are like the doubting Thomas, remember the touch and believe concept? For customers who love shopping physically in a store, retailers strive to create a fascinating shopping experience in a way that online platform cannot offer. This is especially important for them to increase sale and generate profits on peak festive seasons such as Black Fridays or New Year’s. A lot has been talked about the wonders retail analytics data can do for e-commerce sites, read this article for instance. But not as much is talked about traditional stores. So, here we have listed down 10 retail analytics options for offline retailers, to capture maximum customer attention and retention. 1. Location analytics and proximity marketing A large number of retail stores collect data to analyze the volume of customers buying online and offline. They use this data for internal retail analytics, which helps them in merchandise tracking, adjust staffing levels, monitor promotions and so on. Retailers benefit from location analytics in order to detect a section of the store with high customer traffic. Proximity marketing uses location-based technology to communicate with customers through their smartphones. Customers receive targeted offers and discounts based on their proximity to the product. For instance, a 20% off on the floral dress to your right. Such on-the-go attractive deals have a higher likelihood of resulting in a customer sale. Euclid Analytics provides solutions to track the buying experience of every visitor in the store. This helps retailers retarget and rethink their strategies to influence sales at an individual customer level. 2. Music systems Nowadays, most large retail formats have music systems set up for the customers. A playlist with a mixed genre of music is an ideal fit for various customers visiting the store.  Retailers use the tactic right, with a correct tempo, volume, and genre to uplift customer’s mood resulting in a purchase. They also have to keep in mind the choice of a playlist. As, music preferences differ with generation, behavior, and the mood of the customer.  Store owners opt for music services with a customized playlist to create an influential buying factor. Atmos select provides a customized music service to retailers. It takes customer demographics into consideration to draft an audio branding strategy.  It is then used to design an audio solution for most of the retail outlets and stores. 3. Guest WiFi Guest wifi benefits customers by giving them free internet connection while they shop. Who would not want that? However, not only customers but retailers too benefit with such an offering. An in-store wifi provides them with detailed customer analytics and enables to track various shopping patterns. Cloud4Wi, offers Volare guest Wi-Fi, which provides free wi-fi service to customers within the retail store. It provides a faster and easier login option, to connect to the wifi. It also collects customer’s data for retailers to provides unique and selective marketing list. 4. Workforce tools Unison among the staff members within the work environment creates positivity in store. To increase communication between the staff, workforce tools are put to use. These are various messaging applications and work-planner platforms that help in maintaining a rapport among the staff members. It helps empower employees to maintain their work-life, check overtime details, attendance, and more. Branch, a tool to improve workforce productivity, helps internal messaging networks and also notifies employees about their shift timing, and other details. 5. Omnichannel retail analytics Omnichannel retail enables customer with an interactive and seamless shopping experience across platforms. Additionally,  with the data collected from different digital channels, retailers get an overview of customer’s shopping journey and the choices they made over time. Omnichannel analytics also assists them to showcase personalized shopping ads based on customer’s social media habits. Intel offers solutions for Omnichannel analytics which helps retailers increase customer loyalty and generate substantial revenue growth. 6. Dressing Room Technology The mirror within the trial room knows it all! Retailers can attract maximum customer traffic with the mirror technology. It is an interactive, touch screen mirror that allows customers to request new items and adjust the lights in the trial room. The mirror can also sense products that the customer brings in, using the RFID technology, and recommends similar products. It also assists them in saving products to their online accounts-- in case they decide to purchase them later--or digitally seek assistance from the store associate. Oak Labs, has created one such mirror which transforms customer trial room experience while bridging the gap between technology and retail. 7. Pop-ups and kiosks Pop-ups are mini-outlets for large retail formats, set up to sell a seasonal product. Whereas kiosks are temporary alternatives for retailers, to attract a high number of footfalls in store. Both pop-ups and kiosks benefit shoppers with the choice of self-service. They get an option to shop from the store’s physical as well as online product offering. They not only enable secure purchase but also deliver orders to your doorstep. Such techniques attract customers to choose retail shopping over online shopping. Withme, a startup firm that offers a platform to set up POP ups for retail outlets and brands.   8. Inventory management Managing the inventory is a major task for a store manager - to place the right product in the right place at the right time. Predictive analytics helps optimize inventory management for proper allocation, and replenishment process. It also equips retailers to markdown the inventory for clearance to reload a new batch. Celect, an inventory management startup helps retailers to analyze customer preferences and simultaneously map future demand for the product. It also helps in extraction of existing data from the inventory to gain meaningful insights. Such insights can then be taken into account for the faster sale of inventory and to get a detailed retail analytics based sales report. 9.  Smart receipts and ratings Retailers continuously aim to provide better quality service to the customer. Receiving a 5-star rating for their service in return is like a cherry on the cake.  For higher customer engagement, retailers offer smart receipts, which helps retailers collect customer email addresses to send promotional offers or festive sale discounts. Retailers also provide customers with personalized offerings and incentives in order to attract customer revisitation. To know how well retailers have fared in providing services, they set up a digital kiosk at the checkout area, where in-store customers can rate retailers based on the shopping experience. Startup firms such as TruRating aid retailers with a rating mechanism for shoppers at the checkout. FlexReceipts helps retailers to set up smart receipt application for the customers. 10. Shopping cart tech Retailers can now provide a next-gen shopping cart to their customers. A technology that can guide customer’s in-store shopping journey with a tablet-equipped shopping cart. The tablet uses machine vision to keep a track of the shelves, as the cart moves within the store. It also displays digital-ads to promote each product, the shopping cart passes through. Focal Systems build powerful technical assistance for retailers, which can give tough competition to their online counterparts. Online shopping is convenient but more often than not we still crave for the look and feel of a product and the immersive shopping experience especially during holidays and festive occasions. And that’s the USP of a Brick and Mortar shop. Offline retailers who know their data and know how to leverage retail analytics using advances in machine learning and retail tech stand a chance to provide their customers with a shopping experience superior to their online counterparts.
Read more
  • 0
  • 0
  • 1580
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-black-friday-17-ways-ecommerce-machine-learning
Sugandha Lahoti
24 Nov 2017
10 min read
Save for later

Black Friday Special: 17 ways in 2017 that online retailers use machine learning

Sugandha Lahoti
24 Nov 2017
10 min read
Black Friday sales are just around the corner. Both online and traditional retailers have geared up to race past each other in the ultimate shopping frenzy of the year. Although both brick and mortar retailers and online platforms will generate high sales, online retailers will sweep past the offline platforms. Why? In case of online retailers, the best part remains the fact that shopping online, customers don’t have to deal with pushy crowds, traffic, salespeople, and long queues. Online shoppers have access to a much larger array of products. They can also switch between stores, by just switching between tabs on their smart devices. Considering the surge of shoppers expected on such peak seasons, Big Data Analytics is a helpful tool for online retailers. With the advances in Machine Learning, Big data analytics is no longer confined to the technology landscape, it also represents a way how retailers connect with consumers in a purposeful way.  For retailers, both big and small, adopting the right ML powered Big Data Analytic strategy would help in increasing their sales, retent their customers and generate high revenues. Here are 17 reasons why data is an important asset for retailers, especially on the 24th, this month. A. Improving site infrastructure The first thing that a customer sees when landing on an e-commerce website is the UI, ease of access, product classification, number of filters, etc. Hence building an easy to use website is paramount. Here’s how ML powered Big data analytics can help: [toggle title="" state="close"] 1. E-commerce site analysis A complete site analysis is one of the ways to increase sales and retain customers. By analyzing page views and actual purchases, bounce rates, and least popular products, the e-commerce website can be altered for better usability. For enhancing website features data mining techniques can also be used. This includes web mining, which is used to extract information from the web, and log files, which contain information about the user. For time-bound sales like Black Friday and Cyber Monday, this is quite helpful for better product placement, removing unnecessary products and showcasing products which cater to a particular user base. 2. Generating test data Generation of test data helps in a deeper analysis which helps in increasing sales. Big data analytics can give a helping hand here by organizing products based upon the type, shopper gender and age group, brands, pricing, number of views of each product page, and the information provided for that product. During peak seasons such as Black Friday,  ML powered data analytics can analyze most visited pages and shopper traffic flow for better product placements and personalized recommendations.[/toggle] B. Enhancing Products and Categories Every retailer in the world is looking for ways to reduce costs without sacrificing the quality of their products. Big data analytics in combination with machine learning is of great help here. [toggle title="" state="close"] 3. Category development Big Data analytics can help in building up of new product categories, or in eliminating or enhancing old ones. This is possible by using machine learning techniques to analyze patterns in the marketing data as well as other external factors such as product niches. ML powered assortment planning can help in selecting and planning products for a specified period of time, such as the Thanksgiving week,  so as to maximize sales and profit. Data analytics can also help in defining Category roles in order to clearly define the purpose of each category in the total business lifecycle. This is done to ensure that efforts made around a particular category, actually contribute to category development. It also helps to identify key categories, which are the featured products that specifically meet an objective for e.g. Healthy food items, Cheap electronics, etc. 4. Range selection An optimum and dynamic product range is essential to retain customers. Big data analytics can utilize sales data and shopper history to measure a product range for maximum profitability. This is especially important for Black Friday and Cyber Monday deals where products are sold at heavily discounted rates. 5. Inventory management Data analytics can give an overview of best selling products, non-performing or slow moving products, seasonal products and so on. These data pointers can help retailers manage their inventory and reduce the associated costs. Machine learning powered Big data analytics are also helpful in making product localization strategies i.e. which product sells well in what areas. In order to localize for China, Amazon changed its China branding to Amazon.cn. To make it easy for Chinese to pay, Amazon China introduced portable POS so users can pay the delivery guy via credit card at their doorstep. 6. Waste reduction Big Data analytics can analyze sales and reviews to identify products which don’t do well, and either eliminate the product or combine them with a companion well-doing product to increase its sales. Analysing data can also help in listing products that were returned due to damages and defects. Generating insights from this data using machine learning models can be helpful to retailers in many ways. Some examples are: they can modify their stocking methods, improve on their packaging and logistic support for those kinds of products. 7. Supply chain optimization Big Data analytics also have a role to play in Supply chain optimization. This includes using sales and forecast data to plan and manage goods from retailers to warehouses to transport, onto the doorstep of customers. Top retailers like Amazon, are offering deals under the Black Friday space for the entire week. Expanding the sale window is a great supply chain optimization technique for a more manageable selling.[/toggle] C. Upgrading the Customer experience Customers are the most important assets for any retailer. Big Data analytics is here to help you retain, acquire, and attract your customers. [toggle title="" state="close"] 8. Shopper segmentation Machine learning techniques can link and analyze granular data such as behavioral, transactional and interaction data to identify and classify customers who behave in similar ways. This eliminates the guesswork associated and helps in creating rich and highly dynamic consumer profiles. According to a report by Research Methodology, Walmart uses a mono-segment type of positioning targeted to single customer segment. Walmart also pays attention to young consumers due to the strategic importance of achieving the loyalty of young consumers for long-term perspectives. 9. Promotional analytics An important factor for better sales is analyzing how customers respond to promotions and discount. Analyzing data on an hour-to-hour basis on special days such as Black Friday or Cyber Monday, which have high customer traffic, can help retailers plan for better promotions and lead to brand penetration. The Boston consulting group uses data analytics to accurately gauge the performance of promotions and predict promotion performance in advance. 10. Product affinity models By analyzing a shopper’s past transaction history, product affinity models can track customers with the highest propensity of buying a particular product. Retailers can then use this for attracting more customers or providing the existing ones with better personalizations. Product affinity models can also cluster products that are mostly bought together, which can be used to improve recommendation systems. 11. Customer churn prediction The massive quantity of customer data being collected can be used for predicting customer churn rate. Customer churn prediction is helpful in retaining customers, attracting new ones, and also acquiring the right type of customers in the first place. Classification models such as Logistic regression can be used to predict customers most likely to churn. As part of the Azure Machine Learning offering, Microsoft has a Retail Customer Churn Prediction Template to help retail companies predict customer churns.[/toggle] D. Formulating and aligning business strategies Every retailer is in need of tools and strategies for a product or a service to reach and influence the consumers, generate profits, and contribute to the long-term success of the business. Below are some pointers depicting how ML powered Big Data Analytics can help retailers do just that. [toggle title="" state="close"] 12. Building dynamic pricing models Pricing models can be designed by looking at the customer’s purchasing habits and surfing history. This descriptive analytics can be fed into a predictive model to obtain an optimal pricing model such as price sensitivity scores, and price to demand elasticity.  For example, Amazon uses a dynamic price optimization technique by offering its biggest discounts on its most popular products, while making profits on less popular ones. IBM’s Predictive Customer Intelligence can dynamically adjust the price of a product based on customer’s purchase decision. 13. Time series analysis Time series analysis can be used to identify patterns and trends in customer purchases, or a product’s lifecycle by observing information in a sequential fashion. It can also be used to predict future values based on the sequence so generated. For online retailers this means using historical sales data to forecast future sales, analyzing time-dependent patterns to list new arrivals, mark up prices or lower them down depending events such as Black Friday or Cyber Monday sales etc. 14. Demand forecasting Machine learning powered Big Data analytics can learn demand levels from a wide array of factors such as product nature, characteristics, seasonality, relationships with other associated products, relationship with other market factors, etc. It can then forecast the type of demand associated with a particular product using a simulation model. Such predictive analytics are highly accurate and also reduce costs especially for events like Black Friday, where there is a high surge of shoppers. 15. Strategy Adjustment Predictive Big Data analytics can help shorten the go-to-market time for product launches, allowing marketers to adjust their strategy midcourse if needed. For Black Friday or Cyber Monday deals, an online retailer can predict the demand for a particular product and can amend strategies in between, such as increasing the discount, or placing a product at the discounted rate for a longer time, etc. 16. Reporting and sales analysis Big data analytics tools can analyze large quantities of retail data quickly. Also, most such tools have a simple UI Dashboard which helps retailers know detailed descriptions of their queries in a single click. Thus a lot of time is saved, which was previously used for creating reports or sales summary. Reports generated from a data analytics tool are quick, fast, and easy to understand. 17. Marketing mix spend optimization Forecasting sales and proving ROI of marketing activities are two pain points faced by most retailers. Marketing Mix Modelling is a big data statistical analysis, which uses historical data to show the impact of marketing activities on sales and then forecasts the impact of future marketing tactics. Insights derived from such tools can be used to enhance marketing strategies and optimize the costs.[/toggle] Adopting the strategies as mentioned above, retailers can maximize their gains this holiday season starting with Black Friday which begins as the clock chimes 12 today.  Machine Powered Big Data analytics is there to help retailers attract new shoppers, retain them, enhance product line, define new categories, and formulate and align business strategies. Gear up for a Big Data Black Friday this 2017!  
Read more
  • 0
  • 0
  • 1821

article-image-learn-scikit-learn
Guest Contributor
23 Nov 2017
8 min read
Save for later

Why you should learn Scikit-learn

Guest Contributor
23 Nov 2017
8 min read
Today, machine learning in Python has become almost synonymous with scikit-learn. The "Big Bang" moment for scikit-learn was in 2007 when a gentleman named David Cournapeau decided to write this project as part of Google Summer of Code 2007. Let's take a moment to thank him. Matthieu Brucher later came on board and developed it further as part of his thesis. From that point on, sklearn never looked back. In 2010, the prestigious French research organization INRIA took ownership of the project with great developers like Gael Varoquaux, Alexandre Gramfort et al. starting work on it. Here's the oldest pull request I could find in sklearn’s repository. The title says "we're getting there"! Starting from there to today where sklearn receives funding and support from Google, Telecom ParisTech and Columbia University among others, it surely must’ve been quite a journey. Sklearn is an open source library which uses the BSD license. It is widely used in industry as well as in academia. It is built on Numpy, Scipy and Matplotlib while also having wrappers around various popular libraries such LIBSVM. Sklearn can be used “out of the box” after installation. Can I trust scikit-learn? Scikit-learn, or sklearn, is a very active open source project having brilliant maintainers. It is used worldwide by top companies such as Spotify, booking.com and the like. That it is open source where anyone can contribute might make you question the integrity of the code, but from the little experience I have contributing to sklearn, let me tell you only very high-quality code gets merged. All pull requests have to be affirmed by at least two core maintainers of the project. Every code goes through multiple iterations. While this can be time-consuming for all the parties involved, such regulations ensure sklearn’s compliance with the industry standard at all times. You don’t just build a library that’s been awarded the “best open source library” overnight! How can I use scikit-learn? Sklearn can be used for a wide variety of use-cases ranging from image classification to music recommendation to classical data modeling. Scikit-learn in various industries: In the Image classification domain, Sklearn’s implementation of K-Means along with PCA has been used for handwritten digit classification very successfully in the past. Sklearn has also been used for facial/ faces recognition using SVM with PCA. Image segmentation tasks such as detecting Red Blood Corpuscles or segmenting the popular Lena image into sections can be done using sklearn. A lot of us here use Spotify or Netflix and are awestruck by their recommendations. Recommendation engines started off with the collaborative filtering algorithm. It basically says “if people like me like something, I’ll also most probably like that.” To find out users with similar tastes, a KNN algorithm can be used which is available in sklearn. You can find a good demonstration of how it is used for music recommendation here. Classical data modeling can be bolstered using sklearn. Most people generally start their kaggle competitive journeys with the titanic challenge. One of the better tutorials out there on starting out is by dataquest and generally acts as a good introduction on how to use pandas and sklearn (a lethal combination!) for data science. It uses the robust Logistic Regression, Random Forest and the Ensembling modules to guide the user. You will be able to experience the user-friendliness of sklearn first hand while completing this tutorial. Sklearn has made machine learning literally a matter of importing a package. Sklearn also helps in Anomaly detection for highly imbalanced datasets (99.9% to 0.1% in credit card fraud detection) through a host of tools like EllipticEnvelope and OneClassSVM. In this regard, the recently merged IsolationForest algorithm especially works well in higher dimensional sets and has very high performance. Other than that, sklearn has implementations of some widely used algorithms such as linear regression, decision trees, SVM and Multi Layer Perceptrons (Neural Networks) to name a few. It has around 39 models in the “linear models” module itself! Happy scrolling here! Most of these algorithms can run very fast compared to raw python code since they are implemented in Cython and use Numpy and Scipy (which in-turn use C) for low-level computations. How is sklearn different from TensorFlow/MLllib? TensorFlow is a popular library to implement deep learning algorithms (since it can utilize GPUs). But while it can also be used to implement machine learning algorithms, the process can be arduous. For implementing logistic regression in TensorFlow, you will first have to “build” the logistic regression algorithm using a computational graph approach. Scikit-learn, on the other hand, provides the same algorithm out of the box however with the limitation that it has to be done in memory. Here's a good example of how LogisticRegression is done in Tensorflow. Apache Spark’s MLlib, on the other hand, consists of algorithms which can be used out of the box just like in Sklearn, however, it is generally used when the ML task is to be performed in a distributed setting. If your dataset fits into RAM, Sklearn would be a better choice for the task. If the dataset is massive, most people generally prototype on a small subset of the dataset locally using Sklearn. Once prototyping and experimentation are done, they deploy in the cluster using MLlib. Some sklearn must-knows Scikit-learn can be used for three different kinds of problems in machine learning namely supervised learning, unsupervised learning and reinforcement learning (ahem AlphaGo). Unsupervised learning happens when one doesn’t have ‘y’ labels in their dataset. Dimensionality reduction and clustering are typical examples. Scikit-learn has implementations of variations of the Principal Component Analysis such as SparsePCA, KernelPCA, and IncrementalPCA among others. Supervised learning covers problems such as spam detection, rent prediction etc. In these problems, the ‘y’ tag for the dataset is present. Models such as Linear regression, random forest, adaboost etc. are implemented in sklearn. From sklearn.linear_models import LogisticRegression Clf = LogisticRegression().fit(train_X, train_y) Preds = Clf.predict(test_X) Model evaluation and analysis Cross-validation, grid search for parameter selection and prediction evaluation can be done using the Model Selection and Metrics module which implements functions such as cross_val_score and f1_score respectively among others. They can be used as such: Import numpy as np From model_selection import cross_val_score From sklearn.metrics import f1_score Cross_val_avg = np.mean(cross_val_score(clf, train_X, train_y, scoring=’f1’)) # tune your parameters for better cross_val_score # for model results on a certain classification problem F_measure = f1_score(test_y, preds) Model Saving Simply pickle your model using pickle.save and it is ready to be distributed and deployed! Hence a whole machine learning pipeline can be built easily using sklearn. Finishing Remarks There are many good books out there talking about machine learning, but in context to Python,  Sebastian Raschka`s  (one of the core developers on sklearn) recently released his book titled “ Python Machine Learning” and it’s in great demand. Another great blog you could follow is Erik Bernhardsson’s blog. Along with writing about machine learning, he also discusses software development and other interesting ideas. Do subscribe to the scikit-learn mailing list as well. There are some very interesting questions posted there and a lot of learnings to take home. The machine learning subreddit also collates information from a lot of different sources and is thus a good place to find useful information. Scikit-learn has revolutionized the machine learning world by making it accessible to everyone. Machine learning is not like black magic anymore. If you use scikit-learn and like it, do consider contributing to sklearn. There is a huge clutter of open issues and PRs on the sklearn GitHub page. Scikit-learn needs contributors! Have a look at this page to start contributing. Contributing to a library is easily the best way to learn it! [author title="About the Author"]Devashish Deshpande started his foray into data science and machine learning in 2015 with an online course when the question of how machines can learn started intriguing him. He pursued more online courses as well as courses in data science during his undergrad. In order to gain practical knowledge he started contributing to open source projects beginning with a small pull request in Scikit-Learn. He then did a summer project with Gensim and delivered workshops and talks at PyCon France and India in 2016. Currently, Devashish works in the data science team at belong.co, India. Here's the link to his GitHub profile.[/author]
Read more
  • 0
  • 0
  • 5828

article-image-data-science-folks-12-reasons-thankful-thanksgiving
Savia Lobo
21 Nov 2017
8 min read
Save for later

Data science folks have 12 reasons to be thankful for this Thanksgiving

Savia Lobo
21 Nov 2017
8 min read
We are nearing the end of 2017. But with each ending chapter, we have remarkable achievements to be thankful for. Similarly, for the data science community, this year was filled with a number of new technologies, tools, version updates etc. 2017 saw blockbuster releases such as PyTorch, TensorFlow 1.0 and Caffe 2, among many others. We invite data scientists, machine learning experts, and other data science professionals to come together on this Thanksgiving Day, and thank the organizations, which made our interactions with AI easier, faster, better and generally more fun. Let us recall our blessings in 2017, one month at a time... [dropcap]Jan[/dropcap] Thank you, Facebook and friends for handing us PyTorch Hola 2017! While the world was still in the New Year mood, a brand new deep learning framework was released. Facebook along with a few other partners launched PyTorch. PyTorch came as an improvement to the popular Torch framework. It now supported the Python language over the less popular Lua. As PyTorch worked just like Python, it was easier to debug and create unique extensions. Another notable change was the adoption of a Dynamic Computational Graph, used to create graphs on the fly with high speed and flexibility. [dropcap]Feb[/dropcap] Thanks Google for TensorFlow 1.0 The month of February brought Data Scientist’s a Valentine's gift with the release of TensorFlow 1.0. Announced at the first annual TensorFlow Developer Summit, TensorFlow 1.0 was faster, more flexible, and production-ready. Here’s what the TensorFlow box of chocolate contained: Fully compatibility with Keras Experimental APIs for Java and Go New Android demos for object and image detection, localization, and stylization A brand new Tensorflow debugger An introductory glance of  XLA--a domain-specific compiler for TensorFlow graphs [dropcap]Mar[/dropcap] We thank Francois Chollet for making Keras 2 a production ready API Congratulations! Keras 2 is here. This was a great news for Data science developers as Keras 2, a high- level neural network API allowed faster prototyping. It provided support both CNNs (Convolutional Neural Networks) as well as RNNs (Recurrent Neural Networks). Keras has an API designed specifically for humans. Hence, a user-friendly API. It also allowed easy creation of modules, which meant it is perfect for carrying out an advanced research. Developers can now code in  Python, a compact, easy to debug language. [dropcap]Apr[/dropcap] We like Facebook for brewing us Caffe 2 Data scientists were greeted by a fresh aroma of coffee, this April, as Facebook released the second version of it’s popular deep learning framework, Caffe. Caffe 2 came up as a easy to use deep learning framework to build DL applications and leverage community contributions of new models and algorithms. Caffe 2 was fresh with a first-class support for large-scale distributed training, new hardware support, mobile deployment, and the flexibility for future high-level computational approaches. It also provided easy methods to convert DL models built in original Caffe to the new Caffe version. Caffe 2 also came with over 400 different operators--the basic units of computation in Caffe 2. [dropcap]May[/dropcap] Thank you, Amazon for supporting Apache MXNet on AWS and Google for your TPU The month of May brought in some exciting launches from the two tech-giants, Amazon and Google. Amazon Web Services’ brought Apache MXNet on board and Google’s Second generation TPU chips were announced. Apache MXNet, which is now available on AWS allowed developers to build Machine learning applications which can train quickly and run anywhere, which means it is a scalable approach for developers. Next up, was Google’s  second generation TPU (Tensor Processing Unit) chips, designed to speed up machine learning tasks. These chips were supposed to be (and are) more capable of CPUs and even GPUs. [dropcap]Jun[/dropcap] We thank Microsoft for CNTK v2 The mid of the month arrived with Microsoft’s announcement of the version 2 of its Cognitive Toolkit. The new Cognitive Toolkit was now enterprise-ready, had production-grade AI and allowed users to create, train, and evaluate their own neural networks scalable to multiple GPUs. It also included the Keras API support, faster model compressions, Java bindings, and Spark support. It also featured a number of new tools to run trained models on low-powered devices such as smartphones. [dropcap]Jul[/dropcap] Thank you, Elastic.co for bringing ML to Elastic Stack July made machine learning generally available for the Elastic Stack users with its version 5.5. With ML, the anomaly detection of the Elasticsearch time series data was made possible. This allows users to analyze the root cause of the problems in the workflow and thus reduce false positives. To know about the changes or highlights of this version visit here. [dropcap]Aug[/dropcap] Thank you, Google for your Deeplearn.js August announced the arrival of Google’s Deeplearn.js, an initiative that allowed Machine Learning models to run entirely in a browser. Deeplearn.js was an open source WebGL- accelerated JS library. It offered an interactive client-side platform which helped developers carry out rapid prototyping and visualizations. Developers were now able to use hardware accelerator such as the GPU via the webGL and perform faster computations with 2D and 3D graphics. Deeplearn.js also allowed TensorFlow model’s capabilities to be imported on the browser. Surely something to thank for! [dropcap]Sep[/dropcap] Thanks, Splunk and SQL for your upgrades September surprises came with the release of Splunk 7.0, which helps in getting Machine learning to the masses with an added Machine Learning Toolkit, which is scalable, extensible, and accessible. It includes an added native support for metrics which speed up query processing performance by 200x. Other features include seamless event annotations, improved visualization, faster data model acceleration, a cloud-based self-service application. September also brought along the release of MySQL 8.0 which included a first-class support for Unicode 9.0. Other features included are An extended support for native JSOn data Inclusion of windows functions and recursive SQL syntax for queries that were previously impossible or difficult to write Added document-store functionality So, big thanks to the Splunk and SQL upgrades. [dropcap]Oct[/dropcap] Thank you, Oracle for the Autonomous Database Cloud and Microsoft for SQL Server 2017 As Fall arrived, Oracle unveiled the World’s first Autonomous Database Cloud. It provided full automation associated with tuning, patching, updating and maintaining the database. It was self scaling i.e., it instantly resized compute and storage without downtime with low manual administration costs. It was also self repairing and guaranteed 99.995 percent reliability and availability. That’s a lot of reduction in workload! Next, developers were greeted with the release of SQL Server 2017 which was a major step towards making SQL Server a platform. It included multiple enhancements in Database Engine such as adaptive query processing, Automatic database tuning, graph database capabilities, New Availability Groups, Database Tuning Advisor (DTA) etc. It also had a new Scale Out feature in SQL Server 2017 Integration Services (SSIS) and SQL Server Machine Learning Services to reflect support for Python language. [dropcap]Nov[/dropcap] A humble thank you to Google for TensorFlow Lite and Elastic.co for Elasticsearch 6.0 Just a month more for the year to end!! The Data science community has had a busy November with too many releases to keep an eye on with Microsoft Connect(); to spill the beans. So, November, thank you for TensorFlow Lite and Elastic 6. Talking about TensorFlow Lite, a lightweight product  for mobile and embedded devices, it is designed to be: Lightweight: It allows inference of the on-device machine learning models that too with a small binary size, allowing faster initialization/ startup. Speed: The model loading time is dramatically improved, with an accelerated hardware support. Cross-platform: It includes a runtime tailormade to run on various platforms–starting with Android and iOS. And now for Elasticsearch 6.0, which is made generally available. With features such as easy upgrades, Index sorting, better Shard recovery, support for Sparse doc values.There are other new features spread out across the Elastic stack, comprised of Kibana, Beats and Logstash. These are, Elasticsearch’s solutions for visualization and dashboards, data ingestion and log storage. [dropcap]Dec[/dropcap] Thanks in advance Apache for Hadoop 3.0 Christmas gifts may arrive for Data Scientists in the form of General Availability of Hadoop 3.0. The new version is expected to include support for Erasure Encoding in HDFS, version 2 of the YARN Timeline Service, Shaded Client Jars, Support for More than 2 NameNodes, MapReduce Task-Level Native Optimization, support for Opportunistic Containers and Distributed Scheduling to name a few. It would also include a rewritten version of Hadoop shell scripts with bug fixes, improved compatibility and many changes in some existing installation procedures. Pheww! That was a large list of tools for Data Scientists and developers to thank for this year. Whether it be new frameworks, libraries or a new set of software, each one of them is unique and helpful to create data-driven applications. Hopefully, you have used some of them in your projects. If not, be sure to give them a try, because 2018 is all set to overload you with new, and even more amazing tools, frameworks, libraries, and releases.
Read more
  • 0
  • 0
  • 2588

article-image-self-service-analytics-changing-modern-day-businesses
Amey Varangaonkar
20 Nov 2017
6 min read
Save for later

How self-service analytics is changing modern-day businesses

Amey Varangaonkar
20 Nov 2017
6 min read
To stay competitive in today’s economic environment, organizations can no longer be reliant on just their IT team for all their data consumption needs. At the same time, the need to get quick insights to make smarter and more accurate business decisions is now stronger than ever. As a result, there has been a sharp rise in a new kind of analytics where the information seekers can themselves create and access a specific set of reports and dashboards - without IT intervention. This is popularly termed as Self-service Analytics. Per Gartner, Self-service analytics is defined as: “A  form of business intelligence (BI) in which line-of-business professionals are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support.” Expected to become a $10 billion market by 2022, self-service analytics is characterized by simple, intuitive and interactive BI tools that have basic analytic and reporting capabilities with a focus on easy data access. It empowers business users to access relevant data and extract insights from it without needing to be an expert in statistical analysis or data mining. Today, many tools and platforms for self-service analytics are already on the market - Tableau, Microsoft Power BI, IBM Watson, Qlikview and Qlik Sense being some of the major ones. Not only have these empowered users to perform all kinds of analytics with accuracy, but their reasonable pricing, in-tool guidance and the sheer ease of use have also made them very popular among business users. Rise of the Citizen Data Scientist The rise in popularity of self-service analytics has led to the coining of a media-favored term - ‘Citizen Data Scientist’. But what does the term mean? Citizen data scientists are business users and other professionals who can perform less intensive data-related tasks such as data exploration, visualization and reporting on their own using just the self-service BI tools. If Gartner’s predictions are to be believed, there will be more citizen data scientists in 2019 than the traditional data scientists who will be performing a variety of analytics-related tasks. How Self-service Analytics benefits businesses Allowing the end-users within a business to perform their own analysis has some important advantages as compared to using the traditional BI platforms: The time taken to arrive at crucial business insights is drastically reduced. This is because teams don’t have to rely on the IT team to deliver specific reports and dashboards based on the organizational data. Quicker insights from self-service BI tools mean businesses can take decisions faster with higher confidence and deploy appropriate strategies to maximize business goals. Because of the relative ease of use, business users can get up to speed with the self-service BI tools/platform in no time and with very little training as compared to being trained on complex BI solutions. This means relatively lower training costs and democratization of BI analytics which in turn reduces the workload on the IT team and allows them to focus on their own core tasks. Self-service analytics helps the users to manage the data from disparate sources more efficiently, thus allowing organizations to be agiler in terms of handling new business requirements. Challenges in Self-service analytics While the self-service analytics platforms offer many benefits, they come with their own set of challenges too.  Let’s see some of them: Defining a clear role for the IT team within the business by addressing concerns such as: Identifying the right BI tool for the business - Among the many tools out there, identifying which tool is the best fit is very important. Identifying which processes and business groups can make the best use of self-service BI and who may require assistance from IT Setting up the right infrastructure and support system for data analysis and reporting Answering questions like - who will design complex models and perform high-level data analysis Thus, rather than becoming secondary to the business, the role of the IT team becomes even more important when adopting a self-service business intelligence solution. Defining a strict data governance policy - This is a critical task as unauthorized access to organizational data can be detrimental to the business. Identifying the right ‘power users’, i.e., the users who need access to the data and the tools, the level of access that needs to be given to them, and ensuring the integrity and security of the data are some of the key factors that need to be kept in mind. The IT team plays a major role in establishing strict data governance policies and ensuring the data is safe, secure and shared across the right users for self-service analytics. Asking the right kind of questions on the data - When users who aren’t analysts get access to data and the self-service tools, asking the right questions of the data in order to get useful, actionable insights from it becomes highly important. Failure to perform correct analysis can result in incorrect or insufficient findings, which might lead to wrong decision-making. Regular training sessions and support systems in place can help a business overcome this challenge. To read more about the limitations of self-service BI, check out this interesting article. In Conclusion IDC has predicted that spending on self-service BI tools will grow 2.5 times than spending on traditional IT-controlled BI tools by 2020. This is an indicator that many organizations worldwide and of all sizes will increasingly believe that self-service analytics is a feasible and profitable way to go forward. Today mainstream adoption of self-service analytics still appears to be in the early stages due to a general lack of awareness among businesses. Many organizations still depend on the IT team or an internal analytics team for all their data-driven decision-making tasks. As we have already seen, this comes with a lot of limitations - limitations that can easily be overcome by the adoption of a self-service culture in analytics, and thus boost the speed, ease of use and quality of the analytics. By shifting most of the reporting work to the power users,  and by establishing the right data governance policies, businesses with a self-service BI strategy can grow a culture that fuels agile thinking, innovation and thus is ready for success in the marketplace. If you’re interested in learning more about popular self-service BI tools, these are some of our premium products to help you get started:   Learning Tableau 10 Tableau 10 Business Intelligence Cookbook Learning IBM Watson Analytics QlikView 11 for Developers Microsoft Power BI Cookbook    
Read more
  • 0
  • 0
  • 8212
article-image-looking-different-types-lookup-cache
Savia Lobo
20 Nov 2017
6 min read
Save for later

Looking at the different types of Lookup cache

Savia Lobo
20 Nov 2017
6 min read
[box type="note" align="" class="" width=""]The following is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. We walk through the various types of lookup cache based on how a cache is defined in this article.[/box] Cache is the temporary memory that is created when you execute a process. It is created automatically when a process starts and is deleted automatically once the process is complete. The amount of cache memory is decided based on the property you define at the transformation level or session level. You usually set the property as default, so as required, it can increase the size of the cache. If the size required for caching the data is more than the cache size defined, the process fails with the overflow error. There are different types of caches available. Building the Lookup Cache - Sequential or Concurrent You can define the session property to create the cache either sequentially or concurrently. Sequential cache When you select to create the cache sequentially, Integration Service caches the data in a row-wise manner as the records enter the lookup transformation. When the first record enters the lookup transformation, lookup cache gets created and stores the matching record from the lookup table or file in the cache. This way, the cache stores only the matching data. It helps in saving the cache space by not storing unnecessary data. Concurrent cache When you select to create cache concurrently, Integration service does not wait for the data to flow from the source; it first caches complete data. Once the caching is complete, it allows the data to flow from the source. When you select a concurrent cache, the performance enhances as compared to sequential cache since the scanning happens internally using the data stored in the cache. Persistent cache - the permanent one You can configure the cache to permanently save the data. By default, the cache is created as non-persistent, that is, the cache will be deleted once the session run is complete. If the lookup table or file does not change across the session runs, you can use the existing persistent cache. Suppose you have a process that is scheduled to run every day and you are using lookup transformation to lookup on the reference table that which is not supposed to change for six months. When you use non-persistent cache every day, the same data will be stored in the cache; this will waste time and space every day. If you select to create a persistent cache, the integration service makes the cache permanent in the form of a file in the $PMCacheDir location. So, you save the time every day, creating and deleting the cache memory. When the data in the lookup table changes, you need to rebuild the cache. You can define the condition in the session task to rebuild the cache by overwriting the existing cache. To rebuild the cache, you need to check the rebuild option on the session property. Sharing the cache - named or unnamed You can enhance the performance and save the cache memory by sharing the cache if there are multiple lookup transformations used in a mapping. If you have the same structure for both the lookup transformations, sharing the cache will help in enhancing the performance by creating the cache only once. This way, we avoid creating the cache multiple times, which in turn, enhances the performance. You can share the cache--either named or unnamed Sharing unnamed cache If you have multiple lookup transformations used in a single mapping, you can share the unnamed cache. Since the lookup transformations are present in the same mapping, naming the cache is not mandatory. Integration service creates the cache while processing the first record in first lookup transformation and shares the cache with other lookups in the mapping. Sharing named cache You can share the named cache with multiple lookup transformations in the same mapping or in another mapping. Since the cache is named, you can assign the same cache using the name in the other mapping. When you process the first mapping with lookup transformation, it saves the cache in the defined cache directory and with a defined cache file name. When you process the second mapping, it searches for the same location and cache file and uses the data. If the Integration service does not find the mentioned cache file, it creates the new cache. If you run multiple sessions simultaneously that use the same cache file, Integration service processes both the sessions successfully only if the lookup transformation is configured for read-only from the cache. If there is a scenario when both lookup transformations are trying to update the cache file or a scenario where one lookup is trying to read the cache file and other is trying to update the cache, the session will fail as there is conflict in the processing. Sharing the cache helps in enhancing the performance by utilizing the cache created. This way we save the processing time and repository space by not storing the same data multiple times for lookup transformations. Modifying cache - static or dynamic When you create a cache, you can configure them to be static or dynamic. Static cache A cache is said to be static if it does not change with the changes happening in the lookup table. The static cache is not synchronized with the lookup table. By default, Integration service creates a static cache. The Lookup cache is created as soon as the first record enters the lookup transformation. Integration service does not update the cache while it is processing the data. Dynamic cache A cache is said to be dynamic if it changes with the changes happening in the lookup table. The static cache is synchronized with the lookup table. You can choose from the lookup transformation properties to make the cache dynamic. Lookup cache is created as soon as the first record enters the lookup transformation. Integration service keeps on updating the cache while it is processing the data. The Integration service marks the record as an insert for the new row inserted in the dynamic cache. For the record that is updated, it marks the record as an update in the cache. For every record that doesn't change, the Integration service marks it as unchanged. You use the dynamic cache while you process the slowly changing dimension tables. For every record inserted in the target, the record will be inserted in the cache. For every record updated in the target, the record will be updated in the cache. A similar process happens for the deleted and rejected records.
Read more
  • 0
  • 0
  • 8621

article-image-how-sports-analytics-is-changing-industry
Amey Varangaonkar
14 Nov 2017
7 min read
Save for later

Of perfect strikes, tackles and touchdowns: how analytics is changing sports

Amey Varangaonkar
14 Nov 2017
7 min read
The rise of Big Data and Analytics is drastically changing the landscape of many businesses - and the sports industry is one of them. In today’s age of cut-throat competition, data-based strategies are slowly taking the front seat when it comes to crucial decision making - helping teams gain that decisive edge over their competition.Sports Analytics is slowly becoming the next big thing! In the past, many believed that the key to conquering the opponent in any professional sport is to make the player or the team better - be it making them stronger, faster, or more intelligent.  ‘Analysis’ then was limited to mere ‘clipboard statistics’ and the intuition built by coaches on the basis of raw video footage of games. This is not the case anymore. From handling media contracts and merchandising to evaluating individual or team performance on matchday, analytics is slowly changing the landscape of sports. The explosion of data in sports The amount and quality of information available to decision-makers within the sports organization have increased exponentially over the last two decades. There are several factors contributing to this: Innovation in sports science over the last decade, which has been incredible, to say the least. In-depth records maintained by trainers, coaches, medical staff, nutritionists and even the sales and marketing departments Improved processing power and lower cost of storage allowing for maintaining large amounts of historical data. Of late, the adoption of motion capture technology and wearable devices has proved to be a real game-changer in sports, where every movement on the field can be tracked and recorded. Today, many teams in a variety of sports such as Boston Red Sox and Houston Astros in Major League Baseball (MLB), San Antonio Spurs in NBA and teams like Arsenal, Manchester City and Liverpool FC in football (soccer) are adopting analytics in different capacities. Turning sports data into insights Needless to say, all the crucial sports data being generated today need equally good analytics techniques to extract the most value out of it. This is where Sports Analytics comes into the picture. Sports analytics is defined as the use of analytics on current as well as historical sport-related data to identify useful patterns, which can be used to gain a competitive advantage on the field of play. There are several techniques and algorithms which fall under the umbrella of Sports Analytics. Machine learning, among them, is a widely used set of techniques that sports analysts use to derive insights. It is a popular form of Artificial Intelligence where systems are trained using large datasets to give reliable predictions on random data. With the help of a variety of classification and recommendation algorithms, analysts are now able to identify patterns within the existing attributes of a player, and how they can be best optimized to improve his performance. Using cross-validation techniques, the machine learning models then ensure there is no degree of bias involved, and the predictions can be generalized even in cases of unknown datasets. Analytics is being put to use by a lot of sports teams today, in many different ways. Here are some key use-cases of sports analytics: Pushing the limit: Optimizing player performance Right from tracking an athlete’s heartbeats per minute to finding injury patterns, analytics can play a crucial role in understanding how an individual performs on the field. With the help of video, wearables and sensor data, it is possible to identify exactly when an athlete’s performance drops and corrective steps can be taken accordingly. It is now possible to assess a player’s physiological and technical attributes and work on specific drills in training to push them to an optimal level. Developing search-powered data intelligence platforms seems to be the way forward. The best example for this is Tellius, a search-based data intelligence tool which allows you to determine a player’s efficiency in terms of fitness and performance through search-powered analytics. Smells like team spirit: Better team and athlete management Analytics also helps the coaches manage their team better. For example, Adidas has developed a system called miCoach which works by having the players use wearables during the games and training sessions. The data obtained from the devices highlights the top performers and the ones who need rest. It is also possible to identify and improve patterns in a team’s playing styles, and developing a ‘system’ to improve the efficiency in gameplay. For individual athletes, real-time stats such as speed, heart rate, and acceleration could help the trainers plan the training and conditioning sessions accordingly. Getting intelligent responses regarding player and team performances and real-time in-game tactics is something that will make the coaches’ and management’s life a lot easier, going forward. All in the game: Improving game-day strategy By analyzing the real-time training data, it is possible to identify the fitter, in-form players to be picked for the game. Not just that, analyzing opposition and picking the right strategy to beat them becomes easier once you have the relevant data insights with you. Different data visualization techniques can be used not just with historical data but also with real-time data, when the game is in progress. Splashing the cash: Boosting merchandising What are fans buying once they’re inside the stadium? Is it the home team’s shirt, or is it their scarfs and posters? What food are they eating in the stadium eateries? By analyzing all this data, retailers and club merchandise stores can store the fan-favorite merchandise and other items in adequate quantities, so that they never run out of stock. Analyzing sales via online portals and e-stores also help the teams identify the countries or areas where the buyers live. This is a good indicator for them to concentrate sales and marketing efforts in those regions. Analytics also plays a key role in product endorsements and sponsorships. Determining which brands to endorse, identifying the best possible sponsor, the ideal duration of sponsorship and the sponsorship fee - these are some key decisions that can be taken by analyzing current trends along with the historical data. Challenges in sports analytics Although the advantages offered by analytics are there for all to see, many sports teams have still not incorporated analytics into their day-to-day operations. Lack of awareness seems to be the biggest factor here. Many teams underestimate or still don’t understand, the power of analytics. Choosing the right Big Data and analytics tool is another challenge. When it comes to the humongous amounts of data, especially, the time investment needed to clean and format the data for effective analysis is problematic and is something many teams aren’t interested in. Another challenge is the rising demand for analytics and a sharp deficit when it comes to supply, driving higher salaries. Add to that the need to have a thorough understanding of the sport to find effective insights from data - and it becomes even more difficult to get the right data experts. What next for sports analytics? Understanding data and how it can be used in sports - to improve performance and maximize profits - is now deemed by many teams to be the key differentiator between success and failure. And it’s not just success that teams are after - it’s sustained success, and analytics goes a long way in helping teams achieve that. Gone are the days when traditional ways of finding insights were enough. Sports have evolved, and teams are now digging harder into data to get that slightest edge over the competition, which can prove to be massive in the long run. If you found the article to be insightful, make sure you check out our interview on sports analytics with ESPN Senior Stats Analyst Gaurav Sundararaman.
Read more
  • 0
  • 2
  • 2677

article-image-know-customer-envisaging-customer-sentiments-using-behavioral-analytics
Sugandha Lahoti
13 Nov 2017
6 min read
Save for later

Know Your Customer: Envisaging customer sentiments using Behavioral Analytics

Sugandha Lahoti
13 Nov 2017
6 min read
“All the world’s a stage and the men and women are merely players.” Shakespeare may have considered men and women as mere players, but as large number of users are connected with smart devices and the online world, these men, and women—your customers—become your most important assets. Therefore, knowing your customer and envisaging their sentiments using Behavioral Analytics has become paramount. Behavioral analytics: Tracking user events Say, you order a pizza through an app on your phone. After customizing and choosing the crust size, type and ingredients, you land in the payment section. Suppose, instead of paying, you abandon the order altogether. Immediately you get an SMS and an email, alerting you that you are just a step away from buying your choice of pizza. So how does this happen? Behavior analytics runs in the background here. By tracking user navigation, it prompts the user to complete an order, or offer a suggestion. The rise of smart devices has enabled almost everything to transmit data. Most of this data is captured between sessions of user activity and is in the raw form. By user activity we mean social media interactions, amount of time spent on a site, user navigation path, click activity of a user, their responses to change in the market, purchasing history and much more. Some form of understanding is therefore required to make sense of this raw and scrambled data and generate definite patterns. Here’s where behavior analytics steps in. It goes through a user's entire e-commerce journey and focuses on understanding the what and how of their activities. Based on this, it predicts their future moves. This, in turn, helps to generate opportunities for businesses to become more customer-centric. Why Behavioral analytics over traditional analytics The previous analytical tools lacked a single architecture and simple workflow. Although they assisted with tracking clicks and page loads, they required a separate data warehouse and visualization tools. Thus, creating an unstructured workflow. Behavioral Analytics go a step beyond standard analytics by combining rule-based models with deep machine learning. Where the former tells what the users do, the latter reveals the how and why of their actions. Thus, they keep track of where customers click, which pages are viewed, how many continue down the process, who eliminates a website at what step, among other things. Unlike traditional analytics, behavioral analytics is an aggregator of data from diverse sources (websites, mobile apps, CRM, email marketing campaigns etc.) collected across various sessions. Cloud-based behavioral analytic platforms can intelligently integrate and unify all sources of digital communication into a complete picture. Thus, offering a seamless and structured view of the entire customer journey. Such behavioral analytic platforms typically capture real-time data which is in raw format. They then automatically filter and aggregate this data into a structured dataset. It also provides visualization tools to see and observe this data, all the while predicting trends. The aggregation of data is done in such a way that it allows querying this data in an unlimited number of ways for the business to utilize. So, they are helpful in analyzing retention and churn trends, trace abnormalities, perform multidimensional funnel analysis and much more. Let’s look at some specific use cases across industries where behavioral analytics is highly used. Analysing customer behavior in E-commerce E-commerce platforms are on the top of the ladder in the list of sectors, which can largely benefit by mapping their digital customer journey. Analytic strategies can track if a customer spends more time on a product page X over product page Y by displaying views and data pointers of customer activity in a structured format. This enables industries to resolve issues, which may hinder a page’s popularity, including slow loading pages, expensive products etc. By tracking user session, right from when they entered a platform to the point a sale is made, behavior analytics predicts future customer behavior and business trends. Some of the parameters considered include number of customers viewing reviews and ratings before adding an item to their cart, what similar products the customer sees, how often the items in the cart are deleted or added etc. Behavioral analytics can also identify top-performing products and help in building powerful recommendation engines. By analyzing changes in customer behavior over different demographical conditions or on the basis of regional differences.This helps achieve customer-to-customer personalization. KISSmetrics is a powerful analytics tool that provides detailed customer behavior information report for businesses to slice through and find meaningful insights. RetentionGrid provides color-coded visualizations and also provides multiple strategies tailormade for customers, based on customer segmentation and demographics.   How can online gaming benefit from behavioral analysis Online gaming is a surging community with millions of daily active users. Marketers are always looking for ways to acquire customers and retain users. Monetization is another important focal point. This means not only getting more users to play but also to pay. Behavioral analytics keeps track of a user’s gaming session such as skill levels, amount of time spent at different stages, favorite features and activities within game-play, and drop-off points from the game. At an overall level, it tracks the active users, game logs, demographic data and social interaction between players over various community channels. On the basis of this data, a visualization graph is generated which can be used to drive market strategies such as identifying features that work, how to add additional players, or how to keep existing players engaged. Thus helping increase player retention and assisting game developers and marketers implement new versions based on player’s reaction. behavior analytics can also identify common characteristics of users. It helps in understanding what gets a user to play longer and in identifying the group of users most likely to pay based on common characteristics. All these help gaming companies implement right advertising and placement of content to the users. Mr Green’s casino launched a Green Gaming tool to predict a person’s playing behavior and on the basis of a gamer’s risk-taking behavior, they help generate personalized insights regarding their gaming. Nektan PLC has partnered with ‘machine learning’ customer insights firm Newlette. Newlette models analyze player behavior based on individual playing styles. They help in increasing player engagement and reduce bonus costs by providing the players with optimum offers and bonuses. The applications of behavioral analytics are not just limited to e-commerce or gaming alone. The security and surveillance domain uses behavioral analytics for conducting risk assessment of organizational resources and alerting against individual entities that are a potential threat. They do so by sifting through large amounts of company data and identifying patterns that portray irregularity or change. End-to-end monitoring of customer also helps app developers track customer adoption to new-feature development. It could also provide reports on the exact point where customers drop off and help in avoiding expensive technical issues. All these benefits highlight how customer tracking and knowing user behavior is an essential tool to drive a business forward. As Leo Burnett, the founder of a prominent advertising agency says “What helps people, helps business.”
Read more
  • 0
  • 1
  • 2069
article-image-13-reasons-exit-polls-wrong
Sugandha Lahoti
13 Nov 2017
7 min read
Save for later

13 reasons why Exit Polls get it wrong sometimes

Sugandha Lahoti
13 Nov 2017
7 min read
An Exit poll, as the name suggests, is a poll taken immediately after voters exit the polling booth. Private companies working for popular newspapers or media organizations conduct these exit polls and are popularly known as pollsters. Once the data is collected, data analysis and estimation is used to predict the winning party and the number of seats captured. Turnout models which are built using logistic regression or random forest techniques are used for prediction of turnouts in the exit poll results. Exit polls are dependent on sampling. Hence a margin of error does exist. This describes how close pollsters are in expecting an election result relative to the true population value. Normally, a margin of error plus or minus 3 percentage points is acceptable. However, in the recent times, there have been instances where the poll average was off by a larger percentage. Let us analyze some of the reasons why exit polls can get their predictions wrong. [dropcap]1[/dropcap] Sampling inaccuracy/quality Exit polls are dependent on the sample size, i.e. the number of respondents or the number of precincts chosen. Incorrect estimation of this may lead to error margins. The quality of sample data also matters. This includes factors such as whether the selected precincts are representative of the state, whether the polled audience in each precinct represents the whole etc. [dropcap]2[/dropcap] Model did not consider multiple turnout scenarios Voter turnout refers to the percentage of voters who cast a vote during an election. Pollsters may often misinterpret the number of people who actually vote based on the total no. of the population eligible to vote. Also, they often base their turnout prediction on past trends. However, voter turnout is dependent on many factors. For example, some voters might not turn up due to reasons such as indifference or a feeling of perception that their vote might not count--which is not true. In such cases, the pollsters adjust the weighting to reflect high or low turnout conditions by keeping the total turnout count in mind. The observations taken during a low turnout is also considered and the weights are adjusted therein. In short, pollsters try their best to maintain the original data. [dropcap]3[/dropcap] Model did not consider past patterns Pollsters may commit a mistake by not delving into the past. They can gauge the current turnout rates by taking into the account the presidential turnout votes or the previous midterm elections. Although, one may assume that the turnout percentage over the years have been stable a check on the past voter turnout is a must. [dropcap]4[/dropcap] Model was not recalibrated for year and time of election such as odd-year midterms Timing is a very crucial factor in getting the right traction for people to vote. At times, some social issues would be much more hyped and talked-about than the elections. For instance, the news of the Ebola virus breakout in Texas was more prominent than news about the contestants standing in the mid 2014 elections. Another example would be an election day set on a Friday versus on any other weekday. [dropcap]5[/dropcap] Number of contestants Everyone has a personal favorite. In cases where there are just two contestants, it is straightforward to arrive at a clear winner. For pollsters, it is easier to predict votes when the whole world's talking about it, and they know which candidate is most talked about. With the increase in the number of candidates, the task to carry out an accurate survey is challenging for the pollsters. They have to reach out to more respondents to carry out the survey required in an effective manner. [dropcap]6[/dropcap] Swing voters/undecided respondents Another possible explanation for discrepancies in poll predictions and the outcome is due to a large proportion of undecided voters in the poll samples. Possible solutions could be Asking relative questions instead of absolute ones Allotment of undecided voters in proportion to party support levels while making estimates [dropcap]7[/dropcap] Number of down-ballot races Sometimes a popular party leader helps in attracting votes to another less popular candidate of the same party. This is the down-ballot effect. At times, down-ballot candidates may receive more votes than party leader candidates, even when third-party candidates are included. Also, down-ballot outcomes tend to be influenced by the turnout for the polls at the top of the ballot. So the number of down-ballot races need to be taken into account. [dropcap]8[/dropcap] The cost incurred to commission a quality poll A huge capital investment is required in order to commission a quality poll. The cost incurred for a poll depends on the sample size, i.e. the number of people interviewed, the length of the questionnaire--longer the interview, more expensive it becomes, the time within which interviews must be conducted, are some contributing factors. Also, if a polling firm is hired or if cell phones are included to carry out a survey, it will definitely add up to the expense. [dropcap]9[/dropcap] Over-relying on historical precedence Historical precedence is an estimate of the type of people who have shown up previously on a similar type of election. This precedent should also be taken into consideration for better estimation of election results. However, care should be taken not to over-rely on it. [dropcap]10[/dropcap] Effect of statewide ballot measures Poll estimates are also dependent on state and local governments. Certain issues are pushed by local ballot measures. However, some voters feel that power over specific issues should belong exclusively to state governments. This causes opposition to local ballot measures in some states. These issues should be taken into account while estimation for better result prediction. [dropcap]11[/dropcap] Oversampling due to various factors such as faulty survey design, respondents’ willingness/unwillingness to participate etc   Exit polls may also sometimes oversample voters for many reasons. One example of this is related to the people of US with cultural ties to Latin America. Although, more than one-fourth of Latino voters prefer speaking Spanish to English, yet exit polls are almost never offered in Spanish. This might oversample English speaking Latinos. [dropcap]12[/dropcap] Social desirability bias in respondents People may not always tell the truth about who they voted for. In other words, when asked by pollsters they are likely to place themselves on the safer side, as exit polls is a sensitive topic. The voters happen to tell pollsters that they have voted for a minority candidate, but they have actually voted against the minority candidate. Social Desirability has no linking to issues with race or gender. It is just that people like to be liked and like to be seen as doing what everyone else is doing or what the “right” thing to do is, i.e., they play safe. Brexit polling, for instance, showed stronger signs of Social desirability bias. [dropcap]13[/dropcap] The spiral of silence theory People may not reveal their true thoughts to news reporters as they may believe media has an inherent bias. Voters may not come out to declare their stand publicly in fear of reprisal or the fear of isolation. They choose to remain silent. This may also hinder estimate calculation for pollsters. The above is just a shortlist of a long list of reasons why exit poll results must be taken with a pinch of salt. However, even with all its shortcomings, the striking feature of an exit poll is the fact that rather than predicting about a future action, it records an action that has just happened. So you rely on present indicators rather than ambiguous historical data. Exit polls are also cost-effective in obtaining very large samples. If these exit polls are conducted properly, keeping in consideration the points described above, they can predict election results with greater reliability.
Read more
  • 0
  • 0
  • 3407

article-image-8-myths-rpa-robotic-process-automation
Savia Lobo
08 Nov 2017
9 min read
Save for later

8 Myths about RPA (Robotic Process Automation)

Savia Lobo
08 Nov 2017
9 min read
Many say we are on the cusp of the fourth industrial revolution that promises to blur the lines between the real, virtual and the biological worlds. Amongst many trends, Robotic Process Automation (RPA) is also one of those buzzwords surrounding the hype of the fourth industrial revolution. Although poised to be a $6.7 trillion industry by 2025, RPA is shrouded in just as much fear as it is brimming with potential. We have heard time and again how automation can improve productivity, efficiency, and effectiveness while conducting business in transformative ways. We have also heard how automation and machine-driven automation, in particular, can displace humans and thereby lead to a dystopian world. As humans, we make assumptions based on what we see and understand. But sometimes those assumptions become so ingrained that they evolve into myths which many start accepting as facts. Here is a closer look at some of the myths surrounding RPA. [dropcap]1[/dropcap] RPA means robots will automate processes The term robot evokes in our minds a picture of a metal humanoid with stiff joints that speaks in a monotone. RPA does mean robotic process automation. But the robot doing the automation is nothing like the ones we are used to seeing in the movies. These are software robots that perform routine processes within organizations. They are often referred to as virtual workers/digital workforce complete with their own identity and credentials. They essentially consist of algorithms programmed by RPA developers with an aim to automate mundane business processes. These processes are repetitive, highly structured, fall within a well-defined workflow, consist of a finite set of tasks/steps and may often be monotonous and labor intensive. Let us consider a real-world example here - Automating the invoice generation process. The RPA system will run through all the emails in the system, and download the pdf files containing details of the relevant transactions. Then, it would fill a spreadsheet with the details and maintain all the records therein. Later, it would log on to the enterprise system and generate appropriate invoice reports for each detail in the spreadsheet. Once the invoices are created, the system would then send a confirmation mail to the relevant stakeholders. Here, the RPA user will only specify the individual tasks that are to be automated, and the system will take care of the rest of the process. So, yes, while it is true that RPA involves robots automating processes, it is a myth that these robots are physical entities or that they can automate all processes. [dropcap]2[/dropcap] RPA is useful only in industries that rely heavily on software “Almost anything that a human can do on a PC, the robot can take over without the need for IT department support.” - Richard Bell, former Procurement Director at Averda RPA is a software which can be injected into a business process. Traditional industries such as banking and finance, healthcare, manufacturing etc that have significant tasks that are routine and depend on software for some of their functioning can benefit from RPA. Loan processing and patient data processing are some examples. RPA, however, cannot help with automating the assembly line in a manufacturing unit or with performing regular tests on patients. Even in industries that maintain daily essential utilities such as cooking gas, electricity, telephone services etc RPA can be put to use for generating automated bills, invoices, meter-readings etc. By adopting RPA, businesses irrespective of the industry they belong to can achieve significant cost savings, operational efficiency, and higher productivity. To leverage the benefits of RPA, rather than understanding the SDLC process, it is important that users have a clear understanding of business workflow processes and domain knowledge. Industry professionals can be easily trained on how to put RPA into practice. The bottom line - RPA is not limited to industries that rely heavily on software to exist. But it is true that RPA can be used only in situations where some form of software is used to perform tasks manually. [dropcap]3[/dropcap] RPA will replace humans in most frontline jobs Many organizations employ a large workforce in frontline roles to do routine tasks such as data entry operations, managing processes, customer support, IT support etc. But frontline jobs are just as diverse as the people performing them. Take sales reps for example. They bring new business through their expert understanding of the company’s products, their potential customer base coupled with the associated soft skills. Currently, they spend significant time on administrative tasks such as developing and finalizing business contracts, updating the CRM database, making daily status reports etc. Imagine the spike in productivity if these aspects could be taken off the plates of sales reps and they could just focus on cultivating relationships and converting leads. By replacing human efforts in mundane tasks within frontline roles, RPA can help employees focus on higher value-yielding tasks. In conclusion, RPA will not replace humans in most frontline jobs. It will, however, replace humans in a few roles that are very rule-based and narrow in scope such as simple data entry operators or basic invoice processing executives. In most frontline roles like sales or customer support, RPA is quite likely to change significantly at least in some ways how one sees their job responsibilities. Also, the adoption of RPA will generate new job opportunities around the development, maintenance, and sale of RPA based software. [dropcap]4[/dropcap] Only large enterprises can afford to deploy RPA The cost of implementing and maintaining the RPA software and training employees to use it can be quite high. This can make it an unfavorable business proposition for SMBs with fairly simple organizational processes and cross-departmental considerations. On the other hand, large organizations with higher revenue generation capacity, complex business processes, and a large army of workers can deploy an RPA system to automate high-volume tasks quite easily and recover that cost within a few months.   It is obvious that large enterprises will benefit from RPA systems due to the economies of scale offered and faster recovery of investments made. SMBs (Small to medium-sized businesses) can also benefit from RPA to automate their business processes. But this is possible only if they look at RPA as a strategic investment whose cost will be recovered over a longer time period of say 2-4 years. [dropcap]5[/dropcap] RPA adoption should be owned and driven by the organization's IT department The RPA team handling the automation process need not be from the IT department. The main role of the IT department is providing necessary resources for the software to function smoothly. An RPA reliability team which is trained in using RPA tools does not include IT professionals but rather business operations professionals. In simple terms, RPA is not owned by the IT department but by the whole business and is driven by the RPA team. [dropcap]6[/dropcap] RPA is an AI virtual assistant specialized to do a narrow set of tasks An RPA bot performs a narrow set of tasks based on the given data and instructions. It is a system of rule-based algorithms which can be used to capture, process and interpret streams of data, trigger appropriate responses and communicate with other processes. However, it cannot learn on its own - a key trait of an AI system. Advanced AI concepts such as reinforcement learning and deep learning are yet to be incorporated in robotic process automation systems. Thus, an RPA bot is not an AI virtual assistant, like Apple’s Siri, for example. That said, it is not impractical to think that in the future, these systems will be able to think on their own, decide the best possible way to execute a business process and learn from its own actions to improve the system. [dropcap]7[/dropcap] To use the RPA software, one needs to have basic programming skills Surprisingly, this is not true. Associates who use the RPA system need not have any programming knowledge. They only need to understand how the software works on the front-end, and how they can assign tasks to the RPA worker for automation. On the other hand, RPA system developers do require some programming skills, such as knowledge of scripting languages. Today, there are various platforms for developing RPA tools such as UIPath, Blueprism and more, which empower RPA developers to build these systems without any hassle, reducing their coding responsibilities even more. [dropcap]8[/dropcap] RPA software is fully automated and does not require human supervision This is a big myth. RPA is often misunderstood as a completely automated system. Humans are indeed required to program the RPA bots, to feed them tasks for automation and to manage them. The automation factor here lies in aggregating and performing various tasks which otherwise would require more than one human to complete. There’s also the efficiency factor which comes into play - the RPA systems are fast, and almost completely avoid faults in the system or the process that are otherwise caused due to human error. Having a digital workforce in place is far more profitable than recruiting human workforce. Conclusion One of the most talked about areas in terms of technological innovations, RPA is clearly still in its early days and is surrounded by a lot of myths. However, there’s little doubt that its adoption will take off rapidly as RPA systems become more scalable, more accurate and deploy faster. AI, cognitive, and Analytics-driven RPA will take it up a notch or two, and help the businesses improve their processes even more by taking away dull, repetitive tasks from the people. Hype can get ahead of the reality, as we've seen quite a few times - but RPA is an area definitely worth keeping an eye on despite all the hype.
Read more
  • 0
  • 0
  • 6693