Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
IBM SPSS Modeler Cookbook

You're reading from   IBM SPSS Modeler Cookbook If you've already had some experience with IBM SPSS Modeler this cookbook will help you delve deeper and exploit the incredible potential of this data mining workbench. The recipes come from some of the best brains in the business.

Arrow left icon
Product type Paperback
Published in Oct 2013
Publisher Packt
ISBN-13 9781849685467
Length 382 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Toc

Table of Contents (11) Chapters Close

Preface 1. Data Understanding FREE CHAPTER 2. Data Preparation – Select 3. Data Preparation – Clean 4. Data Preparation – Construct 5. Data Preparation – Integrate and Format 6. Selecting and Building a Model 7. Modeling – Assessment, Evaluation, Deployment, and Monitoring 8. CLEM Scripting A. Business Understanding Index

Creating an Outlier report to give to SMEs

It is quite common that the data miner has to rely on others to either provide data or interpret data, or both. Even when the data miner is working with data from their own organization there will be input variables that they don't have direct access to, or that are outside their day-to-day experience.

Are zero values normal? What about negative values? Null values? Are 1500 balance inquiries in a month even possible? How could a wallet cost $19,500? The concept of outliers is something that all analysts are familiar with. Even novice users of Modeler could easily find a dozen ways of identifying some. This recipe is about identifying outliers systematically and quickly so that you can produce a report designed to inspire curiosity.

There is no presumption that the data is in error, or that they should be removed. It is simply an attempt to put the information in the hands of Subject Matter Experts, so quirky values can be discussed in the earliest phases of the projects. It is important to provide whichever primary keys are necessary for the SMEs to look up the records. On one of the author's recent projects, the team started calling these reports quirk reports.

Getting ready

We will start with the Outlier Report.str stream that uses the TELE_CHURN_preprep data set.

How to do it...

To create an Outlier report:

  1. Open the stream Outlier Report.str.
    How to do it...
  2. Add a Data Audit node and examine the results.
    How to do it...
  3. Adjust the stream options to allow for 25 rows to be shown in a data preview. We will be using the preview feature later in the recipe.
    How to do it...
  4. Add a Statistics node. Choose Mean, Min, Max, and Median for the variables DATA_gb, PEAK_mins, and TEXT_count. These three have either unusually high maximums or surprising negative values as shown in the Data Audit node.
  5. Consider taking a screenshot of the Statistics node for later use.
    How to do it...
  6. Add a Sort node. Starting with the first variable, DATA_gb, sort in ascending order.
    How to do it...
  7. Add a Filter node downstream of the Sort node dropping CHURN, DROPPED_CALLS, and LATE_PAYMENTS. It is important to work with your SME to know which variables put quirky values into context.
    How to do it...
  8. Preview the Filter node. Consider the following screenshot:
    How to do it...
  9. Reverse the sort, now choosing descending order, and preview the Filter node. Consider the following screenshot for later use:
    How to do it...
  10. Sort in descending order on the next variable, PEAK_mins. Preview the Filter node.
    How to do it...
  11. Finally sort the variable, TEXT_count, in descending order and preview the Filter node.
    How to do it...
  12. Examine Outliers.docx to see an example of what this might look like in Word.

How it works...

There is no deep theoretical foundation to this recipe; it is as straightforward as it seems. It is simply a way of quickly getting information to an SME. They will not be frequent Modeler users. Also summary statistics only give them a part of the story. Providing the min, max, mean and median alone will not allow an SME to give you the information that you need. If there is a usual min such as a negative value, you need to know how many negatives there are, and need at least a handful of actual examples with IDs. An SME might look up to values in their own resources and the net result could be the addition of more variables to the analysis. Alternatively, negative values might be turned into nulls or zeros. Negative values might be deemed out of scope and removed from the analysis. There is no way to know until you assess why they are negative. Sometimes values that are exactly zero are of interest. High values, NULL values, and rare categories are all of potential interest. The most important thing is to be curious (and pleasantly persistent) and to inspire collaborators to be curious as well.

See also

  • The Selecting variables using the CHAID Modeling node recipe in Chapter 2, Data Preparation – Select
  • The Removing redundant variables using correlation matrices recipe in Chapter 2, Data Preparation – Select
You have been reading a chapter from
IBM SPSS Modeler Cookbook
Published in: Oct 2013
Publisher: Packt
ISBN-13: 9781849685467
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image