Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Conferences

Free Learning

You're reading from R Data Mining Implement data mining techniques through practical use cases and real-world datasets

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781787124462

Length 442 pages

Edition 1st Edition

Languages

Tools

ggplot

Concepts

Data Mining

Author (1):

Andrea Cirillo

View More author details

Table of Contents (16) Chapters

Preface

1. Why to Choose R for Your Data Mining and Where to Start FREE CHAPTER

2. A First Primer on Data Mining Analysing Your Bank Account Data

3. The Data Mining Process - CRISP-DM Methodology

4. Keeping the House Clean – The Data Mining Architecture

5. How to Address a Data Mining Problem – Data Cleaning and Validation

6. Looking into Your Data Eyes – Exploratory Data Analysis

7. Our First Guess – a Linear Regression

8. A Gentle Introduction to Model Performance Evaluation

9. Don't Give up – Power up Your Regression Including Multiple Variables

10. A Different Outlook to Problems with Classification Models

11. The Final Clash – Random Forests and Ensemble Learning

12. Looking for the Culprit – Text Data Mining with R

13. Sharing Your Stories with Your Stakeholders through R Markdown

14. Epilogue

15. Dealing with Dates, Relative Paths and Functions

Looking for context in text – analyzing document n-grams

What was the main limitation of our wordclouds? As we said, the absence of context. In other words, we were looking at isolated words, which don't help us to derive any meaning apart from the limited meaning contained within the single words themselves.

This is where n-gram analysis techniques come in. These techniques basically involve tokenizing the text into groups of words rather than into single words. These groups of words are called n-grams.

We can obtain n-grams from our comments dataset by simply applying the unnest_tokens function again, but this time passing "ngrams" as value to the token argument and 2 as the value to the n argument:

comments %>% 
unnest_tokens(bigram, text, token = "ngrams", n = 2) -> bigram_comments

Since we specified 2 as the value for the...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (1)

Andrea Cirillo

Andrea Cirillo is currently working as an audit quantitative analyst at Intesa Sanpaolo Banking Group. He gained financial and external audit experience at Deloitte Touche Tohmatsu and internal audit experience at FNM, a listed Italian company. His main responsibilities involve the evaluation of credit risk management models and their enhancement, mainly within the field of the Basel III capital agreement. He is married to Francesca and is the father of Tommaso, Gianna, Zaccaria, and Filippo. Andrea has written and contributed to a few useful R packages such as updateR, ramazon, and paletteR, and regularly shares insightful advice and tutorials on R programming. His research and work mainly focus on the use of R in the fields of risk management and fraud detection, largely by modeling custom algorithms and developing interactive applications. Andrea has previously authored RStudio for R Statistical Computing Cookbook for Packt Publishing.

See other products by Andrea Cirillo