You're reading from Mastering Data analysis with R Gain sharp insights into your data and solve real-world data science problems with R—from data munging to modeling and visualization

Product type Paperback

Published in Sep 2015

Publisher Packt

ISBN-13 9781783982028

Length 396 pages

Edition 1st Edition

Languages

Concepts

Data Analysis

Author (1):

Gergely Dar√≥czi

View More author details

Table of Contents (17) Chapters

Preface

1. Hello, Data! FREE CHAPTER

2. Getting Data from the Web

3. Filtering and Summarizing Data

4. Restructuring Data

5. Building Models (authored by Renata Nemeth and Gergely Toth)

6. Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

7. Unstructured Data

8. Polishing Data

9. From Big to Small Data

10. Classification and Clustering

11. Social Network Analysis of the R Ecosystem

12. Analyzing Time-series

13. Data Around Us

14. Analyzing the R Community

A. References

Index

What this book covers

Chapter 1, Hello, Data!, starts with the first very important task in every data-related task: loading data from text files and databases. This chapter covers some problems of loading larger amounts of data into R using improved CSV parsers, pre-filtering data, and comparing support for various database backends.

Chapter 2, Getting Data from the Web, extends your knowledge on importing data with packages designed to communicate with Web services and APIs, shows how to scrape and extract data from home pages, and gives a general overview of dealing with XML and JSON data formats.

Chapter 3, Filtering and Summarizing Data, continues with the basics of data processing by introducing multiple methods and ways of filtering and aggregating data, with a performance and syntax comparison of the deservedly popular data.table and dplyr packages.

Chapter 4, Restructuring Data, covers more complex data transformations, such as applying functions on subsets of a dataset, merging data, and transforming to and from long and wide table formats, to perfectly fit your source data with your desired data workflow.

Chapter 5, Building Models (authored by Renata Nemeth and Gergely Toth), is the first chapter that deals with real statistical models, and it introduces the concepts of regression and models in general. This short chapter explains how to test the assumptions of a model and interpret the results via building a linear multivariate regression model on a real-life dataset.

Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth), builds on the previous chapter, but covers the problems of non-linear associations of predictor variables and provides further examples on generalized linear models, such as logistic and Poisson regression.

Chapter 7, Unstructured Data, introduces new data types. These might not include any information in a structured way. Here, you learn how to use statistical methods to process such unstructured data through some hands-on examples on text mining algorithms, and visualize the results.

Chapter 8, Polishing Data, covers another common issue with raw data sources. Most of the time, data scientists handle dirty-data problems, such as trying to cleanse data from errors, outliers, and other anomalies. On the other hand, it's also very important to impute or minimize the effects of missing values.

Chapter 9, From Big to Smaller Data, assumes that your data is already loaded, clean, and transformed into the right format. Now you can start analyzing the usually high number of variables, to which end we cover some statistical methods on dimension reduction and other data transformations on continuous variables, such as principal component analysis, factor analysis, and multidimensional scaling.

Chapter 10, Classification and Clustering, discusses several ways of grouping observations in a sample using supervised and unsupervised statistical and machine learning methods, such as hierarchical and k-means clustering, latent class models, discriminant analysis, logistic regression and the k-nearest neighbors algorithm, and classification and regression trees.

Chapter 11, A Social Network Analysis of the R Ecosystem, concentrates on a special data structure and introduces the basic concept and visualization techniques of network analysis, with a special focus on the igraph package.

Chapter 12, Analyzing a Time Series, shows you how to handle time-date objects and analyze related values by smoothing, seasonal decomposition, and ARIMA, including some forecasting and outlier detection as well.

Chapter 13, Data around Us, covers another important dimension of data, with a primary focus on visualizing spatial data with thematic, interactive, contour, and Voronoi maps.

Chapter 14, Analyzing the R Community, provides a more complete case study that combines many different methods from the previous chapters to highlight what you have learned in this book and what kind of questions and problems you might face in future projects.

Appendix, References, gives references to the used R packages and some further suggested readings for each aforementioned chapter.