You're reading from R Bioinformatics Cookbook Utilize R packages for bioinformatics, genomics, data science, and machine learning

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781837634279

Length 396 pages

Edition 2nd Edition

Languages

Tools

ChatGPT

Concepts

Bioinformatics

Author (1):

Dan MacLean

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Setting Up Your R Bioinformatics Working Environment

2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse FREE CHAPTER

3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots

4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

5. Chapter 5: Easily Performing Statistical Tests Using Linear Models

6. Chapter 6: Performing Quantitative RNA-seq

7. Chapter 7: Finding Genetic Variants with HTS Data

8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

9. Chapter 9: Phylogenetic Analysis and Visualization

10. Chapter 10: Analyzing Gene Annotations

11. Chapter 11: Machine Learning with mlr3

12. Chapter 12: Functional Programming with purrr and base R

13. Chapter 13: Turbo-Charging Development in R with ChatGPT

14. Index

Why subscribe?

15. Other Books You May Enjoy

Loading data from files with readr

The readr R package is a package that provides functions for reading and writing tabular data in a variety of formats, including comma-separated values (CSV), tab-separated values (TSV), and delimiter-separated files. It is designed to be flexible and stop helpfully when data changes or unexpected items appear in the input. The two main advantages over base R functions include consistency in interface and output and the ability to be explicit about types and inspect those types.

This latter advantage can help to avoid errors when reading data, as well as make data cleaning and manipulation easier. readr functions can also automatically infer the data types of each column, which can be useful for a preliminary inspection of large datasets or when the data types are not known.

Consistency in interface and output is one of the main advantages of readr functions. readr functions provide a consistent interface for reading different types of data, which can make it easier to work with multiple types of files. For example, the read_csv() function can be used to read CSV files, while the read_tsv() function can be used to read TSV files. Additionally, readr functions return a tibble, a modern version of a data frame that is more consistent in its output and easier to read than the base R data frame.

Getting ready

For this recipe, we’ll need the readr and rbioinfcookbook packages. The latter contains a census_2021.csv file that carries UK census data from 2021, from the UK Office for National Statistics (https://www.ons.gov.uk/). You will need to inspect it, especially its header, to understand the process in this recipe. The first step shows you how to find where the file is on your filesystem.

Note the delimiters in the file are commas (,) and that the first seven lines contain metadata that isn’t part of the main data. Also, look at the messy column headings and note that the numbers themselves are internally delimited by commas.

How to do it…

We begin by getting a filename for the sample in the package:

Load the package and get a filename:

library(readr)filename <- fs::path_package("extdata",                              "census_2021.csv",                              package="rbioinfcookbook"                             )

Specify a vector of new names for columns:

col_names = c(            c("area_code", "country", "region",               "area_name", "all_persons", "under_4"),            paste0(seq(5, 85, by = 5),"_to_",seq(9, 89, by =5)),c("over_90"))

Set the column types based on new names and contents:

col_types = cols(  area_code = col_character(),  country = col_factor(levels = c("England", "Wales")),  region = col_factor(levels = c("North East",                                 "Yorkshire Humber",                                 "East Midlands",                                 "West Midlands",                                 "East of England",                                          "London",                                 "South East",                                 "South West",                                 "Wales"), ordered = TRUE                      ),  area_name = col_character(),  .default = col_number())

Put it together and read the file:

df <- read_csv(filename,               skip = 8,               col_names = col_names,               col_types = col_types )

And with that, we’ve loaded in a file with careful checking of the data types.

How it works…

Step 1 loads the library(readr) package. This package contains functions for reading and writing tabular data in a variety of formats, including CSV. The fs::path_package("extdata", "census_2021.csv", package="rbioinfcookbook") function is used to create a file path to the census_2021.csv file. It simply finds the place where the rbioinfcookbook package was installed and looks inside the extdata directory for the file, then it returns the full file path that leads to the file. Quite often, we would see the system.file() function used for this purpose. system.file() is a fine choice when everything works, but when it can’t find the file, it returns a blank string, which can be hard to debug. fs::path_package() is nicer to work with and will return an error when it can’t find the file.

In step 2, a vector of new column names is specified by the code. The vector contains several strings for the first few column names, and then a sequence of strings is created and a long list of age-related columns is created by concatenating two sequences of numbers. The resulting vector is stored in the col_names variable.

In step 3, we specify the R type we want each column to be. The categorical columns are set explicitly to factors, with the region being ordered explicitly in a rough geographical northeast to southwest way. The area_name column contains over 300 names, so we won’t make them an explicit factor and stick with it as a general text containing character type. The rest of the columns contain numeric data, so we make that the default with .default.

Finally, the read_csv() function is used to read the file specified in step 1 and create a data frame. The skip argument is used to skip the first eight rows, which include the metadata in the file and the messy header, the col_names argument is used to specify the new column names stored in col_names, and the col_types argument is used to specify the column types stored in col_types.

There’s more…

We used the read_csv() function for comma-separated data, but many more functions are available for different delimiters:

Function	Delimiter
`read_csv()`	CSV
`read_tsv()`	TSV
`read_delim()`	User-specified delimited files
`read_fwf()`	Fixed-width files
`read_table()`	Whitespace-separated files
`read_log()`	Web log files

Table 2.1 – Parser functions and the type of input file delimiter they work on in readr

For different local conventions on—for example—decimal separators and grouping marks, you can use the locale functions.