Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!
This article is an excerpt from the book, R Bioinformatics Cookbook - Second Edition, by Dan MacLean. Discover over 80 recipes for modeling and handling real-life biological data using modern libraries from the R ecosystem.
Hey there, fellow code explorers! Ever found yourself staring at a chunk of complex R code, feeling lost in its jumble of symbols and functions? Well, fear not! This article dives into the exciting realm of decoding such cryptic code with the help of an unexpected ally: ChatGPT. Join me on this journey as we uncover the secrets behind intricate code snippets, aiming to demystify and explain them in plain, everyday language.
ChatGPT can be used to interpret R code by leveraging its natural language processing capabilities and its understanding of the R programming language. It’s important to note that while ChatGPT can assist in interpreting and explaining R code, it’s still just an AI language model and may not have knowledge of the most recent updates or specific domain-specific intricacies. Double-checking information with official documentation and consulting experienced developers or experts is always necessary. It is massively useful for simplifying and getting a good first-level understanding in most cases.
In this recipe, we’ll look at how to take an arbitrary bit of complex code and explain it in plain English.
We need some code to interpret – we’ll use a function from the besthr Github-hosted package at https://github.com/TeamMacLean/besthr/blob/master/R/functions.R and an account with ChatGPT from https://chat.openai.com/.
Here’s how to work in a ChatGPT session and get it to explain some code to you:
1. Paste the following code into the ChatGPT prompt box:
explain this R code to me:
#' Convert named columns to factors
#'
#' \code{factorise_cols} returns a copy of the passed data frame
in which all
#' explicitly named columns are converted to factors with \
code{as.factor}
#' All columns with only NA values are ommitted.
#'
#' @param df data frame
#' @param col_list vector of quoted column names
#' @keywords internal
#'
factorise_cols <- function(df, col_list ){
not_all_na <- function(x) any(!is.na(x))
for ( i in col_list){
df <- df %>% dplyr::mutate(!!(i) := as.factor(!!(i)) )
}
df %>% dplyr::select_if(not_all_na)
}
2. Read the response:
This R code defines a function called `factorise_cols` that converts explicitly named columns of a data frame into factors
using the `as.factor` function. The function takes two parameters:
1. `df`: This is the data frame that contains the columns to be converted.
2. `col_list`: This is a vector of quoted column names. It specifies which columns in the data frame should be converted to factors.
The function uses a helper function called `not_all_na` to filter out columns that have only `NA` values. This helper function checks if there are any non-NA values in a column using the `any` and `is.na` functions.
Inside the main function, a loop iterates over each column name in the `col_list`. For each column, the function uses the
`mutate` function from the `dplyr` package to create a new column in the data frame (`df`) with the same name, but with the values converted to factors using the `as.factor` function. The `!!(i)` syntax is used to unquote the column name.
After the loop, the function uses the `select_if` function from
`dplyr` to select only the columns that pass the `not_all_na` condition, effectively removing columns that contain only `NA` values. The resulting data frame is then returned as the output of the function.
Overall, this function allows you to convert specific columns of a data frame into factors while excluding columns that contain only `NA` values.
3. Ask for a clarification on a certain point:
You said 'The !!(i) syntax is used to unquote the column name.' Can you go into more depth?
4. Read the response:
Certainly! In the given code, the `!!` (bang-bang) syntax is used for unquoting the column name within the context of the `mutate` function. Let's break it down step by step:
1. In the loop, the variable `i` represents each element (column name) of the `col_list` vector.
2. Within the `mutate` function, the column name `i` needs to be unquoted and evaluated as an expression. This is necessary because `mutate` expects column names to be provided as expressions rather than strings.
3. The `!!` (bang-bang) operator is used to perform this unquoting. It is part of the tidy evaluation framework provided by the `dplyr` package.
4. By wrapping `i` with `!!`, it instructs `mutate` to evaluate
`i` as an expression, resulting in an unquoted column name.
Interpreting complicated code with ChatGPT assistance
To illustrate with an example, suppose `col_list` contains the column names `col1` and `col2`. During the loop iteration, when `i` is `"col1"`, the line `!!(i)` would be evaluated as `!!("col1")`, which unquotes the string to yield `col1`. Similarly, when `i` is `"col2"`, `!!(i)` would be evaluated as `!!("col2")`, resulting in `col2`.
This unquoting operation is necessary in this context because the `mutate` function expects column names to be provided as unevaluated expressions so that it can perform the necessary data manipulation operations. The `!!` operator allows us to achieve this unquoting within the `mutate` function call.
Note that the response from ChatGPT will be somewhat different depending on context and you may not get exactly what was generated here. That is the nature of the thing, but you should get something with the same meaning.
The code in this example comes from a package called besthr that creates estimation statistic plots for visual scoring data. The tool must take in arbitrary column names and work within the tidyverse, so it uses some advanced coding techniques to allow that to happen. We first go and get the code
from the repository on GitHub (https://github.com/TeamMacLean/besthr/blob/ master/R/functions.R) and paste that into ChatGPT’s prompt box asking it for an explanation.
In step 2, we can see the explanation provided (note that the one you get if you try may be different as the model is not guaranteed to reproduce its predictions). The detail is largely correct; certainly, it is sufficient to give us a clear idea of what the code attempts to do and how it does it.
Some parts of the explanation aren’t clear, so in step 3, we ask for clarification of a tricky bit, again by typing into the prompt box. And in step 4, we see a more in-depth description of that part.
In this way, we can get a clear and readable, plain English description of the job done by a particular piece of code very quickly.
Other sites can do this, such as Google’s Bard. ChatGPT Plus – a subscription service –also has special plug-ins that help make working with code much easier..
Who knew cracking code could be this fun and straightforward? With ChatGPT as our trusty sidekick, we've peeked behind the curtains of intricate R code, unraveling its mysteries piece by piece. Remember, while this AI wizardry is fantastic, a mix of human expertise and official documentation remains your ultimate guide through the coding labyrinth. So, armed with newfound knowledge and a reliable AI companion, let's keep exploring, learning, and demystifying the captivating world of programming together!
Professor Dan MacLean has a Ph.D. in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now Head of Bioinformatics at the world leading Sainsbury Laboratory in Norwich, UK where he works on bioinformatics, genomics, and machine learning. He teaches undergraduates, post-graduates, and post-doctoral students in data science and computational biology. His research group has developed numerous new methods and software in R, Python, and other languages with over 100,000 downloads combined.