Importing large datasets with R
The same scalability limitations illustrated for Python packages used to manipulate data also exist for R packages in the Tidyverse ecosystem. Even in R, it is not possible to use a dataset larger than the available RAM on the machine. The first solution that is adopted in these cases is also to switch to Spark-based distributed systems that provide the SparkR language. It provides a distributed implementation of the DataFrame you are used to in R, supporting filtering, aggregation, and selection operations as you do with the dplyr
package. For those of us who are fans of the Tidyverse world, RStudio is actively developing the sparklyr
package, which allows you to use all the functionality of dplyr
, even for distributed DataFrames. However, using Spark-based systems to process CSVs that together take up little more than the RAM you have available on your machine may be overkill due to the overhead of all the Java infrastructure needed to run them.In the...