Import large datasets with Python
In Chapter 3, Configuring Python with Power BI, we suggested that you install some of the most commonly used data management packages in your environment, including NumPy, pandas, and scikit-learn. The biggest limitation of these packages is that they cannot handle datasets larger than the RAM of the machine in which they are used, thus they are not able to scale to more than one machine. To comply with this limitation, distributed systems based on Spark, which has become a dominant tool in the big data analysis landscape, are often used. However, the move to these systems forces developers to have to rethink already-written code using an API called PySpark, born to use Spark objects with Python. This process is generally seen as causing delays in project delivery and causing frustration for developers, who master the libraries available for standard Python with much more confidence.
In response to the preceding issues, the community developed a...