Importing large datasets with Python
In Chapter 3, Configuring Python with Power BI, we suggested that you install some of the most commonly used data management packages in your environment, including NumPy, pandas, and scikit-learn. The biggest limitation of these packages is that they cannot handle datasets larger than the RAM of the machine on which they are used, so they cannot scale to more than one machine. To overcome this limitation, distributed systems based on Spark, which has become a dominant tool in the big data analytics landscape, are often used. However, moving to these systems forces developers to rethink code they have already written using an API called PySpark, which was created to use Spark objects with Python. This process is generally seen as causing delays in project delivery and causing frustration for developers who are much more comfortable with the libraries available for standard Python.In response to the above issues, the community has developed a new library...