Transforming data
Raw data present in real-world applications is often unstructured and noisy. Thus, it cannot be fed directly to machine learning algorithms. We often need to apply several transformations on raw data and convert it into a format that is well supported by machine learning algorithms. In this section, we will learn about multiple options for transforming data in a scalable and efficient way on Google Cloud.
Here are three common options for data transformation in the GCP environment:
- Ad hoc transformation within Jupyter Notebooks
- Cloud Data Fusion
- Dataflow pipelines for scalable data transformations
Let’s learn about these three methods in more detail.
Ad hoc transformations within Jupyter Notebook
Machine learning algorithms are mathematical and can only understand numeric data. For example, in computer vision problems, images are converted into numerical pixel values before they’re fed into a model. Similarly, in the...