Data preparation
The next step is a data transformation tier that processes the raw data; some of the transformations that need to be done are:
- Data Cleansing
- Filtration
- Aggregation
- Augmentation
- Consolidation
- Storage
The cloud providers have become the major data science platforms. Some of the most popular stacks are built around:
- Azure ML service
- AWS SageMaker
- GCP Cloud ML Engine
- SAS
- RapidMiner
- Knime
One of the most popular tools to perform these transformations is Apache Spark, but it still needs a data store. For persistence, the most common solutions are:
- Hadoop Distributed File System (HDFS)
- HBase
- Apache Cassandra
- Amazon S3
- Azure Blob Storage
It's also possible to process data for machine learning in-place, inside the database; databases like SQL Server and SQL Azure are adding specific machine learning functionality to support machine learning pipelines. Spark has that...