Building data ingestion pipelines in batch and real time
An end-to-end data ingestion pipeline involves reading data from data sources and ingesting it into a data sink. In the context of big data and data lakes, data ingestion involves a large number of data sources and, thus, requires a data processing engine that is highly scalable. There are specialist tools available in the market that are purpose-built for handling data ingestion at scale, such as StreamSets, Qlik, Fivetran, Infoworks, and more, from third-party vendors. In addition to this, cloud providers have their own native offerings such as AWS Data Migration Service, Microsoft Azure Data Factory, and Google Dataflow. There are also free and open source data ingestion tools available that you could consider such as Apache Sqoop, Apache Flume, Apache Nifi, to name a few.
Tip
Apache Spark is good enough for ad hoc data ingestion, but it is not a common industry practice to use Apache Spark as a dedicated data ingestion...