There are a lot of ML tutorials on the internet, and usually the sample datasets are clean, formatted, and ready to be used with algorithms because the aim of many tutorials is to show the capability of certain tools, libraries, or Software as a Service (SaaS) offerings.
In reality, datasets come in different types and sizes. A recent industry survey done by Kaggle in 2017, titled The State of Data Science and Machine Learning, with over 16,000 responses, shows that the top-three commonly-used datatypes are relational data, text data, and image data.
Moreover, messy data is at the top of the list of problems that people have to deal with, again based on the Kaggle survey. When a dataset is messy and needs a lot of special treatment to be used by ML models, you spend a considerable amount of time on data cleaning, manipulation, and hand-crafting features to get it in the right shape. This is the most time-consuming part of any data science project.
What about selection of performant ML models, hyperparameter optimization of models in training, validation, and testing phases? These are also crucially-important steps that can be executed in many ways.
When the combination of right components come together to process and model your dataset, that combination represents a ML pipeline, and the number of pipelines to be experimented could grow very quickly.
For building high-performing ML pipelines successfully, you should systematically go through all the available options for every step by considering your limits in terms of time and hardware/software resources.