Designing a workflow
A data scientist has many options in selecting and implementing a classification or clustering algorithm.
Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters:
- Business requirements such as accuracy of results
- Availability of training data and algorithms
- Access to a domain or subject-matter expert
Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters:
- Available resources such as machines, CPU, memory, or I/O bandwidth
- Implementation strategy such as iterative versus recursive computation or caching
- Requirements for the responsiveness of the overall process such as duration of computation or display of intermediate results
The following diagram illustrates the selection...