Data Acquisition, Data Quality, and Noise
Data for machine learning systems can come directly from humans and software systems – usually called source systems. Where the data comes from has implications regarding what it looks like, what kind of quality it has, and how to process it.
The data that originates from humans is usually noisier than data that originates from software systems. We, as humans, are known for small inconsistencies and we can also understand things inconsistently. For example, the same defect reported by two different people could have a very different description; the same is true for requirements, designs, and source code.
The data that originates from software systems is often more consistent and contains less noise or the noise in the data is more regular than the noise in the human-generated data. This data is generated by source systems. Therefore, controlling and monitoring the quality of the data that’s generated automatically is different...