Dirty data, damaged models – how data quantity and quality impact ML
When training or using ML and artificial intelligence models, data is not only an asset but also the foundation of success. Without high-quality, representative data, even the most sophisticated ML model is useless. But what happens when you don’t have enough data, or when the data you have is biased or inaccurate?
To consider one hypothetical example, many banks use ML to flag potentially fraudulent transactions and block accounts based on information about the transaction. Imagine the model was only trained on a subset of account types, such as current accounts that have more regular, lower-value transactions. Let’s say the bank decides to then also apply the model to savings accounts that may have larger, less frequent transactions. The model may now incorrectly flag most typical savings account transactions as false positives, leading to frustrated customers and stressed customer service...