Building idempotent data pipelines
A crucial feature of a production data pipeline is that it is idempotent. Idempotent is defined as denoting an element of a set that is unchanged in value when multiplied or otherwise operated on by itself.
In data science, this means that when your pipeline fails, which is not a matter of if, but when, it can be rerun and the results are the same. Or, if you accidently click run on your pipeline three times in a row by mistake, there are not duplicate records – even if you accidently click run multiple times in a row.
In Chapter 3, Reading and Writing Files, you created a data pipeline that generated 1,000 records of people and put that data in an Elasticsearch database. If you let that pipeline run every 5 minutes, you would have 2,000 records after 10 minutes. In this example, the records are all random and you may be OK. But what if the records were rows queried from another system?
Every time the pipeline runs, it would insert...