PII redaction pipeline in Elasticsearch
The PII redaction pipeline in Elasticsearch aims to automatically redact sensitive information from data as it’s ingested into the Elasticsearch cluster. This process ensures that sensitive data is protected, which is particularly important when handling personal information that could be used to identify an individual, such as names, addresses, phone numbers, and social security numbers.
In this section, we will discuss the steps users can take to configure the PII redaction pipeline in Elasticsearch.
For the complete code, open the Jupyter Notebook in the chapter 6
folder of the book’s GitHub repository: https://github.com/PacktPublishing/Vector-Search-for-Practitioners-with-Elastic/tree/main/chapter6.
We will review the key points of the pipeline.
Generating synthetic PII
To run our pipeline, we will need a dataset. Thankfully we have faker
, the Python library for generating fake data of a given type. Our task...