Drivendata has come out with a new tool, named, Deon, which allows you to easily add an ethics checklist to your data science projects. Deon is aimed at pushing the conversation about ethics in data science, machine learning, and Artificial intelligence by providing actionable reminders to data scientists.
According to the Deon team, “it's not up to data scientists alone to decide what the ethical course of action is. This has always been a responsibility of organizations that are part of civil society. This checklist is designed to provoke conversations around issues where data scientists have particular responsibility and perspective”.
Deon comes with a default checklist, but you can also develop your own custom checklists by removing items and sections, or marking items as N/A depending on the needs of the project. There are also real-world examples linked with each item in the default checklist.
To be able to run Deon for your data science projects, you need to have Python 3 or greater. Let’s now discuss the two types of checklists, Default, and Custom, that comes with Deon.
Default checklist
The default checklist comprises of sections on Data Collection, Data Storage, Analysis, Modeling, and Deployment.
Data Collection
This checklist covers information on informed consent, Collection Bias, and Limit PII exposure.
- Informed consent includes a mechanism for gathering consent where users have clear understanding of what they are consenting to.
- Collection Bias checks on sources of bias introduced during data collection and survey design.
- Lastly, Limit PII exposure talks about ways that can help minimize the exposure of personally identifiable information (PII).
Data Storage
This checklist covers sections such as Data security, Right to be forgotten and Data retention plan.
- Data Security refers to a plan to protect and secure data.
- Right to be forgotten includes a mechanism by which an individual can have his/her personal information.
- Data Retention consists of a plan to delete the data if no longer needed.
Analysis
This section comprises information on Missing perspectives, Dataset bias, Honest representation, Privacy in analysis and Auditability.
- Missing perspectives address the blind spots in data analysis via engagement with relevant stakeholders.
- Dataset bias discusses examining the data for possible sources of bias and consists of steps to mitigate or address them.
- Honest representation checks if visualizations, summary statistics, and reports designed honestly represent the underlying data.
- Privacy in analysis ensures that the data with PII are not used or displayed unless necessary for the analysis.
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
- Auditability refers to the process of producing an analysis which is well documented and reproducible.
Modeling
This offers information on Proxy discrimination, Fairness across groups, Metric selection, Explainability, and Communicate bias.
- Proxy discrimination talks about ensuring that the model does not rely on variables or proxies that are discriminatory.
- Fairness across groups is a section that cross-checks whether the model results have been tested for fairness with respect to different affected groups.
- Metric selection considers the effects of optimizing for defined metrics and other additional metrics. Explainability talks about explaining the model’s decision in understandable terms.
- Communicate bias makes sure that the shortcomings, limitations, and biases of the model have been properly communicated to relevant stakeholders.
Deployment
This covers topics such as Redress, Roll back, Concept drift, and Unintended use.
- Redress discusses with an organization a plan for response in case users get harmed by the results.
- Roll back talks about a way to turn off or roll back the model in production when required.
- Concept drift refers to changing relationships between input and output data in a problem over time. This part in a checklist reminds the user to test and monitor the concept drift. This is to ensure that the model remains fair over time.
- Unintended use prompts the user about the steps to be taken for identifying and preventing uses and abuse of the model.
Custom checklists
For your projects with particular concerns, it is recommended to create your own checklist.yml file. Custom checklists are required to follow the same schema as checklist.yml. Custom Checklists need to have a top-level title which is a string, and sections which are a list. Each section in the list must have a title, a section_id, and then a list of lines. Each line must include a line_id, a line_summary, and a line string which is the content.
When changing the default checklist, it is necessary to keep in mind that Deon’s goal is to have checklist items that are actionable. This is why users are advised to avoid suggesting items that are vague (e.g., "do no harm") or extremely specific (e.g., "remove social security numbers from data").
For more information, be sure to check out the official Drivendata blog post.
The Cambridge Analytica scandal and ethics in data science
OpenAI charter puts safety, standards, and transparency first
20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017