Securing code and artifacts
As we mentioned earlier, one of the significant differences between AI and traditional systems is that AI depends on data for its development. It also introduces a new type of artifacts – that is, models – which are critical and sensitive assets. This difference brings security risks, even at development time. This section will walk you through defenses we can introduce to secure the confidentiality and integrity of our AI solution artifacts before they reach production.
Secure code
Before deploying our Flask application, ensuring that the Python code has no security vulnerabilities is essential. This is known as source code analysis and it’s also used for SAST. There are many SAST tools available. You can find out more at https://owasp.org/www-community/Source_Code_Analysis_Tools.
Bandit is a popular open source SAST tool for Python that’s designed to find common security issues in Python code.
We can install Bandit using pip
:
pip install bandit
Then, we can navigate to each source code directory and run the following command:
bandit -r .
Bandit will scan the Python files in the directory and report any security issues. The -r
parameter makes it recursive and covers all subdirectories. Our sample ImRecS doesn’t have any significant vulnerabilities to address:
Figure 3.3 – Bandit static source code analysis summary
You should always review and address vulnerabilities, especially those that are high and critical. The scans should be done regularly and before deploying.
Securing dependencies with vulnerability scanning
While our Python code might be secure, the libraries and the container images we use might have vulnerabilities we need to detect and remediate. This is known as component analysis or software composition analysis (SCA). You can learn more at https://owasp.org/www-community/Component_Analysis.
Trivy is a popular open source vulnerability scanner for third-party containers, libraries, and other artifacts.
You can install Trivy using the instructions at https://aquasecurity.github.io/trivy/v0.18.3/installation/.
You can use the following command to scan your project for third-party vulnerabilities:
trivy fs .
The preceding command scans the current directory and produces a report, as shown in the following screenshot:
Figure 3.4 – Trivy vulnerabilities scan summary
Trivy relies on requirements.txt
files to analyze dependencies. For ImRecS, we can see some vulnerabilities in the Pillow image library, which we can address by installing the latest version. In the requirements.txt
file, we had the following:
Pillow==9.2.0
Removing the version number and installing the latest version solves the problem.
This may not always be possible. For instance, there might be no fix yet, or the dependency is introduced indirectly (transitive dependency) by another package that only works with the vulnerable version. In that case, you will need to understand the vulnerability and all other mitigations that may make it not exploitable. You can avoid including unfixed issues with --ignore-unfixed
and allow-list issues you have already evaluated by using .trivyignore
files containing vulnerability IDs such as CVE-2022-45199.
Similarly, before pushing your container images, scan them with Trivy:
trivy image <YOUR_IMAGE_NAME:TAG>
First, you must find the image name and tag by running the following command:
docker images
Address any vulnerabilities Trivy identifies before deploying the containers. There are some high and critical vulnerabilities, as depicted in the following screenshot:
Figure 3.5 – open-ssh critical vulnerabilities detected by Trivy
The critical ones are in the open-ssh
OS package. None have a fixed version (at the time of writing).
Since we don’t use SSH clients from the host, we can mitigate the critical one by removing the package in the Dockerfile by adding the following:
RUN apt-get remove -y openssh-client
Even better, you can use the slimmed-down version of the base image in your Dockerfile that does not have it installed:
FROM python:3.10-slim
Since this is a learning exercise, removing openssh-client
is sufficient to demonstrate this approach.
For real-life scenarios, you would need to spend time assessing and mitigating findings rated with high severity. For instance, some are related to Perl, and we don’t use Perl. Removing all the unnecessary components would be a part of this exercise.
Secret scanning
You may have noticed, but Trivy also reported that we store a private key in our source code folders:
Figure 3.6 – Trivy detecting a secret leak
Secret scanning is essential since we don’t want to leak secrets such as passwords, API tokens, or private keys via our GitHub repository, especially for production systems. This would hand over the keys of the castle to attackers. We have mitigated this vulnerability by excluding the contents of the SSL in the .gitignore
file, which prevents them from being leaked to the git repository.
There are more sophisticated ways of doing secrets management. A good starting point is the OWASP Secrets Management Cheatsheet at https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html.
Securing Jupyter Notebooks
So far, we’ve used vulnerability scanning, which you will find in traditional application security. However, AI uses new tools, notably Jupyter Notebooks, which contain code and will have library dependencies. Because we’re using requirements.txt
, the dependencies will be covered by Trivy. Sometimes, data scientists use Notebook magic commands, which are inline external commands such as the following:
!pip install <package name>
Packages that are installed directly are not included in Trivy scans. Notebooks use JSON format with Python code as code fragments in the Notebook JSON. We can apply Bandit or other static code analysis scans by exporting them to a Python file, like so:
jupyter nbconvert --to script YourNotebook.ipynb
Then, we can run Bandit. This can get messy if there are many notebooks to scan; we have written a script file to automate Bandit scanning of notebooks in a folder:
$ ./bandit-notebook-scan.sh -r notebooks -k
NBDefense is another helpful tool that’s dedicated to Notebook security. For dependencies, it uses Trivy under the hood.
We can install it using pip
:
pip install nbdefense
Then, we can use it to scan all our Notebooks under the notebooks
folder. NBDefense can use the open source library spaCy to detect PII information in your Notebook. Although it installs the spacy
package, you will still need to download the en_core_web_trf
model. You can do this with the following command:
python -m spacy download en_core_web_trf
You can scan an individual Notebook or all the Notebooks under a folder. For instance, in our case, we can scan the Notebooks we developed in Chapter 2:
nbdefense scan -r notebooks
It is reassuring to see that the CIFAR-10 CNN Notebook we used in the previous chapter has no issues:
Figure 3.7 – CIFAR-10 CNN Notebook without issues
Now, let’s learn how to secure models from malicious code.
Securing models from malicious code
So far, we’ve looked at securing models from physical theft or tampering. Models, especially those serialized for deployment, can be vulnerable to arbitrary code execution. This means that if an attacker can tamper with the serialized model file, they might be able to execute malicious code when the model is loaded into memory.
This is especially true for models that have been saved using Python’s pickle
module, which is very popular historically. This is why we use the hierarchical data format H5 offered by Keras, but it is worth delving a bit more into the serialization risks of the Pickle format.
Note
The H5 format is Keras-specific. Safetensors offers a framework-independent alternative to pickles. For more information, see the Safetensors repository at https://github.com/huggingface/safetensors.
If a malicious actor can modify a pickled file, they can insert code that runs arbitrary commands when the file is unpickled. Consider the following scenario. Here, an attacker modifies a pickled model file to include malicious code. The unsuspecting data scientist or engineer loads this tampered model using pickle.load()
. The malicious code executes automatically, potentially causing harm, stealing data, or compromising the system:
# Malicious code example: import pickle import os # This is a simple representation and not an actual malicious payload. class MaliciousPayload: def __reduce__(self): return (os.system, ('echo You have been compromised!',)) # Save the malicious payload with open('malicious_model.pkl', 'wb') as file: pickle.dump(MaliciousPayload(), file) # Loading the tampered Model will execute the malicious command. with open('malicious_model.pkl', 'rb') as file: model = pickle.load(file)
Given these risks, it’s crucial to ensure that serialized models are stored securely, their integrity is maintained, and they are scanned for vulnerabilities before being loaded. Since they aren’t libraries or packages, traditional vulnerability scanners will not detect malicious code in a model file.
This is where tools such as ModelScan come into play.
ModelScan is a tool that scans serialized models, including H5 and pickles, for malicious code. We can install ModelScan with pip
:
pip install modelscan
To scan our model, we can use the following command:
modelscan –p <PATH_TO_YOUR_MODEL(S)>
In our case, this will look as follows. Here, we’re scanning all models in the models
folder:
modelscan -p models
Here’s a summary screen of the issues found:
Figure 3.8 – Model scan results for our model
As expected, our model doesn’t report any vulnerabilities.
Integrating with DevSecOps and MLOps pipelines
We talked about DevSecOps and MLOps earlier in this chapter. All the security controls and deployment steps we described have been shown as manual steps. This was because we wanted to focus on concepts and techniques. We should integrate these controls and steps in our deployment pipelines for real-life scenarios. Some steps were simplified – for instance, deploying a model by copying it to a folder. MLOps can automate this as part of more sophisticated pipelines and provide a model registry and governance. We will delve into this in more detail later in this book in Chapter 15 once we’ve covered this book’s actual subject, adversarial AI.
The following section will evaluate how adequate our traditional security controls are against adversarial AI.