Data architecture for AI
Data is an essential ingredient for any AI scenario. One of the key elements for the success of an AI project is the data architecture – how to bring data together, store and process it, bring the results back, and integrate the insights and actions back into the applications. The following are some of the typical challenges of the data architecture for AI:
- Data is located across different data sources based on different formats, systems, and structured and unstructured data types.
- Data integration and consolidation require a common data model.
- Replicating data involves how to address data privacy and data protection concerns, as well as other compliance requirements.
- The data platform provides data for the AI execution engine and also needs to address data ingestion, data storage, and data lifecycle management.
- AI requires metadata such as labeling for supervised learning.
- AI lifecycle events can be tightly coupled with the data lifecycle, such as an inference based on a live data change.
- The insights and AI-based recommendations need to be served by the data platform and integrated back into the applications.
- There are performance requirements for the initial load and event-based data integration for incremental changes.
In this section, we will describe the data architecture for AI. Most of the time, we have SAP AI Core as the AI execution engine, but SAP also supports other AI execution engines from partners, such as the relevant services from AWS, Azure, and GCP. HANA ML as described earlier provides in-database AI and doesn’t necessarily require an extended data architecture. Thus, we will not discuss it further here.
The availability of data is critical for the success of AI projects, for both the design time and run time in the production phase.
During design time, for exploration and experimentation, data acquisition and access to training data is the first important step. Data integration tools such as SAP Data Intelligence can be leveraged to extract data from source systems, in batch or through events, as well as manage the lifecycle of data pipelines. In many use cases, access to and management of third-party datasets will be required. For that, the data architecture should be able to leverage scalable cloud storage such as object stores such as AWS S3. AI runtimes such as AI Core can work directly with object stores. Besides, content creation such as labeling can be labor-intensive and should be automated and coordinated as much as possible.
In production, the data architecture needs to support the severance, monitoring, and retention of the models. Data needs to be continuously available from various data sources including SAP applications to do the re-training and inference processes. A frequent challenge is how to make sure that data governance and compliance are ensured across heterogeneous data sources. A common approach for data governance and policy enforcement is required. It is important to trace the data lineage to understand where data is coming from and how it is processed, especially when data is transported and transformed over several stages, which is often required in modern AI use cases.
In the meantime, it is important to meet the non-functional requirements specific to business needs. Such requirements include but are not limited to multi-tenancy support to isolate the needs of different customers, the ability to monitor the data quality, and ensure consistency with training data. AI in production requires operational readiness. For that, SAP AI Launchpad can support lifecycle management in production as the central operational tool, which supports flexible AI runtime options including SAP AI Core.
Based on the design time needs and runtime requirements described previously, we can start to draw the data architecture diagram covering three major components that include a data platform, an AI platform, and business applications. We will describe the roles and responsibilities of each of them here in detail.
The data platform will be responsible for the following:
- Ingest and store data from SAP applications and third-party data sources
- Keep data up to date through streaming or delta loads from various data sources
- Expose data and make it consumable for AI runtimes
- Store and expose inference results and make them consumable for the applications and downstream consumers
- Store and manage the metadata such as domain models, labeling, connectivity, data cataloging, and data lineage
- Handle data privacy, data protection, and compliance requirements, such as data encryption at rest, data segregation, and audit logging
- Store the AI model and intermediate artifacts such as derived datasets and models
The AI platform will focus on the following responsibilities:
- Manage the AI scenarios and its lifecycle, such as training, deployments, and inference
- Manage the artifacts such as datasets and models as well as the intermediate artifacts
- Expose the AI lifecycle via AI APIs across different runtimes if necessary
- Provide capabilities for data anonymization, labeling, a data annotation workflow, and data quality management
While most of the heavy lifting is performed by the data platform and AI platform, the applications are still responsible for the following aspects:
- Provide and integrate data into the data platform by exposing API or data events
- Consume the inference results and integrate them back into the application for the end users
- Control the AI lifecycle using specific clients such as S/4HANA ISLM (optional)
This kind of data architecture is depicted in Figure 14.8, leveraging some of the offerings from SAP BTP described in this chapter and previous chapters:
Figure 14.8: Data architecture for AI
In this data architecture, as part of the described data platform, the data lake provides the cold storage and the processing layer of relational data, object store support through HANA Cloud, the data lake, and HANA Cloud and data lake files. For HANA Cloud, the data lake is a disk-based relational database optimized for an OLAP workload and provides native integration with a cloud object store through HANA Cloud and data lake files, to leverage it as a cheaper data storage with SQL on Files capabilities. HANA Cloud as the in-memory HTAP database provides the hot and warm storage and processing of relational and multi-model data such as spatial, document, and graph data. HANA Cloud includes multi-tier storage, including in-memory and extended storage based on disks. SAP Data Warehouse Cloud focuses on the data modeling and governance of the data architecture and provides the semantic layer and data access governance. Besides this, SAP Data Intelligence provides the data integration and ingestion capabilities for extracting data from source systems and managing the lifecycle of data pipelines. Next, the AI platform comprised of SAP AI Core and SAP AI Launchpad is responsible for managing the AI scenarios across training, inference, monitoring, and the operational readiness of AI projects. Finally, applications are served as both the data sources and consumers in an AI scenario.
While this data architecture illustrates how different kinds of technologies can be put together to serve AI needs, regardless, it is not a necessary architecture for all kinds of use cases. Depending on the complexity and integration needs of your use cases, the data architecture can be dramatically simplified, and not all the mentioned technology components will be required. For example, SAP AI Core can directly consume data from a cloud object store bucket and manage the AI lifecycle directly together with SAP AI Launchpad. In the simplest scenario, you may even just call an API from SAP API Business Services without the need to worry about the operational complexity of AI models.
Data Privacy and Protection
Data Privacy and Protection (DPP) in the handling of customer data in AI scenarios is a critical enterprise quality that must be prioritized in the data architecture. For an ML/AI use case, data that comes from different sources may contain Personal Identifiable Information (PII) and therefore must be handled properly.
When PII is involved, no human interaction will be allowed whether data is in a structured or unstructured format. This is applicable to data extraction and integration, the inspection of data, model training, and the deletion of data. Concretely, the data architecture needs to consider the following concepts and mechanisms to support DPP-compliant data processing:
- Authorization management: Only authorized data can be processed and the user’s data authorization in the applications needs to be translated into the data authorization of replicated data. Authorization management is based on DPP purposes and can be achieved by providing the DPP context of the user from the leading source system to the data platform. For example, a manager should only see the salary information of members of their team.
- Data anonymization: Applying the anonymization function to translate a raw dataset into an anonymized dataset while preserving the necessary information for the AI use cases. HANA supports in-database anonymization or can be applied as part of the data pipeline for in-transit anonymization enabled by the K-Anonymity operator of Data Intelligence.
- Data access and change logging: If sensitive DPP data is stored, the data platform must provide read-access logging. Besides this, any change to personal data should be logged.
- Data deletion: Data is usually replicated to the data platform. If the data is erased in the source system when the retention period has expired or when requested by the user in the context of the GDPR (General Data Protection Regulation) in the EU, it also needs to be prorogated to delete the data replicated in the data platform.