How-To Tutorials

article-image-how-to-build-supervised-anomaly-detection-models-for-time-series-data

03 Jun 2026

10 min read

How to Build Supervised Anomaly Detection Models for Time Series Data

03 Jun 2026

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionSupervised anomaly detection helps turn labelled time series data into practical models that can identify abnormal patterns, such as stockouts, sensor failures, or fraud events. Unlike standard classification tasks, anomaly detection often involves severe class imbalance and imperfect labels, making accuracy a poor measure of success. In this article, you’ll learn how to build a more reliable supervised anomaly detection workflow using residual features, CatBoost, time-aware validation, class weighting, and threshold tuning to improve precision and recall.Supervised anomaly detectionSupervised anomaly detection is simply classification, as any good academic will tell you. You have observations, labels, on which you train a model that distinguishes anomalous from non-anomalous observations. Chapters 14 and 15 covered some aspects of classification methodology, so we won’t repeat that here. What is worth discussing is what makes the anomaly detection variant of classification distinct from the general case, and what a practical pipeline looks like.Two things distinguish supervised anomaly detection from general classification problems: severe class imbalance and label quality. Both are structural rather than incidental, and both directly affect how you build and evaluate supervised detectors. The label quality problemLabels in anomaly detection are rarely clean. They come from one of three sources, each with its own contamination issue.Operational incident logs are accurate for large, impactful events (a week-long stockout, a confirmed sensor failure) but can miss subtle anomalies that don’t rise to the threshold of a ticket. The positive class is high-precision but recall can be low.Rules-based labelling pipelines are the most common source, and the most dangerous. Rules are written at a point in time, reflecting a data distribution at that moment. As the underlying process drifts, rules generate stale labels. Flagged observations are no longer anomalous (false positive labels) and new anomaly patterns are missed (false negatives).Prior unsupervised detector output is used when no other labels exist. If your profiling model drifted before retraining, or was itself trained on contaminated data, the labels it generated carry that error forward into your supervised model. This can result in compounded contamination, which makes you wonder if foundational anomaly detection is ever possible. Regardless, a bad unsupervised model produces bad labels; a supervised model trained on bad labels produces confidently wrong predictions.The practical response is to audit before training. But I have three million time series, you say. We should then inspect a random sample of our positive-class (anomaly) labels and verify them against raw data. If more than ~20% of your positive labels are wrong, a supervised model will likely underperform a well-tuned unsupervised approach. You’re better off fixing labels first. Class imbalanceWith anomalies comprising well under 5% of observations, often under 1%, a classifier that predicts“normal” everywhere achieves very high accuracy while being completely useless. This is why accuracy is a meaningless metric for anomaly classification, and why you should never use it as your optimization target. Some practical remedies are:• class_weight='balanced' (or auto_class_weights='Balanced' in CatBoost). This reweights the loss function by inverse class frequency, making a missed anomaly proportionally more costly than a missed normal observation. It is almost always worth doing and costs nothing• Threshold adjustment: most classifiers produce a probability score, and a default threshold of 0.5 implicitly assumes balanced classes. Lowering it (say, to 0.2) increases recall at the cost of precision. Use a precision-recall curve on a validation set to fi nd the threshold that reflects your actual cost tradeoff .• Oversampling the minority class: DO NOT DO THIS. Once widely used (and researched!) but now considered poor practice. SMOTE creates synthetic points by interpolating in feature space, producing examples that are statistically plausible but temporally incoherent. Better alternatives are to augment using domain knowledge (jitter known anomaly windows, vary their duration) or simply collect more real labels. Gradient boosting on residual featuresThe most effective supervised approach in our experience, and the one that connects most naturally to the profiling work earlier in this chapter, is to build a feature matrix from profile model residuals and train a gradient boosting classifier on it. A profile model removes known structure. What remains in the residuals is the unexplained variation. A supervised classifier trained on residual features learns to distinguish residual patterns associated with confirmed anomalies from patterns that are just normal noise, things like seasonal peaks that happened to breach a threshold, measurement jitter, or genuine promotions that the model underestimated.Below we have made a few specific design choices.• We use TimeSeriesSplit rather than random cross-validation. This is because our rolling features (resid_roll_std, sales_roll_mean) are computed from past observations and shifted by one step to avoid look-ahead. Random shuffling would be fi ne if we didn’t include these kinds of features; as it stands they would leak information about the temporal neighborhood of test points into training.• We use CatBoost with auto_class_weights='Balanced', which automatically reweights the loss function inversely proportional to class frequency. With anomalies representing roughly 4% of observations, an unweighted classifier would achieve 96% accuracy by predicting normal for every row. Balanced weighting forces the model to treat a missed anomaly as roughly 24x more costly than a false alarm during training, which prevents it from ignoring the minority class (outliers). The threshold adjustment step after training lets us fine-tune the actual precision-recall tradeoff .CatBoost was selected for our classifier, but you could easily use another algorithm i.e. XGBoost, LightGBM, or a regularized logistic regression. What is important here is having good labels, a strong temporal structure and optionally a well-explained model leaving us with good residual signal.from catboost import CatBoostClassifier from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import classification_report, precision_recall_curve import pandas as pd import numpy as npdef build_supervised_features(df, window=14): """ Feature matrix for supervised residual profiling. All rolling features are shifted by 1 to avoid lookahead bias. """ feats = pd.DataFrame(index=df.index) # Core residual signal feats['residual'] = df['residual'] feats['residual_z'] = df['residual_z'] feats['residual_abs'] = df['residual'].abs() # Local residual statistics for w in [7, 14, 28]: feats[f'resid_roll_mean_{w}'] = ( df['residual'].rolling(w, min_periods=3).mean().shift(1) ) feats[f'resid_roll_std_{w}'] = ( df['residual'].rolling(w, min_periods=3).std().shift(1) ) feats[f'resid_roll_max_{w}'] = ( df['residual'].abs().rolling(w, min_periods=3).max().shift(1) ) # Scores from unsupervised detectors as features for col in ['if_unsup_score', 'if_score', 'eif_score', 'lof_score']: if col in df.columns: feats[col] = df[col] # Context feats['price'] = df['price'] feats['promotion'] = df['promotion'] feats['sales_roll_mean'] = ( df['sales'].rolling(14, min_periods=3).mean().shift(1) ) feats['day_of_week'] = df.index.dayofweek feats['month'] = df.index.month return feats.fillna(0) feature_df = build_supervised_features(stockouts_data) labels = stockouts_data['is_anomaly'].fillna(0).astype(int) # Time-series split tscv = TimeSeriesSplit(n_splits=4) all_preds = np.zeros(len(labels)) all_probs = np.zeros(len(labels)) test_mask = np.zeros(len(labels), dtype=bool) for train_idx, test_idx in tscv.split(feature_df): X_tr = feature_df.iloc[train_idx].values X_te = feature_df.iloc[test_idx].values y_tr = labels.iloc[train_idx].values y_te = labels.iloc[test_idx].values clf = CatBoostClassifier( iterations=300, learning_rate=0.05, depth=4, auto_class_weights='Balanced', eval_metric='F1', random_seed=42, verbose=0 ) clf.fit(X_tr, y_tr, eval_set=(X_te, y_te)) all_preds[test_idx] = clf.predict(X_te) all_probs[test_idx] = clf.predict_proba(X_te)[:, 1] test_mask[test_idx] = True y_test_all = labels.values[test_mask] y_pred_all = all_preds[test_mask].astype(int) y_prob_all = all_probs[test_mask] print(classification_report(y_test_all, y_pred_all, target_names=['Normal', 'Anomaly'])) # Output: precision recall f1-score support Normal 1.00 0.99 1.00 562 Anomaly 0.81 1.00 0.90 22 accuracy 0.99 584 macro avg 0.91 1.00 0.95 584 weighted avg 0.99 0.99 0.99 584 The four blocks in build_supervised_features each give our classifier a different view on the same observation. The core residual signal (residual, residual_z, residual_abs) is the raw ‘what the profile couldn’t explain’; this is our primary anomaly evidence. The local residual statistics (rolling mean, std, and max at 7, 14, and 28 days, all shifted by one) provide context, because a residual of +20 means very different things depending on whether the surrounding fortnight has been quiet or noisy. The unsupervised detector scores (if_unsup_score, if_score, eif_score, lof_score) are second opinions from earlier in the chapter; our classifier learns when to trust them and when to override them. Finally the context features (price, promotion, sales_roll_mean, day_of_week, month) let our model learn that a large negative residual in late December under a deep promotion is structurally different from the same residual on a quiet Tuesday in February. Every rolling feature is shifted by one step so that no feature at time ݐ contains ݕ௧ itself.Figure 17.19: CatBoost precision-recall curve and feature importanceThe feature importance panel tells us which signals the model is actually using. If if_unsup_score or if_score dominate, the supervised model is largely deferring to unsupervised detectors and the supervised wrapper adds little beyond threshold calibration. Some influence from resid_roll_std suggests the classifier has learned temporal patterns from residuals that no single unsupervised method captures on its own. We could instead, build more features, and labels with the raw data, which would avoid the need for ensembled output. Threshold adjustmentCatBoost’s default threshold of 0.5 assumes equal misclassification costs. In anomaly detection this is rarely appropriate, as the cost of missing real stockouts often exceeds false alarms. The precision-recall curve in the figure above shows the full tradeoff surface; here we select the threshold that maximizes F1 as a reasonable default.from sklearn.metrics import f1_score thresholds = np.arange(0.1, 0.9, 0.01) f1_scores = [ f1_score(y_test_all, (y_prob_all >= t).astype(int), zero_division=0) for t in thresholds ] best_thresh = thresholds[np.argmax(f1_scores)] best_f1 = max(f1_scores) y_pred_tuned = (y_prob_all >= best_thresh).astype(int) tp_sup = ((y_pred_tuned == 1) & (y_test_all == 1)).sum() fp_sup = ((y_pred_tuned == 1) & (y_test_all == 0)).sum() fn_sup = ((y_pred_tuned == 0) & (y_test_all == 1)).sum() prec_sup = tp_sup / (tp_sup + fp_sup) if (tp_sup + fp_sup) else 0.0 rec_sup = tp_sup / (tp_sup + fn_sup) if (tp_sup + fn_sup) else 0.0 print(f"Best threshold: {best_thresh:.2f}") print(f"Supervised (tuned): Precision: {prec_sup:.2f} Recall: {rec_sup:.2f} F1: {best_f1:.2f}") # Output: Best threshold: 0.55 Supervised (tuned): Precision: 0.91 Recall: 0.95 F1: 0.93If false negatives are expensive (missed equipment failures, missed fraud) you want a lower threshold to maximize recall. If false positives are expensive (wasted analyst time, unnecessary operational interventions) you want a higher threshold to maximize precision. There is no universally correct answer; the threshold is a business decision, not a model decision.17.19: CatBoost supervised detections with tuned thresholdMethodTypePrecisionRecallF1Z-score ProfileUnsupervised0.741.000.85IF: Raw FeaturesUnsupervised0.510.970.67IF: ResidualsUnsupervised0.250.480.33INNE: ResidualsUnsupervised0.050.100.07LOF: ResidualsUnsupervised0.830.970.90CatBoost (supervised, tuned threshold)Supervised0.910.950.93Table 17.4: Comparing all approachesCatBoost wins clearly here, but this result is an artifact of the task. Our anomalies were generated programmatically and we have certainty in them. We have a strong temporal structure and relationships in our data, even where they don’t repeat in calendar position, noise is low. With clean labels, repeated anomaly patterns, and a profile model that already explains most of the variance, a supervised classifier on residual features has very little left to do. It is an easy task, that shows us how critical structure and labelling are.We should think of this as a calibration layer that sits on top of unsupervised detection. In production, labels come from unsupervised detectors and human-in-the-loop review (the contamination problem returns); novel anomalies by definition should not repeat in your training data. We will return to this self-supervised detection in Chapter 18.Some practical considerations:•Hyperparameter sensitivity: contamination has the largest effect in isolation-based methods; it sets the decision threshold on anomaly scores, not the model structure. Halving max_samples and max_features to 0.5 while raising contamination to 0.08 lifted the unsupervised IF’s recall from 0.61 to 0.97. For supervised models, threshold adjustment on the probability output typically matters more than tree hyperparameters themselves (Soenen, Wolputte, and Perini 2021).•The contamination-imputation loop: Start with a generous contamination rate (high recall, accept false positives), review flagged observations, then impute confirmed anomalies with interpolated values. Some values were already wrong; replacing them should improve signal rather than fabricate it. Normal data is abundant enough to anchor any reasonable interpolation. Retrain a profile on the imputed series and re-score. Each iteration sharpens the residuals and tightens detection boundaries. Human confirmation at each step is essential; uncertain flags stay in the data untouched. Depending on your case, you could pair this with forecast accuracy, being sure to leave validation sets clean; otherwise, you’re marking your own homework.ConclusionSupervised anomaly detection works best when labels are trustworthy, temporal structure is preserved, and model outputs are tuned around real business costs. By using residual-based features, balanced class weighting, TimeSeriesSplit, and precision-recall-based threshold adjustment, you can build detectors that perform better than simple unsupervised baselines in well-labelled scenarios. However, the real challenge in production is not just model selection; it is maintaining label quality, managing class imbalance, and deciding the right trade-off between missed anomalies and false alarms.This article is an excerpt from Time Series with PyTorch: Modern Deep Learning Toolkit for Real-World Forecasting Challenges, published by Packt. Author BioGraeme Davidson is a Lead Data Scientist at Retail Express, where he redesigned the company's demand forecasting framework in line with contemporary statistical learning practices. His background spans cognitive neuroscience, researching implicit reward processing and human decision-making, through advertising analytics to research-focused demand forecasting. He is an active contributor to several data science Slack and Discord communities, an occasional competitor in forecasting competitions, and was approached by Packt in late 2022 to write the book he wished had existed when he first fell down an ARIMA rabbit hole chasing answers about how supermarkets actually forecast demand, and how a quantitative researcher models financial markets.Lei Ma is a physicist-turned data scientist specializing in time series forecasting. He is theorist but has tackled real-world forecasting challenges across a variety of industries like housing, logistics, ecommerce, and manufacturing. Lei has led and delivered numerous forecasting projects where he combines deep expertise in building advanced time series models with a strategic approach to delivering holistic business insights. Lei creates time series forecasting tutorials online and joined the venture when Graeme approached him to collaborate on this book.

0
0

article-image-how-to-prepare-data-using-aws-glue

Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Ishan Gaur, Albert Quiroga, Akira Ajisaka

28 May 2026

10 min read

How to Prepare Data Using AWS Glue

Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Ishan Gaur, Albert Quiroga, Akira Ajisaka

28 May 2026

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionPreparing data for analytics can become challenging as organizations deal with growing data volumes, varied data sources, and increasingly complex transformation requirements. AWS Glue helps simplify this process by offering both visual and code-based approaches to data preparation. In this article, we explore how AWS Glue Studio enables users to build ETL workflows visually, apply transformations, configure data quality checks, and prepare datasets for downstream analytics without needing to manage infrastructure.Data preparation using AWS GlueIt is normal for data to grow continuously over time in terms of volume and complexity, considering the huge number of applications and devices generating data in a typical organization. With this ever-growing data, a tremendous amount of resources is required to ingest and prepare this data – both in terms of manpower and compute resources.AWS Glue makes it easy for individuals with varying levels of skill to collaborate on data preparation tasks. For instance, novice users with no programming skills can take advantage of AWSGlue Studio (https://docs.aws.amazon.com/glue/latest/dg/author-job-glue.html), a visual interface that allows novice data professionals to interact with and prepare the data using a variety of pre-built transformations and filtering mechanisms even without writing any code. AWS Glue Studio also provides advanced users to author custom transformations to achieve desired outcomes.AWS Glue Studio is a great tool for preparing data using a graphical user interface (GUI), there are some use cases where the built-in transformations may not be flexible enough or the user may prefer a programmatic approach to prepare data over using the GUI-based approach. In such cases, AWS Glue enables users to prepare data using AWS Glue ETL. Users can leverage AWS Glue Studio to author, execute, and monitor ETL workloads. Although Glue Studio offers a GUI, users may still require programmatic knowledge of AWS Glue’s transformation extensions and APIs to implement data preparation workloads, especially when implementing custom transformations using SQL or source code.Now that we know about the different data preparation options that are available in AWS Glue, let’s dive deep into each of them while looking at practical examples to understand them.Visual data preparation using AWS Glue StudioAWS Glue makes it possible to prepare data using a visual interface through AWS Glue Studio. Previously, the preferred approach was to make use of AWS Glue DataBrew for visual data preparation as highlighted in the previous edition of this book. However, the features available in AWS Glue DataBrew have been implemented in AWS Glue Studio and as such, all users regardless of their skill level can make use of AWS Glue Studio to build their ETL workflows through a unified interface. AWS Glue Studio allows us to author recipes similar to AWS Glue DataBrew and even allows us to import any recipes that were built using AWS Glue DataBrew.Getting started with AWS Glue Studio is quite simple. To author a new job using visual ETL, you can use the Visual ETL option in the AWS Glue Studio UI to open the visual job editor. In this interface, you can start designing the ETL workflow based on your requirements by adding required data sources, transformations, and target nodes with a simple drag-and-drop.As mentioned in the previous chapters, we can also ingest data from a wide range of external Software-as-a-Service (SaaS) providers via Amazon AppFlow. There are several external SaaS providers using native connectors, including Adobe Analytics, Asana, Datadog, Google Analytics, Dynatrace, Marketo, Salesforce, ServiceNow, Slack, and Zendesk, to name a few. This data can be further integrated with datasets from other data stores or SaaS applications. This helps the users take a holistic approach to analyzing and gathering insights from their datasets, which have been spread across different data stores or SaaS platforms. The following screenshot shows the visual editor interface available in AWS Glue Studio, which allows users to drag and drop different components of an ETL job:Figure 4.1: AWS Glue Studio Visual Job Editor InterfaceIn the visual editor interface (Figure 4.1), we can start by dragging and dropping data source(s) we wish to read from and start adding transformations and configure each node as per our requirements. This step is optional if we are moving the data from one location to another without applying transformations. Once we are happy with the transformations, we can add data target(s) to our job to write the output. We have to provide a name to the job, select an IAM role for execution under the job details tab, and the job is ready to be saved now. If our job requires an AWS Glue connection, for instance a relational database, on-premises data source, or SaaS data source, we can specify the connections to be included under the job details tab.Let’s build an ETL job using the visual ETL editor which reads from an Amazon S3 location, applies simple transformations, and writes the data to an Amazon Redshift data warehouse. Before we begin authoring the job, let’s set up an AWS Glue Connection to connect to our data warehouse cluster. To create an AWS Glue Connection, use the Connections option under Data Catalog in the sidebar and click Create connection. In the data source selection page, search for Amazon Redshift and click Next. If your Amazon Redshift cluster is in a different AWS account, use a JDBC data source instead. Configure the connection by specifying the cluster details and IAM role. Review the connection details and save the connection. The following screenshot (Figure 4.2) shows a sample AWS Glue connection configuration used to connect to an Amazon Redshift Cluster:Figure 4.2: AWS Glue Connection Configuration pageOnce the connection is saved, let’s head back to the visual editor and add an Amazon Redshift Source node from the Node picker. Click on the added node and select the connection that you created in the previous step. You can choose to select a single table or enter a custom query. In the Data preview section, you can select an IAM role and start a session to preview your data. The following screenshot (Figure 4.3) shows how data can be previewed at each node in the graph to get an idea of the resultant dataset:Figure 4.3: Data Preview of a node in visual editorUse the node picker and add necessary transformations. To transform data using AWS Glue DataBrew style editor, you can click the plus button and add Data Preparation Recipe transform from the node picker and click on the added node to display the node properties. Click Author Recipe to open the familiar grid interface to begin creating the data preparation recipe.Figure 4.4: Author Recipe button available in Data Preparation Recipe transformIn our example, we are adding a filter on quantity of items sold to fetch sales records with a quantity of 3 or higher. The following screenshot (Figure 4.5) shows the filter transformation in the data preparation recipe authoring window:Figure 4.5: Applying transformations in Data Preparation Recipe interfaceOnce all the necessary transformations are applied, you can use the Done authoring recipe button to exit the grid interface and return to AWS Glue Studio visual editor. After all necessary transformations have been applied, we can add target node(s) to the graph to ensure the output is saved to the target data store. In our example, we will configure the target node to write output in Parquet format by selecting Parquet in the Format field of the target node properties, save it into Amazon S3 location, and create a table in AWS Glue Data catalog. The following screenshot (Figure 4.6) depicts the same:Figure 4.6: Amazon S3 target data store configurationNow that we have defined the data source, transformations, and a data target we can save the ETL job and it is ready to be executed.Note: It is important to note that Data Preparation Recipe transformation can be used only with certain versions of AWS Glue. As of writing this book, we can use AWS Glue version 4.0 or higher to execute ETL jobs with such transformations.While the Data Preparation Recipe transformation offers a familiar grid interface, if you are not particularly interested in that experience, you can choose to add transformations directly from the node picker. For example, to achieve the same outcome as the job we created, we can directly add the Filter transformation after we add the Amazon Redshift data source in the graph instead.The following screenshot (Figure 4.7) shows how we can configure the Filter transformation to achieve the same outcome as the data preparation recipe:Figure 4.7: Visual ETL without using Data preparation recipe transformationWhile the ETL job we designed will work as expected, it is always a good idea to protect our workload from noisy data. For example, if the sales quantity column doesn’t exist in our source data or if the column is not an integer as we expected, or it contains null values, we should be able to detect such anomalies and take necessary actions. AWS Glue enables users to use Evaluate Data Quality to create data quality rules to evaluate the output. During this process, it emits events to AWS EventBridge, which can be captured and acted upon. The configuration is quite flexible. You can enrich the existing dataset with data quality information or choose to write the quality information to a separate destination. You can also emit AWS CloudWatch metrics for data quality checks to track how a job is trending with regard to data quality. Additionally, you can decide what to do with the job if the evaluation fails. For example, you can choose to fail the job before or after loading the data into the target data store, or continue with the job without failing it.To implement the data quality rule described in the example above, you can add the EvaluateData Quality transform from the node picker (or use the Edit Data Quality Configuration button in the target node).Figure 4.8: Evaluate Data Quality transform available in the node pickerData Quality rules can be defined using Data Quality Definition Language (DQDL) syntax. Detailed documentation on DQDL can be found in https://docs.aws.amazon.com/glue/latest/ dg/dqdl.html. The rule we described earlier will look like this:Rules = [ # check if col exists ColumnExists “qtysold”, # check for null values IsComplete “qtysold”, # Make sure the values are integers ColumnDataType “qtysold” = “Integer” ]The following screenshot (Figure 4.9) will show how we can configure the Evaluate Data Quality transform to implement the above data quality rules:Figure 4.9: Evaluate Data Quality TransformationNow that we have seen how we can use the visual editor to build an ETL job from scratch, in the next section we will be exploring how we can use AWS Glue to build ETL jobs using source code, and we will also see some of the built-in transformations available in AWS Glue with examples.Source code-based approach to data preparation using AWS GlueWhile AWS Glue Studio primarily offers a visual interface-based approach to tackle data preparation tasks in a data integration workflow, it can also be used to author complex ETL workflows using advanced (and even custom) transformations. AWS Glue ETL in general requires us to have some level of Glue/Spark programming knowledge to implement ETL jobs, which aids in data preparation as we get a much higher level of flexibility compared to using just the grid interface in data preparation Recipe. With the data preparation recipe approach, we can only use pre-built transformations to prepare data. Since there are no such restrictions in AWS Glue ETL, we can design and develop custom transformations based on our requirements using existing Glue/ Spark ETL APIs and extensions.ConclusionAWS Glue provides a flexible and scalable way to prepare data, whether users prefer a visual interface through AWS Glue Studio or a more programmatic approach using AWS Glue ETL. With support for drag-and-drop job creation, data preparation recipes, built-in transformations, data quality rules, and integration with services such as Amazon S3 and Amazon Redshift, AWS Glue helps teams streamline data preparation across different skill levels and use cases.This article is an excerpt from the book Serverless ETL and Analytics with AWS Glue, Second Edition, which offers a deeper look at building, managing, and optimizing data integration workflows using AWS Glue. Author BioSubramanya Vajiraya is a Senior Cloud Engineer at AWS Sydney specialized in AWS Glue. He obtained his Bachelor of Engineering degree focused on Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India in 2015 and obtained his Master of Information Technology degree focused on Internetworking from University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implement scalable data integration and analytics pipelines on AWS.Noritaka Sekiyama is an experienced big data engineer working at Data and AI company. He is responsible for building scalable data platform with unified governance on Cloud. He is passionate about software engineering, cloud computing, big data technologies, distributed systems, data platform, system monitoring and automation.Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services (AWS). He specializes in data infrastructure with hands-on customer engagement experience including migrations, performance tuning, and production troubleshooting. His areas of expertise include Apache Spark, Apache Iceberg, and AWS Analytics services such as AWS Glue, Amazon EMR and Amazon Athena. He actively contributes to the Apache Iceberg open-source project and speaks at community events and conferences to help customers adopt Iceberg in practice.Ishan Gaur is a Principal Big Data Cloud Engineer at Amazon Web Services (AWS) with over 16 years of experience architecting and building distributed systems and scalable data integration pipelines. As a subject matter expert in AWS Glue and Apache Spark, he specializes in helping enterprise customers design and implement large-scale data processing solutions across the AWS ecosystem, including Amazon EMR, AWS Glue, and Amazon Athena.Throughout his career, Ishan has worked extensively with distributed computing frameworks and ETL technologies including Apache Spark, Scala, Ab Initio, and DataStage. His expertise spans the full lifecycle of data engineering—from architecture design and pipeline development to performance optimization and troubleshooting at scale. At AWS, he partners with customers to modernize their data platforms, optimize workloads, and leverage cloud-native services to achieve operational excellence and cost efficiency in their data processing environments.Albert Quiroga is a Senior Solutions Architect at Amazon, where he creates solutions and architectural designs for one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR, Athena, Glue and SageMaker. His 11 years of experience in the industry have empowered him to work with several Fortune 500 companies to overcome large-scale data and analytics challenges, and he has helped launch and develop features for several AWS services.Akira Ajisaka is a software engineer and has more than 10 years of engineering experience in big data. He likes troubleshooting and contributing to OSS.

0
0

article-image-microsoft-fabric-data-agents-building-ai-powered-conversational-analytics-for-enterprise-data

Christopher Maneu, Frederic Gisbert, Emilie Beau, Jean-Pierre Riehl, Romain Casteres

20 May 2026

10 min read

Microsoft Fabric Data Agents: Building AI-Powered, Conversational Analytics for Enterprise Data

Christopher Maneu, Frederic Gisbert, Emilie Beau, Jean-Pierre Riehl, Romain Casteres

20 May 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionData agents in Microsoft Fabric are redefining how organizations interact with enterprise data. Instead of relying only on dashboards, predefined reports, or technical query skills, users can ask questions in natural language and receive contextual, data-driven answers powered by generative AI.This article explores how Fabric data agents connect to lakehouses, warehouses, semantic models, mirrored databases, Power BI, Microsoft Copilot Studio, AI Foundry, and real-time intelligence systems, turning organizational data into a conversational, governed, and actionable intelligence layer.Data agents in Microsoft FabricBefore diving into the h ands-on steps, let’s now move from the general overview of data agents to a practical example. The next section walks you through creating your fi rst data agent, showing how to connect it to your data and begin building a customized generative AI experience within Microsoft Fabric.Microsoft Fabric now brings full CI/CD, ALM fl ow, and Git integration to Data Agents, providing a more structured and collaborative way to manage, version, and deploy Data Agent artifacts. These capabilities ensure better governance and scalability by introducing controlled development stages, change tracking, and auditability across the lifecycle. Git integration allows teams to branch, experiment independently, review code, and merge updates safely, with the ability to revert quickly if needed. Together, these enhancements make developing and maintaining Fabric Data Agents more reliable, transparent, and aligned with modern soft ware engineering practices.Creating your first data agentThe data agent in Microsoft Fabric is a new type of artifact that allows creating customized generative AI experiences based on the organization’s data. Users can thus, on top of their data, activate an AI assistant that will translate questions posed by users in natural language directly into code and immediately return the answers.Once created, the data agent must be associated with a data source. You will then need to select the specifi c tables from which the system will retrieve data. Currently supported sources include lakehouses, warehouses, KQL databases, and semantic models. The user can enhance the assistant by providing additional information about the data (instructions), the underlying models, or simply by detailing the context in which the assistant should operate.Finally, it is possible to load a set of SQL queries linked to natural language questions to help the model fi nd the information. These queries are, for example, created by enterprise data analysts who are well acquainted with the business models as well as the data.In the following example, we ask our conversational agent What are the key insights or most notable patterns in sales seasonality over time?.Figure 11.20: A data agent in Microsoft Fabric analyzes seasonal sales trends, identifying monthly patterns in total sales, variance, and margins to support strategic planning and forecastingOnce the tables are selected, the context is validated, the examples are loaded, and the interaction is tested, it i s possible to publish this data agent in order to share it and retrieve an API to integrate it into enterprise applications.Unlocking LLM-Driven Intelligence from Mirrored Databases in Microsoft FabricMicrosoft Fabric introduces a new capability that connects large language models (LLMs) directly to mirrored databases through a Data Agent. This feature allows organizations to use their synchronized, real-time data without the need for complex duplication or manual integration. The Data Agent acts as an intelligent bridge between the mirrored databases and AI models, enabling natural language queries and automated insights based on live information.The supported mirrored databases include:• Mirrored Azure Cosmos DB• Mirrored Azure Database for PostgreSQL• Mirrored Azure Databricks catalog• Mirrored Azure SQL Database• Mirrored Azure SQL Managed Instance• Mirrored Oracle• Mirrored Snowfl ake• Mirrored SQL Server DatabaseBy integrating these sources, Fabric allows businesses to maintain data consistency, security, and freshness while making their information accessible to AI-driven analysis. Users can interact with their data conversationally, generate real-time insights, and make faster, more informed decisions. This seamless connection between enterprise databases and LLMs reduces latency, minimizes operational overhead, and strengthens confidence in AI-powered outcomes—turning mirrored databases into a live, intelligent foundation for modern analytics and decision-making.Agent data source instructionsMicrosoft Fabric h as introduced a new feature within its data agent framework called data source instructions. This feature allows data owners and developers to provide specific guidance on how AI agents should interpret and use each dataset. Instead of relying solely on the schema or raw content of a data source, agents now have access to structured context—such as explanations of tables, columns, relationships, and intended use cases. These instructions improve the agent’s ability to generate relevant, precise, and business-aligned responses when interacting with enterprise data.Data source instructions can include natural language descriptions, key business definitions, column-level explanations, usage constraints, and examples of valid questions. This metadata acts as a knowledge layer that shapes how the agent interprets queries and formulates answers. For instance, when working with multiple datasets, the agent can prioritize certain fi elds, avoid unsupported queries, or follow business-specific logic as defined in the instructions.The benefit is a noticeable improvement in response quality, consistency, and reliability, particularly in use cases such as semantic search, natural language querying, and generative AI experiences in dashboards or notebooks. By embedding domain expertise directly into the agent’s environment, data source instructions help bridge the gap between raw data and user intent—empowering AI systems to deliver answers that are not only technically correct but also contextually appropriate.Figure 11.21: The interface provides a dedicated space for adding custom instructions to guide theAI agent when using the LakeDBIA data source—covering table structures, column descriptions, metrics, and relationshipsThis development reflects a broader trend in enterprise AI: pairing large language models with curated, human-authored context to increase trust, precision, and usability in data-driven environments.To begin configuring instructions for your data sources, please refer to the setup guide: https:// go.fabricbook.net/ch11-12. For tips on crafting clear and effective instructions, see the recommended guideline s: http://go.fabricbook.net/ch11-13.The data agent SDKMicrosoft Fabric now off ers a Python SDK for data agents, enabling users to evaluate their agents programmatically and at scale. With this SDK, developers can defi ne a set of test questions paired with expected answers (ground truth) and run structured evaluations directly from notebooks or automation pipelines. The evaluation routine compares the agent’s responses against the expected results, logging detailed metrics and step-by-step reasoning in output tables. This enables validation of accuracy, error diagnosis, and confidence building before deploying the agent into production.Setting up the evaluation is straightforward. You begin by installing the fabric-data-agent-sdk library, then prepare a pandas DataFrame containing your test queries and expected outcomes. Aft er calling the evaluate_data_agent() function, the SDK returns a unique evaluation ID and writes summary and step-level data to tables in your workspace. You can then retrieve overall accuracy and performance insights using the get_evaluation_summary() function.This capability simplifies both quality control and continuous improvement of data agents. It supports tasks such as prompt tuning, regression tracking, and performance monitoring, helping ensure that AI-powered experiences remain accurate and reliable as they evolve.Integrating data agents with Microsoft Copilot Studio and Microsoft AI FoundryIntegrating data agents with AI Foundry and Copilot Studio enables organizations to leverage enterprise data seamlessly within AI-driven workflows. This integration empowers conversational AI agents to dynamically access relevant data, enhancing both decision-making and user productivity.Integration with AI Foundry occurs through the deployment of Fabric data agents as knowledge sources within Foundry’s agent framework. Foundry orchestrates the interaction, enabling AI agents to query these data agents directly. User identity and permissions fl ow securely through this integration, ensuring compliance and data governance at scale.Figure 11.22: This panel allows users to expand an AI agent’s knowledge by connecting external data sources such as Microsoft Fabric, Azure AI Search, SharePoint, Bing Search, Tripadvisor, and others for grounding responsesWhen an AI Foundry agent receives a request, it intelligently delegates appropriate queries to a connected Fabric data agent, retrieves precise responses, and generates coherent outputs for end users.In the following screenshot, we’ve connected SalesAgent from Microsoft Fabric to a broader agent called Market Agent. This connection allows us to cross-reference sales data with market insights, CRM information, and other external sources.Figure 11.23: Existing Microsoft Fabric SalesAgent connection with custom authentication is ready for useWe can then interact with our agent in the playground and, within our context, ask the following question: What are the main takeaways or most significant trends in sales seasonality over time? Summarize in three key points. The result is then displayed in the agent’s interface.Figure 11.24: The Market Agent in Azure AI Foundry summarizes key sales seasonality trends, using the connected SalesAgent knowledge source for contextIt is then possible to view the reasoning behind the model’s response, as well as the structure of its answer. In our example, we c an clearly see that the agent is interacting with the data agent, allowing it to retrieve as much relevant information as possible.Figure 11.25: The run trace confirms a successful call to the fabric_dataagent tool, executed in3 seconds as part of a thread sessionAI agents integrated within Foundry can be seamlessly connected to Copilot Studio. Through this connection, agents created in Foundry become accessible within Microsoft Copilot applications and Microsoft 365. Users benefit from streamlined, multi-agent interactions, enabling AI assistants within familiar productivity tools to interact fluidly with enterprise data sources.This integration leverages standardized protocols, allowing Copilot Studio agents to transparently discover, communicate, and collaborate with Foundry-based data agents, creating a unified conversational AI experience across various platforms.By integrating Fabric data agents with AI Foundry and Copilot Studio, organizations can build sophisticated, interconnected AI agents that effortlessly access and deliver enterprise knowledge. The result is a powerful, cohesive system enabling intuitive and secure interactions between users, AI agents, and organizational data.Integrating data agents with Power BIIn the evolving landscape of business analytics, the integration of Fabric data agents into Power BICopilot marks a significant leap forward. Microsoft is transforming the way users interact with enterprise data—moving beyond dashboards and static reports to a more fluid, conversational experience.Traditionally, Power BI relied on structured datasets and semantic models to deliver insights. While powerful, this approach required users to know what data was available and how it was organized. With the arrival of Fabric data agents, that barrier begins to dissolve. These agents act as intelligent bridges, capable of querying data across Microsoft Fabric—including lakehouses, data warehouses, and real-time sources—without the user needing to understand their structure.Once connected to Copilot, these agents enable a more dynamic and flexible form of analysis. Users can ask open-ended questions in natural language, and the system intelligently determines which data sources to consult. Rather than extracting predefi ned metrics, the agent interprets the user’s intent, formulates relevant queries, and returns synthesized answers. The interaction becomes adaptive, responsive, and deeply contextual.The user experience has also evolved. Power BI now offers a full-screen Copilot interface that supports an ongoing dialogue with data. Users no longer jump between visualizations and filters; instead, they engage in natural, iterative conversation. Each question refines the context. Each response adds new understanding. The Fabric data agent is not just retrieving numbers—it’s helping users think.This integration brings real advantages. First, it broadens analytical reach, allowing users to tap into previously disconnected datasets. Second, it lowers the technical barrier to insight, making advanced analysis accessible to users with no background in data modeling. And third, it shortens the path from question to answer, accelerating decision-making at every level of the business.The Power BI home page offers a few suggested prompts, along with customization options such as adding agents, semantic models, or reports to better address the user’s query.Figure 11.26: New Power BI Copilot experienceOnce the agent is registered, you can submit the question to Copilot, which will rely on this agent to gather the data and generate a response for the user. In our example, the question asked is: What are the main takeaways or most significant trends in sales seasonality over time? Summarize in three key points.Figure 11.27: Interaction between the new Power BI Copilot and Sales data agentThe most striking shift is not technical but conceptual. Business intelligence is becoming less about pulling data and more about interacting with it. With data agents embedded in Copilot, Power BI becomes a space where data and intent meet—where the analyst’s curiosity is answered by a system that listens, understands, and responds. It’s not just an evolution of to oling; it’s a new way of working with information. MCP support for Real-Time Intelligence (RTI)The promise of real-time intelligence has always been about immediacy—turning raw, fast-moving data into insight and action, as events unfold. With the introduction of Model Context Protocol (MCP) support for Real-Time Intelligence (RTI) in Microsoft Fabric, that promise takes a tangible form. AI agents, which once relied on static data or scheduled refreshes, can now engage directly with live event streams, unlocking entirely new use cases for operational awareness, anomaly detection, and conversational analytics.At the heart of this evolution lies the concept of the MCP RTI server—an open source server built to act as a translator between natural language prompts and real-time query execution. This server supports queries against platforms such as Eventhouse and Azure Data Explorer, two foundational components of Fabric’s real-time architecture. The MCP server receives a user or agent’s query—phrased in everyday language—and transforms it into executable Kusto Query Language (KQL) or JSON-based expressions that access the most up-to-date data available.The technical design is particularly elegant: the server offers features such as schema introspection, allowing agents to understand the structure of real-time tables and streams. It supports autocomplete, query optimization, query validation, and even natural language error explanations, enhancing usability for both developers and end users. In more advanced scenarios, it also supports anomaly detection patterns, vector-based semantic searches, and custom parameter bindings—critical capabilities when working with high-volume, high-velocity datasets.From the user’s point of view, the experience is remarkably fluid. Imagine a business analyst interacting with Copilot in Microsoft Teams or Power BI, asking What unexpected traffic spikes occurred in the past five minutes? Instead of routing the question to a stale dataset, the AI agent communicates with the MCP RTI server, which accesses live telemetry from connected streams. The response is both immediate and context-aware—an intelligent snapshot of a moment in time, tailored to the user’s query.This advancement is not just a technical milestone; it marks a shift in how we think about agents and data systems. MCP becomes the connective tissue between real-time data infrastructure and generative AI. It abstracts away the complexity of query languages and schema logic, giving agents the ability to operate as intuitive interfaces to complex real-time systems. These agents no longer need to be pre-programmed with rigid logic—they can now adapt, respond, and learn from live data in the flow of conversation.Moreover, because MCP is standardized and extensible, this model is not confi ned to Microsoft tools. Any system or agent that speaks MCP can now tap into the RTI capabilities, whether it’s a third-party chatbot, an autonomous monitoring system, or a custom enterprise app. This composability ensures that real-time intelligence is not a siloed feature but part of a broader, interoperable ecosystem.In the following example, after deploying the MCP RTI server, we have an MCP server connected to sample New York taxi data. The GitHub Copilot agent is linked to the Claude Sonnet 4 reasoning model, which responds to the following question: Identify the periods of high activity in the taxi data you have access to.The agent begins its reasoning by listing the available databases and tables, discovers the taxi dataset, learns the schema, and then extracts a subset of the data to better understand its structure and content.Figure 11.28: MCP RTI server configured as tools in the GitHub Copilot agent interfaceThe agent then continues its reasoning by extracting specific information from the database that will enable it to produce the final result of its analysis. A sequence of queries is sent to the database as part of this process.• Now let me analyze monthly patterns• Let me also look at specific high-activity periods and combine hour and day analysis• Let me also analyze seasonal patterns by looking at different years• Let me get a broader view of the recent data• Now let me get the peak activity patterns for different time periodsFor this final query, here is an example of the code generated by the agent and sent to the underlying database.{ "cluster_uri": "https://clustername.kusto.fabric.microsoft.com", "database": "NycTaxiDB", "query": "nyctaxitrips\n| where tpep_pickup_datetime >= datetime(2022-01-01) and tpep_pickup_datetime < datetime(2023-01-01)\n| extend pickup_date = bin(tpep_pickup_datetime, 1d)\n| summarize daily_trips = count() by pickup_ date\n| order by daily_trips desc\n| take 10" }At the end of its reasoning, the agent returns its analysis in the form of a formatted text, enriched with calculations derived from the knowledge base.Figure 11.29: The setup in VS Code uses a custom Kusto query configuration to analyze NYC taxi data, revealing peak activity times—Wednesdays at noon lead with over 23 million tripsIn a world increasingly driven by the speed of information, the ability to act on data as it happens is no longer a luxury—it’s a necessity. With MCP support for RTI, Microsoft is providing the infrastructure for that immediacy: a conversational layer over a real-time backbone, where AI agents don’t just analyze the past but live in the present.You can access the official Fabric RTI server at the following link: https://go.fabricbook.net/ch11-14ConclusionMicrosoft Fabric data agents represent a major step toward AI-first analytics, where users can query, reason over, and act on enterprise data through natural language. From creating a first data agent to integrating with mirrored databases, Power BI Copilot, Microsoft AI Foundry, Copilot Studio, and Real-Time Intelligence through MCP, the article shows how Fabric is evolving into a unified platform for governed, conversational, and real-time decision-making.By combining structured data access, business context, SDK-based evaluation, and secure AI integrations, Fabric data agents help organizations move beyond static reporting toward intelligent, interactive analytics experiences. This article is an excerpt removed from The Definitive Guide to Microsoft Fabric book by Packt. Author BioChristopher Maneu is a Principal Data Engineering Advocate at Microsoft, he focuses on data and analytics within the Azure platform. Christopher is part of the Azure Engineering team, working on Microsoft Fabric well before its launch. This early involvement has provided him with in-depth knowledge of the platform's development and capabilities. He has authored multiple books, including a reference book about Microsoft Fabric in French. Additionally, he has contributed to open-source projects related to Microsoft Fabric, such as the 'fabricnotes' repository on GitHub, which offers simple drawings illustrating the main concepts of Microsoft Fabric to empower users to build on the platformEmilie BEAU is a technical specialist in data processing technologies. She has been at Microsoft for 15 years, enjoys sharing her knowledge, and engages in discussions about how advances in BI, Big Data, and Artificial Intelligence can help industries address new challenges. She spent years in the Microsoft Technology center, addressing CxOs at the intersection of business and technical needs.Jean-Pierre is a Technology leader who combines data, innovation, and business value. Passionate about data, a fan of Power BI and Fabric, guided by innovation yet business-oriented, and "artificially intelligent," he has worked for more than 25 years on exciting projects-from the web to IoT-encompassing vast amounts of data and AI. A recognized Microsoft Most Valuable Professional (MVP) for the Data Platform since 2008, he has always been deeply involved in communities in France, both as an organizer and speaker. He leads the Microsoft Data community in France, notably through the Power BI Club and the Fabric Club.

0
0

article-image-semantic-search-in-postgresql-with-pgvector-how-to-build-ai-powered-vector-search-and-recommendations

Vibhor Kumar, Marc Linster

23 Apr 2026

10 min read

Semantic Search in PostgreSQL with pgvector: How to Build AI-Powered Vector Search and Recommendations

Vibhor Kumar, Marc Linster

23 Apr 2026

10 min read

0
0

article-image-what-is-a-system-of-action-building-an-ai-ready-data-foundation-for-real-time-decision-intelligence

Boris Bialek, Sebastian Rojas Arbulu, Taylor Hedgecock

31 Mar 2026

10 min read

What Is a System of Action? Building an AI-Ready Data Foundation for Real-Time Decision Intelligence

Boris Bialek, Sebastian Rojas Arbulu, Taylor Hedgecock

31 Mar 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionModern enterprises are no longer competing on data alone; they’re competing on how quickly and intelligently they can act on it. As AI systems evolve from passive analytics to autonomous decision-makers, traditional data architectures are becoming a critical bottleneck. The rise of systems of action marks a fundamental shift: from storing and analyzing the past to driving real-time decisions, workflows, and outcomes.This article explores how organizations can modernize their data foundations to support agentic AI, real-time context, and scalable intelligence. It breaks down the architectural principles required to unify fragmented data, ensure quality and trust, and enable continuous learning systems that operate at enterprise scale. From unified data access to real-time signal processing and governance, this is a practical guide to building the AI-ready backbone that powers next-generation applications.Building an AI-ready data foundationDelivering on the promise of systems of action requires a new kind of data foundation—one built for speed, context, and adaptability.Agentic AI systems fundamentally differ from traditional systems of record in their operational demands. Where legacy systems focus on capturing and storing historical transactions, systems of action powered by agentic AI require real-time decision-making, dynamic data synthesis, and immediate response capabilities. This shift demands that our data architecture choices move beyond the rigid, siloed structures of traditional enterprise systems.A unified view of core enterprise is essential. It must bring together the diverse data types that autonomous agents rely on (real-time operational signals, contextual documents, vector embeddings) into a single, coherent platform. That platform must be built on flexible data structures that can adapt as agent behaviors evolve.The transition from supporting passive systems of record to enabling active systems of action introduces six critical architectural requirements that distinguish agentic AI infrastructure from legacy approaches:Unified data access to eliminate the complexity of managing multiple disparate datastoresData quality and consistency mechanisms that reduce hallucinations and errors from systems out of syncReal-time context capabilities that enable immediate signal processing for RAG applicationsScalability and performance characteristics that support operational AI rather than only backward-looking analyticsGovernance and security frameworks that protect sensitive information while enabling innovationEfficient model training workflows that optimize data preparation for GenAI applicationsTogether, these elements form the data foundation for autonomous, intelligent systems. As we examine each in the sections ahead, we’ll see how a system of action database departs from traditional data management and enables more intelligent, responsive, and scalable AI applications.What is a system of action?Systems of action are a new class of enterprise application, designed to execute decisions and drive workflows in real time. They enable collaboration between people, AI-assisted users, and AI agents, supporting everything from assisted decision-making to fully autonomous execution.Unlike systems of record, which passively store historical transactions, or systems of insight, which analyze data retrospectively, systems of action operate in the moment. They process dynamic context, trigger decisions, and execute tasks through AI agents. For instance, they might reroute a delayed flight in real time or automatically adjust hospital staffing during a sudden surge.Building systems of action requires more than analytical capabilities. They must ingest streaming signals, reason across unstructured and structured sources, and respond in real time. They require specialized database architectures capable of managing high-velocity, multimodal data streams and supporting complex state transitions over time. Most legacy systems, designed for static, batch-oriented workflows, simply cannot support this kind of continuous intelligence.Figure 3.1: Enterprise system landscape: from system of record to system of actionFigure 3.1 illustrates this evolution across the enterprise landscape. Unlike traditional systems that passively store or retrospectively analyze data, systems of action enable real-time interaction between users, applications, and agents; all powered by a live, adaptable data layer.Unified data access architectureThe foundation of any GenAI system begins with access to diverse, multimodal data, at speed, in formats AI can reason with. Unfortunately, this is also where most enterprises struggle. Traditional enterprise data architectures are fragmented across dozens of incompatible systems, each optimized for narrow use cases. The result is integration pain, access friction, and massive overhead.Modern AI applications demand a fundamental departure: unified access must be treated not as a convenience but as a prerequisite.Today’s models must navigate a wide variety of inputs: text documents, application logs, product catalogs, support transcripts, and streaming sensor data. Relational and legacy systems often store semi-structured data (like JSON or XML) as binary large objects (BLOBs) or character large objects (CLOBs), limiting their usability for AI systems. In these cases, the actual data is hidden inside a single entry and must be extracted and interpreted before it can be reasoned over or acted upon. This was tolerable when the goal was to store and retrieve files. But for GenAI systems, where models need immediate access to both structured and semi-structured data, often in the same query, this format becomes a bottleneck. Even a video can have its own addressable metadata structure, rather than existing solely as an opaque BLOB, illustrates the shift needed to support AI-native reasoning.Beyond the format problem lies a more urgent challenge: fragmentation.An AI application might need to stitch together context from a CRM (customer profiles and account hierarchies), a product catalog (SKU-level details, pricing, availability), a data warehouse (historical transactions), a streaming platform (real-time behavioral signals), and a document store (contracts, support transcripts, policy documents). Each source has its own schema, access pattern, and often its own API. This complexity creates two persistent challenges:Developer integration friction: Each layer introduces its own headaches, from authentication and authorization to schema mismatches, brittle connectors, and inconsistent formatsSystem fragility/maintenance drag: Over time, these integrations accumulate, introducing silent failures, versioning issues, and downstream reliability risks that make innovation slower and more expensiveMongoDB’s document model takes a fundamentally different approach. Instead of forcing diverse data into rigid schemas or hiding it in unreadable blobs, it enables rich, hierarchical data structures that mirror how businesses actually operate [1]. Developers can model a full customer, order, or event in a single document, including nested context, version history, and behavioral attributes. This eliminates the need for complex joins while preserving the relationships critical for effective agentic reasoning.Even more critically, flexible schema design, meaning the ability to store and query data without locking into a rigid blueprint, allows fields and document shapes to adapt as requirements change. This lets data evolve—new attributes can be added without downtime, and new types of signals can be integrated without costly migrations. For AI systems (especially those that learn, adapt, and extend themselves), this agility is essential.This architectural convergence enables structured transactions, real-time signals, and unstructured content together in a single query or operation. Model updates, enrichment jobs, or downstream agent actions can all be triggered directly from the same data platform [2]. That unified model lays the groundwork for sophisticated, AI-native workflows.Perhaps more importantly, unified data access transforms developer productivity. Instead of spending cycles reconciling formats or debugging brittle connectors, teams can focus on building intelligent systems. And, as we’ll see in the sections ahead, everything from data quality and governance to real-time orchestration builds on this foundation.Ensuring data quality and consistencyData quality and consistency are non-negotiable for GenAI solutions. Unlike traditional analytics, where data quality issues might simply yield incorrect reports or delayed insights, poor data quality in AI systems can cause hallucinations, introduce biased outputs, and fundamentally unreliable behavior that undermines user trust and business value.Legacy quality approaches tried to solve this through normalization, deduplication, and validation against external sources. Consider a familiar failure mode: a system validates Joe Miller, 12 High Street, through postal APIs and credit checks, yet fails to distinguish between three different JoeMillers (grandfather, father, and son) at the same address. For entity analytics, where precise relationship mapping matters, this is a critical flaw.In this scenario, an online store might unknowingly treat all three individuals as the same customer, losing the ability to tailor interactions or offers. Relational star schemas exacerbate this problem by fragmenting contextual information across multiple tables. When customer data is split between fact tables, dimension tables, and lookup tables, the rich context that enables accurate entity resolution becomes scattered and difficult to reconstruct.In our Joe Miller example, a document-based approach would maintain separate documents for each individual, complete with detailed demographic information, purchase history, behavioral patterns, and relationship data that enables clear differentiation.Within a document, you can store original values alongside enrichments and enhancements within the same dataset. This approach improves output reliability and reduces hallucinations or contradictory results. When an AI system generates an output, the complete chain of data sources, transformations, and reasoning steps can be traced back through the document structure, enabling both debugging and compliance reporting.This lineage capability proves essential for improving output reliability and reducing hallucinations or contradictory results. When AI models can access not just the current state of data but also its provenance and transformation history, they can make more informed decisions about data reliability and confidence levels. For example, customer service AI might weigh recent direct customer interactions more heavily than older inferred preferences, or flag potential inconsistencies when multiple data sources provide conflicting information.For organizations implementing document-based data quality strategies, MongoDB offers comprehensive best practices, as well as compatibility with industry-leading tooling for data modeling and cataloging that make advanced quality management achievable at scale [3]. When high-quality, lineage-aware data becomes the default, AI systems can deliver results that are accurate, explainable, and trustworthy.Real-time context and RAGThe definition of real-time varies significantly by use case and industry, but the real-time requirements of data in use with GenAI cannot be overstated. Hedge fund trading systems, for example, require millisecond responses, while life insurance underwriting processes measure time in days. While application response times continue to decrease, many architectures use caching layers that create an illusion of real-time performance at the expense of freshness of data.A typical real-time environment follows a simple pattern where an interaction generates a signal that enables immediate interpretation. These signals may originate from diverse sources, such as a retail website recording shopping cart additions, a smart meter transmitting electricity usage, or a pathology lab completing cancer analysis data. All signals, when combined with existing datasets, enable text search, vector search, and LLM processing for reasoning and causal analysis. This applies equally to interactive systems, such as retail shopping carts, and autonomous agentic systems, such as automated insurance claim processing.Real-time integration of signals with metadata, reference data, and historical information generates new knowledge instantaneously. Consider how this has evolved. Traditional rulebased systems might suggest "You ordered a burger, would you like fries?" In contrast, an AI-powered system recognizes patterns such as "You order cat food bi-weekly, always the same brand", and reasons contextually with suggestions such as "Based on your purchase history, you might be interested in our new, healthier formula. Would you like us tosend you a free sample?" The system identifies repeat customers and enhances their experience through reasoning that connects purchase patterns with product recommendations, requiring deeper knowledge about customer preferences and pet characteristics.Figure 3.2: Real-time AI data flowThe architectural flow in Figure 3.2 demonstrates how modern AI applications process realtime signals through a system of action database using an airline passenger assistance scenario. The flow begins with diverse signal sources on the left: Passenger Check-In, Booking Systems,Weather Services, and Flight Status Updates, which feed into Signal Processing and FormatConversion/Real-Time Capture components. These signals are then ingested into the central MongoDB Document Store, which contains Flight Documents, Passenger Vectors, Booking Metadata, and Historical Patterns with Direct Vector Access capabilities.The system processes this data through Atlas Vector Search (finding similar flight disruptions) and LLM Augmentation (generating personalized responses with flight context) to produce three types of intelligent outputs: Re-Booking Confirmations, Personalized Options, and Automated Responses. At the foundation sits the Operational Data Layer (ODL), an architectural pattern that centrally integrates and organizes siloed enterprise data, serving as an intermediary between existing data sources and consuming applications. In this case, the ODL enriches signals with contextual information from passenger records, alternative flights, weather data, and rebooking history.A continuous learning and enrichment feedback loop ensures that every interaction outcome, whether accepted re-bookings or user preferences, flows back to improve future recommendations. The document model enables continuous enhancement without requiring system restructuring, creating a system that grows smarter with each passenger interaction while delivering real-time, context-aware responses vital for modern AI applications.Critically, the feedback loop ensures continuous improvement, ensuring every interaction outcome enriches the system of action database, making future responses more accurate and contextual. This circular flow embodies the key advantage of document-based architectures: the ability to evolve and improve without the schema rigidity that constrains traditional relational systems. The result is a system that grows smarter with each interaction, delivering real-time, context-aware responses that modern AI applications require.Scalability, availability, and performanceHistorically, enterprise data warehouses represented the largest database implementations, with denormalized, column-oriented star schemas designed for analytical queries. These systems perform well with queries such as "Display yogurt sales by region", where large datasets are filtered by specific criteria (region, store, price) to generate insights. The integration of multiple sources led to the development of extract, transform, load (ETL) processes and master data management systems. While these platforms have added machine learning features and now claim to support GenAI capabilities, they remain primarily designed for backward-looking analytical tools, unsuited to real-time, agentic, and causal AI applications.Consider the contrast. A chatbot assisting an airline passenger who missed a connection requires fundamentally different capabilities than answering "How many passengers experienced day-long delays in Frankfurt last year?" The chatbot and its underlying agentic system must address immediate needs, finding available seats, offering mitigation services, and responding empathetically to frustrated passengers. The required data is real-time, context-sensitive, and simply not available from a historic warehouse.To be successful in the request for the passenger, the system needs both real-time seat information access (easy to achieve with an API to the usual booking systems), as well as more important detailed context and information about the passenger and their situation. Is it a family stranded, or a single adult? What other ticket dependencies exist? Can the passenger be rerouted via a different track, or is the best option to stay overnight?This scenario demands that all passenger data reside in an up-to-date system of action database, as real-time interactions fail without current information. As these systems achieve global coverage, non-functional requirements mandate not only 24/7/365 availability but also the ability to handle transaction volume fluctuations from quiet periods to peak travel seasons such as Thanksgiving. Even minimal outages become unacceptable, and caching solutions that simply solve a data availability challenge compromise on data accuracy by introducing data staleness issues.Document-based architectures, such as those provided by MongoDB, offer advantages in specific scenarios for this type of data availability and scalability. Rather than requiring complex joins across multiple tables to reconstruct user context, document models can store complete contextual information in a single, efficiently retrievable record. This approach reduces the computational overhead of context reconstruction while enabling more sophisticated caching and optimization strategies.The performance characteristics of AI workloads also differ significantly from traditional analytical patterns. While analytical queries typically process large volumes of data to generate aggregate results, AI applications often require rapid access to specific, contextually relevant information. This pattern favors architectures optimized for high-concurrency, low-latency access to individual records, rather than bulk processing of large datasets.Governance, security, and complianceGovernance and compliance requirements stem from a fundamental need to protect individuals from flawed decision-making in systems that lack adequate self-regulation. These safeguards exist to prevent real harms, from biased loan approvals to unsafe product recommendations.GenAI faces intense scrutiny regarding accuracy, with media coverage of hallucinations bringing this concern to the forefront. Therefore, transparency in data lineage, reasoning processes, and result interpretation becomes critical for any GenAI solution. The document model in a system of action database enables tracking of all changes, transformations, and actions related to specific datasets. Unlike legacy relational databases, documents offer the flexibility for enhancement and enrichment throughout the process without requiring upfront planning.From a governance perspective, this enables precise and comprehensive tracking of communication and decision-making processes. It facilitates decision auditing and corrective actions when compliance challenges arise, often due to gradual shifts in decision criteria requiring adjustment.Security represents an additional critical dimension. MongoDB’s Queryable Encryption keeps data absolutely protected from unauthorized access. While passenger data may have moderate sensitivity, healthcare provider consultations about potential illnesses require the highest security levels. The system of action database enables transparent security implementation, significantly more challenging when coordinating multiple data sources with potentially incompatible security and policy systems [4].Model training and fine-tuningTraining or fine-tuning models requires large volumes of clean, labeled, and diverse data. The system of action database ensures efficient data curation, sampling, and preprocessing for training pipelines. Data enrichment becomes key, as features such as MongoDB’s aggregation pipeline enable data annotation and continuous analysis of criteria such as minimum or maximum values and moving averages to validate reasoning processes.The subject of data preparation for GenAI is often misunderstood, stemming from the evolution of early AI solutions supporting ML systems (systems that were derived from business intelligence (BI) architectures). This sometimes leads to the mistaken assumption that all data for AI usage and interaction must first be prepared, or readied, in lakes, warehouses, or marts, requiring extensive transformation and data pipeline processing. The resulting data objects are often stored as star schemas with fact tables, each containing hundreds of columns and accompanying dimension tables. Star schemas, a data modeling format originally designed to solve the problem of performant analytics queries executed against relational database objects, introduce the need for complex queries and join operations to extract insight, an architecture still employed by platforms such as Snowflake.Apache Spark object-storage implementations, such as Databricks, offer more complex query capabilities through distributed computing frameworks and in-memory processing, representing a significant advancement over traditional batch processing systems. Both approaches, star schemas and Spark-manipulated object storage files, share a foundation in backward-looking data warehousing, regardless of contemporary terminology such as data lake or lakehouse.These systems are optimized for processing large volumes of homogeneous data aligned along dimensional axes. Real-time access to individual datasets for operational processing falls outside their design parameters. Historically, this was the realm of online transaction processing (OLTP) systems. While transactional logging isn’t central to GenAI data structures, the access patterns remain similar.Often, the example of building models for embeddings is referenced as justification for why the data warehouse must be the source of data for GenAI, but this is misleading. Firstly, many business solutions successfully deploy standard embedding models for PDFs, images, and audio, without the need for custom development. Secondly, and more importantly, the comparison doesn’t hold, as warehouses analyzing quarterly sales have no relevance to point-of-sale operations and transaction booking.ConclusionTo unlock the full potential of AI, enterprises must rethink their data architecture from the ground up. Systems of action represent this new paradigm, where data is not just stored or analyzed, but continuously activated to drive intelligent decisions in real time. Achieving this requires six foundational capabilities: unified data access, high data quality and consistency, real-time context integration, scalable performance, robust governance and security, and efficient model training pipelines.By adopting flexible, document-based architectures and eliminating data fragmentation, organizations can build systems that are not only faster and more responsive but also more trustworthy and adaptable. The result is a living data ecosystem, one that evolves with every interaction, improves decision accuracy, and enables truly autonomous, AI-driven operations.This article is an excerpt from the book Architectures for the Intelligent AI-Ready Enterprise. To explore these concepts in greater depth and learn how to implement them in real-world enterprise environments, readers can explore the full book here: Author BioBoris Bialek has worked in the IT industry since the 1990s and was one of the initial drivers of Linux in Europe, delivering the first SAP port to Linux, conducting the first benchmarks, and securing the first clients. Since then, he has led product and development teams across IBM and FIS, driving innovation for both the end product and development productivity. Boris Bialek joined MongoDB in 2019, igniting a focus on industry solutions based on MongoDB's document model. Promoted to global field CTO and VP of industries, he drives technical design. He works directly with numerous clients, helping them gain the benefits of the MongoDB Atlas data platform. Boris holds a master's in computer science from the Karlsruhe Institute of Technology.Sebastian Rojas Arbulu is an industry solutions specialist at MongoDB, where he collaborates with numerous stakeholders across diverse industries to help customers realize the transformative value of MongoDB through tailored, data-driven solutions, particularly for AI integration. Sebastian also leads his team's content strategy, including numerous additions such as blogs, white papers, magazines, and other thought leadership pieces. With a background in IT consulting, marketing, and digital transformation, among other areas, he has extensive experience in identifying customer needs and developing innovative solutions that prepare data for intelligent applications and unlock new possibilities. He holds a bachelor of business administration degree.Taylor Hedgecock is a strategic program leader and transformation partner who turns vision into velocity. With a career spanning startups to multinationals, she brings a mix of operational rigor, narrative clarity, and cross-functional orchestration. At MongoDB, she has led high-impact programs across AI, partner ecosystems, and services modernization, often serving as the connective tissue between vision and execution. Her work has guided C-level priorities, enabled go-to-market readiness, and driven large-scale change, establishing her as a trusted leader in aligning stakeholders, translating strategy into story, and driving outcomes that last. Taylor currently serves as senior program manager on the industry solutions team, partnering with ISVs and AI innovators to bring next-generation solutions to market. Previously, she was chief of staff for professional services leadership, where she helped launch new offerings and guided modernization strategy, shaping MongoDB's vision for applying AI to its hardest problems.

0
0

article-image-large-language-models-and-graph-machine-learning

Aldo Marzullo, Enrico Deusebio, Claudio Stamile

24 Mar 2026

10 min read

Large Language Models and Graph Machine Learning

Aldo Marzullo, Enrico Deusebio, Claudio Stamile

24 Mar 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionLarge Language Models (LLMs) have transformed the landscape of artificial intelligence, redefining how machines understand and generate human language. From early statistical methods to the breakthrough of Transformer architectures, LLMs such as GPT, BERT, and T5 have unlocked unprecedented capabilities in natural language processing (NLP). As organizations increasingly rely on both structured and unstructured data, a powerful new paradigm is emerging: the integration of LLMs with Graph Machine Learning (GraphML). This combination enables systems to leverage both deep contextual language understanding and rich relational data, paving the way for more accurate, scalable, and intelligent AI applications across domains like search, recommendation systems, and knowledge graphs.LLMs: an overviewIn the rapidly evolving field of artificial intelligence, LLMs have significantly advanced natural language processing (NLP) and understanding. These models, characterized by their extensive number of parameters and trained on large datasets, have demonstrated remarkable capabilities across a wide set of language-related tasks.The journey of language models began with statistical approaches that relied on probabilistic methods to predict word sequences. These early models, while creating the foundations, were limited by their reliance on fixed-size context windows and the inability to capture long-range dependencies. However, as we also discussed in Chapter 4, Unsupervised Graph Learning, with the advent of neural networks, the field has undergone a significant shift, introducing models capable of learning word embeddings. In order to improve the ability to capture long-range dependencies, the initial neural network models were based on the Long-Short Term Memory(LSTM) and Gate-Recurrent Unit (GRU) architecture, which are forms of Recurrent Neural Networks (RNNs). However, a pivotal moment occurred with the introduction of the Transformer architecture by Vaswani et al. in 2017. Unlike its predecessors, the Transformer model utilized self-attention mechanisms, enabling it to consider the entire context of a sentence without the sequential constraints inherent in RNNs. This innovation facilitated the development of models capable of processing and generating text in a more coherent and fluent way.Building upon the Transformer architecture, researchers scaled models to unprecedented sizes, leading to the emergence of LLMs such as OpenAI’s GPT series, Google’s BERT and T5, and more recently, models such as GPT-3 and GPT-4.In a nutshell, training LLMs involves optimizing a large number of parameters on very large datasets. This process, known as pretraining, typically employs unsupervised learning objectives, such as predicting missing words in a sentence (masked language modeling) or forecasting subsequent words (causal language modeling). As a side effect, the pre-training phase lets the model learn and “understand” a language, resulting in a remarkable ability to generalize across various tasks, often achieving state-of-the-art performance. LLMs have demonstrated proficiency in a diverse array of applications, reflecting their versatility and depth of language understanding. Key areas include text generation, language translation, question answering, and summarization, among many others.Given the strengths of LLMs in unstructured text processing and generative tasks, an exciting frontier emerges when we consider their integration with graphs. While LLMs excel in understanding and generating natural language, graphs are particularly powerful for representing and analyzing structured relationships between entities. In the rest of the book, wewill see examples of how we can take advantage of both.Why combine GraphML with LLMs?As we have learned throughout this book, GraphML excels at representing and analyzing structured data such as knowledge graphs, social networks, chemical structures, and so on. It is extremely useful for situations where exploiting relationships between entities is crucial for achieving good performances. However, LLMs are particularly good at interpreting unstructured text, offering generative skills, reasoning, and profound contextual awareness. When it comes to language-based activities such as content creation, question answering, and summarization, they excel.Despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges is the problem of hallucination, where an LLM generates factually incorrect or misleading information that appears plausible. This is particularly problematic in domains requiring high factual accuracy, such as healthcare, finance, and legal applications. To mitigate hallucinations and enhance the reliability of LLM outputs, Retrieval-Augmented Generation (RAG) has emerged as a powerful technique. RAG works by dynamically retrieving relevant information from an external knowledge source (such as a knowledge graph) at inference time, rather than just relying on pre-trained knowledge. This approach ensures that the model has access to up-to-date and accurate data, grounding answers in verified information rather than generating content purely from its internal representations.Recent advancements highlight how integrating GraphML with LLMs can drive significant innovation, enabling the development of applications that require both rich semantic understanding and relational analysis. For instance:Graph-Augmented Question Answering: LLMs can leverage knowledge graphs to answer domain-specific questions with factual accuracy.Node Embedding Generation: State-of-the-art frameworks such as GraphGPT use LLMs to generate node embeddings directly from textual data, enabling seamless integration with graph structures.Knowledge Graph Construction and Enhancement: Recent applications have shown how LLMs can be used to enrich knowledge graphs, where LLMs are used to extract semantic relationships and entities from text to enhance existing graph data.Therefore, by bridging the gap between structured knowledge and natural language understanding, the synergy between GraphML and LLMs paves the way for more accurate, explainable, and intelligent systems.In the next section, we will explore the state-of-the-art trends in combining GraphML and LLMs, as well as the current challenges.State-of-the-art trends and challengesBefore diving into specific examples, it is crucial to understand the current landscape ofGraphML and LLM integration. According to a recent survey by Jin et al. (https://arxiv.org/ abs/2312.02783, 2024), the application scenario can be categorized into three main scenarios:Pure Graphs: These are graphs that lack associated textual information. Examples include social networks, traffic networks, and protein interaction networks. In such cases, the focus is on leveraging LLMs to process and analyze the structural aspects of the graph data.Text-Attributed Graphs: In these graphs, nodes or edges are enriched with textual attributes. For instance, in academic networks, papers (nodes) come with titles and abstracts, while authors (nodes) have profiles. E-commerce networks also fall into this category, where products (nodes) have descriptions, and user interactions (edges) may include reviews. The challenge here is to effectively combine the textual content with the graph’s structural information.Text-Paired Graphs: This scenario involves graphs that are paired with separate textual descriptions or documents. Unlike text-attributed graphs, where text is embedded within the graph as attributes, text-paired graphs treat the graph and text as distinct but related entities. A pertinent example is molecular graphs accompanied by detailed textual descriptions of their properties. The objective is to align and integrate the information from both the graph structure and the associated text to enhance understanding and analysis.To effectively utilize LLMs in these scenarios, three primary techniques can be used: LLMs as predictors, LLMs as encoders, and LLMs as aligners. Let’s see these approaches one by one.LLMs as predictorsThe simplest and most direct approach is to use LLMs as predictors. In this paradigm, the LLM operates as a tool to infer outcomes directly from graph data. Imagine a scenario where textual information is either minimal or entirely absent (pure graphs). In this case, you can transform the graph data into a format that the LLM can process, such as converting graph structures into sequences or textual descriptions.For instance, consider a simple social network graph where nodes represent people and edges indicate friendships (Figure 12.1). These features can be converted into a textual narrative, such as Alice is linked with Bob. An LLM can then process this narrative to predict new relationships or infer additional attributes about the nodes, such as professional interests or potential connections.Figure 12.1: Examples of how graphs can be converted to text narrativesOnce the data is prepared, the LLM can be fine-tuned or prompted to perform specific tasks. These might include predicting node classifications, such as identifying the role of individuals in a social network, or link predictions, such as forecasting interactions between entities. In molecular research, LLMs as predictors can help determine the properties of chemical compounds based solely on their structural representations.One advantage of this approach is its simplicity: LLMs can be applied directly to graph data without requiring extensive preprocessing or specialized models. However, this simplicity can also be a limitation. Purely structural information might not always be sufficient for complex tasks, particularly when additional contextual or textual data is available but not leveraged. Moreover, scalability and cost must be considered: encoding entire graphs as text can lead to an explosion of sentences, making inference expensive, potentially inefficient, and sometimes impossible (for example, if the maximum number of words an LLM can process at once is too small to contain the whole graph). Performance may also be limited, as this approach is similar to providing an LLM with a structured dataset and expecting accurate predictions without tailored adaptations. For this reason, more complex graph2text formalisms can be designed, incorporating node/edge descriptions into textual narratives while balancing efficiency and accuracy.LLMs as encodersWhen graphs are enriched with textual attributes, the LLM as encoder approach becomes particularly powerful. Here, the LLM is tasked with processing and encoding the textual information associated with nodes or edges, producing meaningful representations that can be integrated with the graph’s structural features. These embeddings are then integrated into the graph through proper algorithms such as graph neural networks, which process the combined representation to perform downstream tasks.This hybrid representation combines the strengths of both modalities, capturing the nuances of text alongside the relationships encoded in the graph. As depicted in Figure 12.2, each node could have attributes, such as a name and a brief bio for a node representing a person, while the edges might be annotated with information about the nature of the relation, e.g., close friend or colleague for a graph representing social networks. These features can be converted into a textual narrative, such as Alice, a software engineer, is close friends with Bob, a data scientist.Figure 12.2: Examples of how LMMs can be used as encoders for node attributesOther examples include academic citation networks, where papers (nodes) come with titles, abstracts, and keywords. An LLM can process these textual attributes to generate embeddings that encapsulate their semantic content. These embeddings are then combined with graph-specific features, such as the citation relationships between papers, to create a unified representation. Similarly, in e-commerce platforms, product descriptions and user reviews can be encoded by LLMs to enhance product similarity graphs or user behavior analysis.It is worth noticing that the process of using LLMs as encoders typically involves fine-tuning the LLM on domain-specific textual data to ensure that the embeddings accurately reflect the requirements of the task.This encoder approach offers significant benefits. By leveraging textual data, it captures context and nuances that purely structural methods might miss. It is particularly effective in scenarios where textual attributes can provide critical insights, such as identifying the themes of academic papers or understanding user preferences in recommendation systems.LLMs as alignersThe goal here is to align and integrate the information from both structure and textual descriptions (or accompanying documents, in the case of text-paired graphs), enabling a comprehensive analysis that leverages the strengths of each. This can be achieved, for example, by finding a shared latent space or a semantic mapping that connects the two modalities. Such an approach might involve designing models that jointly optimize both modalities or using attention mechanisms to focus on the most relevant parts of each input.In more detail, the synergy between textual encoding (handled by the LLM) and graph structure encoding (handled by, for example, a GNN), can be typically in two ways:1. Prediction Alignment: Iterative training where LLMs and GNNs generate pseudo-labels to guide each other’s learning2. Latent Space Alignment: Contrastive learning to align the latent representation of the text and the graph structure in a shared space (e.g., Figure 12.3)Figure 12.3: Graphs and associated texts can be embedded in a shared latent spaceFor example, in molecular research, a molecular graph might represent the structure of a compound, while a textual description provides information about its properties, synthesis, or applications. In this context, an LLM can be used to process the text to extract relevant features and align these with the structural characteristics of the graph, enabling tasks such as property prediction or drug discovery.As you can imagine, this approach is particularly powerful in interdisciplinary fields where graphs and text provide complementary points of view. In computational social science, for instance, social graphs representing interactions between individuals can be aligned with news articles, social media posts, or other textual data to study the spread of information or public sentiment. Similarly, in e-commerce, user behavior graphs can be integrated with textual reviews to improve personalized recommendations.Now that we have a clearer understanding of the LLM and graph landscape, let’s dive into a practical example of how this integration works. We will explore this in the next section.ConclusionIn summary, the convergence of LLMs and GraphML represents a major step forward in building next-generation AI systems that combine natural language intelligence with graph-based relational reasoning. By addressing challenges such as hallucination through techniques like Retrieval-Augmented Generation (RAG), and by applying frameworks where LLMs act as predictors, encoders, and aligners, this hybrid approach opens new possibilities for question answering, recommendation systems, knowledge graph enhancement, and explainable AI.This article is an excerpt from the book Graph Machine Learning – Second Edition. Readers who want to explore these concepts in greater depth, along with practical examples and broader GraphML applications, can continue reading in the book.Author BioAldo Marzullo received an M.Sc. degree in computer science from the University of Calabria (Cosenza, Italy) in September 2016. During his studies, he developed a solid background in several areas, including algorithm design, graph theory, and machine learning. In January 2020, he received his joint Ph.D. from the University of Calabria and Université Claude Bernard Lyon 1 (Lyon, France), with a thesis titled Deep Learning and Graph Theory for Brain Connectivity Analysis in Multiple Sclerosis. He is currently a postdoctoral researcher and collaborates with several international institutions.Enrico Deusebio is currently working as an engineering manager at Canonical, the publisher of Ubuntu, to promote open source technologies in the data and AI space and to make them more accessible to everyone. He has been working with data and distributed computing for over 15 years, both in an academic and industrial context, helping organizations implement data-driven strategies and build AI-powered solutions. He has collaborated and worked with top-tier universities, such as the University of Cambridge, the University of Turin, and the Royal Institute of Technology (KTH) in Stockholm, where he obtained a Ph.D. in 2014. He holds a B.Sc. and an M.Sc. degree in aerospace engineering from Politecnico di Torino.Claudio Stamile received an M.Sc. degree in computer science from the University of Calabria (Cosenza, Italy) in September 2013 and, in September 2017, he received his joint Ph.D. from KU Leuven (Leuven, Belgium) and Université Claude Bernard Lyon 1 (Lyon, France). During his career, he developed a solid background in AI, graph theory, and machine learning with a focus on the biomedical field.

0
0

article-image-mongodb-crud-operations-explained-create-read-update-delete-with-mongosh-and-python

Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran

16 Mar 2026

10 min read

MongoDB CRUD Operations Explained: Create, Read, Update, Delete with mongosh and Python

Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran

16 Mar 2026

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionCRUD operations form the backbone of how developers interact with MongoDB databases. Whether you're building a small prototype or scaling a production application, understanding how to create, read, update, and delete data is essential. These four operations allow developers to insert new documents, retrieve existing information, modify stored data, and remove records when they are no longer needed.In MongoDB, CRUD operations can be performed using different tools and programming environments. Developers often begin with mongosh (MongoDB Shell) for direct database interaction and testing queries, while application development typically uses official drivers such as PyMongo for Python. This excerpt introduces the core CRUD concepts and demonstrates how to perform them using both mongosh and the MongoDB Python driver, helping developers understand the fundamental workflows for managing data in MongoDB.MongoDB CRUD operationsCRUD operations are the foundation of interacting with a MongoDB deployment. These operations require you to connect to the MongoDB server before you can query the relevant documents, adjust the specified properties, and subsequently, transmit the data back to the database for updates. Each CRUD operation serves a distinct purpose:The create operation creates and inserts new documents into the databaseThe read operation queries a document or documents in the databaseThe update operation modifies existing documents in the databaseThe delete operation removes documents from the databaseBasic CRUD with mongoshmongosh, also known as the MongoDB Shell, is an environment for hosting MongoDB deployments. It is equivalent to the administration console that relational databases use. You can download mongosh at https://www.mongodb.com/docs/mongodb-shell/install/.This section explains how to connect to your deployment with mongosh, insert a document into a database, query your database for a specified document, update a document, and delete a document. For more in-depth information and examples, you can also refer to the CRUD operations section of the MongoDB database manual at https://www.mongodb.com/docs/manual/crud/.Connecting to MongoDBBefore you can perform CRUD operations, you must connect to your MongoDB deployment. To connect to mongosh, you must specify your deployment connection string as well as any specified parameters.For example, enter a variation of the following code block in your terminal to connect to a MongoDB deployment with mongosh:mongosh "mongodb+srv://mycluster.packt.mongodb.net/myDatabase" --username myUsername --password myPasswordCreating documentsTo create a document in mongosh, you can use the db.collection.insertOne() command. This mongosh command creates a new document and inserts the created document into the specified collection.For example, the following db.collection.insertOne()command creates and inserts a new document into the library collection of the database. The new document has a title field set to Mastering MongoDB 8.0 and an isbn field set to 101:db.library.insertOne( { title: 'Mastering MongoDB 8.0', isbn: '101' } )When you successfully create and insert a document, MongoDB returns a confirmation output that contains the ObjectID value of the new document:{ acknowledged: true, insertedId: ObjectId("652024e7ab44f3bf77788a3d") }You can also use the following mongosh commands to update documents:db.collection.insertMany()db.collection.updateOne() with the upsert: true optiondb.collection.updateMany() with the upsert: true optiondb.collection.findAndModify() with the upsert: true optiondb.collection.findOneAndUpdate() with the upsert: true optiondb.collection.findOneAndReplace() with the upsert: true optionReading documentsRead operations in MongoDB are also called queries. To perform a basic read operation, or query, for a single document, use the db.collection.findOne() method and specify the selection criteria of the document that you want to read.For example, the following operation queries for a document in the library collection that contains the isbn field value of 101:db.library.find( { isbn: '101' } )If the library collection contains a document that matches your specified selection criteria, MongoDB returns an array that contains the document that matches:[ { _id: ObjectId("652024e7ab44f3bf77788a3d"), title: 'Mastering MongoDB 8.0', isbn: '101' } ]You can also use the following mongosh commands to read documents:db.collection.findOne()Updating documentsTo update a document, you can use the db.collection.updateOne() command. This command finds a document that matches the specified criteria and updates the found document.For example, the following code finds the document in the library collection that has an isbn field value of 101 and updates that document so that it contains a price field with a value of 30:db.library.updateOne( { isbn: '101' }, { $set: { price: 30 } } )When you successfully update a document, MongoDB returns the following confirmation summary:{ acknowledged: true, insertedId: null, matchedCount: 1, modifiedCount: 1, upsertedCount: 0 }You can also use the following mongosh commands to update documents:db.collection.updateMany()db.collection.replaceOne()db.collection.findOneAndReplace()db.collection.findOneAndUpdate()db.collection.findAndModify()MongoDB 8.0 also introduces the ability to sort documents within an update operation. If you specify a sort parameter in an operation that uses db.collection.updateOne(), db.collection. replaceOne(), or update, MongoDB sorts documents before updating to be able to select a specific document for the operation when multiple documents match the query.For example, the following update operation sorts all books by price and updates the lowest price book to be on sale for 25% off:db.library.updateOne( { }, [ { $set: { price: { $multiply: ["$price", 0.75] } } } ], { sort: { price: 1 } } )Deleting documentsTo delete a document, you can use the db.collection.deleteOne() command. This command queries for a document that matches specified criteria and deletes the found document.For example, the following operation deletes a document in the library collection that has an isbn field value of 101:db.library.deleteOne( { isbn: '101' } )When you successfully delete a document, MongoDB returns the following output:{ acknowledged: true, deletedCount: 1 }You can also use the following mongosh commands to delete documents:db.collection.deleteMany()db.collection.remove()db.collection.findOneAndDelete()db.collection.findAndModify()Basic CRUD with the Python driverAs an alternative to mongosh, MongoDB provides official drivers to interact with your MongoDB deployment. For a full list of official MongoDB driver libraries, see https://www.mongodb.com/ docs/drivers/.The following sections walk you through performing CRUD operations with the MongoDB Python driver, PyMongo. PyMongo provides a seamless bridge between the dynamic world of Python programming and the efficient, document-oriented NoSQL database of MongoDB.Installing and connecting to PyMongoYou can use pip to install PyMongo. You must have Python installed prior to installing PyMongo.For example, the following operation installs PyMongo through your terminal:To connect to your MongoDB deployment with PyMongo, you must use your connection string.For example, the following code snippet in a .py file connects to a MongoDB deployment:from pymongo import MongoClient uri = "<connection string>" client = MongoClient(uri) try: client.admin.command('ping') print("Pinged your deployment. You successfully connected to MongoDB!") except Exception as e: print(e)If you run the preceding code in a .py file and successfully connect to your deployment, MongoDB returns the following confirmation:"Pinged your deployment. You successfully connected to MongoDB!"You can test your connection to your MongoDB deployment on Atlas with the following code block in a .py file. This code block uses the asyncio asynchronous framework:import asyncio from motor.motor_asyncio import AsyncIOMotorClient from pymongo.server_api import ServerApi async def ping_server(): uri = "<connection string>" client = AsyncIOMotorClient(uri, server_api=ServerApi('1')) try: await client.admin.command('ping') print("Pinged your deployment. You successfully connected to MongoDB!") except Exception as e: print(e) asyncio.run(ping_server())If you run the preceding code in a .py file and successfully connect to your deployment, MongoDB returns the following confirmation:"Pinged your deployment. You successfully connected to MongoDB!".Creating documentsTo create and insert documents with PyMongo, use the insert_one command.For example, the code block that follows performs the following operations:Specifies the library collection within the resources databaseDefines a new document that represents a book titled Python and MongoDBInserts the new document into the library databasePrints the ObjectId value of the new documentlibrary = client.resources.library book = { 'isbn': '301', 'name': 'Python and MongoDB', 'meta': {'version': 'MongoDB 8.0'}, 'price': 60 } insert_result = library.insert_one(book) print(insert_result)If you successfully insert a document with this script, MongoDB prints the ObjectId value of the inserted document.Reading documentsYou can read documents from your database with queries.To read a document, specify the condition(s) of the document you want to read with a find_onecommand.For example, the code block that follows performs the following actions:Specifies the library collection in the resources databaseFinds documents with the name value of Python and MongoDB Prints the found documentlibrary = client.resources.library result = library.find_one( {"name": "Python and MongoDB"} ) print (result)Updating documentsTo update a document with PyMongo, use the update_one command. This command finds a document based on specified criteria and updates the found document.For example, the code block that follows performs the following operations:Finds the document where the name field is Advanced MongoDB TechniquesSets the price field of the found document to the value of 75Prints the result of the update_one operation Prints the updated documentupdate_result = library.update_one( { "name": "Advanced MongoDB Techniques"}, { "$set": { "price": 75 } } ) print(update_result.raw_result) updated_document = library.find_one( {"name": "Advanced MongoDB Techniques"} ) print(updated_document)Deleting a documentTo delete a document with PyMongo, use the delete_one command. For example, the code block that follows performs the following operations:Specifies the isbn value of the book to delete as 303Uses the delete_one command to delete a book from the library collection with the desired isbn valuePrints the result of the delete_one operationisbn_to_delete = '303' delete_result = library.delete_one({"isbn": isbn_to_delete}) print(delete_result.raw_result)If the operation successfully deleted the desired document, MongoDB returns the following confirmation:{'n': 1, 'ok': 1.0}The n value indicates the number of documents that MongoDB deleted. In this case, MongoDB deleted one document. The ok value indicates whether the operation caused any errors. The 1.0 value in the preceding confirmation signifies that MongoDB did not encounter an error with this operation.ConclusionCRUD operations are the fundamental building blocks for working with MongoDB. By learning how to connect to a MongoDB deployment and perform create, read, update, and delete operations, developers gain the ability to manage application data efficiently. Tools such as mongosh provide a powerful command-line environment for direct database interaction, while drivers like PyMongo allow seamless integration between MongoDB and programming languages such as Python. Together, these tools enable developers to prototype, test, and build robust data-driven applications.This section provides a practical overview of MongoDB CRUD workflows, illustrating how documents can be inserted, queried, modified, and removed using both the MongoDB Shell and the Python driver.This content is an excerpt from The Official MongoDB Guide, published by Packt and MongoDB, and written by Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran.Author BioRachelle Palmer is the Product Leader for Developer Database Experience and Developer Education at MongoDB, overseeing the driver client libraries, documentation, framework integrations, and MongoDB University. She has built sample applications for MongoDB in Java, PHP, Rust, Python, Node.js, and Ruby. Rachelle joined MongoDB in 2013 and was previously the Director of the Technical Services Engineering team, creating and managing the team that provided support and CloudOps to MongoDB Atlas.Jeffery Allen is a Technical Writer at MongoDB, based in the New York City area. He focuses on server documentation and works closely with Product and Engineering teams to develop examples that reflect real-world use cases, especially pertaining to data modeling and the MongoDB query language. Before joining MongoDB, Jeff worked as a full-stack web developer. Jeff also enjoys playing guitar and piano and produces electronic music in his spare time.Parker Faucher is a self-taught Software Engineer with over six years of experience in technical education. He has authored more than 100 educational videos for MongoDB, establishing himself as a knowledgeable resource in database technologies. Currently, Parker focuses on Artificial Intelligence and Search technologies, exploring innovative solutions in these rapidly evolving fields. hen not advancing his technical expertise, Parker enjoys spending quality time with his family and maintains an avid interest in collecting comic books.Lander Kerbey is a Technical Writer at MongoDB, specializing in Atlas Stream Processing. Prior to MongoDB, he worked at InterSystems, documenting various parts of the HealthShare suite, along with their various data analytics tools. His 11 years of experience as an educator informs his approach to documentation, as he strives to create “Aha!” moments for users.Maya Raman is a Technical Writer and sometimes-software engineer at MongoDB. She is passionate about the intersections between the environment, art and literature, technology, and people. She is based in New York City and likes to spend her time hanging out in Prospect Park, even in winter.Lauren Tran is a Technical Writer at MongoDB with a background in Communications and Computer Science. She is on the Server Documentation team and mainly specializes in Information Architecture, Time Series data, and Search. Lauren is passionate about creating accessible and inclusive documentation that caters to diverse audiences. She is based in Chicago following four years in New York City and in her free time, enjoys reading on the beach at Lake Michigan and listening to Taylor Swift music.

0
0

article-image-how-to-fine-tune-gemma-llm-for-low-resource-languages-kaggle-winning-strategies

Luca Massaron, Bojan Tunguz, Konrad Banachewicz

27 Feb 2026

10 min read

How to Fine-Tune Gemma LLM for Low-Resource Languages (Kaggle Winning Strategies)

Luca Massaron, Bojan Tunguz, Konrad Banachewicz

27 Feb 2026

10 min read

0
0

article-image-mastering-slowly-changing-dimensions-in-snowflake-data-warehouses

Serge Gershkovich

17 Feb 2026

10 min read

Mastering Slowly Changing Dimensions in Snowflake Data Warehouses

Serge Gershkovich

17 Feb 2026

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionIn modern analytics environments, few modeling challenges are as important or as misunderstood as Slowly Changing Dimensions (SCDs). Business attributes do not stand still. Customers change status, relocate, upgrade tiers, and evolve over time. The real challenge is not just recording those changes. It is preserving the right version of history so that facts remain analytically correct.In Snowflake-powered data warehouses, modeling dimensions properly means balancing performance, maintainability, and historical accuracy. Should attributes be overwritten, versioned, split into mini-dimensions, or tracked with effective dates and surrogate keys?This article explores the full spectrum of SCD strategies from Type 0 through Type 7 using a practical CUSTOMER example. You will see how different change-tracking techniques impact reporting, how surrogate keys and validity intervals maintain point-in-time accuracy, and how Snowflake-native features can be used to implement these patterns efficiently.If you would like to follow along, the scripts used in these examples are available here: https://github.com/PacktPublishing/Data-Modeling-with-Snowflake2E/tree/main/ch14— Recommended Workshop —If you're building with LLMs, agents, or GenAI workflows, most failures aren’t model failures - they’re system design failures. The Machine Learning & Generative AI System Design Workshop by Sairam Sundaresan is a 4.5-hour live session focused on designing AI systems that survive production.Workshop Live on 28 February 2026. Learn practical system design frameworks, cost/latency/reliability trade-offs, and evaluation beyond simple accuracy metrics.Use code ML35 for 35% off →https://www.eventbrite.com/e/machine-learning-and-generative-ai-system-design-workshop-tickets-1975103644168This section is separate from the editorial content.Dimensions overviewA dimension unifies (or conforms) similar attributes from one or various source systems into a single table under a common unique identifier known as a business key. A single surrogate key can also be used in place of multi-column business or primary keys. The unique key of a dimension table plays a critical role in identifying dimension records and allows the database team to track and maintain changes over time. A dimension table can be structured in predetermined ways to allow for different types of change tracking depending on the business requirement.SCD typesAttributes within a dimension have differing requirements for durability and change tracking. Some attributes are updated directly, while others require historical snapshots, yet others cannot change at all. This section will cover the types of SCDs, or update profiles, that a given attribute in a dimension can have.It’s important to note that the dimension type may not necessarily apply across all dimension attributes equally. Within the same dimension table, some attributes may be overwritten while others may not. By understanding SCD types and when to use them, database developers can implement the proper table structure and update techniques to satisfy the organization’s reporting and analytics requirements.Example scenarioTo explain the various SCD types, we will use a simplified CUSTOMER dimension as an example and track the change as it would appear under each configuration.Suppose our fact table stores order details from customer X, made on the first of every month in2022. Thanks to X’s patronage, their customer status went from new to active midway through the year. Not only do we want to track when the change occurred but we want to tie the correct status to the recorded sales facts (that is, the customer is active today, but half their orders were made with a status of new).The change in customer status is displayed here as it currently appears in the source system and data warehouse landing area:Figure: A changed record in the source system and data warehouse raw schema With this scenario in mind, let’s explore the SCD types.Type 0: Maintain originalIronically, the first SCD—Type 0—does not change. Type 0 dimensions are intended for durable attributes that cannot change due to their business nature. Examples of Type 0 attributes include birth dates, calendar dates, and any attribute recorded at record creation that needs to be tracked as a baseline, such as original price, weight, or date of first login.Type 1: OverwriteType 1 attributes do not require historical tracking and may be directly overwritten with an UPDATE statement. Sometimes, the latest attribute value is all that the business cares about. For example, our organization demands the latest customer status, and previous values are irrelevant. Maintaining a Type 1 dimension is relatively simple—for example, if the status changes, it is updated directly in the customer dimension, as illustrated here:Figure: New updated for STATUS change in a Type 1 SCDHowever, overwriting values is often not enough—a historical value must also be preserved.Type 2: Add a new rowFor some attributes, an organization must register the latest value and maintain prior historical records. Type 2 attributes generate a new row every time a change is recorded. Generating new rows for a given business key means that uniqueness is violated unless a time dimension (the effective date) is added to the primary key. The effective date of a Type 2 SCD not only separates historical values for a given business key but also allows those records to be tied to fact tables at a given point in time.Maintaining a Type 2 SCD requires creating new rows when record changes are detected and additional metadata columns to track them. A single record in our example would generate the following change in a Type 2 table:Figure: New row generated for a change in a Type 2 SCDThe following metadata fields make working with Type 2 attributes easier:Validity intervals: Because the business key is being duplicated with each change, another column must be added to the primary key to maintain uniqueness. Validity intervals (also named valid_from/to, start/end_date) provide the additional unique value for the primary key and timestamp when the change occurred, allowing facts to be linked with the correct point-in-time dimension value. The TO_DATE column also provides a flag for identifying the latest record using the standard surrogate high date of 9999-12-31.Hash: Using a hashing function, such as MD5, provides a quick and standard way to identify when record changes occur. This concept is borrowed from Data Vault (discussed in Chapter 18, Scaling Data Models Through Modern Techniques). When there are many Type2 attributes in a table, instead of checking for changes one by one, hash all of them into a single column and compare them in a single go, as follows:Create the hash field: SELECT MD5 (Col1 || Col2 || ... || ColN) AS hashCompare the hash field: IFF(hash_new = hash_old, 'same', 'changed')Type 3: Add a new columnType 3 dimensions track changes by adding a new column to store the previous value when a change occurs. The original column is updated and not renamed to avoid breaking any existing downstream references. An effective date metadata column records the time of the change, allowing analytics processes to use the new or historical value based on their validity period.An example of a status update in a Type 3 attribute is given here:Figure: New row column created for a change in a Type 3 SCDAlthough Type 3 is easier to maintain than Type 2, the limitation is storing multiple changes. While Type 2 attributes can change as often as needed, generating new rows each time, Type 3 can only show one change without creating additional columns—not a scalable design if regular changes occur.Type 4: Add a mini dimensionWhen SCDs become quickly changing dimensions—due to rapidly changing attributes—the number of records that Type 2 dimensions generate can cause performance issues. This is especially true with dimensions containing many records—as in millions of rows or more.In a Type 4 scenario, the solution is to split the frequently changing attributes into a separate mini dimension. To further curtail the number of records, the values in the mini dimension can be banded within business-agreed value ranges that provide a meaningful breakdown for analysis. The mini dimension has its own surrogate key and does not contain the main dimension foreign key—allowing both to retain a relatively low cardinality. However, to tie the main dimension to the mini, the mini dimension foreign key must be included in the fact table (as the main dimension appears at the time of the generated fact).On a diagram, the arrangement of a Type 4 dimension would look like this:Figure: A Type 4 SCD on a relational diagramFor our example, the business wants to track the length in months for how long a customer has been active, as well as their total yearly spend at the time of each sale. To avoid generating a record for each month and order placed, the business teams have agreed to group the MONTHS_ACTIVE attribute into two categories (less than or greater than five months) and band the sales volume into three groups. The mini dimension would need to contain every possible (or allowable by existing business rules) combination of groupings.Our example would look like this (notice how the profile ID changes throughout the year as a function of the customer’s attributes):Figure: Mini dimension and foreign key in a fact table in a Type 4 SCDWhile this arrangement satisfies the reporting requirement, bridging dimension tables via a fact encumbers analysis on the dimension itself. To unify the main and mini dimensions into one, a Type 5 SCD is used.Type 5: Type 4 mini dimension + Type 1A Type 5 SCD is an extension of the Type 4 mini-dimension technique—adding the mini-dimension key as a Type 1 attribute in the main dimension (hence the name, 4+1 = 5). This approach affords the performance gains of a Type 4 dimension by avoiding the explosive growth of rapidly changing Type 2 records and gives users a simple way to unify the main dimension with the mini dimension through a common join column.On a diagram, the arrangement of a Type 5 dimension would look like this:Figure: A Type 5 SCD and related view on a relational diagramNotice that to further simplify the user experience, a view is created over the main and mini dimensions to give the users a single entity to work with. Analysis of the fact table becomes more versatile by allowing users to join on one entity (the view) instead of the main and mini dimensions if historical values are not required.The same scenario described in the section on Type 4 would look like this under Type 5:Figure: Mini-dimension and a related view in a Type 5 SCDUnfortunately, Type 4, and by extension, Type 5, suffer from the inconvenience of calculating the mini-dimension value to include it as part of each fact. The performance implications involved in adding the mini-dimension foreign key to the fact table should outweigh the performance gain in reducing the number of dimension records through the use of the mini dimension.Type 6: The Type 1, 2, and 3 hybridA Type 6 SCD is so named because it combines the techniques of Type 1, 2, and 3 (1+2+3 = 6) dimensions into one table. Based on business needs, users will demand different levels of historical values to achieve a balance of detail and flexibility in their analytics.Suppose our customer X from previous examples began to relocate—moving headquarters to Mexico in 2023, then to Brazil in 2024. A Type 6 approach yields a dimension table that gives analysts every possible temporal attribute value in every snapshot: a Type 1 current value, a Type 2 effective dated value, and a Type 3 previous value.To recap the status and country changes mentioned in this example, a snapshot of the source system over time is presented here:Figure: Source system showing changes for customer XIn a business scenario where the customer status needed Type 2 and the country was presented as Type 1, 2, and 3, the resulting table would look like this (the HASH column is now calculated as a function of STATUS and COUNTRY):Figure: Type 1, 2, and 3 columns combine in a Type 6 SCDType 7: Complete as-at flexibilityBusiness users across all cultures and industries have a penchant for changing their minds. The Type 7 approach gives database modelers a way to deliver the needed historical attribute no matter the criteria or temporal reference point requested.A Type 7 dimension (unimaginatively named as the number that follows 6) includes a natural key and a surrogate key in a Type 2 table structure and embeds both in the fact table.A method for generating surrogate keysAn efficient—and data vault-inspired—way to generate a surrogate key for Type 2 records is to use an MD5 hash on the compound primary key (in this example, CUSTOMER_ID and FROM_DATE):SELECT MD5(customer_id || from_date) AS customer_skeyIn a Type 7 configuration, a surrogate key is added to an otherwise Type 2 structure and is embedded in the fact (the latest SKEY as of the creation of each fact record). Based on the example scenario from the Type 6 section, the tables would look like this:Figure: A Type 7 SCD offers complete analytical flexibilityA Type 7 SCD allows business users to select the appropriate customer attributes based on the following criteria:The most recent or current information (that is, TO_DATE = '9999-12-31')The primary effective date on the fact record (that is, LOAD_DATE)When the user changes their mind, any date associated with the fact record (that is, ORDER_ DATE or SHIPMENT_DATE)Here is how those queries might look:--get current SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_id) WHERE c.to_date = '9999-12-31' --get dimension values as at the primary effective date on the fact record SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_skey) --get dimension values as-at any date on the fact record --example will use SHIPMENT_DATE SELECT < fact and attribute fields > FROM order o INNER JOIN customer c USING(customer_skey) AND o.shipment_date BETWEEN c.from_date AND c.to_dateNow that you have a general understanding of the different SCD types, let’s recap before detailing the Snowflake recipes used to construct them.Overview of SCD typesThe following screenshot summarizes the seven SCD types covered in the previous section, including their maintenance strategy and usage. While eight (including Type 0) SCDs may seem like a lot, most database designs rarely go beyond Type 3, as the first four SCD types strike an acceptable balance of performance, maintainability, and historical reporting needs:Figure: A comparison of SCD typesNow, let’s see how to build SCDs with maximal efficiency using Snowflake-specific features.ConclusionSlowly Changing Dimensions are not a one-size-fits-all solution. Some attributes should never change (Type 0), others can be safely overwritten (Type 1), and many require full historical tracking (Type 2). As business complexity grows, hybrid approaches such as Types 4, 5, 6, and 7 offer increasing levels of analytical flexibility. They help organizations answer not only what is true now, but also what was true at the time of each transaction.Choosing the right SCD type is ultimately a business decision supported by sound modeling principles. The goal remains consistent: preserve the integrity of historical analysis while keeping data structures scalable and performant.For a deeper dive into implementing SCDs in Snowflake, including practical SQL recipes, hashing strategies, dynamic approaches, and performance considerations, you can learn more in the book Data Modeling with Snowflake, Second Edition by Serge Gershkovich. Modeling guides are often steeped in theory. This book’s innovative approach combines practical modeling concepts with Snowflake best practices and unique features - allowing you to create efficient designs that leverage the power of the Data Cloud. Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.

0
0

article-image-creating-a-semantic-model-in-power-bi

Greg Deckler

12 Feb 2026

10 min read

Creating a Semantic Model in Power BI

Greg Deckler

12 Feb 2026

10 min read

Make sure to subscribe our BIPro Newsletter so you never miss a key update in the data world. Join over 35K+ BI lovers and get tips from those who’ve cracked tough data challenges!IntroductionA well-designed semantic model is the foundation of every effective Power BI report. It defines how data tables connect, how calculations behave, and how insights are ultimately presented to users. Without a solid semantic model, even the most visually appealing reports can produce misleading or inconsistent results. In this article, you’ll learn how to create and explore a semantic model in Power BI, understand the Model view, and build relationships that allow data to aggregate correctly across tables, setting the stage for accurate analysis and meaningful business insights. Technical requirementsThe following are needed in order to successfully complete the instructions provided in this chapter:An internet connectionMicrosoft Power BI DesktopChapter 5 Start.pbix downloaded from GitHub at https://github.com/PacktPublishing/Learn-Microsoft-Power-BI_3ECreating a semantic modelThe concept of a semantic model, data model, or dataset is fundamental to Power BI. In short, a semantic model is defined by the tables that are created from Power Query queries, the metadata (data about data) regarding the columns within the tables, and finally, the relationships that are defined between tables. Relationships are needed to connect individual tables to one another. In Power BI, the semantic model is stored within an Analysis Services tabular cube. It is the creation of this semantic model that enables self-service analytics and reporting.In Chapter 4, Connecting to and Transforming Data, we connected to various sources of data (three different Excel files) that in turn created seven different queries, which ultimately resulted in four queries that loaded data tables into our semantic model. We can now stitch those individual tables, along with our previously created data table, into a cohesive semantic model that is used for further analysis.If you are continuing from Chapter 4, continue to use the same Power BI Desktop file after completing that chapter. Otherwise, download and load Chapter 5 Start.pbix as specified in the Technical requirements section of this chapter.Touring the Model viewSo far, we have explored the overall architecture of Power BI Desktop and Power Query Editor. We will now explore the Model view within Power BI Desktop. To switch to the Model view, click the third icon from the top in the Views bar of Power BI Desktop.The Model view is shown in the following screenshot:Figure 5.1 – Model viewThe Model view provides an interface for building our semantic model. It does this by creating relationships between tables, as well as defining metadata for tables and columns. We can even create multiple layouts for our semantic model. The Model view interface is similar to other Desktop views, as shown in Figure 5.1:Header: As shown in Figure 5.1, the Header area is identical to what was described in the Touring the Desktop section of Chapter 3, Up and Running with Power BI Desktop.Ribbon: The Ribbon is nearly identical to what we described in the Touring the Desktop section of Chapter 3, Up and Running with Power BI Desktop, with the notable exception that only three tabs are available—that is, File, Home, and Help. If third-party extensions are installed, the External tools tab is also present.Views: The Views area is identical to what was described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop.Canvas: As mentioned in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, when in the Model view, this area displays layouts of tables within the semantic model, as well as their relationships to one another. A default All tables layout is created automatically.Panes: As described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, only two panes are present—the omnipresent Data pane and the Properties pane. The Properties pane is exclusive to the Model view and allows us to associate metadata with various fields or columns within the tables of the semantic model. This includes the ability to specify synonyms and descriptions, as well as data types, data categories, and default aggregations or summarizations.Layouts: The Pages area from the Report view is replaced by Layouts. The Layouts area allows us to create multiple layouts or views of the data tables within the model, as well as to rename and reorder the layouts.Footer: As described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, in the Model view, the Footer area provides various viewing controls, such as the ability to zoom in and out, reset the layout, and fit the model to the current display area.This completes our tour of the Model view. Let’s next see how we can change the layout of our semantic model.Modifying the layoutIn the Model view, we have a default All tables layout, which was created for us automatically by Power BI.To modify this layout, follow these steps:1. Minimize the Data and Properties panes by clicking on the arrow icon in the pane headers (>).2. Click on the tables and drag them closer together. Use the Fit to screen icon at the far right of the Footer to zoom in on the table layout. You should now be able to clearly see the table names and columns in the tables.3. Move the Calendar and People tables next to one another in the top center of the Canvas by clicking on and then dragging and dropping. Place the Budgets and Forecasts and Hours tables underneath these two tables. Use the Fit to screen icon in the footer to zoom in on the tables. Note that we cannot see all of the columns in the Hours table or the Calendar table. A right-hand scroll bar is present on both tables.4. Use the sizing handle in the bottom-right corner of the table to adjust the size so that we can see all of the columns in the table. When finished, your Canvas should look similar to this: Figure 5.2 – Data tables in the modelNow that we have modified the layout such that we can easily see all tables on the screen, let’s continue building our semantic model by defining relationships between the individual tables.Creating and understanding relationshipsNow that we can clearly see our tables and columns, we can create relationships between our tables. Creating relationships between tables allows calculations and aggregations to work across tables so that multiple columns can be used from separate, related tables.For example, once the People table and the Hours table are related to one another, we can use the ID column from the People table and the TotalHoursBilled column from the Hours table. By doing this, TotalHoursBilled aggregates correctly for each user Identifier (ID) in the People table.To create this relationship between the People table and the Hours table, click on the ID column in the People table and drag and drop it onto the EmployeeID column in the Hours table. The New Relationship dialog is displayed, as follows:Figure – New relationship dialogClick the Save button and note that a line appears on the canvas that connects the People table to the Hours table. This creates a relationship between the ID column in the People table and the EmployeeID column in the Hours table.You can check the columns that are involved in a relationship by hovering your mouse over the relationship line. The line changes slightly and the columns involved in the relationship become shaded.Figure – Relationship between two tablesIf you notice that ID and EmployeeID are not columns associated with the relationship, hover over the relationship line, right-click it, and choose Delete. Then, try again.Note that the line has 1 next to the People table and * next to the Hours table. This means that this relationship is one-to-many or many-to-one.In other words, there are unique row values in the People table that match multiple rows in the Hours table. This makes sense since each employee submits an Hours report for every day. The designation of 1 (unique) or * (many) defines the cardinality of the relationship between the tables.There are actually four different cardinalities for relationships in Power BI, as outlined here:One-to-one: This means that there are unique values in each table.Many-to-one: This means that there are unique values in one table that match multiple rows in another table.One-to-many: This means that there are unique values in one table that match multiple rows in another table.Many-to-many: This means that neither table has unique values for rows. It is generally good practice to avoid these types of relationships because of their complexity and the amount of processing and resources required.The designation of many-to-one versus one-to-many is simply a matter of which table is defined first within a relationship. In other words, the relationship between our People table and Hours table could be either many-to-one or one-to-many, depending on which table we defined first in our relationship.Note that there is also an arrow icon in the middle of the line that points from People to Hours. This indicates that the People table filters the Hours table, but not vice versa. This is known as the cross-filter direction. Cross-filter directions can be either Single or Both, meaning that the filtering occurs only one way or bidirectionally. In simple semantic models, you generally don’t have to worry about cross-filter direction, but it can become very important if the complexity of the semantic model increases.Finally, note that the line that forms the relationship between the tables is solid. A solid line indicates an active relationship. Inactive relationships can be created between tables, and these are represented as a dashed line. In Power BI, there can only be a single active path between tables.As models become more complex, multiple paths from one table to another can be created. In those cases, one or more of the relationships or pathways become inactive. However, even though the relationship is inactive by default, it can be used within calculations that specify using specific relationships.We can view and modify the relationship definition by double-clicking on the relationship line. This brings up the Edit relationship dialog, which looks nearly identical to the New relationship dialog shown in Figure 5.3.Note that since the Hours table is defined first and the People table is defined second, the Cardinality value of our relationship is Many to one (*:1). Also, note that the relationship is active and that the Cross filter direction value is Single.In many-to-one and one-to-many relationships, a Cross filter direction value of Single means that the one side of the relationship filters the many side. Finally, note that the EmployeeID column in the Hours table and the ID column in the People table are highlighted in gray. This shows us the columns that form a relationship between the two tables. Close the Edit relationship dialog by clicking the Cancel button.Now, let’s create another relationship, this time between our Calendar table and the Hours table. To do this, we can use the Manage relationships functionality, as follows:1. Click on the Home tab of the ribbon. Then, in the Relationships section, choose the Manage relationships button. The Manage relationships dialog is displayed, as shown in the following screenshot:Figure 5.5 – Manage relationships dialogHere, we can see our existing Active relationship between the Hours table and the People table, including the columns involved in the relationship in parentheses. From this dialog, we can Edit or Delete the relationship or have Power BI attempt to Autodetect relationships between tables. Power BI can sometimes autodetect relationships between tables based on the column names and row values.2. Select the + New relationship button. This displays the New relationship dialog, as shown in the following screenshot:Figure 5.6 – Create relationship dialog3. In the first drop-down menu, choose the Hours table. A preview of the table will be displayed.4. Click the Date column.5. Choose Calendar in the second drop-down menu. A preview of this table is displayed.6. Choose the Date column from this table. Power BI detects the appropriate Cardinality and Cross filter direction values. Since there are no other relationships between these tables, Power BI checks the Make this relationship active checkbox.7. Click the Save button to create this relationship. Note that this new relationship now appears in the Manage relationships dialog.8. Click the Close button to close the Manage relationships dialog. In the canvas, there will now be a relationship line linking the Calendar table to the Hours table.Congratulations—you have successfully linked the three separate tables to create a semantic model!For now, we won’t link the Budgets and Forecasts table within our semantic model. This is a good time to save your work.Now that our semantic model is created, let’s explore our semantic model a little closer by creating some visuals.Exploring the semantic modelBefore we move on, let’s do some exploration of our data in order to understand our data, our semantic model, and how this data can ultimately be viewed by end users. To do this, follow these steps:1. Start by clicking on the Report view in the Views bar.2. At the bottom of the report canvas, click the plus (+) icon next to Page 1. This creates a new blank page, Page 2. Double-click Page 2 and rename it Utilization, and then press the Enter key. This changes the name of the page to Utilization.3. Expand the People table in the Data pane by clicking the small arrow to the left of the People table. Check the box next to the Name field. This creates a Table visualization on our report canvas with the names of employees. Use the sizing handles for this visualization to resize the table to take up the entire page.4. Expand the Hours table. Make sure that the table visual is selected and then check the box next to the Hours field. The number of hours reported by each employee is shown next to their name. This occurs because of the relationship between our People table and our Hours table. Because these tables are joined by a relationship based on the ID of each employee, we can use fields from both tables in visualizations. Due to this, the rows in the Hours tables are automatically filtered based on this relationship.Note that when the Table visualization is selected, the Hours and People tables have small checkmarks next to them. This indicates that the visualization contains fields from these tables.5. We know that the Category field of the hours is important, regardless of whether the hours are billable or not. With the table selected, click on the Matrix visualization in the Visualizations pane. This icon is to the immediate right of the highlighted Table visualization. Note that the Build visual pane changes from just displaying a Values field well to now containing field wells for Rows, Columns, and Values. Our Name field is now under Rows, while our Hours field is under Values and shows up as Sum of Hours.6. From the Hours table, click on Category and drag and drop this field into the Columns field well. We can now see a breakdown of the hours for each employee by category.It is now obvious that we can calculate a simple version of our utilization metric by taking the number of Billable hours for each employee and dividing them by the Total number of hours for each employee. Save your work before continuing.ConclusionBuilding a semantic model in Power BI is a critical step in transforming raw data into reliable, self-service analytics. By organizing tables, defining relationships, and understanding how filtering and cardinality work, you enable calculations and visuals to behave as expected across your reports. This foundation not only improves accuracy and performance but also makes your Power BI solutions easier to scale and maintain over time.If you’d like to explore this topic in greater depth with guided examples, best practices, and real-world scenarios, you can learn more in Learn Microsoft Power BI - Third Edition by Greg Deckler, where semantic modeling and calculations are covered as part of a complete, end-to-end Power BI learning journey. This book helps you master self-service BI and become your team’s data hero or launch a new career. You’ll learn how to clean data, build models, and use advanced analytics in Power BI to unlock valuable insights and drive better business decisions.Author BioGreg Deckler is a Vice President at a global consulting services firm. In addition, I am a 7-time Microsoft MVP for Data Platform. As an active member in the Columbus IT community, I founded the Columbus Azure ML and Power BI User Group (CAMLPUG) and have presented at many different conferences and events including Dog Food, SharePoint Saturday, CloudDevelop and M3. I am also the author of Microsoft Hates Greg's Quick Measures (MSHGQM) tool and the associated YouTube channels Microsoft Hates Greg and DAX For Humans.

0
0

article-image-acid-transactions-in-lakehouse-architectures-mvcc-occ-and-conflict-resolution-with-iceberg-hudi-and-delta-lake

Dipankar Mazumdar, Vinoth Govindarajan

29 Jan 2026

10 min read

ACID Transactions in Lakehouse Architectures: MVCC, OCC, and Conflict Resolution with Iceberg, Hudi, and Delta Lake

Dipankar Mazumdar, Vinoth Govindarajan

29 Jan 2026

10 min read

0
0

article-image-semantic-search-in-opensearch

Jon Handler, Soujanya Konka, Prashant Agrawal

12 Jan 2026

15 min read

Semantic Search in OpenSearch

Jon Handler, Soujanya Konka, Prashant Agrawal

12 Jan 2026

15 min read

0
0

article-image-building-trust-in-ai-the-role-of-rag-in-data-security-and-transparency

Keith Bourne

13 Dec 2024

15 min read

Building Trust in AI: The Role of RAG in Data Security and Transparency

Keith Bourne

13 Dec 2024

15 min read

This article is an excerpt from the book, "Unlocking Data with Generative AI and RAG", by Keith Bourne. Master Retrieval-Augmented Generation (RAG), the most popular generative AI tool, to unlock the full potential of your data. This book enables you to develop highly sought-after skills as corporate investment in generative AI soars.IntroductionAs the adoption of Retrieval-Augmented Generation (RAG) continues to grow, its potential to address key security challenges in AI-driven applications is becoming evident. Far from merely introducing risks, RAG offers a robust framework to enhance data protection, ensure accuracy, and maintain transparency in content generation. This article delves into the multifaceted security benefits of RAG, while also addressing the unique challenges it poses and strategies to mitigate them.How RAG can be leveraged as a security solutionLet’s start with the most positive security aspect of RAG. RAG can actually be considered a solution to mitigate security concerns, rather than cause them. If done right, you can limit data access via user, ensure more reliable responses, and provide more transparency of sources.Limiting dataRAG applications may be a relatively new concept, but you can still apply the same authentication and database-based access approaches you can with web and similar types of applications. This provides the same level of security you can apply in these other types of applications. By implementing userbased access controls, you can restrict the data that each user or user group can retrieve through the RAG system. This ensures that sensitive information is only accessible to authorized individuals. Additionally, by leveraging secure database connections and encryption techniques, you can safeguard the data at rest and in transit, preventing unauthorized access or data breaches.Ensuring the reliability of generated contentOne of the key benefits of RAG is its ability to mitigate inaccuracies in generated content. By allowing applications to retrieve proprietary data at the point of generation, the risk of producing misleading or incorrect responses is substantially reduced. Feeding the most current data available through your RAG system helps to mitigate inaccuracies that might otherwise occur.With RAG, you have control over the data sources used for retrieval. By carefully curating and maintaining high-quality, up-to-date datasets, you can ensure that the information used to generate responses is accurate and reliable. This is particularly important in domains where precision and correctness are critical, such as healthcare, finance, or legal applications.Maintaining transparencyRAG makes it easier to provide transparency in the generated content. By incorporating data such as citations and references to the retrieved data sources, you can increase the credibility and trustworthiness of the generated responses.When a RAG system generates a response, it can include links or references to the specific data points or documents used in the generation process. This allows users to verify the information and trace it back to its original sources. By providing this level of transparency, you can build trust with your users and demonstrate the reliability of the generated content.Transparency in RAG can also help with accountability and auditing. If there are any concerns or disputes regarding the generated content, having clear citations and references makes it easier to investigate and resolve any issues. This transparency also facilitates compliance with regulatory requirements or industry standards that may require traceability of information.That covers many of the security-related benefits you can achieve with RAG. However, there are some security challenges associated with RAG as well. Let’s discuss these challenges next.RAG security challengesRAG applications face unique security challenges due to their reliance on large language models (LLMs) and external data sources. Let’s start with the black box challenge, highlighting the relative difficulty in understanding how an LLM determines its response.LLMs as black boxesWhen something is in a dark, black box with the lid closed, you cannot see what is going on in there! That is the idea behind the black box when discussing LLMs, meaning there is a lack of transparency and interpretability in how these complex AI models process input and generate output. The most popular LLMs are also some of the largest, meaning they can have more than 100 billion parameters. The intricate interconnections and weights of these parameters make it difficult to understand how the model arrives at a particular output.While the black box aspects of LLMs do not directly create a security problem, it does make it more difficult to identify solutions to problems when they occur. This makes it difficult to trust LLM outputs, which is a critical factor in most of the applications for LLMs, including RAG applications. This lack of transparency makes it more difficult to debug issues you might have in building an RAG application, which increases the risk of having more security issues.There is a lot of research and effort in the academic field to build models that are more transparent and interpretable, called explainable AI. Explainable AI aims at making the operations of A I systems transparent and understandable. It can involve tools, frameworks, and anything else that, when applied to RAG, helps us understand how the language models that we use produce the content they are generating. This is a big movement in the field, but this technology may not be immediately available as you read this. It will hopefully play a larger role in the future to help mitigate black box risk, but right now, none of the most popular LLMs are using explainable models. So, in the meantime, we will talk about other ways to address this issue.You can use human-in-the-loop, where you involve humans at different stages of the process to provide an added line of defense against unexpected outputs. This can often help to reduce the impact of the black box aspect of LLMs. If your response time is not as critical, you can also use an additional LLM to perform a review of the response before it is returned to the user, looking for issues. We will review how to add a second LLM call in code lab 5.3, but with a focus on preventing prompt attacks. But this concept is similar, in that you can add additional LLMs to do a number of extra tasks and improve the security of your application.Black box isn’t the only security issue you face when using RAG applications though; another very important topic is privacy protection.Privacy concerns and protecting user dataPersonally identifiable information (PII) is a key topic in the generative AI space, with governments a round the world trying to determine the best path to balance user privacy with the data-hungry needs of these LLMs. As this gets worked out, it is important to pay attention to the laws and regulations that are taking shape where your company is doing business and make sure all of the technologies you are integrating into your RAG applications adhere. Many companies, such as Google and Microsoft , are taking these efforts into their own hands, establishing their own standards of protection for their user data and emphasizing them in training literature for their platforms.At the corporate level, there is another challenge related to PII and sensitive information. As we have said many times, the nature of the RAG application is to give it access to the company data and combine that with the power of the LLM. For example, for financial institutions, RAG represents a way to give their customers unprecedented access to their own data in ways that allow them to speak naturally with technologies such as chatbots and get near-instant access to hard-to-find answers buried deep in their customer data.In many ways, this can be a huge benefit if implemented properly. But given that this is a security discussion, you may already see where I am going with this. We are giving unprecedented access to customer data using a technology that has artificial intelligence, and as we said previously in the black box discussion, we don’t completely understand how it works! If not implemented properly, this could be a recipe for disaster with massive negative repercussions for companies that get it wrong. Of course, it could be argued that the databases that contain the data are also a potential security risk. Having the data anywhere is a risk! But without taking on this risk, we also cannot provide the significant benefits they represent.As with other IT applications that contain sensitive data, you can forge forward, but you need to have a healthy fear of what can happen to data and proactively take measures to protect that data. The more you understand how RAG works, the better job you can do in preventing a potentially disastrous data leak. These steps can help you protect your company as well as the people who trusted your company with their data.This section was about protecting data that exists. However, a new risk that has risen with LLMs has been the generation of data that isn’t real, called hallucinations. Let’s discuss how this presents a new risk not common in the IT world.HallucinationsWe have discussed this in previous chapters, but LLMs can, at times, generate responses that sound coherent and factual but can be very wrong. These are called hallucinations and there have been many shocking examples provided in the news, especially in late 2022 and 2023, when LLMs became everyday tools for many users.Some are just funny with little consequence other than a good laugh, such as when ChatGPT was asked by a writer for The Economist, “When was the Golden Gate Bridge transported for the second time across Egypt?” ChatGPT responded, “The Golden Gate Bridge was transported for the second time across Egypt in October of 2016” (https://www.economist.com/by-invitation/2022/09/02/artificialneural-networks-today-are-not-conscious-according-to-douglashofstadter).Other hallucinations are more nefarious, such as when a New York lawyer used ChatGPT for legal research in a client’s personal injury case against Avianca Airlines, where he submitted six cases that had been completely made up by the chatbot, leading to court sanctions (https://www. courthousenews.com/sanctions-ordered-for-lawyers-who-relied-onchatgpt-artificial-intelligence-to-prepare-court-brief/). Even worse, generative AI has been known to give biased, racist, and bigoted perspectives, particularly when prompted in a manipulative way.When combined with the black box nature of these LLMs, where we are not always certain how and why a response is generated, this can be a genuine issue for companies wanting to use these LLMs in their RAG applications.From what we know though, hallucinations are primarily a result of the probabilistic nature of LLMs. For all responses that an LLM generates, it typically uses a probability distribution to determine what token it is going to provide next. In situations where it has a strong knowledge base of a certain subject, these probabilities for the next word/token can be 99% or higher. But in situations where the knowledge base is not as strong, the highest probability could be low, such as 20% or even lower. In these cases, it is still the highest probability and, therefore, that is the token that has the highest probability to be selected. The LLM has been trained on stringing tokens together in a very natural language way while using this probabilistic approach to select which tokens to display. As it strings together words with low probability, it forms sentences, and then paragraphs that sound natural and factual but are not based on high probability data. Ultimately, this results in a response that sounds very plausible but is, in fact, based on very loose facts that are incorrect.For a company, this poses a risk that goes beyond the embarrassment of your chatbot saying something wrong. What is said wrong could ruin your relationship(s) with your customer(s), or it could lead to the LLM offering your customer something that you did not intend to offer, or worse, cannot afford to offer. For example, when Microsoft released a chatbot named Tay on Twitter in 2016 with the intention of learning from interactions with Twitter users, users manipulated this spongy personality trait to get it to say numerous racist and bigoted remarks. This reflected poorly on Microsoft, which was promoting its expertise in the AI area with Tay, causing significant damage to its reputation at the time (https://www.theguardian.com/technology/2016/mar/26/microsoftdeeply-sorry-for-offensive-tweets-by-ai-chatbot).Hallucinations, threats related to black box aspects, and protecting user data can all be addressed through red teaming.ConclusionRAG represents a promising avenue for enhancing security in AI applications, offering tools to limit data access, ensure reliable outputs, and promote transparency. However, challenges such as the black box nature of LLMs, privacy concerns, and the risk of hallucinations demand proactive measures. By employing strategies like user-based access controls, explainable AI, and red teaming, organizations can harness the advantages of RAG while mitigating risks. As the technology evolves, a thoughtful approach to its implementation will be crucial for maintaining trust, compliance, and the integrity of data-driven solutions.Author BioKeith Bourne is a senior Generative AI data scientist at Johnson & Johnson. He has over a decade of experience in machine learning and AI working across diverse projects in companies that range in size from start-ups to Fortune 500 companies. With an MBA from Babson College and a master’s in applied data science from the University of Michigan, he has developed several sophisticated modular Generative AI platforms from the ground up, using numerous advanced techniques, including RAG, AI agents, and foundational model fine-tuning. Keith seeks to share his knowledge with a broader audience, aiming to demystify the complexities of RAG for organizations looking to leverage this promising technology.

0
0
100006

article-image-revolutionize-power-bi-queries-with-openai

Gus Frazer

11 Dec 2024

10 min read

Revolutionize Power BI Queries with OpenAI

Gus Frazer

11 Dec 2024

10 min read

This article is an excerpt from the book, Data Cleaning with Power BI, by Gus Frazer. Unlock the full potential of your data by mastering the art of cleaning, preparing, and transforming data with Power BI for smarter insights and data visualizations.IntroductionDiscover the transformative potential of leveraging Azure OpenAI, integrated with ChatGPT functionality, to enhance Power BI's M query capabilities. In this article, we delve into how this powerful combination offers expert guidance, efficient solutions, and insightful recommendations for optimizing data transformation tasks. From generating M queries to streamlining complex transformations, explore how Azure OpenAI with ChatGPT empowers users to boost productivity and efficiency in Power BI.Using OpenAI for M queriesAzure OpenAI, with ChatGPT functionality within it, can be a helpful tool for generating M queries in Power BI by providing suggestions, helping with syntax, and offering insights into data transformation tasks. In the following example, you will learn about how you can leverage the chat playground within OpenAI to improve your productivity and efficiency when writing M queries. We will do this by asking a series of questions directly within Azure OpenAI.Complete the next steps to follow along with the example in your own environment:1. Click on Deployment on the left-hand side and then select Create new deployment to get started.2. Select a model from the base models, in this case, select gpt-35-turbo, and then name your deployment. In this example, name it CleaningDataOpenAI.3. Select Chat playground from the Azure OpenAI Studio home screen or from the Playground tab on the left of your screen. This should open the following view:Figure 15.4 – Screenshot of the chat playground from within Azure OpenAINow that you have set up an instance of OpenAI, here’s an example conversation taken from OpenAI, where the chat functionality has been used to assist with writing M queries in Power BI:User: Hi, I’m working on a Power BI project, and I need to transform some data using M queries. I have a table with sales data, and I want to calculate the total sales for each product category. Can you help me with the M query for that?ChatGPT: Of course! To calculate the total sales for each product category, you can use the Group By transformation in Power BI’s Power Query Editor. Here’s a sample M snippet for your scenario:let Source = YourDataSource, GroupedTable = Table.Group(Source, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableReplace YourDataSource with the actual name of your data source. This code groups the data by the ProductCategory column and calculates the total sales for each group.User: Thanks! That’s a great starting point. What if I want to filter out products with sales less than $1,000 before calculating the total sales?ChatGPT: You can add a filtering step before the Group By transformation. Here’s an updated version of the M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableThis code filters out rows where sales are less than $1,000 before grouping the data by product category and calculating the total sales.User: Perfect! This is really helpful. One more thing – what if I want to sort the results in descending order based on total sales?ChatGPT: You can add a sorting step using the Table.Sort function. Here’s the modified M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}), SortedTable = Table.Sort(GroupedTable,{{"TotalSales", Order. Descending}}) in SortedTableThis code sorts the grouped table in descending order based on the “TotalSales” column.This interaction represents a simple example of how those who can leverage the power of tools such as OpenAI and C hatGPT will be able to quickly upskill in areas such as coding. It has to be said, though, that while this is still in its infancy, it’s important to always test and validate the answers provided before implementing them in production. Also, ensure that you take precautions when using the publicly available ChatGPT model to avoid sharing sensitive data publicly. If you would like to use sensitive data or you want to ensure that requests are given within a secured governed environment, make sure to use the ChatGPT model within your own Azure OpenAI instance.In more complex examples, optimizing Power Query transformations could involve efficient interaction with Azure OpenAI. This includes streamlining API calls, managing large datasets, and incorporating caching mechanisms for repetitive queries, ensuring a seamless and performant data cleaning process.As we begin to explore the use cases where this technology can be most effective, there are a number of clear early winners:Optimizing query plans: ChatGPT’s natural language understanding can assist in formulating more efficient Power Query plans. By describing the desired transformations in natural language, users can interact with ChatGPT to generate optimized query plans. This involves selecting the most suitable Power Query functions and structuring transformations for performance gains.Caching strategies for repetitive queries: ChatGPT can guide users in devising effective caching strategies. By understanding the context of data transformations, it can recommend where to implement caching mechanisms to store and reuse intermediate results, minimizing redundant API calls and computations. The following is an example of just this, where I have asked Azure OpenAI to verify and optimize my query from the Power Query Advanced Editor. The model suggested I use the Table.Buffer function to help cache the table in memory and optimize the query.Figure – An example request to OpenAI to help optimize my query for Power Query Figure – An example response from OpenAI to help optimize my query for Power QueryNow as we highlighted in Chapter 11, M Query Optimization, Table.Buffer can indeed improve the performance of your queries and refreshes, but this really depends on the data you are working with. In the previous example, the model doesn’t take the characteristics, size, or complexity of your data into consideration as it isn’t plugged into your data at this stage. Also linking back to the example you walked through in Chapter 11, the placement of where you add Table.Buffer can really impact how your query performs. In the previous example, if you were connecting to a small dataset, you would likely cause it to run slower by adding the Table.Buffer function as the second variable in the query.Lastly, it’s worth mentioning that how you prompt these models is crucially important. In the previous example, we didn’t specify what type of data source we were using in our query. As such, the model hasn’t provided an insight or overview that using Table.Buffer on a data source supporting query folding will cause it to break the fold. Again, this is not so much of a problem if Table.Buffer is placed at the end of your query for smaller datasets, but it is a problem if you add it nearer to the beginning of the query, like in the previous example.Handling large datasets: Dealing with large datasets often poses a challenge in Power Query. OpenAI models, including ChatGPT, can provide insights into dividing and conquering large datasets. This includes strategies for parallel processing, filtering data early in the transformation pipeline, and using aggregations to reduce computational load.Dynamic query adjustments: ChatGPT’s interactive nature allows users to dynamically adjust queries based on evolving requirements. It can assist in crafting queries that adapt to changing data scenarios, ensuring that Power Query transformations remain flexible and responsive to varied datasets.Guidance on complex transformations: Power Query oft en involves intricate transformations. ChatGPT can act as a virtual assistant, guiding users through the process of complex transformations. It can suggest optimal function compositions, advise on conditional logic placement, and assist in structuring transformations to enhance efficiency. The best example of this can be seen in the following two screenshots of an active use case seen in many businesses. The example begins with a user asking the model for a description of what the query is doing. OpenAI then provides a breakdown of what the query is doing in each step to help the user interpret the code. It helps to break down the barriers to coding and also helps to decipher code that has not been documented well by previous employees. Figure – An example request to OpenAI to help translate my queryFigure – An example response from OpenAI to help describe my queryError handling strategies: Optimizing Power Query also entails robust error handling. ChatGPT can provide recommendations for anticipating and handling errors gracefully within a query. This includes strategies for logging errors, implementing fallback mechanisms, and ensuring the stability of the overall data preparation process.In this section, you learned how to optimize Power Query transformations with Azure OpenAI efficiently. Key takeaways include using ChatGPT for natural-language-based query planning and effective caching strategies. Insights include handling large datasets through parallel processing, early filtering, and aggregations. This knowledge equips you to streamline and enhance your Power Query processes effectively.In the next section, you will learn about Microsoft Copilot, how to set up a Power BI instance with Copilot activated, and also how you can use this new AI technology to help clean and prepare your data.ConclusionIn conclusion, Azure OpenAI with ChatGPT presents a game-changing solution for maximizing Power BI's potential. From query optimization to error-handling strategies, this integration streamlines processes and enhances productivity. As users navigate complex data transformations, the guidance provided fosters efficient decision-making and empowers users to tackle challenges with confidence. With Azure OpenAI and ChatGPT, the possibilities for revolutionizing Power BI workflows are endless, offering a glimpse into the future of data transformation and analytics.Author BioGus Frazer is a seasoned Analytics Consultant focused on Business Intelligence solutions. With over 7 years of experience working for the two market-leading platforms, Power BI & Tableau, has amassed a wealth of knowledge and expertise. Gus has helped hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, cleanse data ready for analysis. Most recently helping to set up, organize and run the Power BI UK community at Microsoft. He holds 6 Azure and Power BI certifications, including the PL-300 and DP-500 certifications. In this book, Gus offers readers invaluable guidance on ingesting, preparing, and cleansing data for analysis in Power BI. --This text refers to an out of print or unavailable edition of this title.

0
0
81745

article-image-optimizing-graphics-pipelines-with-meshlets-a-guide-to-efficient-geometry-processing

Marco Castorina, Gabriel Sassone

09 Dec 2024

15 min read

Optimizing Graphics Pipelines with Meshlets: A Guide to Efficient Geometry Processing

Marco Castorina, Gabriel Sassone

09 Dec 2024

15 min read

This article is an excerpt from the book, "Mastering Graphics Programming with Vulkan", by Marco Castorina, Gabriel Sassone. Mastering Graphics Programming with Vulkan starts by familiarizing you with the foundations of a modern rendering engine. This book will guide you through GPU-driven rendering and show you how to drive culling and rendering from the GPU to minimize CPU overhead. Finally, you’ll explore advanced rendering techniques like temporal anti-aliasing and ray tracing.IntroductionIn modern graphics pipelines, optimizing the geometry stage can have a significant impact on overall rendering performance. This article delves into the concept of meshlets—an approach to breaking down large meshes into smaller, more manageable chunks for efficient GPU processing. By subdividing meshes into meshlets, we can enhance culling techniques, reduce unnecessary shading, and better handle complex geometry. Join us as we explore how meshlets work, their benefits, and practical steps to implement them.Breaking down large meshes into meshletsIn this article, we are going to focus primarily on the geometry stage of the pipeline, the one before the shading stage. Adding some complexity to the geometry stage of the pipeline will pay dividends in later stages as we’ll reduce the number of pixels that need to be shaded.NoteWhen we refer to the geometry stage of the graphics pipeline, we don’t mean geometry shaders. Th e geometry stage of the pipeline refers to input assembly (IA), vertex processing, and primitive assembly (PA). Vertex processing can, in turn, run one or more of the following shaders: vertex, geometry, tessellation, task, and mesh shaders.Content geometry comes in many shapes, sizes, and complexity. A rendering engine must be able to deal with meshes from small, detailed objects to large terrains. Large meshes (think terrain or buildings) are usually broken down by artists so that the rendering engine can pick out the diff erent levels of details based on the distance from the camera of these objects.Breaking down meshes into smaller chunks can help cull geometry that is not visible, but some of these meshes are still large enough that we need to process them in full, even if only a small portion is visible.Meshlets have been developed to address these problems. Each mesh is subdivided into groups of vertices (usually 64) that can be more easily processed on the GPU.The following image illustrates how meshes can be broken down into meshlets:Figure 6.1 – A meshlet subdivision exampleThese vertices can make up an arbitrary number of triangles, but we usually tune this value according to the hardware we are running on. In Vulkan, the recommended value is 126 (as written in https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/, the number is needed to reserve some memory for writing the primitive count with each meshlet).NoteAt the time of writing, mesh and task shaders are only available on Nvidia hardware through its extension. While some of the APIs described in this chapter are specifi c to this extension, the concepts can be generally applied and implemented using generic compute shaders. A more generic version of this extension is currently being worked on by the Khronos committee so that mesh and task shaders should soon be available from other vendors!Now that we have a much smaller number of triangles, we can use them to have much finer-grained control by culling meshlets that are not visible or are being occluded by other objects.Together with the list of vertices and triangles, we also generate some additional data for each meshlet that will be very useful later on to perform back-face, frustum, and occlusion culling.One additional possibility (that will be added in the future) is to choose the level of detail (LOD) of a mesh and, thus, a different subset of meshlets based on any wanted heuristic.The first of this additional data represents the bounding sphere of a meshlet, as shown in the following screenshot:Figure 6.2 – A meshlet bounding spheres example; some of the larger spheres have been hidden for claritySome of you might ask: why not AABBs? AABBs require at least two vec3 of data: one for the center and one for the half-size vector. Another encoding could be to store the minimum and maximum corners. Instead, spheres can be encoded with a single vec4: a vec3 for the center plus the radius.Given that we might need to process millions of meshlets, each saved byte counts! Spheres can also be more easily tested for frustum and occlusion culling, as we will describe later in the chapter.The next additional piece of data that we’re going to use is the meshlet cone, as shown in the following screenshot:Figure 6.3 – A meshlet cone example; not all cones are displayed for clarityThe cone indicates the direction a meshlet is facing and will be used for back-face culling.Now we have a better understanding of why meshlets are useful and how we can use them to improve the culling of larger meshes, let’s see how we generate them in code!Generating meshletsWe are using an open source library, called MeshOptimizer (https://github.com/zeux/meshoptimizer) to generate the meshlets. An alternative library is meshlete (https:// github.com/JarkkoPFC/meshlete) and we encourage you to try both to find the one that best suits your needs.After we have loaded the data (vertices and indices) for a given mesh, we are going to generate the list of meshlets. First, we determine the maximum number of meshlets that could be generated for our mesh and allocate memory for the vertices and indices arrays that will describe the meshlets:const sizet max_meshlets = meshopt_buildMeshletsBound( indices_accessor.count, max_vertices, max_triangles ); Array<meshopt_Meshlet> local_meshlets; local_meshlets.init( temp_allocator, max_meshlets, max_meshlets ); Array<u32> meshlet_vertex_indices; meshlet_vertex_indices.init( temp_allocator, max_meshlets * max_vertices, max_meshlets* max_vertices ); Array<u8> meshlet_triangles; meshlet_triangles.init( temp_allocator, max_meshlets * max_triangles * 3, max_meshlets* max_triangles * 3 );Notice the types for the indices and triangle arrays. We are not modifying the original vertex or index buffer, but only generating a list of indices in the original buffers. Another interesting aspect is that we only need 1 byte to store the triangle indices. Again, saving memory is very important to keep meshlet processing efficient!The next step is to generate our meshlets:const sizet max_vertices = 64; const sizet max_triangles = 124; const f32 cone_weight = 0.0f; sizet meshlet_count = meshopt_buildMeshlets( local_meshlets.data, meshlet_vertex_indices.data, meshlet_triangles.data, indices, indices_accessor.count, vertices, position_buffer_accessor.count, sizeof( vec3s ), max_vertices, max_triangles, cone_weight );As mentioned in the preceding step, we need to tell the library the maximum number of vertices and triangles that a meshlet can contain. In our case, we are using the recommended values for the Vulkan API. The other parameters include the original vertex and index buffer, and the arrays we have just created that will contain the data for the meshlets.Let’s have a better look at the data structure of each meshlet:struct meshopt_Meshlet { unsigned int vertex_offset; unsigned int triangle_offset; unsigned int vertex_count; unsigned int triangle_count; };Each meshlet is described by two offsets and two counts, one for the vertex indices and one for the indices of the triangles. Note that these off sets refer to meshlet_vertex_indices and meshlet_ triangles that are populated by the library, not the original vertex and index buff ers of the mesh.Now that we have the meshlet data, we need to upload it to the GPU. To keep the data size to a minimum, we store the positions at full resolution while we compress the normals to 1 byte for each dimension and UV coordinates to half-float for each dimension. In pseudocode, this is as follows:meshlet_vertex_data.normal = ( normal + 1.0 ) * 127.0; meshlet_vertex_data.uv_coords = quantize_half( uv_coords );The next step is to extract the additional data (bounding sphere and cone) for each meshlet:for ( u32 m = 0; m < meshlet_count; ++m ) { meshopt_Meshlet& local_meshlet = local_meshlets[ m ]; meshopt_Bounds meshlet_bounds = meshopt_computeMeshletBounds( meshlet_vertex_indices.data + local_meshlet.vertex_offset, meshlet_triangles.data + local_meshlet.triangle_offset, local_meshlet.triangle_count, vertices, position_buffer_accessor .count, sizeof( vec3s ) ); ... }We loop over all the meshlets and we call the MeshOptimizer API that computes the bounds for each meshlet. Let’s see in more detail the structure of the data that is returned:struct meshopt_Bounds { float center[3]; float radius; float cone_apex[3]; float cone_axis[3]; float cone_cutoff; signed char cone_axis_s8[3]; signed char cone_cutoff_s8; };The first four floats represent the bounding sphere. Next, we have the cone definition, which is comprised of the cone direction (cone_axis) and the cone angle (cone_cutoff). We are not using the cone_apex value as it makes the back-face culling computation more expensive. However, it can lead to better results.Once again, notice that quantized values (cone_axis_s8 and cone_cutoff_s8) help us reduce the size of the data required for each meshlet.Finally, meshlet data is copied into GPU buff ers and it will be used during the execution of task and mesh shaders.For each processed mesh, we will also save an offset and count of meshlets to add a coarse culling based on the parent mesh: if the mesh is visible, then its meshlets will be added.In this article, we have described what meshlets are and why they are useful to improve the culling of geometry on the GPU.ConclusionMeshlets represent a powerful tool for optimizing the rendering of complex geometries. By subdividing meshes into small, efficient chunks and incorporating additional data like bounding spheres and cones, we can achieve finer-grained control over visibility and culling processes. Whether you're leveraging advanced shader technologies or applying these concepts with compute shaders, adopting meshlets can lead to significant performance improvements in your graphics pipeline. With libraries like MeshOptimizer at your disposal, implementing this technique has never been more accessible.Author BioMarco Castorina first became familiar with Vulkan while working as a driver developer at Samsung. Later, he developed a 2D and 3D renderer in Vulkan from scratch for a leading media server company. He recently joined the games graphics performance team at AMD. In his spare time, he keeps up to date with the latest techniques in real-time graphics. He also likes cooking and playing guitar.Gabriel Sassone is a rendering enthusiast currently working as a principal rendering engineer at The Multiplayer Group. Previously working for Avalanche Studios, where he first encountered Vulkan, they developed the Vulkan layer for the proprietary Apex Engine and its Google Stadia port. He previously worked at ReadyAtDawn, Codemasters, FrameStudios, and some other non-gaming tech companies. His spare time is filled with music and rendering, gaming, and outdoor activities.

0
0
70757

How to Build Supervised Anomaly Detection Models for Time Series Data

How to Prepare Data Using AWS Glue

Microsoft Fabric Data Agents: Building AI-Powered, Conversational Analytics for Enterprise Data

Semantic Search in PostgreSQL with pgvector: How to Build AI-Powered Vector Search and Recommendations

What Is a System of Action? Building an AI-Ready Data Foundation for Real-Time Decision Intelligence

Large Language Models and Graph Machine Learning

MongoDB CRUD Operations Explained: Create, Read, Update, Delete with mongosh and Python

How to Fine-Tune Gemma LLM for Low-Resource Languages (Kaggle Winning Strategies)

Mastering Slowly Changing Dimensions in Snowflake Data Warehouses

Creating a Semantic Model in Power BI

Trending Topics

ACID Transactions in Lakehouse Architectures: MVCC, OCC, and Conflict Resolution with Iceberg, Hudi, and Delta Lake

Semantic Search in OpenSearch

Building Trust in AI: The Role of RAG in Data Security and Transparency

Revolutionize Power BI Queries with OpenAI

Optimizing Graphics Pipelines with Meshlets: A Guide to Efficient Geometry Processing

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access