Data Science | Tech News, Tutorials & Expert Insights

article-image-how-to-build-supervised-anomaly-detection-models-for-time-series-data

03 Jun 2026

10 min read

How to Build Supervised Anomaly Detection Models for Time Series Data

03 Jun 2026

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionSupervised anomaly detection helps turn labelled time series data into practical models that can identify abnormal patterns, such as stockouts, sensor failures, or fraud events. Unlike standard classification tasks, anomaly detection often involves severe class imbalance and imperfect labels, making accuracy a poor measure of success. In this article, you’ll learn how to build a more reliable supervised anomaly detection workflow using residual features, CatBoost, time-aware validation, class weighting, and threshold tuning to improve precision and recall.Supervised anomaly detectionSupervised anomaly detection is simply classification, as any good academic will tell you. You have observations, labels, on which you train a model that distinguishes anomalous from non-anomalous observations. Chapters 14 and 15 covered some aspects of classification methodology, so we won’t repeat that here. What is worth discussing is what makes the anomaly detection variant of classification distinct from the general case, and what a practical pipeline looks like.Two things distinguish supervised anomaly detection from general classification problems: severe class imbalance and label quality. Both are structural rather than incidental, and both directly affect how you build and evaluate supervised detectors. The label quality problemLabels in anomaly detection are rarely clean. They come from one of three sources, each with its own contamination issue.Operational incident logs are accurate for large, impactful events (a week-long stockout, a confirmed sensor failure) but can miss subtle anomalies that don’t rise to the threshold of a ticket. The positive class is high-precision but recall can be low.Rules-based labelling pipelines are the most common source, and the most dangerous. Rules are written at a point in time, reflecting a data distribution at that moment. As the underlying process drifts, rules generate stale labels. Flagged observations are no longer anomalous (false positive labels) and new anomaly patterns are missed (false negatives).Prior unsupervised detector output is used when no other labels exist. If your profiling model drifted before retraining, or was itself trained on contaminated data, the labels it generated carry that error forward into your supervised model. This can result in compounded contamination, which makes you wonder if foundational anomaly detection is ever possible. Regardless, a bad unsupervised model produces bad labels; a supervised model trained on bad labels produces confidently wrong predictions.The practical response is to audit before training. But I have three million time series, you say. We should then inspect a random sample of our positive-class (anomaly) labels and verify them against raw data. If more than ~20% of your positive labels are wrong, a supervised model will likely underperform a well-tuned unsupervised approach. You’re better off fixing labels first. Class imbalanceWith anomalies comprising well under 5% of observations, often under 1%, a classifier that predicts“normal” everywhere achieves very high accuracy while being completely useless. This is why accuracy is a meaningless metric for anomaly classification, and why you should never use it as your optimization target. Some practical remedies are:• class_weight='balanced' (or auto_class_weights='Balanced' in CatBoost). This reweights the loss function by inverse class frequency, making a missed anomaly proportionally more costly than a missed normal observation. It is almost always worth doing and costs nothing• Threshold adjustment: most classifiers produce a probability score, and a default threshold of 0.5 implicitly assumes balanced classes. Lowering it (say, to 0.2) increases recall at the cost of precision. Use a precision-recall curve on a validation set to fi nd the threshold that reflects your actual cost tradeoff .• Oversampling the minority class: DO NOT DO THIS. Once widely used (and researched!) but now considered poor practice. SMOTE creates synthetic points by interpolating in feature space, producing examples that are statistically plausible but temporally incoherent. Better alternatives are to augment using domain knowledge (jitter known anomaly windows, vary their duration) or simply collect more real labels. Gradient boosting on residual featuresThe most effective supervised approach in our experience, and the one that connects most naturally to the profiling work earlier in this chapter, is to build a feature matrix from profile model residuals and train a gradient boosting classifier on it. A profile model removes known structure. What remains in the residuals is the unexplained variation. A supervised classifier trained on residual features learns to distinguish residual patterns associated with confirmed anomalies from patterns that are just normal noise, things like seasonal peaks that happened to breach a threshold, measurement jitter, or genuine promotions that the model underestimated.Below we have made a few specific design choices.• We use TimeSeriesSplit rather than random cross-validation. This is because our rolling features (resid_roll_std, sales_roll_mean) are computed from past observations and shifted by one step to avoid look-ahead. Random shuffling would be fi ne if we didn’t include these kinds of features; as it stands they would leak information about the temporal neighborhood of test points into training.• We use CatBoost with auto_class_weights='Balanced', which automatically reweights the loss function inversely proportional to class frequency. With anomalies representing roughly 4% of observations, an unweighted classifier would achieve 96% accuracy by predicting normal for every row. Balanced weighting forces the model to treat a missed anomaly as roughly 24x more costly than a false alarm during training, which prevents it from ignoring the minority class (outliers). The threshold adjustment step after training lets us fine-tune the actual precision-recall tradeoff .CatBoost was selected for our classifier, but you could easily use another algorithm i.e. XGBoost, LightGBM, or a regularized logistic regression. What is important here is having good labels, a strong temporal structure and optionally a well-explained model leaving us with good residual signal.from catboost import CatBoostClassifier from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import classification_report, precision_recall_curve import pandas as pd import numpy as npdef build_supervised_features(df, window=14): """ Feature matrix for supervised residual profiling. All rolling features are shifted by 1 to avoid lookahead bias. """ feats = pd.DataFrame(index=df.index) # Core residual signal feats['residual'] = df['residual'] feats['residual_z'] = df['residual_z'] feats['residual_abs'] = df['residual'].abs() # Local residual statistics for w in [7, 14, 28]: feats[f'resid_roll_mean_{w}'] = ( df['residual'].rolling(w, min_periods=3).mean().shift(1) ) feats[f'resid_roll_std_{w}'] = ( df['residual'].rolling(w, min_periods=3).std().shift(1) ) feats[f'resid_roll_max_{w}'] = ( df['residual'].abs().rolling(w, min_periods=3).max().shift(1) ) # Scores from unsupervised detectors as features for col in ['if_unsup_score', 'if_score', 'eif_score', 'lof_score']: if col in df.columns: feats[col] = df[col] # Context feats['price'] = df['price'] feats['promotion'] = df['promotion'] feats['sales_roll_mean'] = ( df['sales'].rolling(14, min_periods=3).mean().shift(1) ) feats['day_of_week'] = df.index.dayofweek feats['month'] = df.index.month return feats.fillna(0) feature_df = build_supervised_features(stockouts_data) labels = stockouts_data['is_anomaly'].fillna(0).astype(int) # Time-series split tscv = TimeSeriesSplit(n_splits=4) all_preds = np.zeros(len(labels)) all_probs = np.zeros(len(labels)) test_mask = np.zeros(len(labels), dtype=bool) for train_idx, test_idx in tscv.split(feature_df): X_tr = feature_df.iloc[train_idx].values X_te = feature_df.iloc[test_idx].values y_tr = labels.iloc[train_idx].values y_te = labels.iloc[test_idx].values clf = CatBoostClassifier( iterations=300, learning_rate=0.05, depth=4, auto_class_weights='Balanced', eval_metric='F1', random_seed=42, verbose=0 ) clf.fit(X_tr, y_tr, eval_set=(X_te, y_te)) all_preds[test_idx] = clf.predict(X_te) all_probs[test_idx] = clf.predict_proba(X_te)[:, 1] test_mask[test_idx] = True y_test_all = labels.values[test_mask] y_pred_all = all_preds[test_mask].astype(int) y_prob_all = all_probs[test_mask] print(classification_report(y_test_all, y_pred_all, target_names=['Normal', 'Anomaly'])) # Output: precision recall f1-score support Normal 1.00 0.99 1.00 562 Anomaly 0.81 1.00 0.90 22 accuracy 0.99 584 macro avg 0.91 1.00 0.95 584 weighted avg 0.99 0.99 0.99 584 The four blocks in build_supervised_features each give our classifier a different view on the same observation. The core residual signal (residual, residual_z, residual_abs) is the raw ‘what the profile couldn’t explain’; this is our primary anomaly evidence. The local residual statistics (rolling mean, std, and max at 7, 14, and 28 days, all shifted by one) provide context, because a residual of +20 means very different things depending on whether the surrounding fortnight has been quiet or noisy. The unsupervised detector scores (if_unsup_score, if_score, eif_score, lof_score) are second opinions from earlier in the chapter; our classifier learns when to trust them and when to override them. Finally the context features (price, promotion, sales_roll_mean, day_of_week, month) let our model learn that a large negative residual in late December under a deep promotion is structurally different from the same residual on a quiet Tuesday in February. Every rolling feature is shifted by one step so that no feature at time ݐ contains ݕ௧ itself.Figure 17.19: CatBoost precision-recall curve and feature importanceThe feature importance panel tells us which signals the model is actually using. If if_unsup_score or if_score dominate, the supervised model is largely deferring to unsupervised detectors and the supervised wrapper adds little beyond threshold calibration. Some influence from resid_roll_std suggests the classifier has learned temporal patterns from residuals that no single unsupervised method captures on its own. We could instead, build more features, and labels with the raw data, which would avoid the need for ensembled output. Threshold adjustmentCatBoost’s default threshold of 0.5 assumes equal misclassification costs. In anomaly detection this is rarely appropriate, as the cost of missing real stockouts often exceeds false alarms. The precision-recall curve in the figure above shows the full tradeoff surface; here we select the threshold that maximizes F1 as a reasonable default.from sklearn.metrics import f1_score thresholds = np.arange(0.1, 0.9, 0.01) f1_scores = [ f1_score(y_test_all, (y_prob_all >= t).astype(int), zero_division=0) for t in thresholds ] best_thresh = thresholds[np.argmax(f1_scores)] best_f1 = max(f1_scores) y_pred_tuned = (y_prob_all >= best_thresh).astype(int) tp_sup = ((y_pred_tuned == 1) & (y_test_all == 1)).sum() fp_sup = ((y_pred_tuned == 1) & (y_test_all == 0)).sum() fn_sup = ((y_pred_tuned == 0) & (y_test_all == 1)).sum() prec_sup = tp_sup / (tp_sup + fp_sup) if (tp_sup + fp_sup) else 0.0 rec_sup = tp_sup / (tp_sup + fn_sup) if (tp_sup + fn_sup) else 0.0 print(f"Best threshold: {best_thresh:.2f}") print(f"Supervised (tuned): Precision: {prec_sup:.2f} Recall: {rec_sup:.2f} F1: {best_f1:.2f}") # Output: Best threshold: 0.55 Supervised (tuned): Precision: 0.91 Recall: 0.95 F1: 0.93If false negatives are expensive (missed equipment failures, missed fraud) you want a lower threshold to maximize recall. If false positives are expensive (wasted analyst time, unnecessary operational interventions) you want a higher threshold to maximize precision. There is no universally correct answer; the threshold is a business decision, not a model decision.17.19: CatBoost supervised detections with tuned thresholdMethodTypePrecisionRecallF1Z-score ProfileUnsupervised0.741.000.85IF: Raw FeaturesUnsupervised0.510.970.67IF: ResidualsUnsupervised0.250.480.33INNE: ResidualsUnsupervised0.050.100.07LOF: ResidualsUnsupervised0.830.970.90CatBoost (supervised, tuned threshold)Supervised0.910.950.93Table 17.4: Comparing all approachesCatBoost wins clearly here, but this result is an artifact of the task. Our anomalies were generated programmatically and we have certainty in them. We have a strong temporal structure and relationships in our data, even where they don’t repeat in calendar position, noise is low. With clean labels, repeated anomaly patterns, and a profile model that already explains most of the variance, a supervised classifier on residual features has very little left to do. It is an easy task, that shows us how critical structure and labelling are.We should think of this as a calibration layer that sits on top of unsupervised detection. In production, labels come from unsupervised detectors and human-in-the-loop review (the contamination problem returns); novel anomalies by definition should not repeat in your training data. We will return to this self-supervised detection in Chapter 18.Some practical considerations:•Hyperparameter sensitivity: contamination has the largest effect in isolation-based methods; it sets the decision threshold on anomaly scores, not the model structure. Halving max_samples and max_features to 0.5 while raising contamination to 0.08 lifted the unsupervised IF’s recall from 0.61 to 0.97. For supervised models, threshold adjustment on the probability output typically matters more than tree hyperparameters themselves (Soenen, Wolputte, and Perini 2021).•The contamination-imputation loop: Start with a generous contamination rate (high recall, accept false positives), review flagged observations, then impute confirmed anomalies with interpolated values. Some values were already wrong; replacing them should improve signal rather than fabricate it. Normal data is abundant enough to anchor any reasonable interpolation. Retrain a profile on the imputed series and re-score. Each iteration sharpens the residuals and tightens detection boundaries. Human confirmation at each step is essential; uncertain flags stay in the data untouched. Depending on your case, you could pair this with forecast accuracy, being sure to leave validation sets clean; otherwise, you’re marking your own homework.ConclusionSupervised anomaly detection works best when labels are trustworthy, temporal structure is preserved, and model outputs are tuned around real business costs. By using residual-based features, balanced class weighting, TimeSeriesSplit, and precision-recall-based threshold adjustment, you can build detectors that perform better than simple unsupervised baselines in well-labelled scenarios. However, the real challenge in production is not just model selection; it is maintaining label quality, managing class imbalance, and deciding the right trade-off between missed anomalies and false alarms.This article is an excerpt from Time Series with PyTorch: Modern Deep Learning Toolkit for Real-World Forecasting Challenges, published by Packt. Author BioGraeme Davidson is a Lead Data Scientist at Retail Express, where he redesigned the company's demand forecasting framework in line with contemporary statistical learning practices. His background spans cognitive neuroscience, researching implicit reward processing and human decision-making, through advertising analytics to research-focused demand forecasting. He is an active contributor to several data science Slack and Discord communities, an occasional competitor in forecasting competitions, and was approached by Packt in late 2022 to write the book he wished had existed when he first fell down an ARIMA rabbit hole chasing answers about how supermarkets actually forecast demand, and how a quantitative researcher models financial markets.Lei Ma is a physicist-turned data scientist specializing in time series forecasting. He is theorist but has tackled real-world forecasting challenges across a variety of industries like housing, logistics, ecommerce, and manufacturing. Lei has led and delivered numerous forecasting projects where he combines deep expertise in building advanced time series models with a strategic approach to delivering holistic business insights. Lei creates time series forecasting tutorials online and joined the venture when Graeme approached him to collaborate on this book.

0
0

article-image-how-to-prepare-data-using-aws-glue

Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Ishan Gaur, Albert Quiroga, Akira Ajisaka

28 May 2026

10 min read

How to Prepare Data Using AWS Glue

Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Ishan Gaur, Albert Quiroga, Akira Ajisaka

28 May 2026

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionPreparing data for analytics can become challenging as organizations deal with growing data volumes, varied data sources, and increasingly complex transformation requirements. AWS Glue helps simplify this process by offering both visual and code-based approaches to data preparation. In this article, we explore how AWS Glue Studio enables users to build ETL workflows visually, apply transformations, configure data quality checks, and prepare datasets for downstream analytics without needing to manage infrastructure.Data preparation using AWS GlueIt is normal for data to grow continuously over time in terms of volume and complexity, considering the huge number of applications and devices generating data in a typical organization. With this ever-growing data, a tremendous amount of resources is required to ingest and prepare this data – both in terms of manpower and compute resources.AWS Glue makes it easy for individuals with varying levels of skill to collaborate on data preparation tasks. For instance, novice users with no programming skills can take advantage of AWSGlue Studio (https://docs.aws.amazon.com/glue/latest/dg/author-job-glue.html), a visual interface that allows novice data professionals to interact with and prepare the data using a variety of pre-built transformations and filtering mechanisms even without writing any code. AWS Glue Studio also provides advanced users to author custom transformations to achieve desired outcomes.AWS Glue Studio is a great tool for preparing data using a graphical user interface (GUI), there are some use cases where the built-in transformations may not be flexible enough or the user may prefer a programmatic approach to prepare data over using the GUI-based approach. In such cases, AWS Glue enables users to prepare data using AWS Glue ETL. Users can leverage AWS Glue Studio to author, execute, and monitor ETL workloads. Although Glue Studio offers a GUI, users may still require programmatic knowledge of AWS Glue’s transformation extensions and APIs to implement data preparation workloads, especially when implementing custom transformations using SQL or source code.Now that we know about the different data preparation options that are available in AWS Glue, let’s dive deep into each of them while looking at practical examples to understand them.Visual data preparation using AWS Glue StudioAWS Glue makes it possible to prepare data using a visual interface through AWS Glue Studio. Previously, the preferred approach was to make use of AWS Glue DataBrew for visual data preparation as highlighted in the previous edition of this book. However, the features available in AWS Glue DataBrew have been implemented in AWS Glue Studio and as such, all users regardless of their skill level can make use of AWS Glue Studio to build their ETL workflows through a unified interface. AWS Glue Studio allows us to author recipes similar to AWS Glue DataBrew and even allows us to import any recipes that were built using AWS Glue DataBrew.Getting started with AWS Glue Studio is quite simple. To author a new job using visual ETL, you can use the Visual ETL option in the AWS Glue Studio UI to open the visual job editor. In this interface, you can start designing the ETL workflow based on your requirements by adding required data sources, transformations, and target nodes with a simple drag-and-drop.As mentioned in the previous chapters, we can also ingest data from a wide range of external Software-as-a-Service (SaaS) providers via Amazon AppFlow. There are several external SaaS providers using native connectors, including Adobe Analytics, Asana, Datadog, Google Analytics, Dynatrace, Marketo, Salesforce, ServiceNow, Slack, and Zendesk, to name a few. This data can be further integrated with datasets from other data stores or SaaS applications. This helps the users take a holistic approach to analyzing and gathering insights from their datasets, which have been spread across different data stores or SaaS platforms. The following screenshot shows the visual editor interface available in AWS Glue Studio, which allows users to drag and drop different components of an ETL job:Figure 4.1: AWS Glue Studio Visual Job Editor InterfaceIn the visual editor interface (Figure 4.1), we can start by dragging and dropping data source(s) we wish to read from and start adding transformations and configure each node as per our requirements. This step is optional if we are moving the data from one location to another without applying transformations. Once we are happy with the transformations, we can add data target(s) to our job to write the output. We have to provide a name to the job, select an IAM role for execution under the job details tab, and the job is ready to be saved now. If our job requires an AWS Glue connection, for instance a relational database, on-premises data source, or SaaS data source, we can specify the connections to be included under the job details tab.Let’s build an ETL job using the visual ETL editor which reads from an Amazon S3 location, applies simple transformations, and writes the data to an Amazon Redshift data warehouse. Before we begin authoring the job, let’s set up an AWS Glue Connection to connect to our data warehouse cluster. To create an AWS Glue Connection, use the Connections option under Data Catalog in the sidebar and click Create connection. In the data source selection page, search for Amazon Redshift and click Next. If your Amazon Redshift cluster is in a different AWS account, use a JDBC data source instead. Configure the connection by specifying the cluster details and IAM role. Review the connection details and save the connection. The following screenshot (Figure 4.2) shows a sample AWS Glue connection configuration used to connect to an Amazon Redshift Cluster:Figure 4.2: AWS Glue Connection Configuration pageOnce the connection is saved, let’s head back to the visual editor and add an Amazon Redshift Source node from the Node picker. Click on the added node and select the connection that you created in the previous step. You can choose to select a single table or enter a custom query. In the Data preview section, you can select an IAM role and start a session to preview your data. The following screenshot (Figure 4.3) shows how data can be previewed at each node in the graph to get an idea of the resultant dataset:Figure 4.3: Data Preview of a node in visual editorUse the node picker and add necessary transformations. To transform data using AWS Glue DataBrew style editor, you can click the plus button and add Data Preparation Recipe transform from the node picker and click on the added node to display the node properties. Click Author Recipe to open the familiar grid interface to begin creating the data preparation recipe.Figure 4.4: Author Recipe button available in Data Preparation Recipe transformIn our example, we are adding a filter on quantity of items sold to fetch sales records with a quantity of 3 or higher. The following screenshot (Figure 4.5) shows the filter transformation in the data preparation recipe authoring window:Figure 4.5: Applying transformations in Data Preparation Recipe interfaceOnce all the necessary transformations are applied, you can use the Done authoring recipe button to exit the grid interface and return to AWS Glue Studio visual editor. After all necessary transformations have been applied, we can add target node(s) to the graph to ensure the output is saved to the target data store. In our example, we will configure the target node to write output in Parquet format by selecting Parquet in the Format field of the target node properties, save it into Amazon S3 location, and create a table in AWS Glue Data catalog. The following screenshot (Figure 4.6) depicts the same:Figure 4.6: Amazon S3 target data store configurationNow that we have defined the data source, transformations, and a data target we can save the ETL job and it is ready to be executed.Note: It is important to note that Data Preparation Recipe transformation can be used only with certain versions of AWS Glue. As of writing this book, we can use AWS Glue version 4.0 or higher to execute ETL jobs with such transformations.While the Data Preparation Recipe transformation offers a familiar grid interface, if you are not particularly interested in that experience, you can choose to add transformations directly from the node picker. For example, to achieve the same outcome as the job we created, we can directly add the Filter transformation after we add the Amazon Redshift data source in the graph instead.The following screenshot (Figure 4.7) shows how we can configure the Filter transformation to achieve the same outcome as the data preparation recipe:Figure 4.7: Visual ETL without using Data preparation recipe transformationWhile the ETL job we designed will work as expected, it is always a good idea to protect our workload from noisy data. For example, if the sales quantity column doesn’t exist in our source data or if the column is not an integer as we expected, or it contains null values, we should be able to detect such anomalies and take necessary actions. AWS Glue enables users to use Evaluate Data Quality to create data quality rules to evaluate the output. During this process, it emits events to AWS EventBridge, which can be captured and acted upon. The configuration is quite flexible. You can enrich the existing dataset with data quality information or choose to write the quality information to a separate destination. You can also emit AWS CloudWatch metrics for data quality checks to track how a job is trending with regard to data quality. Additionally, you can decide what to do with the job if the evaluation fails. For example, you can choose to fail the job before or after loading the data into the target data store, or continue with the job without failing it.To implement the data quality rule described in the example above, you can add the EvaluateData Quality transform from the node picker (or use the Edit Data Quality Configuration button in the target node).Figure 4.8: Evaluate Data Quality transform available in the node pickerData Quality rules can be defined using Data Quality Definition Language (DQDL) syntax. Detailed documentation on DQDL can be found in https://docs.aws.amazon.com/glue/latest/ dg/dqdl.html. The rule we described earlier will look like this:Rules = [ # check if col exists ColumnExists “qtysold”, # check for null values IsComplete “qtysold”, # Make sure the values are integers ColumnDataType “qtysold” = “Integer” ]The following screenshot (Figure 4.9) will show how we can configure the Evaluate Data Quality transform to implement the above data quality rules:Figure 4.9: Evaluate Data Quality TransformationNow that we have seen how we can use the visual editor to build an ETL job from scratch, in the next section we will be exploring how we can use AWS Glue to build ETL jobs using source code, and we will also see some of the built-in transformations available in AWS Glue with examples.Source code-based approach to data preparation using AWS GlueWhile AWS Glue Studio primarily offers a visual interface-based approach to tackle data preparation tasks in a data integration workflow, it can also be used to author complex ETL workflows using advanced (and even custom) transformations. AWS Glue ETL in general requires us to have some level of Glue/Spark programming knowledge to implement ETL jobs, which aids in data preparation as we get a much higher level of flexibility compared to using just the grid interface in data preparation Recipe. With the data preparation recipe approach, we can only use pre-built transformations to prepare data. Since there are no such restrictions in AWS Glue ETL, we can design and develop custom transformations based on our requirements using existing Glue/ Spark ETL APIs and extensions.ConclusionAWS Glue provides a flexible and scalable way to prepare data, whether users prefer a visual interface through AWS Glue Studio or a more programmatic approach using AWS Glue ETL. With support for drag-and-drop job creation, data preparation recipes, built-in transformations, data quality rules, and integration with services such as Amazon S3 and Amazon Redshift, AWS Glue helps teams streamline data preparation across different skill levels and use cases.This article is an excerpt from the book Serverless ETL and Analytics with AWS Glue, Second Edition, which offers a deeper look at building, managing, and optimizing data integration workflows using AWS Glue. Author BioSubramanya Vajiraya is a Senior Cloud Engineer at AWS Sydney specialized in AWS Glue. He obtained his Bachelor of Engineering degree focused on Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India in 2015 and obtained his Master of Information Technology degree focused on Internetworking from University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implement scalable data integration and analytics pipelines on AWS.Noritaka Sekiyama is an experienced big data engineer working at Data and AI company. He is responsible for building scalable data platform with unified governance on Cloud. He is passionate about software engineering, cloud computing, big data technologies, distributed systems, data platform, system monitoring and automation.Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services (AWS). He specializes in data infrastructure with hands-on customer engagement experience including migrations, performance tuning, and production troubleshooting. His areas of expertise include Apache Spark, Apache Iceberg, and AWS Analytics services such as AWS Glue, Amazon EMR and Amazon Athena. He actively contributes to the Apache Iceberg open-source project and speaks at community events and conferences to help customers adopt Iceberg in practice.Ishan Gaur is a Principal Big Data Cloud Engineer at Amazon Web Services (AWS) with over 16 years of experience architecting and building distributed systems and scalable data integration pipelines. As a subject matter expert in AWS Glue and Apache Spark, he specializes in helping enterprise customers design and implement large-scale data processing solutions across the AWS ecosystem, including Amazon EMR, AWS Glue, and Amazon Athena.Throughout his career, Ishan has worked extensively with distributed computing frameworks and ETL technologies including Apache Spark, Scala, Ab Initio, and DataStage. His expertise spans the full lifecycle of data engineering—from architecture design and pipeline development to performance optimization and troubleshooting at scale. At AWS, he partners with customers to modernize their data platforms, optimize workloads, and leverage cloud-native services to achieve operational excellence and cost efficiency in their data processing environments.Albert Quiroga is a Senior Solutions Architect at Amazon, where he creates solutions and architectural designs for one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR, Athena, Glue and SageMaker. His 11 years of experience in the industry have empowered him to work with several Fortune 500 companies to overcome large-scale data and analytics challenges, and he has helped launch and develop features for several AWS services.Akira Ajisaka is a software engineer and has more than 10 years of engineering experience in big data. He likes troubleshooting and contributing to OSS.

0
0

article-image-microsoft-fabric-data-agents-building-ai-powered-conversational-analytics-for-enterprise-data

Christopher Maneu, Frederic Gisbert, Emilie Beau, Jean-Pierre Riehl, Romain Casteres

20 May 2026

10 min read

Microsoft Fabric Data Agents: Building AI-Powered, Conversational Analytics for Enterprise Data

Christopher Maneu, Frederic Gisbert, Emilie Beau, Jean-Pierre Riehl, Romain Casteres

20 May 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionData agents in Microsoft Fabric are redefining how organizations interact with enterprise data. Instead of relying only on dashboards, predefined reports, or technical query skills, users can ask questions in natural language and receive contextual, data-driven answers powered by generative AI.This article explores how Fabric data agents connect to lakehouses, warehouses, semantic models, mirrored databases, Power BI, Microsoft Copilot Studio, AI Foundry, and real-time intelligence systems, turning organizational data into a conversational, governed, and actionable intelligence layer.Data agents in Microsoft FabricBefore diving into the h ands-on steps, let’s now move from the general overview of data agents to a practical example. The next section walks you through creating your fi rst data agent, showing how to connect it to your data and begin building a customized generative AI experience within Microsoft Fabric.Microsoft Fabric now brings full CI/CD, ALM fl ow, and Git integration to Data Agents, providing a more structured and collaborative way to manage, version, and deploy Data Agent artifacts. These capabilities ensure better governance and scalability by introducing controlled development stages, change tracking, and auditability across the lifecycle. Git integration allows teams to branch, experiment independently, review code, and merge updates safely, with the ability to revert quickly if needed. Together, these enhancements make developing and maintaining Fabric Data Agents more reliable, transparent, and aligned with modern soft ware engineering practices.Creating your first data agentThe data agent in Microsoft Fabric is a new type of artifact that allows creating customized generative AI experiences based on the organization’s data. Users can thus, on top of their data, activate an AI assistant that will translate questions posed by users in natural language directly into code and immediately return the answers.Once created, the data agent must be associated with a data source. You will then need to select the specifi c tables from which the system will retrieve data. Currently supported sources include lakehouses, warehouses, KQL databases, and semantic models. The user can enhance the assistant by providing additional information about the data (instructions), the underlying models, or simply by detailing the context in which the assistant should operate.Finally, it is possible to load a set of SQL queries linked to natural language questions to help the model fi nd the information. These queries are, for example, created by enterprise data analysts who are well acquainted with the business models as well as the data.In the following example, we ask our conversational agent What are the key insights or most notable patterns in sales seasonality over time?.Figure 11.20: A data agent in Microsoft Fabric analyzes seasonal sales trends, identifying monthly patterns in total sales, variance, and margins to support strategic planning and forecastingOnce the tables are selected, the context is validated, the examples are loaded, and the interaction is tested, it i s possible to publish this data agent in order to share it and retrieve an API to integrate it into enterprise applications.Unlocking LLM-Driven Intelligence from Mirrored Databases in Microsoft FabricMicrosoft Fabric introduces a new capability that connects large language models (LLMs) directly to mirrored databases through a Data Agent. This feature allows organizations to use their synchronized, real-time data without the need for complex duplication or manual integration. The Data Agent acts as an intelligent bridge between the mirrored databases and AI models, enabling natural language queries and automated insights based on live information.The supported mirrored databases include:• Mirrored Azure Cosmos DB• Mirrored Azure Database for PostgreSQL• Mirrored Azure Databricks catalog• Mirrored Azure SQL Database• Mirrored Azure SQL Managed Instance• Mirrored Oracle• Mirrored Snowfl ake• Mirrored SQL Server DatabaseBy integrating these sources, Fabric allows businesses to maintain data consistency, security, and freshness while making their information accessible to AI-driven analysis. Users can interact with their data conversationally, generate real-time insights, and make faster, more informed decisions. This seamless connection between enterprise databases and LLMs reduces latency, minimizes operational overhead, and strengthens confidence in AI-powered outcomes—turning mirrored databases into a live, intelligent foundation for modern analytics and decision-making.Agent data source instructionsMicrosoft Fabric h as introduced a new feature within its data agent framework called data source instructions. This feature allows data owners and developers to provide specific guidance on how AI agents should interpret and use each dataset. Instead of relying solely on the schema or raw content of a data source, agents now have access to structured context—such as explanations of tables, columns, relationships, and intended use cases. These instructions improve the agent’s ability to generate relevant, precise, and business-aligned responses when interacting with enterprise data.Data source instructions can include natural language descriptions, key business definitions, column-level explanations, usage constraints, and examples of valid questions. This metadata acts as a knowledge layer that shapes how the agent interprets queries and formulates answers. For instance, when working with multiple datasets, the agent can prioritize certain fi elds, avoid unsupported queries, or follow business-specific logic as defined in the instructions.The benefit is a noticeable improvement in response quality, consistency, and reliability, particularly in use cases such as semantic search, natural language querying, and generative AI experiences in dashboards or notebooks. By embedding domain expertise directly into the agent’s environment, data source instructions help bridge the gap between raw data and user intent—empowering AI systems to deliver answers that are not only technically correct but also contextually appropriate.Figure 11.21: The interface provides a dedicated space for adding custom instructions to guide theAI agent when using the LakeDBIA data source—covering table structures, column descriptions, metrics, and relationshipsThis development reflects a broader trend in enterprise AI: pairing large language models with curated, human-authored context to increase trust, precision, and usability in data-driven environments.To begin configuring instructions for your data sources, please refer to the setup guide: https:// go.fabricbook.net/ch11-12. For tips on crafting clear and effective instructions, see the recommended guideline s: http://go.fabricbook.net/ch11-13.The data agent SDKMicrosoft Fabric now off ers a Python SDK for data agents, enabling users to evaluate their agents programmatically and at scale. With this SDK, developers can defi ne a set of test questions paired with expected answers (ground truth) and run structured evaluations directly from notebooks or automation pipelines. The evaluation routine compares the agent’s responses against the expected results, logging detailed metrics and step-by-step reasoning in output tables. This enables validation of accuracy, error diagnosis, and confidence building before deploying the agent into production.Setting up the evaluation is straightforward. You begin by installing the fabric-data-agent-sdk library, then prepare a pandas DataFrame containing your test queries and expected outcomes. Aft er calling the evaluate_data_agent() function, the SDK returns a unique evaluation ID and writes summary and step-level data to tables in your workspace. You can then retrieve overall accuracy and performance insights using the get_evaluation_summary() function.This capability simplifies both quality control and continuous improvement of data agents. It supports tasks such as prompt tuning, regression tracking, and performance monitoring, helping ensure that AI-powered experiences remain accurate and reliable as they evolve.Integrating data agents with Microsoft Copilot Studio and Microsoft AI FoundryIntegrating data agents with AI Foundry and Copilot Studio enables organizations to leverage enterprise data seamlessly within AI-driven workflows. This integration empowers conversational AI agents to dynamically access relevant data, enhancing both decision-making and user productivity.Integration with AI Foundry occurs through the deployment of Fabric data agents as knowledge sources within Foundry’s agent framework. Foundry orchestrates the interaction, enabling AI agents to query these data agents directly. User identity and permissions fl ow securely through this integration, ensuring compliance and data governance at scale.Figure 11.22: This panel allows users to expand an AI agent’s knowledge by connecting external data sources such as Microsoft Fabric, Azure AI Search, SharePoint, Bing Search, Tripadvisor, and others for grounding responsesWhen an AI Foundry agent receives a request, it intelligently delegates appropriate queries to a connected Fabric data agent, retrieves precise responses, and generates coherent outputs for end users.In the following screenshot, we’ve connected SalesAgent from Microsoft Fabric to a broader agent called Market Agent. This connection allows us to cross-reference sales data with market insights, CRM information, and other external sources.Figure 11.23: Existing Microsoft Fabric SalesAgent connection with custom authentication is ready for useWe can then interact with our agent in the playground and, within our context, ask the following question: What are the main takeaways or most significant trends in sales seasonality over time? Summarize in three key points. The result is then displayed in the agent’s interface.Figure 11.24: The Market Agent in Azure AI Foundry summarizes key sales seasonality trends, using the connected SalesAgent knowledge source for contextIt is then possible to view the reasoning behind the model’s response, as well as the structure of its answer. In our example, we c an clearly see that the agent is interacting with the data agent, allowing it to retrieve as much relevant information as possible.Figure 11.25: The run trace confirms a successful call to the fabric_dataagent tool, executed in3 seconds as part of a thread sessionAI agents integrated within Foundry can be seamlessly connected to Copilot Studio. Through this connection, agents created in Foundry become accessible within Microsoft Copilot applications and Microsoft 365. Users benefit from streamlined, multi-agent interactions, enabling AI assistants within familiar productivity tools to interact fluidly with enterprise data sources.This integration leverages standardized protocols, allowing Copilot Studio agents to transparently discover, communicate, and collaborate with Foundry-based data agents, creating a unified conversational AI experience across various platforms.By integrating Fabric data agents with AI Foundry and Copilot Studio, organizations can build sophisticated, interconnected AI agents that effortlessly access and deliver enterprise knowledge. The result is a powerful, cohesive system enabling intuitive and secure interactions between users, AI agents, and organizational data.Integrating data agents with Power BIIn the evolving landscape of business analytics, the integration of Fabric data agents into Power BICopilot marks a significant leap forward. Microsoft is transforming the way users interact with enterprise data—moving beyond dashboards and static reports to a more fluid, conversational experience.Traditionally, Power BI relied on structured datasets and semantic models to deliver insights. While powerful, this approach required users to know what data was available and how it was organized. With the arrival of Fabric data agents, that barrier begins to dissolve. These agents act as intelligent bridges, capable of querying data across Microsoft Fabric—including lakehouses, data warehouses, and real-time sources—without the user needing to understand their structure.Once connected to Copilot, these agents enable a more dynamic and flexible form of analysis. Users can ask open-ended questions in natural language, and the system intelligently determines which data sources to consult. Rather than extracting predefi ned metrics, the agent interprets the user’s intent, formulates relevant queries, and returns synthesized answers. The interaction becomes adaptive, responsive, and deeply contextual.The user experience has also evolved. Power BI now offers a full-screen Copilot interface that supports an ongoing dialogue with data. Users no longer jump between visualizations and filters; instead, they engage in natural, iterative conversation. Each question refines the context. Each response adds new understanding. The Fabric data agent is not just retrieving numbers—it’s helping users think.This integration brings real advantages. First, it broadens analytical reach, allowing users to tap into previously disconnected datasets. Second, it lowers the technical barrier to insight, making advanced analysis accessible to users with no background in data modeling. And third, it shortens the path from question to answer, accelerating decision-making at every level of the business.The Power BI home page offers a few suggested prompts, along with customization options such as adding agents, semantic models, or reports to better address the user’s query.Figure 11.26: New Power BI Copilot experienceOnce the agent is registered, you can submit the question to Copilot, which will rely on this agent to gather the data and generate a response for the user. In our example, the question asked is: What are the main takeaways or most significant trends in sales seasonality over time? Summarize in three key points.Figure 11.27: Interaction between the new Power BI Copilot and Sales data agentThe most striking shift is not technical but conceptual. Business intelligence is becoming less about pulling data and more about interacting with it. With data agents embedded in Copilot, Power BI becomes a space where data and intent meet—where the analyst’s curiosity is answered by a system that listens, understands, and responds. It’s not just an evolution of to oling; it’s a new way of working with information. MCP support for Real-Time Intelligence (RTI)The promise of real-time intelligence has always been about immediacy—turning raw, fast-moving data into insight and action, as events unfold. With the introduction of Model Context Protocol (MCP) support for Real-Time Intelligence (RTI) in Microsoft Fabric, that promise takes a tangible form. AI agents, which once relied on static data or scheduled refreshes, can now engage directly with live event streams, unlocking entirely new use cases for operational awareness, anomaly detection, and conversational analytics.At the heart of this evolution lies the concept of the MCP RTI server—an open source server built to act as a translator between natural language prompts and real-time query execution. This server supports queries against platforms such as Eventhouse and Azure Data Explorer, two foundational components of Fabric’s real-time architecture. The MCP server receives a user or agent’s query—phrased in everyday language—and transforms it into executable Kusto Query Language (KQL) or JSON-based expressions that access the most up-to-date data available.The technical design is particularly elegant: the server offers features such as schema introspection, allowing agents to understand the structure of real-time tables and streams. It supports autocomplete, query optimization, query validation, and even natural language error explanations, enhancing usability for both developers and end users. In more advanced scenarios, it also supports anomaly detection patterns, vector-based semantic searches, and custom parameter bindings—critical capabilities when working with high-volume, high-velocity datasets.From the user’s point of view, the experience is remarkably fluid. Imagine a business analyst interacting with Copilot in Microsoft Teams or Power BI, asking What unexpected traffic spikes occurred in the past five minutes? Instead of routing the question to a stale dataset, the AI agent communicates with the MCP RTI server, which accesses live telemetry from connected streams. The response is both immediate and context-aware—an intelligent snapshot of a moment in time, tailored to the user’s query.This advancement is not just a technical milestone; it marks a shift in how we think about agents and data systems. MCP becomes the connective tissue between real-time data infrastructure and generative AI. It abstracts away the complexity of query languages and schema logic, giving agents the ability to operate as intuitive interfaces to complex real-time systems. These agents no longer need to be pre-programmed with rigid logic—they can now adapt, respond, and learn from live data in the flow of conversation.Moreover, because MCP is standardized and extensible, this model is not confi ned to Microsoft tools. Any system or agent that speaks MCP can now tap into the RTI capabilities, whether it’s a third-party chatbot, an autonomous monitoring system, or a custom enterprise app. This composability ensures that real-time intelligence is not a siloed feature but part of a broader, interoperable ecosystem.In the following example, after deploying the MCP RTI server, we have an MCP server connected to sample New York taxi data. The GitHub Copilot agent is linked to the Claude Sonnet 4 reasoning model, which responds to the following question: Identify the periods of high activity in the taxi data you have access to.The agent begins its reasoning by listing the available databases and tables, discovers the taxi dataset, learns the schema, and then extracts a subset of the data to better understand its structure and content.Figure 11.28: MCP RTI server configured as tools in the GitHub Copilot agent interfaceThe agent then continues its reasoning by extracting specific information from the database that will enable it to produce the final result of its analysis. A sequence of queries is sent to the database as part of this process.• Now let me analyze monthly patterns• Let me also look at specific high-activity periods and combine hour and day analysis• Let me also analyze seasonal patterns by looking at different years• Let me get a broader view of the recent data• Now let me get the peak activity patterns for different time periodsFor this final query, here is an example of the code generated by the agent and sent to the underlying database.{ "cluster_uri": "https://clustername.kusto.fabric.microsoft.com", "database": "NycTaxiDB", "query": "nyctaxitrips\n| where tpep_pickup_datetime >= datetime(2022-01-01) and tpep_pickup_datetime < datetime(2023-01-01)\n| extend pickup_date = bin(tpep_pickup_datetime, 1d)\n| summarize daily_trips = count() by pickup_ date\n| order by daily_trips desc\n| take 10" }At the end of its reasoning, the agent returns its analysis in the form of a formatted text, enriched with calculations derived from the knowledge base.Figure 11.29: The setup in VS Code uses a custom Kusto query configuration to analyze NYC taxi data, revealing peak activity times—Wednesdays at noon lead with over 23 million tripsIn a world increasingly driven by the speed of information, the ability to act on data as it happens is no longer a luxury—it’s a necessity. With MCP support for RTI, Microsoft is providing the infrastructure for that immediacy: a conversational layer over a real-time backbone, where AI agents don’t just analyze the past but live in the present.You can access the official Fabric RTI server at the following link: https://go.fabricbook.net/ch11-14ConclusionMicrosoft Fabric data agents represent a major step toward AI-first analytics, where users can query, reason over, and act on enterprise data through natural language. From creating a first data agent to integrating with mirrored databases, Power BI Copilot, Microsoft AI Foundry, Copilot Studio, and Real-Time Intelligence through MCP, the article shows how Fabric is evolving into a unified platform for governed, conversational, and real-time decision-making.By combining structured data access, business context, SDK-based evaluation, and secure AI integrations, Fabric data agents help organizations move beyond static reporting toward intelligent, interactive analytics experiences. This article is an excerpt removed from The Definitive Guide to Microsoft Fabric book by Packt. Author BioChristopher Maneu is a Principal Data Engineering Advocate at Microsoft, he focuses on data and analytics within the Azure platform. Christopher is part of the Azure Engineering team, working on Microsoft Fabric well before its launch. This early involvement has provided him with in-depth knowledge of the platform's development and capabilities. He has authored multiple books, including a reference book about Microsoft Fabric in French. Additionally, he has contributed to open-source projects related to Microsoft Fabric, such as the 'fabricnotes' repository on GitHub, which offers simple drawings illustrating the main concepts of Microsoft Fabric to empower users to build on the platformEmilie BEAU is a technical specialist in data processing technologies. She has been at Microsoft for 15 years, enjoys sharing her knowledge, and engages in discussions about how advances in BI, Big Data, and Artificial Intelligence can help industries address new challenges. She spent years in the Microsoft Technology center, addressing CxOs at the intersection of business and technical needs.Jean-Pierre is a Technology leader who combines data, innovation, and business value. Passionate about data, a fan of Power BI and Fabric, guided by innovation yet business-oriented, and "artificially intelligent," he has worked for more than 25 years on exciting projects-from the web to IoT-encompassing vast amounts of data and AI. A recognized Microsoft Most Valuable Professional (MVP) for the Data Platform since 2008, he has always been deeply involved in communities in France, both as an organizer and speaker. He leads the Microsoft Data community in France, notably through the Power BI Club and the Fabric Club.

0
0

article-image-what-is-a-system-of-action-building-an-ai-ready-data-foundation-for-real-time-decision-intelligence

Boris Bialek, Sebastian Rojas Arbulu, Taylor Hedgecock

31 Mar 2026

10 min read

What Is a System of Action? Building an AI-Ready Data Foundation for Real-Time Decision Intelligence

Boris Bialek, Sebastian Rojas Arbulu, Taylor Hedgecock

31 Mar 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionModern enterprises are no longer competing on data alone; they’re competing on how quickly and intelligently they can act on it. As AI systems evolve from passive analytics to autonomous decision-makers, traditional data architectures are becoming a critical bottleneck. The rise of systems of action marks a fundamental shift: from storing and analyzing the past to driving real-time decisions, workflows, and outcomes.This article explores how organizations can modernize their data foundations to support agentic AI, real-time context, and scalable intelligence. It breaks down the architectural principles required to unify fragmented data, ensure quality and trust, and enable continuous learning systems that operate at enterprise scale. From unified data access to real-time signal processing and governance, this is a practical guide to building the AI-ready backbone that powers next-generation applications.Building an AI-ready data foundationDelivering on the promise of systems of action requires a new kind of data foundation—one built for speed, context, and adaptability.Agentic AI systems fundamentally differ from traditional systems of record in their operational demands. Where legacy systems focus on capturing and storing historical transactions, systems of action powered by agentic AI require real-time decision-making, dynamic data synthesis, and immediate response capabilities. This shift demands that our data architecture choices move beyond the rigid, siloed structures of traditional enterprise systems.A unified view of core enterprise is essential. It must bring together the diverse data types that autonomous agents rely on (real-time operational signals, contextual documents, vector embeddings) into a single, coherent platform. That platform must be built on flexible data structures that can adapt as agent behaviors evolve.The transition from supporting passive systems of record to enabling active systems of action introduces six critical architectural requirements that distinguish agentic AI infrastructure from legacy approaches:Unified data access to eliminate the complexity of managing multiple disparate datastoresData quality and consistency mechanisms that reduce hallucinations and errors from systems out of syncReal-time context capabilities that enable immediate signal processing for RAG applicationsScalability and performance characteristics that support operational AI rather than only backward-looking analyticsGovernance and security frameworks that protect sensitive information while enabling innovationEfficient model training workflows that optimize data preparation for GenAI applicationsTogether, these elements form the data foundation for autonomous, intelligent systems. As we examine each in the sections ahead, we’ll see how a system of action database departs from traditional data management and enables more intelligent, responsive, and scalable AI applications.What is a system of action?Systems of action are a new class of enterprise application, designed to execute decisions and drive workflows in real time. They enable collaboration between people, AI-assisted users, and AI agents, supporting everything from assisted decision-making to fully autonomous execution.Unlike systems of record, which passively store historical transactions, or systems of insight, which analyze data retrospectively, systems of action operate in the moment. They process dynamic context, trigger decisions, and execute tasks through AI agents. For instance, they might reroute a delayed flight in real time or automatically adjust hospital staffing during a sudden surge.Building systems of action requires more than analytical capabilities. They must ingest streaming signals, reason across unstructured and structured sources, and respond in real time. They require specialized database architectures capable of managing high-velocity, multimodal data streams and supporting complex state transitions over time. Most legacy systems, designed for static, batch-oriented workflows, simply cannot support this kind of continuous intelligence.Figure 3.1: Enterprise system landscape: from system of record to system of actionFigure 3.1 illustrates this evolution across the enterprise landscape. Unlike traditional systems that passively store or retrospectively analyze data, systems of action enable real-time interaction between users, applications, and agents; all powered by a live, adaptable data layer.Unified data access architectureThe foundation of any GenAI system begins with access to diverse, multimodal data, at speed, in formats AI can reason with. Unfortunately, this is also where most enterprises struggle. Traditional enterprise data architectures are fragmented across dozens of incompatible systems, each optimized for narrow use cases. The result is integration pain, access friction, and massive overhead.Modern AI applications demand a fundamental departure: unified access must be treated not as a convenience but as a prerequisite.Today’s models must navigate a wide variety of inputs: text documents, application logs, product catalogs, support transcripts, and streaming sensor data. Relational and legacy systems often store semi-structured data (like JSON or XML) as binary large objects (BLOBs) or character large objects (CLOBs), limiting their usability for AI systems. In these cases, the actual data is hidden inside a single entry and must be extracted and interpreted before it can be reasoned over or acted upon. This was tolerable when the goal was to store and retrieve files. But for GenAI systems, where models need immediate access to both structured and semi-structured data, often in the same query, this format becomes a bottleneck. Even a video can have its own addressable metadata structure, rather than existing solely as an opaque BLOB, illustrates the shift needed to support AI-native reasoning.Beyond the format problem lies a more urgent challenge: fragmentation.An AI application might need to stitch together context from a CRM (customer profiles and account hierarchies), a product catalog (SKU-level details, pricing, availability), a data warehouse (historical transactions), a streaming platform (real-time behavioral signals), and a document store (contracts, support transcripts, policy documents). Each source has its own schema, access pattern, and often its own API. This complexity creates two persistent challenges:Developer integration friction: Each layer introduces its own headaches, from authentication and authorization to schema mismatches, brittle connectors, and inconsistent formatsSystem fragility/maintenance drag: Over time, these integrations accumulate, introducing silent failures, versioning issues, and downstream reliability risks that make innovation slower and more expensiveMongoDB’s document model takes a fundamentally different approach. Instead of forcing diverse data into rigid schemas or hiding it in unreadable blobs, it enables rich, hierarchical data structures that mirror how businesses actually operate [1]. Developers can model a full customer, order, or event in a single document, including nested context, version history, and behavioral attributes. This eliminates the need for complex joins while preserving the relationships critical for effective agentic reasoning.Even more critically, flexible schema design, meaning the ability to store and query data without locking into a rigid blueprint, allows fields and document shapes to adapt as requirements change. This lets data evolve—new attributes can be added without downtime, and new types of signals can be integrated without costly migrations. For AI systems (especially those that learn, adapt, and extend themselves), this agility is essential.This architectural convergence enables structured transactions, real-time signals, and unstructured content together in a single query or operation. Model updates, enrichment jobs, or downstream agent actions can all be triggered directly from the same data platform [2]. That unified model lays the groundwork for sophisticated, AI-native workflows.Perhaps more importantly, unified data access transforms developer productivity. Instead of spending cycles reconciling formats or debugging brittle connectors, teams can focus on building intelligent systems. And, as we’ll see in the sections ahead, everything from data quality and governance to real-time orchestration builds on this foundation.Ensuring data quality and consistencyData quality and consistency are non-negotiable for GenAI solutions. Unlike traditional analytics, where data quality issues might simply yield incorrect reports or delayed insights, poor data quality in AI systems can cause hallucinations, introduce biased outputs, and fundamentally unreliable behavior that undermines user trust and business value.Legacy quality approaches tried to solve this through normalization, deduplication, and validation against external sources. Consider a familiar failure mode: a system validates Joe Miller, 12 High Street, through postal APIs and credit checks, yet fails to distinguish between three different JoeMillers (grandfather, father, and son) at the same address. For entity analytics, where precise relationship mapping matters, this is a critical flaw.In this scenario, an online store might unknowingly treat all three individuals as the same customer, losing the ability to tailor interactions or offers. Relational star schemas exacerbate this problem by fragmenting contextual information across multiple tables. When customer data is split between fact tables, dimension tables, and lookup tables, the rich context that enables accurate entity resolution becomes scattered and difficult to reconstruct.In our Joe Miller example, a document-based approach would maintain separate documents for each individual, complete with detailed demographic information, purchase history, behavioral patterns, and relationship data that enables clear differentiation.Within a document, you can store original values alongside enrichments and enhancements within the same dataset. This approach improves output reliability and reduces hallucinations or contradictory results. When an AI system generates an output, the complete chain of data sources, transformations, and reasoning steps can be traced back through the document structure, enabling both debugging and compliance reporting.This lineage capability proves essential for improving output reliability and reducing hallucinations or contradictory results. When AI models can access not just the current state of data but also its provenance and transformation history, they can make more informed decisions about data reliability and confidence levels. For example, customer service AI might weigh recent direct customer interactions more heavily than older inferred preferences, or flag potential inconsistencies when multiple data sources provide conflicting information.For organizations implementing document-based data quality strategies, MongoDB offers comprehensive best practices, as well as compatibility with industry-leading tooling for data modeling and cataloging that make advanced quality management achievable at scale [3]. When high-quality, lineage-aware data becomes the default, AI systems can deliver results that are accurate, explainable, and trustworthy.Real-time context and RAGThe definition of real-time varies significantly by use case and industry, but the real-time requirements of data in use with GenAI cannot be overstated. Hedge fund trading systems, for example, require millisecond responses, while life insurance underwriting processes measure time in days. While application response times continue to decrease, many architectures use caching layers that create an illusion of real-time performance at the expense of freshness of data.A typical real-time environment follows a simple pattern where an interaction generates a signal that enables immediate interpretation. These signals may originate from diverse sources, such as a retail website recording shopping cart additions, a smart meter transmitting electricity usage, or a pathology lab completing cancer analysis data. All signals, when combined with existing datasets, enable text search, vector search, and LLM processing for reasoning and causal analysis. This applies equally to interactive systems, such as retail shopping carts, and autonomous agentic systems, such as automated insurance claim processing.Real-time integration of signals with metadata, reference data, and historical information generates new knowledge instantaneously. Consider how this has evolved. Traditional rulebased systems might suggest "You ordered a burger, would you like fries?" In contrast, an AI-powered system recognizes patterns such as "You order cat food bi-weekly, always the same brand", and reasons contextually with suggestions such as "Based on your purchase history, you might be interested in our new, healthier formula. Would you like us tosend you a free sample?" The system identifies repeat customers and enhances their experience through reasoning that connects purchase patterns with product recommendations, requiring deeper knowledge about customer preferences and pet characteristics.Figure 3.2: Real-time AI data flowThe architectural flow in Figure 3.2 demonstrates how modern AI applications process realtime signals through a system of action database using an airline passenger assistance scenario. The flow begins with diverse signal sources on the left: Passenger Check-In, Booking Systems,Weather Services, and Flight Status Updates, which feed into Signal Processing and FormatConversion/Real-Time Capture components. These signals are then ingested into the central MongoDB Document Store, which contains Flight Documents, Passenger Vectors, Booking Metadata, and Historical Patterns with Direct Vector Access capabilities.The system processes this data through Atlas Vector Search (finding similar flight disruptions) and LLM Augmentation (generating personalized responses with flight context) to produce three types of intelligent outputs: Re-Booking Confirmations, Personalized Options, and Automated Responses. At the foundation sits the Operational Data Layer (ODL), an architectural pattern that centrally integrates and organizes siloed enterprise data, serving as an intermediary between existing data sources and consuming applications. In this case, the ODL enriches signals with contextual information from passenger records, alternative flights, weather data, and rebooking history.A continuous learning and enrichment feedback loop ensures that every interaction outcome, whether accepted re-bookings or user preferences, flows back to improve future recommendations. The document model enables continuous enhancement without requiring system restructuring, creating a system that grows smarter with each passenger interaction while delivering real-time, context-aware responses vital for modern AI applications.Critically, the feedback loop ensures continuous improvement, ensuring every interaction outcome enriches the system of action database, making future responses more accurate and contextual. This circular flow embodies the key advantage of document-based architectures: the ability to evolve and improve without the schema rigidity that constrains traditional relational systems. The result is a system that grows smarter with each interaction, delivering real-time, context-aware responses that modern AI applications require.Scalability, availability, and performanceHistorically, enterprise data warehouses represented the largest database implementations, with denormalized, column-oriented star schemas designed for analytical queries. These systems perform well with queries such as "Display yogurt sales by region", where large datasets are filtered by specific criteria (region, store, price) to generate insights. The integration of multiple sources led to the development of extract, transform, load (ETL) processes and master data management systems. While these platforms have added machine learning features and now claim to support GenAI capabilities, they remain primarily designed for backward-looking analytical tools, unsuited to real-time, agentic, and causal AI applications.Consider the contrast. A chatbot assisting an airline passenger who missed a connection requires fundamentally different capabilities than answering "How many passengers experienced day-long delays in Frankfurt last year?" The chatbot and its underlying agentic system must address immediate needs, finding available seats, offering mitigation services, and responding empathetically to frustrated passengers. The required data is real-time, context-sensitive, and simply not available from a historic warehouse.To be successful in the request for the passenger, the system needs both real-time seat information access (easy to achieve with an API to the usual booking systems), as well as more important detailed context and information about the passenger and their situation. Is it a family stranded, or a single adult? What other ticket dependencies exist? Can the passenger be rerouted via a different track, or is the best option to stay overnight?This scenario demands that all passenger data reside in an up-to-date system of action database, as real-time interactions fail without current information. As these systems achieve global coverage, non-functional requirements mandate not only 24/7/365 availability but also the ability to handle transaction volume fluctuations from quiet periods to peak travel seasons such as Thanksgiving. Even minimal outages become unacceptable, and caching solutions that simply solve a data availability challenge compromise on data accuracy by introducing data staleness issues.Document-based architectures, such as those provided by MongoDB, offer advantages in specific scenarios for this type of data availability and scalability. Rather than requiring complex joins across multiple tables to reconstruct user context, document models can store complete contextual information in a single, efficiently retrievable record. This approach reduces the computational overhead of context reconstruction while enabling more sophisticated caching and optimization strategies.The performance characteristics of AI workloads also differ significantly from traditional analytical patterns. While analytical queries typically process large volumes of data to generate aggregate results, AI applications often require rapid access to specific, contextually relevant information. This pattern favors architectures optimized for high-concurrency, low-latency access to individual records, rather than bulk processing of large datasets.Governance, security, and complianceGovernance and compliance requirements stem from a fundamental need to protect individuals from flawed decision-making in systems that lack adequate self-regulation. These safeguards exist to prevent real harms, from biased loan approvals to unsafe product recommendations.GenAI faces intense scrutiny regarding accuracy, with media coverage of hallucinations bringing this concern to the forefront. Therefore, transparency in data lineage, reasoning processes, and result interpretation becomes critical for any GenAI solution. The document model in a system of action database enables tracking of all changes, transformations, and actions related to specific datasets. Unlike legacy relational databases, documents offer the flexibility for enhancement and enrichment throughout the process without requiring upfront planning.From a governance perspective, this enables precise and comprehensive tracking of communication and decision-making processes. It facilitates decision auditing and corrective actions when compliance challenges arise, often due to gradual shifts in decision criteria requiring adjustment.Security represents an additional critical dimension. MongoDB’s Queryable Encryption keeps data absolutely protected from unauthorized access. While passenger data may have moderate sensitivity, healthcare provider consultations about potential illnesses require the highest security levels. The system of action database enables transparent security implementation, significantly more challenging when coordinating multiple data sources with potentially incompatible security and policy systems [4].Model training and fine-tuningTraining or fine-tuning models requires large volumes of clean, labeled, and diverse data. The system of action database ensures efficient data curation, sampling, and preprocessing for training pipelines. Data enrichment becomes key, as features such as MongoDB’s aggregation pipeline enable data annotation and continuous analysis of criteria such as minimum or maximum values and moving averages to validate reasoning processes.The subject of data preparation for GenAI is often misunderstood, stemming from the evolution of early AI solutions supporting ML systems (systems that were derived from business intelligence (BI) architectures). This sometimes leads to the mistaken assumption that all data for AI usage and interaction must first be prepared, or readied, in lakes, warehouses, or marts, requiring extensive transformation and data pipeline processing. The resulting data objects are often stored as star schemas with fact tables, each containing hundreds of columns and accompanying dimension tables. Star schemas, a data modeling format originally designed to solve the problem of performant analytics queries executed against relational database objects, introduce the need for complex queries and join operations to extract insight, an architecture still employed by platforms such as Snowflake.Apache Spark object-storage implementations, such as Databricks, offer more complex query capabilities through distributed computing frameworks and in-memory processing, representing a significant advancement over traditional batch processing systems. Both approaches, star schemas and Spark-manipulated object storage files, share a foundation in backward-looking data warehousing, regardless of contemporary terminology such as data lake or lakehouse.These systems are optimized for processing large volumes of homogeneous data aligned along dimensional axes. Real-time access to individual datasets for operational processing falls outside their design parameters. Historically, this was the realm of online transaction processing (OLTP) systems. While transactional logging isn’t central to GenAI data structures, the access patterns remain similar.Often, the example of building models for embeddings is referenced as justification for why the data warehouse must be the source of data for GenAI, but this is misleading. Firstly, many business solutions successfully deploy standard embedding models for PDFs, images, and audio, without the need for custom development. Secondly, and more importantly, the comparison doesn’t hold, as warehouses analyzing quarterly sales have no relevance to point-of-sale operations and transaction booking.ConclusionTo unlock the full potential of AI, enterprises must rethink their data architecture from the ground up. Systems of action represent this new paradigm, where data is not just stored or analyzed, but continuously activated to drive intelligent decisions in real time. Achieving this requires six foundational capabilities: unified data access, high data quality and consistency, real-time context integration, scalable performance, robust governance and security, and efficient model training pipelines.By adopting flexible, document-based architectures and eliminating data fragmentation, organizations can build systems that are not only faster and more responsive but also more trustworthy and adaptable. The result is a living data ecosystem, one that evolves with every interaction, improves decision accuracy, and enables truly autonomous, AI-driven operations.This article is an excerpt from the book Architectures for the Intelligent AI-Ready Enterprise. To explore these concepts in greater depth and learn how to implement them in real-world enterprise environments, readers can explore the full book here: Author BioBoris Bialek has worked in the IT industry since the 1990s and was one of the initial drivers of Linux in Europe, delivering the first SAP port to Linux, conducting the first benchmarks, and securing the first clients. Since then, he has led product and development teams across IBM and FIS, driving innovation for both the end product and development productivity. Boris Bialek joined MongoDB in 2019, igniting a focus on industry solutions based on MongoDB's document model. Promoted to global field CTO and VP of industries, he drives technical design. He works directly with numerous clients, helping them gain the benefits of the MongoDB Atlas data platform. Boris holds a master's in computer science from the Karlsruhe Institute of Technology.Sebastian Rojas Arbulu is an industry solutions specialist at MongoDB, where he collaborates with numerous stakeholders across diverse industries to help customers realize the transformative value of MongoDB through tailored, data-driven solutions, particularly for AI integration. Sebastian also leads his team's content strategy, including numerous additions such as blogs, white papers, magazines, and other thought leadership pieces. With a background in IT consulting, marketing, and digital transformation, among other areas, he has extensive experience in identifying customer needs and developing innovative solutions that prepare data for intelligent applications and unlock new possibilities. He holds a bachelor of business administration degree.Taylor Hedgecock is a strategic program leader and transformation partner who turns vision into velocity. With a career spanning startups to multinationals, she brings a mix of operational rigor, narrative clarity, and cross-functional orchestration. At MongoDB, she has led high-impact programs across AI, partner ecosystems, and services modernization, often serving as the connective tissue between vision and execution. Her work has guided C-level priorities, enabled go-to-market readiness, and driven large-scale change, establishing her as a trusted leader in aligning stakeholders, translating strategy into story, and driving outcomes that last. Taylor currently serves as senior program manager on the industry solutions team, partnering with ISVs and AI innovators to bring next-generation solutions to market. Previously, she was chief of staff for professional services leadership, where she helped launch new offerings and guided modernization strategy, shaping MongoDB's vision for applying AI to its hardest problems.

0
0

article-image-large-language-models-and-graph-machine-learning

Aldo Marzullo, Enrico Deusebio, Claudio Stamile

24 Mar 2026

10 min read

Large Language Models and Graph Machine Learning

Aldo Marzullo, Enrico Deusebio, Claudio Stamile

24 Mar 2026

10 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionLarge Language Models (LLMs) have transformed the landscape of artificial intelligence, redefining how machines understand and generate human language. From early statistical methods to the breakthrough of Transformer architectures, LLMs such as GPT, BERT, and T5 have unlocked unprecedented capabilities in natural language processing (NLP). As organizations increasingly rely on both structured and unstructured data, a powerful new paradigm is emerging: the integration of LLMs with Graph Machine Learning (GraphML). This combination enables systems to leverage both deep contextual language understanding and rich relational data, paving the way for more accurate, scalable, and intelligent AI applications across domains like search, recommendation systems, and knowledge graphs.LLMs: an overviewIn the rapidly evolving field of artificial intelligence, LLMs have significantly advanced natural language processing (NLP) and understanding. These models, characterized by their extensive number of parameters and trained on large datasets, have demonstrated remarkable capabilities across a wide set of language-related tasks.The journey of language models began with statistical approaches that relied on probabilistic methods to predict word sequences. These early models, while creating the foundations, were limited by their reliance on fixed-size context windows and the inability to capture long-range dependencies. However, as we also discussed in Chapter 4, Unsupervised Graph Learning, with the advent of neural networks, the field has undergone a significant shift, introducing models capable of learning word embeddings. In order to improve the ability to capture long-range dependencies, the initial neural network models were based on the Long-Short Term Memory(LSTM) and Gate-Recurrent Unit (GRU) architecture, which are forms of Recurrent Neural Networks (RNNs). However, a pivotal moment occurred with the introduction of the Transformer architecture by Vaswani et al. in 2017. Unlike its predecessors, the Transformer model utilized self-attention mechanisms, enabling it to consider the entire context of a sentence without the sequential constraints inherent in RNNs. This innovation facilitated the development of models capable of processing and generating text in a more coherent and fluent way.Building upon the Transformer architecture, researchers scaled models to unprecedented sizes, leading to the emergence of LLMs such as OpenAI’s GPT series, Google’s BERT and T5, and more recently, models such as GPT-3 and GPT-4.In a nutshell, training LLMs involves optimizing a large number of parameters on very large datasets. This process, known as pretraining, typically employs unsupervised learning objectives, such as predicting missing words in a sentence (masked language modeling) or forecasting subsequent words (causal language modeling). As a side effect, the pre-training phase lets the model learn and “understand” a language, resulting in a remarkable ability to generalize across various tasks, often achieving state-of-the-art performance. LLMs have demonstrated proficiency in a diverse array of applications, reflecting their versatility and depth of language understanding. Key areas include text generation, language translation, question answering, and summarization, among many others.Given the strengths of LLMs in unstructured text processing and generative tasks, an exciting frontier emerges when we consider their integration with graphs. While LLMs excel in understanding and generating natural language, graphs are particularly powerful for representing and analyzing structured relationships between entities. In the rest of the book, wewill see examples of how we can take advantage of both.Why combine GraphML with LLMs?As we have learned throughout this book, GraphML excels at representing and analyzing structured data such as knowledge graphs, social networks, chemical structures, and so on. It is extremely useful for situations where exploiting relationships between entities is crucial for achieving good performances. However, LLMs are particularly good at interpreting unstructured text, offering generative skills, reasoning, and profound contextual awareness. When it comes to language-based activities such as content creation, question answering, and summarization, they excel.Despite their impressive capabilities, LLMs are not without limitations. One of the most significant challenges is the problem of hallucination, where an LLM generates factually incorrect or misleading information that appears plausible. This is particularly problematic in domains requiring high factual accuracy, such as healthcare, finance, and legal applications. To mitigate hallucinations and enhance the reliability of LLM outputs, Retrieval-Augmented Generation (RAG) has emerged as a powerful technique. RAG works by dynamically retrieving relevant information from an external knowledge source (such as a knowledge graph) at inference time, rather than just relying on pre-trained knowledge. This approach ensures that the model has access to up-to-date and accurate data, grounding answers in verified information rather than generating content purely from its internal representations.Recent advancements highlight how integrating GraphML with LLMs can drive significant innovation, enabling the development of applications that require both rich semantic understanding and relational analysis. For instance:Graph-Augmented Question Answering: LLMs can leverage knowledge graphs to answer domain-specific questions with factual accuracy.Node Embedding Generation: State-of-the-art frameworks such as GraphGPT use LLMs to generate node embeddings directly from textual data, enabling seamless integration with graph structures.Knowledge Graph Construction and Enhancement: Recent applications have shown how LLMs can be used to enrich knowledge graphs, where LLMs are used to extract semantic relationships and entities from text to enhance existing graph data.Therefore, by bridging the gap between structured knowledge and natural language understanding, the synergy between GraphML and LLMs paves the way for more accurate, explainable, and intelligent systems.In the next section, we will explore the state-of-the-art trends in combining GraphML and LLMs, as well as the current challenges.State-of-the-art trends and challengesBefore diving into specific examples, it is crucial to understand the current landscape ofGraphML and LLM integration. According to a recent survey by Jin et al. (https://arxiv.org/ abs/2312.02783, 2024), the application scenario can be categorized into three main scenarios:Pure Graphs: These are graphs that lack associated textual information. Examples include social networks, traffic networks, and protein interaction networks. In such cases, the focus is on leveraging LLMs to process and analyze the structural aspects of the graph data.Text-Attributed Graphs: In these graphs, nodes or edges are enriched with textual attributes. For instance, in academic networks, papers (nodes) come with titles and abstracts, while authors (nodes) have profiles. E-commerce networks also fall into this category, where products (nodes) have descriptions, and user interactions (edges) may include reviews. The challenge here is to effectively combine the textual content with the graph’s structural information.Text-Paired Graphs: This scenario involves graphs that are paired with separate textual descriptions or documents. Unlike text-attributed graphs, where text is embedded within the graph as attributes, text-paired graphs treat the graph and text as distinct but related entities. A pertinent example is molecular graphs accompanied by detailed textual descriptions of their properties. The objective is to align and integrate the information from both the graph structure and the associated text to enhance understanding and analysis.To effectively utilize LLMs in these scenarios, three primary techniques can be used: LLMs as predictors, LLMs as encoders, and LLMs as aligners. Let’s see these approaches one by one.LLMs as predictorsThe simplest and most direct approach is to use LLMs as predictors. In this paradigm, the LLM operates as a tool to infer outcomes directly from graph data. Imagine a scenario where textual information is either minimal or entirely absent (pure graphs). In this case, you can transform the graph data into a format that the LLM can process, such as converting graph structures into sequences or textual descriptions.For instance, consider a simple social network graph where nodes represent people and edges indicate friendships (Figure 12.1). These features can be converted into a textual narrative, such as Alice is linked with Bob. An LLM can then process this narrative to predict new relationships or infer additional attributes about the nodes, such as professional interests or potential connections.Figure 12.1: Examples of how graphs can be converted to text narrativesOnce the data is prepared, the LLM can be fine-tuned or prompted to perform specific tasks. These might include predicting node classifications, such as identifying the role of individuals in a social network, or link predictions, such as forecasting interactions between entities. In molecular research, LLMs as predictors can help determine the properties of chemical compounds based solely on their structural representations.One advantage of this approach is its simplicity: LLMs can be applied directly to graph data without requiring extensive preprocessing or specialized models. However, this simplicity can also be a limitation. Purely structural information might not always be sufficient for complex tasks, particularly when additional contextual or textual data is available but not leveraged. Moreover, scalability and cost must be considered: encoding entire graphs as text can lead to an explosion of sentences, making inference expensive, potentially inefficient, and sometimes impossible (for example, if the maximum number of words an LLM can process at once is too small to contain the whole graph). Performance may also be limited, as this approach is similar to providing an LLM with a structured dataset and expecting accurate predictions without tailored adaptations. For this reason, more complex graph2text formalisms can be designed, incorporating node/edge descriptions into textual narratives while balancing efficiency and accuracy.LLMs as encodersWhen graphs are enriched with textual attributes, the LLM as encoder approach becomes particularly powerful. Here, the LLM is tasked with processing and encoding the textual information associated with nodes or edges, producing meaningful representations that can be integrated with the graph’s structural features. These embeddings are then integrated into the graph through proper algorithms such as graph neural networks, which process the combined representation to perform downstream tasks.This hybrid representation combines the strengths of both modalities, capturing the nuances of text alongside the relationships encoded in the graph. As depicted in Figure 12.2, each node could have attributes, such as a name and a brief bio for a node representing a person, while the edges might be annotated with information about the nature of the relation, e.g., close friend or colleague for a graph representing social networks. These features can be converted into a textual narrative, such as Alice, a software engineer, is close friends with Bob, a data scientist.Figure 12.2: Examples of how LMMs can be used as encoders for node attributesOther examples include academic citation networks, where papers (nodes) come with titles, abstracts, and keywords. An LLM can process these textual attributes to generate embeddings that encapsulate their semantic content. These embeddings are then combined with graph-specific features, such as the citation relationships between papers, to create a unified representation. Similarly, in e-commerce platforms, product descriptions and user reviews can be encoded by LLMs to enhance product similarity graphs or user behavior analysis.It is worth noticing that the process of using LLMs as encoders typically involves fine-tuning the LLM on domain-specific textual data to ensure that the embeddings accurately reflect the requirements of the task.This encoder approach offers significant benefits. By leveraging textual data, it captures context and nuances that purely structural methods might miss. It is particularly effective in scenarios where textual attributes can provide critical insights, such as identifying the themes of academic papers or understanding user preferences in recommendation systems.LLMs as alignersThe goal here is to align and integrate the information from both structure and textual descriptions (or accompanying documents, in the case of text-paired graphs), enabling a comprehensive analysis that leverages the strengths of each. This can be achieved, for example, by finding a shared latent space or a semantic mapping that connects the two modalities. Such an approach might involve designing models that jointly optimize both modalities or using attention mechanisms to focus on the most relevant parts of each input.In more detail, the synergy between textual encoding (handled by the LLM) and graph structure encoding (handled by, for example, a GNN), can be typically in two ways:1. Prediction Alignment: Iterative training where LLMs and GNNs generate pseudo-labels to guide each other’s learning2. Latent Space Alignment: Contrastive learning to align the latent representation of the text and the graph structure in a shared space (e.g., Figure 12.3)Figure 12.3: Graphs and associated texts can be embedded in a shared latent spaceFor example, in molecular research, a molecular graph might represent the structure of a compound, while a textual description provides information about its properties, synthesis, or applications. In this context, an LLM can be used to process the text to extract relevant features and align these with the structural characteristics of the graph, enabling tasks such as property prediction or drug discovery.As you can imagine, this approach is particularly powerful in interdisciplinary fields where graphs and text provide complementary points of view. In computational social science, for instance, social graphs representing interactions between individuals can be aligned with news articles, social media posts, or other textual data to study the spread of information or public sentiment. Similarly, in e-commerce, user behavior graphs can be integrated with textual reviews to improve personalized recommendations.Now that we have a clearer understanding of the LLM and graph landscape, let’s dive into a practical example of how this integration works. We will explore this in the next section.ConclusionIn summary, the convergence of LLMs and GraphML represents a major step forward in building next-generation AI systems that combine natural language intelligence with graph-based relational reasoning. By addressing challenges such as hallucination through techniques like Retrieval-Augmented Generation (RAG), and by applying frameworks where LLMs act as predictors, encoders, and aligners, this hybrid approach opens new possibilities for question answering, recommendation systems, knowledge graph enhancement, and explainable AI.This article is an excerpt from the book Graph Machine Learning – Second Edition. Readers who want to explore these concepts in greater depth, along with practical examples and broader GraphML applications, can continue reading in the book.Author BioAldo Marzullo received an M.Sc. degree in computer science from the University of Calabria (Cosenza, Italy) in September 2016. During his studies, he developed a solid background in several areas, including algorithm design, graph theory, and machine learning. In January 2020, he received his joint Ph.D. from the University of Calabria and Université Claude Bernard Lyon 1 (Lyon, France), with a thesis titled Deep Learning and Graph Theory for Brain Connectivity Analysis in Multiple Sclerosis. He is currently a postdoctoral researcher and collaborates with several international institutions.Enrico Deusebio is currently working as an engineering manager at Canonical, the publisher of Ubuntu, to promote open source technologies in the data and AI space and to make them more accessible to everyone. He has been working with data and distributed computing for over 15 years, both in an academic and industrial context, helping organizations implement data-driven strategies and build AI-powered solutions. He has collaborated and worked with top-tier universities, such as the University of Cambridge, the University of Turin, and the Royal Institute of Technology (KTH) in Stockholm, where he obtained a Ph.D. in 2014. He holds a B.Sc. and an M.Sc. degree in aerospace engineering from Politecnico di Torino.Claudio Stamile received an M.Sc. degree in computer science from the University of Calabria (Cosenza, Italy) in September 2013 and, in September 2017, he received his joint Ph.D. from KU Leuven (Leuven, Belgium) and Université Claude Bernard Lyon 1 (Lyon, France). During his career, he developed a solid background in AI, graph theory, and machine learning with a focus on the biomedical field.

0
0

article-image-mongodb-crud-operations-explained-create-read-update-delete-with-mongosh-and-python

Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran

16 Mar 2026

10 min read

MongoDB CRUD Operations Explained: Create, Read, Update, Delete with mongosh and Python

Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran

16 Mar 2026

10 min read

Our Data Engineering Byte Newsletter gives data engineers and practitioners what they often lack today: clear, real-world insights—where every byte tells a story.Subscribe here to stay ahead in data engineeringIntroductionCRUD operations form the backbone of how developers interact with MongoDB databases. Whether you're building a small prototype or scaling a production application, understanding how to create, read, update, and delete data is essential. These four operations allow developers to insert new documents, retrieve existing information, modify stored data, and remove records when they are no longer needed.In MongoDB, CRUD operations can be performed using different tools and programming environments. Developers often begin with mongosh (MongoDB Shell) for direct database interaction and testing queries, while application development typically uses official drivers such as PyMongo for Python. This excerpt introduces the core CRUD concepts and demonstrates how to perform them using both mongosh and the MongoDB Python driver, helping developers understand the fundamental workflows for managing data in MongoDB.MongoDB CRUD operationsCRUD operations are the foundation of interacting with a MongoDB deployment. These operations require you to connect to the MongoDB server before you can query the relevant documents, adjust the specified properties, and subsequently, transmit the data back to the database for updates. Each CRUD operation serves a distinct purpose:The create operation creates and inserts new documents into the databaseThe read operation queries a document or documents in the databaseThe update operation modifies existing documents in the databaseThe delete operation removes documents from the databaseBasic CRUD with mongoshmongosh, also known as the MongoDB Shell, is an environment for hosting MongoDB deployments. It is equivalent to the administration console that relational databases use. You can download mongosh at https://www.mongodb.com/docs/mongodb-shell/install/.This section explains how to connect to your deployment with mongosh, insert a document into a database, query your database for a specified document, update a document, and delete a document. For more in-depth information and examples, you can also refer to the CRUD operations section of the MongoDB database manual at https://www.mongodb.com/docs/manual/crud/.Connecting to MongoDBBefore you can perform CRUD operations, you must connect to your MongoDB deployment. To connect to mongosh, you must specify your deployment connection string as well as any specified parameters.For example, enter a variation of the following code block in your terminal to connect to a MongoDB deployment with mongosh:mongosh "mongodb+srv://mycluster.packt.mongodb.net/myDatabase" --username myUsername --password myPasswordCreating documentsTo create a document in mongosh, you can use the db.collection.insertOne() command. This mongosh command creates a new document and inserts the created document into the specified collection.For example, the following db.collection.insertOne()command creates and inserts a new document into the library collection of the database. The new document has a title field set to Mastering MongoDB 8.0 and an isbn field set to 101:db.library.insertOne( { title: 'Mastering MongoDB 8.0', isbn: '101' } )When you successfully create and insert a document, MongoDB returns a confirmation output that contains the ObjectID value of the new document:{ acknowledged: true, insertedId: ObjectId("652024e7ab44f3bf77788a3d") }You can also use the following mongosh commands to update documents:db.collection.insertMany()db.collection.updateOne() with the upsert: true optiondb.collection.updateMany() with the upsert: true optiondb.collection.findAndModify() with the upsert: true optiondb.collection.findOneAndUpdate() with the upsert: true optiondb.collection.findOneAndReplace() with the upsert: true optionReading documentsRead operations in MongoDB are also called queries. To perform a basic read operation, or query, for a single document, use the db.collection.findOne() method and specify the selection criteria of the document that you want to read.For example, the following operation queries for a document in the library collection that contains the isbn field value of 101:db.library.find( { isbn: '101' } )If the library collection contains a document that matches your specified selection criteria, MongoDB returns an array that contains the document that matches:[ { _id: ObjectId("652024e7ab44f3bf77788a3d"), title: 'Mastering MongoDB 8.0', isbn: '101' } ]You can also use the following mongosh commands to read documents:db.collection.findOne()Updating documentsTo update a document, you can use the db.collection.updateOne() command. This command finds a document that matches the specified criteria and updates the found document.For example, the following code finds the document in the library collection that has an isbn field value of 101 and updates that document so that it contains a price field with a value of 30:db.library.updateOne( { isbn: '101' }, { $set: { price: 30 } } )When you successfully update a document, MongoDB returns the following confirmation summary:{ acknowledged: true, insertedId: null, matchedCount: 1, modifiedCount: 1, upsertedCount: 0 }You can also use the following mongosh commands to update documents:db.collection.updateMany()db.collection.replaceOne()db.collection.findOneAndReplace()db.collection.findOneAndUpdate()db.collection.findAndModify()MongoDB 8.0 also introduces the ability to sort documents within an update operation. If you specify a sort parameter in an operation that uses db.collection.updateOne(), db.collection. replaceOne(), or update, MongoDB sorts documents before updating to be able to select a specific document for the operation when multiple documents match the query.For example, the following update operation sorts all books by price and updates the lowest price book to be on sale for 25% off:db.library.updateOne( { }, [ { $set: { price: { $multiply: ["$price", 0.75] } } } ], { sort: { price: 1 } } )Deleting documentsTo delete a document, you can use the db.collection.deleteOne() command. This command queries for a document that matches specified criteria and deletes the found document.For example, the following operation deletes a document in the library collection that has an isbn field value of 101:db.library.deleteOne( { isbn: '101' } )When you successfully delete a document, MongoDB returns the following output:{ acknowledged: true, deletedCount: 1 }You can also use the following mongosh commands to delete documents:db.collection.deleteMany()db.collection.remove()db.collection.findOneAndDelete()db.collection.findAndModify()Basic CRUD with the Python driverAs an alternative to mongosh, MongoDB provides official drivers to interact with your MongoDB deployment. For a full list of official MongoDB driver libraries, see https://www.mongodb.com/ docs/drivers/.The following sections walk you through performing CRUD operations with the MongoDB Python driver, PyMongo. PyMongo provides a seamless bridge between the dynamic world of Python programming and the efficient, document-oriented NoSQL database of MongoDB.Installing and connecting to PyMongoYou can use pip to install PyMongo. You must have Python installed prior to installing PyMongo.For example, the following operation installs PyMongo through your terminal:To connect to your MongoDB deployment with PyMongo, you must use your connection string.For example, the following code snippet in a .py file connects to a MongoDB deployment:from pymongo import MongoClient uri = "<connection string>" client = MongoClient(uri) try: client.admin.command('ping') print("Pinged your deployment. You successfully connected to MongoDB!") except Exception as e: print(e)If you run the preceding code in a .py file and successfully connect to your deployment, MongoDB returns the following confirmation:"Pinged your deployment. You successfully connected to MongoDB!"You can test your connection to your MongoDB deployment on Atlas with the following code block in a .py file. This code block uses the asyncio asynchronous framework:import asyncio from motor.motor_asyncio import AsyncIOMotorClient from pymongo.server_api import ServerApi async def ping_server(): uri = "<connection string>" client = AsyncIOMotorClient(uri, server_api=ServerApi('1')) try: await client.admin.command('ping') print("Pinged your deployment. You successfully connected to MongoDB!") except Exception as e: print(e) asyncio.run(ping_server())If you run the preceding code in a .py file and successfully connect to your deployment, MongoDB returns the following confirmation:"Pinged your deployment. You successfully connected to MongoDB!".Creating documentsTo create and insert documents with PyMongo, use the insert_one command.For example, the code block that follows performs the following operations:Specifies the library collection within the resources databaseDefines a new document that represents a book titled Python and MongoDBInserts the new document into the library databasePrints the ObjectId value of the new documentlibrary = client.resources.library book = { 'isbn': '301', 'name': 'Python and MongoDB', 'meta': {'version': 'MongoDB 8.0'}, 'price': 60 } insert_result = library.insert_one(book) print(insert_result)If you successfully insert a document with this script, MongoDB prints the ObjectId value of the inserted document.Reading documentsYou can read documents from your database with queries.To read a document, specify the condition(s) of the document you want to read with a find_onecommand.For example, the code block that follows performs the following actions:Specifies the library collection in the resources databaseFinds documents with the name value of Python and MongoDB Prints the found documentlibrary = client.resources.library result = library.find_one( {"name": "Python and MongoDB"} ) print (result)Updating documentsTo update a document with PyMongo, use the update_one command. This command finds a document based on specified criteria and updates the found document.For example, the code block that follows performs the following operations:Finds the document where the name field is Advanced MongoDB TechniquesSets the price field of the found document to the value of 75Prints the result of the update_one operation Prints the updated documentupdate_result = library.update_one( { "name": "Advanced MongoDB Techniques"}, { "$set": { "price": 75 } } ) print(update_result.raw_result) updated_document = library.find_one( {"name": "Advanced MongoDB Techniques"} ) print(updated_document)Deleting a documentTo delete a document with PyMongo, use the delete_one command. For example, the code block that follows performs the following operations:Specifies the isbn value of the book to delete as 303Uses the delete_one command to delete a book from the library collection with the desired isbn valuePrints the result of the delete_one operationisbn_to_delete = '303' delete_result = library.delete_one({"isbn": isbn_to_delete}) print(delete_result.raw_result)If the operation successfully deleted the desired document, MongoDB returns the following confirmation:{'n': 1, 'ok': 1.0}The n value indicates the number of documents that MongoDB deleted. In this case, MongoDB deleted one document. The ok value indicates whether the operation caused any errors. The 1.0 value in the preceding confirmation signifies that MongoDB did not encounter an error with this operation.ConclusionCRUD operations are the fundamental building blocks for working with MongoDB. By learning how to connect to a MongoDB deployment and perform create, read, update, and delete operations, developers gain the ability to manage application data efficiently. Tools such as mongosh provide a powerful command-line environment for direct database interaction, while drivers like PyMongo allow seamless integration between MongoDB and programming languages such as Python. Together, these tools enable developers to prototype, test, and build robust data-driven applications.This section provides a practical overview of MongoDB CRUD workflows, illustrating how documents can be inserted, queried, modified, and removed using both the MongoDB Shell and the Python driver.This content is an excerpt from The Official MongoDB Guide, published by Packt and MongoDB, and written by Rachelle Palmer, Jeffrey Allen, Parker Faucher, Alison Huh, Lander Kerbey, Maya Raman, and Lauren Tran.Author BioRachelle Palmer is the Product Leader for Developer Database Experience and Developer Education at MongoDB, overseeing the driver client libraries, documentation, framework integrations, and MongoDB University. She has built sample applications for MongoDB in Java, PHP, Rust, Python, Node.js, and Ruby. Rachelle joined MongoDB in 2013 and was previously the Director of the Technical Services Engineering team, creating and managing the team that provided support and CloudOps to MongoDB Atlas.Jeffery Allen is a Technical Writer at MongoDB, based in the New York City area. He focuses on server documentation and works closely with Product and Engineering teams to develop examples that reflect real-world use cases, especially pertaining to data modeling and the MongoDB query language. Before joining MongoDB, Jeff worked as a full-stack web developer. Jeff also enjoys playing guitar and piano and produces electronic music in his spare time.Parker Faucher is a self-taught Software Engineer with over six years of experience in technical education. He has authored more than 100 educational videos for MongoDB, establishing himself as a knowledgeable resource in database technologies. Currently, Parker focuses on Artificial Intelligence and Search technologies, exploring innovative solutions in these rapidly evolving fields. hen not advancing his technical expertise, Parker enjoys spending quality time with his family and maintains an avid interest in collecting comic books.Lander Kerbey is a Technical Writer at MongoDB, specializing in Atlas Stream Processing. Prior to MongoDB, he worked at InterSystems, documenting various parts of the HealthShare suite, along with their various data analytics tools. His 11 years of experience as an educator informs his approach to documentation, as he strives to create “Aha!” moments for users.Maya Raman is a Technical Writer and sometimes-software engineer at MongoDB. She is passionate about the intersections between the environment, art and literature, technology, and people. She is based in New York City and likes to spend her time hanging out in Prospect Park, even in winter.Lauren Tran is a Technical Writer at MongoDB with a background in Communications and Computer Science. She is on the Server Documentation team and mainly specializes in Information Architecture, Time Series data, and Search. Lauren is passionate about creating accessible and inclusive documentation that caters to diverse audiences. She is based in Chicago following four years in New York City and in her free time, enjoys reading on the beach at Lake Michigan and listening to Taylor Swift music.

0
0

article-image-how-to-fine-tune-gemma-llm-for-low-resource-languages-kaggle-winning-strategies

Luca Massaron, Bojan Tunguz, Konrad Banachewicz

27 Feb 2026

10 min read

How to Fine-Tune Gemma LLM for Low-Resource Languages (Kaggle Winning Strategies)

Luca Massaron, Bojan Tunguz, Konrad Banachewicz

27 Feb 2026

10 min read

0
0

article-image-creating-a-semantic-model-in-power-bi

Greg Deckler

12 Feb 2026

10 min read

Creating a Semantic Model in Power BI

Greg Deckler

12 Feb 2026

10 min read

Make sure to subscribe our BIPro Newsletter so you never miss a key update in the data world. Join over 35K+ BI lovers and get tips from those who’ve cracked tough data challenges!IntroductionA well-designed semantic model is the foundation of every effective Power BI report. It defines how data tables connect, how calculations behave, and how insights are ultimately presented to users. Without a solid semantic model, even the most visually appealing reports can produce misleading or inconsistent results. In this article, you’ll learn how to create and explore a semantic model in Power BI, understand the Model view, and build relationships that allow data to aggregate correctly across tables, setting the stage for accurate analysis and meaningful business insights. Technical requirementsThe following are needed in order to successfully complete the instructions provided in this chapter:An internet connectionMicrosoft Power BI DesktopChapter 5 Start.pbix downloaded from GitHub at https://github.com/PacktPublishing/Learn-Microsoft-Power-BI_3ECreating a semantic modelThe concept of a semantic model, data model, or dataset is fundamental to Power BI. In short, a semantic model is defined by the tables that are created from Power Query queries, the metadata (data about data) regarding the columns within the tables, and finally, the relationships that are defined between tables. Relationships are needed to connect individual tables to one another. In Power BI, the semantic model is stored within an Analysis Services tabular cube. It is the creation of this semantic model that enables self-service analytics and reporting.In Chapter 4, Connecting to and Transforming Data, we connected to various sources of data (three different Excel files) that in turn created seven different queries, which ultimately resulted in four queries that loaded data tables into our semantic model. We can now stitch those individual tables, along with our previously created data table, into a cohesive semantic model that is used for further analysis.If you are continuing from Chapter 4, continue to use the same Power BI Desktop file after completing that chapter. Otherwise, download and load Chapter 5 Start.pbix as specified in the Technical requirements section of this chapter.Touring the Model viewSo far, we have explored the overall architecture of Power BI Desktop and Power Query Editor. We will now explore the Model view within Power BI Desktop. To switch to the Model view, click the third icon from the top in the Views bar of Power BI Desktop.The Model view is shown in the following screenshot:Figure 5.1 – Model viewThe Model view provides an interface for building our semantic model. It does this by creating relationships between tables, as well as defining metadata for tables and columns. We can even create multiple layouts for our semantic model. The Model view interface is similar to other Desktop views, as shown in Figure 5.1:Header: As shown in Figure 5.1, the Header area is identical to what was described in the Touring the Desktop section of Chapter 3, Up and Running with Power BI Desktop.Ribbon: The Ribbon is nearly identical to what we described in the Touring the Desktop section of Chapter 3, Up and Running with Power BI Desktop, with the notable exception that only three tabs are available—that is, File, Home, and Help. If third-party extensions are installed, the External tools tab is also present.Views: The Views area is identical to what was described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop.Canvas: As mentioned in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, when in the Model view, this area displays layouts of tables within the semantic model, as well as their relationships to one another. A default All tables layout is created automatically.Panes: As described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, only two panes are present—the omnipresent Data pane and the Properties pane. The Properties pane is exclusive to the Model view and allows us to associate metadata with various fields or columns within the tables of the semantic model. This includes the ability to specify synonyms and descriptions, as well as data types, data categories, and default aggregations or summarizations.Layouts: The Pages area from the Report view is replaced by Layouts. The Layouts area allows us to create multiple layouts or views of the data tables within the model, as well as to rename and reorder the layouts.Footer: As described in the Touring the desktop section of Chapter 3, Up and Running with Power BI Desktop, in the Model view, the Footer area provides various viewing controls, such as the ability to zoom in and out, reset the layout, and fit the model to the current display area.This completes our tour of the Model view. Let’s next see how we can change the layout of our semantic model.Modifying the layoutIn the Model view, we have a default All tables layout, which was created for us automatically by Power BI.To modify this layout, follow these steps:1. Minimize the Data and Properties panes by clicking on the arrow icon in the pane headers (>).2. Click on the tables and drag them closer together. Use the Fit to screen icon at the far right of the Footer to zoom in on the table layout. You should now be able to clearly see the table names and columns in the tables.3. Move the Calendar and People tables next to one another in the top center of the Canvas by clicking on and then dragging and dropping. Place the Budgets and Forecasts and Hours tables underneath these two tables. Use the Fit to screen icon in the footer to zoom in on the tables. Note that we cannot see all of the columns in the Hours table or the Calendar table. A right-hand scroll bar is present on both tables.4. Use the sizing handle in the bottom-right corner of the table to adjust the size so that we can see all of the columns in the table. When finished, your Canvas should look similar to this: Figure 5.2 – Data tables in the modelNow that we have modified the layout such that we can easily see all tables on the screen, let’s continue building our semantic model by defining relationships between the individual tables.Creating and understanding relationshipsNow that we can clearly see our tables and columns, we can create relationships between our tables. Creating relationships between tables allows calculations and aggregations to work across tables so that multiple columns can be used from separate, related tables.For example, once the People table and the Hours table are related to one another, we can use the ID column from the People table and the TotalHoursBilled column from the Hours table. By doing this, TotalHoursBilled aggregates correctly for each user Identifier (ID) in the People table.To create this relationship between the People table and the Hours table, click on the ID column in the People table and drag and drop it onto the EmployeeID column in the Hours table. The New Relationship dialog is displayed, as follows:Figure – New relationship dialogClick the Save button and note that a line appears on the canvas that connects the People table to the Hours table. This creates a relationship between the ID column in the People table and the EmployeeID column in the Hours table.You can check the columns that are involved in a relationship by hovering your mouse over the relationship line. The line changes slightly and the columns involved in the relationship become shaded.Figure – Relationship between two tablesIf you notice that ID and EmployeeID are not columns associated with the relationship, hover over the relationship line, right-click it, and choose Delete. Then, try again.Note that the line has 1 next to the People table and * next to the Hours table. This means that this relationship is one-to-many or many-to-one.In other words, there are unique row values in the People table that match multiple rows in the Hours table. This makes sense since each employee submits an Hours report for every day. The designation of 1 (unique) or * (many) defines the cardinality of the relationship between the tables.There are actually four different cardinalities for relationships in Power BI, as outlined here:One-to-one: This means that there are unique values in each table.Many-to-one: This means that there are unique values in one table that match multiple rows in another table.One-to-many: This means that there are unique values in one table that match multiple rows in another table.Many-to-many: This means that neither table has unique values for rows. It is generally good practice to avoid these types of relationships because of their complexity and the amount of processing and resources required.The designation of many-to-one versus one-to-many is simply a matter of which table is defined first within a relationship. In other words, the relationship between our People table and Hours table could be either many-to-one or one-to-many, depending on which table we defined first in our relationship.Note that there is also an arrow icon in the middle of the line that points from People to Hours. This indicates that the People table filters the Hours table, but not vice versa. This is known as the cross-filter direction. Cross-filter directions can be either Single or Both, meaning that the filtering occurs only one way or bidirectionally. In simple semantic models, you generally don’t have to worry about cross-filter direction, but it can become very important if the complexity of the semantic model increases.Finally, note that the line that forms the relationship between the tables is solid. A solid line indicates an active relationship. Inactive relationships can be created between tables, and these are represented as a dashed line. In Power BI, there can only be a single active path between tables.As models become more complex, multiple paths from one table to another can be created. In those cases, one or more of the relationships or pathways become inactive. However, even though the relationship is inactive by default, it can be used within calculations that specify using specific relationships.We can view and modify the relationship definition by double-clicking on the relationship line. This brings up the Edit relationship dialog, which looks nearly identical to the New relationship dialog shown in Figure 5.3.Note that since the Hours table is defined first and the People table is defined second, the Cardinality value of our relationship is Many to one (*:1). Also, note that the relationship is active and that the Cross filter direction value is Single.In many-to-one and one-to-many relationships, a Cross filter direction value of Single means that the one side of the relationship filters the many side. Finally, note that the EmployeeID column in the Hours table and the ID column in the People table are highlighted in gray. This shows us the columns that form a relationship between the two tables. Close the Edit relationship dialog by clicking the Cancel button.Now, let’s create another relationship, this time between our Calendar table and the Hours table. To do this, we can use the Manage relationships functionality, as follows:1. Click on the Home tab of the ribbon. Then, in the Relationships section, choose the Manage relationships button. The Manage relationships dialog is displayed, as shown in the following screenshot:Figure 5.5 – Manage relationships dialogHere, we can see our existing Active relationship between the Hours table and the People table, including the columns involved in the relationship in parentheses. From this dialog, we can Edit or Delete the relationship or have Power BI attempt to Autodetect relationships between tables. Power BI can sometimes autodetect relationships between tables based on the column names and row values.2. Select the + New relationship button. This displays the New relationship dialog, as shown in the following screenshot:Figure 5.6 – Create relationship dialog3. In the first drop-down menu, choose the Hours table. A preview of the table will be displayed.4. Click the Date column.5. Choose Calendar in the second drop-down menu. A preview of this table is displayed.6. Choose the Date column from this table. Power BI detects the appropriate Cardinality and Cross filter direction values. Since there are no other relationships between these tables, Power BI checks the Make this relationship active checkbox.7. Click the Save button to create this relationship. Note that this new relationship now appears in the Manage relationships dialog.8. Click the Close button to close the Manage relationships dialog. In the canvas, there will now be a relationship line linking the Calendar table to the Hours table.Congratulations—you have successfully linked the three separate tables to create a semantic model!For now, we won’t link the Budgets and Forecasts table within our semantic model. This is a good time to save your work.Now that our semantic model is created, let’s explore our semantic model a little closer by creating some visuals.Exploring the semantic modelBefore we move on, let’s do some exploration of our data in order to understand our data, our semantic model, and how this data can ultimately be viewed by end users. To do this, follow these steps:1. Start by clicking on the Report view in the Views bar.2. At the bottom of the report canvas, click the plus (+) icon next to Page 1. This creates a new blank page, Page 2. Double-click Page 2 and rename it Utilization, and then press the Enter key. This changes the name of the page to Utilization.3. Expand the People table in the Data pane by clicking the small arrow to the left of the People table. Check the box next to the Name field. This creates a Table visualization on our report canvas with the names of employees. Use the sizing handles for this visualization to resize the table to take up the entire page.4. Expand the Hours table. Make sure that the table visual is selected and then check the box next to the Hours field. The number of hours reported by each employee is shown next to their name. This occurs because of the relationship between our People table and our Hours table. Because these tables are joined by a relationship based on the ID of each employee, we can use fields from both tables in visualizations. Due to this, the rows in the Hours tables are automatically filtered based on this relationship.Note that when the Table visualization is selected, the Hours and People tables have small checkmarks next to them. This indicates that the visualization contains fields from these tables.5. We know that the Category field of the hours is important, regardless of whether the hours are billable or not. With the table selected, click on the Matrix visualization in the Visualizations pane. This icon is to the immediate right of the highlighted Table visualization. Note that the Build visual pane changes from just displaying a Values field well to now containing field wells for Rows, Columns, and Values. Our Name field is now under Rows, while our Hours field is under Values and shows up as Sum of Hours.6. From the Hours table, click on Category and drag and drop this field into the Columns field well. We can now see a breakdown of the hours for each employee by category.It is now obvious that we can calculate a simple version of our utilization metric by taking the number of Billable hours for each employee and dividing them by the Total number of hours for each employee. Save your work before continuing.ConclusionBuilding a semantic model in Power BI is a critical step in transforming raw data into reliable, self-service analytics. By organizing tables, defining relationships, and understanding how filtering and cardinality work, you enable calculations and visuals to behave as expected across your reports. This foundation not only improves accuracy and performance but also makes your Power BI solutions easier to scale and maintain over time.If you’d like to explore this topic in greater depth with guided examples, best practices, and real-world scenarios, you can learn more in Learn Microsoft Power BI - Third Edition by Greg Deckler, where semantic modeling and calculations are covered as part of a complete, end-to-end Power BI learning journey. This book helps you master self-service BI and become your team’s data hero or launch a new career. You’ll learn how to clean data, build models, and use advanced analytics in Power BI to unlock valuable insights and drive better business decisions.Author BioGreg Deckler is a Vice President at a global consulting services firm. In addition, I am a 7-time Microsoft MVP for Data Platform. As an active member in the Columbus IT community, I founded the Columbus Azure ML and Power BI User Group (CAMLPUG) and have presented at many different conferences and events including Dog Food, SharePoint Saturday, CloudDevelop and M3. I am also the author of Microsoft Hates Greg's Quick Measures (MSHGQM) tool and the associated YouTube channels Microsoft Hates Greg and DAX For Humans.

0
0

article-image-semantic-search-in-opensearch

Jon Handler, Soujanya Konka, Prashant Agrawal

12 Jan 2026

15 min read

Semantic Search in OpenSearch

Jon Handler, Soujanya Konka, Prashant Agrawal

12 Jan 2026

15 min read

0
0

article-image-building-trust-in-ai-the-role-of-rag-in-data-security-and-transparency

Keith Bourne

13 Dec 2024

15 min read

Building Trust in AI: The Role of RAG in Data Security and Transparency

Keith Bourne

13 Dec 2024

15 min read

This article is an excerpt from the book, "Unlocking Data with Generative AI and RAG", by Keith Bourne. Master Retrieval-Augmented Generation (RAG), the most popular generative AI tool, to unlock the full potential of your data. This book enables you to develop highly sought-after skills as corporate investment in generative AI soars.IntroductionAs the adoption of Retrieval-Augmented Generation (RAG) continues to grow, its potential to address key security challenges in AI-driven applications is becoming evident. Far from merely introducing risks, RAG offers a robust framework to enhance data protection, ensure accuracy, and maintain transparency in content generation. This article delves into the multifaceted security benefits of RAG, while also addressing the unique challenges it poses and strategies to mitigate them.How RAG can be leveraged as a security solutionLet’s start with the most positive security aspect of RAG. RAG can actually be considered a solution to mitigate security concerns, rather than cause them. If done right, you can limit data access via user, ensure more reliable responses, and provide more transparency of sources.Limiting dataRAG applications may be a relatively new concept, but you can still apply the same authentication and database-based access approaches you can with web and similar types of applications. This provides the same level of security you can apply in these other types of applications. By implementing userbased access controls, you can restrict the data that each user or user group can retrieve through the RAG system. This ensures that sensitive information is only accessible to authorized individuals. Additionally, by leveraging secure database connections and encryption techniques, you can safeguard the data at rest and in transit, preventing unauthorized access or data breaches.Ensuring the reliability of generated contentOne of the key benefits of RAG is its ability to mitigate inaccuracies in generated content. By allowing applications to retrieve proprietary data at the point of generation, the risk of producing misleading or incorrect responses is substantially reduced. Feeding the most current data available through your RAG system helps to mitigate inaccuracies that might otherwise occur.With RAG, you have control over the data sources used for retrieval. By carefully curating and maintaining high-quality, up-to-date datasets, you can ensure that the information used to generate responses is accurate and reliable. This is particularly important in domains where precision and correctness are critical, such as healthcare, finance, or legal applications.Maintaining transparencyRAG makes it easier to provide transparency in the generated content. By incorporating data such as citations and references to the retrieved data sources, you can increase the credibility and trustworthiness of the generated responses.When a RAG system generates a response, it can include links or references to the specific data points or documents used in the generation process. This allows users to verify the information and trace it back to its original sources. By providing this level of transparency, you can build trust with your users and demonstrate the reliability of the generated content.Transparency in RAG can also help with accountability and auditing. If there are any concerns or disputes regarding the generated content, having clear citations and references makes it easier to investigate and resolve any issues. This transparency also facilitates compliance with regulatory requirements or industry standards that may require traceability of information.That covers many of the security-related benefits you can achieve with RAG. However, there are some security challenges associated with RAG as well. Let’s discuss these challenges next.RAG security challengesRAG applications face unique security challenges due to their reliance on large language models (LLMs) and external data sources. Let’s start with the black box challenge, highlighting the relative difficulty in understanding how an LLM determines its response.LLMs as black boxesWhen something is in a dark, black box with the lid closed, you cannot see what is going on in there! That is the idea behind the black box when discussing LLMs, meaning there is a lack of transparency and interpretability in how these complex AI models process input and generate output. The most popular LLMs are also some of the largest, meaning they can have more than 100 billion parameters. The intricate interconnections and weights of these parameters make it difficult to understand how the model arrives at a particular output.While the black box aspects of LLMs do not directly create a security problem, it does make it more difficult to identify solutions to problems when they occur. This makes it difficult to trust LLM outputs, which is a critical factor in most of the applications for LLMs, including RAG applications. This lack of transparency makes it more difficult to debug issues you might have in building an RAG application, which increases the risk of having more security issues.There is a lot of research and effort in the academic field to build models that are more transparent and interpretable, called explainable AI. Explainable AI aims at making the operations of A I systems transparent and understandable. It can involve tools, frameworks, and anything else that, when applied to RAG, helps us understand how the language models that we use produce the content they are generating. This is a big movement in the field, but this technology may not be immediately available as you read this. It will hopefully play a larger role in the future to help mitigate black box risk, but right now, none of the most popular LLMs are using explainable models. So, in the meantime, we will talk about other ways to address this issue.You can use human-in-the-loop, where you involve humans at different stages of the process to provide an added line of defense against unexpected outputs. This can often help to reduce the impact of the black box aspect of LLMs. If your response time is not as critical, you can also use an additional LLM to perform a review of the response before it is returned to the user, looking for issues. We will review how to add a second LLM call in code lab 5.3, but with a focus on preventing prompt attacks. But this concept is similar, in that you can add additional LLMs to do a number of extra tasks and improve the security of your application.Black box isn’t the only security issue you face when using RAG applications though; another very important topic is privacy protection.Privacy concerns and protecting user dataPersonally identifiable information (PII) is a key topic in the generative AI space, with governments a round the world trying to determine the best path to balance user privacy with the data-hungry needs of these LLMs. As this gets worked out, it is important to pay attention to the laws and regulations that are taking shape where your company is doing business and make sure all of the technologies you are integrating into your RAG applications adhere. Many companies, such as Google and Microsoft , are taking these efforts into their own hands, establishing their own standards of protection for their user data and emphasizing them in training literature for their platforms.At the corporate level, there is another challenge related to PII and sensitive information. As we have said many times, the nature of the RAG application is to give it access to the company data and combine that with the power of the LLM. For example, for financial institutions, RAG represents a way to give their customers unprecedented access to their own data in ways that allow them to speak naturally with technologies such as chatbots and get near-instant access to hard-to-find answers buried deep in their customer data.In many ways, this can be a huge benefit if implemented properly. But given that this is a security discussion, you may already see where I am going with this. We are giving unprecedented access to customer data using a technology that has artificial intelligence, and as we said previously in the black box discussion, we don’t completely understand how it works! If not implemented properly, this could be a recipe for disaster with massive negative repercussions for companies that get it wrong. Of course, it could be argued that the databases that contain the data are also a potential security risk. Having the data anywhere is a risk! But without taking on this risk, we also cannot provide the significant benefits they represent.As with other IT applications that contain sensitive data, you can forge forward, but you need to have a healthy fear of what can happen to data and proactively take measures to protect that data. The more you understand how RAG works, the better job you can do in preventing a potentially disastrous data leak. These steps can help you protect your company as well as the people who trusted your company with their data.This section was about protecting data that exists. However, a new risk that has risen with LLMs has been the generation of data that isn’t real, called hallucinations. Let’s discuss how this presents a new risk not common in the IT world.HallucinationsWe have discussed this in previous chapters, but LLMs can, at times, generate responses that sound coherent and factual but can be very wrong. These are called hallucinations and there have been many shocking examples provided in the news, especially in late 2022 and 2023, when LLMs became everyday tools for many users.Some are just funny with little consequence other than a good laugh, such as when ChatGPT was asked by a writer for The Economist, “When was the Golden Gate Bridge transported for the second time across Egypt?” ChatGPT responded, “The Golden Gate Bridge was transported for the second time across Egypt in October of 2016” (https://www.economist.com/by-invitation/2022/09/02/artificialneural-networks-today-are-not-conscious-according-to-douglashofstadter).Other hallucinations are more nefarious, such as when a New York lawyer used ChatGPT for legal research in a client’s personal injury case against Avianca Airlines, where he submitted six cases that had been completely made up by the chatbot, leading to court sanctions (https://www. courthousenews.com/sanctions-ordered-for-lawyers-who-relied-onchatgpt-artificial-intelligence-to-prepare-court-brief/). Even worse, generative AI has been known to give biased, racist, and bigoted perspectives, particularly when prompted in a manipulative way.When combined with the black box nature of these LLMs, where we are not always certain how and why a response is generated, this can be a genuine issue for companies wanting to use these LLMs in their RAG applications.From what we know though, hallucinations are primarily a result of the probabilistic nature of LLMs. For all responses that an LLM generates, it typically uses a probability distribution to determine what token it is going to provide next. In situations where it has a strong knowledge base of a certain subject, these probabilities for the next word/token can be 99% or higher. But in situations where the knowledge base is not as strong, the highest probability could be low, such as 20% or even lower. In these cases, it is still the highest probability and, therefore, that is the token that has the highest probability to be selected. The LLM has been trained on stringing tokens together in a very natural language way while using this probabilistic approach to select which tokens to display. As it strings together words with low probability, it forms sentences, and then paragraphs that sound natural and factual but are not based on high probability data. Ultimately, this results in a response that sounds very plausible but is, in fact, based on very loose facts that are incorrect.For a company, this poses a risk that goes beyond the embarrassment of your chatbot saying something wrong. What is said wrong could ruin your relationship(s) with your customer(s), or it could lead to the LLM offering your customer something that you did not intend to offer, or worse, cannot afford to offer. For example, when Microsoft released a chatbot named Tay on Twitter in 2016 with the intention of learning from interactions with Twitter users, users manipulated this spongy personality trait to get it to say numerous racist and bigoted remarks. This reflected poorly on Microsoft, which was promoting its expertise in the AI area with Tay, causing significant damage to its reputation at the time (https://www.theguardian.com/technology/2016/mar/26/microsoftdeeply-sorry-for-offensive-tweets-by-ai-chatbot).Hallucinations, threats related to black box aspects, and protecting user data can all be addressed through red teaming.ConclusionRAG represents a promising avenue for enhancing security in AI applications, offering tools to limit data access, ensure reliable outputs, and promote transparency. However, challenges such as the black box nature of LLMs, privacy concerns, and the risk of hallucinations demand proactive measures. By employing strategies like user-based access controls, explainable AI, and red teaming, organizations can harness the advantages of RAG while mitigating risks. As the technology evolves, a thoughtful approach to its implementation will be crucial for maintaining trust, compliance, and the integrity of data-driven solutions.Author BioKeith Bourne is a senior Generative AI data scientist at Johnson & Johnson. He has over a decade of experience in machine learning and AI working across diverse projects in companies that range in size from start-ups to Fortune 500 companies. With an MBA from Babson College and a master’s in applied data science from the University of Michigan, he has developed several sophisticated modular Generative AI platforms from the ground up, using numerous advanced techniques, including RAG, AI agents, and foundational model fine-tuning. Keith seeks to share his knowledge with a broader audience, aiming to demystify the complexities of RAG for organizations looking to leverage this promising technology.

0
0
100006

article-image-revolutionize-power-bi-queries-with-openai

Gus Frazer

11 Dec 2024

10 min read

Revolutionize Power BI Queries with OpenAI

Gus Frazer

11 Dec 2024

10 min read

This article is an excerpt from the book, Data Cleaning with Power BI, by Gus Frazer. Unlock the full potential of your data by mastering the art of cleaning, preparing, and transforming data with Power BI for smarter insights and data visualizations.IntroductionDiscover the transformative potential of leveraging Azure OpenAI, integrated with ChatGPT functionality, to enhance Power BI's M query capabilities. In this article, we delve into how this powerful combination offers expert guidance, efficient solutions, and insightful recommendations for optimizing data transformation tasks. From generating M queries to streamlining complex transformations, explore how Azure OpenAI with ChatGPT empowers users to boost productivity and efficiency in Power BI.Using OpenAI for M queriesAzure OpenAI, with ChatGPT functionality within it, can be a helpful tool for generating M queries in Power BI by providing suggestions, helping with syntax, and offering insights into data transformation tasks. In the following example, you will learn about how you can leverage the chat playground within OpenAI to improve your productivity and efficiency when writing M queries. We will do this by asking a series of questions directly within Azure OpenAI.Complete the next steps to follow along with the example in your own environment:1. Click on Deployment on the left-hand side and then select Create new deployment to get started.2. Select a model from the base models, in this case, select gpt-35-turbo, and then name your deployment. In this example, name it CleaningDataOpenAI.3. Select Chat playground from the Azure OpenAI Studio home screen or from the Playground tab on the left of your screen. This should open the following view:Figure 15.4 – Screenshot of the chat playground from within Azure OpenAINow that you have set up an instance of OpenAI, here’s an example conversation taken from OpenAI, where the chat functionality has been used to assist with writing M queries in Power BI:User: Hi, I’m working on a Power BI project, and I need to transform some data using M queries. I have a table with sales data, and I want to calculate the total sales for each product category. Can you help me with the M query for that?ChatGPT: Of course! To calculate the total sales for each product category, you can use the Group By transformation in Power BI’s Power Query Editor. Here’s a sample M snippet for your scenario:let Source = YourDataSource, GroupedTable = Table.Group(Source, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableReplace YourDataSource with the actual name of your data source. This code groups the data by the ProductCategory column and calculates the total sales for each group.User: Thanks! That’s a great starting point. What if I want to filter out products with sales less than $1,000 before calculating the total sales?ChatGPT: You can add a filtering step before the Group By transformation. Here’s an updated version of the M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableThis code filters out rows where sales are less than $1,000 before grouping the data by product category and calculating the total sales.User: Perfect! This is really helpful. One more thing – what if I want to sort the results in descending order based on total sales?ChatGPT: You can add a sorting step using the Table.Sort function. Here’s the modified M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}), SortedTable = Table.Sort(GroupedTable,{{"TotalSales", Order. Descending}}) in SortedTableThis code sorts the grouped table in descending order based on the “TotalSales” column.This interaction represents a simple example of how those who can leverage the power of tools such as OpenAI and C hatGPT will be able to quickly upskill in areas such as coding. It has to be said, though, that while this is still in its infancy, it’s important to always test and validate the answers provided before implementing them in production. Also, ensure that you take precautions when using the publicly available ChatGPT model to avoid sharing sensitive data publicly. If you would like to use sensitive data or you want to ensure that requests are given within a secured governed environment, make sure to use the ChatGPT model within your own Azure OpenAI instance.In more complex examples, optimizing Power Query transformations could involve efficient interaction with Azure OpenAI. This includes streamlining API calls, managing large datasets, and incorporating caching mechanisms for repetitive queries, ensuring a seamless and performant data cleaning process.As we begin to explore the use cases where this technology can be most effective, there are a number of clear early winners:Optimizing query plans: ChatGPT’s natural language understanding can assist in formulating more efficient Power Query plans. By describing the desired transformations in natural language, users can interact with ChatGPT to generate optimized query plans. This involves selecting the most suitable Power Query functions and structuring transformations for performance gains.Caching strategies for repetitive queries: ChatGPT can guide users in devising effective caching strategies. By understanding the context of data transformations, it can recommend where to implement caching mechanisms to store and reuse intermediate results, minimizing redundant API calls and computations. The following is an example of just this, where I have asked Azure OpenAI to verify and optimize my query from the Power Query Advanced Editor. The model suggested I use the Table.Buffer function to help cache the table in memory and optimize the query.Figure – An example request to OpenAI to help optimize my query for Power Query Figure – An example response from OpenAI to help optimize my query for Power QueryNow as we highlighted in Chapter 11, M Query Optimization, Table.Buffer can indeed improve the performance of your queries and refreshes, but this really depends on the data you are working with. In the previous example, the model doesn’t take the characteristics, size, or complexity of your data into consideration as it isn’t plugged into your data at this stage. Also linking back to the example you walked through in Chapter 11, the placement of where you add Table.Buffer can really impact how your query performs. In the previous example, if you were connecting to a small dataset, you would likely cause it to run slower by adding the Table.Buffer function as the second variable in the query.Lastly, it’s worth mentioning that how you prompt these models is crucially important. In the previous example, we didn’t specify what type of data source we were using in our query. As such, the model hasn’t provided an insight or overview that using Table.Buffer on a data source supporting query folding will cause it to break the fold. Again, this is not so much of a problem if Table.Buffer is placed at the end of your query for smaller datasets, but it is a problem if you add it nearer to the beginning of the query, like in the previous example.Handling large datasets: Dealing with large datasets often poses a challenge in Power Query. OpenAI models, including ChatGPT, can provide insights into dividing and conquering large datasets. This includes strategies for parallel processing, filtering data early in the transformation pipeline, and using aggregations to reduce computational load.Dynamic query adjustments: ChatGPT’s interactive nature allows users to dynamically adjust queries based on evolving requirements. It can assist in crafting queries that adapt to changing data scenarios, ensuring that Power Query transformations remain flexible and responsive to varied datasets.Guidance on complex transformations: Power Query oft en involves intricate transformations. ChatGPT can act as a virtual assistant, guiding users through the process of complex transformations. It can suggest optimal function compositions, advise on conditional logic placement, and assist in structuring transformations to enhance efficiency. The best example of this can be seen in the following two screenshots of an active use case seen in many businesses. The example begins with a user asking the model for a description of what the query is doing. OpenAI then provides a breakdown of what the query is doing in each step to help the user interpret the code. It helps to break down the barriers to coding and also helps to decipher code that has not been documented well by previous employees. Figure – An example request to OpenAI to help translate my queryFigure – An example response from OpenAI to help describe my queryError handling strategies: Optimizing Power Query also entails robust error handling. ChatGPT can provide recommendations for anticipating and handling errors gracefully within a query. This includes strategies for logging errors, implementing fallback mechanisms, and ensuring the stability of the overall data preparation process.In this section, you learned how to optimize Power Query transformations with Azure OpenAI efficiently. Key takeaways include using ChatGPT for natural-language-based query planning and effective caching strategies. Insights include handling large datasets through parallel processing, early filtering, and aggregations. This knowledge equips you to streamline and enhance your Power Query processes effectively.In the next section, you will learn about Microsoft Copilot, how to set up a Power BI instance with Copilot activated, and also how you can use this new AI technology to help clean and prepare your data.ConclusionIn conclusion, Azure OpenAI with ChatGPT presents a game-changing solution for maximizing Power BI's potential. From query optimization to error-handling strategies, this integration streamlines processes and enhances productivity. As users navigate complex data transformations, the guidance provided fosters efficient decision-making and empowers users to tackle challenges with confidence. With Azure OpenAI and ChatGPT, the possibilities for revolutionizing Power BI workflows are endless, offering a glimpse into the future of data transformation and analytics.Author BioGus Frazer is a seasoned Analytics Consultant focused on Business Intelligence solutions. With over 7 years of experience working for the two market-leading platforms, Power BI & Tableau, has amassed a wealth of knowledge and expertise. Gus has helped hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, cleanse data ready for analysis. Most recently helping to set up, organize and run the Power BI UK community at Microsoft. He holds 6 Azure and Power BI certifications, including the PL-300 and DP-500 certifications. In this book, Gus offers readers invaluable guidance on ingesting, preparing, and cleansing data for analysis in Power BI. --This text refers to an out of print or unavailable edition of this title.

0
0
81745

article-image-mastering-performance-tuning-with-dax-studio-and-vertipaq-analyzer

Thomas LeBlanc, Bhavik Merchant

03 Dec 2024

15 min read

Mastering Performance Tuning with DAX Studio and VertiPaq Analyzer

Thomas LeBlanc, Bhavik Merchant

03 Dec 2024

15 min read

This article is an excerpt from the book, "Microsoft Power BI Performance Best Practices - Second Edition", by Thomas LeBlanc, Bhavik Merchant. Overcome common challenges in data management, visualization, and security with this updated edition of Microsoft Power BI Performance Best Practices, and ramp-up your Power BI solutions, achieve faster insights, and drive better business outcomes.IntroductionOptimizing performance and storage in Power BI and Analysis Services can be a complex task. However, tools like DAX Studio and VertiPaq Analyzer simplify this process by providing insightful metrics and performance-tuning capabilities. This article explores how to leverage these tools to analyze semantic models, identify performance bottlenecks, and optimize DAX queries. We'll discuss key features such as viewing model metrics, capturing and analyzing query traces, and testing optimizations using DAX Studio's query editor.Tuning with DAX Studio and VertiPaq AnalyserDAX Studio, as the name implies, is a tool centered on DAX queries. It provides a simple yet intuitive interface with powerful features to browse and query Analysis Services semantic models. We will cover querying later in this section. For now, let’s look deeper into semantic models.The Analysis Services engine has supported dynamic management views (DMVs) for over a decade. These views refer to SQL-like queries that can be executed on Analysis Services to return information about semantic model objects and operations.VertiPaq Analyzer is a utility that uses publicly documented DMVs to display essential information about which structures exist inside the semantic model and how much space they occupy. It started life as a standalone utility, published as a Power Pivot for an Excel workbook, and still exists in that form today. In this chapter, we will refer to its more recent incarnation as a built-in feature of DAX Studio 3.0.11.It is interesting to note that VertiPaq is the original name given to the compressed column store engine within Analysis Services (Verti referring to columns and Paq referring to compression).Analyzing model size with VertiPaq AnalyzerVertiPaq Analyzer is built into DAX Studio as the View Metrics features, found in the Advanced tab of the toolbar. You simply click the icon to have DAX Studio run the DMVs for you and display statistics in a tabular form. This is shown in the following figure:Figure 6.8 – Using View Metrics to generate VertiPaq Analyzer statsYou can switch to the Summary tab of the VertiPaq Analyzer pane to get an idea of the overall total size of the model along with other summary statistics, as shown in the following figure:Figure 6.9 – Summary tab of VertiPaq AnalyzerThe Total Size metric provided in the previous figure will often be larger than the size of the semantic model on disk (as a .pbix file or Analysis Services .abf backup). This is because there are additional structures required when the model is loaded into memory, which is particularly true of Import mode semantic models.In Chapter 2, Exploring Power BI Architecture and Configuration, we learned about Power BI’s compressed column storage engine. The DMV statistics provided by VertiPaq Analyzer let us see just how compressible columns are and how much space they are taking up. It also allows us to observe other objects, such as relationships.The Columns tab is a great way to see whether you have any columns that are very large relative to others or the entire dataset. The following figure shows the columns view for the same model we saw in Figure 6.9. You can see how from 238 columns, a single column called SalesOrderNumber takes up a staggering 22.40% of the whole model size! It’s interesting to see its Cardinality (or uniqueness) value is about twelve times lower than the next largest column (SalesKey):|Figure 6.10 – Two columns monopolizing the semantic modelIn Figure 6.10, we can also see that Data Type is String for Online Sale-SalesOrderNumber, which was a column suggested by Tabular Editor to have a large dictionary footprint. These statistics would lead you to deduce that this column contains long, unique test values that do not compress well because there is a large cardinality. Indeed, in this case, the column contains a sales order number that is unique to each order plus is not used to group or slice analytical data in a Power BI report well.This analysis may lead you to re-evaluate the need for this level of reporting in the analysis of sales data. You’d need to ask yourself whether the extra storage space and time taken to build compressed columns and potentially other structures is worth it for your business case. In cases of highly detailed data such as this where you do not need detail-level sales order data, consider limiting the analysis to customer-related data such as demographics or date attributes such as year and month.Now, let’s learn about how DAX Studio can help us with performance analysis and improvement.Performance tuning the data model and DAXThe first-party option for capturing Analysis Services traces is SQL Server Profiler. When starting a trace, you must identify exactly which events to capture, which requires some knowledge of the trace events and what they contain. Even with this knowledge, working with the trace data in Profi ler can be tough since the tool was designed primarily to work with SQL Server application traces. The good news is that DAX Studio can start an Analysis Services server trace and then parse and format all the data to show you relevant results in a well-presented way within its user interface. It allows us to both tune and measure queries in a single place and provides features for Analysis Services that make it a good alternative SQL profiler for tuning semantic models.Capturing and replaying queriesThis All Queries command in the Traces section of the DAX Studio toolbar will start a trace against the semantic model you have connected to. Figure 6.11 shows the result when a trace is successfully started:Figure 6.11 – Query trace successfully started in DAX StudioOnce your trace has started, you can interact with the semantic model outside DAX Studio, and it will capture queries for you. How you interact with the semantic model depends on where it is. For a semantic model running on your computer in Power BI Desktop, you would simply interact with the report. This would generate queries that DAX Studio will see. The All Queries tab at the bottom of the tool is where the captured queries are listed in time order with durations in milliseconds. The following figure shows two queries captured when opening the Unique by Account No page from the Slow vs Fast Measures.pbix sample file:Figure 6.12 – Queries captured by DAX StudioThe preceding queries come from a screen that has the same table results in a visual, but two different DAX measures that calculate the aggregation. These measures make one table come back in less than a second while the other returns in about 17 seconds. The following figure shows the page in the report:Figure 6.13 – Tables with the same results but from using different measuresThe following screenshot shows the results of the Performance Analyzer for the tables previously.Observe how one query took over 17 seconds, whereas the other took under 1 second:Figure 6.14 – Vastly different query durations for the same visual resultIn Figure 6.12, the second query was double-clicked to bring the DAX text to the editor. You can modify this query in DAX Studio to test performance changes. We see here that the DAX expression for the UniqueRedProducts_Slow measure was not efficient. We’ll learn a technique to optimize queries soon, but first, we need to learn about capturing query performance traces.Obtaining query timingsTo get detailed query performance information, you can use the Server Timings command shown in Figure 6.11. After starting the trace, you can run queries and then use the Server Timings tab to see how the engine executed the query, as shown in the following figure:Figure 6.15 – Server Timings showing detailed query performance statisticsFigure 6.15 gives very useful information. FE and SE refer to the formula engine and storage engine. The storage engine is fast and multi-threaded, and its job is fetching data. It can apply basic logic such as filtering data to retrieve only what is needed. The formula engine is single-threaded, and it generates a query plan, which is the physical steps required to compute the result. It also performs calculations on the data such as joins, complex filters, aggregations, and lookups. We want to avoid queries that spend most of the time in the formula engine, or that execute many queries in the storage engine. The bottom-left section of Figure 6.15 shows that we executed almost 4,900 SE queries. The list of queries to the right shows many queries returning only one result, which is suspicious.For comparison, we look at timing for the fastest version of the query and we see the following:Figure 6.16 – Server Timings for a fast version of the queryIn Figure 6.16, we can see that only three server engine queries were run this time, and the result was obtained much faster (milliseconds compared to seconds).The faster DAX measure was as follows:UniqueRedProducts_Fast = CALCULATE( DISTINCTCOUNT('SalesOrderDetail'[ProductID]), 'Product'[Color] = "Red" )The slower DAX measure was as follows:UniqueRedProducts_Slow = CALCULATE( DISTINCTCOUNT('SalesOrderDetail'[ProductID]), FILTER('SalesOrderDetail', RELATED('Product'[Color]) = "Red"))TipThe Analysis Services engine does use data caches to speed up queries. These caches contain uncompressed query results that can be reused later to save time fetching and decompressing data. You should use the Clear Cache button in DAX Studio to force these caches to be cleared and get a proper worst-case performance measure. This is visible in the menu bar in Figure 6.11.We will build on these concepts when we look at DAX and model optimizations in later chapters. Now, let’s look at how we can experiment with DAX and query changes in DAX Studio.Modifying and tuning queriesEarlier in this section, we saw how we could capture a query generated by a Power BI visual and then display its text. A nice trick we can use here is to use query-scoped measures to override the measure definition and see how performance differs.The following figure shows how we can search for a measure, right-click, and then pull its definition into the query editor of DAX Studio:Figure 6.17 – The Define Measure option and result in the Query paneWe can now modify the measure in the query editor, and the engine will use the local definition instead of the one defined in the model! This technique gives you a fast way to prototype DAX enhancements without having to edit them in Power BI and refresh visuals over many iterations.Remember that this technique does not apply any changes to the dataset you are connected to. You can optimize expressions in DAX Studio, then transfer the definition to Power BI Desktop/Visual Studio when ready. The following figure shows how we changed the definition of UniqueRedProducts_ Slow in a query-scoped measure to get a huge performance boast:Figure 6.18 – Modified measure giving better resultsThe technique described here can be adapted to model changes too. For example, if you wanted to determine the impact of changing a relationship type, you could run the same queries in DAX Studio before and after the change to draw a comparison.Here are some additional tips for working with DAX Studio:Isolate measure: When performance tuning a query generated by a report visual, comment out complex measures and then establish a baseline performance score. Th en, add each measure back to the query individually and check the speed. This will help identify the slowest measures in the query and visual context.Work with Desktop Performance Analyzer traces: DAX Studio has a facility to import the trace files generated by Desktop Performance Analyzer. You can import trace files using the Load Perf Data button located next to All Queries highlighted in Figure 6.12. This trace can be captured by one person and then shared with a DAX/modeling expert who can use DAX Studio to analyze and replay their behavior. The following figure shows how DAX Studio formats the data to make it easy to see which visual component is taking the most time. It was generated by viewing each of the three report pages in the Slow vs Fast Measures.pbix sample file:Figure 6.19 – Performance Analyzer trace shows the slowest visual in the reportExport/import model metrics: DAX Studio has a facility to export or import the VertiPaq model metadata using .vpax files. These files do not contain any of your data. They contain table names, column names, and measure definitions. If you are not concerned with sharing these definitions, you can provide .vpax files to others if you need assistance with model optimizationConclusionDAX Studio and VertiPaq Analyzer are indispensable tools for anyone working with Power BI or Analysis Services models. From detailed model size analysis to advanced performance tuning, these tools empower users to identify inefficiencies and implement optimizations effectively. By using their robust features, such as the ability to view metrics, trace query performance, and prototype query changes, professionals can ensure their models are both efficient and scalable. Mastery of these tools lays a solid foundation for building high-performing, resource-efficient analytical solutions.Author BioThomas LeBlanc is a seasoned Business Intelligence Architect at Data on the Geaux, where he applies his extensive skillset in dimensional modeling, data visualization, and analytical modeling to deliver robust solutions. With a Bachelor of Science in Management Information Systems from Louisiana State University, Thomas has amassed over 30 years of experience in Information Technology, transitioning from roles as a software developer and database administrator to his current expertise in business intelligence and data warehouse architecture and management.Throughout his career, Thomas has spearheaded numerous impactful projects, including consulting for various companies on Power BI implementation, serving as lead database administrator for a major home health care company, and overseeing the implementation of Power BI and Analysis Service for a large bank. He has also contributed his insights as an author to the Power BI MVP book.Thomas is recognized as a Microsoft Data Platform MVP and is actively engaged in the tech community through his social media presence, notably as TheSmilinDBA on Twitter and ThePowerBIDude on Bluesky and Mastodon. With a passion for solving real-world business challenges with technology, Thomas continues to drive innovation in the field of business intelligence.Bhavik Merchant has nearly 18 years of deep experience in Business Intelligence. He is currently the Director of Product Analytics at Salesforce. Prior to that, he was at Microsoft, first as a Cloud Solution Architect and then as a Product Manager in the Power BI Engineering team. At Power BI, he led the customer-facing insights program, being responsible for the strategy and technical framework to deliver system-wide usage and performance insights to customers. Before Microsoft, Bhavik spent years managing high-caliber consulting teams delivering enterprise-scale BI projects. He has provided extensive technical and theoretical BI training over the years, including expert Power BI performance training he developed for top Microsoft Partners globally.

0
0
45776

article-image-mastering-transfer-learning-fine-tuning-bert-and-vision-transformers

Sinan Ozdemir

27 Nov 2024

15 min read

Mastering Transfer Learning: Fine-Tuning BERT and Vision Transformers

Sinan Ozdemir

27 Nov 2024

15 min read

DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. This article is an excerpt from the book, "Principles of Data Science", by Sinan Ozdemir. This book provides an end-to-end framework for cultivating critical thinking about data, performing practical data science, building performant machine learning models, and mitigating bias in AI pipelines. Learn the fundamentals of computational math and stats while exploring modern machine learning and large pre-trained models.IntroductionTransfer learning (TL) has revolutionized the field of deep learning by enabling pre-trained models to adapt their broad, generalized knowledge to specific tasks with minimal labeled data. This article delves into TL with BERT and GPT, demonstrating how to fine-tune these advanced models for text classification and image classification tasks. Through hands-on examples, we illustrate how TL leverages pre-trained architectures to simplify complex problems and achieve high accuracy with limited data.TL with BERT and GPTIn this article, we will take some models that have already learned a lot from their pre-training and fine-tune them to perform a new, related task. This process involves adjusting the model’s parameters to better suit the new task, much like fine-tuning a musical instrument:Figure 12.8 – ITLITL takes a pre-trained model that was generally trained on a semi-supervised (or unsupervised) task and then is given labeled data to learn a specific task.Examples of TLLet’s take a look at some examples of TL with specific pre-trained models.Example – Fine-tuning a pre-trained model for text classificationConsider a simple text classification problem. Suppose we need to analyze customer reviews and determine whether they’re positive or negative. We have a dataset of reviews, but it’s not nearly large enough to train a deep learning (DL) model from scratch. We will fine-tune BERT on a text classification task, allowing the model to adapt its existing knowledge to our specific problem.We will have to move away from the popular scikit-learn library to another popular library called transformers, which was created by HuggingFace (the pre-trained model repository I mentioned earlier) as scikit-learn does not (yet) support Transformer models.Figure 12.9 shows how we will have to take the original BERT model and make some minor modifications to it to perform text classification. Luckily, the transformers package has a built-in class to do this for us called BertForSequenceClassification:Figure 12.9 – Simplest text classification caseIn many TL cases, we need to architect additional layers. In the simplest text classification case, we add a classification layer on top of a pre-trained BERT model so that it can perform the kind of classification we want.The following code block shows an end-to-end code example of fine-tuning BERT on a text classification task. Note that we are also using a package called datasets, also made by HuggingFace, to load a sentiment classification task from IMDb reviews. Let’s begin by loading up the dataset:# Import necessary libraries from datasets import load_dataset from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments # Load the dataset imdb_data = load_dataset('imdb', split='train[:1000]') # Loading only 1000 samples for a toy example # Define the tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Preprocess the data def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_ length', max_length=512) imdb_data = imdb_data.map(encode, batched=True) # Format the dataset to PyTorch tensors imdb_data.set_format(type='torch', columns=['input_ids', 'attention_ mask', 'label'])With our dataset loaded up, we can run some training code to update our BERT model on our labeled data:# Define the model model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2) # Define the training arguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=1, per_device_train_batch_size=4 ) # Define the trainer trainer = Trainer(model=model, args=training_args, train_dataset=imdb_ data) # Train the model trainer.train() # Save the model model.save_pretrained('./my_bert_model')Once we have our saved model, we can use the following code to run the model against unseen data:from transformers import pipeline # Define the sentiment analysis pipeline nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer) # Use the pipeline to predict the sentiment of a new review review = "The movie was fantastic! I enjoyed every moment of it." result = nlp(review) # Print the result print(f"label: {result[0]['label']}, with score: {round(result[0] ['score'], 4)}") # "The movie was fantastic! I enjoyed every moment of it." # POSITIVE: 99%Example – TL for image classificationWe could take a pre-trained model such as ResNet or the Vision Transformer (shown in Figure 12.10), initially trained on a large-scale image dataset such as ImageNet. This model has already learned to detect various features from images, from simple shapes to complex objects. We can take advantage of this knowledge, fi ne-tuning the model on a custom image classification task:Figure 12.10 – The Vision TransformerThe Vision Transformer is like a BERT model for images. It relies on many of the same principles, except instead of text tokens, it uses segments of images as “tokens” instead.The following code block shows an end-to-end code example of fine-tuning the Vision Transformer on an image classification task. The code should look very similar to the BERT code from the previous section because the aim of the transformers library is to standardize training and usage of modern pre-trained models so that no matter what task you are performing, they can offer a relatively unified training and inference experience.Let’s begin by loading up our data and taking a look at the kinds of images we have (seen in Figure 12.11). Note that we are only going to use 1% of the dataset to show that you really don’t need that much data to get a lot out of pre-trained models!# Import necessary libraries from datasets import load_dataset from transformers import ViTImageProcessor, ViTForImageClassification from torch.utils.data import DataLoader import matplotlib.pyplot as plt import torch from torchvision.transforms.functional import to_pil_image # Load the CIFAR10 dataset using Hugging Face datasets # Load only the first 1% of the train and test sets train_dataset = load_dataset("cifar10", split="train[:1%]") test_dataset = load_dataset("cifar10", split="test[:1%]") # Define the feature extractor feature_extractor = ViTImageProcessor.from_pretrained('google/vitbase-patch16-224') # Preprocess the data def transform(examples): # print(examples) # Convert to list of PIL Images examples['pixel_values'] = feature_ extractor(images=examples["img"], return_tensors="pt")["pixel_values"] return examples # Apply the transformations train_dataset = train_dataset.map( transform, batched=True, batch_size=32 ).with_format('pt') test_dataset = test_dataset.map( transform, batched=True, batch_size=32 ).with_format('pt')We can similarly use the model using the following code:Figure 12.11 – A single example from CIFAR10 showing an airplaneNow, we can train our pre-trained Vision Transformer:# Define the model model = ViTForImageClassification.from_pretrained( 'google/vit-base-patch16-224', num_labels=10, ignore_mismatched_sizes=True ) LABELS = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] model.config.id2label = LABELS # Define a function for computing metrics def compute_metrics(p): predictions, labels = p preds = np.argmax(predictions, axis=1) return {"accuracy": accuracy_score(labels, preds)} # Define the training arguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=5, per_device_train_batch_size=4, load_best_model_at_end=True, # Save and evaluate at the end of each epoch evaluation_strategy='epoch', save_strategy='epoch' ) # Define the trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset )Our final model has about 95% accuracy on 1% of the test set. We can now use our new classifier on unseen images, as in this next code block:from PIL import Image from transformers import pipeline # Define an image classification pipeline classification_pipeline = pipeline( 'image-classification', model=model, feature_extractor=feature_extractor ) # Load an image image = Image.open('stock_image_plane.jpg') # Use the pipeline to classify the image result = classification_pipeline(image)Figure 12.12 shows the result of this single classification, and it looks like it did pretty well:Figure 12.12 – Our classifier predicting a stock image of a plane correctlyWith minimal labeled data, we can leverage TL to turn models off the shelf into powerhouse predictive models.ConclusionTransfer learning is a transformative technique in deep learning, empowering developers to harness the power of pre-trained models like BERT and the Vision Transformer for specialized tasks. From sentiment analysis to image classification, these models can be fine-tuned with minimal labeled data, offering impressive performance and adaptability. By using libraries like HuggingFace’s transformers, TL streamlines model training, making state-of-the-art AI accessible and versatile across domains. As demonstrated in this article, TL is not only efficient but also a practical way to achieve powerful predictive capabilities with limited resources.Author BioSinan is an active lecturer focusing on large language models and a former lecturer of data science at the Johns Hopkins University. He is the author of multiple textbooks on data science and machine learning including "Quick Start Guide to LLMs". Sinan is currently the founder of LoopGenius which uses AI to help people and businesses boost their sales and was previously the founder of the acquired Kylie.ai, an enterprise-grade conversational AI platform with RPA capabilities. He holds a Master’s Degree in Pure Mathematics from Johns Hopkins University and is based in San Francisco.

0
0
38736

article-image-airflow-ops-best-practices-observation-and-monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

Airflow Ops Best Practices: Observation and Monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey

12 Nov 2024

15 min read

This article is an excerpt from the book, "Apache Airflow Best Practices", by Dylan Intorf, Kendrick van Doorn, Dylan Storey. With practical approach and detailed examples, this book covers newest features of Apache Airflow 2.x and it's potential for workflow orchestration, operational best practices, and data engineering.IntroductionIn this article, we will continue to explore the application of modern “ops” practices within Apache Airflow, focusing on the observation and monitoring of your systems and DAGs after they’ve been deployed.We’ll divide this observation into two segments – the core Airflow system and individual DAGs. Each segment will cover specific metrics and measurements you should be monitoring for alerting and potential intervention.When we discuss monitoring in this section, we will consider two types of monitoring – active and suppressive.In an active monitoring scenario, a process will actively check a service’s health state, recording its state and potentially taking action directly on the return value.In a suppressive monitoring scenario, the absence of a state (or state change) is usually meaningful. In these scenarios, the monitored application sends an active schedule to a process to inform it that it is OK, usually suppressing an action (such as an alert) from occurring.This chapter covers the following topics:Monitoring core Airflow componentsMonitoring your DAGsTechnical requirementsBy now, we expect you to have a good understanding of Airflow and its core components, along with functional knowledge in the deployment and operation of Airflow and Airflow DAGs.We will not be covering specific observability aggregators or telemetry tools; instead, we will focus on the activities you should be keeping an eye on. We strongly recommend that you work closely with your ops teams to understand what tools exist in your stack and how to configure them for capture and alerting your deployments.Monitoring core Airflow componentsAll of the components we will discuss here are critical to ensuring a functioning Airflow deployment. Generally, all of them should be monitored with a bare minimum check of Is it on? and if a component is not, an alert should surface to your team for investigation. The easiest way to check this is to query the REST API on the web server at `/health/`; this will return a JSON object that can be parsed to determine whether components are healthy and, if not, when they were last seen.SchedulerThis component needs to be running and working effectively in order for tasks to be scheduled for execution.When the scheduler service is started, it also starts a `/health` endpoint that can be checked by an external process with an active monitoring approach.The returned signal does not always indicate that the scheduler is working properly, as its state is simply indicative that the service is up and running. There are many scenarios where the scheduler may be operating but unable to schedule jobs; as a result, many deployments will include a canary dag to their deployment that has a single task, acting to suppress an external alert from going off.Import metrics that airflow exposes for you include the following:scheduler.scheduler_loop_duration: This should be monitored to ensure that your scheduler is able to loop and schedule tasks for execution. As this metric increases, you will see tasks beginning to schedule more slowly, to the point where you may begin missing SLAs because tasks fail to reach a schedulable state.scheduler.tasks.starving: This indicates how many tasks cannot be scheduled because there are no slots available. Pools are a mechanism that Airflow uses to balance large numbers of submitted task executions versus a finite amount of execution throughput. It is likely that this number will not be zero, but being high for extended periods of time may point to an issue in how DAGs are being written to schedule work.scheduler.tasks.executable: This indicates how many tasks are ready for execution (i.e., queued). This number will sometimes not be zero, and that is OK, but if the number increases and stays high for extended periods of time, it indicates that you may need additional computer resources to handle the load. Look at your executor to increase the number of workers it can run. Metadata databaseThe metadata database is used to store and track all of the metadata for your Airflow deployments’ previous DAG/task executions, along with information about your environment’s roles and permissions. Losing data from this database can interrupt normal operations and cause unintended consequences, with DAG runs being repeated.While critical, because it is architecturally ubiquitous, the database is also least likely to encounter issues, and if it does, they are absolutely catastrophic in nature.We generally suggest you utilize a managed service for provisioning and operating your backing database, ensuring that a disaster recovery plan for your metadata database is in place at all times.Some active areas to monitor on your database include the following:Connection pool size/usage: Monitor both the connection pool size and usage over time to ensure appropriate configuration, and identify potential bottlenecks or resource contention arising from Airflow components’ concurrent connections.Query performance: Measure query latency to detect inefficient queries or performance issues, while monitoring query throughput to ensure effective workload handling by the database.Storage metrics: Monitor the disk space utilization of the metadata database to ensure that it has sufficient storage capacity. Set up alerts for low disk space conditions to prevent database outages due to storage constraints.Backup status: Monitor the status of database backups to ensure that they are performed regularly and successfully. Verify backup integrity and retention policies to mitigate the risk of data loss if there is a database failure.TriggererThe Triggerer instance manages all of the asynchronous operations of deferrable operators in a deferred state. As such, major operational concerns generally relate to ensuring that individual deferred operators don’t cause major blocking calls to the event loop. If this occurs, your deferrable tasks will not be able to check their state changes as frequently, and this will impact scheduling performance.Import metrics that airflow exposes for you include the following:triggers.blocked_main_thread: The number of triggers that have blocked the main thread. This is a counter and should monotonically increase over time; pay attention to large differences between recording (or quick acceleration) counts, as it’s indicative of a larger problem.triggers.running: The number of triggers currently on a triggerer instance. This metric should be monitored to determine whether you need to increase the number of triggerer instances you are running. While the official documentation claims that up to tens of thousands of triggers can be on an instance, the common operational number is much lower. Tune at your discretion, but depending on the complexity of your triggers, you may need to add a new instance for every few hundred consistent triggers you run.Executors/workersDepending on the executor you use, you will need to monitor your executors and workers a bit differently.The Kubernetes executor will utilize the Kubernetes API to schedule tasks for execution; as such, you should utilize the Kubernetes events and metrics servers to gather logs and metrics for your task instances. Common metrics to collect on an individual task are CPU and memory usage. This is crucial for tuning requests or mutating individual task resource requests to ensure that they execute safely.The Celery worker has additional components and long-lived processes that you need to metricize. You should monitor an individual Celery worker’s memory and CPU utilization to ensure that it is not over- or under-provisioned, tuning allocated resources accordingly. You also need to monitor the message broker (usually Redis or RabbitMQ) to ensure that it is appropriately sized. Finally, it is critical to measure the queue length of your message broker and ensure that too much “back pressure” isn’t being created in the system. If you find that your tasks are sitting in a queued state for a long period of time and the queue length is consistently growing, it’s a sign that you should start an additional Celery worker to execute on scheduled tasks. You should also investigate using the native Celery monitoring tool Flower (https://flower.readthedocs.io/en/latest/) for additional, more nuanced methods of monitoring.Web serverThe Airflow web server is the UI for not just your Airflow deployment but also the RESTful interface. Especially if you happen to be controlling Airflow scheduling behavior with API calls, you should keep an eye on the following metrics:Response time: Measure the time taken for the API to respond to requests. This metric indicates the overall performance of the API and can help identify potential bottlenecks.Error rate: Monitor the rate of errors returned by the API, such as 4xx and 5xx HTTP status codes. High error rates may indicate issues with the API implementation or underlying systems.Request rate: Track the rate of incoming requests to the API over time. Sudden spikes or drops in request rates can impact performance and indicate changes in usage patterns.System resource utilization: Monitor resource utilization metrics such as CPU, memory, disk I/O, and network bandwidth on the servers hosting the API. High resource utilization can indicate potential performance bottlenecks or capacity limits.Throughput: Measure the number of successful requests processed by the API per unit of time. Throughput metrics provide insights into the API’s capacity to handle incoming traffic.Now that you have some basic metrics to collect from your core architectural components and can monitor the overall health of an application, we need to monitor the actual DAGs themselves to ensure that they function as intended.Monitoring your DAGsThere are multiple aspects to monitoring your DAGs, and while they’re all valuable, they may not all be necessary. Take care to ensure that your monitoring and alerting stack match your organizational needs with regard to operational parameters for resiliency and, if there is a failure, recovery times. No matter how much or how little you choose to implement, knowing that your DAGs work and if and how they fail is the first step in fixing problems that will arise.LoggingAirflow writes logs for tasks in a hierarchical structure that allows you to see each task’s logs in the Airflow UI. The community also provides a number of providers to utilize other services for backing log storage and retrieval. A complete list of supported providers is available at https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html.Airflow uses the standard Python logging framework to write logs. If you’re writing custom operators or executing Python functions with a PythonOperator, just make sure that you instantiate a Python logger instance, and then the associated methods will handle everything for you.AlertingAirflow provides mechanisms for alerting on operational aspects of your executing workloads that can be configured within your DAG:Email notifications: Email notifications can be sent if a task is put into a marked or retry state with the `email_on_failure` or `email_on_retry` state, respectively. These arguments can be provided to all tasks in the DAG with the `default_args` key work in the DAG, or individual tasks by setting the keyword argument individually.Callbacks: Callbacks are special actions that are executed if a specific state change occurs. Generally, these callbacks should be thoughtfully leveraged to send alerts that are critical operationally:on_success_callback: This callback will be executed at both the task and DAG levels when entering a successful state. Unless it is critical that you know whether something succeeds, we generally suggest not using this for alerting.on_failure_callback: This callback is invoked when a task enters a failed state. Generally, this callback should always be set and, in critical scenarios, alert on failures that require intervention and support.on_execute_callback: This is invoked right before a task executes and only exists at the task level. Use sparingly for alerting, as it can quickly become a noisy alert when overused.on_retry_callback: This is invoked when a task is placed in a retry state. This is another callback to be cautious about as an alert, as it can become noisy and cause false alarms.sla_miss_callback: This is invoked when a DAG misses its defined SLA. This callback is only executed at the end of a DAG’s execution cycle so tends to be a very reactive notification that something has gone wrong.SLA monitoringAs awesome of a tool as Airflow is, it is a well-known fact in the community that SLAs, while largely functional, have some unfortunate details with regard to implementation that can make them problematic at best, and they are generally regarded as a broken feature in Airflow. We suggest that if you require SLA monitoring on your workflows, you deploy a CRON job monitoring tool such as healthchecks (https://github.com/healthchecks/healthchecks) that allows you to create suppressive alerts for your services through its rest API to manage SLAs. By pairing this third- party service with either HTTP operators or simple requests from callbacks, you can ensure that your most critical workflows achieve dynamic and resilient SLA alerting.Performance profilingThe Airflow UI is a great tool for profiling the performance of individual DAGs:The Gannt chart view: This is a great visualization for understanding the amount of time spent on individual tasks and the relative order of execution. If you’re worried about bottlenecks in your workflow, start here.Task duration: This allows you to profile the run characteristics of tasks within your DAG over a historical period. This tool is great at helping you understand temporal patterns in execution time and finding outliers in execution. Especially if you find that a DAG slows down over time, this view can help you understand whether it is a systemic issue and which tasks might need additional development.Landing times: This shows the delta between task completion and the start of the DAG run. This is an un-intuitive but powerful metric, as increases in it, when paired with stable task durations in upstream tasks, can help identify whether a scheduler is under heavy load and may need tuning.Additional metrics that have proven to be useful (but may need to be calculated) include the following:Task startup time: This is an especially useful metric when operating with a Kubernetes executor. To calculate this, you will need to calculate the difference between `start_date` and `execution_date` on each task instance. This metric will especially help you identify bottlenecks outside of Airflow that may impact task run times.Task failure and retry counts: Monitoring the frequency of task failures and retries can help identify information about the stability and robustness of your environment. Especially if these types of failure can be linked back to patterns in time or execution, it can help debug interactions with other services.DAG parsing time: Monitoring the amount of time a DAG takes to parse is very important to understand scheduler load and bottlenecks. If an individual DAG takes a long time to load (either due to heavy imports or long blocking calls being executed during parsing), it can have a material impact on the timeliness of scheduling tasks.ConclusionIn this article, we covered some essential strategies to effectively monitor both the core Airflow system and individual DAGs post-deployment. We highlighted the importance of active and suppressive monitoring techniques and provided insights into the critical metrics to track for each component, including the scheduler, metadata database, triggerer, executors/workers, and web server. Additionally, we discussed logging, alerting mechanisms, SLA monitoring, and performance profiling techniques to ensure the reliability, scalability, and efficiency of Airflow workflows. By implementing these monitoring practices and leveraging the insights gained, operators can proactively manage and optimize their Airflow deployments for optimal performance and reliability.Author BioDylan Intorf is a solutions architect and data engineer with a BS from Arizona State University in Computer Science. He has 10+ years of experience in the software and data engineering space, delivering custom tailored solutions to Tech, Financial, and Insurance industries.Kendrick van Doorn is an engineering and business leader with a background in software development, with over 10 years of developing tech and data strategies at Fortune 100 companies. In his spare time, he enjoys taking classes at different universities and is currently an MBA candidate at Columbia University.Dylan Storey has a B.Sc. and M.Sc. from California State University, Fresno in Biology and a Ph.D. from University of Tennessee, Knoxville in Life Sciences where he leveraged computational methods to study a variety of biological systems. He has over 15 years of experience in building, growing, and leading teams; solving problems in developing and operating data products at a variety of scales and industries.

2
0
62994

article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary

04 Nov 2024

15 min read

1
1
67379

How-To Tutorials - Data Science

How to Build Supervised Anomaly Detection Models for Time Series Data

How to Prepare Data Using AWS Glue

Microsoft Fabric Data Agents: Building AI-Powered, Conversational Analytics for Enterprise Data

What Is a System of Action? Building an AI-Ready Data Foundation for Real-Time Decision Intelligence

Large Language Models and Graph Machine Learning

MongoDB CRUD Operations Explained: Create, Read, Update, Delete with mongosh and Python

How to Fine-Tune Gemma LLM for Low-Resource Languages (Kaggle Winning Strategies)

Creating a Semantic Model in Power BI

Semantic Search in OpenSearch

Building Trust in AI: The Role of RAG in Data Security and Transparency

Trending Topics

Revolutionize Power BI Queries with OpenAI

Mastering Performance Tuning with DAX Studio and VertiPaq Analyzer

Mastering Transfer Learning: Fine-Tuning BERT and Vision Transformers

Airflow Ops Best Practices: Observation and Monitoring

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access