Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

How-To Tutorials - AI Tools

89 Articles
article-image-automating-data-enrichment-with-snowflake-and-ai
Shankar Narayanan
30 Oct 2023
9 min read
Save for later

Automating Data Enrichment with Snowflake and AI

Shankar Narayanan
30 Oct 2023
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn today’s data-driven world, businesses constantly seek ways to extract more value from their data. One of the key strategies to accomplish this is Data Enrichment.Data Enrichment involves enhancing your existing datasets with additional information, which can lead to improved decision-making, customer engagement, and personalized experiences. In this blog, we’ll explore how to automate data enrichment using Snowflake, a powerful data warehousing platform, and Generative AI techniques.Understanding Data EnrichmentData Enrichment is simply the practice of enhancing your existing datasets with additional and relevant information. This supplementary data can include demographic data, geographic data, social media profiles, and much more. The primary goal is to improve the quality and depth of your data - making it more valuable for analytics, reporting, and decision-making.Why Automate Data Enrichment?Automating data enrichment not only saves time and resources but also improves data quality, supports real-time updates, and helps organizations stay competitive in an increasingly data-centric world. Whether in e-commerce, finance, healthcare, marketing, or any other industry, automation can be a strategic advantage that allows you to extract greater value from your data.EfficiencyManual data enrichment is time-consuming and resource-intensive. Automation allows you to process large volumes of data rapidly, reducing the time and effort required.ConsistencyHuman errors are common when manually enriching data. Automation ensures the process is consistent and accurate, reducing the risk of errors affecting decision-making.ScalabilityAs your organization grows and accumulates more data, automating data enrichment ensures you can handle larger datasets without a proportional increase in human resources.Enhanced Data QualityAutomated processes can validate and cleanse data, leading to higher data quality. High-quality data is essential for meaningful analytics and reporting.Competitive AdvantageIn a competitive business landscape, having access to enriched and up-to-date data can give you a significant advantage. It allows for more accurate market analysis, better customer understanding, and smarter decision-making.PersonalizationAutomated data enrichment can support personalized customer experiences, which are increasingly crucial for businesses. It allows you to tailor content, product recommendations, and marketing efforts to individual preferences and behaviors.Cost-EfficiencyWhile there are costs associated with setting up and maintaining automated data enrichment processes, these costs can be significantly lower in the long run compared to manual efforts, especially as the volume of data grows.Compliance and Data SecurityAutomated processes can be designed to adhere to data privacy regulations and security standards, reducing the risk of data breaches and compliance issues.ReproducibilityAutomated data enrichment processes can be documented, version-controlled, and easily reproduced, making it easier to audit and track changes over time.Data VarietyAs the sources and formats of data continue to expand, automation allows you to efficiently handle various data types, whether structured, semi-structured, or unstructured.Snowflake for Data EnrichmentSnowflake, a cloud-based data warehousing platform, provides powerful features for data manipulation and automation. Snowflake at the basic can be used to:Create tables for raw data and enrichment data.Load data into these tables using the COPY INTO command.Create views to combine raw and enrichment data based on common keys.Code Examples: Snowflake Data EnrichmentCreate TablesIn Snowflake, create tables for your raw data and enrichment data with: -  Create a table for raw data CREATE OR REPLACE TABLE raw_data (             Id INT,             name  STRING,             email  STRING ); -  Create a table for enrichment data CREATE OR REPLACE TABLE enrichment_data (         email  STRING,         location STRING,         age INT ); Load Data: Loading raw and enrichment data into their respective tables. -  Load raw data COPY INTO raw_data (id, name, email) FROM @<raw_data_stage>/raw_data.csv FILE_FORMAT = (TYPE = CSV); -  Load enrichment data COPY INTO enrichment_data (email, location, age) FROM @<enrichment_data_stage>/enrichment_data.csv FILE_FORMAT = (TYPE = CSV);Automate Data EnrichmentCreate a view that combines raw and enrichment data.-  Create a view that enriches the raw data CREATE OR REPLACE VIEW enriched_data AS SELECT    rd.id,    rd.name,    ed.location,    ed.age, -  Use generative AI to generate a description for the enriched date <Generative_AI_function> (ed.location, ed.age) AS description FROM       raw_data  rd JOIN      enrichment_data ed ON      rd.email = ed.email;Leveraging Snowflake for Data EnrichmentUsing Snowflake for data enrichment is a smart choice, especially if your organization relies on this cloud-based data warehousing platform. Snowflake provides a robust set of features for data manipulation and automation, making it an ideal environment to enhance the value of your data. Here are a few examples of how you can use Snowflake for data enrichment:Data Storage and ManagementSnowflake allows you to store and manage your data efficiently by separating storage and computing resources, which provides a scalable and cost-effective way to manage large data sets. You can store your raw and enriched data within Snowflake, making it readily accessible for enrichment processes.Data EnrichmentYou can perform data enrichment by combining data from your raw and enrichment tables. By using SQL JOIN operations to bring together related data based on common keys, such as email addresses.-        Create a view that enriches the raw data CREATE OR REPLACE VIEW enriched_data AS SELECT    rd.id,    rd.name,    ed.location,    ed.age, FROM       raw_data  rd JOIN      enrichment_data ed ON      rd.email = ed.email;Schedule UpdatesAutomating data enrichment by the creation of scheduled tasks within Snowflake. You can set up tasks to run at regular intervals, ensuring that your enriched data remains up to date.- Example: Creating a scheduled task to update enriched data CREATE OR REPLACE TASK update_enriched_data WAREHOUSE = <your_warehouse> SCHEDULE = ‘1 DAY’ AS INSERT INTO enriched_data (id, name, location, age) SELECT     rd.id,     rd.name,     ed.location,     ed.age FROM      raw_data rd JOIN     enrichment_data ed ON    rd.email = ed.email;Security and ComplianceSnowflake provides robust security features and complies with various data privacy regulations. Ensure that your data enrichment processes adhere to the necessary security and compliance standards to protect sensitive information.Monitoring and OptimizationRegularly monitoring the performance of your data enrichment processes. Snowflake offers tools for monitoring query execution so you can identify and address any performance bottlenecks. Optimization here is one of the crucial factors to ensure efficient data enrichment.Real-World ApplicationsData Enrichment is a powerful tool that stands for versatility in its real-world applications. Organizations across various sectors use it to improve their data quality, decision-making process, customer experiences, and overall operational efficiency. By augmenting their datasets with additional information, these organizations gain a competitive edge and drive innovation in their respective industries:E-commerce and RetailProduct Recommendations: E-commerce platforms use data enrichment to analyze customer browsing and purchase history. These enriched customer profiles help generate personalized product recommendations, increasing sales and customer satisfaction.Inventory Management: Retailers leverage enriched supplier data to optimize inventory management, ensuring they have the right products in stock at the right time.Marketing and AdvertisingCustomer Segmentation: Marketers use enriched customer data to create more refined customer segments. This enables them to tailor advertising campaigns and messaging for specific audiences, leading to higher engagement rates.Ad Targeting: Enriched demographic and behavioral data supports precise ad targeting. Advertisers can serve ads to audiences most likely to convert, reducing ad spend wastage.Financial ServicesCredit Scoring: Financial institutions augment customer data with credit scores, employment history, and other financial information to assess credit risk more accurately.Fraud Detection: Banks use data enrichment to detect suspicious activities by analyzing transaction data enriched with historical fraud patterns.HealthcarePatient Records: Healthcare providers enhance electronic health records (EHR) with patient demographics, medical histories, and test results. This results in comprehensive and up-to-date patient profiles, leading to better care decisions.Drug Discovery: Enriching molecular and clinical trial data accelerates drug discovery and research, potentially leading to breakthroughs in medical treatments.Social Media and Customer SupportSocial Media Insights: Social media platforms use data enrichment to provide businesses with detailed insights into their followers and engagement metrics, helping them refine their social media strategies.Customer Support: Enriched customer profiles enable support teams to offer more personalized assistance, increasing customer satisfaction and loyalty.ConclusionAutomating data enrichment with Snowflake and Generative AI is a powerful approach for businesses seeking to gain a competitive edge through data-driven insights. By combining a robust data warehousing platform with advanced AI techniques, you can efficiently and effectively enhance your datasets. Embrace automation, follow best practices, and unlock the full potential of your enriched data.Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.
Read more
  • 0
  • 0
  • 225

article-image-efficient-data-caching-with-vector-datastore-for-llms
Karthik Narayanan Venkatesh
25 Oct 2023
9 min read
Save for later

Efficient Data Caching with Vector Datastore for LLMs

Karthik Narayanan Venkatesh
25 Oct 2023
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) have taken center stage, transforming the way we interact with and understand vast amounts of textual data. With the proliferation of these advanced models, the need for efficient data management and retrieval has become paramount. Enter Vector Datastore, a game-changer in the realm of data caching for LLMs. This article explores how Vector Datastore's innovative approach, based on vector representations and similarity search, empowers LLMs to swiftly access and process data, revolutionizing their performance and capabilities.How does Vector datastore enable data cache for LLMs?With every online source you scan, you will come across terms like chatbots, LLMs, or GPT. Most people are speaking about the large language models, and we can see every new language model gets released every week.Before seeing how vector databases enable data caches for large language models, one must learn about them and their importance to the language models.Vector databases: What are they? Getting an idea of vector embeddings is essential to know about vector databases. It is a data representation that consists of semantic information, helping the artificial intelligence system better understand all the datasets. At the same time, it helps to maintain long-term memory. The most critical element is understanding and remembering, especially if you want to learn anything new.AI models usually generate embeddings. Every large language model consists of a variety of features. Due to this reason, it becomes difficult to manage their representations. With the help of embeddings, one could represent the various dimensions of the data. Therefore, the artificial intelligence models would understand the patterns, relationships, and hidden structures.In this scenario, the vector embeddings that use the traditional scalar-based databases could be challenging. It cannot keep up or handle the scale and complexity of various data. The complexities that often come with vector embeddings would require a specialized database. It is the reason why one would need vector databases. With the help of vector databases, one could get optimized storage and query capabilities of any unique structure presented by vector embeddings. As a result, the user would get high performance along with easy search capabilities, data retrieval, and scalability only by comparing the similarities and values of findings between one another.Though vector databases are difficult to implement, until now, various tech giants companies are not only developing them but also making them manageable. Since they are expensive to implement, one must ensure proper calibration to receive high performance. How it works?Taking the simple example of a link with a large language model like chat GPT, we know it consists of a large volume of data and content. At the same time, the user can only proceed with the chat GPT application. Being the user, one has to improve your queries in the application. Once you complete this step, the query gets inserted into the embedding model. It initiates the process of vector embeddings based on the content that requires indexing. After completing this process, the vector embeddings move into the vector databases. It usually occurs regarding the content that wants to be used for embedding.As a result, you will receive the outcome produced by vector databases. Therefore, the system sends it back to the user as a result.As a user continues making different queries, it goes through the same embedding model that helps create embeddings. It helps in processing the database query for similar—vector embeddings.Let us know the whole process in detail.A vector database incorporates diverse algorithms dedicated to facilitating Approximate Nearest Neighbor (ANN) searches. These algorithms encompass techniques such as hashing, graph-based search, and quantization, which are intricately combined into a structured pipeline for retrieving neighboring vectors concerning a queried input.The outcomes of this search operation are contingent upon the proximity or approximation of the retrieved vectors to the original query. Hence, the pivotal factors under consideration are accuracy and speed. A trade-off exists between the query output's speed and the results' precision; a slower output corresponds to a more accurate outcome.The process of querying a vector database unfolds in three fundamental stages: 1. IndexingA diverse array of algorithms comes into play upon the ingress of the vector embedding into the vector database. These algorithms serve the purpose of mapping the vector embedding onto specific data structures, thus optimizing the search process. This preparatory step significantly enhances the speed and efficiency of subsequent searches. 2. QueryingThe vector database systematically compares the queried vector and the previously indexed vectors. This comparison entails the application of a similarity metric, a crucial determinant in identifying the nearest neighbor amidst the indexed vectors. The precision and efficacy of this phase are paramount to the overall accuracy of the search results. 3. Post ProcessingUpon pinpointing the nearest neighbor, the vector database initiates a post-processing stage. The specifics of this stage may vary based on the particular vector database in use. Post-processing activities may involve refining the final output of the query, ensuring it aligns seamlessly with the user's requirements.Additionally, the vector database might undertake the task of re-ranking the nearest neighbors, a strategic move to enhance the database's future search capabilities. This meticulous post-processing step guarantees that the delivered results are accurate and optimized for subsequent reference, thereby elevating the overall utility of the vector database in complex search scenarios.Implementing vector data stores in LLM Let us consider an example to understand how a vector data store can be installed or implemented in a large language model. Before we can start with the implementation, one has to install the vector datastore library. pip install vectordatastore Assuming you have the data set containing the text snippets, you will get the following in code format. # Sample dataset dataset = {    "1": "Text snippet 1",    "2": "Text snippet 2",    # ... more data points ... } # Initialize Vector Datastore from vectordatastore import VectorDatastore vector_datastore = VectorDatastore() # Index data into Vector Datastore for key, text in dataset.items():    vector_datastore.index(key, text) # Query Vector Datastore from LLM query = "Query text snippet" similar_texts = vector_datastore.query(query) # Process similar_texts in the LLM # ...In this example, one can see that the vector data store efficiently indexes the datasets while utilizing vector representations. When the large language model requires retrieving data similar to the query text, it often uses a vector datastore to obtain the relevant snippets quickly.Process of enabling data caches in LLMsVector Datastore enables efficient data caching for Large Language Models (LLMs) through its unique approach to handling data. Traditional caching mechanisms store data based on keys, and retrieving data involves matching these keys. However, LLMs often work with complex and high-dimensional data, such as text embeddings, which are not easily indexed or retrieved using traditional key-value pairs. Vector Datastore addresses this challenge by leveraging vector representations of data points.Process of how Vector Datastore enables data cache for LLMs 1. Vector Representation:Vector Datastore stores data points in vectorized form. Each data point, whether a text snippet or any other type of information, is transformed into a high-dimensional numerical vector. This vectorization process captures the semantic meaning and relationships between data points. 2. Similarity Search: Instead of relying on exact matches of keys, Vector Datastore performs similarity searches based on vector representations. When an LLM needs specific data, it translates its query into a vector representation using the same method employed during data storage. This query vector is then compared against the stored vectors using similarity metrics like cosine similarity or Euclidean distance. 3. Efficient Retrieval: By organizing data as vectors and employing similarity searches, Vector Datastore can quickly identify the most similar vectors to the query vector. This efficient retrieval mechanism allows LLMs to access relevant data points without scanning the entire dataset, significantly reducing the time required for data retrieval. 4. Adaptive Indexing: Vector Datastore dynamically adjusts its indexing strategy based on the data and queries it receives. As the dataset grows or the query patterns change, Vector Datastore adapts its indexing structures to maintain optimal search efficiency. This adaptability ensures the cache remains efficient even as the data and query patterns evolve. 5. Scalability: Vector Datastore is designed to handle large-scale datasets commonly encountered in LLM applications. Its architecture allows horizontal scaling, efficiently distributing the workload across multiple nodes or servers. This scalability ensures that Vector Datastore can accommodate the vast amount of data processed by LLMs without compromising performance.Vector Datastore's ability to work with vectorized data and perform similarity searches based on vector representations enables it to serve as an efficient data cache for Large Language Models. By avoiding the limitations of traditional key-based caching mechanisms, Vector Datastore significantly enhances the speed and responsiveness of LLMs, making it a valuable tool in natural language processing.ConclusionThe development of LLM is one of the crucial technological advancements of our time. Not only does it have the potential to revolutionize various aspects of our lives, but at the same time, it is imperative on our part to utilize them ethically and responsibly to retrieve all its benefits.Author BioKarthik Narayanan Venkatesh (aka Kaptain), founder of WisdomSchema, has multifaceted experience in the data analytics arena. He has been associated with the data analytics domain since the early 2000s, with a ringside view of transformations in this industry. He has led teams that architected and built scalable data platform solutions across the technology spectrum.As a niche consulting provider, he bridged the gap between business and technology and drove BI adoption through innovative approaches in an agnostic manner. He is a sought-after speaker who has presented many lectures on SAP, Analytics, Snowflake, AWS, and GCP technologies.
Read more
  • 0
  • 0
  • 1038

article-image-getting-started-with-ai-builder
Adeel Khan
23 Oct 2023
9 min read
Save for later

Getting Started with AI Builder

Adeel Khan
23 Oct 2023
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!Introduction  AI is transforming the way businesses operate, enabling them to improve efficiency, reduce costs, and enhance customer satisfaction. However, building and deploying AI solutions can be challenging, even at times for pro developers due to the inherent complexity of traditional tools. That’s where Microsoft AI Builder comes in. AI Builder is a low-code AI platform that empowers pro developers to infuse AI into business workflows without writing a single line of code. AI Builder is integrated with Microsoft Power Platform, a suite of tools that allows users to build apps, automate processes, and analyze data. With AI Builder, users can leverage pre-built or custom AI models to enhance their Power Apps and Power Automate solutions. One of the most powerful features of AI Builder is the prediction model, which allows users to create AI models that can predict outcomes based on historical data. The prediction model can be used to predict the following outcomes,  Binary outcome, choice between one value. An example would be booking status, canceled/redeemed.  Multiple outcomes, a choice between multiple yet fixed outcomes. An example would be the Stage of Delivery, early/on-time/delayed/escalated.  Numeric outcomes, a number value. An example would be revenue per customer. In this blog post, we will show you how to create and use a prediction model with AI Builder using our business data. We will focus on Numeric outcomes and use the example mentioned above, we will attempt to predict the possible revenue we can generate from customers in a lifetime. Let’s get started! Getting Data Ready The process of building a model begins with data. We will not cover the AI builder prerequisites in the chapter but you can easily find them at Microsoft learn. The data in focus is sample data of customer profiles from the retailer system. The data include basic profile details such as (education, marital status, customer since, kids at home, teens at home), interaction data (participation in the campaign), and transaction summary (purchases both online and offline, product categories)  The data needs to be either imported in Dataverse or already existing. In this case, we will import the file “Customer_profile_sample.xls”. To import the data, the user should perform the following actions.  1. Open http://make.powerapps.com  and log in to your power platform environment.  2. Select the right environment, and recommend performing these actions in a development environment.  3. From the left menu pan select table . 4. Now select the option upload from excel. This will start a data import process.  Figure 1 Upload data in dataverse from excel file. 5. Upload the Excel file mentioned above “Customer_profile_sample.xls.” The system will read the file content and give a summary of the data in the file. Note if your environment has the copilot feature on, you will see a GPT in action where it will not only get the details of the file but also choose the table name and add descriptions to columns as well.   Figure 2 Copilot in action with file Summary 6. Verify the details, make sure the table is selected as “Customer Profile” and the Primary column is “ID.” Once verified, click Create and let the system upload the data into this new table.   The system will move you to the table view screen.   Figure 3 Table View Screen  7. In this screen, lets click on Columns under the Schema section. This will take us to the column list. here we need to scroll down and find a column called “Revenue.” Right-click the column and select edit.   Figure 4 Updating column information.  8. Let's check the feature searchable and save the changes.   9. We will move back all the way to the table list, by clicking on Table in the left navigation. Here we will select our “Customer Profile” table and choose Publish from the top menu. This will apply to the change made in step 8. We will wait till we see a green bar with the message “Publish completed.”   This concludes our first part of getting the sample data imported. Creating a Model Now that we have our data ready and available in dataverse, let's start building our model. We will follow the next set of actions to deliver the model with this low code / no code tool. 1. The first step is to open AI Builder. To open AI Builder Studio, let go to http://make.powerapps.com. 2. From the left navigation, click on AI Models . This will open the AI model studio.  3. Select from the top navigation bar. There are many OOB models for various business use cases that developers can choose but this time we will select a prediction model from the options.   Figure 5 Prediction Model Icon 4. The next pop-up screen will provide details about the prediction model feature and how it can used. Select to begin the model creation process. The model creation process is a step journey that we will explain one by one.  5. The first action is to select the historical outcome. Here we need to select the table we created in the above section “Customer Profile” and the column (Label) we want the model to predict, in this case, “revenue.”  Figure 6 Step one - Historical Outcome Selection 6. The next step is the critical step in any classification model. it is called the feature selection. In this step, we will select the columns to make sure we provide enough information to our AI model so it can assess the impact and influence of these features and train itself. The table has now 33 columns (27 we imported from the sample file and 5 added as part of the dataverse process). We will select 27 columns again as the most important feature for this model. The ones we will not select are.  Created On: it is a date column created by dataverse to track the record creation date. Not relevant in predicting revenue.  ID: it is a numerical sequential number, again we can decide with confidence that it is not going to be relevant in predicting our label “revenue.” Record Created On: Dataverse added column.  Revenue (base): a base currency value.  UTC Conversion Time zone: Dataverse added column. Before moving to next step make sure that you can see 27 columns selected.  Figure 7 Selecting Features / Columns 7. The next step is to choose the training data with business logic. If you would have noticed, our original imported data contains some rows where the revenue field is empty. Such data would not be helpful to train the model. Hence, we would like a model to train on rows that have revenue information available. We can do so by selecting “Filter the Data” and then adding the condition row as shown in the below figure.  Figure 8 Selecting the right dataset. 8. Finally, we are our last step of verification, here will perform one last action before training the model, that is to give this model proper name. let's click on an icon to change the name of the model. We shall name the model “Prediction – Revenue.”  Figure 9 Renaming the Model 9. Let’s click on and begin model training.  Evaluation of model The ultimate step of any model creation is the assessment of the model. Once our model is ready and trained, the system will generate model performance details. These details can be accessed by clicking on the model from AI Studio. Let's evaluate and read into our model.   Figure 10 Model Performance Summary PerformanceAI builder grade models based on model R-squared (goodness of fit). An R-squared value of 88% for a model means that 88% of the variation in revenue can be explained by the model’s inputs. The remaining 12% could be due to other factors not included in the model. For a set of information provided, it is a good start and, in some cases, an acceptable outcome as well.  Most Influential data The model also explains the most influential feature to our outcome “revenue.” In this case, Monthly Wine purchase (MntWines) is the highest weighted and suggests the highest association with revenue an organization can make from a customer. These weights can trigger a lot of business ideation and improve business KPIs further.  WarningsIn the detail section, you can also view the warnings the system has generated. In this case, it has identified a few columns that we intentionally selected in our earlier steps as having no association with revenue. This information can be used to further fine-tune and remove unnecessary features from our training and feature selection that were explained earlier.   Figure 11 Warning Tab in Details ConclusionThis marks the completion of our model preparation. Once we are satisfied with the model performance, we can choose to Publish this model. The model then can be used either through Power Apps or Power Automate to predict the revenue and reflect in dataverse. This feature of AI Builder opens the door to so many possibilities and the ability to deliver it in a short duration of time makes it extremely useful. Keep experimenting and keep learning.  Author BioMohammad Adeel Khan is a Senior Technical Specialist at Microsoft. A seasoned professional with over 19 years of experience with various technologies and digital transformation projects. At work , he engages with enterprise customers across geographies and helps them accelerate digital transformation using Microsoft Business Applications , Data, and AI solutions. In his spare time, he collaborates with like-minded and helps solve business problems for Nonprofit organizations using technology.  Adeel is also known for his unique approach to learning and development. During the COVID-19 lockdown, he introduced his 10-year-old twins to Microsoft Learn. The twins not only developed their first Microsoft Power Platform app—an expense tracker—but also became one of the youngest twins to earn the Microsoft Power Platform certification. 
Read more
  • 0
  • 0
  • 1757
Banner background image

article-image-ai-powered-data-visualization-with-snowflake
Shankar Narayanan
19 Oct 2023
8 min read
Save for later

AI-Powered Data Visualization with Snowflake

Shankar Narayanan
19 Oct 2023
8 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLarge language models, also known as LLM and generative Artificial Intelligence (AI), are revolutionizing various enterprises and businesses' productivity. One can expect the benefits of automation of the fast generation of insights and repetitive tasks from a large data pool.Pursuing insights has developed cutting-edge data storage solutions, including the Snowflake data cloud. This has the capabilities of artificial intelligence in visualizing data.Let us explore the synergy between Snowflake and AI, which facilitates data exploration while empowering businesses to acquire profound insights.Snowflake Data Cloud: the foundation for modern data warehousingEven before we start with our exploration, it is imperative to understand how Snowflake plays a significant role in modern data warehousing. It is a cloud-based data warehousing platform known for performance, ease of use, and scalability. As it provides a flexible and secure environment for analyzing and storing data, it is an ideal choice for every enterprise that deals with diverse and large data sets.Key featuresSome of the critical elements of the snowflake data cloud are mentioned below.●  Separates computing and storage.Snowflake's unique architecture helps scale the organization's computing resources independently from the storage. It helps to result in performance optimization and cost efficiency.●  Data SharingWith the help of seamless data sharing, Snowflake helps enterprises to share data between organizations that can foster data monetization opportunities and collaboration.●  Multi-cloud supportOne must know that Snowflake is compatible with most of the cloud providers. Therefore, it allows businesses to leverage their preferred cloud infrastructure.Unleash the potential of AI-Powered Data VisualizationWhen you have understood the concept of Snowflake, it is time that you get introduced to a game changer: AI-powered data visualization. The AI algorithm has undoubtedly evolved. They help to assist in the analyses and exploration of complex data sets while revealing insights and patterns that can be challenging to discover through traditional methods.Role of AI in data visualisationAI in data visualization plays a significant role. Some of these are:●  Predictive analyticsThe machine learning models help forecast anomalies and trends, thus enabling businesses and enterprises to make proactive decisions.●  Automated InsightsArtificial intelligence can analyze data sets quickly. It helps reduce the time required for manual analyses and extracts meaningful insights.●  Natural Language ProcessingNatural Language Processing or NLP algorithms can help to turn the textual data into visual representation. This process makes the unstructured data readily accessible.Harness the power of AI and SnowflakeLet us explore how one can collaborate with snowflakes and artificial intelligence to empower their business in gaining deeper insights.●  Data integrationThe ease of integration presented by Snowflake allows the organization to centralize the data. It does not matter whether the businesses consolidate their data from IOT devices, external partners, or internal sources. The unified data repository eventually becomes the foundation for AI-powered exploration.Example:Creating a Snowflake database and warehouse-- -- Create a new Snowflake database CREATE DATABASE my_database; -- Create a virtual warehouse for query processing CREATE WAREHOUSE my_warehouse  WITH WAREHOUSE_SIZE = 'X-SMALL'  AUTO_SUSPEND = 600  AUTO_RESUME = TRUE; 2. Loading data into Snowflake-- Create an external stage for data loading CREATE OR REPLACE STAGE my_stage URL = 's3://my-bucket/data/' CREDENTIALS = (AWS_KEY_ID = 'your_key_id' AWS_SECRET_KEY = 'your_secret_key'); -- Copy data from the stage into a Snowflake table COPY INTO my_table FROM @my_stage FILE_FORMAT = (TYPE = CSV) ON_ERROR = 'CONTINUE';● AI-driven code generationOne of the exciting aspects of collaborating AI and Snowflake happens to be the ability of artificial intelligence to generate code for data visualization. Here is how the process works.● Data preprocessingAI algorithms can prepare data for visualization while reducing the burden of the data engineers. At the same time, it has the capability of cleaning and transforming the data for visualization● Visualization suggestions Artificial intelligence helps to analyze data while suggesting appropriate visualization types, including scatter plots, charts, bars, and more. It analyses based on the characteristics presented by the data set● Automated code generationAfter choosing the visualization type, artificial intelligence helps generate the code needed to create interactive visualization. Hence, the process becomes accessible to every non-technical user.Let us know this with the help of the below example.from sklearn.cluster import KMeans from yellowbrick.cluster import KElbowVisualizer # Using AI to determine the optimal number of clusters (K) in K-means model = KMeans() visualizer = KElbowVisualizer(model, k=(2, 10)) visualizer.fit(scaled_data) visualizer.show() ● Interactive data explorationWith the help of AI-generated visualization, one can interact with the data effortlessly. The business can drill down, explore, and filter its data dynamically while gaining deeper insights into the real-time scenario. Such a level of interactivity empowers every business user to make informed data-driven decisions without heavily relying on IT teams or data analysts.Examples: import dash import dash_core_components as dcc import dash_html_components as html from dash.dependencies import Input, Output import plotly.express as px app = dash.Dash(__name__) # Define the layout of the web app app.layout = html.Div([    dcc.Graph(id='scatter-plot'),    dcc.Dropdown(        id='x-axis',        options=[            {'label': 'Feature 1', 'value': 'feature1'},            {'label': 'Feature 2', 'value': 'feature2'}        ],        value='feature1'    ) ]) # Define callback to update the scatter plot @app.callback(    Output('scatter-plot', 'figure'),    [Input('x-axis', 'value')] ) def update_scatter_plot(selected_feature):    fig = px.scatter(data_frame=scaled_data, x=selected_feature, y='target', title='Scatter Plot')    fig.update_traces(marker=dict(size=5))    return fig if __name__ == '__main__':    app.run_server(debug=True) From this web application, the users can interactively explore data.Benefits of AI and Snowflake for Enterprises●  Faster decision makingWith the help of code, generation, and data preprocessing automation, the business can enable faster decision-making. Also, the real-time interactive exploration helps in reducing the time it takes to derive specific insights from data.●  Democratize the data access.The AI-generated visualization helps every non-technical user explore data while democratizing access to insights. It reduces the bottleneck faced by the data science team and data analyst.●  Enhance predictive capabilitiesThe AI-powered predictive analytics present within Snowflake helps uncover hidden patterns and trends. It enables every enterprise and business to stay ahead of the competition and make proactive decisions.●  Cost efficiency and scalability The AI-driven automation and Snowflake's scalability ensures that business can handle large data sets without breaking the bank.Conclusion In summary, the combination of Snowflake Data Cloud and data visualization powered by AI is the game changer for enterprises and businesses looking to gain insights from their data. Aiding with automating code creation, simplifying data integration, and facilitating exploration, such collaboration empowers companies to make informed decisions based on data. As we progress in the field of data analytics, it will be crucial for organizations to embrace these technologies to remain competitive and unlock the potential of their data.With Snowflake and AI working together, exploring data evolves from being complicated and time-consuming to becoming interactive, enlightening, and accessible for everyone. Ultimately, this transformation revolutionizes how enterprises utilize the power of their data.Is this code a prompt or does the reader have to manually type? If the reader has to type, please share the text code so they can copy and paste it for convenience.Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.
Read more
  • 0
  • 0
  • 110

article-image-pinecone-101-anything-you-need-to-know
Louis Owen
16 Oct 2023
10 min read
Save for later

Pinecone 101: Anything You Need to Know

Louis Owen
16 Oct 2023
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionThe ability to harness vast amounts of data is essential to build an AI-based application. Whether you're building a search engine, creating tabular question-answering systems, or developing a frequently asked questions (FAQ) search, you need efficient tools to manage and retrieve information. Vector databases have emerged as an invaluable resource for these tasks. In this article, we'll delve into the world of vector databases, focusing on Pinecone, a cloud-based option with high performance. We'll also discuss other noteworthy choices like Milvus and Qdrant. So, buckle up as we explore the world of vector databases and their applications.Vector databases serve two primary purposes: facilitating semantic search and acting as a long-term memory for Large Language Models (LLMs). Semantic search is widely employed across various applications, including search engines, tabular question answering, and FAQ-based question answering. This search method relies on embedding-based similarity search. The core idea is to find the closest embedding that represents a query from a source document. Essentially, it's about matching the question to the answer by comparing their embeddings.Vector databases are pivotal in the context of LLMs, particularly in architectures like Retrieval-Augmented Generation (RAG). In RAG, a knowledge base is essential, and vector databases step in to store all the sources of truth. They convert this information into embeddings and perform similarity searches to retrieve the most relevant documents. In a nutshell, vector databases become the knowledge base for your LLM, acting as an external long-term memory.Now that we've established the importance of vector databases let's explore your options. The market offers a variety of vector databases to suit different use cases. Among the prominent contenders, three stand out: Milvus, Pinecone, and, Qdrant.Milvus and Pinecone are renowned for their exceptional performance. Milvus, based on the Faiss library, offers a highly optimized vector similarity search. It's a powerhouse for demanding applications. Pinecone, on the other hand, is a cloud-based vector database designed for real-time similarity searches. Both of these options excel in speed and reliability, making them ideal for intensive use cases.If you're on the lookout for a free and open-source vector storage database, Qdrant is a compelling choice. However, it's not as fast or scalable as Milvus and Pinecone. Qdrant is a valuable option when you need an economical solution without sacrificing core functionality. You can check another article about Qdrant here.Scalability is a crucial factor when considering vector databases, especially for large-scale applications. Milvus stands out for its ability to scale horizontally to handle billions of vectors and thousands of queries per second. Pinecone, being cloud-based, automatically scales as your needs grow. Qdrant, as mentioned earlier, may not be the go-to option for extreme scalability.In this article, we’ll dive deeper into Pinecone. We’ll discuss anything you need to know about Pinecone, starting from Pinecone’s architecture, how to set it up, the pricing, and several examples of how to use Pinecone. Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn all you need to know about Pinecone!Getting to Know PineconeIndexes are at the core of Pinecone's functionality. They serve as the repositories for your vector embeddings and metadata. Each project can have one or more indexes. The structure of an index is highly flexible and can be tailored to your specific use case and resource requirements.Pods are the units of cloud resources that provide storage and compute for each index. They are equipped with vCPU, RAM, and disk space to handle the processing and storage needs of your data. The choice of pod type is crucial as it directly impacts the performance and scalability of your Pinecone index.When it comes to selecting the appropriate pod type, it's essential to align your choice with your specific use case. Pinecone offers various pod types designed to cater to different resource needs. Whether you require substantial storage capacity, high computational power, or a balance between the two, there's a pod type that suits your requirements.As your data grows, you have the flexibility to scale your storage capacity. This can be achieved by increasing the size and number of pods allocated to your index. Pinecone ensures that you have the necessary resources to manage and access your expanding dataset efficiently.In addition to scaling storage, Pinecone allows you to control throughput as well. You can fine-tune the performance of your index by adding replicas. These replicas can help distribute the workload and handle increased query traffic, providing a seamless and responsive experience for your end users.Setting Up PineconePinecone is not only available in Python, it’s also available in TypeScript/Node clients. In this article, we’ll focus only on the python-client of Pinecone. Setting Pinecone in Python is very straightforward. We just need to install it via pip, like the following.pip3 install pinecone-clientOnce it’s installed, we can directly exploit the power of Pinecone. However, remember that Pinecone is a commercial product, so we need to put our API key when using Pinecone.import pinecone pinecone.init(api_key="YOUR_API_KEY",              environment="us-west1-gcp")Get Started with PineconeFirst thing first, we need to create an index in Pinecone to be able to exploit the power of vector database. Remember that vector database is basically storing embedding or vectors inside it. Thus, configuring the dimensions of vectors and the distance metric to be used is important when creating an index. The provided commands showcase how to create an index named "hello_pinecone" for performing an approximate nearest-neighbor search using the Cosine distance metric for 10-dimensional vectors. Creating an index usually takes around 60 seconds.pinecone.create_index("hello_pinecone", dimension=10, metric="cosine")Once the index is created, we can get all the information about the index by calling the `.describe_index()` method. This includes configuration information and the deployment status of the index. The operation requires the name of the index as a parameter. The response includes details about the database and its status.index_description = pinecone.describe_index("hello_pinecone")You can also check what are the created indices by calling the `.list_indexes()` method.active_indexes = pinecone.list_indexes()Creating an index is just the first step before you can insert or query the data. The next step involves creating a client instance that targets the index you just created.index = pinecone.Index("hello_pinecone")Once the client instance is created, we can start inserting any relevant data that we want to store. To ingest vectors into your index, you can use the "upsert" operation, which allows you to insert new records into the index or update existing records if a record with the same ID is already present. Below are the commands to upsert five 10-dimensional vectors into your index.index.upsert([    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]) ])We can also check the statistics of our index by running the following command.index.describe_index_stats() # Returns: # {'dimension': 10, 'index_fullness': 0.0, 'namespaces': {'': {'vector_count': 5}}}Now, we can start interacting with our index. Let’s perform a query operation by giving a vector of 10-dimensional as the query and return the top-3 most similar vectors from the index. Note that Pinecone will judge the similarity between the query vector and each of the data in the index based on the provided similarity metric during the index creation, where in this case, it’ll use the Cosine metric.index.query( vector=[0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6], top_k=3, include_values=True ) # Returns: # {'matches': [{'id': 'E', #               'score': 0.0, #               'values': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}, #              {'id': 'D', #               'score': 0.0799999237, #               'values': [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}, #              {'id': 'C', #               'score': 0.0800000429, #               'values': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]}], #  'namespace': ''}Finally, if you are done using the index, you can also delete the index by running the following command.pinecone.delete_index("hello_pinecone")Pinecone PricingPinecone offers both a paid licensing option and a free-tier plan. The free-tier plan is an excellent starting point for those who want to dip their toes into Pinecone's capabilities without immediate financial commitment. However, it comes with certain limitations. Under this plan, users are restricted to creating one index and one project.For users opting for the paid plans, hourly billing is an important aspect to consider. The billing is determined by the per-hour price of a pod, multiplied by the number of pods that your index uses. Essentially, you will be charged based on the resources your index consumes, and this billing structure ensures that you only pay for what you use.It's important to note that, regardless of the activity on your indexes, you will be sent an invoice at the end of the month. This invoice is generated based on the total minutes your indexes have been running. Pinecone's billing approach ensures transparency and aligns with your actual resource consumption.ConclusionCongratulations on keeping up to this point! Throughout this article, you have learned all you need to know about Pinecone, starting from Pinecone’s architecture, how to set it up, the pricing, and several examples of how to use Pinecone. I wish the best for your experiment in creating your vector database with Pinecone and see you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects. Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.
Read more
  • 0
  • 0
  • 260

article-image-build-your-first-rag-with-qdrant
Louis Owen
12 Oct 2023
10 min read
Save for later

Build your First RAG with Qdrant

Louis Owen
12 Oct 2023
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLarge Language Models (LLM) have emerged as powerful tools for various tasks, including question-answering. However, as many are now aware, LLMs alone may not be suitable for the task of question-answering, primarily due to their limited access to up-to-date information, often resulting in incorrect or hallucinated responses. To overcome this limitation, one approach involves providing these LMs with verified facts and data. In this article, we'll explore a solution to this challenge and delve into the scalability aspect of improving question-answering using Qdrant, a vector similarity search engine and vector database.To address the limitations of LLMs, one approach is to provide known facts alongside queries. By doing so, LLMs can utilize the actual, verifiable information and generate more accurate responses. One of the latest breakthroughs in this field is the RAG model, a tripartite approach that seamlessly combines Retrieval, Augmentation, and Generation to enhance the quality and relevance of responses generated by AI systems.At the core of the RAG model lies the retrieval step. This initial phase involves the model searching external sources to gather relevant information. These sources can span a wide spectrum, encompassing databases, knowledge bases, sets of documents, or even search engine results. The primary objective here is to find valuable snippets or passages of text that contain information related to the given input or prompt.The retrieval process is a vital foundation upon which RAG's capabilities are built. It allows the model to extend its knowledge beyond what is hardcoded or pre-trained, tapping into a vast reservoir of real-time or context-specific information. By accessing external sources, the model ensures that it remains up-to-date and informed, a critical aspect in a world where information changes rapidly.Once the retrieval step is complete, the RAG model takes a critical leap forward by moving to the augmentation phase. During this step, the retrieved information is seamlessly integrated with the original input or prompt. This fusion of external knowledge with the initial context enriches the pool of information available to the model for generating responses.Augmentation plays a pivotal role in enhancing the quality and depth of the generated responses. By incorporating external knowledge, the model becomes capable of providing more informed and accurate answers. This augmentation also aids in making the model's responses more contextually appropriate and relevant, as it now possesses a broader understanding of the topic at hand.The final step in the RAG model's process is the generation phase. Armed with both the retrieved external information and the original input, the model sets out to craft a response that is not only accurate but also contextually rich. This last step ensures that the model can produce responses that are deeply rooted in the information it has acquired.By drawing on this additional context, the model can generate responses that are more contextually appropriate and relevant. This is a significant departure from traditional AI models that rely solely on pre-trained data and fixed knowledge. The generation phase of RAG represents a crucial advance in AI capabilities, resulting in more informed and human-like responses.To summarize, RAG can be utilized for the question-answering task by following the multi-step pipeline that starts with a set of documentation. These documents are converted into embeddings, essentially numerical representations, and then subjected to similarity search when a query is presented. The top N most similar document embeddings are retrieved, and the corresponding documents are selected. These documents, along with the query, are then passed to the LLM, which generates a comprehensive answer.This approach improves the quality of question-answering but depends on two crucial variables: the quality of embeddings and the quality of the LLM itself. In this article, our focus will be on the former - enhancing the scalability of the embedding search process, with Qdrant.Qdrant, pronounced "quadrant," is a vector similarity search engine and vector database designed to address these challenges. It provides a production-ready service with a user-friendly API for storing, searching, and managing vectors. However, what sets Qdrant apart is its enhanced filtering support, making it a versatile tool for neural-network or semantic-based matching, faceted search, and various other applications. It is built using Rust, a programming language known for its speed and reliability even under high loads, making it an ideal choice for demanding applications. The benchmarks speak for themselves, showcasing Qdrant's impressive performance.In the quest for improving the accuracy and scalability of question-answering systems, Qdrant stands out as a valuable ally. Its capabilities in vector similarity search, coupled with the power of Rust, make it a formidable tool for any application that demands efficient and accurate search operations. Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to build your first RAG with Qdrant!Setting Up QdrantTo get started with Qdrant, you have several installation options, each tailored to different preferences and use cases. In this guide, we'll explore the various installation methods, including Docker, building from source, the Python client, and deploying on Kubernetes.Docker InstallationDocker is known for its simplicity and ease of use when it comes to deploying software, and Qdrant is no exception. Here's how you can get Qdrant up and running using Docker:1. First, ensure that the Docker daemon is installed and running on your system. You can verify this with the following command:sudo docker infoIf the Docker daemon is not listed, start it to proceed. On Linux, running Docker commands typically requires sudo privileges. To run Docker commands without sudo, you can create a Docker group and add your users to it.2. Pull the Qdrant Docker image from DockerHub:docker pull qdrant/qdrant3. Run the container, exposing port 6333 and specifying a directory for data storage:docker run -p 6333:6333 -v $(pwd)/path/to/data:/qdrant/storage qdrant/qdrantBuilding from SourceBuilding Qdrant from source is an option if you have specific requirements or prefer not to use Docker. Here's how to build Qdrant using Cargo, the Rust package manager:Before compiling, make sure you have the necessary libraries and the Rust toolchain installed. The current list of required libraries can be found in the Dockerfile.Build Qdrant with Cargo:cargo build --release --bin qdrantAfter a successful build, you can find the binary at ./target/release/qdrant.Python ClientIn addition to the Qdrant service itself, there is a Python client that provides additional features compared to clients generated directly from OpenAPI. To install the Python client, you can use pip:pip install qdrant-clientThis client allows you to interact with Qdrant from your Python applications, enabling seamless integration and control.Kubernetes DeploymentIf you prefer to run Qdrant in a Kubernetes cluster, you can utilize a ready-made Helm Chart. Here's how you can deploy Qdrant using Helm:helm repo add qdrant https://qdrant.to/helm helm install qdrant-release qdrant/qdrantBuilding RAG with Qdrant and LangChainQdrant works seamlessly with LangChain, in fact, you can use Qdrant directly in LangChain through the `VectorDBQA` class! The first thing we need to do is to gather all documents that we want to use as the source of truth for our LLM. Let’s say we store it in the list variable named `docs`. This `docs` variable is a list of string where each element of the list consist of chunks of paragraphs.The next thing that we need to do is to generate the embeddings from the docs. For the sake of an example, we’ll use a small model provided by the `sentence-transformers` package.from langchain.vectorstores import Qdrant from langchain.embeddings import HuggingFaceEmbeddings embedding_model = HuggingFaceEmbeddings(model_name=”sentence-transformers/all-mpnet-base-v2”) qdrant_vec_store = Quadrant.from_texts(docs, embedding_model, host = QDRANT_HOST, api_key = QDRANT_API_KEYOnce we setup the embedding model and Qdrant, we can now move to the next part of RAG, which is augmentation and generation. To do that, we’ll utilize the `VectorDBQA` class. This class basically will load some docs from Qdrant and then pass them into the LLM. Once the docs are passed or augmented, the LLM then will do its job to analyze them to generate the answer to the given query. In this example, we’ll use the GPT3.5-turbo provided by OpenAI.from langchain import OpenAI, VectorDBQA llm = OpenAI(openai_api_key=OPENAI_API_KEY) rag =   VectorDBQA.from_chain_type(                                    llm=llm,                                    chain_type=”stuff”,                                    vectorstore=qdrant_vec_store,                                    return_source_documents=False)The final thing to do is to test the pipeline by passing a query to the `rag` variable and LangChain supported by Qdrant will handle the rest!rag.run(question)Below are some examples of the answers generated by the LLM based on the provided documents using the Natural Questions datasets.ConclusionCongratulations on keeping up to this point! Throughout this article, you have learned what is RAG,  how it can improve the quality of your question-answering model, how to scale the embedding search part of the pipeline with Qdrant, and how to build your first RAG with Qdrant and LangChain. Hope the best for your experiment in creating your first RAG and see you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.
Read more
  • 0
  • 0
  • 440
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-creating-openai-and-azure-openai-functions-in-power-bi-dataflows
Greg Beaumont
09 Oct 2023
7 min read
Save for later

Creating OpenAI and Azure OpenAI functions in Power BI dataflows

Greg Beaumont
09 Oct 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Power BI Machine Learning and OpenAI, by Greg Beaumont. Master core data architecture design concepts and Azure Data & AI services to gain a cloud data and AI architect’s perspective to developing end-to-end solutions IntroductionAs noted earlier, integrating OpenAI and Azure OpenAI with Power Query or dataflows currently requires custom M code. To facilitate this process, we have provided M code for both OpenAI and Azure OpenAI, giving you the flexibility to choose which version to use based on your specific needs and requirements.By leveraging this provided M code, you can seamlessly integrate OpenAI or Azure OpenAI with your existing Power BI solutions. This will allow you to take advantage of the unique features and capabilities offered by these powerful AI technologies, while also gaining insights and generating new content from your data with ease.OpenAI and Azure OpenAI functionsOpenAI offers a user-friendly API that can be easily accessed and utilized from within Power Query or dataflows in Power BI. For further information regarding the specifics of the API, we refer you to the official OpenAI documentation, available at this link: https://platform.openai.com/ docs/introduction/overview.It is worth noting that optimizing and tuning the OpenAI API will likely be a popular topic in the coming year. Various concepts, including prompt engineering, optimal token usage, fine-tuning, embeddings, plugins, and parameters that modify response creativity (such as temperature and top p), can all be tested and fine-tuned for optimal results.While these topics are complex and may be explored in greater detail in future works, this book will focus primarily on establishing connectivity between OpenAI and Power BI. Specifically, we will explore prompt engineering and token limits, which are key considerations that will be incorporated into the API call to ensure optimal performance:Prompts: Prompt engineering, in basic terms, is the English-language text that will be used to preface every API call. For example, instead of sending [Operator] and [Airplane] as values without context, text was added to the request in the previous chapter such that the API will receive Tell me about the airplane model [Aircraft] operated by [Operator] in three sentences:. The prompt adds context to the values passed to the OpenAI model.Tokens: Words sent to the OpenAI model get broken into chunks called tokens. Per the OpenAI website, a token contains about four English language characters. Reviewing the Remarks column in the Power BI dataset reveals that most entries have up to 2,000 characters. (2000 / 4) = 500, so you will specify 500 as the token limit. Is that the right number? You’d need to do extensive testing to answer that question, which goes beyond the scope of this book.Let’s get started with building your OpenAI and Azure OpenAI API calls for Power BI dataflows!Creating OpenAI and Azure OpenAI functions for Power BI dataflowsYou will create two functions for OpenAI in your dataflow named OpenAI. The only difference between the two will be the token limits. The purpose of having different token limits is primarily cost savings, since larger token limits could potentially run up a bigger bill. Follow these steps to create a new function named OpenAIshort:1.      Select Get data | Blank query.2.      Paste in the following M code and select Next. Be sure to replace abc123xyz with your OpenAI API key.Here is the code for the function. The code can also be found as 01 OpenAIshortFunction.M in the Packt GitHub repository at https://github.com/PacktPublishing/ Unleashing-Your-Data-with-Power-BI-Machine-Learning-and-OpenAI/ tree/main/Chapter-13:let callOpenAI = (prompt as text) as text => let jsonPayload = "{""prompt"": """ & prompt & """, ""max_tokens"": " & Text.From(120) & "}", url = "https://api.openai.com/v1/engines/ text-davinci-003/completions", headers = [#"Content-Type"="application/json", #"Authorization"="Bearer abc123xyz"], response = Web.Contents(url, [Headers=headers, Content=Text.ToBinary(jsonPayload)]), jsonResponse = Json.Document(response), choices = jsonResponse[choices], text = choices{0}[text] in text in callOpenAI3.      Now, you can rename the function OpenAIshort. Right-click on the function in the Queries panel and duplicate it. The new function will have a larger token limit.4.      Rename this new function OpenAIlong.5.      Right-click on OpenAIlong and select Advanced editor.6.      Change the section of code reading Text.From(120) to Text.From(500).7.      Click OK.Your screen should now look like this:Figure 13.1 – OpenAI functions added to a Power BI dataflowThese two functions can be used to complete the workshop for the remainder of this chapter. If you’d prefer to use Azure OpenAI, the M code for OpenAIshort would be as follows. Remember to replace PBI_OpenAI_project with your Azure resource name, davinci-PBIML with your deployment name, and abc123xyz with your API key:let callAzureOpenAI = (prompt as text) as text => let jsonPayload = "{""prompt"": """ & prompt & """, ""max_tokens"": " & Text.From(120) & "}" url = "https://" & "PBI_OpenAI_project" & ".openai.azure. com" & "/openai/deployments/" & "davinci-PBIML" & "/ completions?api-version=2022-12-01", headers = [#"Content-Type"="application/json", #"api-key"="abc123xyz"], response = Web.Contents(url, [Headers=headers, Content=Text.ToBinary(jsonPayload)]), jsonResponse = Json.Document(response), choices = jsonResponse[choices], text = choices{0}[text] in text in callAzureOpenAIAs with the previous example, changing the token limit for Text.From(120) to Text.From(500) is all you need to do to create an Azure OpenAI function for 500 tokens instead of 120. The M code to create the dataflows for your OpenAI functions can also be found on the Packt GitHub site at this link: https://github.com/PacktPublishing/Unleashing-Your-Data-withPower-BI-Machine-Learning-and-OpenAI/tree/main/Chapter-13.Now that you have your OpenAI and Azure OpenAI functions ready to go in a Power BI dataflow, you can test them out on the FAA Wildlife Strike data!ConclusionIn conclusion, this article has provided valuable insights into integrating OpenAI and Azure OpenAI with Power BI dataflows using custom M code. By offering M code for both OpenAI and Azure OpenAI, it allows users to seamlessly incorporate these powerful AI technologies into their Power BI solutions. The article emphasizes the significance of prompt engineering and token limits in optimizing the OpenAI API. It also provides step-by-step instructions for creating functions with different token limits, enabling cost-effective customization.With these functions in place, users can harness the capabilities of OpenAI and Azure OpenAI within Power BI, enhancing data analysis and content generation. For further details and code references, you can explore the provided GitHub repository. Now, armed with these tools, you are ready to explore the potential of OpenAI and Azure OpenAI in your Power BI data projects.Author BioGreg Beaumont is a Data Architect at Microsoft; Greg is an expert in solving complex problems and creating value for customers. With a focus on the healthcare industry, Greg works closely with customers to plan enterprise analytics strategies, evaluate new tools and products, conduct training sessions and hackathons, and architect solutions that improve the quality of care and reduce costs. With years of experience in data architecture and a passion for innovation, Greg has a unique ability to identify and solve complex challenges. He is a trusted advisor to his customers and is always seeking new ways to drive progress and help organizations thrive. For more than 15 years, Greg has worked with healthcare customers who strive to improve patient outcomes and find opportunities for efficiencies. He is a veteran of the Microsoft data speaker network and has worked with hundreds of customers on their data management and analytics strategies.
Read more
  • 0
  • 0
  • 183

article-image-the-future-of-data-analysis-with-pandasai
Gabriele Venturi
06 Oct 2023
6 min read
Save for later

The Future of Data Analysis with PandasAI

Gabriele Venturi
06 Oct 2023
6 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionData analysis often involves complex, tedious coding tasks that make it seem reserved only for experts. But imagine a future where anyone could gain insights through natural conversations - where your data speaks plainly instead of through cryptic tables.PandasAI makes this future a reality. In this comprehensive guide, we'll walk through all aspects of adding conversational capabilities to data analysis workflows using this powerful new library. You'll learn:● Installing and configuring PandasAI● Querying data and generating visualizations in plain English● Connecting to databases, cloud storage, APIs, and more● Customizing PandasAI config● Integrating PandasAI into production workflows● Use cases across industries like finance, marketing, science, and moreFollow along to master conversational data analysis with PandasAI!Installation and ConfigurationInstall PandasAILet's start by installing PandasAI using pip or poetry.To install with pip:pip install pandasaiMake sure you are using an up-to-date version of pip to avoid any installation issues.For managing dependencies, we recommend using poetry:# Install poetry pip install --user poetry # Install pandasai poetry add pandasaiThis will install PandasAI and all its dependencies for you.For advanced usage, install all optional extras:poetry add pandasai –all-extrasThis includes dependencies for additional capabilities you may need later like connecting to databases, using different NLP models, advanced visualization, etc.With PandasAI installed, we are ready to start importing it and exploring its conversational interface!Import and Initialize PandasAILet's initialize a PandasAI DataFrame from a CSV file:from pandasai import SmartDataframe df = SmartDataframe("sales.csv")This creates a SmartDataFrame that wraps the underlying Pandas DataFrame but adds conversational capabilities.We can customize initialization through configuration options:from pandasai.llm import OpenAI llm = OpenAI(“<your api key>”) config = { "llm": } df = SmartDataFrame("sales.csv", config=config)This initializes the DataFrame using OpenAI model.For easy multi-table analysis, use SmartDatalake:from pandasai import SmartDatalake dl = SmartDatalake(["sales.csv", "inventory.csv"])SmartDatalake conversates across multiple related data sources.We can also connect to live data sources like databases during initialization:from pandasai.connectors import MySQLConnector mysql_conn = MySQLConnector(config={ "host": "localhost", "port": 3306, "database": "mydb", "username": "root", "password": "root",   "table": "loans", }) df = SmartDataframe(mysql_conn)This connects to a MySQL database so we can analyze the live data interactively.Conversational Data ExplorationAsk Questions in Plain EnglishThe most exciting part of PandasAI is exploring data through natural language. Let's go through some examples!Calculate totals:df.chat("What is the total revenue for 2022?") # Prints revenue totalFilter data:df.chat("Show revenue for electronics category") # Filters and prints electronics revenueAggregate by groups:df.chat("Break down revenue by product category and segment") # Prints table with revenue aggregated by category and segmentVisualize data:df.chat("Plot monthly revenue over time") # Plots interactive line chartAsk for insights:df.chat("Which segment has fastest revenue growth?") # Prints segments sorted by revenue growthPandasAI understands the user's questions in plain English and automatically generates relevant answers, tables and charts.We can ask endless questions and immediately get data-driven insights without writing any SQL queries or analysis code!Connect to Data SourcesA key strength of PandasAI is its broad range of built-in data connectors. This enables conversational analytics on diverse data sources.Databasesfrom pandasai.connectors import PostgreSQLConnector pg_conn = PostgreSQLConnector(config={ "host": "localhost",   "port": 5432,   "database": "mydb",   "username": "root",   "password": "root",   "table": "payments", }) df = SmartDataframe(pg_conn) df.chat("Which products had the most orders last month?")Finance Datafrom pandasai.connectors import YahooFinanceConnector yf_conn = YahooFinanceConnector("AAPL") df = SmartDataframe(yf_conn) df.chat("How did Apple stock perform last quarter?")The connectors provide out-of-the-box access to data across domains for easy conversational analytics.Advanced UsageCustomize ConfigurationWhile PandasAI is designed for simplicity, its architecture is customizable and extensible.We can configure aspects like:Language ModelUse different NLP models:from pandasai.llm import OpenAI, VertexAI df = SmartDataframe(data, config={"llm": VertexAI()})Custom InstructionsAdd data preparation logic:config["custom_instructions"] = """ Prepare data: - Filter outliers - Impute missing valuesThese options provide advanced control for tailored workflows.Integration into PipelinesSince PandasAI is built on top of Pandas, it integrates smoothly into data pipelines:import pandas as pd from pandasai import SmartDataFrame # Load raw data data = pd.read_csv("sales.csv") # Clean data clean_data = clean_data(data) # PandasAI for analysis df = SmartDataframe(clean_data) df.chat("Which products have trending sales?") # Further processing final_data = process_data(df)PandasAI's conversational interface can power the interactive analysis stage in ETL pipelines.Use Cases Across IndustriesThanks to its versatile conversational interface, PandasAI can adapt to workflows across multiple industries. Here are a few examples:Sales Analytics - Analyze sales numbers, find growth opportunities, and predict future performance.df.chat("How do sales for women's footwear compare to last summer?")Financial Analysis - Conduct investment research, portfolio optimization, and risk analysis.df.chat("Which stocks have the highest expected returns given acceptable risk?")Scientific Research - Explore and analyze the results of experiments and simulations.df.chat("Compare the effects of the three drug doses on tumor size.")Marketing Analytics - Measure campaign effectiveness, analyze customer journeys, and optimize spending.df.chat("Which marketing channels give the highest ROI for millennial customers?")And many more! PandasAI fits into any field that leverages data analysis, unlocking the power of conversational analytics for all.ConclusionThis guide covered a comprehensive overview of PandasAI's capabilities for effortless conversational data analysis. We walked through:● Installation and configuration● Asking questions in plain English● Connecting to databases, cloud storage, APIs● Customizing NLP and visualization● Integration into production pipelinesPandasAI makes data analysis intuitive and accessible to all. By providing a natural language interface, it opens up insights from data to a broad range of users.Start adding a conversational layer to your workflows with PandasAI today! Democratize data science and transform how your business extracts value from data through the power of AI.Author BioGabriele Venturi is a software engineer and entrepreneur who started coding at the young age of 12. Since then, he has launched several projects across gaming, travel, finance, and other spaces - contributing his technical skills to various startups across Europe over the past decade.Gabriele's true passion lies in leveraging AI advancements to simplify data analysis. This mission led him to create PandasAI, released open source in April 2023. PandasAI integrates large language models into the popular Python data analysis library Pandas. This enables an intuitive conversational interface for exploring data through natural language queries.By open-sourcing PandasAI, Gabriele aims to share the power of AI with the community and push boundaries in conversational data analytics. He actively contributes as an open-source developer dedicated to advancing what's possible with generative AI.
Read more
  • 0
  • 0
  • 186

article-image-getting-started-with-langchain
Sangita Mahala
27 Sep 2023
7 min read
Save for later

Getting Started with LangChain

Sangita Mahala
27 Sep 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionLangChain was launched in October 2022 as an open-source project by Harrison Chase. It is a Python framework that makes it easy to work with large language models (LLMs) such as the OpenAI GPT-3 language model. LangChain provides an easy-to-use API that makes it simple to interact with LLMs. You can use the API to generate text, translate languages, and answer the questions.Why to use LangChain?As we know, LangChain is a powerful tool that can be used to build a wide variety of applications and improve the productivity and quality of tasks. There are many reasons to use LangChain , including :Simplicity: LangChain provides a simple and easy interface for interacting with GPT-3. You don't need to worry about the details of the OpenAI API.Flexibility: LangChain allows you to customize the way you interact with GPT-3. You can use LangChain to build your own custom applications.Reduced costs: LangChain can help you to reduce costs by eliminating the need to hire human experts to perform LLM-related tasks.Increased productivity: LangChain can help you to increase your productivity by making it easy to generate high-quality text, translate languages, write creative content, and answer questions in an informative way.Getting Started with LangChain LLMIn order to completely understand LangChain and how to apply it in a practical use-case situation. Firstly, you have to set up the development environment.InstallationTo get started with LangChain, you have to:Step-1: Install the LangChain Python library:pip install langchain Step-2: Install the the openai package:pip install openaiStep-3: Obtain an OpenAI API key:In order to be able to use OpenAI’s models through LangChain you need to fetch an API key from OpenAI as well. So you have to follow these steps:Go to the OpenAI website by clicking this link: https://platform.openai.com/ Go to the top right corner of your screen and then click on the “Sign up” or “Sign in” if you already have an account. After signing in, you’ll be directed to the OpenAI Dashboard.Now navigate to the right corner of your OpenAI dashboard and click on the Personal button and then click on the “View API keys” section. Once you click “View API keys”, you will be redirected to the API keys section page. Then click on “+ Create new secret key”.Now provide a name for creating a secret key. For example : LangChain Once you click the create secret key button you will redirected to the secret key prompt then copy the API key and click done.The API key should look like a long alphanumeric string (for example: “sk-12345abcdeABCDEfghijKLMNZC”).Note- Please save this secret key safe and accessible. For security reasons, you won’t be able to view it again through your OpenAI account. If you lose this secret key, you’ll need to generate a new one.Step-4After getting the API key, you should execute the following command to add it as an environment variable: export OPENAI_API_KEY="..."If you'd prefer not to set an environment variable you can pass the key in directly via the openai_api_key named parameter when initiating the OpenAI LLM class:from langchain.llms import OpenAI llm = OpenAI(openai_api_key="...")For Example:Here are some of the best hands-on examples of LangChain applications:Content generationLangChain can also be used to generate text content, such as blog posts, marketing materials, and code. This can help businesses to save time and produce high-quality content.Output:Oh, feathered friend, so free and light, You dance across the azure sky, A symphony of colors bright, A song of joy that never dies. Your wings outstretched, you soar above, A glimpse of heaven from on high, Your spirit wild, your spirit love, A symbol of the endless sky.Translating LanguagesLangChain can also be used to translate languages accurately and efficiently. This can make it easier for people to interact with people around the world and for businesses to function in different nations.Example:Output:Question answeringLangChain can also be used to build question answering systems that can provide comprehensive and informative answers to users' questions. Question answering can be used for educational, research, and customer support tools.Example:Output:Check out LangChain’s official documentation to explore various toolkits available and to get access to their free guides and example use cases.How LangChain can be used to build the future of AIThere are several ways that LangChain can be utilized to build the AI of the future.Creating LLMs that are more effective and accurate By giving LLMs access to more information and resources, LangChain can help them perform better. LangChain, for example, can be used to link LLMs to knowledge databases or to other LLMs. LLMs can provide us with a better understanding of the world as a result, and their replies may be more accurate and insightful.Making LLMs more accessibleRegardless of a user's level of technical proficiency, LangChain makes using LLMs simpler. This may provide more equitable access to LLMs and enable individuals to use them to develop new, cutting-edge applications. For example, LangChain may be used to create web-based or mobile applications that enable users to communicate with LLMs without writing any code.Developing a new LLM applicationIt is simple with LangChain due to its chatbot, content generator, and translation systems. This could accelerate the deployment of LLMs across several businesses. For example, LangChain may be utilized for building chatbots that can assist doctors in illness diagnosis or to generate content-generating systems that can assist companies in developing personalized marketing materials.ConclusionIn this article, we've explored LangChain's main capabilities, given some interesting examples of its uses, and provided a step-by-step guide to help you start your AI adventure. LangChain is not just a tool; it's a gateway to the future of AI.  The adoption of LLMs in a variety of industries is accelerated by making it simpler to design and deploy LLM-powered applications.It will provide lots of advantages, such as higher production, enhanced quality, lower prices, simplicity of use, and flexibility. The ability of LangChain, as an entire system, to revolutionize how we interface with computers makes it a tremendous instrument. It assists in the development of the AI of the future by making it simpler to create and deploy LLM-powered applications. Now, it's your turn to unlock the full potential of AI with LangChain. The future is waiting for you, and it starts with you.Author BioSangita Mahala is a passionate IT professional with an outstanding track record, having an impressive array of certifications, including 12x Microsoft, 11x GCP, 2x Oracle, and LinkedIn Marketing Insider Certified. She is a Google Crowdsource Influencer and IBM champion learner gold. She also possesses extensive experience as a technical content writer and accomplished book blogger. She is always Committed to staying with emerging trends and technologies in the IT sector.
Read more
  • 0
  • 0
  • 1693

article-image-personalization-at-scale-using-snowflake-and-generative-ai
Shankar Narayanan
27 Sep 2023
9 min read
Save for later

Personalization at Scale: Using Snowflake and Generative AI

Shankar Narayanan
27 Sep 2023
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionImagine your customers still picking your competitors even after your incredible offerings. It sounds strange.We live in an era where customers want tailored offerings or solutions from businesses. Anticipating and meeting customer needs is no longer enough. Companies must exceed their expectations and create more authentic and long-term customer interactions. At every touchpoint of the customer journey, whether online or on your website, they want a tailored experience.However, exceeding these expectations can be daunting. So, how should businesses do it? Organizations need robust data management solutions and cutting-edge technologies to meet customer needs and offer tailored experiences. One such powerful combination is the use of Snowflake Data Cloud for Generative AI, which allows businesses to craft tailored customer experiences at scale.In this blog, we'll explore how Snowflake's Data Cloud and Generative AI can help achieve this and the importance of Snowflake Industry Cloud and Marketplace Data in this context.Before we hop on to it, let’s understand Generative AI and the importance of personalization.Why is Hyper-Personalization Critical for Businesses?Personalization significantly impacts customer satisfaction, loyalty, and retention rates. Tailored experiences are about more than addressing by their name several times in a message. But it's more about understanding customer needs and preferences and personalizing communication and marketing efforts that lead to higher click-through rates, conversations, and customer satisfaction.Personalization creates a more profound customer connection, drives engagement, and increases conversion rates. However, achieving personalization at scale presents significant challenges, primarily because it relies on lots of data and the ability to process and analyze it quickly.How is Generative AI Powering Personalization?Generative AI is an artificial intelligence technology capable of effectively producing various types of content like text, images, and other media. These generative models usually learn the patterns and structure of their input data and then generate new data or results with similar characteristics.This technology has undoubtedly revolutionized many businesses. And hyper-personalization is one of the reasons behind it. Generative AI doesn't just analyze data but also has the potential to achieve unprecedented levels of personalization.In this ever-evolving landscape of businesses, Generative AI models can constantly seek ways to improve customer engagement and sales through personalization. It tailors every aspect of customer experience and leaves room for real-time interactions.Here’s how it can be helpful to businesses in many ways:1. Dynamic content: It can produce different types of content like emailers, newsletters, social media copy, website content, marketing materials, and more.2. Recommendations: These models can understand and analyze customer behavior and preferences to offer personalized recommendations.3. Chatbots and virtual assistants: If businesses want real-time customer assistance, generative AI-powered virtual assistants can come to the rescue.4. Pricing strategy: Generative AI also helps optimize pricing strategies for customers by understanding and analyzing their browsing history, purchasing behavior, market pricing, and overall customer journey.5. Natural Language Processing (NLP): NLP models can understand and respond to customer inquiries and feedback in a personalized manner, enhancing customer service.Snowflake Data Cloud: A Game-Changer for Data ManagementSnowflake isn't just another technology company. Snowflake is a cloud-based data platform that has revolutionized how organizations manage and utilize their data. It offers several key advantages that are critical for personalization at scale: 1. Data IntegrationSnowflake enables seamless data integration from various sources, including structured and semi-structured data. This data consolidation is crucial for creating a holistic view of customer behavior and preferences. 2. ScalabilitySnowflake's architecture allows for elastic scalability, meaning you can effortlessly handle growing datasets and workloads, making it ideal for personalization efforts that need to accommodate a large user base.3. Data SharingSnowflake's data-sharing capabilities make it easy to collaborate with partners and share data securely, which can be valuable for personalization initiatives involving third-party data.4. SecuritySecurity is paramount when dealing with customer data. Snowflake offers robust security features to protect sensitive information and comply with data privacy regulations.5. Real-time Data ProcessingSnowflake's cloud-native architecture supports real-time data processing, a fundamental requirement for delivering personalized experiences in real-time or near-real-time.Also, to measure the effectiveness of personalization, one can conduct the A/B tests. Let us see an example to understand the same.import numpy as np from scipy import stats # A/B test data (conversion rates) group_a = [0, 1, 1, 0, 1, 0, 0, 1, 0, 0] group_b = [1, 1, 1, 0, 0, 1, 0, 1, 1, 0] # Perform a t-test to compare conversion rates t_stat, p_value = stats.ttest_ind(group_a, group_b) if p_value < 0.05:    print("Personalization is statistically significant.") else:    print("No significant difference observed.") In this way, you can analyze the results of A/B tests. It would help to determine if personalization efforts are statistically significant in improving customer experiences.Snowflake Industry Cloud and Marketplace DataWhile Snowflake's core features make it a powerful platform for data management, its Industry Cloud and Marketplace Data offerings take personalization to the next level:1. Industry CloudSnowflake's Industry Cloud solutions provide industry-specific data models and best practices. This means organizations can quickly adopt personalized solutions tailored to their specific sector, whether healthcare, retail, finance, or any other domain.2. Marketplace DataThe Snowflake Marketplace offers many data sources and tools to augment personalization efforts. This includes third-party data, pre-built machine learning models, and analytics solutions, making enriching customer profiles easier and driving better personalization.Personalization at Scale with Snowflake Data CloudGenerative AI and Snowflake Data Cloud play a pivotal role in revolutionizing businesses. And leveraging the capabilities of the Snowflake cloud data platform, along with Generative AI to scale personalization, can be a game-changer for many industries. Here's how you can accomplish this seamlessly and effectively.3. Data Ingestion Snowflake allows you to ingest data from various sources, including your CRM, website, mobile app, and third-party data providers. This data is stored in a central repository, ready for analysis.4. Data Storage Snowflake's data integration capabilities enable you to consolidate this data, creating a comprehensive customer profile that includes historical interactions, purchase history, preferences, and more. It can handle massive amounts of data, so businesses can easily store and collect data per their needs and preferences.Snowflake’s scalability helps one to handle large databases efficiently. In the SQL snippet, let us see how to create a table for storing and loading data from a CSV file.-- Create a table to store customer data CREATE TABLE customers (    customer_id INT,    first_name VARCHAR,    last_name VARCHAR,    email VARCHAR,    purchase_history ARRAY,    last_visit_date DATE ); -- Load customer data into the table COPY INTO customers FROM 's3://your-data-bucket/customer_data.csv' FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1);5. Machine Learning and Generative AI With your data in Snowflake, you can leverage Generative AI models to analyze customer behavior and generate personalized content, recommendations, and predictions.We can understand this using a Python code.import openai # Your OpenAI API key api_key = "your_api_key" # Customer behavior data customer_history = "Customer recently purchased a laptop and smartphone." # Generate personalized recommendations response = openai.Completion.create(    engine="text-davinci-002",    prompt=f"Based on customer data: {customer_history}, recommend products: ",    max_tokens=50,    n = 5,  # Number of recommendations    stop=None,    temperature=0.7,    api_key=api_key ) recommendations = [choice["text"] for choice in response["choices"]] print(recommendations) Using the OpenAI GPT-3 model, we can generate personalized product recommendations, considering customers' purchase history.6. Real-time Processing Snowflake's real-time data processing capabilities ensure that these personalized experiences are delivered in real-time or with minimal latency, enhancing customer engagement. Let us see how we can utilize Snowflake and a hypothetical real-time recommendation engine:-- Create a view to fetch real-time recommendations using a stored procedure CREATE OR REPLACE VIEW real_time_recommendations AS SELECT    c.customer_id,    c.first_name,    c.last_name,    r.recommendation_text FROM    customers c JOIN    real_time_recommendations_function(c.customer_id) r ON    c.customer_id = r.customer_id; 7. Iterative Improvement Personalization is an ongoing process. Snowflake's scalability and flexibility allow you to continuously refine your personalization strategies based on customer feedback and changing preferences.ConclusionCompetition is kicking off in every industry. Businesses can't just focus on creating products that solve specific problems. Instead, the focus has shifted to personalization in this competitive landscape. It's a must-have and non-negotiable. Businesses cannot afford to avoid it if they want to create excellent customer experiences.This is where Snowflake Data Cloud comes to the rescue. Leveraging this platform along with Generative AI can empower organizations in many ways and help craft tailored customer experiences at scale. If appropriately leveraged, businesses can gain a competitive edge and cater to customers' unique demands by delivering personalized solutions.In today's competitive era, only those businesses that invest in personalized technology and marketing efforts will survive. Thus, it's time to embrace these technologies and use them to your advantage to gain long-term success.Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.
Read more
  • 0
  • 0
  • 101
article-image-duet-ai-for-google-workspace
Aryan Irani
22 Sep 2023
6 min read
Save for later

Duet AI for Google Workspace

Aryan Irani
22 Sep 2023
6 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionDuet AI was announced at Google Cloud Next 23 as a powerful AI collaborator that can help you get more done in Google Workspace. It can help you write better emails, sort tables, create presentations, and more. Duet AI is still under development, but it has already learned to perform many kinds of tasks, including:Helping you write better in Google Docs.Generate images for better presentations in Google SlidesOrganizing and analyzing data in Google SheetsThere is so much more that Duet AI provides and Google will be announcing more updates to it. In this blog post, we will be taking a look at these features that Duet AI provides in detail with some interesting examples.Help me write in Google DocsThe help me write feature in Google Docs helps you to write better content, faster. It can help you generate new text, rewrite existing content or even improve your writing style.Generate new text: You can use the Help Me Write feature to generate new text for your document, such as a blog post, social media campaign and more. All you have to do is type in a prompt and it will generate text for you according to your instructions.Rewrite Existing text: You can use the help me write feature to rewrite existing text in the document. For example, you can use it to make your writing more concise, more formal, and creative.Improve your writing style: This allows you to improve your writing style by suggesting edits and improvements you should make. It can even tell you to correct your grammar, improve your sentence structure, and make your writing more engaging.Now that we have understood what the capabilities of the Help Me Write feature in Google Docs is, let's take a look at it in action.On opening the new Google Doc, you can see the Help Me Write feature pops up.On clicking the button, it allows you to enter a prompt that you want. For this example, we are going to tell it to write an advertisement for men’s soap bars.On structuring the prompt, to generate the text just go ahead and click on Create. In just a few seconds you will be able to see that Duet AI has generated a complete new advertisement.Here you can see we have successfully generated an advertisement for the soap bars. On reviewing the advertisement, let’s say you do not like the advertisement and maybe want to refine it and change the tone of it. You can do that by clicking on Refine.On clicking Refine, you will be allowed to choose from a variety of options on how you want to refine the paragraph Duet AI just generated for you. Additionally, you can manually design another prompt for how you want to refine the paragraph by typing it in the custom section.For this example, we are going to move forward and change the tone of the advertisement to Casual.On refining the paragraph, just in a few seconds, we can see that it has given me a new informal version of it. Once you like the paragraph Duet AI has generated for you, go ahead and click on insert, the paragraph will be inserted inside your Google Doc.Here you can see the paragraph has been pasted in the Google Doc and we have now successfully generated a new advertisement using Duet AI.Generate Images in SlidesThere have been so many times I have spent time trying to find the right photo to fit my slide and have been unsuccessful. With the new feature that Duet AI provides for Google Slides, I can generate images inside of slides and integrate them at the click of a button.Now that we have understood what the capabilities of this feature are, let’s take a look at it in action.When you open up your Google Slides, you will see something like this called Help me visualize. Once you click on this a new sidebar will open up on the right side of the screen.In this sidebar, you have to enter the prompt for the image you want to generate. Once you enter the prompt you have an option to select a style for the image.Once you select the style of the image, go ahead and click on Create.On clicking Create, in about 15–20 seconds you will see multiple photos generated according to the prompt we entered.Here you can see on successful execution we have been able to generate images inside of your Google Slides.Organizing and analyzing data in Google SheetsWe looked at how we can generate new images in Google Slides followed by the Help Me Write feature in Google Docs. All these features helped us understand the power of Duet AI inside of Google Workspace Tools.The next feature that we will be taking a look at is inside of Google Sheets, which allows us to turn ideas into actions and data into insights.Once you open up your Google Sheet, you will see a sidebar on the right side of the screen saying help me organize.Once you have your Google Sheet ready and the sidebar ready, it's time to enter a prompt for which you want to create a custom template. For this example, I am going to ask it to generate a template for the following prompt. On clicking create, in a few seconds you will see that it has generated some data inside of your Google Sheet.On successful execution, it has generated data according to the prompt we designed. If you are comfortable with this template it has generated go ahead and click on insert.On clicking Insert, the data will be inserted into the Google Sheet and you can start using it like a normal Google Sheet.ConclusionCurrently, all these features are not available for everybody and it is on a waitlist. If you want to grab the power of AI inside of Google Workspace Tools like Google Sheets, Google Docs, Google Slides and more, apply for the waitlist by clicking here.In this blog, we looked at how we can use AI inside of our Google Docs to help us write better. Later, we looked at how we can generate images inside of our Google Slides to make our presentations more engaging, and in the end, we looked at how we can generate templates inside of Google Sheets. I hope you have understood how to get the basics done with Duet AI for Google Workspace.Feel free to reach out if you have any issues/feedback at aryanirani123@gmail.com.Author BioAryan Irani is a Google Developer Expert for Google Workspace. He is a writer and content creator who has been working in the Google Workspace domain for three years. He has extensive experience in the area, having published 100 technical articles on Google Apps Script, Google Workspace Tools, and Google APIs.Website
Read more
  • 0
  • 0
  • 145

article-image-ai-distilled-18-oracles-clinical-digital-assistant-google-deepminds-alphamissense-ai-powered-stable-audio-prompt-lifecycle-3d-gaussian-splatting
Merlyn Shelley
21 Sep 2023
12 min read
Save for later

AI_Distilled #18: Oracle’s Clinical Digital Assistant, Google DeepMind's AlphaMissense, AI-Powered Stable Audio, Prompt Lifecycle, 3D Gaussian Splatting

Merlyn Shelley
21 Sep 2023
12 min read
👋 Hello,“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” - Alan Turing, Visionary Computer Scientist.This week, we begin by spotlighting Turing's test, a crucial concept in computer science. It sparks discussions about how AI emulates human intelligence, ultimately elevating productivity and creativity. A recent Hardvard study revealed how AI improves worker productivity and reduces task completion time by 25% while also improving quality by 40%. A study with 758 Boston Consulting Group consultants revealed that GPT-4 boosted productivity by 12.2% on tasks it could handle. Welcome to AI_Distilled #18, your ultimate source for everything related to AI, GPT, and LLMs.  In this edition, we’ll talk about OpenAI expanding to EU with Dublin office and key hires, AI-Powered Stable Audio transforming text into high-quality music, a Bain study predicting how generative AI will dominate game development in 5-10 years, and Oracle introducing AI-powered clinical digital assistant for healthcare. A fresh batch of AI secret knowledge and tutorials is here too! Look out for a comprehensive guide to prompt lifecycle, exploring LLM selection and evaluation, a primer on 3D gaussian splatting: rasterization and its future in graphics, and a step-by-step guide to text generation with GPT using Hugging Face transformers library in Python.In addition, we're showcasing an article by our author Ben Auffarth about Langchain, offering a sneak peek into our upcoming virtual conference. Writer’s Credit: Special shout-out to Vidhu Jain for their valuable contribution to this week’s newsletter content!  Cheers,  Merlyn Shelley  Editor-in-Chief, Packt  ⚡ TechWave: AI/GPT News & Analysis OpenAI Expands to EU with Dublin Office and Key Hires: The ChatGPT creator is opening its first European Union office in Dublin, signaling its readiness for upcoming AI regulatory challenges. This move follows OpenAI's announcement of its third office, with locations in San Francisco and London. The expansion into Ireland is strategically significant, as many tech companies choose it as a hub to engage with European regulators and clients while benefiting from favorable tax rates. OpenAI is actively hiring for positions in Dublin, including an associate general counsel, policy and partnerships lead, privacy program manager, software engineer focused on privacy, and a media relations lead. This expansion highlights OpenAI's commitment to addressing privacy concerns, especially in the EU, where ChatGPT faced scrutiny and regulatory actions related to data protection. AI-Powered Stable Audio Transforms Text into High-Quality Music: Stability AI has unveiled Stable Audio, an AI model capable of converting text descriptions into stereo 44.1 kHz music and sound effects. This breakthrough technology raises the potential of AI-generated audio rivaling human-made compositions. Stability AI collaborated with AudioSparx, incorporating over 800,000 audio files and text metadata into the model, enabling it to mimic specific sounds based on text commands. Stable Audio operates efficiently, rendering 95 seconds of 16-bit stereo audio at 44.1 kHz in under a second using Nvidia A100 GPUs. It comes with free and Pro plans, offering users the ability to generate music with varying lengths and quantities, marking a significant advancement in AI-generated audio quality. Oracle Introduces AI-Powered Clinical Digital Assistant for Healthcare: Oracle has unveiled its AI-powered Clinical Digital Assistant to enhance electronic health record (EHR) solutions in healthcare. This innovation aims to automate administrative tasks for caregivers, allowing them to focus on patient care. It addresses concerns related to the adoption of generative AI technologies in healthcare. The assistant offers multimodal support, responding to both text and voice commands, streamlining tasks such as accessing patient data and prescriptions. It remains active during appointments, providing relevant information and suggesting actions. Patients can also interact with it for appointment scheduling and medical queries. Oracle plans a full rollout of capabilities over the next year.  Generative AI to Dominate Game Development in 5-10 Years, Says Bain Study: A study by global consulting firm Bain & Company predicts that generative AI will account for more than 50% of game development in the next 5 to 10 years, up from less than 5% currently. The research surveyed 25 gaming executives worldwide, revealing that most believe generative AI will enhance game quality and expedite development, but only 20% think it will reduce costs. Additionally, 60% don't expect generative AI to significantly alleviate the talent shortage in the gaming industry, emphasizing the importance of human creativity. The study highlights that generative AI should complement human creativity rather than replace it.  Google DeepMind's AI Program, AlphaMissense, Predicts Harmful DNA Mutations: Researchers at Google DeepMind have developed AlphaMissense, an artificial intelligence program that can predict whether genetic mutations are harmless or likely to cause diseases, with a focus on missense mutations, where a single letter is misspelled in the DNA code. AlphaMissense assessed 71 million single-letter mutations affecting human proteins, determining 57% were likely harmless, 32% likely harmful, and uncertain about the rest. The program's predictions have been made available to geneticists and clinicians to aid research and diagnosis. AlphaMissense performs better than current programs, potentially helping identify disease-causing mutations and guiding treatment.  📥 Feedback on the Weekly EditionWhat do you think of this issue and our newsletter?Please consider taking the short survey below to share your thoughts and you will get a free PDF of the “The Applied Artificial Intelligence Workshop” eBook upon completion. Complete the Survey. Get a Packt eBook for Free! 🔮 Looking for a New Book from Packt’s Expert Community? Splunk 9.x Enterprise Certified Admin Guide - By Srikanth Yarlagadda If Splunk is a part of your professional toolkit, consider exploring the Splunk 9.x Enterprise Certified Admin Guide. In an era where the IT sector's demand for Splunk expertise is consistently increasing, this resource proves invaluable. It comprehensively addresses essential aspects of Splunk Enterprise, encompassing installation, license management, user and forwarder administration, index creation, configuration file setup, data input handling, field extraction, and beyond. Moreover, the inclusion of self-assessment questions facilitates a thorough understanding, rendering it an indispensable guide for Splunk Enterprise administrators aiming to excel in their field. Interested in getting a sneak peek of Chapter 1 without any commitment? Simply click the button below to access it. Read through the Chapter 1 unlocked here...  🌟 Secret Knowledge: AI/LLM Resources Understanding the Prompt Lifecycle: A Comprehensive Guide: A step-by-step guide to the prompt lifecycle, which is crucial for effective prompt engineering in AI applications. The guide covers four main stages: Design & Experiment, Differentiate & Personalize, Serve & Operate, and Analyze Feedback & Adapt. In each stage, you'll learn how to design, differentiate, serve, and adapt prompts effectively, along with the specific tools required. Additionally, the post addresses the current state of tooling solutions for prompt lifecycle management and highlights the existing gaps in prompt engineering tooling.  Exploring LLM Selection and Evaluation: A Comprehensive Guide: In this post, you'll discover a comprehensive guide to selecting and evaluating LLMs. The guide delves into the intricate process of choosing the right LLM for your specific task and provides valuable insights into evaluating their performance effectively. By reading this post, you can expect to gain a thorough understanding of the criteria for LLM selection, the importance of evaluation metrics, and practical tips to make informed decisions when working with these powerful language models. A Primer on 3D Gaussian Splatting: Rasterization and Its Future in Graphics: In this post, you'll delve into the world of 3D Gaussian Splatting, a rasterization technique with promising implications for graphics. You'll explore the core concept of 3D Gaussian Splatting, which involves representing scenes using gaussians instead of triangles. The post guides you through the entire process, from Structure from Motion (SfM) to converting points to gaussians and training the model for optimal results. It also touches on the importance of differentiable Gaussian rasterization.  How to Build a Multi-GPU System for Deep Learning in 2023: A Step-by-Step Guide: Learn how to construct a multi-GPU system tailored for deep learning while staying within budget constraints. The guide begins by delving into crucial GPU considerations, emphasizing the importance of VRAM, performance (evaluated via FLOPS and tensor cores), slot width, and power consumption. It offers practical advice on choosing the right GPU for your budget. The post then moves on to selecting a compatible motherboard and CPU, paying special attention to PCIe lanes and slot spacing. The guide also covers RAM, disk space, power supply, and PC case considerations, offering insights into building an efficient multi-GPU system.  ✨ Expert Insights from Packt Community  This week’s featured article is written by Ben Auffarth, the Head of Data Science at loveholidays. LangChain provides an intuitive framework that makes it easier for AI developers, data scientists, and even those new to NLP technology to create applications using LLMs. What can I build with LangChain? LangChain empowers various NLP use cases such as virtual assistants, content generation models for summaries or translations, question answering systems, and more. It has been used to solve a variety of real-world problems.  For example, LangChain has been used to build chatbots, question answering systems, and data analysis tools. It has also been used in a number of different domains, including healthcare, finance, and education. You can build a wide variety of applications with LangChain, including: Chatbots: It can be used to build chatbots that can interact with users in a natural way. Question answering: LangChain can be used to build question answering systems that can answer questions about a variety of topics. Data analysis: You can use it for automated data analysis and visualization to extract insights. Code generation: You can set up software pair programming assistants that can help to solve business problems. And much more! This is an excerpt from the Author’s upcoming book Generative AI with LangChain with Packt. If you're intrigued by this, we invite you to join us at our upcoming virtual conference for an in-depth exploration of LangChain and gain a better understanding of how to responsibly apply Large Language Models (LLMs) and move beyond merely producing statistically driven responses. The author will then take you on the practical journey of crafting your own chatbot, akin to the capabilities of ChatGPT. Missed the Early Bird Special offer for the big event? No worries! You can still save 40% by booking your seat now. Reserve your seat at 40%OFF 💡 Masterclass: AI/LLM TutorialsLearn How to Orchestrate Ray-Based ML Workflows with Amazon SageMaker Pipelines: Discover the benefits of combining Ray and Amazon SageMaker for distributed ML in this comprehensive guide. Understand how Ray, an open-source distributed computing framework, simplifies distributed ML tasks, and how SageMaker seamlessly integrates with it. This post provides a step-by-step tutorial on building and deploying a scalable ML workflow using these tools, covering data ingestion, data preprocessing with Ray Dataset, model training, hyperparameter tuning with XGBoost-Ray, and more. You'll also explore how to orchestrate these steps using SageMaker Pipelines, enabling efficient and automated ML workflows. Dive into the detailed code snippets and unleash the potential of your ML projects. Building and Deploying Tool-Using LLM Agents with AWS SageMaker JumpStart Foundation Models: Discover how to create and deploy LLM agents with extended capabilities, including access to external tools and self-directed task execution. This post introduces LLM agents and guides you through building and deploying an e-commerce LLM agent using Amazon SageMaker JumpStart and AWS Lambda. This agent leverages tools to enhance its functionality, such as answering queries about returns and order updates. The architecture involves a Flan-UL2 model deployed as a SageMaker endpoint, data retrieval tools with AWS Lambda, and integration with Amazon Lex for use as a chatbot.  Step-by-Step Guide to Text Generation with GPT using Hugging Face Transformers Library in Python: In this post, you'll learn how to utilize the Hugging Face Transformers library for text generation and natural language processing without the need for OpenAI API keys. The Hugging Face Transformers library offers a range of models, including GPT-2, GPT-3, GPT-4, T5, BERT, and more, each with unique characteristics and use cases. You'll explore how to install the required libraries, choose a pretrained language model, and generate text based on a prompt or context using Python and the Flask framework. This comprehensive guide will enable you to implement text generation applications with ease, making AI-powered interactions accessible to users.  💬 AI_Distilled User Insights Space Would you like to participate in our user feedback interview to shape AI_Distilled's content and address your professional challenges?Share your content requirements and ideas in 15 simple questions. Plus, be among the first 25 respondents to receive a free Packt credit for claiming a book of your choice from our vast digital library. Don't miss this chance to improve the newsletter and expand your knowledge. Join us today! Share Your Insights Now! 🚀 HackHub: Trending AI Toolsise-uiuc/Repilot: Patch generation tool designed for Java and based on large language models and code completion engines. turboderp/exllamav2: Early release of an inference library for local LLMs on consumer GPUs, requiring further testing and development.  liuyuan-pal/SyncDreamer: Focuses on creating multiview-consistent images from single-view images. FL33TW00D/whisper-turbo: Fast, cross-platform Whisper implementation running in your browser or electron app offering real-time streaming and privacy. OpenBMB/ChatDev: Virtual software company run by intelligent agents with various roles aiming to revolutionize programming and study collective intelligence. 
Read more
  • 0
  • 0
  • 116

article-image-revolutionizing-data-analysis-with-pandasai
Rohan Chikorde
18 Sep 2023
7 min read
Save for later

Revolutionizing Data Analysis with PandasAI

Rohan Chikorde
18 Sep 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionData analysis plays a crucial role in extracting meaningful insights from raw data, driving informed decision-making in various fields. Python's Pandas library has long been a go-to tool for data manipulation and analysis. Now, imagine enhancing Pandas with the power of Generative AI, enabling data analysis to become conversational and intuitive. Enter PandasAI, a Python library that seamlessly integrates Generative AI capabilities into Pandas, revolutionizing the way we interact with data.PandasAI is designed to bridge the gap between traditional data analysis workflows and the realm of artificial intelligence. By combining the strengths of Pandas and Generative AI, PandasAI empowers users to engage in natural language conversations with their data. This innovative library brings a new level of interactivity and flexibility to the data analysis process.With PandasAI, you can effortlessly pose questions to your dataset using human-like language, transforming complex queries into simple conversational statements. The library leverages machine learning models to interpret and understand these queries, intelligently extracting the desired insights from the data. This conversational approach eliminates the need for complex syntax and allows users, regardless of their technical background, to interact with data in a more intuitive and user-friendly way.Under the hood, PandasAI combines the power of natural language processing (NLP) and machine learning techniques. By leveraging pre-trained models, it infers user intent, identifies relevant data patterns, and generates insightful responses. Furthermore, PandasAI supports a wide range of data analysis operations, including data cleaning, aggregation, visualization, and more. It seamlessly integrates with existing Pandas workflows, making it a versatile and valuable addition to any data scientist or analyst's toolkit.In this comprehensive blog post, we will first cover how to install and configure PandasAI, followed by detailed usage examples to demonstrate its capabilities.Installing and Configuring PandasAIPandasAI can be easily installed using pip, Python's package manager:pip install pandasaiThis will download and install the latest version of the PandasAI package along with any required dependencies.Next, you need to configure credentials for the AI engine that will power PandasAI's NLP capabilities:from pandasai.llm.openai import OpenAI openai_api_key = "sk-..." llm = OpenAI(api_token=openai_api_key) ai = PandasAI(llm)PandasAI offers detailed documentation on how to get API keys for services like OpenAI and Anthropic.Once configured, PandasAI is ready to supercharge your data tasks through the power of language. Let's now see it in action through some examples.Intuitive Data Exploration Using Natural LanguageA key strength of PandasAI is enabling intuitive data exploration using plain English. Consider this sample data:data = pd.DataFrame({    'Product': ['A', 'B', 'C'],    'Sales': [100, 200, 50],    'Region': ['East', 'West', 'West']}) ai.init(data)You can now ask questions about this data conversationally:ai.run("Which region had the highest sales?") ai.run("Plot sales by product as a bar chart ordered by sales")PandasAI will automatically generate relevant summaries, plots, and insights from the data based on the natural language prompts.Automating Complex Multi-Step Data PipelinesPandasAI also excels at automating relatively complex multi-step analytical data workflows: ai.run("""    Load sales and inventory data    Join tables on product_id    Impute missing values    Remove outliers    Calculate inventory turnover ratio    Segment products into ABC categories """)This eliminates tedious manual coding effort with Pandas.Unified Analysis across Multiple DatasetsFor real-world analysis, PandasAI can work seamlessly across multiple datasets:sales = pd.read_csv("sales.csv") product = pd.read_csv("product.csv") customer = pd.read_csv("customer.csv") ai.add_frames(sales, product, customer) ai.run("Join the datasets. Show average order size by customer city.")This enables deriving unified insights across disconnected data sources.Building Data-Driven Analytics ApplicationsBeyond exploration, PandasAI can power analytics apps via Python integration. For instance:region = input("Enter region: ") ai.run(f"Compare {region} sales to national average") This allows creating customizable analytics tools for business users tailored to their needs. PandasAI can also enable production apps using Streamlit for the UI: import streamlit as st from pandasai import PandasAI region = st.text_input("Enter region:") … … … if region:    insight = ai.run(f"Analyze {region} sales")    st.write(insight)Democratizing Data-Driven DecisionsA key promise of PandasAI is democratizing data analysis by removing coding complexity. This allows non-technical users to independently extract insights through natural language.Data-driven decisions can become decentralized rather than relying on centralized analytics teams. Domain experts can get tailored insights on demand without coding expertise.Real-World ApplicationsLet's explore some real-world applications of PandasAI to understand how it can benefit various industries:FinanceFinancial analysts can use PandasAI to quickly analyze stock market data, generate investment insights, and create financial reports. They can ask questions like, "What are the top-performing stocks in the last quarter?" and receive instant answers. For Example:import pandas as pd from pandasai import PandasAI stocks = pd.read_csv("stocks.csv") ai = PandasAI(model="codex") ai.init(stocks) ai.run("What were the top 5 performing stocks last quarter?") ai.run("Compare revenue growth across technology and healthcare stocks") ai.run("Which sectors saw the most upside surprises in earnings last quarter?")HealthcareHealthcare professionals can leverage PandasAI to analyze patient data, track disease trends, and make informed decisions about patient care. They can ask questions like, "What are the common risk factors for a particular disease?" and gain valuable insights.MarketingMarketers can use PandasAI to analyze customer data, segment audiences, and optimize marketing strategies. They can ask questions like, "Which marketing channels have the highest conversion rates?" and fine-tune their campaigns accordingly.E-commerceE-commerce businesses can benefit from PandasAI by analyzing sales data, predicting customer behavior, and optimizing inventory management. They can ask questions like, "What products are likely to be popular next month?" and plan their stock accordingly.ConclusionPandasAI represents an exciting glimpse into the future of data analysis driven by AI advancement. By automating the tedious parts of data preparation and manipulation, PandasAI allows data professionals to focus on high-value tasks - framing the right questions, interpreting insights, and telling impactful data stories.Its natural language interface also promises to open up data exploration and analysis to non-technical domain experts. Rather than writing code, anyone can derive tailored insights from data by simply asking questions in plain English.As AI continues progressing, we can expect PandasAI to become even more powerful and nuanced in its analytical abilities over time. It paves the path for taking data science from simple pattern recognition to deeper knowledge generation using machines that learn, reason and connect concepts.While early in its development, PandasAI offers a taste of what is possible when the foundations of data analysis are reimagined using AI. It will be fascinating to see how this library helps shape and transform the analytics landscape in the coming years. For forward-thinking data professionals, the time to embrace its possibilities is now.In summary, by synergizing the strengths of Pandas and large language models, PandasAI promises to push the boundaries of what is possible in data analysis today. It represents an important milestone in the AI-driven evolution of the field.Author BioRohan Chikorde is an accomplished AI Architect professional with a post-graduate in Machine Learning and Artificial Intelligence. With almost a decade of experience, he has successfully developed deep learning and machine learning models for various business applications. Rohan's expertise spans multiple domains, and he excels in programming languages such as R and Python, as well as analytics techniques like regression analysis and data mining. In addition to his technical prowess, he is an effective communicator, mentor, and team leader. Rohan's passion lies in machine learning, deep learning, and computer vision.LinkedIn
Read more
  • 0
  • 0
  • 221
article-image-generative-ai-building-a-strong-data-foundation
Shankar Narayanan
15 Sep 2023
7 min read
Save for later

Generative AI: Building a Strong Data Foundation

Shankar Narayanan
15 Sep 2023
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionGenerative AI has become increasingly popular among businesses and researchers, which has led to a growing interest in how data supports generative models. Generative AI relies heavily on the quality and diversity of its foundational data to generate new data samples from existing ones. In this blog post, I will explain why a strong data foundation is essential for Generative AI and explore the various methods used to build and prepare data systems. Why Data is Vital for Generative AI?Generative AI models can generate various outputs, from images to text to music. However, the accuracy and performance of these models depend primarily on the quality of the data they are trained on. The models will produce incorrect, biased, or unimpressive results if the foundation data is inadequate. The adage "garbage in, garbage out" is quite relevant here. The quality, diversity, and volume of data used will determine how well the AI system understands patterns and nuances. Methods of Building a Data Foundation for Generative AI To harness the potential of generative AI, enterprises need to establish a strong data foundation. But building a data foundation isn't a piece of cake. Like a killer marketing strategy, building a solid data foundation for generative AI involves a systematic collection, preparation, and management approach. Building a robust data foundation involves the following phases: Data Collection: Collecting data from diverse sources ensures variety. For example, a generative model that trains on human faces should include faces from different ethnicities, ages, and expressions. For example, you can run to collect data from a CSV file in Python.   import pandas as pd data = pd.read_csv('path_to_file.csv') print(data.head())  # prints first 5 rows To copy from a Database, you can use a Python code like this import sqlite3 DATABASE_PATH = 'path_to_database.db' conn = sqlite3.connect(DATABASE_PATH) cursor = conn.cursor() cursor.execute("SELECT * FROM table_name") rows = cursor.fetchall() for row in rows: print(row) conn.close()  Time-Series Data Time-series data is invaluable for generative models focusing on sequences or temporal patterns (like stock prices). Various operations can be performed with the Time series data, such as the one below.  import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load data (assuming a CSV file with 'date' and 'value' columns) df = pd.read_csv('time_series_data.csv', parse_dates=['date'], index_col='date') # Making the Time Series Stationary # Differencing df['first_difference'] = df['value'] - df['value'].shift(1) # Log Transformation (if data is non-stationary after differencing) df['log_value'] = np.log(df['value']) df['log_first_difference'] = df['log_value'] - df['log_value'].shift(1) # 3. Smoothing with Moving Average window_size = 5  # e.g., using a window size of 5 df['moving_avg'] = df['first_difference'].rolling(window=window_size).mean()  Data Cleaning Detecting and managing outliers appropriately is crucial as they can drastically skew AI predictions. Lets see an example of Data Cleaning using Python.  import pandas as pd # Sample data for demonstration data = {    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice'],    'Age': [25, 30, np.nan, 29, 25],    'Salary': [50000, 55000, 52000, 60000, 50000],    'Department': ['HR', 'Finance', 'Finance', 'IT', None] } df = pd.DataFrame(data) # Removing duplicates df.drop_duplicates(inplace=True) Handling Missing Values: Accuracy can only be achieved with complete data sets. Techniques like imputation can be used to address gaps. The missing values can be handled for the data, like the following example. import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load data (assuming a CSV file with 'date' and 'value' columns) df = pd.read_csv('time_series_data.csv', parse_dates=['date'], index_col='date') #  Handle Missing Values: Interpolation is one method df['value'].interpolate(method='linear', inplace=True)  Data AugmentationTransformations such as rotating, scaling, or flipping images can increase the volume and diversity of visual data. Sometimes, a little noise (random variations) is added to the data for robustness. We will do some essential data augmentation for the same data presented in the above example.  #  Correcting data types df['Age'] = df['Age'].astype(int)  # Convert float Age to integer # Removing outliers (using Z-score for Age as an example) from scipy import stats z_scores = np.abs(stats.zscore(df['Age'])) df = df[(z_scores < 3)] Data AnnotationAdding descriptions or tags helps AI understand the context. For example, in image datasets, metadata can describe the scene, objects, or emotions present. Having domain experts review and annotate data ensures high fidelity. Data Partitioning Segregating data ensures that models are not evaluated on the same data they are trained on. This technique uses multiple training and test sets to ensure generalized and balanced models. Data Storage & Accessibility Storing data in structured or semi-structured databases makes it easily retrievable. For scalability and accessibility, many organizations opt for cloud-based storage solutions. Generative AI's Need for Data Different Generative AI models require diverse types of data: Images: GANs, used to create synthetic images, rely heavily on large, diverse image datasets. They can generate artwork, fashion designs, or even medical images. Text: Models like OpenAI's GPT series require vast text corpora to generate human-like text. These models can produce news articles, stories, or technical manuals. Audio: Generative models can produce music or speech. They need extensive audio samples to capture nuances. Mixed Modalities: Some models integrate text, image, and audio data to generate multimedia content. ConclusionWe all know the capabilities and potential of generative AI models in various industries and roles like content creation, designing, and problem-solving. But to let it continuously evolve, improve, and generate better results, it's essential to recognize and leverage the correct data.  Enterprises that recognize the importance of data and invest in building a solid data foundation will be well-positioned to harness the creative power of generative AI in future years. As Generative AI advances, the role of data becomes even more critical. Just as a building requires a strong foundation to withstand the test of time, Generative AI requires a solid data foundation to produce meaningful, accurate, and valuable outputs. Building and preparing this foundation is essential, and investing time and resources into it will pave the way for breakthroughs and innovations in the realm of Generative AI. Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.
Read more
  • 0
  • 0
  • 119

article-image-exploring-the-roles-in-building-azure-ai-solutions
Olivier Mertens, Breght Van Baelen
13 Sep 2023
19 min read
Save for later

Exploring the Roles in Building Azure AI Solutions

Olivier Mertens, Breght Van Baelen
13 Sep 2023
19 min read
This article is an excerpt from the book, Azure Data and AI Architect Handbook, by Olivier Mertens and Breght Van Baelen. Master core data architecture design concepts and Azure Data & AI services to gain a cloud data and AI architect’s perspective to developing end-to-end solutionsIntroductionArtificial Intelligence (AI) is transforming businesses across various industries rapidly. Especially w ith the surge in popularity of large language models such as ChatGPT, AI adoption is increasing exponentially. Microsoft Azure provides a wide range of AI services to help organizations build powerful AI solutions. In this chapter, we will explore the different AI services available on Azure, as well as the roles involved in building AI solutions, and the steps required to design, develop, and deploy AI models on Azure.Specifically, we will cover the following:The different roles involved in building AI solutionsThe questions a data architect should ask when designing an AI solutionBy the end of this article, you will have a good understanding of the role of the data architect in the world of data science. Additionally, you will have a high-level overview of what the data scientists and machine learning engineers are responsible for. Knowing the roles in data scienceThe Azure cloud offers an extensive range of services for use in advanced analytics and data science. Before we dive into these, it is crucial to understand the different roles in the data science ecosystem. In previous chapters, while always looking through the lens of a data architect, we saw workloads that are typically operationalized by data engineers, database administrators, and data analysts.Up until now, the chapters followed the journey of data through a data platform, from ingestion to raw storage to transformation, data warehousing, and eventually, visualization and dashboarding. The advanced analytics component is more separated from the entire solution, in the sense that most data architectures can perform perfectly without it. This does not take away from the fact that adding advanced analytics such as machine learning predictions can be a valuable enhancement to a  solution.The environment for advanced analytics introduces some new roles. The most prominent are the data scientist and the machine learning engineer, which we will look at in a bit more detail, starting with the following figure. Other profiles include roles such as data labelers and citizen data scientists.Figure 9.1 – An overview of the core components that each data role works withFigure 9.1 shows a very simplified data solution with a machine learning component attached to it. This consists of a workspace to build and train machine learning models and virtual machine clusters to deploy them in production.The data scientist is responsible for building and training the machine learning model. This is done through experimenting with data, most of the time stemming from the data lake. The data scientist will often use data from the bronze or silver tier in the data lake (i.e., the raw or semi-processed data). Data in the gold tier or the data warehouse is often transformed and aggregated in ways that make it convenient for business users to build reports with. However, the data scientist might want to perform different kinds of transformations, which focus more on the statistical relevance of certain features within the data to optimize the training performance of a machine learning model. Regardless, in some cases, data scientists will still interact with the gold layer and the data warehouse to pull clean data for experimentation.Using this data, data scientists will perform exploratory data analysis (EDA) to get initial insights into the dataset. This is followed by data cleaning and feature engineering, where features are transformed or new features are derived to serve as input for the machine learning model. Next up, a model is trained and evaluated, resulting in a first prototype. The experimentation does not stop here, however, as machine learning models have hyperparameters that can be adjusted, which might lead to increased performance, while still using the same dataset. This last process is called hyperparameter tuning. Once this is completed, we will arrive at the cutoff point between the responsibilities of a data scientist and a machine learning engineer.The machine learning engineer is responsible for the machine learning operations, often referred to as MLOps. Depending on the exact definition, this usually encompasses the later stages of the machine learning model life cycle. The machine learning engineer receives the finished model from the data scientist and creates a deployment for it. This will make the model available through an API so that it can be consumed by applications and users. In later stages, the model will need to be monitored and periodically retrained, until the end of its life cycle. This is a brief summary, but the MLOps process will be explained in more detail further in this chapter.Next, Figure 9.2 provides an overview of the processes that take place in the MLOps cycle and who the primary contributor to each step is.Figure 9.2 – The steps of the data science workflow and their executorsFinally, what we are most interested in is the role of the cloud data architect in this environment. First, the architect has to think about the overall AI approach, part of which is deciding whether to go for custom development or not. We will dive deeper into strategy soon.If custom machine learning model development is involved, the architect will have to decide on a data science environment, or workspace, where the data scientists can experiment.However, the architect will have more involvement in the work of a machine learning engineer. The optimal working of MLOps is considerably more dependent on good architectural design than the typical prototyping done by data scientists. Here, the architect is responsible for deciding on deployment infrastructure, choosing the right monitoring solutions, version control for models, datasets, code, retraining strategies, and so on.A lot of the value that an architect brings to machine learning projects comes from design choices outside of the data science suite. The data architect can greatly facilitate the work of data scientists by envisioning efficient data storing structures at the data lake level, with a strong focus on silver (and bronze) tiers with good data quality. Often, extra pipelines are required to get labeled data ready to be picked up by the data scientists. Designing AI solutionsIn this part, we will talk about the design of AI solutions, including qualification, strategy, and the responsible use of AI. Infusing AI into architecture has to be the result of some strategic consideration. The data architect should ask themself a series of questions, and find a substantiated answer, to end up with an optimal architecture.The first set of questions is regarding the qualification of a use case. Is AI the right solution?This can be further related to the necessity of an inductive solution, compared to a deductive one. Business rulesets are deductive; machine learning is inductive. Business rules will provide you with a solid answer if the condition for that rule is met. Machine learning models will provide you with answers that have a high probability but not certain ones.The big advantage of machine learning is its ability to cover cases in a much more granular manner,  whereas business rules must group various cases within a single condition so as to not end up with an absurd or even impossible number of rules. Look at image recognition, for example. Trying to make a rule set for every possible combination of pixels that might represent a human is simply impossible. Knowing this, evaluate the proposed use case and confirm that the usage (and correlating costs) of AI is justified for this solution.Do we opt for pre-trained models or a custom model?Although this question is more focused on implementation than qualification, it is crucial t o answer it fi rst, as this will directly impact the following two questions. As with most things in the broader field of IT, it comes down to not reinventing the wheel. Does your use case sound like something generic or industry-agnostic? Th en there are probably existing machine learning models, often with far superior performance (general knowledge-wise) than your own data could train a  model to have. Companies such as Microsoft and partners such as OpenAI invest heavily in getting these pre-trained models to cutting-edge standards.It may be that the solution you want to create is fairly generic, but there are certain aspects that make it a bit more niche. An example could be a text analytics model in the medical industry. Text analytics models are great at the general skill of language understanding, but they might have some issues with grasping the essence of industry-specific language out of the box. In this case, an organization can provide some of its own data to fine-tune the model to increase its performance on niche tasks, while maintaining most of the general knowledge from its initial training dataset. Most of the pre-trained AI models on Azure, which reside in Azure Cognitive Services and Azure OpenAI Service, are fine tuneable. When out-of-the-box models are not an option, then we need to look at custom development. Is data available?If we opt for custom development, we will need to bring our own data. The same goes for wanting to fine-tune an existing model, yet to a lesser extent. Is the data that we need available? Does an organization have a significant volume of historical data stored already in a central location? If this data is still spread across multiple platforms or sources, then this might indicate it is not the right time to implement AI. It would be more valuable to focus on increased data engineering efforts in this situation. In the case of machine learning on Azure, data is ideally stored in tiers in Azure Data Lake Storage.Keep in mind that machine learning model training does not stop after putting it into production. Th e performance of the production model will be constantly monitored, and if it starts to drift over time, retraining will take place. Do the sources of our current historical data still generate an adequate volume of data to carry out retraining?In terms of data volume, there is still a common misunderstanding that large volumes of data are a necessity for any high-performant model. It’s key to know here that even though the performance of a model still scales with the amount of training data, more and more new techniques have been developed to allow for valuable performance levels to be reached with a limited data volume. Is the data of acceptable quality?Just like the last question, this only counts for custom development or fine-tuning. Data quality between sources can differ immensely. There are different ways in which data can be of bad quality. Some issues can be solved easily; others can be astonishingly hard. Some examples of poor data quality are as follows:Inaccurate data: This occurs when data is incorrect or contains errors, such as typos or missing values. This is not easy to solve and will often result in fixes required at the source.Incomplete data: This occurs when data is missing important information or lacks the necessary details to be useful. In some cases, data scientists can use statistics to impute missing data. In other cases, it might depend on the specific model that is being developed. Certain algorithms can perform well with sparse data, while others are heavily affected by it. Knowing which exact algorithms should not be in the scope of the architect but, rather, the data scientists. Outdated data: This occurs when data is no longer relevant or useful due to changes in circumstances or the passage of time. If this data is statistically dissimilar to data generated in the present, it is better to remove this data from the training dataset.Duplicated data: This occurs when the same data is entered multiple times in different places, leading to inconsistencies and confusion. Luckily, this is one of the easiest data quality issues to solve. Biased data: This occurs when data is influenced by personal biases or prejudices, leading to inaccurate or unfair conclusions. This can be notoriously hard to solve and is a well-known issue in the data science world. We will come back to this later when discussing responsible AI.This concludes the qualifying questions on whether to implement AI or not. There is one more important topic, namely the return on investment (ROI) of the addition, but to calculate the investment, we need to have more knowledge on the exact implementation. This will be the focus of the next set of questions. Low code or code first?The answer to which approach should be chosen depends on people, their skill sets, and the complexity of the use case. In the vast majority of cases, code-fi first solutions are preferred, as it comes with considerably more flexibility and versatility. Low code simplifies development a lot, often by providing drag and drop interfaces to create workflows (or, in this case, machine learning pipelines). While low-code solutions often benefit from rapid development, this advantage in speed is slowly shrinking. Due to advancements in libraries and packages, generic code-fi first models are also being developed in a shorter amount of time than before.While code-first solutions cover a much broader set of use cases, they are simply not possible for every organization. Data scientists tend to be an expensive resource and are often fought over , with competition due to a lack of them in the labor market. Luckily, low-code platforms are advancing fast to address this issue. This allows citizen data scientists (non-professionals) to create and train machine learning models easily, although it will still yield inferior performance compared to professional code-first development.As a rule of thumb, if a professional data science team is present and it has already been decided that custom development is the way forward, choose a code-fi rst solution. What are the requirements for the AI model?Now, we will dive deeper into the technicalities of machine learning models. Note that not all answers here must come from the data architect. It is certainly a plus if the architect can think about things such as model selection with the data scientists, but it is not expected of the role. Leave it to the data science and machine learning team to have a clear understanding of the technical requirements for the AI model and allow them to leverage their expertise.The minimum accepted performance is probably the most straightforward. This is a defined threshold on the primary metric of a model, based on what is justifiable for the use case to progress. For instance, a model might need to have a minimum accuracy of 95% to be economically viable and continue toward production.Next, latency is an important requirement when the model is used to make real-time predictions. The larger the model and the more calculations that need to happen (not counting parallelism), the longer it will take to make a prediction. Some use cases will require a prediction latency within milliseconds, which can be solved with lightweight model selection and specialized infrastructure.Another requirement is the size of the model, which directly relates to the hosting costs when deployed into production, as the model will have to be loaded in RAM while the deployment runs. This is mostly a very binding requirement for IoT Edge use cases, where AI models are deployed on a small IoT device and make predictions locally before sending their results to the cloud. These devices often have very limited memory, and the data science team will have to figure out what the most efficient model is to fit on the device.With the recently growing adoption of large language models (LLMs), such as the GPT-model family, power consumption has started to become an increasingly important topic as well. Years ago, this was a negligible topic in most use cases, but with the massive size of today’s cutting-edge models, it is unavoidable. Whether these models are hosted privately or in the cloud, power consumption will be an incurred cost directly or indirectly. For natural language use cases specifically, consider whether the traditional (and significantly cheaper) text analytics models in Azure Cognitive Services can do the job at an acceptable level before heading straight for LLMs. Batch or real-time inferencing?When a model is finished and ready for deployment, the architect will have to decide on the type of deployment. On a high level, we should decide whether the model will be used for either batch scoring or predicting in real-time.Typically, when machine learning predictions are used to enrich data, which is already being batch processed in an OLAP scenario, the machine learning model can do periodical inferencing on large batches. The model will then be incorporated as an extra transformation step in the ETL pipeline. When using machine learning models in applications, for example, where users expect an instant prediction, real-time endpoints are required.When deploying our model to an endpoint, the architecture might differ based on the type of inferencing, which we will look into in more depth later in this chapter. Is explainability required?Explainable AI, often referred to as XAI, has been on the rise for quite a while now. For traditional machine learning models, it was straightforward to figure out why a model came to which conclusion, through statistical methods such as feature importance. With the rise of deep learning models, which are essentially black-box models, we come across more and more predictions that cannot be explained.Techniques have been developed to make an approximation of the decision-making process of a black box model. For instance, in the case of the mimic explainer, a traditional (and by nature interpretable) machine learning model is trained to mimic the black-box model and extract things, such as feature importance, from the mimic model. However, this is still an approximation and no guarantee.Therefore, it is key to figure out how crucial explainability is for the use case. In cases that (heavily) affect humans, such as predicting credit scoring using AI, interpretability is a must. In cases with minimal or no impact on human lives, interpretability is more of a nice-to-have. In this instance, we can opt for a black-box model if this provides increased predictive performance. What is the expected ROI?When the qualifying questions have been answered and decisions have been made to fulfill technical requirements, we should have sufficient information to calculate an estimated ROI. This will be the final exercise before giving the green light to start implementation, or at least the development of a proof of concept.If we know what approach to use, what kind of models to train, and which type of deployment to leverage, we can start mapping it to the right Azure service and perform a cost calculation. This is compared to the expected added value of a machine learning model.Optimal performance of a machine learning modelAs a side note to calculating the ROI, we need to have an idea of what the optimal performance level of a machine learning model is. This is where the academic and corporate worlds tend to differ. Academics focus on reaching the highest performance levels possible, whereas businesses will focus on the most efficient ratio between costs and performance. It might not make sense for a business to invest largely in a few percent increase in performance if this marginal increase is not justified by bringing adequate value to compensate.ConclusionThis article is focused on data science and AI on Azure. We started by outlining the different roles involved in a data science team, including the responsibilities of data architects, engineers, scientists, and machine learning engineers, and how the collaboration between these roles is key to building successful AI solutions.We then focused on the role of the data architect when designing an AI solution, outlining the questions they should ask themselves for a well-architected design.Author BioOlivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium.Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master’s degree in information management, a postgraduate degree as an AI business architect, and a bachelor’s degree in business management.Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium.Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master’s degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor’s degree in computer science from the University of Hasselt.
Read more
  • 0
  • 0
  • 143