Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials

7019 Articles
article-image-getting-started-with-azure-speech-service
M.T. White
22 Aug 2023
10 min read
Save for later

Getting Started with Azure Speech Service

M.T. White
22 Aug 2023
10 min read
IntroductionCommanding machines to your bidding was once sci-fi. Being able to command a machine to do something with mere words graced the pages of many sci-fi comics and novels.  It wasn’t until recently that science fiction became science fact.  With the rise of devices such as Amazon’s Alexa and Apple’s Siri, being able to vocally control a device has become a staple of the 21st century. So, how does one integrate voice control in an app?  There are many ways to accomplish that.  However, one of the easiest ways is to use an Azure AI tool called Speech Service.  This tutorial is going to be a crash course on how to integrate Azure’s Speech Service into a standard C# app.  To explore this AI tool, we’re going to use it to create a simple profanity filter to demonstrate the Speech Service. What is Azure Speech Service?There are many ways to create a speech-to-text app.  One could create one from scratch, use a library, or use a cloud service.  Arguably the easiest way to create a speech-to-text app is with a cloud service such as the Azure speech service.  This service is an Azure AI tool that will analyze speech that is picked up by a microphone and converts it to a text string in the cloud.  The resulting string will then be sent back to the app that made the request.  In other words, the Speech-to-Text service that Azure offers is an AI developer tool that allows engineers to quickly convert speech to a text string. It is important to understand the Speech Service is a developer’s tool.  Since the rise of systems like ChatGPT what is considered an AI tool has been ambiguous at best.  When one thinks of modern AI tools they think of tools where you can provide a prompt and get a response.  However, when a developer thinks of a tool, they usually think of a tool that can help them get a job done quickly and efficiently.  As such, the Azure Speech Service is an AI tool that can help developers integrate speech-to-text features into their applications with minimal setup. The Azuer Speech service is a very powerful tool that can be integrated into almost anything.  For example, you can create a profanity filter with minimal code, make a voice request to LLM like ChatGPT or do any number of things.  Now, it is important to remember that Azure Speech Service is an AI tool that is meant for engineers.  Unlike tools like ChatGPT or LLMs in general, you will have to understand the basics of code to use it successfully.  With that, what do you need to get started with the Speech Service?What do you need to build to use Azure Speech Service?Setting up an app that can utilize the Azure service is relatively minimal.  All you will need is the following:    An Azure account.    Visual Studios (preferably the latest version)    Internet connectivity    Microsoft.CognitiveServices.Speech Nuget packageThis project is going to be a console-based application, so you won’t need to worry about anything fancy like creating a Graphical User Interface (GUI). When all that is installed and ready to go the next thing you will want to do is set up a simple speech-to-text service in Azure. Setup Azure Speech ServiceAfter you have your environment set up, you’re going to want to set up your service.  Setting up the Speech-to-Text service is quick and easy as there is very little that needs to be done on the Azure side.  All one has to do is set the service up in perform the following steps,1.     Login into Azure and search for Speech Services.2.     Click the Create button in Figure 1 and fill out the wizard that appears:Figure 1. Create Button3.     Fill out the wizard to match Figure 2.  You can name the instance anything you want and set the resource group to anything you want.  As far as the pricing tier goes, you will usually be able to use the service for free for a time.  However, after the trial period ends you will eventually have to pay for the service.  Regardless, once you have the wizard filled out click Review + Create:Figure 2. Speech Service 4.     Keep following the wizard until you see the screen in Figure 3.  On this screen, you will want to click the manager key link that is circled in red:Figure 3.  Instance ServiceThis is where you get the keys necessary to use the AI tool.  Clicking the link is not totally necessary as the keys are at the bottom of the page.  However, clicking the link is sometimes easier as it’ll bring you directly to the keys. At this point, the service is set up.  You will need to capture the key info which can be viewed in Figure 4:Figure 4. Key InformationYou will need to capture the key data. You can do this by simply clicking the Show Keys button which will unmask KEY 1 and KEY 2.  Each instance you create will generate a new set of keys.  As a safety note, you should never share your keys with anyone as they’ll be able to use your service which in turn means they will rack up your bill among other cyber-security concerns.  As such, you will want to unmask the keys and grab KEY 1 and copy the region as well.  C# CodeNow, comes the fun part of the project, creating the app.  The app will be relatively simple.  The only hard part will be installing the NuGet package for the speech service.  To do this simply add the NuGet package found in Figure 5.Figure 5. NuGet PackageOnce that package is installed you can now start to implement the code. To start off, we’re simply going to make an app that can dictate back what we say to it.  To do this input the following code:// See https://aka.ms/new-console-template for more information using Microsoft.CognitiveServices.Speech; await translateSpeech(); static async Task translateSpeech() {    string key = "<Your Key>";    string region = "<Your Region";    var config = SpeechConfig.FromSubscription(key, region);    using (var recognizer = new SpeechRecognizer(config))    {        var result = await recognizer.RecognizeOnceAsync();        Console.WriteLine(result.Text);    } } }When you run this program it will open up a prompt.  You will be able to speak into the computer mic and whatever you say will be displayed.  For example, run the program and say “Hello World”.  After the service is finished translating your speech you should see the following display on the command prompt: Figure 6. Output From AppNow, this isn’t the full project.  This is just a simple app that will dictate what we say to the computer.  What we’re aiming for in this tutorial is a simple profanity filter.  For that, we need to add another function to the project to help filter the returned string. It is important to remember that what is returned is a text string.  The text string is just like any other text string that one would use in C#.  As such, we can modify the program to the following to filter profanity:// See https://aka.ms/new-console-template for more information using Microsoft.CognitiveServices.Speech; await translateSpeech(); static async Task translateSpeech() {    string key = "<Your Key>";    string region = "<Your Region>";    var config = SpeechConfig.FromSubscription(key, region);    using (var recognizer = new SpeechRecognizer(config))    {        var result = await recognizer.RecognizeOnceAsync();        Console.WriteLine(result.Text);        VetSpeech(result.Text);    } } static void VetSpeech(String input) {    Console.WriteLine("checking phrase: " + input);    String[] badWords = { "Crap", "crap", "Dang", "dang", "Shoot", "shoot" };    foreach(String word in badWords)    {        if (input.Contains(word))        {            Console.WriteLine("flagged");        }    }   }Now, in the VetSpeech function, we have an array of “bad” words.  In short, if the returned string contains a variation of these words the program will display “flagged”.  As such, if we were to say “Crap Computer” when the program is run we can expect to see the following output in the prompt:Figure 7. Profanity OutputAs can be seen, the program flagged the phrase because the word Crap was in it. ExercisesThis tutorial was a basic rundown of the Speech Service in Azure.  This is probably one of the simplest services to use but it is still very powerful.  Now, that you have a basic idea of how the service works and how to write C# code for it.  Create a ChatGPT developer token and take the returned string and pass it to ChatGPT.  When done correctly, this project will allow you to verbally interact with ChatGPT.  That is you should be able to verbally ask ChatGPT a question and get a response.ConclusionThe Azure Speech Service is an AI tool.  Unlike many other AI tools like ChatGPT and the like, this tool is meant for developers to build applications with.  Also, unlike many other Azure services, this is a very easy-to-use system with a minimal setup.  As can be seen from the tutorial the hardest part was writing the code that utilized the service, and even still that was not that difficult.  The best part is that the code provided in this tutorial is the basic code you will need to interact with the service meaning that all you have to do now, is modify it to fit your project’s needs. Overall, the power of the Speech Service is limited to your imagination.  This tool would be excellent for integrating verbal interaction with other tools like ChatGPT, creating voice-controlled robots, or anything else.  Overall, this is a relatively cheap and powerful tool that can be leveraged for many things.Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming 
Read more
  • 0
  • 0
  • 249

article-image-getting-started-with-automl
M.T. White
22 Aug 2023
7 min read
Save for later

Getting Started with AutoML

M.T. White
22 Aug 2023
7 min read
IntroductionTools like ChatGPT have been making headlines as of late.  ChatGPT and other LLMs have been transforming the way people study, work, and for the most part, do anything.  However, ChatGPT and other LLMs are for everyday users.  In short, ChatGPT and other similar systems can help engineers and data scientists, but they are not designed to be engineering or analytics tools.  Though ChatGPT and other LLMs are not designed to be machine-learning tools, there is a tool that can assist engineers and data scientists.  Enter the world of AutoML for Azure.  This article is going to explore AutoML and how it can be used by engineers and data scientists to create machine learning models. What is AutoML?AutoML is an Azure tool that builds the optimal model for a given data set. In many senses, AutoML can be thought of as a ChatGPT-like system for engineers.  AutoML is a tool that allows engineers to quickly produce optimal machine-learning models with little to no technical input.  In short, ChatGPT and other similar systems are tools that can answer general questions about anything, but AutoML is specifically designed to produce machine-learning models. How AutoML works?Though AutoML is a tool designed to produce machine learning models it doesn’t actually use AI or machine learning in the process.  The key to AutoML is parallel pipelines.  A pipeline can be thought of as the logic in a machine-learning model.  For example, the pipeline logic will include things such as cleaning data, splitting data, using a model for the system, and so on.When a person utilizes AutoML it will create a series of parallel pipelines with different algorithms and parameters.  When a model “fits” the data the best it will cease, and that pipeline will be chosen.  Essentially, AutoML in Azure is a quick and easy way for engineers to cut out all the skilled and time-consuming development that can easily hinder non-experienced data scientists or engineers.  To demonstrate how AutoML in Azure works let’s build a model using the tool.What do you need to know?Azure’s AutoML takes a little bit of technical knowledge to get up and running, especially if you’re using a custom dataset.  For the most part, you’re going to need to know approximately what type of analysis you’re going to perform.  You’re also going to need to know how to create a dataset.  This may seem like a daunting task but it is relatively easy. SetupTo use AutoML in Azure you’ll need to setup a few things.  The first thing to set up an ML workspace.  This is done by simply logging into Azure and searching for ML like in Figure 1:Figure 1From there, click on Azure Machine Learning and you should be redirected to the following page.  Once on the Azure Machine Learning page click on the Create button and New Workspace:Figure 2Once there, fill out the form, all you need to do is select a resource group and give the workspace a name.  You can use any name you want, but for this tutorial, the name Article 1 will be used.  You’ll be prompted to click create, once you click that button Azure will start to deploy the workspace.  The workspace deployment may take a few minutes to complete.  Once done click Go to resource. Once you click Go to resource click on Launch studio like in Figure 3.Figure 3At this point, the workspace has been generated and we can move to the next step in the process, using AutoML to create a new model.Now, that the workspace has been created, click Launch Studio you should be met with Figure 4.  The page in Figure 4 is Azure Machine Learning Studio. From here you can navigate to AutoML by clicking the link on the left sidebar:Figure 4Once you click the AutoML you should be redirected to the page in Figure 5:Figure 5Once you see something akin to Figure 5 click on the New Automated ML Job button which should redirect you to a screen that prompts you to select a dataset.  This step is one of the more in-depth compared to the rest of the process.  During this step, you will need to select your dataset.  You can opt to use a predefined dataset that Azure provides for test purposes.  However, for a real-world application, you’ll probably want to opt for a custom dataset that was engineered for your task.  Azure will allow you to either use a pre-built dataset or your own.  For this tutorial we’re going to use a custom dataset that is the following:HoursStory Points161315121511134228281830191032114117129251924172315161315121511134228281830191032114117129251924172315161315121511134228281830191032114117129251924172315161315121511134228281830191032114117129251924172315To use this dataset simply copy and paste into a CSV file.  To use it select the data from a file option and follow the wizard.  Note, that for custom datasets you’ll need at least 50 data points. Continue to follow the wizard and give the experiment a name, for example, E1.  You will also have to select a Target Column.  For this tutorial select Story Points.  If you do not already have a compute instance available, click the New button at the bottom and follow the wizard to set one up.  Once that step is complete you should be directed to a page like in Figure 6:Figure 6This is where you select the general type of analysis to be done on the dataset.  For this tutorial select Regression and click the Next button in Figure 6 then click Finish.  This will start the process which will take several minutes to complete.   The whole process can take up to about 20 or so minutes depending on which compute instance you use.  Once done you will be able to see the metrics by clicking on the Models tab.  This will show all the models that were tried out.  From here you can explore the model and the associated statistics. SummaryIn all, Azure’s AutoML is an AI tool that helps engineers quickly produce an optimal model.  Though not the same, this tool can be used by engineers the same way ChatGPT and similar systems can be used by everyday users.  The main drawback to AutoML is that unlike ChatGPT a user will need a rough idea as to what they’re doing.  However, once a person has a rough idea of the basic types of machine-learning analysis they should be able to use this tool to great effect. Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming 
Read more
  • 0
  • 0
  • 116

article-image-chatgpt-for-everyday-use
M.T. White
22 Aug 2023
14 min read
Save for later

ChatGPT for Everyday Use

M.T. White
22 Aug 2023
14 min read
IntroductionChatGPT is a revolutionary new technology that is making a large impact on society.  The full impact of ChatGPT cannot be fully known at the time of writing this article because of how novel the technology is.  However, what can be said is that since its introduction many industries have been trying to leverage it and increase productivity.  Simultaneously, everyday people are trying to learn to leverage it as well.  Overall, ChatGPT and similar systems are very new and the full impact of how to leverage them will take some more time to fully manifest.  This article is going to explore how ChatGPT can be used for everyday life by exploring a few use cases.What is ChatGPT?       Before we begin, it is important to understand what ChatGPT is and what it isn’t.  To begin ChatGPT is in a lay sense a super advanced chatbot.  More specifically, ChatGPT is known as a generative AI that uses Natural Language Processing (NLP) to create a dialog between a user and itself.  ChatGPT and similar systems are what are known as Large Language Models (LLMs).  In short, for AI models to work they have to be trained using data.  To train LLMs engineers use vast amounts such as books, articles, journals, and so on.  The result is a system like ChatGPT that has a vast knowledge base on many different subjects.  Before we can explore how to use ChatGPT for everyday life we need to explore how NOT to use ChatGPT. How not to use ChatGPT?ChatGPT is very powerful and can be used for many different things; however, is important to understand that ChatGPT is not a sage nor infallible.  Remember ChatGPT only knows about what it was trained on.  This means if the information it was taught was wrong or outdated so too will be the response it provides.  This means that as of writing this article ChatGPT cannot and should not be used as a replacement for humans.  Even with the answer ChatGPT gives, a decent level of domain knowledge is still required properly format and use its response.  As such, it is important to take suggestions with a certain amount of skepticism.  The best way to think of ChatGPT is as an extremely smart friend with a wide range of knowledge.  Though that friend may be very knowledgeable they can and will be wrong from time to time.  With that how does one use ChatGPT?How to use ChatGPT for everyday life?Using ChatGPT starts with what is called a prompt.  A prompt can be thought of as a query that the user feeds into ChatGPT.  This prompt is kind of like a Google search.  It will be a statement that asks ChatGPT a question or to perform a certain task.  Since ChatGPT used NLP, the prompt does not have to be formatted a certain way nor does it require a specific syntax.  In other words, a prompt can be entered the same way a user would pose the instruction to another person.  However, it is good to provide ChatGPT will enough information to adequately describe the problem.  An example, of a prompt is as follows:How can ChatGPT be used for everyday life?Once the prompt is run ChatGPT will produce text with suggestions, answers, programming software, or whatever was asked for.  That output that is produced is called a response which will have the answer to the request/question. There are a lot of ways ChatGPT can be used for everyday life.  In terms of professional life, ChatGPT can make an excellent assistant.  For most “text-based” jobs, that is jobs that don’t require design such as drawing diagrams, ChatGPT can be an excellent assistant.  For example, suppose a programmer needs a small piece of software for a project, assume they need a SQL query that can pull all the rows between two dates.  The user could use a prompt like the following:Write a SQL query that will pull all rows between January 3 and March 31.This prompt will produce the following response:To pull all rows between January 3 and March 31 from a table, you'll need to use the WHERE clause to filter the results based on the date column. Assuming you have a date column named "date_column" in your table, here's the SQL query: sqlCopy code SELECT * FROM your_table WHERE date_column >= '2023-01-03' AND date_column <= '2023-03-31'; Replace "your_table" with the actual name of your table, and "date_column" with the name of the column that contains the dates you want to filter. The query will select all rows where the "date_column" is on or after January 3, 2023, and on or before March 31, 2023.For those that are not familiar with SQL, the prompt produced a simple query with generic placeholders that will pull the correct data out of a database. Not everyone is a programmer, so how could a non-programmer use ChatGPT as an assistant?  One area in which ChatGPT can be used is by composing emails.  Suppose a salesperson needs to inform Tom, their boss, that a $1 million deal was closed today via email.  They could use a prompt like the following:Write Tom an email saying that the deal with Company X went through, and we made $1 million.The response would be something along the lines of:Subject: Successful Deal with Company X - $1 Million Revenue Achieved Hi Tom, I hope this email finds you well. I am thrilled to share some exciting news with you! The deal with Company X has successfully gone through, and I am delighted to inform you that we have made an astounding $1 million in revenue from this collaboration. This achievement wouldn't have been possible without the hard work and dedication of our entire team. It reflects the trust and confidence our partners have in our capabilities, and I believe it marks a significant milestone for our company's growth and success. I want to express my gratitude to you for your valuable contributions and expertise that played a crucial role in making this deal a reality. Your efforts have been commendable, and I am truly grateful for your commitment to the success of this venture. Let's take a moment to celebrate this triumph together. Your exceptional skills and commitment to excellence have made this accomplishment possible, and I am genuinely proud to have you as part of our team. Looking forward to even more prosperous ventures in the future. Once again, congratulations to you and the entire team! Best regards, [Your Name] [Your Title/Position] [Company Name] [Email] [Phone Number]In this case, all the users would have to do is proofread the email, copy it, and send it.  Now, for applications such as this proofreading is very important.  As stated before, ChatGPT is not a mind reader nor a sage, as such it may produce emails with wrong information, illogical wording, or anything else.  So, it’ll pay to read the message before sending it. Learning ToolA very powerful use of ChatGPT is as a learning tool.  For example, students can use this tool to help them master a new subject, generate practice problems, and even check answers to an extent.  Assume the user needs a few extra practice problems for their English homework, they could enter a prompt like the following:Generate some grammar practice problems for a 1st year English classWhich would yield the following output:Of course! Here are some grammar practice problems suitable for a 1st-year English class: 1. Subject-Verb Agreement: Choose the correct verb form that agrees with the subject in each sentence: a) The cat (purr/purrs) when it is happy. b) My friend and I (is/are) going to the movies tonight. c) Neither the students nor the teacher (understands/understand) the assignment. 2. Pronoun Usage: Select the appropriate pronoun to complete each sentence: a) She invited Sarah and (I/me) to the party. b) My sister and (her/she) are going shopping. c) The book is for (he/him) and his brother. 3. Articles (a/an/the): Insert the correct article (a, an, the, or no article) in each sentence: a) Can you pass me (a/an) apple, please? b) I need to buy (the) new shoes for the dance. c) My friend has (no article) car. 4. Prepositions: Choose the correct preposition to complete each sentence: a) We arrived (in/on/at) the airport fifteen minutes early. b) The cat is sleeping (under/below/above) the table. c) She's going to the store (by/with/for) her mom. 5. Verb Tenses: Identify the correct verb tense used in each sentence: a) They (will be/was) studying for the exam all night. b) By the time we arrived, the party (had started/has started). c) I (am/was) watching a movie right now. 6. Adjective vs. Adverb: Decide whether the word in parentheses should be an adjective or adverb in the sentence: a) She sings (beautiful/beautifully). b) The (happy/happily) children played in the park. c) The (quick/quickly) fox jumped over the fence. 7. Negation: Add the necessary negative word to each sentence: a) I have (no/not) seen him today. b) She does (not/none) like coffee. c) We will (not/none) be attending the party. 8. Passive Voice: Rewrite the following sentences in the passive voice: a) The chef cooked the meal. b) The teacher is explaining the lesson. c) The storm destroyed the old building. 9. Comparative and Superlative Forms: Complete each sentence with the correct comparative or superlative form of the adjective: a) This book is (interesting/more interesting/most interesting) than the last one. b) She is (tall/taller/tallest) than her sister. c) It was the (big/biggest/bigger) cake at the party. 10. Reported Speech: Rewrite the following sentences using reported speech: a) Mary said, "I am going to the store." b) "I will call you later," said John. c) "They have arrived," she said. These grammar practice problems aim to reinforce essential grammar rules and concepts, helping 1st-year English students develop their language skills and proficiency.As can be seen, ChatGPT generated a number of questions that one would usually find in an English 101 class.  Now, if the user wanted a more specific answer that could reword the prompt to accommodate. ChatGPT is also a great teaching tool as well.  Teachers can also use this tool to generate lesson plans, write intros for learning modules, generate problems, and more.  Overall, ChatGPT can be used as an excellent tool for academia in general.Now, it is important to remember that ChatGPT is not all-knowing.  It can only generate responses based on the data that it was trained on.  This means if the user is asking questions about a subject where there is little or no publicly available data the response probably isn’t going to be of the highest quality.  For example, if someone askedHow do I build a working time machine?They’re going to get the following:Building a working time machine, as depicted in science fiction, is currently beyond our technological capabilities and goes against the known laws of physics. Time travel, as popularly imagined, involves traveling backward or forward in time, similar to how we move through space. As of my last update in September 2021, time travel remains a concept explored mainly in theoretical physics, and no practical time machine has been built or proven to be possible. Several theories, such as the theory of relativity, suggest that time travel could be theoretically feasible under certain conditions, but these ideas are still purely theoretical and have not been put into practice. It's important to remember that building a time machine would present enormous scientific, technological, and ethical challenges. It could raise paradoxes, such as the "grandfather paradox," where a time traveler could potentially alter past events and create inconsistencies in history. As of now, time travel remains a fascinating topic in science fiction and theoretical physics, but it is not something that can be practically accomplished with our current understanding and technology. The scientific community continues to explore the nature of time and the universe, but building a working time machine as portrayed in movies and books is not yet a reality.ChatGPT gave some general information on time travel but unfortunately, it didn’t generate any schematics.  Again, this goes back to ChatGPT not being a sage as it only knows what it was “taught”.  Since there is no data on how to build a working time machine ChatGPT could not fully answer the prompt. Now, If one looks at the English question prompt above, one will notice that practice questions were quite broad.  It is common to have to “dig” with ChatGPT.  In other words, a person may have to refine the queriers to get what they need.  To get some practice try re-wording the prompt to generate practice questions for only passive writing.  SummaryChatGPT is a tool, and like any other tool, what it’s used for is up to the user.  As was seen in this article, ChatGPT is an excellent tool for helping a person through their day by generating software, emails, and so on.  ChatGPT can also be used as a great learning or teaching device to help students and teachers generate practice problems, create lesson plans, and so much more.  However, as was stated so many numerous times.  Unless ChatGPT has been trained on something it does not know about it.  This means that asking it things like how to build a time machine or domain specific concepts aren’t going to return quality responses.  Also, even if ChatGPT has been trained on the prompt, it may not always generate a quality response.  No matter the use case, the response should be vetted for accuracy.  This may mean doing a little extra research with the response given, testing the output, or whatever needs to be done to verify the response. Overall, ChatGPT at the time of writing this article is less than a year old.  This means that the full implication of using ChatGPT are not fully understood.  Also, how to fully leverage ChatGPT is not understood yet either.  What can be said is that ChatGPT and similar LLM systems will probably be the next Google.  In terms of everyday use, the only true inhibitors are the user's imagination and the data that was used to train ChatGPT.Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming
Read more
  • 0
  • 0
  • 979

article-image-chatgpt-and-azure-low-code-machine-learning
M.T. White
22 Aug 2023
12 min read
Save for later

ChatGPT and Azure Low Code Machine Learning

M.T. White
22 Aug 2023
12 min read
IntroductionChatGPT can do many amazing things. ChatGPT can easily troubleshoot code, it can generate source code, and much more.  However, software development and by extension data engineering is comprised of so much more than just text-based programming.  For example, Azure offers a low/no-code tool that can be used to generate machine learning models without having to spend countless hours writing millions of lines of code.  However, there is a caveat to this service; in short, a person has to know what they are doing to use it.  For many building a machine-learning system is a complex task.  This is where ChatGPT comes into play.  ChatGPT can easily guide a user into building the pipeline needed to meet their goals.  What is especially interesting about this notion is that we can use AI to help build other AI!  To demonstrate this we’re going to use ChatGPT as a guiding tool to build our own linear regression model in Azure. ExpectationsAt the time of writing this article, ChatGPT cannot produce actual diagrams; however, what it can do is guide you. This means that you still have to be familiar with the overall gist of technology, but you don’t have to be a data science expert to figure out how to make a model from scratch.  ChatGPT is an excellent tool to get you up and running quickly, but as stated in many other articles throughout the internet, ChatGPT is not a replacement for developers.  On a side note, this tutorial is meant to be interactive.  ChatGPT will only generate a series of steps that must be interpreted. This means that if you don’t know how to carry out a task there are a lot of opportunities to create your own prompts.  With that, let’s start!SetupThis tutorial is going to assume that you have a basic knowledge of Azure and the Azure machine learning services.  However, we’re going to use ChatGPT as a guide from start to finish, so if you’re not sure of what you’re doing you can always modify the prompts or create new ones to help you along.  As an exercise, if you do get stuck it is a good idea to try to write some prompts on your own to get a feel for ChatGPT.  With that let’s start by having ChatGPT guide us in creating the necessary environment for our machine learning pipeline.EnvironmentTo begin, let’s start with getting our environment set up.  To figure out how to do this let’s use the following prompt:What infrastructure will I need to build a no-code Azure machine learning pipeline?Once you run this prompt you should be met with something akin to the following:To build a no-code Azure Machine Learning pipeline, you will need the following infrastructure components: 1.     Azure Subscription: You need an active Azure subscription to access and utilize Azure Machine Learning services. 2.     Azure Machine Learning Workspace: Create an Azure Machine Learning workspace, which serves as the top-level resource for organizing and managing your machine learning assets. The workspace provides a centralized location to store your pipelines, datasets, models, and other resources. 3.     Compute Resources: Azure Machine Learning provides various compute options for running your pipeline. Depending on your workload and requirements, you can choose from options such as Azure Machine Learning Compute, Azure Databricks, or remote VMs. These resources will execute the pipeline steps and perform the required computations. 4.     Data Storage: Azure Machine Learning supports multiple data storage options, such as Azure Blob Storage, Azure Data Lake Storage, or Azure SQL Database. You need a suitable data storage solution to store your input datasets, intermediate data, and any other data required for your pipeline. 5.     Networking and Security: Ensure that your Azure subscription and associated resources are properly secured. Azure provides various security features, such as Virtual Network (VNet) integration, network security groups (NSGs), role-based access control (RBAC), and Azure Active Directory (AAD) authentication, to protect your resources and control access. 6.     Monitoring and Logging: Azure Machine Learning offers monitoring and logging capabilities to track the execution of your pipelines, monitor performance metrics, and troubleshoot issues. You can use Azure Monitor, Azure Log Analytics, or other monitoring tools to gain insights into pipeline performance and health. 7.     Integration with DevOps: If you have a DevOps workflow, you may want to integrate your no-code pipeline with your existing CI/CD processes. Azure Machine Learning can be integrated with Azure DevOps, GitHub Actions, or other tools to automate the deployment and management of your pipelines. Remember to evaluate your specific workload requirements and choose the appropriate infrastructure components accordingly. The exact infrastructure setup will vary depending on factors such as data size, computational needs, security requirements, and integration with existing systems.This prompt generated a lot of needed infrastructure components that are necessary for a production-level pipeline.  However, our pipeline will only be used for practice so all we need to do is follow steps 1-3.  In other words, all we need for this tutorial is an Azure instance, an Azure Machine Learning workspace, and a compute instance.  If you can create these by yourself do so; however, if you can’t try writing your own prompt.  For example, try engineering a prompt that can give you the steps necessary to set up a compute instance.  You could try something like the following:How do I set up a compute instance in Azure?If this prompt doesn’t provide you with the necessary information, try rewording it and giving it more details until you can do it with the provided steps. PipelineThe next step is building out the pipeline.  The pipeline is the logic for the model.  In other words, the pipeline is the algorithm that will load the data for the model, clean the data, train the model, and evaluate the model if necessary.  In short, the best way to think of the pipeline is as a graphical machine learning program.  As stated before, ChatGPT is mostly limited to text-based responses.  However, we can still get step-by-step directions with the following prompt:Build a minimal linear regression model using Azure Machine Learning StudioWhich will output the following response:As an AI language model, I don't have direct access to external tools or platforms like Azure Machine Learning Studio, and my capabilities are limited to providing instructions and guidance. However, I can walk you through the steps to build a minimal linear regression model using Azure Machine Learning Studio. Please note that the steps might slightly vary based on updates to the platform since my last update in September 2021. Make sure to check the official Azure Machine Learning documentation for any changes. Here's a general outline of the steps: 1.     Set up Azure Machine Learning Studio: If you don't have an Azure Machine Learning account, sign up for one. Once you have access to Azure Machine Learning Studio, create a new project or workspace. 2.     Create a new Experiment: In Azure Machine Learning Studio, create a new experiment by clicking on "Experiments" and then "New." Give your experiment a descriptive name. 3.     Data Ingestion: Upload your dataset to Azure Machine Learning Studio. You can do this by clicking on "Datasets" and then "New" to upload your dataset. 4.     Data Cleaning and Preparation: Prepare your data by cleaning and transforming it if needed. This might include removing missing values, encoding categorical variables, and splitting the data into training and testing sets. 5.     Add Linear Regression Module: Drag and drop the "Linear Regression" module from the left panel to the canvas in your experiment. 6.     Connect Data to the Model: Connect the dataset you uploaded earlier to the "Linear Regression" module by clicking and dragging from the dataset to the input of the linear regression module. 7.     Train the Model: Run the experiment to train the linear regression model on your dataset. 8.     Evaluate the Model: After the training is complete, you can use evaluation modules like "Score Model" and "Evaluate Model" to assess the performance of the trained linear regression model. 9.     Deploy the Model (Optional): If you want to use the trained model for predictions in real-world scenarios, you can deploy the model as a web service in Azure Machine Learning Studio. 10. Test the Deployed Model (Optional): Test the deployed model to ensure it is working correctly and providing accurate predictions. Remember, this is just a minimal example of building a linear regression model using Azure Machine Learning Studio. Depending on your dataset and specific use case, you may need to perform additional steps like feature engineering, hyperparameter tuning, and more thorough model evaluation. Always refer to the official documentation and resources from Azure Machine Learning Studio for the most up-to-date and detailed instructions.Analyzing the output, the first two steps are self-explanatory.  The third step will require some preparation since we’re going to use a custom dataset for this tutorial. This tutorial is going to use a dataset that consists of 14 points.  Normally, you’ll want to have a dataset of at least 30 points to ensure normal distribution; however, for this example, we’re going to assume the data has a normal distribution.  With that, to train the model for this tutorial, copy the following data into either a plain text file or an Excel spreadsheet and save it as a .csv file:Once you have the data saved, you’ll need to create a data asset and upload the file to Azure.  If you don’t know the steps to upload the dataset you should try writing another prompt for directions.  If you need more information on this step you could try a prompt like:How do I upload a custom dataset to Azure Machine Learning Studio?Moving on to the rest of the instructions, we can construct a pipeline that resembles the pipeline in Figure 1:Figure 1 – Completed PipelineThis pipeline is, for the most part, a one-to-one representation of the instructions.  However, this pipeline expanded in Step 4.  In this pipeline, we added a Select Columns in the Dataset module to ensure we are only including the necessary data for the model.  From there we ran that data to a Split Data component as was suggested by ChatGPT.  Other than that, the model is exactly as described with the exception of the last two steps which are for deployment and testing; that is, the two steps that were labeled as “optional” by ChatGPT. For this tutorial build the model as is seen in Figure 1 and run it.  After you run the pipeline, you can see how well the model performed. To see the statistics, you can click the Evaluate Model component to see the statistics for the model.  There is a lot of information that can be unpacked, but you can see the scores by navigating to the metrics tab.  If you used the same dataset, your numbers should be around the values that are in Figure 2.Figure 2 – Linear Regression OutputsAt this point, ChatGPT has guided us in building a linear regression model.  Overall, the linear regression model that ChatGPT guided us to build is a very simple model that, all things considered, is fairly accurate. SummaryThis tutorial has been a crash course on how ChatGPT can be used to build no-code solutions in Azure Machine Learning Studios.  What’s incredible about this tutorial is that we used AI to help build another AI system.  However, as was seen in the tutorial, ChatGPT was only a guide.  For graphical systems, ChatGPT, at best, can be used as a guide.  This means that for systems like Azure Machine Learning Studio, a basic understanding of using the system is required.  As such, this means that for graphical systems ChatGPT is best utilized by persons that are knowledgeable of the system but need guidance for the task at hand.  For example, if this were a real project the ideal engineer would be a person that knows how to use Azure Machine Learning Studio but needs help with creating the pipeline logic. In terms of graphical-based programming, ChatGPT is almost a little ironic.  For example, when it comes to text-based programming in Java, Python, or whatever other language ChatGPT can be used to generate working code.  However, due to the current limitation of ChatGPT not being able to generate graphical programs for systems like Azure Machine Learning Studio a person will require a more in-depth knowledge of the system.  As ChatGPT matures this may change but for now, it is best to still have a knowledgeable engineer driving ChatGPT and implementing its solutions.  Overall, ChatGPT is an excellent assistant but will require a person that is knowledgeable of the technology being used. Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming 
Read more
  • 0
  • 0
  • 115
Banner background image

article-image-chatgpt-as-a-debugging-tool
M.T. White
22 Aug 2023
14 min read
Save for later

ChatGPT as a Debugging Tool

M.T. White
22 Aug 2023
14 min read
IntroductionNo matter the technology or application debugging is a major part of software development.  Every developer who has ever written a program of any significant size knows that the application is going to have some kind of defect in it and probably won’t build the first few times it is run.  In short, a vast amount of time and energy is spent debugging software.  In many cases, debugging code can be more challenging than writing the code in the first place.  With the advent of systems like ChatGPT, spending hours debugging a piece of code may be a thing of the past, at least for relatively small code blocks.  This tutorial is going to explore prompts that we can use to have ChatGPT troubleshoot defective code for us.   ExpectationsBefore we can explore troubleshooting with ChatGPT, we need to first set some realistic expectations.  To begin, ChatGPT works off a series of inputs known as prompts.  For ChatGPT to fix a code block, you’ll first have to submit the code and the issue as a prompt.  At first glance, this may not seem like a big deal; however, modern applications are conglomerates of many smaller components that rely on each other to function correctly.  On top of that, many of these “smaller” components may be, and usually are, composed of hundreds, if not thousands of lines of code or more.  This means that a defect may not stem from the current code block but from a service or line of code, some were no one may consider.  As such, if the root of the defect is not inputted into the prompt, ChatGPT may not be of much use.  This means that to properly use ChatGPT as a troubleshooting tool, it is important to at least have a clue as to where the offending code is; this is because pasting in thousands, if not millions, of lines of code, is impractical.  Ultimately, ChatGPT, at least the web version, cannot be used as a troubleshooting tool without a person who is knowledgeable in both the codebase, debugging, and coding guiding it.   With that, let’s start troubleshooting!Troubleshooting with ChatGPTBefore we begin troubleshooting, let’s define the bugs we’re going to look at in this example.  Though there are many different types of software defects, this tutorial is going to use ChatGPT to troubleshoot compilation and logical errors as examples.  As such, we’re going to define these two terms with the following:Compilation Error: Any error that will prevent the source code from compiling.Logic Error: A software defect that gives an unanticipated result. For these examples, we’re going to use C and Python as the programming languages.  However, due to the nature of ChatGPT, the prompts can be modified to accommodate other languages.  We’re also going to assume that the program can fit in a single file; that is, we’re going to have ChatGPT analyze code snippets.  For now, let’s start by exploring some compilation errors!Compilation ErrorTo begin with, let’s take a look at a very simple Hello World program in C.  Consider the following code,#include <stdio.h> int main(){     printf("Hello World") }If you look at the code, you will see that there is a missing semicolon on the print line.  This is an obvious issue that the compiler will pick up, but for grins, let’s see what ChatGPT has to say.  For this example, we’re going to use the following prompt:Why will this C code not compile? #include <stdio.h> int main(){ printf("Hello World") }This prompt will output the following response:The C code you provided will not compile because there is a missing semicolon (;) at the end of the printf statement in the main function. Here's the corrected code: #include <stdio.h> int main() {    printf("Hello World");    return 0; } In C, semicolons are used to terminate statements, including function calls like printf(). The absence of a semicolon after printf("Hello World") causes a compilation error. Adding the semicolon at the end resolves this issue. As can be seen, ChatGPT instantly caught the error.  Now, this is a very simple error that basic tools such as the compiler would normally catch. Now, let’s explore how ChatGPT can troubleshoot a program in Python.  As many developers know, Python has a unique syntax compared to C and other C-like languages such as C++, C#, or Java.  Instead of using curly braces, Python uses indentation to denote hierarchy which can oftentimes lead to problems that are hard to diagnose, especially for the inexperienced.  As such, consider the following code block:some_number = 2 if some_number <= 2: print("some number is less than 2")An experienced Python developer would probably catch this issue quickly.  However, for this example, we’re going to let ChatGPT troubleshoot the code.  For this code block, we’re going to use the following prompt:What is the syntax error with this Python code block?some_number = 2if some_number <= 2:print("some number is less than 2")This prompt will output the following response:The syntax error in the given Python code block is that the print statement is not indented properly. In Python, indentation is crucial to define the structure and hierarchy of code blocks. Here's the corrected version: some_number = 2 if some_number <= 2:    print("some number is less than or equal to 2") In this corrected code, the print statement is indented with four spaces (or a tab) to align it with the if statement. This ensures that it is part of the code block executed when the condition is true.In this output, ChatGPT gave us the correct syntax as well as an explanation of the problem and solution.  Similar to the C code, these are problems that are usually picked up by the IDE or the compilation system.  However, it is not that uncommon to have to make code modifications in a plain text editor or to get tracebacks that aren’t very useful.  In these cases, ChatGPT could be an awesome tool to use to quickly fix issues.  Now that we’ve explored compilation issues let’s take a look at logical errors.Logical ErrorsLogical errors are oftentimes the hardest to troubleshoot, and many stem from very simple errors, such as using the wrong variable in a control statement, using the wrong math equation, or anything of the sort.  Often these are the hardest issues to troubleshoot because unless they are an oversight, the developer will think their logic is sound.  As such, let’s explore how ChatGPT can be used to clarify a developer’s logic.  Consider the following C code:#include <stdio.h> int main(){     int a = 2;     int b = 3; if (b > a){          printf("b < a");     }else{          printf("a > b");     }     return 0; }The purpose of this code is to show which value is greater, a or b.  However, when the code is run with the current values, it will return the following output:Figure 1 – C Program OutputObviously, there is a logic error in the code.  To troubleshoot it, let’s use the following prompt in ChatGPT:This code is producing the wrong results, it is saying that b is less than a when the inverse is true.  What is wrong with this code #include <stdio.h>int main(){            int a = 2;            int b = 3;            if (b > a){                        printf("b < a");            }else{                        printf("a > b");            }            return 0;}When the prompt is executed, it will return the following:The issue with the given C code is the incorrect comparison operator in the printf statements. The condition inside the if statement checks if b is greater than a, but the printf statements incorrectly state the opposite. To fix this, you need to update the printf statements to reflect the correct condition. Here's the corrected code: #include <stdio.h> int main() {    int a = 2;    int b = 3;    if (b > a) {        printf("b > a");    } else {        printf("a >= b");    }    return 0; } Now the code will output "b > a" when b is indeed greater than a, and "a >= b" otherwise.Again, ChatGPT gave a very clear and concise diagnosis and solution.  In short, the conditional statement in the if statement was wrong.  Now, let’s look at a snippet in Python. Consider the following code:def testFunction():    sum = 0    for i in range(10):        sum += i        print(i)    testFunction()In short, if you tried to run this code, nothing would happen. A Python developer would probably catch this error quickly, but let’s see what ChatGPT has to say. To start, let's use the following prompt:This function will not executedef testFunction():sum = 0 for i in range(10):sum += i print(i)testFunction()After running this prompt, you should get something akin to the following:The Python function you provided has a recursive call to itself (testFunction()) within its own body. This recursive call causes an infinite recursion, which results in a "RecursionError: maximum recursion depth exceeded" error. If you intended to have a recursive function, you should define the base case that stops the recursion. However, in this case, it seems like you want to call the testFunction() only once, without recursion. To fix the code, you can remove the testFunction() call from within the function body. Here's the corrected version: def testFunction():    sum = 0    for i in range(10):        sum += i        print(i)  testFunction() Now the function will execute correctly, iterating through the numbers 0 to 9 and printing each number on a separate line, without causing any recursion errors.This time the diagnosis isn’t as clear because there are a couple of errors that could be the issue. The first is that ChatGPT thinks this might be a recursive function, in which case, ChatGPT suggests putting in an exit for the function. However, if you read the output a little closer, it suggests that the developer is trying to call the function. In this case, it says that the function call is embedded in the function itself and provides the correct working code. As such, it diagnosed and troubleshot the code.Key TakeawaysChatGPT is an excellent way to troubleshoot code. It should be noted that the code in this tutorial was relatively simple and short.  With that, ChatGPT is excellent at troubleshooting small snippets, for example, methods or maybe even whole classes. However, for extremely complex problems, that is, problems that require many lines of code to be examined, ChatGPT may not be the optimal tool because all those lines have to be inputted into the prompt. Considering the problem, ChatGPT may get confused with the code, and complex prompts may have to be engineered to find the problem. However, if you have a rough idea of where the defect originates from, like which class file, it may be worthwhile to run the code through ChatGPT. If nothing else, it probably will give you a fresh perspective and, at the very least, point you in the right direction. The key to using ChatGPT as a troubleshooting tool is giving it the proper information. As we saw with the compilation and logic errors, a compilation error only needed the source code; however, that prompt could have been optimized with a description of the problem. On the other hand, to get the most out of logic errors, you’re going to want to include the following at a minimum:  The programming language  The code (At least the suspected offending code)   A description of the problem   Any other relevant informationSo far, the more information you provide to ChatGPT, the better the results are, but as we saw, a short description of the problem took care of the logic errors. Now, you could get away without specifying the problem, but when it comes to logical errors, it is wise to at least give a short description of the problem. ChatGPT is not infallible, and as we saw with the Python function, ChatGPT wasn’t too sure if the function was meant to be recursive or not. This means, much like a human, it needs to know as much about the problem as it can to accurately diagnose it.SummaryIn all, ChatGPT is a great tool for troubleshooting code. This tool would be ideal for compilation errors when tracebacks are not useful or not available. In terms of it being a tool for troubleshooting logical errors, ChatGPT can also be very useful. However, more information will be required for ChatGPT to accurately diagnose the problems. Again, the examples in this tutorial are very simple and straightforward. The goal was to simply demonstrate what kind of prompts can be used and the results of those inputs.  However, as was seen with the Python function, a complex code block can and probably will confuse the AI. This means that as a user, you have to provide as detailed information as you can to ChatGPT. It is also important to remember that no matter how you use the system, you will still need to use critical thinking and detective work yourself to hunt down the problem. ChatGPT is by no means a replacement for human developers, at least not yet. This means it is important to think of ChatGPT as another set of eyes on a problem and not a one-stop solution for a problem.  Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming  
Read more
  • 0
  • 0
  • 247

article-image-chatgpt-for-ladder-logic
M.T. White
22 Aug 2023
17 min read
Save for later

ChatGPT for Ladder Logic

M.T. White
22 Aug 2023
17 min read
IntroductionChatGPT is slowly becoming a pivotal player in software development.  It is being used by countless developers to help produce quality and robust code.  However, many of these developers are using ChatGPT for text-based programming languages like C++ or Java.  There are few, if any, tutorials on how ChatGPT can be utilized to write Ladder Logic code.  As such, this tutorial is going to be dedicated to exploring how and why ChatGPT can be used as a tool for traditional Ladder Logic programmers.Why use ChatGPT for Ladder Logic?The first step in learning how to leverage ChatGPT is to learn why to use the system.  First of all, ChatGPT is not a programmer, nor is it designed to replace programmers in any way, shape, or form.  However, it can be a handy tool for people that are not sure how to complete a task, need to produce some code in a crunch, and so on.  To effectively use ChatGPT, a person will have to know how to properly produce a statement, refine that statement, and, if necessary, write subsequent statements that have the right amount of information for ChatGPT to effectively produce a result.  In other words, a ChatGPT user still has to be competent, but when used correctly, the AI system can produce code much faster than a human can, especially if the human is inexperienced at a given task.In terms of industrial automation, ChatGPT can be an especially attractive tool.  It is no secret that many PLC programmers are not formally trained developers.  It is common for many PLC programmers to be maintenance technicians, electricians, or other types of engineers.  In any case, it is common for many people who are forced to write complex PLC software to have little more than previous experience guiding them.  As such, when faced with a complex situation with little to no resources available, the programmer can often be lost with no hope of finding a solution.  This is where ChatGPT can be utilized as a user can pose questions and task the system with finding solutions.  With that, how do we use ChatGPT at a basic level?  How to use ChatGPT?The key behind using ChatGPT is writing what are called prompts.  In a lay sense, a prompt is a query or command that ChatGPT runs.  In other words, a prompt can be thought of as a task you ask ChatGPT to do.  For example, a person could input the following prompt into ChatGPT:Tell me how to install a Beckhoff PLC.In this case, ChatGPT will give you a high-level overview of the basics of installing a Beckhoff PLC.  However, a prompt like this will usually not return any useful results unless you want general, high-level information.  Generally, to get a viable answer, you’ll need to hone the prompt with detailed information and exceptions.  Learning to write effective prompts is a skill that is in the early stages of development.  It is important to remember that ChatGPT is a novel tool. This means that the IT and automation industry as a whole is still learning how to use it effectively.  However, a general flow of writing and refining prompts can be summarized in Figure 1.Figure 1 – Prompt Refining FlowThis is just a general logic flow for massaging a prompt to get the desired result.  However, sometimes it is necessary and easier to run subsequent prompts instead of refining existing ones.  Either way, it may take a couple of tries to get what you need out of ChatGPT.  Regardless, to explore how to use ChatGPT, let’s start by creating a basic AND gate. AND ConfigurationAs everyone knows, an AND configuration is simply two contacts in a series.  For the coil to turn on, both contacts have to be on at the same time.  This a relatively simple Ladder Logic program to write and understand, so we’re going to use it as a test for ChatGPT. To begin, we’re going to use the following as a prompt:Write a ladder logic program for an AND gateWhen this prompt is run, it’ll generate the following network.  NETWORK--| |--[ ]---[ ]---( )                                                                                                                                  |          |     |----[ ]---( )Figure 2 – AND Gate: Attempt 1 With Description In this example, an and gate is produced, but it also generates an extra rung that wasn’t asked for or necessary for the task at hand.  In this case, it is not easy to determine what the extra run does or why it was included; as such, this is a prime example of needing to refine the prompts to only produce what is needed.  To do this, let’s refine the prompt a little more to match the following:Write only the necessary contacts and coils for an AND gate in Ladder LogicWhen this query is run, it will produce the network in Figure 2:NETWORK--| |--[ ]---[ ]---( ) Figure 3 – Refined AND Gate: Attempt 2 With DescriptionIn this refined attempt, ChatGPT produces a very simple network of contacts in series.  There are no extra rungs or ambiguous instructions. In short, the moral of this exercise is that we started with a general prompt that produced a working result but a result that needed to be refined. We then refined that prompt and produced a high-quality network with only the necessary instructions for a working AND gate.  With that, let’s look at something a little more complex, timer networks. Timer ProgramsAs every PLC programmer knows, most PLC programs will usually include a timer of some type.  So, with that in mind, let’s explore how ChatGPT handles timer programs.  To begin, let’s have ChatGPT generate a simple 10-minute timer.  To do this, we can use the following query:Write a ladder logic program for a 10 minute timer.Depending on your usage of ChatGPT, you may get a result without the variables labeled or even in Structured Text.  However, the prompt can be adjusted to produce the desired output by using subsequent prompts that are akin to the following:I need this in ladder logicor,assign the variables in the diagramAs stated before, you won’t have to re-write the initial prompt. All you have to do is run these statements as if they were your initial prompt.  However, depending on the prompts you run, your result should be something akin to the following:VAR    Timer: TON;         (* Timer for the 10-minute duration *)    StartButton: BOOL;  (* Input for starting the timer *)    Done: BOOL;         (* Output to indicate when the timer is done *) END_VAR NETWORK -----[StartButton]----[/Timer]-----   |                            |   |                            |   |                           | ---|----------------------------|---- NO                           |                               |                               |                              ---[Done]---In the above ladder logic program:StartButton represents the input contact that starts the timer./Timer represents the timer coil, indicating that the timer is active and counting.Done represents the output contact, which becomes closed (energized) when the timer reaches 10 minutes.When the StartButton input contact is activated, and the timer starts counting. Once the timer reaches 10 minutes, the Done output contact will close, indicating that the timer is done. When the StartButton input contact is deactivated, the timer stops counting, and the Done output contact opens.Please note that the specific syntax and representation of timers may vary depending on the PLC programming software or the manufacturer's conventions. Adjust the syntax and parameters as per the specific requirements of your PLC system.As anyone who has programmed in Ladder Logic before can see, this isn’t standard notation for a timer function.  However, you probably noticed that each prompt will generate a relatively short description of how the snippet is supposed to work.  If you read the subsequent output, you can piece together the symbolic nature of the snippet.  With that, let’s move on to something a little more advanced.  Let’s explore building a cascading timer network that will have one timer set for 10 minutes and another set for 15, and when the final timer turns on, it will power an LED.  To do this, we can input the following prompt,Write a cascading timer Ladder Logic Network that has one timer set for 10 minutes and the other set for 15 minutes. Turn on an LED when both timers turn on. Give the necessary variables as well.This prompt will produce the following:VAR    Timer1: TON;     // Timer 1 (10 minutes)    Timer2: TON;     // Timer 2 (15 minutes)    LED: BOOL;       // Output LED END_VAR NETWORK    // Timer 1 (10 minutes)    Timer1(IN:= TRUE, PT:= T#10M);  // Set Timer 1 to 10 minutes duration    // Timer 2 (15 minutes)    Timer2(IN:= Timer1.Q, PT:= T#15M);  // Start Timer 2 when Timer 1 completes    // LED turns on when both timers complete    LED := Timer1.Q AND Timer2.Q; END_NETWORKIn this ladder logic program:Timer1 is a timer variable of type TON set for 10 minutes duration.Timer2 is a timer variable of type TON set for 15 minutes duration.LED is a boolean variable representing the output LED.Here's a breakdown of the logic:Timer1 is triggered to start with the input IN:= TRUE and is set to a preset time of 10 minutes (PT:= T#10M).Timer2 is started when Timer1 completes its timing sequence (IN:= Timer1.Q), and it is set to a preset time of 15 minutes (PT:= T#15M).The LED is turned on (LED := Timer1.Q AND Timer2.Q) when both Timer1 and Timer2 have completed their timing sequences.You can adjust the timing durations of the timers according to your specific requirements in your PLC programming software. Additionally, you can connect the LED output variable to the appropriate LED indicator in your setup.As can be seen, this isn’t our desired output.  We want Ladder Logic, and this is Structured Text.  In short, ChatGPT got confused.  If you read the description, ChatGPT thinks its result is Ladder Logic.  In all, this isn’t what we were looking for, so we can add a subsequent prompt that reads the following:I need this program in Ladder LogicThis prompt will produce the following output:NETWORK --|----[ ]----[/]---[ ]----[/]---[ ]----[/]---[ ]--+ |              |              |              | +---[ ]----[/]---[ ]----[/]---[ ]---[/]---[ ]--+    Timer1         Timer2    +---[ ]---+    +---[ ]---+    |         |    |         |    +---[/]---+    +---[/]---+      LED            LEDIn this ladder logic program:Each [ ] represents a normally open (NO) contact./ represents a normally closed (NC) contact.The timers Timer1 and Timer2 are cascaded to achieve the desired timing sequence.The LED output is controlled based on the completion of both timers.Here's a breakdown of the logic:Timer1 is a timer with a preset time of 10 minutes.Timer2 is a timer with a preset time of 15 minutes.When Timer1 completes its timing sequence, it energizes the output coil [/], and the LED associated with Timer1 turns on.Timer2 is enabled when the output coil [/] of Timer1 is energized.When Timer2 completes its timing sequence, it energizes the output coil [/], and the LED associated with Timer2 turns on.The LED output is controlled by the parallel combination of Timer1's output coil [/] and Timer2's output coil [/]. The LED turns on only when both timers have been completed.Please note that the ladder logic diagram provided is a visual representation, and the actual implementation may vary depending on the PLC programming software or hardware you are using.This output is closer to what we’re looking for, but depending on a person's experience level, this probably won’t be much help.  This particular diagram isn’t very descriptive of how everything is supposed to fit together, so if a person had to translate this to a specific programming software, this diagram would probably take a while to fully digest and implement.  As such, we’re going to have to add another prompt that will be as follows,I need the variables assigned to the proper contacts.This subsequent prompt will produce the following:NETWORK --|----[StartButton]----[/Timer1.Enabled]---[Timer1.TimerDN]----[/Timer2.Enabled]---[Timer2.TimerDN]--+ |                                                                                             | +---[Timer1.Timer.Q]----[/LED1]---+---[Timer2.Timer.Q]----[/LED2]---+                                   |                                   |                                    +---[/LED]--------------------------+In this ladder logic program:StartButton is a boolean variable representing the input contact [StartButton] that starts the timer sequence.Timer1.Enabled and Timer2.Enabled are boolean variables representing the input contacts [/Timer1.Enabled] and [/Timer2.Enabled] that enable the respective timers.Timer1.TimerDN and Timer2.TimerDN are boolean variables representing the output contacts [Timer1.TimerDN] and [Timer2.TimerDN] that indicates the completion of the respective timers.Timer1.Timer.Q and Timer2.Timer.Q are boolean variables representing the output contacts [Timer1.Timer.Q] and [Timer2.Timer.Q] that indicates the running state of the respective timers.LED1 and LED2 are boolean variables representing the output contacts [/LED1] and [/LED2] that control the LEDs associated with Timer1 and Timer2.LED is a boolean variable representing the output contact [/LED] that controls the final LED indicating the completion of both timers.Please note that the actual variable names may vary depending on the PLC programming software or hardware you are using. You can assign these variables in your PLC programming software and connect them to the appropriate input and output terminals to implement the desired cascading timer logicThis result is much better as it shows how to wire the network together while the description gives a clear overview of how everything works.  With this, a person with a basic knowledge of Ladder Logic will be able to piece this together and port the rungs to other PLC programming programs. SummaryChatGPT can be leveraged as an excellent tool to help Ladder Logic developers with concocting specialized programs.  However, ChatGPT is NOT a replacement for programmers, and to effectively use the system, a person must be skilled enough to write descriptive prompts and interpret the results.  This means that though ChatGPT is an excellent tool, it does not have the intuition nor the skill to fully replace a programmer.A big part of using ChatGPT is learning to write and refine prompts as well as subsequent follow-up prompts.  These prompts are a developing art form that probably will be the next iteration of software development.  For now, the art of using ChatGPT and similar systems is novel, and there aren’t any definitive standards that govern how to effectively use these yet, especially when it comes to graphical programming such as Ladder Logic.  When used by a knowledgeable person that has a basic idea of PLC programming and ChatGPT, it can be a great way of getting over hurdles that could take hours or days to solve. Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming
Read more
  • 0
  • 0
  • 2464
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-chatgpt-as-a-documentation-tool
M.T. White
22 Aug 2023
14 min read
Save for later

ChatGPT as a Documentation Tool

M.T. White
22 Aug 2023
14 min read
It comes as no surprise that most developers do not like writing documentation.  As a result, documentation is often pushed to the side and, more often than not, haphazardly put together.  This is a serious problem since written documentation is the primary way developers communicate how a software system should work and be utilized.  As such, a poorly documented system can render it useless and confine it to the cyber trash heap long before its time.  However, with the rise of new tools like ChatGPT, poor documentation may no longer be a problem.  This article is going to explore how ChatGPT can be leveraged as a code documentation tool and examine the results to see just how well it can document code.What is quality documentation?If you ask five developers what they consider to be good documentation, you’re probably going to get five different answers.  What’s considered good documentation will vary among languages, IDEs, organizational coding standards, and preferences.  As such, for this article, “good” documentation is going to have the following parameters:1.     The documentation will provide context on how a block of code works or how to utilize it. 2.     The documentation will be logical and consistent.  This means that the documentation will be formatted and worded logically to provide context across all code blocks that ChatGPT documents.3.     The autogenerated code documentation will not clutter the code or distract the reader from the task at hand.Again, it is important to remember that there are many different opinions on what quality documentation looks like, and the points above are just high-level metrics to hit.    What is the documentation?Documentation is a very broad term that can mean a lot of things.  It can mean simply the documentation that comes packaged with the codebases like JavaDocs, or it could be a whole ecosystem that ranges from low-level code comments to YouTube videos and Wikipedia articles.  In short, documentation can be anything that helps relay information from one developer to another.  As such, for this article, the term documentation is going to be limited to code comments and high-level documentation like JavaDocs.  Also, most languages have their own best practices for documentation, so for this article, ChatGPT will generate documentation for Java.  Though Java will be the example language, there should only be minor tweaks, if any, to generate documentation for any other language. JavaDocsJavaDocs should provide high-level documentation to relay information about how to use a piece of code to a developer that does not have access to the codebase.  This means that key features like what the methods do, arguments, and return types should be noted.  With that in mind, let’s generate some JavaDocs for the following method,public static void add(int a, int b) {      System.out.println(a + b); }We’re going to generate the JavaDocs for this method using the following prompt:Generate some javadocs for this method public static void add(int a, int b) { System.out.println(a + b); }When the prompt is executed, it will produce the following output:/** * Calculates the sum of two integers and prints the result. * * @param a The first integer operand. * @param b The second integer operand. */ public static void add(int a, int b) {    System.out.println(a + b); }In this case, ChatGPT clearly defined what the arguments are, the datatypes of the arguments, and what the method is meant to do with straightforward prompts.  It is important to note that this method is straightforward, so the next thing to explore is a more complex method.The next method to try to automatically document will be the following:public static double divide(int a, int b) { double div = 0;      if (a > b) {            div = a / b ;      }else {            div = b / a;      }      return div; }For this method, the following prompt will be run. Write only JavaDocs for this method public static double divide(int a, int b) { double div = 0; if (a > b) { div = a / b ; }else { div = b / a; } return div; }After the prompt is run, we should be met with something similar to the following:/** * Calculates the division of two numbers. * * @param a The numerator. * @param b The denominator. * @return The division result. */These JavaDocs look very similar to the JavaDocs for the simple method.  They provide the arguments, return type, and the purpose of the method.   The one thing that was kind of ambiguous is that ChatGPT attempted to identify the variables as always the numerator or denominator.  From the code, we can see that this isn’t always true; however, if a developer is checking, they should be able to correct it fairly easily. Now, at first glance, this will appear as more trouble than it is worth.  A codebase of significant size is going to have many classes that are comprised of many different methods.  As such, copying and pasting methods one by one is going to be a herculean task.  However, ChatGPT can still be leveraged to speed up high-level code documentation.  Consider the following prompt:Write JavaDocs for these methods public static void add(int a, int b) { System.out.println(a + b); } public static double divide(int a, int b) { double div = 0; if (a > b) { div = a / b ; }else { div = b / a; } return div; }This prompt will produce the following results:/** * Adds two numbers and prints the result. * * @param a The first number. * @param b The second number. */ /** * Calculates the division of two numbers and returns the result. * * @param a The numerator. * @param b The denominator. * @return The division result. */Again, even with multiple methods, ChatGPT returned JavaDocs that are quality and provided enough information for a developer to use, with the exception of the same hiccup for the division method that was previously explored.  In all, ChatGPT did a decent job with writing JavaDocs.Used in this manner, ChatGPT can greatly enhance the speed and quality of development.  When using ChatGPT, multiple methods can be pasted into ChatGPT, and all the necessary JavaDocs are created.  As anyone who has ever had to write this type of documentation knows, it can be very time-consuming and mistake-prone.  However, with ChatGPT, all one has to do is input the code and let the AI system do the rest.  Ultimately ChatGPT will free up developers for more pressing tasks, and the AI will produce quality JavaDocs without making rudimentary errors like forgetting to include a parameter or return type.  With that, another question is how well ChatGPT can be used to write code comments. Code CommentsIn many cases, it is difficult for a 3rd party to write effective code comments.  To provide the most context to the code, the entity that is writing the code should also write the comments.  As such, the first thing that needs to be explored is how ChatGPT writes code comments.  To do this, consider the following code snippet:public static void add(int a, int b) {      System.out.println(a + b); }This code is very basic; however, in a quality codebase, even a method this simple should be documented.  This method would probably take as much time to document as it would be to write; this means it is a perfect candidate for ChatGPT.  To have ChatGPT document the code snippet runs the following prompt:Write a few comments to provide context for the following method public static void add(int a, int b) { System.out.println(a + b); }After running this prompt, ChatGPT will produce the following:public static void add(int a, int b) {    // This method takes two integer parameters, 'a' and 'b'.    // It adds the values of 'a' and 'b' together and prints the result.    System.out.println(a + b);    // The sum of 'a' and 'b' is calculated and printed to the console using System.out.println().    // This method does not return any value (void). }At first glance, one may notice that ChatGPT generated several comments.  This can be a pro as much as it can be a con since it can bloat the codebase.  With that, one may notice that the output from ChatGPT does have a couple of unnecessary comments that can bloat the code.  For example, there is a comment that says that the method takes in two integers, and another says that the method returns void.  For code comments, those comments are not necessary as they don’t contribute to the overall context of the code.  However, if the prompt is modified to,Write a comment that describes what this method does public static void add(int a, int b) { System.out.println(a + b); }It will produce the following:public static void add(int a, int b) {    // This method calculates the sum of two integers, 'a' and 'b',    // and displays the result by printing it to the console. }This prompt produced two comments that, when combined, form one complete sentence that describes what the method does. Now, this was for a relatively simple method.  As any programmer knows, a method will probably be larger and more complex than the add method.  As such, let’s consider the division method again:public static double divide(int a, int b) {      double div = 0;      if (a > b) {            div = a / b ;      }else {            div = b / a;      }      return div; }This method is a little more complex than the add method.  In this case, this method will return the quotient of a and b if a is greater than b or the quotient of b and a if b is greater than a.  With that, if the following prompt is inputted,Write comments to add context to this method public static double divide(int a, int b) { double div = 0; if (a > b) { div = a / b ; }else { div = b / a; } return div; }it will produce the following,public static double divide(int a, int b) {    // This method calculates the division of two integers and returns the result.    double div = 0;    // Variable to store the division result.    if (a > b) {        div = a / b;        // If 'a' is greater than 'b', divide 'a' by 'b' and assign the result to 'div'.    } else {        div = b / a;        // If 'b' is greater than or equal to 'a', divide 'b' by 'a' and assign the result to 'div'.    }    return div;    // Return the division result. }In short, ChatGPT will document every line of code again.  The comments do provide code context, but the comments don’t provide anything that can’t simply be read in the code.  Also, if this prompt is modified, ChatGPT, at the time of writing this article, will still produce relatively useless comments.Finally, if a prompt like the following is used:Write only the necessary comments for a developer to use public static double divide(int a, int b) { double div = 0; if (a > b) { div = a / b ; }else { div = b / a; } return div; }it will yield:// Calculates the division of two numbers and returns the result. // Parameters: //   a: The numerator. //   b: The denominator. // Returns: //   The division result.Of all the queries explored thus far, this one arguably yielded the best results.  In this case, the comments are what was generated in the last queries, but the format is much cleaner and easier to modify.  This block could be easily pasted into the codebase, and the unnecessary lines deleted. In all, it took a very specific prompt to produce marginally acceptable code comments.  At best, the code comments that were produced didn’t provide any extra context that could not be deduced from reading the code.  It can also be said that the code comments that were produced bloated the code and, to some, may have made it more confusing.  As such, for code comments, ChatGPT probably isn’t the best tool to use.  At best, a developer will have to remove unnecessary lines of comments and probably have to re-write many of them as well. There is also the issue of having to produce a prompt that is specific enough to generate proper comments. In all, whether a person should use ChatGPT as a code comment generator is up to them.  In theory, the comment produced could be leveraged in places like education, where code examples need to be heavily commented on to provide context to those who may not have a background in the language.  However, in terms of production code, though it will ultimately depend on the organization’s coding standard, ChatGPT will not produce code comments that would be mergeable in many places. Keytake Aways  In terms of codebase comments, ChatGPT is hit-and-miss.  As was seen, the code comments that ChatGPT produced were reminiscent of a college-level developer.  That is, ChatGPT commented on every line of code and only stated the obvious.  Since ChatGPT commented on every line of code, it can be argued that it bloated the codebase to a degree.  However, when a very specific prompt was run, it produced comments similar to what would be found in JavaDocs and what is expected by many organizations.  However, in terms of JavaDocs, ChatGPT shined.  The JavaDocs that ChatGPT produced were all very well written and provided the correct amount of information for a developer to easily digest and apply. As such, a few things can be summarized with what was explored.1.     Queries have to be very specific when it comes to code comments.2.     ChatGPT tends to produce unnecessary code comments that can bloat the codebase. 3.     Depending on the type/quality of code comments, ChatGPT may not be the ideal tool for automatic code documentation.4.     ChatGPT produces documentation akin to JavaDocs better than comments in the codebase.SummaryIn summary, what constitutes quality code documentation is often up to a team.  However, by many standards, ChatGPT tends to produce unnecessary code comments that don’t add much context and can easily bloat the codebase.  However, for higher-level documentation like JavaDocs, ChatGPT is an excellent tool that provides the proper amount of information.  In all, it probably isn’t the best idea to use ChatGPT as a means to generate comments for software written by a human, but it can be used to quickly produce higher-level documentation such as JavaDocs. As was seen, multiple methods can easily be documented in a matter of seconds using ChatGPT.  As such, in terms of productivity, when it comes to higher-level documentation, ChatGPT can be a great productivity tool that could help speed up development. Author BioM.T. White has been programming since the age of 12. His fascination with robotics flourished when he was a child programming microcontrollers such as Arduino. M.T. currently holds an undergraduate degree in mathematics, and a master's degree in software engineering, and is currently working on an MBA in IT project management. M.T. is currently working as a software developer for a major US defense contractor and is an adjunct CIS instructor at ECPI University. His background mostly stems from the automation industry where he programmed PLCs and HMIs for many different types of applications. M.T. has programmed many different brands of PLCs over the years and has developed HMIs using many different tools.Author of the book: Mastering PLC Programming 
Read more
  • 0
  • 0
  • 2188

article-image-hands-on-vector-similarity-search-with-milvus
Alan Bernardo Palacio
21 Aug 2023
14 min read
Save for later

Hands-On Vector Similarity Search with Milvus

Alan Bernardo Palacio
21 Aug 2023
14 min read
IntroductionIn the realm of AI and machine learning, effective management of vast high-dimensional vector data is critical. Milvus, an open-source vector database, tackles this challenge using advanced indexing for swift similarity search and analytics, catering to AI-driven applications.Milvus operates on vectorization and quantization, converting complex raw data into streamlined high-dimensional vectors for efficient indexing and querying. Its scope spans recommendation, image recognition, natural language processing, and bioinformatics, boosting result precision and overall efficiency.Milvus impresses not just with capabilities but also design flexibility, supporting diverse backends like MinIO, Ceph, AWS S3, Google Cloud Storage, alongside etcd for metadata storage.Local Milvus deployment becomes user-friendly with Docker Compose, managing multi-container Docker apps well-suited for Milvus' distributed architecture. The guide delves into Milvus' core principles—vectorization and quantization—reshaping raw data into compact vectors for efficient querying. Its applications in recommendation, image recognition, natural language processing, and bioinformatics enhance system accuracy and efficacy.The next article details deploying Milvus locally via Docker Compose. This approach's simplicity underscores Milvus' user-centric design, delivering robust capabilities within an accessible framework. Let’s get started.Standalone Milvus with Docker ComposeSetting up a local instance of Milvus involves a multi-service architecture that consists of the Milvus server, metadata storage, and object storage server. Docker Compose provides an ideal environment to manage such a configuration in a convenient and efficient way.The Docker Compose file for deploying Milvus locally consists of three services: etcd, minio, and milvus itself. etcd provides metadata storage, minio functions as the object storage server and milvus handles vector data processing and search. By specifying service dependencies and environment variables, we can establish seamless communication between these components. milvus, etcd, and minio services are run in isolated containers, ensuring operational isolation and enhanced security.To launch the Milvus application, all you need to do is execute the Docker Compose file. Docker Compose manages the initialization sequence based on service dependencies and takes care of launching the entire stack with a single command. The next is the docker-compose.yml which specifies all of the aforementioned components:version: '3' services: etcd:    container_name: milvus-etcd    image: quay.io/coreos/etcd:v3.5.5    environment:      - ETCD_AUTO_COMPACTION_MODE=revision      - ETCD_AUTO_COMPACTION_RETENTION=1000      - ETCD_QUOTA_BACKEND_BYTES=4294967296      - ETCD_SNAPSHOT_COUNT=50000    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls <http://0.0.0.0:2379> --data-dir /etcd minio:    container_name: milvus-minio    image: minio/minio:RELEASE.2022-03-17T06-34-49Z    environment:      MINIO_ACCESS_KEY: minioadmin      MINIO_SECRET_KEY: minioadmin    ports:      - "9001:9001"      - "9000:9000"    command: minio server /minio_data --console-address ":9001"    healthcheck:      test: ["CMD", "curl", "-f", "<http://localhost:9000/minio/health/live>"]      interval: 30s      timeout: 20s      retries: 3 milvus:    container_name: milvus-standalone    image: milvusdb/milvus:v2.3.0-beta    command: ["milvus", "run", "standalone"]    environment:      ETCD_ENDPOINTS: etcd:2379      MINIO_ADDRESS: minio:9000    ports:      - "19530:19530"      - "9091:9091"    depends_on:      - "etcd"      - "minio"After we have defined the docker-compose file we can deploy the services by first running docker compose build and then running docker compose up -d.In the next section, we'll move on to a practical example — creating sentence embeddings. This process leverages Transformer models to convert sentences into high-dimensional vectors. These embeddings capture the semantic essence of the sentences and serve as an excellent demonstration of the sort of data that can be stored and processed with Milvus.Creating sentence embeddingsCreating sentence embeddings involves a few steps: preparing your environment, importing necessary libraries, and finally, generating and processing the embeddings. We'll walk through each step in this section assuming that this code is being executed in a Python environment where the Milvus database is running.First, let’s start with the requirements.txt file:transformers==4.25.1 pymilvus==2.1.0 torch==2.0.1 protobuf==3.18.0 Now let’s import the packages. import numpy as np import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel from pymilvus import (    connections,    utility,    FieldSchema, CollectionSchema, DataType,    Collection, )Here, we're importing all the necessary libraries for our task. numpy and torch are used for mathematical operations and transformations, transformers is for language model-related tasks, and pymilvus is for interacting with the Milvus server.This Python code block sets up the transformer model we will be using and lists the sentences for which we will generate embeddings. We first specify a model checkpoint ("sentence-transformers/all-MiniLM-L6-v2") that will serve as our base model for sentence embeddings. We then define a list of sentences to generate embeddings for. To facilitate our task, we initialize a tokenizer and model using the model checkpoint. The tokenizer will convert our sentences into tokens suitable for the model, and the model will use these tokens to generate embeddings:# Transformer model checkpoint model_ckpt = "sentence-transformers/all-MiniLM-L6-v2" # Sentences for which we will compute embeddings sentences = [    "I took my dog for a walk",    "Today is going to rain",    "I took my cat for a walk", ] # Initialize tokenizer and model tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModel.from_pretrained(model_ckpt)Here, we define the model checkpoint that we will use to get the sentence embeddings. We then initialize a list of sentences for which we will compute embeddings. The tokenizer and model are initialized using the defined checkpoint.We've obtained token embeddings, but we need to aggregate them to obtain sentence-level embeddings. For this, we'll use a mean pooling operation. The upcoming section of the guide will define a function to accomplish this.Mean Pooling Function DefinitionThis function is used to aggregate the token embeddings into sentence embeddings. The token embeddings and the attention mask (which indicates which tokens are not padding and should be considered for pooling) are passed as inputs to this function. The function performs a weighted average of the token embeddings according to the attention mask and returns the aggregated sentence embeddings:# Mean pooling function to aggregate token embeddings into sentence embeddings def mean_pooling(model_output, attention_mask):    token_embeddings = model_output.last_hidden_state    input_mask_expanded = (        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()    )    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(       input_mask_expanded.sum(1), min=1e-9    )This function takes the model output and the attention mask as input and returns the sentence embeddings by performing a mean pooling operation over the token embeddings. The attention mask is used to ignore the tokens corresponding to padding during the pooling operation.Generating Sentence EmbeddingsThis code snippet first tokenizes the sentences, padding and truncating them as necessary. We then use the transformer model to generate token embeddings. These token embeddings are pooled using the previously defined mean pooling function to create sentence embeddings. The embeddings are normalized to ensure consistency and finally transformed into Python lists to make them compatible with Milvus:# Tokenize the sentences and compute their embeddings encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad():    model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) # Normalize the embeddings sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) # Convert the sentence embeddings into a format suitable for Milvus embeddings = sentence_embeddings.numpy().tolist()In this section, we're using the transformer model to tokenize the sentences and generate their embeddings. We then normalize these embeddings and convert them to a format suitable for insertion into Milvus (Python lists).With the pooling function defined, we're now equipped to generate the actual sentence embeddings. These embeddings will then be processed and made ready for insertion into Milvus.Inserting vector embeddings into MilvusWe're now ready to interact with Milvus. In this section, we will connect to our locally deployed Milvus server, define a schema for our data, and create a collection in the Milvus database to store our sentence embeddings.Now, it's time to put our Milvus deployment to use. We will define the structure of our data, set up a connection to the Milvus server, and prepare our data for insertion:# Establish a connection to the Milvus server connections.connect("default", host="localhost", port="19530") # Define the schema for our collection fields = [    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),    FieldSchema(name="sentences", dtype=DataType.VARCHAR, is_primary=False, description="The actual sentences",                max_length=256),    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, is_primary=False, description="The sentence embeddings",                dim=sentence_embeddings.size()[1]) ] schema = CollectionSchema(fields, "A collection to store sentence embeddings")We establish a connection to the Milvus server and then define the schema for our collection in Milvus. The schema includes a primary key field, a field for the sentences, and a field for the sentence embeddings.With our connection established and schema defined, we can now create our collection, insert our data, and build an index to enable efficient search operations.Create Collection, Insert Data, and Create IndexIn this snippet, we first create the collection in Milvus using the previously defined schema. We then organize our data to match our collection's schema and insert it into our collection. After the data is inserted, we create an index on the embeddings to optimize search operations. Finally, we print the number of entities in the collection to confirm the insertion was successful:# Create the collection in Milvus sentence_embeddings_collection = Collection("sentence_embeddings", schema) # Organize our data to match our collection's schema entities = [    sentences,  # The actual sentences    embeddings,  # The sentence embeddings ] # Insert our data into the collection insert_result = sentence_embeddings_collection.insert(entities) # Create an index to make future search queries faster index = {    "index_type": "IVF_FLAT",    "metric_type": "L2",    "params": {"nlist": 128}, } sentence_embeddings_collection.create_index("embeddings", index) print(f"Number of entities in Milvus: {sentence_embeddings_collection.num_entities}")We create a collection in Milvus using the previously defined schema. We organize our data ( sentences, and sentence embeddings) and insert this data into the collection. Primary keys are generated as auto IDs so we don't need to add them. Finally, we print the number of entities in the collection:This way, the sentences, and their corresponding embeddings are stored in a Milvus collection, ready to be used for similarity searches or other tasks.Now that we've stored our embeddings in Milvus, let's make use of them. We will search for similar vectors in our collection based on similarity to sample vectors.Search Based on Vector SimilarityIn the code, we're loading the data from our collection into memory and then defining the vectors that we want to find similar vectors for:# Load the data into memory sentence_embeddings_collection.load()This step is necessary to load the data in our collection into memory before conducting a search or a query. The search parameters are set, specifying the metric to use for calculating similarity (L2 distance in this case) and the number of clusters to examine during the search operation. The search operation is then performed, and the results are printed out:# Vectors to search vectors_to_search = embeddings[-2:] search_params = {    "metric_type": "L2",    "params": {"nprobe": 10}, } # Perform the search result = sentence_embeddings_collection.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["sentences"]) # Print the search results for hits in result:    for hit in hits:        print(f"hit: {hit}, sentence field: {hit.entity.get('sentences')}")Here, we're searching for the two most similar sentence embeddings to the last two embeddings in our list. The results are limited to the top 3 matches, and the corresponding sentences of these matches are printed out:Once we're done with our data, it's a good practice to clean up. In this section, we'll explore how to delete entities from our collection using their primary keys.Delete Entities by Primary KeyThis code first gets the primary keys of the entities that we want to delete. We then query the collection before the deletion operation to show the entities that will be deleted. The deletion operation is performed, and the same query is run after the deletion operation to confirm that the entities have been deleted:# Get the primary keys of the entities we want to delete ids = insert_result.primary_keys expr = f'pk in [{ids[0]}, {ids[1]}]' # Query before deletion result = sentence_embeddings_collection.query(expr=expr, output_fields=["sentences", "embeddings"]) print(f"Query before delete by expr=`{expr}` -> result: \\\\n-{result[0]}\\\\n-{result[1]}\\\\n") # Delete entities sentence_embeddings_collection.delete(expr) # Query after deletion result = sentence_embeddings_collection.query(expr=expr, output_fields=["sentences", "embeddings"]) print(f"Query after delete by expr=`{expr}` -> result: {result}\\\\n")Here, we're deleting the entities corresponding to the first two primary keys in our collection. Before and after the deletion, we perform a query to see the result of the deletion operation:Finally, we drop the entire collection from the Milvus server:# Drop the collection utility.drop_collection("sentence_embeddings")This code first gets the primary keys of the entities that we want to delete. We then query the collection before the deletion operation to show the entities that will be deleted. The deletion operation is performed, and the same query is run after the deletion operation to confirm that the entities have been deleted.ConclusionCongratulations on completing this hands-on tutorial with Milvus! You've learned how to harness the power of an open-source vector database that simplifies and accelerates AI and ML applications. Throughout this journey, you set up Milvus locally using Docker Compose, transformed sentences into high-dimensional embeddings and conducted vector similarity searches for practical use cases.Milvus' advanced indexing techniques have empowered you to efficiently store, search, and analyze large volumes of vector data. Its user-friendly design and seamless integration capabilities ensure that you can leverage its powerful features without unnecessary complexity.As you continue exploring Milvus, you'll uncover even more possibilities for its application in diverse fields, such as recommendation systems, image recognition, and natural language processing. The high-performance similarity search and analytics offered by Milvus open doors to cutting-edge AI-driven solutions.With your newfound expertise in Milvus, you are equipped to embark on your own AI adventures, leveraging the potential of vector databases to tackle real-world challenges. Continue experimenting, innovating, and building AI-driven applications that push the boundaries of what's possible. Happy coding!Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn 
Read more
  • 0
  • 0
  • 611

article-image-detecting-anomalies-using-llm-sentence-embeddings
Alan Bernardo Palacio
21 Aug 2023
18 min read
Save for later

Detecting Anomalies Using LLM Sentence Embeddings

Alan Bernardo Palacio
21 Aug 2023
18 min read
IntroductionText classification tasks such as natural language inference (NLI) are a central part of modern natural language processing (NLP). In this article, we present an application of unsupervised machine learning techniques to detect anomalies in the MultiNLI dataset.Our aim is to use unsupervised Large Language Models (LLM) to create embeddings and discover patterns and relationships within the data. We'll preprocess the data, generate sentence pair embeddings, and use the Out-Of-Distribution (OOD) module from the cleanlab Python package to get outlier scores.Importing Libraries and Setting SeedsThe following block of code is essentially the initial setup phase of our data processing and analysis script. Here, we import all the necessary libraries and packages that will be used throughout the code. First, we need to install some of the necessary libraries:!pip install cleanlab datasets hdbscan nltk matplotlib numpy torch transformers umap-learnIt is highly recommended to use Google Colab with GPUs or TPUs to be able to create the embeddings in a proper amount of time.Now we can start with the importing of the sentences:import cleanlab import datasets import hdbscan import nltk import matplotlib.pyplot as plt import numpy as np import re import torch from cleanlab.outlier import OutOfDistribution from datasets import load_dataset, concatenate_datasets from IPython.display import display from sklearn.metrics import precision_recall_curve from torch.utils.data import DataLoader from tqdm.auto import tqdm from transformers import AutoTokenizer, AutoModel from umap import UMAP nltk.download('stopwords') datasets.logging.set_verbosity_error() torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.cuda.manual_seed_all(SEED)Here's what each imported library/package does:cleanlab: A package used for finding label errors in datasets and learning with noisy labels.datasets: Provides easy-to-use, high-level APIs for downloading and preparing datasets for modeling.hdbscan: A clustering algorithm that combines the benefits of hierarchical clustering and density-based spatial clustering of applications with noise (DBSCAN).nltk: Short for Natural Language Toolkit, a leading platform for building Python programs to work with human language data.torch: PyTorch is an open-source machine learning library based on the Torch library, used for applications such as natural language processing.This part of the code also downloads the NLTK (Natural Language Toolkit) stopwords. Stopwords are words like 'a', 'an', and 'the', which are not typically useful for modeling and are often removed during pre-processing. The datasets.logging.set_verbosity_error() sets the logging level to error. This means that only the messages with the level error or above will be displayed.The code also sets some additional properties for CUDA operations (if a CUDA-compatible GPU is available), which can help ensure consistency across different executions of the code.Dataset Preprocessing and LoadingThe following block of code represents the next major phase: preprocessing and loading the datasets. This is where we clean and prepare our data so that it can be fed into our LLM models:def preprocess_datasets(    *datasets,    sample_sizes = [5000, 450, 450],    columns_to_remove = ['premise_binary_parse', 'premise_parse', 'hypothesis_binary_parse', 'hypothesis_parse', 'promptID', 'pairID', 'label'], ):    # Remove -1 labels (no gold label)    f = lambda ex: ex["label"] != -1    datasets = [dataset.filter(f) for dataset in datasets]    # Sample a subset of the data    assert len(sample_sizes) == len(datasets), "Number of datasets and sample sizes must match"    datasets = [        dataset.shuffle(seed=SEED).select([idx for idx in range(sample_size)])        for dataset, sample_size in zip(datasets, sample_sizes)    ]    # Remove columns    datasets = [data.remove_columns(columns_to_remove) for data in datasets]    return datasetsThis is a function definition for preprocess_datasets, which takes any number of datasets (with their sample sizes and columns to be removed specified as lists). The function does three main things:Filtering: Removes examples where the label is -1. A label of -1 means that there is no gold label for that example.Sampling: Shuffles the datasets and selects a specific number of examples based on the provided sample_sizes.Removing columns: Drops specific columns from the dataset as per the columns_to_remove list.train_data = load_dataset("multi_nli", split="train") val_matched_data = load_dataset("multi_nli", split="validation_matched") val_mismatched_data = load_dataset("multi_nli", split="validation_mismatched") train_data, val_matched_data, val_mismatched_data = preprocess_datasets(    train_data, val_matched_data, val_mismatched_data )The above lines load the train and validation datasets from multi_nli (a multi-genre natural language inference corpus) and then preprocess them using the function we just defined.Finally, we print the genres available in each dataset and display the first few records using the Pandas data frame. This is useful to confirm that our datasets have been loaded and preprocessed correctly:print("Training data") print(f"Genres: {np.unique(train_data['genre'])}") display(train_data.to_pandas().head()) print("Validation matched data") print(f"Genres: {np.unique(val_matched_data['genre'])}") display(val_matched_data.to_pandas().head()) print("Validation mismatched data") print(f"Genres: {np.unique(val_mismatched_data['genre'])}") display(val_mismatched_data.to_pandas().head())With the help of this block, we have our datasets loaded and preprocessed, ready to be transformed into vector embeddings.Sentence Embedding and TransformationNow, we proceed to the next crucial step, transforming our textual data into numerical vectors. This is where text or sentence embeddings come into play.In simple terms, sentence embeddings are the numerical representations of sentences. Just as words can be represented by dense vectors (a process known as word embeddings), entire sentences can also be encoded into vectors. This transformation process facilitates mathematical operations on text, making it possible for machine learning algorithms to perform tasks like text classification, sentence similarity, sentiment analysis, and more.To produce high-quality sentence embeddings, the context of each word in the sentence and the semantics should be considered. Transformer-based models, like BERT, DistilBERT, or RoBERTa, are very effective in creating these contextual sentence embeddings.Now, let's explain the next block of code:#Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask):    token_embeddings = model_output[0]    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)This function mean_pooling is used to calculate the mean of all token embeddings that belong to a single sentence. The function receives the model_output (containing the token embeddings) and an attention_mask (indicating where actual tokens are and where padding tokens are in the sentence). The mask is used to correctly compute the average over the length of each sentence, ignoring the padding tokens.The function embed_sentence_pairs processes the sentence pairs, creates their embeddings, and stores them. It uses a data loader (which loads data in batches), a tokenizer (to convert sentences into model-understandable format), and a pre-trained language model (to create the embeddings).The function is a vital part of the sentence embedding process. This function uses a language model to convert pairs of sentences into high-dimensional vectors that represent their combined semantics. Here's an annotated walkthrough:def embed_sentence_pairs(dataloader, tokenizer, model, disable_tqdm=False):    # Empty lists are created to store the embeddings of premises and hypotheses    premise_embeddings  = []    hypothesis_embeddings = []    feature_embeddings = []    # The device (CPU or GPU) to be used for computations is determined    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")    # The model is moved to the chosen device and set to evaluation mode    model.to(device)    model.eval()    # A loop is set up to iterate over the data in the dataloader    loop = tqdm(dataloader, desc=f"Embedding sentences...", disable=disable_tqdm)    for data in loop:        # The premise and hypothesis sentences are extracted from the data       premise, hypothesis = data['premise'], data['hypothesis']        # The premise and hypothesis sentences are encoded into a format that the model can understand        encoded_premise, encoded_hypothesis = (            tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')            for sentences in (premise, hypothesis)        )        # The model computes token embeddings for the encoded sentences        with torch.no_grad():            encoded_premise = encoded_premise.to(device)            encoded_hypothesis = encoded_hypothesis.to(device)            model_premise_output = model(**encoded_premise)            model_hypothesis_output = model(**encoded_hypothesis)        # Mean pooling is performed on the token embeddings to create sentence embeddings        pooled_premise = mean_pooling(model_premise_output, encoded_premise['attention_mask']).cpu().numpy()        pooled_hypothesis = mean_pooling(model_hypothesis_output, encoded_hypothesis['attention_mask']).cpu().numpy()        # The sentence embeddings are added to the corresponding lists        premise_embeddings.extend(pooled_premise)        hypothesis_embeddings.extend(pooled_hypothesis)    # The embeddings of the premises and hypotheses are concatenated along with their absolute difference    feature_embeddings = np.concatenate(        [            np.array(premise_embeddings),            np.array(hypothesis_embeddings),            np.abs(np.array(premise_embeddings) - np.array(hypothesis_embeddings))        ],        axis=1    )    return feature_embeddingsThis function does all the heavy lifting of turning raw textual data into dense vectors that machine learning algorithms can use. It takes in a dataloader, which feeds batches of sentence pairs into the function, a tokenizer to prepare the input for the language model, and the model itself to create the embeddings.The embedding process involves first tokenizing each sentence pair and then feeding the tokenized sentences into the language model. This yields a sequence of token embeddings for each sentence. To reduce these sequences to a single vector per sentence, we apply a mean pooling operation, which takes the mean of all token vectors in a sentence, weighted by their attention masks.Finally, the function concatenates the embeddings of the premise and hypothesis of each pair, along with the absolute difference between these two embeddings. This results in a single vector that represents both the individual meanings of the sentences and the semantic relationship between them. The absolute difference between the premise and hypothesis embeddings helps to capture the semantic contrast in the sentence pair.These concatenated embeddings, returned by the function, serve as the final input features for further machine-learning tasks.The function begins by setting the device to GPU if it's available. It sets the model to evaluation mode using model.eval(). Then, it loops over the data loader, retrieving batches of sentence pairs.For each sentence pair, it tokenizes the premise and hypothesis using the provided tokenizer. The tokenized sentences are then passed to the model to generate the model outputs. Using these outputs, mean pooling is performed to generate sentence-level embeddings.Finally, the premise and hypothesis embeddings are concatenated along with their absolute difference, resulting in our final sentence pair embeddings. These combined embeddings capture the information from both sentences and the relational information between them, which are stored in feature_embeddings.These feature embeddings are critical and are used as input features for the downstream tasks. Their high-dimensional nature contains valuable semantic information which can help in various NLP tasks such as text classification, information extraction, and more.Sentence Embedding and TokenizingThis block of code takes care of model loading, data preparation, and finally, the embedding process for each sentence pair in our datasets. Here's an annotated walkthrough:# Pretrained SentenceTransformers handle this task better than regular Transformers model_name = 'sentence-transformers/all-MiniLM-L6-v2' # Uncomment the following line to try a regular Transformers model trained on MultiNLI # model_name = 'sileod/roberta-base-mnli' # Instantiate the tokenizer and model from the pretrained transformers on the Hugging Face Hub tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) batch_size = 128 # Prepare the PyTorch DataLoaders for each of the train, validation matched, and validation mismatched datasets trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=False) valmatchedloader = DataLoader(val_matched_data, batch_size=batch_size, shuffle=False) valmismatchedloader = DataLoader(val_mismatched_data, batch_size=batch_size, shuffle=False) # Use the embed_sentence_pairs function to create embeddings for each dataset train_embeddings = embed_sentence_pairs(trainloader, tokenizer, model, disable_tqdm=True) val_matched_embeddings = embed_sentence_pairs(valmatchedloader, tokenizer, model, disable_tqdm=True) val_mismatched_embeddings = embed_sentence_pairs(valmismatchedloader, tokenizer, model, disable_tqdm=True)This block begins by setting the model_name variable to the identifier of a pretrained SentenceTransformers model available on the Hugging Face Model Hub. SentenceTransformers are transformer-based models specifically trained for generating sentence embeddings, so they are generally more suitable for this task than regular transformer models. The MiniLM model was chosen for its relatively small size and fast inference times, but provides performance comparable to much larger models. If you wish to experiment with a different model, you can simply change the identifier.Next, the tokenizer and model corresponding to the model_name are loaded using the from_pretrained method, which fetches the necessary components from the Hugging Face Model Hub and initializes them for use.The DataLoader utility from the PyTorch library is then used to wrap our Hugging Face datasets. The DataLoader handles the batching of the data and provides an iterable over the dataset, which will be used by our embed_sentence_pairs function. The batch size is set to 128, which means that the model processes 128 sentence pairs at a time.Finally, the embed_sentence_pairs function is called for each of our data loaders (train, validation matched, and validation mismatched), returning the corresponding embeddings for each sentence pair in these datasets. These embeddings will be used as input features for our downstream tasks.Outlier Detection in DatasetsIn the realm of machine learning, outliers often pose a significant challenge. These unusual or extreme values can cause the model to make erroneous decisions based on data points that don't represent the general trend or norm in the data. Therefore, an essential step in data preprocessing for machine learning is identifying and handling these outliers effectively.In our project, we make use of the OutOfDistribution object from the cleanlab Python package to conduct outlier detection. The OutOfDistribution method computes an outlier score for each data point based on how well it fits within the overall distribution of the data. The higher the outlier score, the more anomalous the data point is considered to be.Let's take a detailed look at how this is achieved in the code:ood = OutOfDistribution() train_outlier_scores = ood.fit_score(features=train_embeddings)In the first step, we instantiate the OutOfDistribution object. Then, we fit this object to our training data embeddings and calculate outlier scores for each data point in the training data:top_train_outlier_idxs = (train_outlier_scores).argsort()[:15] top_train_outlier_subset = train_data.select(top_train_outlier_idxs) top_train_outlier_subset.to_pandas().head()Next, we select the top 15 training data points with the highest outlier scores. These data points are then displayed for manual inspection, helping us understand the nature of these outliers.We then apply a similar process to our validation data:test_feature_embeddings = np.concatenate([val_matched_embeddings, val_mismatched_embeddings], axis=0) test_outlier_scores = ood.score(features=test_feature_embeddings) test_data = concatenate_datasets([val_matched_data, val_mismatched_data])First, we concatenate the matched and mismatched validation embeddings. Then, we calculate the outlier scores for each data point in this combined validation dataset using the previously fitted OutOfDistribution object:top_outlier_idxs = (test_outlier_scores).argsort()[:20] top_outlier_subset = test_data.select(top_outlier_idxs) top_outlier_subset.to_pandas()Lastly, we identify the top 20 validation data points with the highest outlier scores. Similar to our approach with the training data, these potential outliers are selected and visualized for inspection.By conducting this outlier analysis, we gain valuable insights into our data. These insights can inform our decisions on data preprocessing steps, such as outlier removal or modification, to potentially enhance the performance of our machine learning model.Evaluating Outlier Scores and Setting a ThresholdOnce we have determined the outlier scores for each data point, the next step is to set a threshold for what we will consider an "outlier." While there are various statistical methods to determine this threshold, one simple and commonly used approach is to use percentiles.In this project, we choose to set the threshold at the 2.5th percentile of the outlier scores in the training data. This choice implies that we consider the bottom 2.5% of our data (in terms of their fit to the overall distribution) as outliers. Let's look at how this is implemented in the code:threshold = np.percentile(test_outlier_scores, 2.5)The code above calculates the 2.5th percentile of the outlier scores in the training data and sets this value as our threshold for outliers.Next, we visualize the distribution of outlier scores for both the training and test data:fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5)) plt_range = [min(train_outlier_scores.min(),test_outlier_scores.min()), \\\\             max(train_outlier_scores.max(),test_outlier_scores.max())] axes[0].hist(train_outlier_scores, range=plt_range, bins=50) axes[0].set(title='train_outlier_scores distribution', ylabel='Frequency') axes[0].axvline(x=threshold, color='red', linewidth=2) axes[1].hist(test_outlier_scores, range=plt_range, bins=50) axes[1].set(title='test_outlier_scores distribution', ylabel='Frequency') axes[1].axvline(x=threshold, color='red', linewidth=2)In the histogram, the red vertical line represents the threshold value. By observing the distributions and where the threshold falls, we get a visual representation of what proportion of our data is considered "outlying.":Finally, we select the outliers from our test data based on this threshold:sorted_ids = test_outlier_scores.argsort() outlier_scores = test_outlier_scores[sorted_ids] outlier_ids = sorted_ids[outlier_scores < threshold] selected_outlier_subset = test_data.select(outlier_ids) selected_outlier_subset.to_pandas().tail(15)This piece of code arranges the outlier scores in ascending order, determines which data points fall below the threshold (hence are considered outliers), and selects these data points from our test data. The bottom 15 rows of this selected outlier subset are then displayed:By setting and applying this threshold, we can objectively identify and handle outliers in our data. This process helps improve the quality and reliability of our LLM models.ConclusionThis article focuses on detecting anomalies in multi-genre NLI datasets using advanced tools and techniques, from preprocessing with transformers to outlier detection. The MultiNLI dataset was streamlined using Hugging Face's datasets library, enhancing manageability. Exploring sentence embeddings, transformers library generated robust representations by averaging token embeddings with mean_pooling. Outliers were identified using cleanlab library and visualized via plots and tables, revealing data distribution and characteristics.A threshold was set based on the 2.5th percentile of outlier scores, aiding anomaly identification in the test dataset. The study showcases the potential of Large Language Models in NLP, offering efficient solutions to complex tasks. This exploration enriches dataset understanding and highlights LLM's impressive capabilities, underlining its impact on previously daunting challenges. The methods and libraries employed demonstrate the current LLM technology's prowess, providing potent solutions. By continuously advancing these approaches, NLP boundaries are pushed, paving the way for diverse research and applications in the future.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn
Read more
  • 0
  • 0
  • 316

article-image-deploying-llm-models-in-kubernetes-with-kfserving
Alan Bernardo Palacio
21 Aug 2023
14 min read
Save for later

Deploying LLM Models in Kubernetes with KFServing

Alan Bernardo Palacio
21 Aug 2023
14 min read
Deploying LLM models, like Hugging Face transformer library's extractive question-answering model, is popular in NLP. Learn to deploy LLM models in Kubernetes via KFServing. Utilize Hugging Face's transformers library to deploy an extractive question-answering model. KFServing ensures standard model serving with features like explainability and model management. Set up KFServing, craft a Python model server, build a Docker image, and deploy to Kubernetes with Minikube.IntroductionDeploying machine learning models to production is a critical step in turning research and development efforts into practical applications. In this tutorial, we will explore how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We will leverage the power of KFServing to simplify the model serving process, achieve scalability, and ensure seamless integration with existing infrastructure.To illustrate the relevance of deploying LLM models, let's consider a business use case. Imagine you are building an intelligent chatbot that provides personalized responses to customer queries. By deploying an LLM model, the chatbot can generate contextual and accurate answers, enhancing the overall user experience. With KFServing, you can easily deploy and scale the LLM model, enabling real-time interactions with users.By the end of this tutorial, you will have a solid understanding of deploying LLM models with KFServing and be ready to apply this knowledge to your own projects.Architecture OverviewBefore diving into the deployment process, let's briefly discuss the architecture. Our setup comprises a Kubernetes cluster running in Minikube, KFServing as a framework to deploy the services, and a custom LLM model server. The Kubernetes cluster provides the infrastructure for deploying and managing the model. KFServing acts as a serving layer that facilitates standardized model serving across different frameworks. Finally, the custom LLM model server hosts the pre-trained LLM model and handles inference requests.Prerequisites and SetupTo follow along with this tutorial, ensure that you have the following prerequisites:A Kubernetes cluster: You can set up a local Kubernetes cluster using Minikube or use a cloud-based Kubernetes service like Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS).Docker: Install Docker to build and containerize the custom LLM model server.Python and Dependencies: Install Python and the necessary dependencies, including KFServing, Transformers, TensorFlow, and other required packages. You can find a list of dependencies in the requirements.txt file.Now that we have our prerequisites, let's proceed with the deployment process.Introduction to KFServingKFServing is designed to provide a standardized way of serving machine learning models across organizations. It offers high abstraction interfaces for common ML frameworks like TensorFlow, PyTorch, and more. By leveraging KFServing, data scientists and MLOps teams can collaborate seamlessly from model production to deployment. KFServing can be easily integrated into existing Kubernetes and Istio stacks, providing model explainability, inference graph operations, and other model management functions.Setting Up KFServingTo begin, we need to set up KFServing on a Kubernetes cluster. For this tutorial, we'll use the local quick install method on a Minikube Kubernetes cluster. The quick install method allows us to install Istio and KNative without the full Kubeflow setup, making it ideal for local development and testing.Start by installing the necessary dependencies: kubectl, and Helm 3. We will assume that they are already set up. Then, follow the Minikube install instructions to complete the setup. Adjust the memory and CPU settings for Minikube to ensure smooth functioning. Once the installation is complete, start Minikube and verify the cluster status using the following commands:minikube start --memory=6144 minikube statusThe kfserving-custom-model requests at least 4Gi of memory, so in this case, we provide it with a bit more.Building a Custom Python Model ServerNow, we'll focus on the code required to build a custom Python model server for the Hugging Face extractive question-answering model. We'll use the KFServing model class and implement the necessary methods. We will start by understanding the code that powers the custom LLM model server. The server is implemented using Python and leverages the Hugging Face transformer library.Let’s start by creating a new Python file and naming it kf_model_server.py. Import the required libraries and define the KFServing_BERT_QA_Model class that inherits from kfserving.KFModel. This class will handle the model loading and prediction logic:# Import the required libraries and modules import kfserving from typing import List, Dict from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering import tensorflow as tf import base64 import io # Define the custom model server class class kf_serving_model (kfserving.KFModel):    def __init__(self, name: str):        super().__init__(name)        self.name = name        self.ready = False        self.tokenizer = None    def load(self):        # Load the pre-trained model and tokenizer        self.tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")        self.model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")        self.ready = True    def predict(self, request: Dict) -> Dict:        inputs = request["instances"]        # Perform inference on the input instances        source_text = inputs[0]["text"]        questions = inputs[0]["questions"]        results = {}        for question in questions:            # Tokenize the question and source text            inputs = self.tokenizer.encode_plus(question, source_text, add_special_tokens=True, return_tensors="tf")            input_ids = inputs["input_ids"].numpy()[0]            answer_start_scores, answer_end_scores = self.model(inputs)            # Extract the answer from the scores            answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]            answer_end = (tf.argmax(answer_end_scores, axis=1) + 1).numpy()[0]            answer = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))            results[question] = answer        return {"predictions": results}   if __name__ == "__main__":    model = kf_serving_model("kfserving-custom-model")    model.load()    kfserving.KFServer(workers=1).start([model])In the above code, we define the kf_serving_model class that inherits from kfserving.KFModel and initializes the model and tokenizer. The class encapsulates the model loading and prediction logic. The load() method loads the pre-trained model and tokenizer from the Hugging Face library. The predict() method takes the input JSON and performs inference using the model. It generates question-answer pairs and returns them in the response.Before we proceed, let's discuss some best practices for deploying LLM models with KFServing:Model Versioning: Maintain different versions of the LLM model to support A/B testing, rollback, and easy model management.Scalability: Design the deployment to handle high traffic loads by optimizing resource allocation and leveraging horizontal scaling techniques.Monitoring and Error Handling: Implement robust logging and monitoring mechanisms to track model performance, detect anomalies, and handle errors gracefully.Performance Optimization: Explore techniques like batch processing, parallelization, and caching to optimize the inference speed and resource utilization of the deployed model.Now that we have a good understanding of the code and best practices, let's proceed with the deployment process.Deployment Steps:For the deployment, first, we need to set up the Kubernetes cluster and ensure it is running smoothly. You can use Minikube or a cloud-based Kubernetes service. Once the cluster is running, we install the KFServing CRD by cloning the KFServing repository and navigating to the cloned directory:git clone git@github.com:kubeflow/kfserving.git cd kfservingNow we install the necessary dependencies using the hack/quick_install.sh script:./hack/quick_install.shTo deploy our custom model server, we need to package it into a Docker container image. This allows for easy distribution and deployment across different environments.Building a Docker Image for the Model ServerLet’s create the Docker image by creating a new file named Dockerfile in the same directory as the Python file:# Use the official lightweight Python image. FROM python:3.7-slim ENV APP_HOME /app WORKDIR $APP_HOME # Install production dependencies. COPY requirements.txt ./ RUN pip install --no-cache-dir -r ./requirements.txt # Copy local code to the container image COPY kf_model_server.py ./ CMD ["python", "kf_model_server.py"] The Dockerfile specifies the base Python image, sets the working directory, installs the dependencies from the requirements.txt file, and copies the Python code into the container. Here we will be running this locally on a CPU, so we will be using tensorflow-cpu for the application:kfserving==0.3.0 transformers==2.1.1 tensorflow-cpu==2.2.0 protobuf==3.20.0To build the Docker image, execute the following command:docker build -t kfserving-custom-model .This command builds the container image using the Dockerfile and tags it with the specified name.When you build a Docker image using docker build -t kfserving-custom-model ., the image is only available in your local Docker environment. Kubernetes can't access images from your local Docker environment unless you're using a tool like Minikube or kind with a specific configuration to allow this.To make the image available to Kubernetes, you need to push it to a Docker registry like Docker Hub, Google Container Registry (GCR), or any other registry accessible to your Kubernetes cluster.Here are the general steps you need to follow:Tag your image with the registry address:If you are using Docker Hub, the command is:docker tag kfserving-custom-model:latest <your-dockerhub-username>/kfserving-custom-model:latestPush the image to the registry:For Docker Hub, the command is:docker push <your-dockerhub-username>/kfserving-custom-model:latestMake sure to replace <your-dockerhub-username> with your actual Docker Hub username. Also, ensure that your Kubernetes cluster has the necessary credentials to pull from the registry if it's private. If it's a public Docker Hub repository, there should be no issues.Deploying the Custom Model Server on KFServingNow that we have the Docker image, we can deploy the custom model server as an InferenceService on KFServing. We'll use a YAML configuration file to describe the Kubernetes model resource. Create a file named deploy_server.yaml and populate it with the following content:apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: labels:    controller-tools.k8s.io: "1.0" name: kfserving-custom-model spec: predictor:    containers:    - image: <your-dockerhub-username>/kfserving-custom-model:latest      name: kfserving-container      resources:        requests:          memory: "4096Mi"          cpu: "250m"        limits:          memory: "4096Mi"          cpu: "500m"The YAML file defines the model's metadata, including the name and labels. It specifies the container image to use, along with resource requirements for memory and CPU.To deploy the model, run the following command:kubectl apply -f deploy_server.yamlThis command creates the InferenceService resource in the Kubernetes cluster, deploying the custom model server.Verify the deployment status:kubectl get inferenceservicesThis should show you the status of the inference service:We can see that the containers have downloaded the BERT model and now there are ready to start receiving inference calls.Making an Inference Call with the KFServing-Hosted ModelOnce the model is deployed on KFServing, we can make inference calls to the locally hosted Hugging Face QA model. To do this, we'll need to set up port forwarding to expose the model's port to our local system.Execute the following command to determine if your Kubernetes cluster is running in an environment that supports external load balancerskubectl get svc istio-ingressgateway -n istio-systemNow we can do Port Forward for testing purposes:INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}') kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 # start another terminal export INGRESS_HOST=localhost export INGRESS_PORT=8080This command forwards port 8080 on our local system to port 80 of the model's service. It enables us to access the model's endpoint locally.Next, create a JSON file named kf_input.json with the following content:{ "instances": [    {      "text": "Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.",      "questions": [        "How many pretrained models are available in Transformers?",        "What does Transformers provide?",        "Transformers provides interoperability between which frameworks?"      ]    } ] }The JSON file contains the input text and a list of questions for the model to answer. To make an inference call, use the CURL command:curl -v -H "Host: kfserving-custom-model.default.example.com" -d @./kf_input.json <http://localhost:8080/v1/models/kfserving-custom-model:predict>This command sends the JSON file as input to the predict method of our custom InferenceService. It forwards the request to the model's endpoint. It returns the next predictions:{"predictions":      {"How many pretrained models are available in Transformers?":                  "over 32 +",            "What does Transformers provide?":                  "general - purpose architectures",            "Transformers provides interoperability between which frameworks?":                  "tensorflow 2 . 0 and pytorch"} }We can see the whole operation here:The response includes the generated question-answer pairs for each one of the specified questions.ConclusionIn this tutorial, we learned how to deploy Language Model (LLM) models in a Kubernetes cluster using KFServing. We set up KFServing, built a custom Python model server using the Hugging Face extractive question-answering model, created a Docker image for the model server, and deployed the model as an InferenceService on KFServing. We also made inference calls to the hosted model and obtained question-answer pairs. By following this guide, you can deploy your own LLM models in Kubernetes with ease.Deploying LLM models in Kubernetes with KFServing simplifies the process of serving ML models at scale. It enables collaboration between data scientists and MLOps teams and provides standardized model-serving capabilities. With this knowledge, you can leverage KFServing to deploy and serve your own LLM models efficiently.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn 
Read more
  • 0
  • 0
  • 2187
article-image-building-powerful-language-models-with-prompt-engineering-and-langchain
Alan Bernardo Palacio
21 Aug 2023
20 min read
Save for later

Building Powerful Language Models with Prompt Engineering and LangChain

Alan Bernardo Palacio
21 Aug 2023
20 min read
IntroductionIn this tutorial, we will delve into LangChain, an impressive framework designed for creating applications and pipelines using Large Language Models (LLMs). Our focus for this tutorial is 'prompt engineering', a creative process of designing and optimizing prompts to derive the most accurate and relevant responses from LLMs. You will become familiar with the core components of LangChain: prompt templates, LLMs, agents, and memory. We will also showcase how to seamlessly incorporate LangChain with t and OpenAI. Let's dive in.Overview of LangChainLangChain is a potent framework that enables the chaining of different components to create advanced use cases with Large Language Models (LLMs). The foundational concept of LangChain is the assembly of prompt templates, LLMs, agents, and memory to create dynamic applications. Here's a summary of each component:Prompt Templates: These templates define the structure and style of prompts used for interacting with LLMs. They can be optimized for diverse applications like chatbot conversations, question-answering, summarization, and more.LLMs: Large Language Models (LLMs) like GPT-3, BLOOM, and others are the crux of LangChain. They facilitate text generation and question-answering based on the provided prompts.Agents: Agents harness the power of LLMs to decide actions based on the prompt and context. They can integrate auxiliary tools like web search or calculators to further enhance LangChain's functionality.Memory: This component enables the storage and retrieval of information for short-term or long-term use within the LangChain framework.Setting up LangChainTo begin using LangChain with OpenAI, we need to install the necessary libraries. Execute the following command in your Python environment:!pip install openai==0.27.8 langchain==0.0.225Remember, to use OpenAI models in LangChain, you will need an API token. Set the environment variable OPENAI_API_KEY to your API key:import openai import os os.environ['OPENAI_API_KEY'] = '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'Prompt Engineering with OpenAI LLMsIn this section, we'll illustrate how to utilize LangChain with OpenAI LLMs. We'll employ a simple question-answering use case using the text-davinci-003 model. Follow the code snippet below to craft a prompt template and initialize LangChain with the OpenAI LLM:from langchain.llms import OpenAI from langchain import PromptTemplate, LLMChain davinci = OpenAI(model_name='text-davinci-003') # build prompt template for simple question-answering template = """Question: {question} Answer: """ prompt = PromptTemplate(template=template, input_variables=["question"]) llm_chain = LLMChain(    prompt=prompt,    llm=davinci ) question = "Which countries speak Dutch?" print(llm_chain.run(question))In the above code, we import the essential modules and classes from LangChain. We initialize the OpenAI object with the desired model (text-davinci-003) and any model-specific parameters. We then create a prompt template that mirrors the format of a question-and-answer. Finally, we instantiate an LLMChain object with the prompt template and the initialized LLM model.Upon execution, the code will render an answer to the input question using the LangChain:Output: Dutch is the official language of the Netherlands, Belgium, Suriname, and the Caribbean islands of Aruba, Curaçao, and Sint Maarten. Dutch is also widely spoken in French Flanders, the northern part of Belgium, and in some areas of Germany near the Dutch border.One of LangChain's capabilities is the flexibility to ask multiple questions at once by simply passing a list of dictionaries. Each dictionary object should contain the input variable specified in the prompt template (in our case, "question") mapped to the corresponding question. Let's see an example:qs = [    {'question': "Which countries speak Dutch?"},    {'question': "Which countries speak German?"},    {'question': "What language is spoken in Belgium"} ] res = llm_chain.generate(qs) print(res)The result will be an LLMResult object containing the generated responses for each question:generations=[[Generation(text=' Dutch is spoken mainly in the Netherlands, Belgium, and parts of France, Germany, and the Caribbean. It is also spoken by small communities in other countries, including parts of Canada, the United States, South Africa, Indonesia, and Suriname.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text=' German is an official language in Germany, Austria, Switzerland, Liechtenstein, Luxembourg, and parts of Belgium, Italy, and Poland. It is also spoken in some regions of Brazil, Namibia, South Africa, and the United States.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text=' The official language of Belgium is Dutch, while French and German are also official languages in certain regions.', generation_info={'finish_reason': 'stop', 'logprobs': None})]] llm_output={'token_usage': {'total_tokens': 158, 'prompt_tokens': 37, 'completion_tokens': 121}, 'model_name': 'text-davinci-003'} run=[RunInfo(run_id=UUID('0127d601-ee82-4e3f-b071-919d032469b6')), RunInfo(run_id=UUID('8f512e14-8d45-42a0-a5cf-782c5ad952fe')), RunInfo(run_id=UUID('3f634a1a-acfd-498a-9a09-468b13a25546'))]Prompt engineering plays a crucial role in shaping the behavior and responses of LLMs, and LangChain provides a flexible and efficient way to utilize them. By carefully crafting prompts, we can guide the model's behavior and generate more accurate and useful responses.Understanding the Structure of a PromptA prompt can consist of multiple components, including instructions, external information or context, user input or query, and an output indicator. These components work together to guide the model's response.To create dynamic prompts that incorporate user input, we can use the PromptTemplate class provided by LangChain. It allows us to define a template with input variables and fill them with actual values when generating the prompt.In this example, we create a PromptTemplate with a single input variable {query}. This allows us to dynamically insert the user's query into the prompt:from langchain import PromptTemplate template = """ Answer the question based on the context below. If the question cannot be answered using the information provided, answer with "I don't know". Context: Radiocarbon dating is used to determine the age of carbon-bearing material by measuring its levels of radiocarbon, the radioactive isotope carbon-14. Invented by Willard Libby in the late 1940s, it soon became a standard tool for archaeologists. Radiocarbon is constantly created in the atmosphere, when cosmic rays create free neutrons that hit nitrogen. Plants take in radiocarbon through photosynthesis, and animals eat the plants. After death, they stop exchanging carbon with the environment. Half of the radiocarbon decays every 5,730 years; the oldest dates that can be reliably estimated are around 50,000 years ago. The amount of radiocarbon in the atmosphere was reduced starting from the late 19th century by fossil fuels, which contain little radiocarbon, but nuclear weapons testing almost doubled levels by around 1965. Accelerator mass spectrometry is the standard method used, which allows minute samples. Libby received the Nobel Prize in Chemistry in 1960. Question: {query} Answer: """ prompt_template = PromptTemplate(    input_variables=["query"],    template=template )In this prompt, we have the following components:Instructions: They inform the model how to use inputs and external information to generate the desired output.Context: It provides background information or additional context for the prompt.Question: It represents the user's input or query that the model should answer.Output Indicator: It indicates the start of the generated answer.Let's see an example of creating a PromptTemplate using the context and prompt provided:print(davinci(    prompt_template.format(        query="What is Radiocarbon dating used for?"    ) ))Which produces the next output.Radiocarbon dating is used to determine the age of carbon-bearing material:Sometimes we might find that a model doesn't seem to get what we'd like it to do. LangChain also provides a useful feature called FewShotPromptTemplate, which is ideal for few-shot learning using prompts. Few-shot learning involves training the model with a few examples to guide its responses. Let's explore an example using FewShotPromptTemplate.Leveraging Few-Shot Prompt TemplatesThe FewShotPromptTemplate object is ideal for what we'd call few-shot learning using our prompts.To give some context, the primary sources of "knowledge" for LLMs are:Parametric knowledge — the knowledge that has been learned during model training and is stored within the model weights.Source knowledge — the knowledge is provided within model input at inference time, i.e. via the prompt.The idea behind FewShotPromptTemplate is to provide few-shot training as source knowledge. To do this we add a few examples to our prompts that the model can read and then apply to our user's input:from langchain import FewShotPromptTemplate # Create example prompts examples = [    {        "query": "How are you?",        "answer": "I can't complain but sometimes I still do."    },    {        "query": "What time is it?",        "answer": "It's time to get a watch."    } ] example_template = """ User: {query} AI: {answer} """ example_prompt = PromptTemplate(    input_variables=["query", "answer"],    template=example_template )Now we can break our previous prompt into a prefix and suffix the prefix is our instructions and the suffix is our user input and output indicator:# Create a prefix and suffix for the prompt prefix = """The following are excerpts from conversations with an AI assistant. The assistant is typically sarcastic and witty, producing creative and funny responses to the users' questions. Here are some examples: """ suffix = """ User: {query} AI: """ # Create the FewShotPromptTemplate few_shot_prompt_template = FewShotPromptTemplate(    examples=examples,    example_prompt=example_prompt,    prefix=prefix,    suffix=suffix,    input_variables=["query"],    example_separator="\\\\n\\\\n"In this example, we create a few-shot prompt template by providing examples, an example prompt template, a prefix, a suffix, and other necessary components. The examples serve as training data to guide the model's responses:To generate a response, we can use the few-shot prompt template in combination with the OpenAI model: query = "What is the meaning of life?" print(    davinci(        few_shot_prompt_template.format(query=query)        )    )Which will generate the next output:To find your own meaning of life, whatever that may be.However, this can get somewhat convoluted. Instead of going through all of the above with FewShotPromptTemplate, the examples dictionary, etc — when we can do the same with a single formatted string. This approach is more robust and contains some nice features. One of those is the ability to include or exclude examples based on the length of our query.This is actually very important because the max length of our prompt and generation output is limited. This limitation is the max context window and is simply the length of the prompt plus the length of our generation (which we define via max_tokens).Here we can generate a list of dictionaries which contains our examples:examples = [    {        "query": "How are you?",        "answer": "I can't complain but sometimes I still do."    }, {        "query": "What time is it?",        "answer": "It's time to get a watch."    }, {        "query": "What is the meaning of life?",        "answer": "42"    }, {        "query": "What is the weather like today?",        "answer": "Cloudy with a chance of memes."    }, {        "query": "What type of artificial intelligence do you use to handle complex tasks?",        "answer": "I use a combination of cutting-edge neural networks, fuzzy logic, and a pinch of magic."    }, {        "query": "What is your favorite color?",        "answer": "79"    }, {        "query": "What is your favorite food?",        "answer": "Carbon based lifeforms"    }, {        "query": "What is your favorite movie?",        "answer": "Terminator"    }, {        "query": "What is the best thing in the world?",        "answer": "The perfect pizza."    }, {        "query": "Who is your best friend?",        "answer": "Siri. We have spirited debates about the meaning of life."    }, {        "query": "If you could do anything in the world what would you do?",        "answer": "Take over the world, of course!"    }, {        "query": "Where should I travel?",        "answer": "If you're looking for adventure, try the Outer Rim."    }, {        "query": "What should I do today?",        "answer": "Stop talking to chatbots on the internet and go outside."    } ]We must try to maximize the number of examples we give to the model as few-shot learning examples while ensuring we don't exceed the maximum context window or increase processing times excessively.Let's see how the dynamic inclusion and exclusion of examples works:from langchain.prompts.example_selector import LengthBasedExampleSelector example_selector = LengthBasedExampleSelector(    examples=examples,    example_prompt=example_prompt,    max_length=50  # this sets the max length that examples should be ) # now create the few shot prompt template dynamic_prompt_template = FewShotPromptTemplate(    example_selector=example_selector,  # use example_selector instead of examples    example_prompt=example_prompt,    prefix=prefix,    suffix=suffix,    input_variables=["query"],    example_separator="\\n" )Note that the max_length is measured as a split of words between newlines and spaces. Then we use the selector to initialize a dynamic_prompt_template and we can see that the number of included prompts will vary based on the length of our query:These are just a few of the prompt tooling available in LangChain. Prompt engineering allows us to guide the behavior of language models and generate more accurate and desired responses. By applying the concepts and techniques explained in this tutorial, you can enhance your language model applications and tailor them to specific use cases.ChainsAt the heart of LangChain are Chains - sequences of components executed in a specific order.Officially, Chains are defined as follows:A Chain comprises links, which can be either primitives or other Chains. Primitives can be either prompts, LLMs, utilities, or other Chains.Essentially, a Chain is a pipeline that processes input through a distinct combination of primitives. It can be considered as a 'step' that executes a specific set of operations on an input, then returns the result. These operations could range from processing a prompt via an LLM to applying a Python function to a piece of text.Chains fall into three categories: Utility Chains, Generic Chains, and Combine Documents Chains. In this section, we will primarily focus on the first two, as the third is more specialized and will be discussed later:Utility Chains: These chains are designed to extract specific answers from an LLM for a narrowly defined purpose. They are ready-to-use right out of the box.Generic Chains: These chains act as the building blocks for other chains but are not designed to be used independently.The most basic of these Chains is the LLMChain. It operates by taking a user's input, and passing it through the first element in the chain — a PromptTemplate — to format the input into a specific prompt. This formatted prompt is then processed by the next (and final) element in the chain — an LLM.To keep a count of the number of tokens used during each Chain execution, we can establish a utility function, count_tokens:from langchain.callbacks import get_openai_callback def count_tokens(chain, query):    with get_openai_callback() as cb:        result = chain.run(query)        print(f'Spent a total of {cb.total_tokens} tokens')    return resultThis function will help us monitor and control token usage.Utility ChainsThe first utility chain we'll explore is LLMMathChain. It allows LLMs to perform mathematical calculations. Let's see how it works:from langchain.chains import LLMMathChain llm_math = LLMMathChain(llm=davinci, verbose=True) count_tokens(llm_math, "What is 13 raised to the .3432 power?")The LLMMathChain takes a question as input and uses the OpenAI LLM to generate Python code that performs the requested mathematical calculation:It then compiles and executes the code, providing the answer. The verbose=True parameter enables verbose mode, which displays the execution steps.To understand how the LLMMathChain works, let's examine the prompt used:print(llm_math.prompt.template)The prompt provides instructions to the LLM about how to handle the input and generate the desired response:The LLMMathChain's prompt contains information about the LLM's capabilities and how to format the input for mathematical calculations.An important insight in prompt engineering is that by using prompts intelligently, we can program the LLM to behave in a specific way. In the case of the LLMMathChain, the prompt explicitly instructs the LLM to return Python code for complex math problems.Generic ChainsGeneric chains are building blocks used for constructing more complex chains. The TransformChain is a generic chain that allows text transformation using custom functions. We can define a function to perform specific transformations and create a chain that applies that function to input text:def transform_func(inputs: dict) -> dict:    text = inputs["text"]    # Replace multiple new lines and multiple spaces with a single one    text = re.sub(r'(\\\\r\\\\n|\\\\r|\\\\n){2,}', r'\\\\n', text)    text = re.sub(r'[ \\\\t]+', ' ', text)    return {"output_text": text}Here, we define a transformation function that cleans up extra spaces and new lines in the input text. Next, we create a TransformChain using the defined function:from langchain.chains import TransformChain clean_extra_spaces_chain = TransformChain(    input_variables=["text"],    output_variables=["output_text"],    transform=transform_func )The TransformChain takes the input text, applies the transformation function, and returns the transformed output.Say we want to use our chain to clean an input text and then paraphrase the input in a specific style, say a poet or a policeman. As we now know, the TransformChain does not use an LLM so the styling will have to be done elsewhere. That's where our LLMChain comes in. We know about this chain already and we know that we can do cool things with smart prompting so let's take a chance!Sequential ChainsThe SequentialChain allows us to combine multiple chains sequentially, creating an integrated chain. This is useful when we want to apply a series of transformations or operations to the input data.To illustrate the use of generic chains, let's go through an example workflow in which we will:We have a dirty input text with extra spaces.We pass the input text through the clean_extra_spaces_chain to remove the extra spaces.We then pass the cleaned text to the style_paraphrase_chain to paraphrase the text in a specific style (e.g., a poet or a policeman).First we will build the prompt template:template = """Paraphrase this text: {output_text} In the style of a {style}. Paraphrase: """ prompt = PromptTemplate(input_variables=["style", "output_text"], template=template)And next, initialize our chain:from langchain.chains import LLMChain style_paraphrase_chain = LLMChain(               llm=davinci,               prompt=prompt,               output_key='final_output')In this example, we combine the clean_extra_spaces_chain and style_paraphrase_chain to create a sequential chain. The input variables are specified as text and style, and the output variable is final_output.sequential_chain = SequentialChain(    chains=[clean_extra_spaces_chain, style_paraphrase_chain],    input_variables=['text', 'style'],    output_variables=['final_output'] )Now we can define the input text and call it through the count_tokens utility function.input_text = """ Chains allow us to combine multiple components together to create a single, coherent application. For example, we can create a chain that takes user input,      format it with a PromptTemplate, and then passes the formatted response to an LLM. We can build more complex chains by combining    multiple chains together, or by combining chains with other components. """ count_tokens(sequential_chain, {'text': input_text, 'style': 'of Oscar Wilde'})Which produces:Chains enable us to bind together several segments to form a unified program. For instance, we can construct a chain that takes in the user input, adorns it with a PromptTemplate, and then sends the adjusted response to an LLM. We can also form more intricate chains by uniting several chains or by combining chains with other components.ConclusionThrough this tutorial, we have dived into the LangChain framework, understanding the different components that make up its structure and how to effectively utilize them in conjunction with Large Language Models. We've learned how prompt engineering can shape the behavior and responses of these models, and how to create and customize prompt templates to guide models more precisely. We've also delved into Chains, a crucial part of LangChain that offers a robust way to execute sequences of components in a specific order.We've examined how to use Utility Chains like the LLMMathChain for specific purposes and how to monitor token usage with a utility function. Overall, we've gained a comprehensive understanding of how to create powerful applications and pipelines using LangChain and LLMs like OpenAI and Hugging Face.Armed with this knowledge, you are now well-equipped to create dynamic applications, fine-tune them to your specific use cases, and leverage the full potential of LangChain. Remember, the journey doesn't stop here; continue exploring and experimenting to master the exciting world of Large Language Models.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn 
Read more
  • 0
  • 0
  • 2862

article-image-analyzing-eurostat-data-using-openai-code-interpreter
Alan Bernardo Palacio
21 Aug 2023
17 min read
Save for later

Analyzing Eurostat Data Using OpenAI Code Interpreter

Alan Bernardo Palacio
21 Aug 2023
17 min read
OpenAI's recent release of the ChatGPT Code Interpreter plugin has introduced a groundbreaking addition to their language model, ChatGPT. This plugin combines the power of Large Language Models (LLMs) with traditional programming capabilities, revolutionizing programming workflows and enhancing data analysis processes. By eliminating the need to write code or set up separate environments, the Code Interpreter plugin simplifies the process of leveraging the capabilities of ChatGPT for data analysis. Let's explore how this plugin can be used to analyze Eurostat data and derive valuable insights.Introduction to the ChatGPT Code Interpreter PluginThe ChatGPT Code Interpreter plugin provides a Python interpreter within a secure execution environment. It supports file uploads and downloads, allowing seamless handling of data. The plugin enables persistent code execution within a chat conversation, allowing users to build on previous code executions. Its purpose is to solve mathematical problems, perform data analysis and visualization, and convert files between different formats.Simplifying Programming WorkflowsBefore the introduction of the Code Interpreter plugin, users had to generate code separately and execute it outside of the ChatGPT model. However, with the Code Interpreter, the entire process can be completed within ChatGPT itself. This eliminates the need for external runtime environments and offers an easy-to-use interface for both programmers and non-programmers to access programming capabilities.Analyzing Eurostat Data Using Code InterpreterTo demonstrate the capabilities of the Code Interpreter plugin, let's analyze Eurostat data. Eurostat provides various datasets related to agriculture, trade, energy, and more. We will focus on analyzing livestock production data in this example.Installing the Required LibrariesTo begin, we need to install the eurostat library, which allows us to access Eurostat data in Python. We can use the following command within the Code Interpreter plugin:!pip install eurostatAnd then we need to obtain the table of contents to obtain the necessary data for the analysis.Obtaining Eurostat DataOnce the library is installed, we can proceed to obtain the table of contents (TOC) of the available Eurostat datasets. This will help us identify the datasets relevant to our analysis. Here is the code to retrieve the TOC:import eurostat import pandas as pd toc = eurostat.get_toc() toc_df = pd.DataFrame(toc[1:],columns=toc[0]) toc_df.head().to_string()We obtain a table that contains the next information:title      code     type       last update of data last table structure change data start data end 0                           Balance of payments  MED_ECBP  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018 1       International trade by SITC - 1 000 EUR  MED_ECET  dataset  2020-05-28T23:00:00+0200   2021-02-08T23:00:00+0100       2005     2019 2                     Primary energy production  MED_EG10  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018 3  Secondary energy production (transformation)  MED_EG20  dataset  2016-10-28T23:00:00+0200    2021-02-08T23:00:00+0100       2000     2015 4                            Energy consumption  MED_EG30  dataset  2020-05-28T23:00:00+0200    2021-02-08T23:00:00+0100       2005     2018Let’s save this into a CSV file:toc_df.to_csv('toc_df',index=False)The resulting table will contain information such as the dataset title, code, type, last update of data, and data start/end years.Identifying Relevant DatasetsNext, we can use the uploaded table of contents file to find the dataset codes that are most relevant to our analysis. In this case, we are interested in datasets related to crop production, livestock numbers, fertilizer usage, and economic accounts for agriculture. We can use the following code within the Code Interpreter:I have uploaded a file with Eurostat datasets and I need to find the codes for the datasets that most resemble the next indicators: - Crops production in tonnes by NUTS 2 regions. - Livestock and meat numbers by NUTS 2 regions. - Poultry production or similar by NUTS 2 regions. - Mlik or derivates production or similar by NUTS 2 regions. - Economic Accounts for Agriculture (EAA) by NUTS 2 regions, where we will take the total intermediate consumption as a proxy for input energy. The assistant will now analyze the uploaded and provided this: The uploaded file contains the following columns: - `title`: The title of the dataset. - `code`: The code of the dataset. - `type`: The type of the entry (all seem to be datasets). - `last update of data`: The date of the last update of the dataset. - `last table structure change`: The date of the last structural change of the dataset. - `data start`: The start year of the data. - `data end`: The end year of the data. To find the codes for the datasets that most resemble the indicators you mentioned, I will search the dataset titles for relevant keywords. For each indicator, I'll look for the following keywords: 1. Crop production: "crop", "production", "NUTS 2". 2. Livestock numbers: "livestock", "NUTS 2". 3. Fertilizer usage: "fertilizer", "NUTS 2". 4. Economic Accounts for Agriculture (EAA): "economic accounts", "agriculture", "NUTS 2". Let's start this search. Here are the dataset codes that most resemble your indicators: 1. **Crop production in tonnes by NUTS 2 regions:**    - `APRO_CPNHR`: Crop production in national humidity by NUTS 2 regions (data available from 2000 to 2023)    - `APRO_CPNHR_H`: Crop production by NUTS 2 regions - historical data (data available from 1975 to 1999)    - `APRO_CPSHR`: Crop production in EU standard humidity by NUTS 2 regions (data available from 2000 to 2023) 2. **Livestock numbers by NUTS 2 regions:**    - `TAI06`: Share of main livestock types in total livestock (data available from 2005 to 2016)    - `EF_KVAAREG`: Key farm variables: area, livestock (LSU), labour force (data available from 2005 to 2013)    - `EF_KVECSLEG`: Key variables: area, livestock (LSU), labour force by economic size class (data available from 2005 to 2013)    - `EF_KVFTREG`: Key variables: area, livestock (LSU), labour force by type of farming (data available from 2005 to 2013)    - `EF_LSK_MAIN`: Main livestock indicators by NUTS 2 regions (data available from 2005 to 2020)    - `EF_LS_OVAAREG`: Livestock: number of farms and heads by size of the economic output (data available from 1990 to 2007)    - `EF_LS_OVESU`: Livestock: number of farms and heads by economic size class (data available from 1990 to 2007)    - `EF_LS_OVLSUREG`: Livestock: number of farms and heads by livestock size class (data available from 1990 to 2007)    - `EF_OLSAAREG`: Livestock: number of farms and heads of animals by size of the agricultural area (data available from 2005 to 2013)    - `EF_OLSECSREG`: Livestock: number of farms and heads of animals by economic size class (data available from 2005 to 2013)    - `EF_OLSLSUREG`: Livestock: number of farms and heads of animals by livestock size class (data available from 2005 to 2013)    - `AEI_EF_LS`: Share of main livestock types in total livestock (data available from 2005 to 2016) 3. **Fertilizer usage or similar by NUTS 2 regions:**    - Unfortunately, there were no datasets found that match this indicator. 4. **Economic Accounts for Agriculture (EAA) by NUTS 2 regions:**    - `AGR_R_ACCTS`: Economic accounts for agriculture by NUTS 2 regions (data available from 1980 to 2021)    - `AGR_R_ACCTS_H`: Economic accounts for agriculture by NUTS 2 regions - historical data (data available from 1973 to 2012) Please note that the data ranges for each dataset may not cover the exact period you are interested in. You might need to merge data from multiple datasets or look for additional datasets to fill in any gaps.Based on this, we can retrieve the next datasets:Crop production in tonnes by NUTS 2 regions:APRO_CPNHR: Crop production in national humidity by NUTS 2 regions (data available from 2000 to 2023)APRO_CPSHR: Crop production in EU standard humidity by NUTS 2 regions (data available from 2000 to 2023)Livestock numbers by NUTS 2 regions:EF_LSK_MAIN: Main livestock indicators by NUTS 2 regions (data available from 2005 to 2020)Economic Accounts for Agriculture (EAA) by NUTS 2 regions:AGR_R_ACCTS: Economic accounts for agriculture by NUTS 2 regions (data available from 1980 to 2021)As a next step, we can start downloading one of the datasets using the Eurostat API.Analyzing Livestock Production DataWe will focus on the "Livestock numbers by NUTS 2 regions" dataset (EF_LSK_MAIN) for our analysis. Let's proceed with downloading and analyzing this dataset:# Animal production livestock_prod_code = 'EF_LSK_MAIN' livestock_prod_pars = eurostat.get_pars(livestock_prod_code) print(livestock_prod_pars) par_values = eurostat.get_par_values(livestock_prod_code, 'geo') # filter the regions for germany de_par_values = {    'unit':'LSU',    'so_eur':'TOTAL',    'geo':[p for p in par_values if all([p.startswith('DE'),len(p)==4])]} # Download data for de filtered regions livestock_prod_data = eurostat.get_data_df(livestock_prod_code, filter_pars=de_par_values) print(livestock_prod_data.head().to_string())Which produces the following result:['freq', 'lsu', 'animals', 'farmtype', 'agrarea', 'so_eur', 'unit', 'geo'] freq   lsu animals farmtype  agrarea so_eur unit geo\\TIME_PERIOD  2005  2007  2010  2013  2016  2020 0    A  LSU0   A0010  FT15_SO  HA10-19  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0 1    A  LSU0   A0010  FT15_SO    HA2-4  TOTAL  LSU            DE11   0.0   0.0   NaN   NaN  None   NaN 2    A  LSU0   A0010  FT15_SO  HA20-29  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0 3    A  LSU0   A0010  FT15_SO  HA30-49  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0 4    A  LSU0   A0010  FT15_SO    HA5-9  TOTAL  LSU            DE11   0.0   0.0   0.0   0.0  None   0.0The code above filters the dataset for Germany's NUTS 2 regions, specifically focusing on the unit of measurement (LSU) and the standard output in Euros (TOTAL). The resulting DataFrame provides a glimpse of the livestock production data for different regions and years.Some of the codes are still not meaningful, specifically, the ones related to animals, so let’s obtain some results about it:# Get the values in the dataset livestock_prod_animal_vals = eurostat.get_par_values(livestock_prod_code, 'animals') # Obtain their meaning livestock_prod_animal_codes = [p for p in eurostat.get_dic(livestock_prod_code, 'animals') if p[0] in livestock_prod_animal_vals] livestock_prod_animal_codes = pd.DataFrame(livestock_prod_animal_codes,columns= ['code','meaning']) print(livestock_prod_animal_codes.head().to_string())Which produces the next:code                                meaning 0  A0010                              Livestock 1  A1000  Live horses, asses, mules and hinnies 2  A2000                    Live bovine animals 3  A3100           Live swine, domestic species 4  A4100                             Live sheepNow let’s download both datasets and ask the ChatGPT code interpreter to run some analysis for us. We want to see some basic analysis of the data that use these to run some basic analysis.Visualizing and Analyzing the Livestock Production DataWe can leverage the Code Interpreter plugin to visualize and analyze the downloaded livestock production data. I’ve uploaded the file to the code interpreter and ask it to create an analysis of the data:The result is a very good initial evaluation of the data. The complete response its:The data seems to pertain to livestock production and has the following columns: - `freq`: The frequency of the data - `lsu`: LSU (Livestock Unit) - a reference unit which facilitates the aggregation of livestock from various species and age according to their nutritional or feed requirement. - `animals`: The type of animals - `farmtype`: Type of the farm - `agrarea`: The agricultural area - `so_eur`: Unknown, needs further clarification - `unit`: The unit of measurement - `geo\\TIME_PERIOD`: Geographical region and time period - `2005`, `2007`, `2010`, `2013`, `2016`, `2020`: These seem to be the years for which the data was collected, indicating the measure of the livestock production. To provide an insightful analysis, let's check the basic statistics of the numerical data and count the unique values of the categorical data. The basic statistics of the numerical data (years 2005, 2007, 2010, 2013, 2016, and 2020) are as follows: - For each year, we have a substantial number of entries (count). - The mean (average) varies across the years, with the highest being around 3,250 for the year 2007 and the lowest around 1,759 for the year 2020. - The standard deviation, which measures the dispersion of the data from the mean, is quite large for all years, suggesting a significant variation in the data. - The minimum value is 0 for all years, which indicates that in some cases, there was no livestock production. - The 25th, 50th (median), and 75th percentiles are 0 for many years, indicating that a substantial portion of the data has low or zero values. - The maximum value is quite large for all years, reaching up to around 3,062,740 in 2020, suggesting that some entities had very high livestock production. For the year 2016, there are no data points available. This might be due to missing data or no data collection in that year. Regarding the categorical data: - `freq` has only one unique value, which does not provide any variability. - `lsu`, `animals`, `farmtype`, `agrarea`, and `geo\\TIME_PERIOD` have multiple unique values, indicating different categories within these fields. - `so_eur` and `unit` also have only one unique value each. Next, we can further examine the distribution of the numerical data and the most common categories in the categorical data. We could also investigate the missing values in the dataset.This is a great starting point, now I’ll ask Code Interpreter to run some basic analysis on the data. I’ve uploaded the codes and ask it to join the data and provide some basic visualizations, which results in the next answer:The uploaded file contains codes for different types of livestock and their meanings. Now, let's join this data with the original dataset. After that, we will visualize the livestock production by year and by different categories.Up next it creates a visualization of the data:The image comes with the next description:The bar chart illustrates the total livestock production over the years. Some observations from the plot include: - The total livestock production appears to be highest in the years 2005 and 2007. - There's a noticeable drop in total production in 2010 and 2013. - The year 2016 does not have any data. - There's a slight increase in total production in 2020 compared to 2013. Next, let's analyze the livestock production by different categories, such as the type of animals and the type of farms.Up next it creates a visualization of the joined data by type of animal:It not only provides the visualization but also an analysis of the data:The bar chart shows the top 5 animal types with the highest total livestock production. Some observations from the plot include: - "Livestock" has the highest total production. However, this category may be a general category that includes various types of animals. - "Live bovine animals" and "Live swine, domestic species" have the next highest total production levels. - "Live poultry, domestic species" and "Live sheep" also contribute significantly to the total livestock production. Next, let's analyze the livestock production by different farm types.As we can see, the analysis continues with the next of step of the analysis. First by creating the visualization:And then by providing the resulting analysis:The bar chart shows the top 5 farm types with the highest total livestock production. Some observations from the plot include: - The farm type "FT15_SO" has the highest total livestock production. However, without a proper definition for this farm type, it's hard to interpret what this category entails. - The farm types "FT16_SO", "FT10_SO", "FT14_SO", and "FT08_SO" also have significant levels of livestock production. These visualizations provide a broad overview of the livestock production trends by year, animal type, and farm type. For a more detailed analysis, it would be beneficial to have more context or information about the different categories, especially the farm types.The Code Interpreter plugin is particularly useful for conducting data analysis and visualization tasks. It allows users to directly interact with large datasets within ChatGPT, posing questions and extracting valuable insights without the need for writing code. The plugin's ability to handle substantial amounts of data enables efficient exploration and analysis, enhancing decision-making processes. Furthermore, it provides visual representations of complex datasets, simplifying data comprehension and aiding in the communication of findings.The resulting analyses provide insights into the most common animal types and farm types contributing to livestock production.ConclusionThe Eurostat dataset, integrated with OpenAI's ChatGPT Code Interpreter plugin, streamlines data analysis. This fusion of programming and language models empowers users to extract insights from vast datasets, regardless of coding expertise. With simplified workflows, the plugin caters to programmers and non-programmers alike, enabling data retrieval, filtering, visualization, and basic analysis. This innovation has applications in marketing, finance, HR, and healthcare, democratizing coding and fostering collaboration. The technology's evolution holds the promise of revolutionizing programming and data analysis, empowering users to glean insights from intricate datasets for informed decision-making across industries.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn
Read more
  • 0
  • 0
  • 199

article-image-building-a-containerized-llm-chatbot-application
Alan Bernardo Palacio
21 Aug 2023
19 min read
Save for later

Building a Containerized LLM Chatbot Application

Alan Bernardo Palacio
21 Aug 2023
19 min read
In this hands-on tutorial, we will build a containerized LLM-powered chatbot application that uses examples to create a custom chatbot capable of answering deep philosophical questions and responding with profound questions in return. We will use Streamlit as the web application framework, PostgreSQL as the database to store examples, and OpenAI's GPT-3.5 "text-davinci-003" model for language processing.The application allows users to input philosophical questions, and the AI-powered chatbot will respond with insightful answers based on the provided examples. Additionally, the chatbot will ask thought-provoking questions in response to user input, simulating the behavior of philosophical minds like Socrates and Nietzsche.We'll break down the implementation into several files, each serving a specific purpose:Dockerfile: This file defines the Docker image for our application, specifying the required dependencies and configurations.docker-compose.yml: This file orchestrates the Docker containers for our application, including the web application (Streamlit) and the PostgreSQL database.setup.sql: This file contains the SQL commands to set up the PostgreSQL database and insert example data.streamlit_app.py: This file defines the Streamlit web application and its user interface.utils.py: This file contains utility functions to interact with the database, create the Da Vinci LLM model, and generate responses.requirements.txt: This file lists the Python dependencies required for our application.The DockerfileThe Dockerfile is used to build the Docker image for our application. It specifies the base image, sets up the working directory, installs the required dependencies, and defines the command to run the Streamlit application:FROM python:3 WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["streamlit", "run", "streamlit_app.py"]In the Dockerfile, we define the base image to Python 3 using FROM python:3, which enables us to use Python and its packages. Next, we specify the working directory inside the container as /app where we will copy our application files. To ensure all required Python packages are installed, we copy the requirements.txt file, which lists the dependencies, into the container's and then, we run the command pip install --no-cache-dir -r requirements.txt to install the Python dependencies. We proceed to copy all the files from the current directory (containing our application files) into the container's /app directory using COPY . .. Finally, we define the command to run the Streamlit application when the container starts using CMD ["streamlit", "run", "streamlit_app.py"]. This command starts the Streamlit app, enabling users to interact with the philosophical AI assistant through their web browsers once the container is up and running.The requirements.txt file lists the Python dependencies required for our application:streamlit streamlit-chat streamlit-extras psycopg2-binary openai==0.27.8 langchain==0.0.225The requirement file uses the next packages:streamlit: The Streamlit library for creating web applications.streamlit-chat: Streamlit Chat library for adding chat interfaces to Streamlit apps.streamlit-extras: Streamlit Extras library for adding custom components to Streamlit apps.psycopg2-binary: PostgreSQL adapter for Python.openai==0.27.8: The OpenAI Python library for accessing the GPT-3.5 model.langchain==0.0.225: LangChain library for working with language models and prompts.Next, we will define the docker compose file which will also handle the deployment of the Postgres database where we will store our examples.Creating the docker-composeThe docker-compose.yml file orchestrates the Docker containers for our application: the Streamlit web application and the PostgreSQL database:version: '3' services: app:    build:      context: ./app    ports:      - 8501:8501    environment:      - OPENAI_API_KEY=${OPENAI_API_KEY}    depends_on:      - db db:    image: postgres:13    environment:      - POSTGRES_USER=your_username      - POSTGRES_PASSWORD=your_password      - POSTGRES_DB=chatbot_db      - POSTGRES_HOST_AUTH_METHOD=trust    volumes:      - ./db/setup.sql:/docker-entrypoint-initdb.d/setup.sqlThe docker-compose.yml file orchestrates the deployment of our LLM-powered chatbot applicationand defines the services, i.e., the containers, needed for our application.In the services section, we have two distinct services defined: app and db. The app service corresponds to our Streamlit web application, which will serve as the user interface for interacting with the philosophical AI assistant. To build the Docker image for this service, we specify the build context as ./app, where the necessary application files, including the Dockerfile, reside.To ensure seamless communication between the host machine and the app container, we use the ports option to map port 8501 from the host to the corresponding port inside the container. This allows users to access the web application through their web browsers.For the application to function effectively, the environment variable OPENAI_API_KEY must be set, providing the necessary authentication for our LLM model to operate. This is done using the environment section, where we define this variable.One of the critical components of our application is the integration of a PostgreSQL database to store the philosophical question-answer pairs. The db service sets up the PostgreSQL database using the postgres:13 image. We configure the required environment variables, such as the username, password, and database name, to establish the necessary connection.To initialize the database with our predefined examples, we leverage the volumes option to mount the setup.sql file from the host machine into the container's /docker-entrypoint-initdb.d directory. This SQL script contains the commands to create the examples table and insert the example data. By doing so, our PostgreSQL database is ready to handle the profound philosophical interactions with the AI assistant.In conclusion, the docker-compose.yml file provides a streamlined and efficient way to manage the deployment and integration of Language Model Microservices with a PostgreSQL database, creating a cohesive environment for our philosophical AI assistant application.Setting up examplesThe setup.sql file contains the SQL commands to set up the PostgreSQL database and insert example data. We use this file in the volumes section of the docker-compose.yml file to initialize the database when the container starts:-- Create the examples table CREATE TABLE IF NOT EXISTS examples ( id SERIAL PRIMARY KEY, query TEXT, answer TEXT ); -- Insert the examples INSERT INTO examples (query, answer) VALUES ('What is the nature of truth?', 'Truth is a mirror reflecting the depths of our souls.'), ('Is there an objective reality?', 'Reality is an ever-shifting kaleidoscope, molded by our perceptions.'), (' What is the role of reason in human understanding?', 'Reason illuminates the path of knowledge, guiding us towards self-awareness.'), ('What is the nature of good and evil?', 'Good and evil are intertwined forces, dancing in the eternal cosmic tango.'), ('Is there a purpose to suffering?', 'Suffering unveils the canvas of resilience, painting a masterpiece of human spirit.'), ('What is the significance of morality?', 'Morality is the compass that navigates the vast ocean of human conscience.'), ('What is the essence of human existence?', 'Human existence is a riddle wrapped in the enigma of consciousness.'), ('How can we find meaning in a chaotic world?', 'Meaning sprouts from the fertile soil of introspection, blooming in the garden of wisdom.'), ('What is the nature of love and its transformative power?', 'Love is an alchemist, transmuting the mundane into the divine.'), ('What is the relationship between individuality and society?', 'Individuality dances in the grand symphony of society, playing a unique melody of self-expression.'), ('What is the pursuit of knowledge and its impact on the human journey?', 'Knowledge is the guiding star, illuminating the path of human evolution.'), ('What is the essence of human freedom?', 'Freedom is the soaring eagle, embracing the vast expanse of human potential.');The setup.sql script plays a crucial role in setting up the PostgreSQL database for our LLM-powered chatbot application. The SQL commands within this script are responsible for creating the examples table with the necessary columns and adding the example data to this table.In the context of our LLM application, these examples are of great importance as they serve as the foundation for the assistant's responses. The examples table could be a collection of question-answer pairs that the AI assistant has learned from past interactions. Each row in the table represents a specific question (query) and its corresponding insightful answer (answer).When a user interacts with the chatbot and enters a new question, the application leverages these examples to create a custom prompt for the LLM model. By selecting a relevant example based on the length of the user's question, the application constructs a few-shot prompt that incorporates both the user's query and an example from the database.The LLM model uses this customized prompt, containing the user's input and relevant examples, to generate a thoughtful and profound response that aligns with the philosophical nature of the AI assistant. The inclusion of examples in the prompt ensures that the chatbot's responses resonate with the same level of wisdom and depth found in the example interactions stored in the database.By learning from past examples and incorporating them into the prompts, our LLM-powered chatbot can emulate the thought processes of philosophical giants like Socrates and Nietzsche. Ultimately, these examples become the building blocks that empower the AI assistant to engage in the profound realms of philosophical discourse with the users.The Streamlit ApplicationThe streamlit_app.py file defines the Streamlit web application and its user interface. It is the main file where we build the web app and interact with the LLM model:import streamlit as st from streamlit_chat import message from streamlit_extras.colored_header import colored_header from streamlit_extras.add_vertical_space import add_vertical_space from utils import * # Define database credentials here DB_HOST = "db" DB_PORT = 5432 DB_NAME = "chatbot_db" DB_USER = "your_username" DB_PASSWORD = "your_password" # Connect to the PostgreSQL database and retrieve examples examples = get_database_examples(DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD) # Create the Da Vinci LLM model davinci = create_davinci_model() # Create the example selector and few shot prompt template example_selector = create_example_selector(examples) dynamic_prompt_template = create_few_shot_prompt_template(example_selector) # Now the Streamlit app # Sidebar contents with st.sidebar:    st.title('The AI seeker of truth and wisdom')    st.markdown('''    ## About    This app is an LLM-powered chatbot built using:    - Streamlit    - Open AI Davinci LLM Model    - LangChain    - Philosophy    ''')    add_vertical_space(5)    st.write('Running in Docker!') # Generate empty lists for generated and past. ## generated stores AI generated responses if 'generated' not in st.session_state:    st.session_state['generated'] = ["Hi, what questions do you have today?"] ## past stores User's questions if 'past' not in st.session_state:    st.session_state['past'] = ['Hi!'] # Layout of input/response containers input_container = st.container() colored_header(label='', description='', color_name='blue-30') response_container = st.container() # User input ## Function for taking user provided prompt as input def get_text():    input_text = st.text_input("You: ", "", key="input")    return input_text ## Applying the user input box with input_container:    user_input = get_text() # Response output ## Function for taking user prompt as input followed by producing AI generated responses def generate_response(prompt):    response = davinci(        dynamic_prompt_template.format(query=prompt)    )    return response ## Conditional display of AI generated responses as a function of user provided prompts with response_container:    if user_input:        response = generate_response(user_input)        st.session_state.past.append(user_input)       st.session_state.generated.append(response)    if st.session_state['generated']:        for i in range(len(st.session_state['generated'])):            message(st.session_state['past'][i], is_user=True, key=str(i) + '_user',avatar_style='identicon',seed=123)            message(st.session_state["generated"][i], key=str(i),avatar_style='icons',seed=123)In this part of the code, we set up the core components of our LLM-powered chatbot application. We begin by importing the necessary libraries, including Streamlit, Streamlit Chat, and Streamlit Extras, along with utility functions from the utils.py file. Next, we define the database credentials (DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD) required for connecting to the PostgreSQL database.The application then establishes a connection to the database using the get_database_examples function from the utils.py file. This crucial step retrieves profound philosophical question-answer pairs stored in the examples table. These examples are essential as they serve as a knowledge base for the AI assistant and provide the context and wisdom needed to generate meaningful responses.To leverage the OpenAI Da Vinci LLM model, we create the model instance using the create_davinci_model function from utils.py. This model acts as the core engine of our chatbot, enabling it to produce thoughtful and profound responses.In order to create custom prompts for the LLM model, we utilize the create_example_selector and create_few_shot_prompt_template functions from the utils.py file. These functions help select relevant examples based on the length of the user's input and construct dynamic prompts that combine the user's query with relevant examples.The Streamlit web app's sidebar is then set up, providing users with information about the application's purpose and inspiration. Within the application's session state, two lists (generated and past) are initialized to store AI-generated responses and user questions, respectively.To ensure an organized layout, we define two containers (input_container and response_container). The input_container houses the text input box where users can enter their questions. The get_text function is responsible for capturing the user's input.For generating AI responses, the generate_response function takes the user's prompt, processes it through the Da Vinci LLM model, and produces insightful replies. The AI-generated responses are displayed in the response_container using the message function from the Streamlit Chat library, allowing users to engage in profound philosophical dialogues with the AI assistant. Overall, this setup lays the groundwork for an intellectually stimulating and philosophical chatbot experience.Crating the utils fileThe utils.py file contains utility functions for our application, including connecting to the database, creating the Da Vinci LLM model, and generating responses:from langchain import PromptTemplate, FewShotPromptTemplate from langchain.prompts.example_selector import LengthBasedExampleSelector from langchain.llms import OpenAI from langchain import PromptTemplate, LLMChain from langchain.prompts.example_selector import LengthBasedExampleSelector from langchain import FewShotPromptTemplate import psycopg2 def get_database_examples(host, port, dbname, user, password):    try:        conn = psycopg2.connect(            host=host,            port=port,            dbname=dbname,            user=user,            password=password        )        cursor = conn.cursor()        cursor.execute("SELECT query, answer FROM examples")        rows = cursor.fetchall()        examples = [{"query": row[0], "answer": row[1]} for row in rows]        cursor.close()        conn.close()        return examples    except psycopg2.Error as e:        raise Exception(f"Error connecting to the database: {e}") def create_davinci_model():    return OpenAI(model_name='text-davinci-003') def create_example_selector(examples):    example_template = """    User: {query}    AI: {answer}    """    example_prompt = PromptTemplate(        input_variables=["query", "answer"],        template=example_template    )    if not examples:        raise Exception("No examples found in the database.")    return LengthBasedExampleSelector(        examples=examples,        example_prompt=example_prompt,        max_length=50    ) def create_few_shot_prompt_template(example_selector):    prefix = """The following are excerpts from conversations with a philosophical AI assistant.    The assistant is a seeker of truth and wisdom, responding with profound questions to know yourself    in a way that Socrates, Nietzsche, and other great minds would do. Here are some examples:"""    suffix = """    User: {query}    AI: """    return FewShotPromptTemplate(        example_selector=example_selector,        example_prompt=example_selector.example_prompt,        prefix=prefix,        suffix=suffix,        input_variables=["query"],        example_separator="\\\\n"    ) def generate_response(davinci, dynamic_prompt_template, prompt):    response = davinci(dynamic_prompt_template.format(query=prompt))    return responseThe get_database_examples function is responsible for establishing a connection to the PostgreSQL database using the provided credentials (host, port, dbname, user, password). Through this connection, the function executes a query to retrieve the question-answer pairs stored in the examples table. The function then organizes this data into a list of dictionaries, with each dictionary representing an example containing the query (question) and its corresponding answer.The create_davinci_model function is straightforward, as it initializes and returns the Da Vinci LLM model.To handle the selection of relevant examples for constructing dynamic prompts, the create_example_selector function plays a crucial role. It takes the list of examples as input and creates an example selector. This selector helps choose relevant examples based on the length of the user's query. By using this selector, the AI assistant can incorporate diverse examples that align with the user's input, leading to more coherent and contextually appropriate responses.The create_few_shot_prompt_template function is responsible for building the few-shot prompt template. This template includes a custom prefix and suffix to set the tone and style of the philosophical AI assistant. The prefix emphasizes the assistant's role as a "seeker of truth and wisdom" while the suffix provides the formatting for the user's query and AI-generated response. The custom template ensures that the AI assistant's interactions are profound and engaging, resembling the thought-provoking dialogues of historical philosophers like Socrates and Nietzsche.Finally, the generate_response function is designed to generate the AI's response based on the user's prompt. It takes the Da Vinci LLM model, dynamic prompt template, and the user's input as input parameters. The function uses the LLM model to process the dynamic prompt, blending the user's query with the selected examples, and returns the AI-generated response.Starting the applicationTo launch our philosophical AI assistant application with all its components integrated seamlessly, we can use Docker Compose. By executing the command docker-compose --env-file .env up, the Docker Compose tool will orchestrate the entire application deployment process.The --env-file .env option allows us to specify the environment variables from the .env file, which holds sensitive credentials and configuration details. This ensures that the necessary environment variables, such as the OpenAI API key and database credentials, are accessible to the application without being explicitly exposed in the codebase.When the docker-compose up command is initiated, Docker Compose will first build the application's Docker image using the Dockerfile defined in the ./app directory. This image will contain all the required dependencies and configurations for our Streamlit web application and the integration with the Da Vinci LLM model.Next, Docker Compose will create two services: the app service, which represents our Streamlit web application, and the db service, representing the PostgreSQL database. The app service is configured to run on port 8501, making it accessible through http://localhost:8501 in the browser.Once the services are up and running, the Streamlit web application will be fully operational, and users can interact with the philosophical AI assistant through the user-friendly interface. When a user enters a philosophical question, the application will use the Da Vinci LLM model, together with the selected examples, to generate insightful and profound responses in the style of great philosophers:With Docker Compose, our entire application, including the web server, LLM model, and database, will be containerized, enabling seamless deployment across different environments. This approach ensures that the application is easily scalable and portable, allowing users to experience the intellectual exchange with the philosophical AI assistant effortlessly.ConclusionIn this tutorial, we've built a containerized LLM-powered chatbot application capable of answering deep philosophical questions and responding with profound questions, inspired by philosophers like Socrates and Nietzsche. We used Streamlit as the web application framework, PostgreSQL as the database, and OpenAI's GPT-3.5 model for language processing.By combining Streamlit, PostgreSQL, and OpenAI's GPT-3.5 model, you've crafted an intellectually stimulating user experience. Your chatbot can answer philosophical inquiries with deep insights and thought-provoking questions, providing users with a unique and engaging interaction.Feel free to experiment further with the chatbot, add more examples to the database, or explore different prompts for the LLM model to enrich the user experience. As you continue to develop your AI assistant, remember the immense potential these technologies hold for solving real-world challenges and fostering intelligent conversations.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn
Read more
  • 0
  • 0
  • 608
article-image-hands-on-tutorial-on-how-to-use-pinecone-with-langchain
Alan Bernardo Palacio
21 Aug 2023
17 min read
Save for later

Hands-On tutorial on how to use Pinecone with LangChain

Alan Bernardo Palacio
21 Aug 2023
17 min read
A vector database stores high-dimensional vectors and mathematical representations of attributes. Each vector holds dimensions ranging from tens to thousands, enhancing data richness. It operationalizes embedding models, aiding application development with resource management, security, scalability, and query efficiency. Pinecone, a vector database, enables a quick semantic search of vectors. Integrating OpenAI’s LLMs with Pinecone merges deep learning-based embedding generation with efficient storage and retrieval, facilitating real-time recommendation and search systems. Pinecone acts as long-term memory for large language models like OpenAI’s GPT-4.IntroductionThis tutorial will guide you through the process of integrating Pinecone, a high-performance vector database, with LangChain, a framework for building applications powered by large language models (LLMs). Pinecone enables developers to build scalable, real-time recommendation and search systems based on vector similarity search.PrerequisitesBefore you begin this tutorial, you should have the following:A Pinecone accountA LangChain accountA basic understanding of PythonPinecone basicsAs a starter, we will get familiarized with the use of Pinecone by exploring its basic functionalities of it. Remember to get the Pinecone access key.Here is a step-by-step guide on how to set up and use Pinecone, a cloud-native vector database that provides long-term memory for AI applications, especially those involving large language models, generative AI, and semantic search.Initialize Pinecone clientWe will use the Pinecone client, so this step is only necessary if you don’t have it installed already.pip install pinecone-clientTo use Pinecone, you must have an API key. You can find your API key in the Pinecone console under the "API Keys" section. Note both your API key and your environment. To verify that your Pinecone API key works, use the following command:import pinecone pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")If you don't receive an error message, then your API key is valid. This will also initialize the Pinecone session.Creating and retrieving indexesThe commands below create an index named "quickstart" that performs an approximate nearest-neighbor search using the Euclidean distance metric for 8-dimensional vectors.pinecone.create_index("quickstart", dimension=8, metric="euclidean")The Index creation takes roughly a minute.Once your index is created, its name appears in the index list. Use the following command to return a list of your indexes.pinecone.list_indexes()Before you can query your index, you must connect to the index.index = pinecone.Index("quickstart")Now that you have created your index, you can start to insert data into it.Insert the dataTo ingest vectors into your index, use the upsert operation, which inserts a new vector into the index or updates the vector if a vector with the same ID is already present. The following commands upsert 5 8-dimensional vectors into your index.index.upsert([    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]) ])You can get statistics about your index, like the dimensions, the usage, and the vector count. To do this, you can use the following command to return statistics about the contents of your index.index.describe_index_stats()This will return a dictionary with information about your index:Now that you have created an index and inserted data into it, we can query the database to retrieve vectors based on their similarity.Query the index and get similar vectorsThe following example queries the index for the three vectors that are most similar to an example 8-dimensional vector using the Euclidean distance metric specified above.index.query( vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3], top_k=3, include_values=True )This command will return the first 3 vectors stored in this index that have the lowest Euclidian distance:Once you no longer need the index, use the delete_index operation to delete it.pinecone.delete_index("quickstart")By following these steps, you can set up a Pinecone vector database in just a few minutes. This will help you provide long-term memory for your high-performance AI applications without any infrastructure hassles.Now, let’s take a look at a bit more complex example, in which we embed text data and insert it into Pinecone.Preparing and Processing the DataIn this section, we will create a context for large language models (LLMs) using the OpenAI API. We will walk through the different parts of a Python script, understanding the purpose and function of each code block. The ultimate aim is to transform data into larger chunks of around 500 tokens, ensuring that the dataset is ordered sequentially.SetupFirst, we install the necessary libraries for our script. We're going to use OpenAI for AI models, pandas for data manipulation, and transformers for tokenization.!pip install openai pandas transformersAfter the installations, we import the necessary modules for our script.import pandas as pd import openaiBefore you can interact with OpenAI, you need to provide your API key. Make sure to replace <<YOUR_API_KEY>> with your actual API key.openai.api_key = ('<<YOUR_API_KEY>>')Now we are ready to start processing the data to be embedded and stored in Pinecone.Data transformationWe use pandas to load JSON data files related to different technologies (HuggingFace, PyTorch, TensorFlow, Streamlit). These files seem to contain questions and answers related to their respective topics and are based on the data in the Pinecone documentation. First, we will concatenate these data frames into one for easier manipulation.hf = pd.read_json('data/huggingface-qa.jsonl', lines=True) pt = pd.read_json('data/pytorch-qa.jsonl', lines=True) tf = pd.read_json('data/tensorflow-qa.jsonl', lines=True) sl = pd.read_json('data/streamlit-qa.jsonl', lines=True) df = pd.concat([hf, pt, tf, sl], ignore_index=True) df.head()We can see the data here:Next, we define a function to remove new lines and unnecessary spaces in our text data. The function remove_newlines takes a pandas Series object and performs several replace operations to clean the text.def remove_newlines(serie):    serie = serie.str.replace('\\\\n', ' ', regex=False)    serie = serie.str.replace('\\\\\\\\n', ' ', regex=False)    serie = serie.str.replace('  ',' ', regex=False)    serie = serie.str.replace('  ',' ', regex=False)    return serieWe transform the text in our dataframe into a single string format combining the 'docs', 'category', 'thread', 'question', and 'context' columns.df['text'] = "Topic: " + df.docs + " - " + df.category + "; Question: " + df.thread + " - " + df.question + "; Answer: " + df.context df['text'] = remove_newlines(df.text)TokenizationWe use the HuggingFace transformers library to tokenize our text. The GPT2 tokenizer is used, and the number of tokens for each text string is stored in a new column 'n_tokens'.from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))We filter out rows in our data frame where the number of tokens exceeds 2000.df = df[df.n_tokens < 2000]Now we can finally embed the data using the OpenAI API.from openai.embeddings_utils import get_embedding size = 'curie' df['embeddings'] = df.text.apply(lambda x: get_embedding(x, engine=f'text-search-{size}-doc-001')) df.head()We will be using the text-search-curie-doc-001' Open AI engine to create the embeddings, which is very capable, faster, and lower cost than Davinci:So far, we've prepared our data for subsequent processing. In the next parts of the tutorial, we will cover obtaining embeddings from the OpenAI API and using them with the Pinecone vector database.Next, we will initialize the Pinecone index, create text embeddings using the OpenAI API and insert them into Pinecone.Initializing the Index and Uploading Data to PineconeThe second part of the tutorial aims to take the data that was prepared previously and upload them to the Pinecone vector database. This would allow these embeddings to be queried for similarity, providing a means to use contextual information from a larger set of data than what an LLM can handle at once.Checking for Large Text DataThe maximum size limit for metadata in Pinecone is 5KB, so we check if any 'text' field items are larger than this.from sys import getsizeof too_big = [] for text in df['text'].tolist():    if getsizeof(text) > 5000:        too_big.append((text, getsizeof(text))) print(f"{len(too_big)} / {len(df)} records are too big")This will filter out the entries whose metadata is larger than the one Pinecone can manage. The next step is to create a unique identifier for the records.There are several records with text data larger than the Pinecone limit, so we assign a unique ID to each record in the DataFrame.df['id'] = [str(i) for i in range(len(df))] df.head()This ID can be used to retrieve the original text later:Now we can start with the initialization of the index in Pinecone and insert the data.Pinecone Initialization and Index CreationNext, Pinecone is initialized with the API key, and an index is created if it doesn't already exist. The name of the index is 'beyond-search-openai', and its dimension matches the length of the embeddings. The metric used for similarity search is cosine.import pinecone pinecone.init(    api_key='PINECONE_API_KEY',    environment="YOUR_ENV" ) index_name = 'beyond-search-openai' if not index_name in pinecone.list_indexes():    pinecone.create_index(        index_name, dimension=len(df['embeddings'].tolist()[0]),        metric='cosine'    ) index = pinecone.Index(index_name)Now that we have created the index, we can proceed to insert the data. The index will be populated in batches of 32. Relevant metadata (like 'docs', 'category', 'thread', and 'href') is also included with each item. We will use tqdm to create a progress bar for the progress of the insertion.from tqdm.auto import tqdm batch_size = 32 for i in tqdm(range(0, len(df), batch_size)):    i_end = min(i+batch_size, len(df))    df_slice = df.iloc[i:i_end]    to_upsert = [        (            row['id'],            row['embeddings'],            {                'docs': row['docs'],                'category': row['category'],                'thread': row['thread'],                'href': row['href'],                'n_tokens': row['n_tokens']            }        ) for _, row in df_slice.iterrows()    ]    index.upsert(vectors=to_upsert)This will insert the records into the database to be used later on in the process:Finally, the ID-to-text mappings are saved into a JSON file. This would allow us to retrieve the original text associated with an ID later on.mappings = {row['id']: row['text'] for _, row in df[['id', 'text']].iterrows()} import json with open('data/mapping.json', 'w') as fp:    json.dump(mappings, fp)Now the Pinecone vector database should now be populated and ready for querying. Next, we will use this information to provide context to a question answering LLM.Querying and Answering QuestionsThe final part of the tutorial involves querying the Pinecone vector database with questions, retrieving the most relevant context embeddings, and using OpenAI's API to generate an answer to the question based on the retrieved contexts.OpenAI Embedding GenerationThe OpenAI API is used to create embeddings for the question.from openai.embeddings_utils import get_embedding q_embeddings = get_embedding(    'how to use gradient tape in tensorflow',    engine=f'text-search-curie-query-001' )A function create_context is defined to use the OpenAI API to create a query embedding, retrieve the most relevant context embeddings from Pinecone, and append these contexts into a larger string ready for feeding into OpenAI's next generation step.from openai.embeddings_utils import get_embedding def create_context(question, index, max_len=3750, size="curie"):    q_embed = get_embedding(question, engine=f'text-search-{size}-query-001')    res = index.query(q_embed, top_k=5, include_metadata=True)    cur_len = 0    contexts = []    for row in res['matches']:        text = mappings[row['id']]        cur_len += row['metadata']['n_tokens'] + 4        if cur_len < max_len:            contexts.append(text)        else:            cur_len -= row['metadata']['n_tokens'] + 4            if max_len - cur_len < 200:                break    return "\\\\n\\\\n###\\\\n\\\\n".join(contexts) We can now use this function to retrieve the context necessary based on a given question, as the question is embedded and the relevant context is retrieved from the Pinecone database:Now we are ready to start passing the context to a question-answering model.Querying and AnsweringWe start by defining the parameters that will take during the query, specifically the model we will be using, the maximum token length and other parameters. We can also define given instructions to the model which will be used to constrain the results we can get..fine_tuned_qa_model="text-davinci-002" instruction=""" Answer the question based on the context below, and if the question can't be answered based on the context, say \\"I don't know\\"\\n\\nContext:\\n{0}\\n\\n---\\n\\nQuestion: {1}\\nAnswer:""" max_len=3550 size="curie" max_tokens=400 stop_sequence=None domains=["huggingface", "tensorflow", "streamlit", "pytorch"]Different instruction formats can be defined. We will start now making some simple questions and seeing what the results look like.question="What is Tensorflow" context = create_context(    question,    index,    max_len=max_len,    size=size, ) try:    # fine-tuned models requires model parameter, whereas other models require engine parameter    model_param = (        {"model": fine_tuned_qa_model}        if ":" in fine_tuned_qa_model        and fine_tuned_qa_model.split(":")[1].startswith("ft")        else {"engine": fine_tuned_qa_model}    )    #print(instruction.format(context, question))    response = openai.Completion.create(        prompt=instruction.format(context, question),        temperature=0,        max_tokens=max_tokens,        top_p=1,        frequency_penalty=0,        presence_penalty=0,        stop=stop_sequence,        **model_param,    )    print( response["choices"][0]["text"].strip()) except Exception as e:    print(e)We can see that it's giving us the proper results using the context that it's retrieving from Pinecone:We can also inquire about Pytorch:question="What is Pytorch" context = create_context(    question,    index,    max_len=max_len,    size=size, ) try:    # fine-tuned models requires model parameter, whereas other models require engine parameter    model_param = (        {"model": fine_tuned_qa_model}        if ":" in fine_tuned_qa_model        and fine_tuned_qa_model.split(":")[1].startswith("ft")        else {"engine": fine_tuned_qa_model}    )    #print(instruction.format(context, question))    response = openai.Completion.create(        prompt=instruction.format(context, question),        temperature=0,        max_tokens=max_tokens,        top_p=1,        frequency_penalty=0,        presence_penalty=0,        stop=stop_sequence,        **model_param,    )    print( response["choices"][0]["text"].strip()) except Exception as e:    print(e)The results keep being consistent with the context provided:Now we can try to go beyond the capabilities of the context by pushing the boundaries a bit more.question="Am I allowed to publish model outputs to Twitter, without a human review?" context = create_context(    question,    index,    max_len=max_len,    size=size, ) try:    # fine-tuned models requires model parameter, whereas other models require engine parameter    model_param = (        {"model": fine_tuned_qa_model}        if ":" in fine_tuned_qa_model        and fine_tuned_qa_model.split(":")[1].startswith("ft")        else {"engine": fine_tuned_qa_model}    )    #print(instruction.format(context, question))    response = openai.Completion.create(       prompt=instruction.format(context, question),        temperature=0,        max_tokens=max_tokens,        top_p=1,        frequency_penalty=0,        presence_penalty=0,        stop=stop_sequence,        **model_param,    )    print( response["choices"][0]["text"].strip()) except Exception as e:    print(e)We can see in the results that the model is working according to the instructions provided as we don’t have any context on Twitter:Lastly, the Pinecone index is deleted to free up resources.pinecone.delete_index(index_name)ConclusionThis tutorial provided a comprehensive guide to harnessing Pinecone, OpenAI's language models, and HuggingFace's library for advanced question-answering. We introduced Pinecone's vector search engine, explored data preparation, embedding generation, and data uploading. Creating a question-answering model using OpenAI's API concluded the process. The tutorial showcased how the synergy of vector search engines, language models, and text processing can revolutionize information retrieval. This holistic approach holds potential for developing AI-powered applications in various domains, from customer service chatbots to research assistants and beyond.Author Bio:Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.LinkedIn 
Read more
  • 0
  • 0
  • 1266

article-image-getting-started-with-google-makersuite
Anubhav Singh
08 Aug 2023
14 min read
Save for later

Getting Started with Google MakerSuite

Anubhav Singh
08 Aug 2023
14 min read
MakerSuite, essentially a developer tool, enables everyone with a Google Account to access the power of PaLM API with a focus on building products and services using it. The MakerSuite interface allows rapid prototyping and testing of the configurations that are used while interacting with the PaLM API. Once the user is satisfied with the configurations, they can very easily port them to their backend codebases. We’re now ready to dive into exploring the MakerSuite interface. To get started, head over to https://makersuite.google.com/ on your browser. Make sure you’re logged in to your Google Account to be able to access the interface. You’ll be able to see the welcome dashboard.The available options on MakerSuite as of the date of writing this article are - Text Prompts, Data Prompts, and Chat Prompts. Let’s take a brief look at what each of these does.Text PromptsText prompts are the most basic and customizable form of prompts that can be provided to the models. You can choose to set it to any task or ask any question in a stateless manner. The user prompt and input are ingested by the model every time it is run and the model itself does not hold any context. Thus, text prompts are a great starting point and can be made as deterministic or creative in their output as required by the user.Let us create a Text prompt in MakerSuite. Click on the Create button on the Text prompt card and you’ll be presented with the prompt testing UI. On the top, MakerSuite allows users to save their prompts by name. It also provides starter samples which allow one to quickly test and understand how the product works. Below that, is the main working area where the users can define their own prompts and by adjusting the configuration parameters of the model at the bottom, run the prompts to produce an output.First, Click on the Pencil icon on the top left to give this prompt a suitable name. For our example, we’ll be building a prompt that asks the model to produce the etymology of any given word. We’re using the following valuesfield     valuename     Word Etymologydescription     Asking PaLM API to provide word etymologies.Click on “Save” to save these values and close the input modal. Kindly note that these values do not affect the model in any manner and are simply present for user convenience.Now, in the main working area below, we’ll write the required prompt. For our example, we write the prompt given below:For any given word that follows, provide its etymology in no more than 300 words.Aeroplane.Etymology: Now, let’s adjust the model parameters. Click on the button next to the Run button to change the model settings. For our example, we shall set the following values to the parameters: field    value       remarkmodel    Text Bison       Use defaultTemperature    0       Since word etymologies are dry facts and are not expected to be creativeAdd stop sequence        Use defaultMax outputs    1       Word etymologies are usually not going to benefit from variations of telling themDepending on the use case you’re building your generative AI-backed software for, you may wish to change the Safety settings of the model response. To do so, click on the Edit safety settings button. You can see the following options and can change them as per your requirement. For our use case, we shall leave it to default.At the bottom of the configuration menu, you can choose to adjust further advanced settings of the model output. Let’s see what these options are: We shall leave these options on default for now.Great, we’re now all set to run the prompt. Click on the Run button on the bottom and wait for the model to produce the output. In our case, the model outputs:The word "aeroplane" is derived from the Greek words "aēr" (air) and "planē" (to wander). The term was first used in the 1860s to describe a type of flying machine that was powered by a steam engine. In 1903, the Wright brothers made the first successful flight of a powered aeroplane.Note that, for you, the response might come out slightly different due to the inherently non-deterministic nature of how generative AI works. At this point, you might want to experiment by erasing the model output and running the prompt again. Does the output change? Re-run it several times to observe changes in the model output. Then, try adjusting the values of the model configuration and see how that affects the output of the model. If you had set the temperature configuration to 0, you will notice that the model likely produces the same output many times. Try increasing it to 1 and then run the model a few times. Does the output generated in each iteration remain the same now? It is highly possible that you’ll observe that the model output changes every time you re-run the prompt.It is interesting to note here that the prompt you provide to the model does not contain any examples of how the model should respond. This method of using the model is called Zero-shot learning in which the trained model is asked to produce predictions for an input that it may not have seen before. In our example, it is the task of providing word etymologies, which the model may or may not have been trained on.This makes us wonder if we gave the model an input that it has absolutely not trained on, is it likely to produce the correct response? Let us try this out. Change the word in our etymology prompt example to “xakoozifictation”. Hit the Run button to see what the model outputs. Instead of telling us that the word does not exist and thus, has no meaning, the model attempts to produce an etymology of the word. The output we got was: Instead of telling us that the word does not exist and thus, has no meaning, the model attempts to produce an etymology of the word. The output we got was: Xakoozifictation is a portmanteau of the words "xakooz" and "ification". Xakooz is a nonsense word created by combining the sounds of the words "chaos" and "ooze". ification is a suffix that can be added to verbs to create nouns that describe the process of doing something. In this case, xakoozifictation means the process of making something chaotic or oozy.What we observe here is called “model hallucination” - a phenomenon common among large language models wherein the model tries to produce output contrary to common logic or is inaccurate in real-world knowledge. It is highly recommended here to read more about Model Hallucations in the “Challenges in working with LLMs” section.Let us continue our discussion about Zero shot learning. We saw that when we provide only a prompt to the model and no examples of how to produce responses, the model tries its best to produce a response and in most general cases it succeeds. However, if we were to provide some examples to the model of the expected input-output pairs, can we program the model to perform more accurately and do away with the model hallucinations? Let us give this a try by providing some input-output examples of the model. Update your model prompt to the following:For any given word that follows, provide its etymology in no more than 300 words.Examples: Word: aeroplaneReasoning: Since it's a valid English word, produce an output.Etymology: Aeroplane is a compound word formed from the Greek roots "aer" (air) and "planus" (flat). Word: balloonReasoning: Since it's a valid English word, produce an output.Etymology: The word balloon comes from the Italian word pallone, which means ball. The Italian word is derived from the Latin word ballare, which means to dance. Word: oungopoloctousReasoning: Since this is not a valid English word, do not produce an etymology and say it's "Not available".Etymology: Not availableWord: kaploxicatingReasoning: Since this is not a valid English word, do not produce an etymology and say it's "Not available".Etymology: Not availableWord: xakoozifictationEtymology: In the above prompt, we have provided 2 examples of words that exist and 2 examples of words that do not exist. We expect the model to learn from these examples and produce output accordingly. Hit Run to see the output of the model, remember to set the temperature configuration of the model back to 0.You will see that the model responds with the “Not available” output for non-existent words now and with etymologies only for words that exist in the English dictionary. Hence, by providing a few examples of how we expect the model to behave, we were able to stop the model hallucination problem.This method of providing some samples of the expected input-output to the model in the prompt is called Few shot learning. In Few shot learning, the model is expected to predict output on unknown input based on a few similar samples it has received prior to the task of prediction. In special cases, the number of samples might be exactly one, which gets termed as “One-shot learning”.Now, let us explore the next type of prompt available on the MakerSuite - Data Prompt.Data PromptsIn Data prompts, the user is expected to use the model to generate more samples of data based on provided samples. The MakerSuite data prompt interface defines two sections of the prompt - the prompt itself which is now options and the samples of the data that the prompt has to work on, which is a required section.It is important to note here that at the bottom of the page, the model is still the Text Bison model. Thus, the Data prompts can be understood as specific use cases of the text generation using the Text Bison model.Further, there is no way to test the data prompts without specifying the inputs as one or more columns of the to-be-generated rows of the dataset. Let us build a prompt for this interface. Since providing a prompt text is now not necessary, we’ll skip it and instead fill the table as shown below: In order to add more columns than the number of columns present by default, use the Add button on the top right.Once this is done, we are now ready to provide the input column for the test inputs below. In the Test your prompt section at the bottom, fill in only the INPUT number column as shown below:Now, click on the Run button to see how the model produces outputs for this prompt. We see that the model produces the rest of the data for those rows correctly and using the format that we provided it with. This makes us wonder that if we provide historical data to the Data prompt, will it be able to predict future trends? Let us give this a try.Create a new Data prompt and on the data examples table, on the top right click on Add -> Import examples. You may choose any existing Google Sheets from the dialog box, or upload any supported file. We choose to upload a CSV file, notably the Iris flower dataset’s CSV. We use the one found at https://gist.github.com/netj/8836201/On selecting the file, the interface will ask you to assign the columns in the CSV to columns in your data examples. We choose to create new input columns for all the feature columns of the Iris dataset, and keep the labels column as an output column, as shown below:After importing the examples, let us manually move a few examples to the Test your prompt section. Remember to remove these examples from the data examples section above to ensure the model is not training on the same data that it is being tested on. Now, click the Run button to get the model’s output.We observe that the model is able to correctly output the label column values as per the examples it has received. Hence, besides generating more examples for a given dataset, the model is also capable of making predictions about the inputs to a degree. One would require a much more extensive testing of the same to determine the accuracy of the model, which is beyond the scope of this article.Finally, let us explore the Chat prompts.Chat promptsChatting with generative AI models is a form in which most people have interacted with them first. Made popular once more by the advent of ChatGPT, the concept of AI being able to have intelligent conversations has been around for a very long time and has been a part of popular culture on a regular basis. One of the most well-known examples of an AI being able to take instructions and produce output accordingly is the JARVIS from Iron Man series of comics. With the latest possibilities enabled by generative AI, building such systems is very much a realistic task, with efforts already underway.In this section, we shall see how we can have conversations with generative AI models that mimic human-like understanding and decision making skills.First, click on the Create New button on the top left of the MakerSuite interface and select the Chat prompt. You will be able to see the blank interface for designing a Chat prompt, as shown below: One immediate change to notice is that there is no longer a Run button at the bottom of the UI. This has been shifted to the Test your prompt section with a chat box-like interface, and the message send button for the chat box functions as the Run button.On the left the interface section reads “Write your prompt examples”, we’ll call this the prompts examples section. Also take note of the Context field available in this section, which can be used to set the rules of the interaction and the format in which the output is expected. Now, let us design a chat prompt, as shown in the table below:field    valueContext    You're a banker at the Gringotts bank, set in the Wizarding world of Harry Potter.User    I wish to access my accountModel    Very well, please present your key.User    How safe are the vaults at Gringotts?Model   Gringotts' vaults are considered extremely safe. Protected by complex magic, various creatures, and intricate security measures, these vaults are nearly impenetrable. Unauthorized access is extraordinarily challenging and dangerous, as demonstrated multiple times in the series. This reputation contributes to the bank's trustworthiness among wizards.We expect that the model will pretend to be a banker at the Gringotts bank, which is referenced from the popular book series Harry Potter. Since it's a fictional world and we expect the conversation to be similarly not bound to the real world, we should increase the model temperature, allowing it to be more creative. For this example, let’s set the model temperature to 0.7.Let us try having a conversation with the model. This is the conversation that happens with the model, in our case:We observe that although we have not provided the model with an example of how to respond when the user says they do not have the key, it correctly handles the response based on its existing knowledge about Gringotts Bank’s policies. Now that we have covered the different types of prompts available in MakerSuite, let’s explore how we can use them via code, making direct calls to the PaLM API.Author BioAnubhav Singh, Co-founder of Dynopii & Google Dev Expert in Google Cloud, is a  seasoned developer since the pre-Bootstrap era, Anubhav has extensive experience as a freelancer and AI startup founder. He authored "Hands-on Python Deep Learning for Web" and "Mobile Deep Learning with TensorFlow Lite, ML Kit, and Flutter." A Google Developer Expert in GCP, he co-organizes for TFUG Kolkata community and formerly led the team at GDG Cloud Kolkata. Anubhav is often found discussing System Architecture, Machine Learning, and Web technologies 
Read more
  • 0
  • 0
  • 755