Bard is a large language model (LLM) from Google AI, trained on a massive dataset of text and code. Bard can be used to generate Python code for data science projects. This can be extremely helpful for data scientists who want to save time on coding or are unfamiliar with Python. It also empowers those of us who are not full-time data scientists but have an interest in leveraging machine learning (ML) technologies.
The first step is to define the problem you are trying to solve. In this article, we will use Bard to create a binary text classifier. It will take a news story as input and classify it as either fake or real. Given a problem to solve, you can brainstorm solutions. If you are familiar with machine learning technologies, you are able to do this yourself. Alternatively, you can ask Bard for help in finding an appropriate algorithm that meets your requirements. The classification of text documents often uses term frequency techniques. We don’t need to know more than that at this point, as we can have Bard help us with the details of the implementation.
The overall design of your project could also involve feature engineering and visualization methods. As in most software engineering efforts, you will likely need to iterate on the design and implementation. However, Bard can help you do this much faster.
All of the code from this article can be found on GitHub. Bard will guide you, but to run this code, you will need to install a few packages using the following commands.
python -m pip install pandas
python -m pip install scikit-learn
To train our model, we can use the news.csv data set within this project found here, originally sourced from a Data Flair training exercise. It contains the title and text of almost 8,000 news articles labeled as REAL or FAKE.
To get started , Bard can help us write code to parse and read this data file. Pandas is a popular open-source data analysis and manipulation tool for Python. We can prompt Bard to use this library to read the file.
Image 1: Using Pandas to read the file
Running the code shows the format of the csv
file and the first few data rows, just as Bard described. It has an unnamed article id in the first column followed by the title, text, and classification label.
broemmerd$ python test.py
Unnamed: 0 title text label
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield, a Shillman Journalism Fello... FAKE
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9, 2016 T... FAKE
4 875 The Battle of New York: Why This Primary Matters It's primary day in New York and front-runners... REAL
Now that we can read and understand our training data, we can prompt Bard to write code to train an ML model using this data. Our prompt is detailed regarding the input columns in the file used for training. However, it specifies a general ML technique we believe is applicable to the solution.
The text column in the news.csv
contains a string with the content from a new article. The label column contains a classifier label of either REAL or FAKE. Modify the Python code to train a machine-learning model using term frequency based on these two columns of data.
Image 2: Bard for training ML models
We can now train our model. The output of this code is shown below:
broemmerd$ python test.py
Accuracy: 0.9521704814522494
We have our model working. Now we just need a function that will apply it to a given input text. We use the following prompt. Modify this code to include a function called classify_news
that takes a text string as input and returns the classifier, either REAL or FAKE.
Bard generates the following code for this function. Note that it also refactored the previous code to include the use of the TfidfVectorizor in order to support this function.
Image 3: Including classify_news function
To test the classifier with a fake story, we will use an Onion article entitled “Chill Juror Good With Whatever Group Wants To Do For Verdict.” The Onion is a satirical news website known for its humorous and fictional content. Articles in The Onion are intentionally crafted to appear as genuine news stories, but they contain fictional, absurd elements for comedic purposes.
Our real news story is a USA Today article entitled “House blocks push to Censure Adam Schiff for alleging collusion between Donald Trump and Russia.”
Here is the code that reads the two articles and uses our new function to classify each one. The results are shown below.
with open("article_the_onion.txt", "r") as f:
article_text = f.read()
print("The Onion article: " + classify_news(article_text))
with open("article_usa_today.txt", "r") as f:
article_text = f.read()
print("USA Today article: " + classify_news(article_text))
broemmerd$ python news.py
Accuracy: 0.9521704814522494
The Onion article: FAKE
USA Today article: REAL
Our classifier worked well on these two test cases.
Bard can be a helpful tool for data scientists who want to save time on coding or who are not familiar with Python. By following a process similar to the one outlined above, you can use Bard to generate Python code for data science projects.
When using Bard to generate Python code for data science projects, be sure to use clear and concise prompts. Provide the necessary detail regarding the inputs and the desired outputs. Where possible, use specific examples. This can help Bard generate more accurate code. Be patient and expect to go through a few iterations until you get the desired result. Test the generated code at each step in the process. It will be difficult to determine the cause of errors if you wait until the end to start testing.
Once you get familiar with the process, you can use Bard to generate Python code that can help you solve data science problems more quickly and easily.
Darren Broemmer is an author and software engineer with extensive experience in Big Tech and Fortune 500. He writes on topics at the intersection of technology, science, and innovation.