Prompt strategy for data science
Let’s do a similar thought experiment for data science as we did for web development. We’ll use the presented guidelines “problem breakdown” and “generate prompts,” and just like in the web development section, we’ll draw some general conclusions on the domain and present those as a prompt strategy for data science.
Problem breakdown: predict sales
Let’s say we’re building a machine-learning model to predict sales. At a high level, we understand what the system should do. To solve the problem though, we need to divide it into smaller parts, which in data science usually entails the following components:
- Data: The data is the part of the system that stores information. The data can come from many places like databases, web endpoints, static files, and more.
- Model: The model is responsible for learning from the data and producing a prediction that’s as accurate as possible. To predict, you need an input that produces one or more outputs as a prediction.
- Training: The training is the part of the system that trains the model. Here, you typically have part of your data as training and a part being sample data.
- Evaluation: To ensure your model works as intended, you need to evaluate it. Evaluation means taking the data and model and producing a score that indicates how well the model performs.
- Visualization: Visualization is the part where you can gain insights valuable for the business via graphs. This part is very important, as it’s the part that’s most visible to the business.
Further breakdown into features/steps for data science
At this point, you’re at too high a level to start writing prompts. We can break it down further by looking at each step:
- Data: The data part has many steps, including collecting the data, cleaning it, and transforming it. Here’s how you can break it down:
- Collect data: The data needs to be collected from somewhere. It could be a database, a web endpoint, a static file, and so on.
- Clean data: The data needs to be cleaned. Cleaning means removing data that’s not relevant, removing duplicates, and so on.
- Transform data: The data needs to be transformed. Transformation means changing the data to a format that’s useful for the model.
- Training: Just like the data part, the training part has many steps to it. Here’s how you can break it down:
- Split data: The data needs to be split into training and sample data. The training data is used to train the model and the sample data is used to evaluate the model.
- Train model: The model needs to be trained. Training means taking the training data and learning from it.
- Evaluation: The evaluation part is usually a single step but can be broken down further.
Generate prompts for each step
Note how our breakdown for data science looks a bit different from web development. Instead of identifying features like Add inventory, we instead have a feature like Collect data.
However, we’re on the correct level to author a prompt, so let’s use the Collect data feature as our example:
[Prompt]
Collect data from data.xls
and read it into a DataFrame using Pandas library.
[End of prompt]
The preceding prompt is both general and specific at the same time. It’s general in the sense that it tells you to “collect data” but specific in that it specifies a specific library to use and even what data structure (DataFrame). It’s entirely possible that a simpler prompt would have worked for the preceding step like so:
[Prompt]
Collect data from data.xls.
[End of prompt]
This is where it may vary depending on whether you use a tool like ChatGPT or GitHub Copilot.
Identify some basic principles for data science, “a prompt strategy for data science”
Here, we’ve identified some similar principles as in the web development example:
- Provide context – filename: A CSV file can have any name. It’s important to specify the name of the file.
- Specify how – libraries: There are many ways to load a CSV file, and even though Pandas library is a common choice, it’s important to specify it. There are other libraries to work with and you might need a solution for Java, C#, and Rust, for example, where libraries are named differently.
- Iterate: It’s worth iterating on the prompt, rephrasing it, and adding separators like a comma, a colon, and so on.
- Be context-aware: Also here, context matters a lot; if you’re working in Notebook, previous cells will be available to GitHub Copilot, previous conversations will be available to ChatGPT, and so on.
As you can see from the preceding guidance, the strategy is very similar for web development. Here we’re also listing “Provide,” “Specify how,” “Iterate,” and “Be context-aware.” The big difference lies in the details. However, there’s an alternate strategy that works in data science and that’s lengthy prompts. Even though we’ve broken down the data science problem into features, we don’t need to write a prompt per feature. Another way of solving it could be to express everything you want to be carried out in one large prompt. Such a prompt could therefore look like so:
[Prompt]
You want to predict sales on the file data.xsl
. Use Python and Pandas library. Here are the steps that you should carry out:
- Collect data
- Clean data
- Transform data
- Split data
- Train model
- Evaluation
[End of prompt]
You will see examples in future chapters on data science and machine learning where both smaller prompts as well as lengthier prompts are being used. You decide which approach you want to use.