Downloading and preparing the dataset
We will use the SciQ dataset created by Welbl, Liu, and Gardner (2017) with a method for generating high-quality, domain-specific multiple-choice science questions via crowdsourcing. The SciQ dataset consists of 13,679 multiple-choice questions crafted to aid the training of NLP models for science exams. The creation process involves two main steps: selecting relevant passages and generating questions with plausible distractors.
In the context of using this dataset for an augmented generation of questions through a Chroma collection, we will implement the question
, correct_answer
, and support
columns. The dataset also contains distractor
columns with wrong answers, which we will drop.
We will integrate the prepared dataset into a retrieval system that utilizes query augmentation techniques to enhance the retrieval of relevant questions based on specific scientific topics or question formats for Hugging Face’s Llama model. This will...