Inference
After constructing the prompt with the retrieved context and the user’s query, the next step is to submit the formatted prompt directly to Vertex AI’s API endpoint to be processed by Gemini 1.5 Flash. This is where the actual generation of the response takes place. In the following code snippet, the generate()
function is responsible for sending the prompt to the Gemini 1.5 Flash model and obtaining the generated response:
#This is the section where we submit the full prompt and
#context to the LLM
result = generate(prompt)
The generate()
function encapsulates the configuration and settings required for the generation process. It includes two main components: generation_config
and safety_settings
.
The generation_config
dictionary specifies the parameters that control the behavior of the language model during the generation process. In this example, the following settings are provided:
generation_config = {
"max_output_tokens": 8192,
"temperature": 0,
"top_p": 0.95,
}
From Google Gemini’s documentation:
max_output_tokens
: Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60–80 words.temperature
: The temperature is used for sampling during response generation, which occurs whentop_p
andtop_k
are applied.temperature
controls the degree of randomness in token selection. Lower temperature values are good for prompts that require a less open-ended or creative response, while higher temperature values can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.
A temperature of 0 means the model will choose the most likely token based on its training data, while higher values introduce more randomness and diversity in the output.
top_p
: This parameter changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1, respectively, and the top-p value is 0.5, then the model will select either A or B as the next token by using temperature and will exclude C as a candidate.
In this case, it is set to 0.95, meaning that only the top 95% of tokens with the highest probabilities will be considered during generation.
Beyond the above, the safety_settings
dictionary specifies the harm categories and corresponding thresholds for filtering potentially harmful or inappropriate content from the generated output. In this example, the following settings are provided:
safety_settings = {
generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
}
These settings instruct the Gemini 1.5 Flash model to block only highly harmful content related to hate speech, dangerous content, sexually explicit content, and harassment. Any content that falls below the “high” harm threshold for these categories will be allowed in the generated output.
The generate()
function creates an instance of the GenerativeModel
class, passing the MODEL
parameter; in this example, Gemini 1.5 Flash. It then calls the generate_content()
method on the model instance, providing the prompt, generation configuration, and safety settings. The stream=False
parameter indicates that the generation should happen in a non-streaming mode, meaning the entire response will be generated and returned at once:
def generate(prompt):
model = GenerativeModel(MODEL)
responses = model.generate_content(
[prompt],
generation_config=generation_config,
safety_settings=safety_settings,
stream=False,
)
return(responses)
The generated response is stored in the responses
variable, which is then returned by the generate()
function.
By submitting the formatted prompt to Vertex AI’s API endpoint for Gemini 1.5 Flash, leveraging the provided generation configuration and safety settings, this RAG pipeline can obtain a contextualized and relevant response tailored to the user’s query while adhering to the specified parameters and content filtering rules.