Inference
After constructing the prompt with the retrieved context and the user’s query, the next step is to submit the formatted prompt directly to Vertex AI’s API endpoint to be processed by Gemini 1.5 Flash. This is where the actual generation of the response takes place. In the following code snippet, the generate()
function is responsible for sending the prompt to the Gemini 1.5 Flash model and obtaining the generated response:
#This is the section where we submit the full prompt and
#context to the LLM
result = generate(prompt)
The generate()
function encapsulates the configuration and settings required for the generation process. It includes two main components: generation_config
and safety_settings
.
The generation_config
dictionary specifies the parameters that control the behavior of the language model during the generation process. In this example, the following settings are provided:
generation_config = {
"max_output_tokens"...