Data generation
QAPairGenerator ¶
Bases: BaseModel
Generates questions and answers for sections and subsections of a document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
openai_key |
str
|
The API key for OpenAI. |
required |
openai_model |
str
|
The name of the OpenAI model to use. |
required |
seed |
int
|
The seed for the random number generator. Defaults to 42. |
required |
Source code in docqa/core/data_generation.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
sanitize_output_format
staticmethod
¶
This static method takes in an output
of type dict
or list
and returns
a sanitized list[dict]
output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
dict | list
|
The input |
required |
Returns:
Type | Description |
---|---|
list[dict]
|
list[dict]: The sanitized output as a list of dictionaries. |
Raises:
Type | Description |
---|---|
ValueError
|
If the |
Source code in docqa/core/data_generation.py
process ¶
Process the given document to generate a list of questions and answers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document |
str
|
The text document to process. |
required |
temperature |
float
|
The temperature parameter for controlling the randomness of the output. Defaults to 1.0. |
1.0
|
question_type |
str
|
The type of questions to generate. Defaults to "dense". |
'dense'
|
num_questions |
int
|
The number of questions to generate. Defaults to 5. |
5
|
Returns:
Type | Description |
---|---|
tuple[list[dict[str, str]], list[dict]]
|
tuple[list[dict[str, str]], list[dict]]: A tuple containing a list of questions and answers and a list of metadata. |
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid question type is provided. |
Source code in docqa/core/data_generation.py
AnswerGenerator ¶
Bases: BaseModel
Generate an answer to a question based on a reference.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
openai_key |
str
|
The OpenAI API key. |
required |
openai_model |
str
|
The name of the OpenAI model to use. |
required |
seed |
int
|
The seed for the random number generator. Defaults to 42. |
required |
Source code in docqa/core/data_generation.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
|
process ¶
Process the given question and generate a response using the OpenAI model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question |
str
|
The question to be processed. |
required |
reference |
str
|
The reference string for the instruction. |
required |
temperature |
float
|
The temperature parameter for generating the response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more focused and deterministic. Defaults to 1.0. |
1.0
|
Returns:
Type | Description |
---|---|
tuple[str, dict]
|
Tuple[str, dict]: A tuple containing the generated answer and metadata. |
Output dict structure
- answer (str): The generated answer as a string.
- metadata (dict): Additional metadata about the response.
- finish_reason (str): The reason why the completion finished.
- usage (dict): Usage statistics of the completion.
- completed_tokens (int): The number of tokens used for completion.
- prompt_tokens (int): The number of tokens used for the prompt.
- total_tokens (int): The total number of tokens used.
Source code in docqa/core/data_generation.py
generate_top_sections_questions ¶
Generate the top sections with questions based on the provided document tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc_tree |
dict
|
The document tree representing the sections of the document. |
required |
output_file |
Path
|
The path to the output file where the top sections with uestions will be saved. |
required |
openai_key |
str
|
The OpenAI API key. Defaults to an empty string. |
''
|
openai_model |
str
|
The OpenAI model to use for question generation. Defaults to an empty string. |
''
|
seed |
int
|
The seed value for random number generation. Defaults to 42. |
42
|
temperature |
float
|
The temperature parameter for question generation. Defaults to 1.0. |
1.0
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The top sections with questions. |
Source code in docqa/core/data_generation.py
generate_long_answers_for_sections_questions ¶
Generate long answers for sections' questions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections_with_questions |
dict
|
A dictionary containing sections with their corresponding questions. |
required |
output_file |
Path
|
The path to the output file where the generated long answers will be stored. |
required |
openai_key |
str
|
The API key for OpenAI. Defaults to an empty string. |
''
|
openai_model |
str
|
The name of the OpenAI model to use. Defaults to an empty string. |
''
|
seed |
int
|
The seed value for random number generation. Defaults to 42. |
42
|
temperature |
float
|
The temperature parameter for generating answers. Defaults to 1.0. |
1.0
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary containing sections with their corresponding questions and generated long answers. |
Source code in docqa/core/data_generation.py
make_simple_sample_for_openai ¶
Generates a simple sample for OpenAI chat conversation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question |
str
|
The user's question. |
required |
answer |
str
|
The assistant's answer. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary containing the chat conversation sample. |
Example
make_simple_sample_for_openai("What is the capital of France?", "Paris")
Source code in docqa/core/data_generation.py
make_instruction_sample_for_openai ¶
Generates an instruction sample for OpenAI chat conversation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
question |
str
|
The question to be used in the instruction. |
required |
answer |
str
|
The answer to be used in the instruction. |
required |
references |
list[str]
|
A list of reference texts to be included in the instruction. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary containing the chat conversation sample. |