Create dataset
create_openai_dataset ¶
Generate a dataset for OpenAI based on the given sections QA data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections_qa_data_flatten |
dict
|
A dictionary containing the flattened sections QA data. |
required |
section_type |
Literal['main', 'summary', 'metadata', 'extra']
|
The type of section to include in the dataset. Defaults to "main". |
'main'
|
question_type |
Literal['dense', 'sparse']
|
The type of question to include in the dataset. Defaults to "dense". |
'dense'
|
answer_type |
Literal['long', 'short']
|
The type of answer to include in the dataset. Defaults to "long". |
'long'
|
prompt_type |
Literal['instruction', 'simple']
|
The type of prompt to use in the dataset. Defaults to "instruction". |
'instruction'
|
Returns:
Type | Description |
---|---|
list[dict]
|
list[dict]: The generated dataset for OpenAI. |
Note
- The dataset is generated based on the specified parameters.
- Only sections that exist in the sections QA data will be included in the dataset.
- For each section, the questions and answers are extracted based on the question type and answer type.
- Depending on the prompt type, different sample generation functions are used to create the samples.
- The dataset is a list of dictionaries, where each dictionary represents a sample.
Source code in docqa/demo/create_dataset.py
pdf_to_qa_data ¶
Generates a QA data dictionary from a PDF file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_dir |
Path
|
The directory where the output files will be saved. |
required |
pdf_file |
Path
|
The path to the PDF file. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The generated QA data dictionary. |