Markdown
MarkdownTidier ¶
Bases: BaseModel
Tidies the given markdown text using OpenAI's model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
openai_key |
str
|
The OpenAI API key. |
required |
openai_model |
str
|
The OpenAI model to use. |
required |
seed |
int
|
The seed for the random number generator. Defaults to 42. |
required |
system_message |
str
|
The system message for the OpenAI model. |
required |
instruction |
str
|
The instruction for the OpenAI model. |
required |
api_client |
OpenAI
|
The OpenAI client. |
required |
Source code in docqa/core/markdown.py
process ¶
Generates a response to a given markdown text using the OpenAI chat model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
markdown_text |
str
|
The input markdown text to generate a response for. |
required |
temperature |
float
|
The temperature of the model's output. Higher values make the output more random, while lower values make it more focused and deterministic. Defaults to 0.7. |
0.7
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The generated response text. |
dict |
dict
|
Metadata about the completion process, including the finish reason and token usage. |
Example
Source code in docqa/core/markdown.py
find_highest_markdown_heading_level ¶
Takes a list of lines representing a markdown file as input. Finds the highest level of heading and returns it as an integer. Returns None if the text contains no headings.
Source
https://github.com/nestordemeure/question_extractor/blob/main/question_extractor/markdown.py
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lines |
list of str
|
A list of lines in the markdown file. |
required |
Returns:
Type | Description |
---|---|
int | None
|
int | None: The highest heading level as an integer, or None if no headings are found. |
Source code in docqa/core/markdown.py
pdf_to_markdown ¶
Converts a PDF file to Markdown format and saves the result to an output file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_file |
Path
|
The path to the PDF file to be converted. |
required |
output_file |
Path
|
The path to the output file where the converted Markdown will be saved. |
required |
max_pages |
int | None
|
The maximum number of pages to convert. Defaults to None. |
None
|
parallel_factor |
int
|
The number of parallel processes to use for conversion. Defaults to 1. |
1
|
cache_dir |
Path
|
The directory to use for caching the conversion |
Path('.cache/pdf_to_markdown/')
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The converted Markdown text. |
Source code in docqa/core/markdown.py
filter_empty_sections ¶
Filters out empty sections from a list of tuples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections |
list[tuple[str, str]]
|
A list of tuples representing sections, where each tuple contains a heading (str) and content (str). |
required |
Returns:
Type | Description |
---|---|
list[tuple[str, str]]
|
list[tuple[str, str]]: A list of tuples representing non-empty sections, where each tuple contains a heading (str) and content (str). |
Source code in docqa/core/markdown.py
merge_abstract_with_previous_sections ¶
If found an Abstract section then assume it's a research paper and merge it with all previous sections, this is because the authors section might have more column thus messes up the parsed order
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections |
list[tuple[str, str]]
|
A list of tuples representing sections, where each tuple contains a heading (str) and content (str). |
required |
Returns:
Type | Description |
---|---|
list[tuple[str, str]]: A list of tuples representing merged sections, where each tuple contains a heading (str) and content (str). |
Source code in docqa/core/markdown.py
preprocess_sections ¶
Preprocesses the given list of sections by filtering out any empty sections and merging any abstract sections with their previous sections.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections |
List[Tuple[str, str]]
|
A list of tuples representing sections. Each tuple contains two strings: the title of the section and the content of the section. |
required |
Returns:
Type | Description |
---|---|
list[tuple[str, str]]
|
List[Tuple[str, str]]: A list of tuples representing the preprocessed sections. Each tuple contains two strings: the title of the section and the content of the section. |
Source code in docqa/core/markdown.py
text_similarity_score ¶
Compute the similarity score between two texts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text1 |
str
|
The first text. |
required |
text2 |
str
|
The second text. |
required |
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
The similarity score between the two texts. |
Source code in docqa/core/markdown.py
heading_similarity_score ¶
Calculate the similarity score between two headings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
heading1 |
str
|
The first heading. |
required |
heading2 |
str
|
The second heading. |
required |
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
The similarity score between the two headings. |
Source code in docqa/core/markdown.py
preserve_content ¶
Calculate the similarity between the given heading and new heading using a threshold. If the similarity score is above the threshold, the new text still contains the heading, so the content after the heading is extracted as the new content. If the similarity score is below the threshold, the new text does not contain the heading, so the entire new text is considered as the new content. Calculate the similarity between the old content and new content using a threshold. If the similarity score is above the threshold, the content is considered preserved and the new content along with its similarity score is returned. If the similarity score is below the threshold, the content has been modified too much and the old content along with its similarity score is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
heading |
str
|
The heading of the old text. |
required |
old_content |
str
|
The content of the old text. |
required |
new_text |
str
|
The new text. |
required |
heading_similarity_threshold |
float
|
The threshold for heading similarity. Defaults to 0.7. |
0.7
|
content_similarity_threshold |
float
|
The threshold for content similarity. Defaults to 0.8. |
0.8
|
Returns:
Type | Description |
---|---|
tuple[str, float]
|
tuple[str, float]: A tuple containing the new content and its similarity score. |
Source code in docqa/core/markdown.py
tidy_markdown_sections ¶
Tidies up sections of markdown text by splitting them into heading and content, and then processing each section using the MarkdownTidier class. It takes a list of tuples representing the sections, where each tuple contains a heading and content. The function also accepts optional parameters such as the maximum length of the tidied sections, the OpenAI API key, the OpenAI model to use, a seed value for reproducibility, and thresholds for heading and content similarity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections |
list[tuple[str, str]]
|
A list of tuples representing the sections of markdown text. Each tuple contains a heading and content. |
required |
max_length |
int
|
The maximum length of the tidied sections. Defaults to 4096. |
4096
|
openai_key |
str
|
The OpenAI API key. Defaults to "". |
''
|
openai_model |
str
|
The OpenAI model to use. Defaults to "". |
''
|
seed |
int
|
A seed value for reproducibility. Defaults to 42. |
42
|
heading_similarity_threshold |
float
|
The threshold for heading similarity. Defaults to 0.7. |
0.7
|
content_similarity_threshold |
float
|
The threshold for content similarity. Defaults to 0.8. |
0.8
|
Returns:
Type | Description |
---|---|
tuple[list[tuple[str, str]], list[dict]]
|
tuple[list[tuple[str, str]], list[dict]]: A tuple containing the tidied sections and a list of metadata for each section. |