Doc tree
build_doc_tree_from_markdown ¶
Takes a string representation of a markdown file as input. Finds the highest level of heading and splits the text into sections accordingly. Returns a list of tuples, each containing the section title and section content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The content of a markdown file as a single string. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary containing the tree structure of the markdown file. |
Source code in docqa/core/doc_tree.py
build_doc_tree_from_pdf ¶
Generate a document tree from a PDF file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file |
Path
|
The path to the input PDF file. |
required |
output_dir |
Path
|
The directory where the output files will be saved. |
required |
Notes
- The function first checks if the marker output file exists in the output directory.
- If the marker output file exists, it reads the content of the file.
- If the marker output file does not exist, it converts the input PDF file to
markdown using the
pdf_to_markdown
function. - The function then checks if the tidy text sections file exists in the output directory.
- If the tidy text sections file exists, it reads the content of the file.
- If the tidy text sections file does not exist, it builds a document tree from
the marker markdown content using the
build_doc_tree_from_markdown
function. - The function flattens the document tree using the
flatten_doc_tree
function. - It preprocesses the sections using the
preprocess_sections
function. - The function then tidies the markdown sections and retrieves the metadata
using the
tidy_markdown_sections
function. - Finally, it saves the tidy text sections to a file, writes the tidy markdown content to a file, and saves the metadata to a file.
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The final document tree generated from the PDF. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the marker output file or tidy text sections file does not exist. |
Source code in docqa/core/doc_tree.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
|
flatten_doc_tree ¶
Recursively flattens a nested dictionary representing a document tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root |
dict
|
The root node of the document tree. |
required |
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of tuples representing the flattened document tree. Each tuple contains a heading and its corresponding text. |
Source code in docqa/core/doc_tree.py
get_section_full_text ¶
Retrieves the full text of a section from a given dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
section |
dict
|
The section to retrieve the full text from. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The full text of the section. |