Doc tree

build_doc_tree_from_markdown ¶

build_doc_tree_from_markdown(text)

Takes a string representation of a markdown file as input. Finds the highest level of heading and splits the text into sections accordingly. Returns a list of tuples, each containing the section title and section content.

{
    "heading": "Section 1",
    "text": "Section 1 opening text",
    "child_sections": [
        {
            "heading": "Section 1.1",
            "text": "Section 1.1 opening text",
            "child_sections": [
                ...
            ]
        },
        ...
    ]
}

Parameters:

Name	Type	Description	Default
`text`	`str`	The content of a markdown file as a single string.	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary containing the tree structure of the markdown file.

Source code in docqa/core/doc_tree.py

def build_doc_tree_from_markdown(
    text: str,
) -> dict:
    """
    Takes a string representation of a markdown file as input.
    Finds the highest level of heading and splits the text into sections accordingly.
    Returns a list of tuples, each containing the section title and section content.

    ```python
    {
        "heading": "Section 1",
        "text": "Section 1 opening text",
        "child_sections": [
            {
                "heading": "Section 1.1",
                "text": "Section 1.1 opening text",
                "child_sections": [
                    ...
                ]
            },
            ...
        ]
    }
    ```

    Args:
        text (str): The content of a markdown file as a single string.

    Returns:
        dict: A dictionary containing the tree structure of the markdown file.

    """
    lines = text.strip().split("\n")

    # Find the highest heading level
    highest_heading_level = find_highest_markdown_heading_level(lines)

    # If there are no headings, return the text as a single section
    if highest_heading_level is None:
        return {"heading": "", "text": text}

    # Construct the heading prefix for splitting
    headings_prefix = ("#" * highest_heading_level) + " "

    n = len(lines)
    i = 0
    opening_text_lines = []
    while i < n and not lines[i].startswith(headings_prefix):
        opening_text_lines.append(lines[i])
        i += 1

    root = {
        "heading": "",
        "text": "\n".join(opening_text_lines).strip(),
        "child_sections": [],
    }

    current_section_title = ""
    current_section_lines: list[str] = []

    # Split the text at the highest heading level
    while i < n:
        line = lines[i]
        # Check if the line starts with the highest heading level prefix
        if line.startswith(headings_prefix):
            # If the current_section is not empty, add it to the sections list
            if len(current_section_lines) > 0:
                current_section_body = "\n".join(current_section_lines).strip()
                child_section = build_doc_tree_from_markdown(current_section_body)
                child_section["heading"] = current_section_title
                root["child_sections"].append(child_section)  # type: ignore

            # Update the current_section_title and clear the current_section
            current_section_title = line.strip()
            current_section_lines = []
        else:
            # Add the line to the current_section
            current_section_lines.append(line)
        i += 1

    # Add the last section to the sections list (if not empty)
    if len(current_section_lines) > 0:
        current_section_body = "\n".join(current_section_lines).strip()
        child_section = build_doc_tree_from_markdown(current_section_body)
        child_section["heading"] = current_section_title
        root["child_sections"].append(child_section)  # type: ignore[attr-defined]

    return root

build_doc_tree_from_pdf ¶

build_doc_tree_from_pdf(input_file, output_dir)

Generate a document tree from a PDF file.

Parameters:

Name	Type	Description	Default
`input_file`	`Path`	The path to the input PDF file.	required
`output_dir`	`Path`	The directory where the output files will be saved.	required

Notes

The function first checks if the marker output file exists in the output directory.
If the marker output file exists, it reads the content of the file.
If the marker output file does not exist, it converts the input PDF file to markdown using the pdf_to_markdown function.
The function then checks if the tidy text sections file exists in the output directory.
If the tidy text sections file exists, it reads the content of the file.
If the tidy text sections file does not exist, it builds a document tree from the marker markdown content using the build_doc_tree_from_markdown function.
The function flattens the document tree using the flatten_doc_tree function.
It preprocesses the sections using the preprocess_sections function.
The function then tidies the markdown sections and retrieves the metadata using the tidy_markdown_sections function.
Finally, it saves the tidy text sections to a file, writes the tidy markdown content to a file, and saves the metadata to a file.

Returns:

Name	Type	Description
`dict`	`dict`	The final document tree generated from the PDF.

Raises:

Type	Description
`FileNotFoundError`	If the marker output file or tidy text sections file does not exist.

Source code in docqa/core/doc_tree.py

def build_doc_tree_from_pdf(input_file: Path, output_dir: Path) -> dict:
    """
    Generate a document tree from a PDF file.

    Args:
        input_file (Path): The path to the input PDF file.
        output_dir (Path): The directory where the output files will be saved.

    Notes:
        - The function first checks if the marker output file exists in the output
            directory.
        - If the marker output file exists, it reads the content of the file.
        - If the marker output file does not exist, it converts the input PDF file to
            markdown using the `pdf_to_markdown` function.
        - The function then checks if the tidy text sections file exists in the output
            directory.
        - If the tidy text sections file exists, it reads the content of the file.
        - If the tidy text sections file does not exist, it builds a document tree from
            the marker markdown content using the `build_doc_tree_from_markdown`
            function.
        - The function flattens the document tree using the `flatten_doc_tree` function.
        - It preprocesses the sections using the `preprocess_sections` function.
        - The function then tidies the markdown sections and retrieves the metadata
            using the `tidy_markdown_sections` function.
        - Finally, it saves the tidy text sections to a file, writes the tidy markdown
            content to a file, and saves the metadata to a file.

    Returns:
        dict: The final document tree generated from the PDF.

    Raises:
        FileNotFoundError: If the marker output file or tidy text sections file does
            not exist.
    """
    marker_output_file = output_dir / "marker_output.md"

    if marker_output_file.exists():
        with open(marker_output_file, "r", encoding="utf-8") as f:
            marker_markdown = f.read()
    else:
        cache_dir = output_dir / "pdf_to_markdown_cache/"
        marker_markdown = pdf_to_markdown(
            input_file, marker_output_file, cache_dir=cache_dir
        )

    tidy_text_sections_file = output_dir / "tidy_text_sections.json"
    tidy_markdown_file = output_dir / "tidy_output.md"

    if tidy_text_sections_file.exists():
        with open(tidy_text_sections_file, "r", encoding="utf-8") as f:
            tidy_text_sections = json.load(f)
        tidy_markdown = "\n\n".join(tidy_text_sections)
    else:
        doc_tree = build_doc_tree_from_markdown(marker_markdown)
        sections = flatten_doc_tree(doc_tree)
        sections = preprocess_sections(sections)
        tidy_sections, all_metadata = tidy_markdown_sections(
            sections,
            openai_key=os.getenv("OPENAI_API_KEY", ""),
            openai_model=os.getenv("OPENAI_MODEL", ""),
            seed=int(os.getenv("SEED", 42)),
        )

        tidy_text_sections = [
            f"{heading.strip()}\n\n{content.strip()}".strip()
            for heading, content in tidy_sections
        ]
        tidy_markdown = "\n\n".join(tidy_text_sections)

        with open(tidy_text_sections_file, "w", encoding="utf-8") as f:
            json.dump(tidy_text_sections, f, indent=4)

        with open(tidy_markdown_file, "w", encoding="utf-8") as f:
            f.write(tidy_markdown)

        with open(output_dir / "tidy_metadata.json", "w", encoding="utf-8") as f:
            json.dump(all_metadata, f, indent=4)

        print(
            "total completion tokens:",
            sum([m.get("usage", {}).get("total_tokens", 0) for m in all_metadata]),
        )
        print(
            "total prompt tokens:",
            sum([m.get("usage", {}).get("prompt_tokens", 0) for m in all_metadata]),
        )
        print(
            "total completed tokens:",
            sum([m.get("usage", {}).get("completed_tokens", 0) for m in all_metadata]),
        )

    final_doc_tree = build_doc_tree_from_markdown(tidy_markdown)
    doc_tree_file = output_dir / "doc_tree.json"
    with open(doc_tree_file, "w", encoding="utf-8") as f:
        json.dump(final_doc_tree, f, indent=4)

    return final_doc_tree

flatten_doc_tree ¶

flatten_doc_tree(root)

Recursively flattens a nested dictionary representing a document tree.

Parameters:

Name	Type	Description	Default
`root`	`dict`	The root node of the document tree.	required

Returns:

Name	Type	Description
`list`	`list`	A list of tuples representing the flattened document tree. Each tuple contains a heading and its corresponding text.

Source code in docqa/core/doc_tree.py

def flatten_doc_tree(root: dict) -> list:
    """
    Recursively flattens a nested dictionary representing a document tree.

    Parameters:
        root (dict): The root node of the document tree.

    Returns:
        list: A list of tuples representing the flattened document tree. Each tuple
            contains a heading and its corresponding text.
    """
    if root["heading"] or root["text"]:
        sections = [(root["heading"], root["text"])]
    else:
        sections = []
    for section in root.get("child_sections", []):
        sections.extend(flatten_doc_tree(section))
    return sections

get_section_full_text ¶

get_section_full_text(section)

Retrieves the full text of a section from a given dictionary.

Parameters:

Name	Type	Description	Default
`section`	`dict`	The section to retrieve the full text from.	required

Returns:

Name	Type	Description
`str`	`str`	The full text of the section.

Source code in docqa/core/doc_tree.py

def get_section_full_text(section: dict) -> str:
    """
    Retrieves the full text of a section from a given dictionary.

    Args:
        section (dict): The section to retrieve the full text from.

    Returns:
        str: The full text of the section.
    """
    flatten_sections = flatten_doc_tree(section)
    text_sections = [
        f"{heading.strip()}\n\n{content.strip()}".strip()
        for heading, content in flatten_sections
    ]

    full_text = "\n\n".join(text_sections)

    return full_text