Chunking
chunk_content ¶
Generate a list of content chunks from a given string. The function will only split at a new line and never in the middle of a sentence. Which means it tries its best to preserve the structure of the original text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content |
str
|
The input string to be chunked. |
required |
single_threshold |
int
|
The minimum length of a single chunk. Defaults to 100. |
100
|
composite_threshold |
int
|
The maximum length of a composite chunk. Defaults to 200. |
200
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of content chunks. |
Description
- This function takes a string
content
and splits it into smaller chunks based on the specified thresholds. - It first splits the string into parts using the newline and carriage return characters as delimiters.
- It then iterates over each part and checks if the length of the part exceeds
the
single_threshold
. - If it does, it is considered a paragraph and added as a separate chunk.
- If the length of the current chunk exceeds the
composite_threshold
, it is also added as a separate chunk. - Finally, the function returns a list of all the generated chunks.
Example
Source code in docqa/core/chunking.py
chunk_size_stats ¶
Calculates the statistics of the chunk sizes in the given list of sections.
Description
This function calculates the statistics of the chunk sizes in the given list of
sections. It iterates through each section and splits the content into
paragraphs using "\n\n" as the delimiter. It then calculates the length of
each paragraph by splitting it into words and stores them in the
paragraph_lengths
list. After that, it filters out the paragraph lengths
that are less than or equal to 100.
Next, it prints the average paragraph length by calculating the sum of all paragraph lengths and dividing it by the number of paragraph lengths. It then prints the 90th percentile paragraph length by sorting the paragraph lengths in ascending order and selecting the index that corresponds to 90% of the length of the list.
The function then initializes an empty dictionary sections_details
to store
the details of each section. It iterates through each section and checks if the
heading matches any of the predefined keywords. If it does, it initializes an
empty list chunks
, otherwise it calls the chunk_content
function to chunk
the content and assigns the result to chunks
. It then adds the details of the
section to the sections_details
dictionary.
Finally, it prints the total number of chunks by summing the lengths of the
chunks
list for each section in sections_details
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sections |
list[tuple[str, str]]
|
A list of tuples containing a heading and content for each section. The content is a string. |
required |