dllmforge.rag_preprocess_documents¶
This module provides document preprocessing functionality for RAG (Retrieval-Augmented Generation) pipelines. It includes document loading and text chunking for PDF files.
Classes
Abstract base class for document loaders. |
|
Loader for PDF documents using PyPDF2. |
|
|
Class for chunking text into smaller segments with overlap. |
- class dllmforge.rag_preprocess_documents.DocumentLoader[source]¶
Abstract base class for document loaders.
- class dllmforge.rag_preprocess_documents.TextChunker(chunk_size: int = 1000, overlap_size: int = 200)[source]¶
Class for chunking text into smaller segments with overlap. For detailed information about chunking strategies in RAG applications, including: - Why chunking is important - How to choose chunk size and overlap - Different splitting techniques - Evaluation methods See: https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/
Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)
- __init__(chunk_size: int = 1000, overlap_size: int = 200)[source]¶
Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)
- chunk_text(pages_with_text: List[Tuple[int, str]], file_name: str = None, metadata: dict = None) List[Dict[str, Any]][source]¶
Split text into chunks while preserving sentence boundaries. :param pages_with_text: List of tuples containing (page_number, text) pairs :param file_name: Name of the source file (optional) :param metadata: Metadata information extracted from the document (optional)
- Returns:
- {
‘text’: str, # The chunk text ‘page_number’: int, # Source page number ‘chunk_index’: int, # Index of the chunk ‘total_chunks’: int, # Total number of chunks from this document ‘file_name’: str # Name of the source file
}
- Return type:
List of dictionaries containing chunks with metadata