dllmforge.rag_preprocess_documents

This module provides document preprocessing functionality for RAG (Retrieval-Augmented Generation) pipelines. It includes document loading and text chunking for PDF files.

Classes

DocumentLoader()

Abstract base class for document loaders.

PDFLoader()

Loader for PDF documents using PyPDF2.

TextChunker([chunk_size, overlap_size])

Class for chunking text into smaller segments with overlap.

class dllmforge.rag_preprocess_documents.DocumentLoader[source]

Abstract base class for document loaders.

abstract load(file_path: Path) List[Tuple[int, str]][source]

Load a document and return its contents as a list of (page_number, text) tuples. :param file_path: Path to the document file

Returns:

List of tuples containing (page_number, text) pairs

class dllmforge.rag_preprocess_documents.PDFLoader[source]

Loader for PDF documents using PyPDF2.

load(file_path: Path) Tuple[List[Tuple[int, str]], str][source]

Load a PDF document and extract text from its pages. :param file_path: Path to the PDF file

Returns:

Tuple containing (pages_with_text, file_name) where pages_with_text is a list of (page_number, text) pairs

class dllmforge.rag_preprocess_documents.TextChunker(chunk_size: int = 1000, overlap_size: int = 200)[source]

Class for chunking text into smaller segments with overlap. For detailed information about chunking strategies in RAG applications, including: - Why chunking is important - How to choose chunk size and overlap - Different splitting techniques - Evaluation methods See: https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/

Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)

__init__(chunk_size: int = 1000, overlap_size: int = 200)[source]

Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)

chunk_text(pages_with_text: List[Tuple[int, str]], file_name: str = None, metadata: dict = None) List[Dict[str, Any]][source]

Split text into chunks while preserving sentence boundaries. :param pages_with_text: List of tuples containing (page_number, text) pairs :param file_name: Name of the source file (optional) :param metadata: Metadata information extracted from the document (optional)

Returns:

{

‘text’: str, # The chunk text ‘page_number’: int, # Source page number ‘chunk_index’: int, # Index of the chunk ‘total_chunks’: int, # Total number of chunks from this document ‘file_name’: str # Name of the source file

}

Return type:

List of dictionaries containing chunks with metadata