dllmforge.rag_preprocess_documents¶

This module provides document preprocessing functionality for RAG (Retrieval-Augmented Generation) pipelines. It includes document loading and text chunking for PDF files.

Classes

`DocumentLoader`()	Abstract base class for document loaders.
`PDFLoader`()	Loader for PDF documents using PyPDF2.
`TextChunker`([chunk_size, overlap_size])	Class for chunking text into smaller segments with overlap.

class dllmforge.rag_preprocess_documents.DocumentLoader[source]¶

Abstract base class for document loaders.

abstract load(file_path: Path) → List[Tuple[int, str]][source]¶

Load a document and return its contents as a list of (page_number, text) tuples. :param file_path: Path to the document file

Returns:: List of tuples containing (page_number, text) pairs

class dllmforge.rag_preprocess_documents.PDFLoader[source]¶

Loader for PDF documents using PyPDF2.

load(file_path: Path) → Tuple[List[Tuple[int, str]], str][source]¶

Load a PDF document and extract text from its pages. :param file_path: Path to the PDF file

Returns:: Tuple containing (pages_with_text, file_name) where pages_with_text is a list of (page_number, text) pairs

class dllmforge.rag_preprocess_documents.TextChunker(chunk_size: int = 1000, overlap_size: int = 200)[source]¶

Class for chunking text into smaller segments with overlap. For detailed information about chunking strategies in RAG applications, including: - Why chunking is important - How to choose chunk size and overlap - Different splitting techniques - Evaluation methods See: https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/

Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)

__init__(chunk_size: int = 1000, overlap_size: int = 200)[source]¶: Initialize the TextChunker. :param chunk_size: Maximum size of each chunk in characters :param overlap_size: Number of characters to overlap between chunks (recommended: 5-20% of chunk_size)

chunk_text(pages_with_text: List[Tuple[int, str]], file_name: str = None, metadata: dict = None) → List[Dict[str, Any]][source]¶

Split text into chunks while preserving sentence boundaries. :param pages_with_text: List of tuples containing (page_number, text) pairs :param file_name: Name of the source file (optional) :param metadata: Metadata information extracted from the document (optional)

Returns:

{: ‘text’: str, # The chunk text ‘page_number’: int, # Source page number ‘chunk_index’: int, # Index of the chunk ‘total_chunks’: int, # Total number of chunks from this document ‘file_name’: str # Name of the source file

}

Return type:

List of dictionaries containing chunks with metadata

dllmforge.rag_preprocess_documents¶

dllmforge

Navigation

Related Topics