dllmforge.IE_agent_document_processor¶
Document Processor module for preprocessing documents into text or images for LLM processing.
Classes
|
Class for preprocessing documents into text or images |
|
Class representing processed document content |
- class dllmforge.IE_agent_document_processor.ProcessedDocument(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]¶
Class representing processed document content
Initialize processed document
- Parameters:
content – The document content (text string or image bytes)
content_type – Type of content (‘text’ or ‘image’)
metadata – Additional metadata about the document
- __init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]¶
Initialize processed document
- Parameters:
content – The document content (text string or image bytes)
content_type – Type of content (‘text’ or ‘image’)
metadata – Additional metadata about the document
- class dllmforge.IE_agent_document_processor.DocumentProcessor(config: DocumentConfig | None = None, input_dir: str | Path | None = None, file_pattern: str | None = None, output_type: str | None = None, output_dir: str | Path | None = None)[source]¶
Class for preprocessing documents into text or images
Initialize document processor :param config: Document processing configuration (DocumentConfig) :param input_dir: Input directory (overrides config if given) :param file_pattern: File pattern (overrides config if given) :param output_type: Processing type (overrides config if given) :param output_dir: Output directory (overrides config if given)
- __init__(config: DocumentConfig | None = None, input_dir: str | Path | None = None, file_pattern: str | None = None, output_type: str | None = None, output_dir: str | Path | None = None)[source]¶
Initialize document processor :param config: Document processing configuration (DocumentConfig) :param input_dir: Input directory (overrides config if given) :param file_pattern: File pattern (overrides config if given) :param output_type: Processing type (overrides config if given) :param output_dir: Output directory (overrides config if given)
- process_to_text(file_path: str | Path) ProcessedDocument[source]¶
Process document to text using DocumentLoader
- process_to_image(file_path: str | Path) List[ProcessedDocument][source]¶
Process document to list of page images
- process_file(file_path: str | Path) ProcessedDocument | List[ProcessedDocument][source]¶
Process a single file based on configuration (text/image) :param file_path: Path to document
- Returns:
Single ProcessedDocument for text or list of ProcessedDocument for images
- process_directory() List[ProcessedDocument | List[ProcessedDocument]][source]¶
Process all matching files in the configured directory