dllmforge.IE_agent_document_processor

Document Processor module for preprocessing documents into text or images for LLM processing.

Classes

DocumentProcessor([config, input_dir, ...])

Class for preprocessing documents into text or images

ProcessedDocument(content, content_type[, ...])

Class representing processed document content

class dllmforge.IE_agent_document_processor.ProcessedDocument(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]

Class representing processed document content

Initialize processed document

Parameters:
  • content – The document content (text string or image bytes)

  • content_type – Type of content (‘text’ or ‘image’)

  • metadata – Additional metadata about the document

__init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]

Initialize processed document

Parameters:
  • content – The document content (text string or image bytes)

  • content_type – Type of content (‘text’ or ‘image’)

  • metadata – Additional metadata about the document

class dllmforge.IE_agent_document_processor.DocumentProcessor(config: DocumentConfig | None = None, input_dir: str | Path | None = None, file_pattern: str | None = None, output_type: str | None = None, output_dir: str | Path | None = None)[source]

Class for preprocessing documents into text or images

Initialize document processor :param config: Document processing configuration (DocumentConfig) :param input_dir: Input directory (overrides config if given) :param file_pattern: File pattern (overrides config if given) :param output_type: Processing type (overrides config if given) :param output_dir: Output directory (overrides config if given)

__init__(config: DocumentConfig | None = None, input_dir: str | Path | None = None, file_pattern: str | None = None, output_type: str | None = None, output_dir: str | Path | None = None)[source]

Initialize document processor :param config: Document processing configuration (DocumentConfig) :param input_dir: Input directory (overrides config if given) :param file_pattern: File pattern (overrides config if given) :param output_type: Processing type (overrides config if given) :param output_dir: Output directory (overrides config if given)

process_to_text(file_path: str | Path) ProcessedDocument[source]

Process document to text using DocumentLoader

process_to_image(file_path: str | Path) List[ProcessedDocument][source]

Process document to list of page images

encode_image_base64(image_bytes: bytes) str[source]

Encode image bytes to base64 string

process_file(file_path: str | Path) ProcessedDocument | List[ProcessedDocument][source]

Process a single file based on configuration (text/image) :param file_path: Path to document

Returns:

Single ProcessedDocument for text or list of ProcessedDocument for images

process_directory() List[ProcessedDocument | List[ProcessedDocument]][source]

Process all matching files in the configured directory