dllmforge.IE_agent_extractor_docling

Synchronous Information Extractor module for extracting structured information from documents using LLM with Docling.

Classes

DoclingDocumentProcessor(config)

Document processor using Docling for advanced PDF processing

DoclingInfoExtractor(config, output_schema)

Class for extracting information from documents using LLM with Docling preprocessing

DoclingProcessedDocument(content, content_type)

Class representing a document processed by Docling

DocumentChunk(content, content_type[, ...])

Class representing a chunk of document content

DocumentConverter(*args, **kwargs)

Fallback stub used when the real docling package isn't available.

class dllmforge.IE_agent_extractor_docling.DocumentConverter(*args, **kwargs)[source]

Fallback stub used when the real docling package isn’t available.

The stub is intentionally minimal: it can be instantiated safely but its convert method raises a RuntimeError. Test suites can still patch DocumentConverter where they need to simulate conversions.

__init__(*args, **kwargs)[source]
convert(*args, **kwargs)[source]
class dllmforge.IE_agent_extractor_docling.DoclingProcessedDocument(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None, docling_result=None)[source]

Class representing a document processed by Docling

__init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None, docling_result=None)[source]
class dllmforge.IE_agent_extractor_docling.DocumentChunk(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None, docling_elements: List | None = None)[source]

Class representing a chunk of document content

__init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None, docling_elements: List | None = None)[source]
class dllmforge.IE_agent_extractor_docling.DoclingDocumentProcessor(config)[source]

Document processor using Docling for advanced PDF processing

__init__(config)[source]
encode_image_base64(image_data: bytes) str[source]

Encode image data to base64 string

process_document(file_path: Path) DoclingProcessedDocument | None[source]

Process a single document using Docling

process_directory() List[DoclingProcessedDocument][source]

Process all documents in the configured directory

class dllmforge.IE_agent_extractor_docling.DoclingInfoExtractor(config: IEAgentConfig, output_schema: type[BaseModel], llm_api: LangchainAPI | None = None)[source]

Class for extracting information from documents using LLM with Docling preprocessing

Initialize the information extractor

__init__(config: IEAgentConfig, output_schema: type[BaseModel], llm_api: LangchainAPI | None = None)[source]

Initialize the information extractor

refine_system_prompt(task_description: str) str[source]

Use LLM to refine user’s task description into a proper system prompt

chunk_document(doc: DoclingProcessedDocument) Generator[DocumentChunk, None, None][source]

Split document into chunks based on Docling structure if needed

create_text_extraction_prompt() ChatPromptTemplate[source]

/no_think Create prompt template for text-based information extraction with Docling awareness

process_text_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]

Process a text document chunk with Docling enhancements

create_multimodal_extraction_prompt() ChatPromptTemplate[source]

Create prompt template for multimodal extraction with Docling structure

process_multimodal_chunk(chunk: DocumentChunk, doc: DoclingProcessedDocument) Dict[str, Any] | None[source]

Process chunk with access to original Docling result for multimodal content

process_chunk(chunk: DocumentChunk, doc: DoclingProcessedDocument) Dict[str, Any] | None[source]

Process a document chunk with Docling context

process_document(doc: DoclingProcessedDocument | List[DoclingProcessedDocument]) List[Dict[str, Any]][source]

Process document and extract information

save_results(results: List[Any], output_path: Path) None[source]

Save extraction results to JSON file

process_all() None[source]

Process all documents in configured directory