dllmforge.IE_agent_extractor

Synchronous Information Extractor module for extracting structured information from documents using LLM.

Classes

DocumentChunk(content, content_type[, metadata])

Class representing a chunk of document content

InfoExtractor([config, output_schema, ...])

Class for extracting information from documents using LLM

class dllmforge.IE_agent_extractor.DocumentChunk(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]

Class representing a chunk of document content

__init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]
class dllmforge.IE_agent_extractor.InfoExtractor(config: IEAgentConfig | None = None, output_schema: type[BaseModel] | None = None, llm_api: LangchainAPI | None = None, system_prompt: str | None = None, chunk_size: int | None = None, chunk_overlap: int | None = None, doc_processor: DocumentProcessor | None = None, document_output_type: str = 'text')[source]

Class for extracting information from documents using LLM

Initialize the information extractor.

You can use either config (IEAgentConfig), or pass the individual parameters directly.

__init__(config: IEAgentConfig | None = None, output_schema: type[BaseModel] | None = None, llm_api: LangchainAPI | None = None, system_prompt: str | None = None, chunk_size: int | None = None, chunk_overlap: int | None = None, doc_processor: DocumentProcessor | None = None, document_output_type: str = 'text')[source]

Initialize the information extractor.

You can use either config (IEAgentConfig), or pass the individual parameters directly.

refine_system_prompt(task_description: str) str[source]

Use LLM to refine user’s task description into a proper system prompt

chunk_document(doc: ProcessedDocument) Generator[DocumentChunk, None, None][source]

Split document into chunks if needed based on thresholds

create_text_extraction_prompt() ChatPromptTemplate[source]

Create prompt template for text-based information extraction

process_text_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]

Process a text document chunk

create_image_extraction_prompt() ChatPromptTemplate[source]

Create prompt template for image-based information extraction

process_image_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]

Process an image document chunk

process_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]

Process a document chunk based on its type

process_document(doc: ProcessedDocument | List[ProcessedDocument]) List[Dict[str, Any]][source]

Process document and extract information, merging in chunk metadata.

save_results(results: List[Any], output_path: str | Path) None[source]

Save extraction results to JSON file

process_all(save_individual: bool = False, combined_output_name: str = 'all_extracted.json') None[source]

Process all documents in configured directory

Parameters:
  • save_individual – If True, save each document to a separate JSON file (old behavior)

  • combined_output_name – Name of the combined output file (default: “all_extracted.json”)