dllmforge.IE_agent_extractor¶

Synchronous Information Extractor module for extracting structured information from documents using LLM.

Classes

`DocumentChunk`(content, content_type[, metadata])	Class representing a chunk of document content
`InfoExtractor`([config, output_schema, ...])	Class for extracting information from documents using LLM

class dllmforge.IE_agent_extractor.DocumentChunk(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]¶

Class representing a chunk of document content

__init__(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]¶

class dllmforge.IE_agent_extractor.InfoExtractor(config: IEAgentConfig | None = None, output_schema: type[BaseModel] | None = None, llm_api: LangchainAPI | None = None, system_prompt: str | None = None, chunk_size: int | None = None, chunk_overlap: int | None = None, doc_processor: DocumentProcessor | None = None, document_output_type: str = 'text')[source]¶

Class for extracting information from documents using LLM

Initialize the information extractor.

You can use either config (IEAgentConfig), or pass the individual parameters directly.

Initialize the information extractor.

You can use either config (IEAgentConfig), or pass the individual parameters directly.

refine_system_prompt(task_description: str) → str[source]¶: Use LLM to refine user’s task description into a proper system prompt

chunk_document(doc: ProcessedDocument) → Generator[DocumentChunk, None, None][source]¶: Split document into chunks if needed based on thresholds

create_text_extraction_prompt() → ChatPromptTemplate[source]¶: Create prompt template for text-based information extraction

process_text_chunk(chunk: DocumentChunk) → Dict[str, Any] | None[source]¶: Process a text document chunk

create_image_extraction_prompt() → ChatPromptTemplate[source]¶: Create prompt template for image-based information extraction

process_image_chunk(chunk: DocumentChunk) → Dict[str, Any] | None[source]¶: Process an image document chunk

process_chunk(chunk: DocumentChunk) → Dict[str, Any] | None[source]¶: Process a document chunk based on its type

process_document(doc: ProcessedDocument | List[ProcessedDocument]) → List[Dict[str, Any]][source]¶: Process document and extract information, merging in chunk metadata.

save_results(results: List[Any], output_path: str | Path) → None[source]¶: Save extraction results to JSON file

process_all(save_individual: bool = False, combined_output_name: str = 'all_extracted.json') → None[source]¶

Process all documents in configured directory

Parameters:

save_individual – If True, save each document to a separate JSON file (old behavior)
combined_output_name – Name of the combined output file (default: “all_extracted.json”)

dllmforge.IE_agent_extractor¶

dllmforge

Navigation

Related Topics