dllmforge.IE_agent_extractor¶
Synchronous Information Extractor module for extracting structured information from documents using LLM.
Classes
|
Class representing a chunk of document content |
|
Class for extracting information from documents using LLM |
- class dllmforge.IE_agent_extractor.DocumentChunk(content: str | bytes, content_type: str, metadata: Dict[str, Any] | None = None)[source]¶
Class representing a chunk of document content
- class dllmforge.IE_agent_extractor.InfoExtractor(config: IEAgentConfig | None = None, output_schema: type[BaseModel] | None = None, llm_api: LangchainAPI | None = None, system_prompt: str | None = None, chunk_size: int | None = None, chunk_overlap: int | None = None, doc_processor: DocumentProcessor | None = None, document_output_type: str = 'text')[source]¶
Class for extracting information from documents using LLM
Initialize the information extractor.
You can use either config (IEAgentConfig), or pass the individual parameters directly.
- __init__(config: IEAgentConfig | None = None, output_schema: type[BaseModel] | None = None, llm_api: LangchainAPI | None = None, system_prompt: str | None = None, chunk_size: int | None = None, chunk_overlap: int | None = None, doc_processor: DocumentProcessor | None = None, document_output_type: str = 'text')[source]¶
Initialize the information extractor.
You can use either config (IEAgentConfig), or pass the individual parameters directly.
- refine_system_prompt(task_description: str) str[source]¶
Use LLM to refine user’s task description into a proper system prompt
- chunk_document(doc: ProcessedDocument) Generator[DocumentChunk, None, None][source]¶
Split document into chunks if needed based on thresholds
- create_text_extraction_prompt() ChatPromptTemplate[source]¶
Create prompt template for text-based information extraction
- process_text_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]¶
Process a text document chunk
- create_image_extraction_prompt() ChatPromptTemplate[source]¶
Create prompt template for image-based information extraction
- process_image_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]¶
Process an image document chunk
- process_chunk(chunk: DocumentChunk) Dict[str, Any] | None[source]¶
Process a document chunk based on its type
- process_document(doc: ProcessedDocument | List[ProcessedDocument]) List[Dict[str, Any]][source]¶
Process document and extract information, merging in chunk metadata.
- save_results(results: List[Any], output_path: str | Path) None[source]¶
Save extraction results to JSON file
- process_all(save_individual: bool = False, combined_output_name: str = 'all_extracted.json') None[source]¶
Process all documents in configured directory
- Parameters:
save_individual – If True, save each document to a separate JSON file (old behavior)
combined_output_name – Name of the combined output file (default: “all_extracted.json”)