extract

package
v1.80.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

Documentation

Index

Constants

View Source
const DefaultMaxPages = 0

DefaultMaxPages is the default page limit for extraction. 0 means no limit (all pages are processed).

View Source
const DefaultOCRConfThreshold = 70

DefaultOCRConfThreshold is the default confidence threshold below which OCR confidence annotations are included in spatial output. Lines with min confidence >= this threshold omit the confidence score to save tokens.

View Source
const DefaultTextTimeout = 30 * time.Second

DefaultTextTimeout is the default timeout for pdftotext.

View Source
const MIMEApplicationPDF = "application/pdf"

MIMEApplicationPDF is the MIME type for PDF documents.

Variables

View Source
var ExtractionAllowedOps = func() map[string]AllowedOps {
	m := make(map[string]AllowedOps)
	for _, op := range ExtractionOps {
		a := m[op.Table]
		switch op.Action {
		case ActionCreate:
			a.Insert = true
		case ActionUpdate:
			a.Update = true
		}
		m[op.Table] = a
	}
	return m
}()

ExtractionAllowedOps is derived from ExtractionOps for use by ValidateOperations.

View Source
var ExtractionOps = func() []TableOp {
	var ops []TableOp
	for _, td := range ExtractionTableDefs {
		for _, ad := range td.Actions {
			ops = append(ops, expandTableOp(td, ad))
		}
	}
	return ops
}()

ExtractionOps expands ExtractionTableDefs into flat TableOp entries.

View Source
var ExtractionTableDefs = []TableDef{
	{
		Table:   data.TableVendors,
		Columns: columnsFromMeta(data.TableVendors),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"name"}},
			{Action: ActionUpdate, Required: []string{"id"}, Extra: []ColumnDef{
				{Name: "id", Type: ColTypeInteger},
			}},
		},
	},
	{
		Table:   data.TableAppliances,
		Columns: columnsFromMeta(data.TableAppliances),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"name"}},
			{Action: ActionUpdate, Required: []string{"id"}, Extra: []ColumnDef{
				{Name: "id", Type: ColTypeInteger},
			}},
		},
	},
	{
		Table: data.TableProjects,
		Columns: withEnum(
			columnsFromMeta(data.TableProjects),
			"status", []any{
				data.ProjectStatusIdeating,
				data.ProjectStatusPlanned,
				data.ProjectStatusQuoted,
				data.ProjectStatusInProgress,
				data.ProjectStatusDelayed,
				data.ProjectStatusCompleted,
				data.ProjectStatusAbandoned,
			},
		),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"title"}},
		},
	},
	{
		Table: data.TableQuotes,
		Columns: append(
			columnsFromMeta(data.TableQuotes),
			ColumnDef{Name: "vendor_name", Type: ColTypeString},
		),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"project_id", "total_cents"}},
			{Action: ActionUpdate, Required: []string{"id"}, Extra: []ColumnDef{
				{Name: "id", Type: ColTypeInteger},
			}},
		},
	},
	{
		Table:   data.TableMaintenanceItems,
		Columns: columnsFromMeta(data.TableMaintenanceItems),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"name"}},
			{Action: ActionUpdate, Required: []string{"id"}, Extra: []ColumnDef{
				{Name: "id", Type: ColTypeInteger},
			}},
		},
	},
	{
		Table: data.TableIncidents,
		Columns: append(
			withEnum(
				withEnum(
					columnsFromMeta(data.TableIncidents),
					"status", []any{
						data.IncidentStatusOpen,
						data.IncidentStatusInProgress,
						data.IncidentStatusResolved,
					},
				),
				"severity", []any{
					data.IncidentSeverityUrgent,
					data.IncidentSeveritySoon,
					data.IncidentSeverityWhenever,
				},
			),
			ColumnDef{Name: "vendor_name", Type: ColTypeString},
		),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"title"}},
		},
	},
	{
		Table: data.TableServiceLogEntries,
		Columns: append(
			columnsFromMeta(data.TableServiceLogEntries),
			ColumnDef{Name: "vendor_name", Type: ColTypeString},
		),
		Actions: []ActionDef{
			{Action: ActionCreate, Required: []string{"maintenance_item_id"}},
		},
	},
	{
		Table: data.TableDocuments,
		Columns: withEnum(
			columnsFromMeta(data.TableDocuments),
			"entity_kind", []any{
				data.DocumentEntityProject,
				data.DocumentEntityQuote,
				data.DocumentEntityMaintenance,
				data.DocumentEntityAppliance,
				data.DocumentEntityServiceLog,
				data.DocumentEntityVendor,
				data.DocumentEntityIncident,
			},
		),
		Actions: []ActionDef{
			{Action: ActionCreate},
			{Action: ActionUpdate, Required: []string{"id"}, Extra: []ColumnDef{
				{Name: "id", Type: ColTypeInteger},
			}, Omit: []string{"file_name"}},
		},
	},
}

ExtractionTableDefs is the single source of truth for extraction table metadata. Column lists are derived from generated model metadata via columnsFromMeta; only policy annotations (Actions, Required, Enum, Omit, synthetic columns) are hand-maintained.

ExtractionTables is the set of tables the LLM receives DDL for and may reference in its output. Includes both writable and read-only reference tables.

Functions

func BuildExtractionPrompt

func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message

BuildExtractionPrompt creates the system and user messages for document extraction. The system prompt includes the database DDL and existing entity rows; the LLM outputs a JSON array of operations.

func ExtractText

func ExtractText(data []byte, mime string, timeout time.Duration) (string, error)

ExtractText pulls plain text from document content based on MIME type. Returns empty string (not an error) for unsupported MIME types. PDF extraction uses pdftotext (poppler-utils) when available, returning empty for PDFs when the tool is missing. The timeout parameter caps how long pdftotext can run (0 = DefaultTextTimeout).

This is a convenience wrapper that delegates to PDFTextExtractor and PlainTextExtractor. For full pipeline extraction, use Pipeline.Run.

func ExtractWithProgress added in v1.47.0

func ExtractWithProgress(
	ctx context.Context,
	data []byte,
	mime string,
	extractors []Extractor,
) <-chan ExtractProgress

ExtractWithProgress runs async extraction with per-page progress updates sent on the returned channel. The channel closes when processing completes. The extractors list is consulted to determine whether to run image or PDF OCR. Unsupported types produce a single Done message with empty text.

func ExtractorMaxPages added in v1.47.0

func ExtractorMaxPages(extractors []Extractor) int

ExtractorMaxPages returns the max pages from the first PDFOCRExtractor in the list, or 0 (meaning "no limit") if none is found.

func ExtractorTimeout added in v1.47.0

func ExtractorTimeout(extractors []Extractor) time.Duration

ExtractorTimeout returns the timeout from the first PDFTextExtractor in the list, or 0 (meaning "use default") if none is found.

func FormatDDLBlock added in v1.49.0

func FormatDDLBlock(ddl map[string]string, tables []string) string

FormatDDLBlock formats the DDL map as a SQL comment block for inclusion in the LLM system prompt.

func FormatEntityRows added in v1.49.0

func FormatEntityRows(label string, rows []EntityRow) string

FormatEntityRows formats a named set of entity rows as SQL comments for inclusion in the LLM system prompt.

func HasMatchingExtractor added in v1.47.0

func HasMatchingExtractor(extractors []Extractor, tool string, mime string) bool

HasMatchingExtractor reports whether any extractor in the list with the given tool name matches the MIME type and is available.

func HasPDFInfo added in v1.65.1

func HasPDFInfo() bool

HasPDFInfo reports whether the pdfinfo binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.

func HasPDFToCairo added in v1.65.1

func HasPDFToCairo() bool

HasPDFToCairo reports whether the pdftocairo binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.

func HasPDFToText

func HasPDFToText() bool

HasPDFToText reports whether the pdftotext binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.

func HasTesseract

func HasTesseract() bool

HasTesseract reports whether the tesseract binary is on PATH. The result is cached for the process lifetime.

func ImageOCRAvailable

func ImageOCRAvailable() bool

ImageOCRAvailable reports whether tesseract is available for direct image OCR (no PDF tools needed for image files).

func IsImageMIME

func IsImageMIME(mime string) bool

IsImageMIME reports whether the MIME type is an image format that tesseract can process.

func IsScanned

func IsScanned(extractedText string) bool

IsScanned returns true if the extracted text is empty or whitespace-only, indicating the document likely needs OCR.

func NeedsOCR added in v1.47.0

func NeedsOCR(extractors []Extractor, mime string) bool

NeedsOCR reports whether any OCR-capable extractor in the list matches the MIME type and is available. Use this instead of checking tool names directly so callers don't couple to extractor internals.

func OCRAvailable

func OCRAvailable() bool

OCRAvailable reports whether tesseract and pdftocairo (with pdfinfo for page count discovery) are available.

func OperationsSchema added in v1.49.0

func OperationsSchema() map[string]any

OperationsSchema returns the JSON Schema for structured extraction output. The schema uses anyOf to define precise per-table column schemas, so the LLM is constrained to produce only valid column names and types for each {action, table} combination. Document operations live in a separate top-level "document" field (singular object) rather than the array.

func ParseInt64 added in v1.61.0

func ParseInt64(v any) int64

ParseInt64 extracts an int64 from an arbitrary value. Handles concrete numeric types (from GORM/SQLite map queries), json.Number, and string representations. Returns 0 for nil or unparsable values.

func ParseUint added in v1.54.1

func ParseUint(v any) uint

ParseUint extracts a uint from an arbitrary value. Handles concrete numeric types (from GORM/SQLite map queries), json.Number, and string representations. Returns 0 for nil, negative, or unparsable values.

func SpatialTextFromTSV added in v1.79.0

func SpatialTextFromTSV(tsv []byte, confThreshold int) string

SpatialTextFromTSV converts tesseract TSV output into a compact spatial format with line-level bounding boxes. Each output line has the form:

[left,top,width] word1 word2 ...

When the minimum confidence for a line falls below confThreshold, the confidence is appended:

[left,top,width;minConf] word1 word2 ...

Lines within the same block/paragraph are separated by newlines; block or paragraph breaks produce a blank line.

func StripCodeFences

func StripCodeFences(s string) string

StripCodeFences removes markdown code fences that LLMs sometimes wrap around JSON output. Handles fences anywhere in the text (not just at the start), since LLMs may produce commentary before the fenced block.

func ValidateOperations added in v1.49.0

func ValidateOperations(ops []Operation, allowed map[string]AllowedOps) error

ValidateOperations checks each operation against the allowed tables and action types. Returns an error describing the first violation found.

Types

type AcquireToolState added in v1.59.0

type AcquireToolState struct {
	Tool    string
	Running bool // true while the tool is executing
	Count   int  // pages completed (valid when !Running, or incremental while Running)
	Err     error
}

AcquireToolState tracks a single image extraction tool during acquisition.

type Action added in v1.63.0

type Action string

Action is a typed string enum for extraction operations.

const (
	ActionCreate Action = "create"
	ActionUpdate Action = "update"
)

type ActionDef added in v1.63.0

type ActionDef struct {
	Action   Action
	Required []string    // columns required for this action
	Extra    []ColumnDef // columns only present for this action (e.g. id for update)
	Omit     []string    // columns from the table to exclude for this action
}

ActionDef describes what an action can do on a table's columns.

type AllowedOps added in v1.49.0

type AllowedOps struct {
	Insert bool
	Update bool
}

AllowedOps specifies which operations are permitted on a table. Insert maps to "create", Update maps to "update".

type ColType added in v1.63.0

type ColType string

ColType is a JSON Schema type for a column.

const (
	ColTypeString  ColType = "string"
	ColTypeInteger ColType = "integer"
)

type ColumnDef added in v1.63.0

type ColumnDef struct {
	Name string
	Type ColType
	Enum []any // optional enum constraint (e.g. entity_kind values)
}

ColumnDef describes a single column the LLM may write.

type EntityRow added in v1.49.0

type EntityRow struct {
	ID   uint
	Name string
}

EntityRow is a lightweight (id, name) pair for FK context in LLM prompts.

type ExtractProgress added in v1.47.0

type ExtractProgress struct {
	Tool     string // extractor tool name (set on Done)
	Desc     string // human description (set on Done)
	Phase    string // e.g. "extract"
	Page     int    // current page (1-indexed)
	Total    int    // total pages (0 until known)
	DocPages int    // total pages in the PDF (0 when uncapped)
	Done     bool   // all phases finished
	Text     string // accumulated text (set on Done)
	Data     []byte // structured data (set on Done)
	Err      error  // set on failure

	// AcquireTools carries per-tool state during the rasterization+OCR
	// phase. Non-nil while pages are being processed.
	AcquireTools []AcquireToolState
}

ExtractProgress reports incremental progress from ExtractWithProgress.

type ExtractionPromptInput

type ExtractionPromptInput struct {
	DocID         uint
	Filename      string
	MIME          string
	SizeBytes     int64
	Schema        SchemaContext
	Sources       []TextSource
	SendTSV       bool // send spatial layout annotations from tesseract OCR
	ConfThreshold int  // confidence threshold for spatial annotations
}

ExtractionPromptInput holds the inputs for building an extraction prompt.

type Extractor added in v1.47.0

type Extractor interface {
	Tool() string
	Matches(mime string) bool
	Available() bool
	Extract(ctx context.Context, data []byte) (TextSource, error)
}

Extractor extracts text from document bytes.

func DefaultExtractors added in v1.47.0

func DefaultExtractors(
	maxPages int,
	timeout time.Duration,
	ocrEnabled bool,
) []Extractor

DefaultExtractors returns the standard extractors in priority order: pdftotext, plaintext, PDF OCR, image OCR. maxPages of 0 means no limit (all pages). Zero timeout causes the concrete extractor to use its default. ocrEnabled controls whether OCR extractors are included (default true).

type ImageOCRExtractor added in v1.47.0

type ImageOCRExtractor struct{}

ImageOCRExtractor wraps ocrImage for direct image OCR.

func (*ImageOCRExtractor) Available added in v1.47.0

func (e *ImageOCRExtractor) Available() bool

func (*ImageOCRExtractor) Extract added in v1.47.0

func (e *ImageOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)

func (*ImageOCRExtractor) Matches added in v1.47.0

func (e *ImageOCRExtractor) Matches(mime string) bool

func (*ImageOCRExtractor) Tool added in v1.47.0

func (e *ImageOCRExtractor) Tool() string

type Operation added in v1.49.0

type Operation struct {
	Action Action         `json:"action"`
	Table  string         `json:"table"`
	Data   map[string]any `json:"data"`
}

Operation is a single create/update action the LLM wants to perform.

func ParseOperations added in v1.49.0

func ParseOperations(raw string) ([]Operation, error)

ParseOperations unmarshals the schema-constrained {"operations": [...], "document": {...}} response from the LLM. The optional "document" field is synthesized into a regular Operation with Table "documents" so downstream consumers see a uniform slice.

type PDFOCRExtractor added in v1.47.0

type PDFOCRExtractor struct {
	MaxPages int
}

PDFOCRExtractor wraps ocrPDF for scanned PDF pages.

func (*PDFOCRExtractor) Available added in v1.47.0

func (e *PDFOCRExtractor) Available() bool

func (*PDFOCRExtractor) Extract added in v1.47.0

func (e *PDFOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)

func (*PDFOCRExtractor) Matches added in v1.47.0

func (e *PDFOCRExtractor) Matches(mime string) bool

func (*PDFOCRExtractor) Tool added in v1.47.0

func (e *PDFOCRExtractor) Tool() string

type PDFTextExtractor added in v1.47.0

type PDFTextExtractor struct {
	Timeout time.Duration
}

PDFTextExtractor wraps pdftotext for digital PDF text extraction.

func (*PDFTextExtractor) Available added in v1.47.0

func (e *PDFTextExtractor) Available() bool

func (*PDFTextExtractor) Extract added in v1.47.0

func (e *PDFTextExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)

func (*PDFTextExtractor) Matches added in v1.47.0

func (e *PDFTextExtractor) Matches(mime string) bool

func (*PDFTextExtractor) Tool added in v1.47.0

func (e *PDFTextExtractor) Tool() string

type Pipeline

type Pipeline struct {
	LLMClient     *llm.Client   // nil = skip LLM extraction
	Extractors    []Extractor   // nil = DefaultExtractors(0, 0, true)
	Schema        SchemaContext // DDL + entity rows for prompt
	DocID         uint          // document ID for UPDATE operations
	SendTSV       bool          // send spatial layout annotations to LLM
	ConfThreshold int           // confidence threshold for spatial annotations
}

Pipeline orchestrates the document extraction layers: text extraction, OCR, and LLM-powered structured extraction. Each layer is independent and gracefully degrades when its dependencies are unavailable.

func (*Pipeline) Run

func (p *Pipeline) Run(
	ctx context.Context,
	data []byte,
	filename string,
	mime string,
) *Result

Run executes the extraction pipeline on the given document data. It never returns a Go error -- all failures are captured in Result.Err so the caller can save the document regardless.

type PlainTextExtractor added in v1.47.0

type PlainTextExtractor struct{}

PlainTextExtractor normalizes whitespace from text/* content.

func (*PlainTextExtractor) Available added in v1.47.0

func (e *PlainTextExtractor) Available() bool

func (*PlainTextExtractor) Extract added in v1.47.0

func (e *PlainTextExtractor) Extract(_ context.Context, data []byte) (TextSource, error)

func (*PlainTextExtractor) Matches added in v1.47.0

func (e *PlainTextExtractor) Matches(mime string) bool

func (*PlainTextExtractor) Tool added in v1.47.0

func (e *PlainTextExtractor) Tool() string

type Result

type Result struct {
	Sources    []TextSource // text from each extraction method
	Operations []Operation  // nil if LLM unavailable or failed
	LLMRaw     string       // raw LLM output (for display)
	LLMUsed    bool
	Err        error // non-fatal extraction error; document still saves
}

Result holds the output of a pipeline run.

func (*Result) HasSource added in v1.47.0

func (r *Result) HasSource(tool string) bool

HasSource reports whether any source matches the given tool name.

func (*Result) SourceByTool added in v1.47.0

func (r *Result) SourceByTool(tool string) *TextSource

SourceByTool returns the first source matching the given tool name, or nil if not found.

func (*Result) Text added in v1.47.0

func (r *Result) Text() string

Text returns the first non-empty text from the extraction sources.

type SchemaContext added in v1.49.0

type SchemaContext struct {
	DDL                   map[string]string // table name -> CREATE TABLE SQL
	Vendors               []EntityRow
	Projects              []EntityRow
	Appliances            []EntityRow
	MaintenanceItems      []EntityRow
	MaintenanceCategories []EntityRow
	ProjectTypes          []EntityRow
}

SchemaContext provides the schema and entity data the LLM needs to generate correct operations against the database.

type ShadowDB added in v1.61.0

type ShadowDB struct {
	// contains filtered or unexported fields
}

ShadowDB stages LLM extraction operations in an in-memory SQLite database so that cross-references between batch-created entities (e.g. a quote referencing a just-created vendor) resolve correctly via auto-increment IDs.

The shadow DB has FK constraints OFF -- it is a staging area, not a validator. Validation happens during commit against the real DB. Auto-increment IDs are seeded from the real DB's max IDs so shadow IDs occupy a disjoint range (max_real_id+1, ...), eliminating ambiguity between references to existing entities and batch-created ones.

func NewShadowDB added in v1.61.0

func NewShadowDB(store *data.Store) (*ShadowDB, error)

NewShadowDB creates an in-memory SQLite database and migrates the extraction-relevant tables. Auto-increment IDs are seeded from the real DB so shadow IDs occupy a disjoint range from existing real IDs, making cross-references unambiguous. FK constraints are OFF -- the shadow DB is a staging area; validation happens during commit.

func (*ShadowDB) Close added in v1.61.0

func (s *ShadowDB) Close() error

Close closes the underlying in-memory database connection.

func (*ShadowDB) Commit added in v1.61.0

func (s *ShadowDB) Commit(store *data.Store, ops []Operation) error

Commit copies staged shadow rows to the real database inside a single transaction, remapping shadow auto-increment IDs to real IDs. If any operation fails the entire batch is rolled back. Tables are processed in dependency order; updates are applied after all creates.

func (*ShadowDB) CreatedIDs added in v1.61.0

func (s *ShadowDB) CreatedIDs(table string) []uint

CreatedIDs returns the shadow auto-increment IDs for a given table in insertion order.

func (*ShadowDB) Stage added in v1.61.0

func (s *ShadowDB) Stage(ops []Operation) error

Stage inserts all operations into the shadow database in order. Create operations are inserted into the appropriate shadow table; update operations are recorded but not applied to the shadow DB (they target real DB rows and are handled during commit).

type TableDef added in v1.63.0

type TableDef struct {
	Table   string
	Columns []ColumnDef // shared columns across all actions
	Actions []ActionDef
}

TableDef defines a table's columns and which actions are allowed. Columns are derived from generated model metadata via columnsFromMeta; each ActionDef specifies required fields and any action-specific extras. Table-wide column exclusions are controlled by the extract:"-" struct tag on model fields, which causes genmeta to omit them from TableExtractColumns.

type TableOp added in v1.63.0

type TableOp struct {
	Action  Action
	Table   string
	Columns []flatColumn
}

TableOp is a flattened {action, table, columns} triple expanded from TableDef. Used by the schema builder and derived maps.

type TextSource added in v1.47.0

type TextSource struct {
	Tool string // "pdftotext", "plaintext", "tesseract"
	Desc string // human description for LLM context
	Text string
	Data []byte // optional structured data (e.g. OCR TSV)
}

TextSource holds text from a single extraction method.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL