Data Extraction¶
KARLI-hosted data-extraction models turn uploaded files into structured text that downstream components can consume. They are selected from the Read File component when its Extraction Backend is set to karli.
KARLI is currently the only provider offering this category of model.
Available Models¶
| Model | Accepts | Notes |
|---|---|---|
karli/default-data-extraction |
Any | KARLI-managed default; routes the file to a sensible extractor. |
docling-project/docling |
Documents | Docling, run server-side by KARLI. |
datalab-to/marker |
Documents | Marker. |
opendatalab/MinerU |
Documents | MinerU. |
karli/multimodal-data-extraction |
Documents | Multimodal hybrid pipeline. |
openai/whisper-large-v3 |
Audio | Audio transcription via Whisper. |
The Read File component validates the uploaded file against the chosen model's accepted type before uploading — submitting, for example, a PDF to the Whisper model produces an error rather than an upload.
Request Shape¶
When a file is sent for extraction, the component issues a POST to {KARLI_BASE_URL}/data-extraction/extract as a multipart upload:
- Form field
extractorModelcarries the selected model (mapped to its KARLI identifier). - The file part carries the document or audio file.
Authorization: Bearer <JWT>uses the session JWT injected by the KARLI proxy.
The response is a JSON object whose segments are concatenated into a single text payload; segments with a title are emitted as ## <title> Markdown headers.
See Document Extraction for how the Read File component uses these models in practice, including the downstream Data shape.