SmolDocling-256M-preview
SmolDocling-256M-preview is a multimodal image-to-text model designed for efficient document conversion. It retains the main features of Docling and is fully compatible with Docling, achieved through seamless support for DoclingDocuments. Key features include:
- DocTags: Uses DocTags labels, an efficient and minimal document representation method, fully compatible with DoclingDocuments, clearly separating text and document structure.
- OCR: Accurately extracts text from images.
- Layout and Positioning: Preserves document structure and element bounding boxes.
- Code Recognition: Detects and formats code blocks, including indentation.
- Formula Recognition: Identifies and processes mathematical expressions.
- Chart Recognition: Extracts and interprets chart data.
- Table Recognition: Supports column and row headers for structured table extraction.
- Image Classification: Distinguishes graphic elements.
- Title Correspondence: Links titles to related images and graphics.
- List Grouping: Correctly organizes and structures list elements.
- Full Page Conversion: Processes entire pages, including all page elements (code, formulas, tables, charts, etc.).
- OCR with Bounding Boxes: Uses bounding boxes for OCR region identification.
- General Document Processing: Trained on scientific and non-scientific documents.
- Seamless Docling Integration: Imports Docling and exports in multiple formats (MD, HTML, etc.).
- Fast Inference: Uses VLLM, averaging 0.35 seconds per page on an A100 GPU.
This model is fine-tuned based on Idefics3, using DocTags for efficient tokenization, and will provide enhanced chart recognition, multi-page inference support, and chemical recognition functions. The developers also provide code examples for inference using transformers or vllm, and converting results into multiple output formats using Docling.