Surya OCR
Surya OCR is a modern AI-powered optical character recognition model developed by Vik Paruchuri that supports over 90 languages with impressive accuracy across diverse document types. Built on a Vision Transformer architecture inspired by the Donut framework, Surya takes an encoder-decoder approach that processes document images directly without requiring traditional text detection as a separate preprocessing step. The model extracts text content along with precise bounding box coordinates, enabling both full-text extraction and position-aware document understanding. Beyond basic character recognition, Surya includes a comprehensive document layout analysis module that identifies structural elements such as headers, paragraphs, tables, figures, lists, and captions, providing a complete understanding of document organization. The model handles complex document layouts including multi-column pages, academic papers with equations, invoices with tabular data, and historical documents with non-standard typography. Surya achieves competitive or superior accuracy compared to commercial OCR services on many benchmarks while running locally without requiring cloud API calls, making it suitable for privacy-sensitive document processing. Released under the GPL-3.0 license, the model is open source and actively maintained with regular updates. It provides a Python API and command-line interface for batch processing. Key applications include digitizing printed and handwritten documents, extracting structured data from invoices and receipts, converting scanned books and academic papers to searchable text, processing legal and medical documents, archival document preservation, and building document understanding pipelines for enterprise content management systems. Surya is particularly valued for its strong multilingual support covering Latin, Cyrillic, CJK, Arabic, Devanagari, and many other scripts.
Key Highlights
Support for 90+ Languages
Meets multilingual document processing needs with text detection and recognition in over 90 languages.
Advanced Layout Analysis
Determines correct reading order by automatically detecting document structure, columns, headings, and paragraphs.
Table Detection and Extraction
Offers capability to automatically detect tables in documents and extract them as structured data.
High Speed with GPU
Meets batch OCR needs by rapidly processing large document collections thanks to GPU optimization.
About
Surya OCR is a modern AI model developed for document-level multilingual optical character recognition, supporting over 90 languages with impressive accuracy across diverse document types. This high-performance model uses an encoder-decoder architecture based on the Donut framework, extracting image features with a Swin Transformer encoder and generating text with an mBART decoder. Unlike traditional OCR systems, Surya employs an end-to-end deep learning architecture that excels in complex document layouts where conventional rule-based approaches struggle significantly.
Surya's architecture includes a transformer-based text recognition module and an advanced layout analysis module that work in concert to understand document structure. The layout analysis automatically detects and classifies different elements in the document such as text blocks, tables, headers, footnotes, captions, and images with high precision. This provides users with rich structural information about the document, ensuring that text output faithfully reflects the original document format and reading order. Multi-column newspaper pages, complex nested table structures, interleaved lists, and mixed-layout academic papers are processed successfully with high fidelity. The line detection module can correctly identify and process skewed and rotated text segments.
The model can recognize multiple writing systems including Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, and Indic scripts with consistent accuracy. This extensive language and script support provides significant advantages in international document processing, archive digitization, and multilingual content management projects across organizations. Turkish character recognition performance achieves high accuracy rates including all special characters (ç, ğ, ı, ö, ş, ü), making it reliable for Turkish-language document processing workflows and historical Ottoman document digitization efforts.
Surya OCR achieves competitive results on ICDAR benchmarks and demonstrates performance comparable to commercial solutions such as Google Cloud Vision and AWS Textract, while remaining completely free and open source without usage limits. Compared to traditional OCR tools like Tesseract, its performance is notably superior especially in handwriting recognition, low-resolution scans, degraded documents, and complex layouts. It works on PDFs, image files (JPEG, PNG, TIFF, WebP), and scanned documents with consistent quality regardless of input format or scanning conditions.
Available as open source on GitHub, Surya OCR can be easily installed via pip and used programmatically through its comprehensive Python API. The CLI tool supports batch document processing and offers a scalable batch processing pipeline for automatic digitization of large archives and document collections. It produces structured output in JSON and hOCR formats, facilitating integration with search engines, document management systems, and downstream applications for indexing and retrieval. It performs fast inference on GPU while also delivering reasonable performance on CPU for smaller workloads.
Serving as an ideal solution for document digitization, archive scanning, invoice processing, contract analysis, medical record transcription, legal document processing, and accessibility applications, Surya OCR provides a powerful and free alternative for researchers, developers, and organizations with document processing automation needs at any scale. Its active developer community and regular updates ensure continuous improvement in recognition accuracy, expanded language and script coverage, and enhanced processing speed for production deployments.
Use Cases
Document Digitization
Converting paper documents, archives, and books to digital text format to make them searchable.
Academic Paper Processing
Digitizing academic papers in correct format with layout analysis and text extraction.
Invoice and Form Processing
Automating data entry by automatically extracting table and form data from business documents.
Multilingual Content Processing
Meeting the needs of multilingual organizations by batch processing documents in different languages.
Pros & Cons
Pros
- Versatile document OCR toolkit supporting 90+ languages
- Line-level text detection, layout analysis, and reading order detection
- Structured data extraction with table recognition
- Faster and more accurate results compared to Tesseract
Cons
- Specialized for document OCR — weak on photos and natural scene text
- Handwritten text recognition not supported
- Falls behind some newer vision-language models in certain tests
- GPU requirement — slow processing on CPU
Technical Details
Parameters
Unknown
Architecture
Vision Transformer
Training Data
Proprietary multilingual dataset
License
GPL-3.0
Features
- 90+ languages
- Layout analysis
- Table detection
- Reading order
- Fast
- Line-level detection
- GPU optimized
Benchmark Results
| Metric | Value | Compared To | Source |
|---|---|---|---|
| Doğruluk Oranı (General Benchmark) | %93.2 (avg across scripts) | Tesseract: %80.1 | Surya GitHub Benchmarks |
| Desteklenen Diller | 90+ dil & yazı sistemi | PaddleOCR: 80+ dil | GitHub Repository |
| Satır Algılama (Line Detection F1) | 0.957 | DocTR: 0.921 | Surya Benchmark Suite |
| İşleme Hızı (A100) | ~200ms/sayfa (GPU) | PaddleOCR: ~150ms/sayfa | Surya GitHub Benchmarks |