Surya OCR icon

Surya OCR

Open Source
4.5
VikParuchuri

Surya OCR is a modern AI-powered optical character recognition model developed by Vik Paruchuri that supports over 90 languages with impressive accuracy across diverse document types. Built on a Vision Transformer architecture inspired by the Donut framework, Surya takes an encoder-decoder approach that processes document images directly without requiring traditional text detection as a separate preprocessing step. The model extracts text content along with precise bounding box coordinates, enabling both full-text extraction and position-aware document understanding. Beyond basic character recognition, Surya includes a comprehensive document layout analysis module that identifies structural elements such as headers, paragraphs, tables, figures, lists, and captions, providing a complete understanding of document organization. The model handles complex document layouts including multi-column pages, academic papers with equations, invoices with tabular data, and historical documents with non-standard typography. Surya achieves competitive or superior accuracy compared to commercial OCR services on many benchmarks while running locally without requiring cloud API calls, making it suitable for privacy-sensitive document processing. Released under the GPL-3.0 license, the model is open source and actively maintained with regular updates. It provides a Python API and command-line interface for batch processing. Key applications include digitizing printed and handwritten documents, extracting structured data from invoices and receipts, converting scanned books and academic papers to searchable text, processing legal and medical documents, archival document preservation, and building document understanding pipelines for enterprise content management systems. Surya is particularly valued for its strong multilingual support covering Latin, Cyrillic, CJK, Arabic, Devanagari, and many other scripts.

OCR

Key Highlights

Support for 90+ Languages

Meets multilingual document processing needs with text detection and recognition in over 90 languages.

Advanced Layout Analysis

Determines correct reading order by automatically detecting document structure, columns, headings, and paragraphs.

Table Detection and Extraction

Offers capability to automatically detect tables in documents and extract them as structured data.

High Speed with GPU

Meets batch OCR needs by rapidly processing large document collections thanks to GPU optimization.

About

Surya OCR is a modern AI model developed for document-level multilingual optical character recognition, supporting over 90 languages with impressive accuracy across diverse document types. This high-performance model uses an encoder-decoder architecture based on the Donut framework, extracting image features with a Swin Transformer encoder and generating text with an mBART decoder. Unlike traditional OCR systems, Surya employs an end-to-end deep learning architecture that excels in complex document layouts where conventional rule-based approaches struggle significantly.

Surya's architecture includes a transformer-based text recognition module and an advanced layout analysis module that work in concert to understand document structure. The layout analysis automatically detects and classifies different elements in the document such as text blocks, tables, headers, footnotes, captions, and images with high precision. This provides users with rich structural information about the document, ensuring that text output faithfully reflects the original document format and reading order. Multi-column newspaper pages, complex nested table structures, interleaved lists, and mixed-layout academic papers are processed successfully with high fidelity. The line detection module can correctly identify and process skewed and rotated text segments.

The model can recognize multiple writing systems including Latin, Cyrillic, Arabic, Chinese, Japanese, Korean, and Indic scripts with consistent accuracy. This extensive language and script support provides significant advantages in international document processing, archive digitization, and multilingual content management projects across organizations. Turkish character recognition performance achieves high accuracy rates including all special characters (ç, ğ, ı, ö, ş, ü), making it reliable for Turkish-language document processing workflows and historical Ottoman document digitization efforts.

Surya OCR achieves competitive results on ICDAR benchmarks and demonstrates performance comparable to commercial solutions such as Google Cloud Vision and AWS Textract, while remaining completely free and open source without usage limits. Compared to traditional OCR tools like Tesseract, its performance is notably superior especially in handwriting recognition, low-resolution scans, degraded documents, and complex layouts. It works on PDFs, image files (JPEG, PNG, TIFF, WebP), and scanned documents with consistent quality regardless of input format or scanning conditions.

Available as open source on GitHub, Surya OCR can be easily installed via pip and used programmatically through its comprehensive Python API. The CLI tool supports batch document processing and offers a scalable batch processing pipeline for automatic digitization of large archives and document collections. It produces structured output in JSON and hOCR formats, facilitating integration with search engines, document management systems, and downstream applications for indexing and retrieval. It performs fast inference on GPU while also delivering reasonable performance on CPU for smaller workloads.

Serving as an ideal solution for document digitization, archive scanning, invoice processing, contract analysis, medical record transcription, legal document processing, and accessibility applications, Surya OCR provides a powerful and free alternative for researchers, developers, and organizations with document processing automation needs at any scale. Its active developer community and regular updates ensure continuous improvement in recognition accuracy, expanded language and script coverage, and enhanced processing speed for production deployments.

Use Cases

1

Document Digitization

Converting paper documents, archives, and books to digital text format to make them searchable.

2

Academic Paper Processing

Digitizing academic papers in correct format with layout analysis and text extraction.

3

Invoice and Form Processing

Automating data entry by automatically extracting table and form data from business documents.

4

Multilingual Content Processing

Meeting the needs of multilingual organizations by batch processing documents in different languages.

Pros & Cons

Pros

  • Versatile document OCR toolkit supporting 90+ languages
  • Line-level text detection, layout analysis, and reading order detection
  • Structured data extraction with table recognition
  • Faster and more accurate results compared to Tesseract

Cons

  • Specialized for document OCR — weak on photos and natural scene text
  • Handwritten text recognition not supported
  • Falls behind some newer vision-language models in certain tests
  • GPU requirement — slow processing on CPU

Technical Details

Parameters

Unknown

Architecture

Vision Transformer

Training Data

Proprietary multilingual dataset

License

GPL-3.0

Features

  • 90+ languages
  • Layout analysis
  • Table detection
  • Reading order
  • Fast
  • Line-level detection
  • GPU optimized

Benchmark Results

MetricValueCompared ToSource
Doğruluk Oranı (General Benchmark)%93.2 (avg across scripts)Tesseract: %80.1Surya GitHub Benchmarks
Desteklenen Diller90+ dil & yazı sistemiPaddleOCR: 80+ dilGitHub Repository
Satır Algılama (Line Detection F1)0.957DocTR: 0.921Surya Benchmark Suite
İşleme Hızı (A100)~200ms/sayfa (GPU)PaddleOCR: ~150ms/sayfaSurya GitHub Benchmarks

Available Platforms

GitHub
PyPI

Frequently Asked Questions

Related Models

PaddleOCR icon

PaddleOCR

Baidu|15M

PaddleOCR is a comprehensive optical character recognition system developed by Baidu on the PaddlePaddle deep learning framework, supporting over 80 languages with industry-grade accuracy and speed. The latest PP-OCRv4 architecture employs a three-stage pipeline consisting of text detection, direction classification, and text recognition, each optimized independently for maximum performance. With approximately 15 million parameters in its lightweight configuration, PaddleOCR achieves an exceptional balance between accuracy and inference speed, running efficiently on both server GPUs and edge devices including mobile phones and embedded systems. The system excels at recognizing text in complex real-world scenarios including curved text, rotated text, dense multi-line layouts, and text overlaid on textured backgrounds. PaddleOCR supports Latin, Chinese, Japanese, Korean, Arabic, Cyrillic, and dozens of other scripts with dedicated recognition models for each language family. Beyond basic OCR, the toolkit includes document structure analysis for extracting tables, headers, and paragraphs from scanned documents, as well as key information extraction capabilities for invoices, receipts, and forms. Released under the Apache 2.0 license, PaddleOCR is fully open source and has become one of the most starred OCR repositories on GitHub. It provides pre-trained models, training scripts, and deployment tools for ONNX, TensorRT, and OpenVINO formats. Common applications include document digitization, license plate recognition, receipt processing, handwriting recognition, and industrial text inspection in manufacturing quality control.

Open Source
4.6

Quick Info

ParametersUnknown
TypeTransformer
LicenseGPL-3.0
Released2024-01
ArchitectureVision Transformer
Rating4.5 / 5
CreatorVikParuchuri

Links

Tags

ocr
document
layout
multilingual
Visit Website