GroundingDINO icon

GroundingDINO

Open Source
4.6
IDEA Research

Grounding DINO is a powerful open-set object detection model developed by IDEA Research that locates and identifies any object in an image based on natural language text descriptions, representing a paradigm shift from fixed-category detection to language-guided visual understanding. With 172 million parameters, the model combines the DINO detection architecture with text grounding capabilities, enabling it to detect objects that were never seen during training simply by describing them in words. Unlike traditional object detectors trained on fixed categories like COCO's 80 classes, Grounding DINO can find arbitrary objects, parts, materials, or visual concepts by accepting free-form text queries such as 'red shoes on the shelf' or 'cracked window in the building.' The architecture fuses visual features from the image encoder with textual features from a text encoder through cross-modality attention layers, learning to align visual regions with their semantic descriptions. Grounding DINO achieves state-of-the-art results on zero-shot object detection benchmarks and when combined with SAM (Segment Anything Model) creates a powerful pipeline for text-prompted segmentation of any visual concept. Released under the Apache 2.0 license, the model is fully open source and widely used in computer vision research and production systems. Key applications include automated image annotation and labeling, visual search engines, robotic manipulation systems that understand verbal commands, visual question answering pipelines, content moderation systems, accessibility tools that describe image contents, and custom quality inspection systems that can be configured with natural language descriptions of defects rather than extensive training data.

Object Detection
Segmentation

Key Highlights

Text-Based Object Detection

Open-set object detection technology that can detect any object using natural language descriptions.

SAM Integration

Grounded-SAM pipeline providing text-guided pixel-level segmentation by combining with Segment Anything Model.

Zero-Shot Detection

Capability to work without additional training by detecting even previously untrained object categories.

Simultaneous Multi-Object Detection

Provides efficient batch processing by detecting multiple different objects simultaneously with a single text description.

About

Grounding DINO is a highly powerful AI model developed for text-based open-set object detection, representing a fundamental paradigm shift in how computer vision systems understand and locate objects in images. Developed by IDEA Research, unlike traditional object detection models that are constrained to fixed class vocabularies, it is not limited to predefined categories. It can take any text description as input and detect and locate objects in the image that match this description with high precision and recall. This open-vocabulary approach is widely regarded as a transformative innovation in the field of visual understanding and reasoning.

At the foundation of Grounding DINO lies a sophisticated architecture based on the combination of the DINO (DETR with Improved deNOising anchOr boxes) detector and natural language processing modules working in tight integration. Built on a Swin Transformer backbone, the model processes visual and linguistic information together through cross-attention mechanisms that enable deep multimodal reasoning. This enables object detection with detailed natural language descriptions such as "woman in red dress" or "coffee cup on the table," going beyond simple object class names to support relational and descriptive queries that capture spatial relationships, attributes, and contextual information.

Achieving strong results in zero-shot object detection on COCO and LVIS benchmarks, Grounding DINO can find any object expressed in natural language without being bound to predefined class lists. This flexibility provides significant time savings in applications such as automated data labeling, content moderation, and visual search systems. Unlike traditional detection models, it requires no retraining for new object categories, dramatically reducing deployment time and cost for new use cases and domains. Domain-specific performance can be further enhanced through fine-tuning on custom datasets.

When combined with Segment Anything (SAM), the model enables text-based segmentation through the powerful combination known as Grounded-SAM. This collaboration creates a comprehensive workflow for autonomous driving perception, robotic manipulation planning, image editing, automated data labeling, and video analytics applications. Users can perform text-based object detection followed by pixel-level segmentation, using these segmentations for downstream tasks including inpainting, tracking, and scene understanding.

Grounding DINO possesses zero-shot detection capability and can recognize entirely new object categories without any specialized training. This feature provides critical advantages in industrial applications such as quality control in rapidly changing production lines, retail shelf analysis and compliance, security camera footage evaluation, agricultural crop monitoring, and medical image screening where target object categories may not be known in advance. Object counting and localization accuracy is comparable to many fully supervised models despite requiring no task-specific training.

Available as open source on GitHub, the model can be easily installed via pip and used through its comprehensive Python API with detailed documentation. Model weights are downloadable from Hugging Face and can be converted to ONNX format for deployment across different hardware platforms and configurations. TensorRT can accelerate GPU inference for production deployments. Serving a broad range of applications including autonomous vehicles, robotics, video analytics, content moderation, medical imaging, agricultural monitoring, and accessibility applications, Grounding DINO stands as one of the most versatile and impactful models in the computer vision ecosystem, continuously evolving through active community contributions and ongoing research.

Use Cases

1

Automatic Data Labeling

Reducing manual labeling time by automatically labeling objects in machine learning datasets.

2

Smart Image Editing

Performing automatic editing and manipulation by selecting specific objects in images with text.

3

Robotic Vision Systems

Enabling robots to recognize and interact with objects in their environment through natural language commands.

4

Content Moderation

Automatically detecting and filtering inappropriate content on social media and web platforms.

Pros & Cons

Pros

  • Achieves 52.5 AP on COCO detection zero-shot transfer without any COCO training data
  • Open-set detection: can localize any user-specified phrase including natural language queries zero-shot
  • Tight cross-modality fusion with feature enhancer, language-guided query selection, and cross-modality decoder
  • Sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP across diverse domains

Cons

  • Early fusion architecture can increase model hallucinations, predicting objects not present in images
  • Performs worse than GLIP on rare and uncommon object categories due to less pretraining data
  • Produces only bounding boxes without instance segmentation masks
  • Heavy computational feature enhancer makes it impractical for real-time edge applications
  • Can inherit biases from underlying language models and web-sourced training data

Technical Details

Parameters

172M

Architecture

DINO + Text Grounding

Training Data

O365, GoldG, Cap4M

License

Apache 2.0

Features

  • Open-set detection
  • Text-prompted
  • SAM compatible
  • Zero-shot
  • Multi-object detection
  • Bounding box output

Benchmark Results

MetricValueCompared ToSource
Zero-Shot AP (COCO val2017)52.5GLIP-L: 49.8GroundingDINO Paper (ECCV 2024)
Zero-Shot AP (LVIS minival)27.4GLIP-L: 26.9GroundingDINO Paper
İşleme Hızı (A100)~12 FPS (Swin-T backbone)GLIP: ~8 FPSGitHub Repository
Parametre Sayısı172M (Swin-T), 341M (Swin-B)Hugging Face Model Card

Available Platforms

GitHub
HuggingFace

Frequently Asked Questions

Related Models

YOLOv10 icon

YOLOv10

Tsinghua University|8M-68M

YOLOv10 is the tenth major iteration of the YOLO (You Only Look Once) real-time object detection series, developed by researchers at Tsinghua University. The model introduces a fundamentally redesigned NMS-free (Non-Maximum Suppression free) architecture that eliminates the post-processing bottleneck present in all previous YOLO versions, enabling true end-to-end object detection with consistent latency. YOLOv10 employs a dual-assignment training strategy that combines one-to-many and one-to-one label assignments during training, achieving rich supervision signals while maintaining efficient inference without redundant predictions. Built on a CSPNet backbone with enhanced feature aggregation, the model comes in six scale variants ranging from Nano (8M parameters) to Extra-Large (68M parameters), allowing deployment across edge devices, mobile platforms, and high-performance servers. Each variant is optimized for its target hardware profile, delivering the best accuracy-latency trade-off in its class. YOLOv10 achieves state-of-the-art performance on the COCO benchmark, outperforming previous YOLO versions and competing models like RT-DETR with significantly lower computational cost. Released under the AGPL-3.0 license, the model is open source and integrates seamlessly with the Ultralytics ecosystem for training, validation, and deployment. Common applications include autonomous driving perception, industrial quality inspection, security surveillance, retail analytics, robotics, and drone-based monitoring. The model supports ONNX and TensorRT export for optimized production deployment.

Open Source
4.7

Quick Info

Parameters172M
TypeTransformer
LicenseApache 2.0
Released2023-03
ArchitectureDINO + Text Grounding
Rating4.6 / 5
CreatorIDEA Research

Links

Tags

detection
grounding
text-prompted
open-set
Visit Website