Segmentation Models

Explore the best AI models for segmentation

Filter

Segment Anything 2 (SAM 2)

Segment Anything 2 (SAM 2) is a universal segmentation model developed by Meta AI that unifies image and video segmentation within a single Transformer-based architecture enhanced with a streaming memory module. Building on the groundbreaking success of the original SAM, SAM 2 extends promptable segmentation to the video domain, allowing users to segment any object across video frames by providing simple prompts such as points, bounding boxes, or masks on a single frame. The model automatically propagates the segmentation through the entire video using its memory attention mechanism, which maintains temporal consistency even through occlusions and object reappearances. With approximately 300 million parameters, SAM 2 achieves real-time performance while delivering state-of-the-art accuracy across diverse segmentation benchmarks. The architecture processes both images and individual video frames through a shared image encoder, making it versatile for static and dynamic content alike. SAM 2 was trained on the SA-V dataset, the largest video segmentation dataset to date, containing over 600,000 masklet annotations across 50,000 videos. Released under the Apache 2.0 license, the model is fully open source and available on GitHub with pre-trained weights. It serves applications ranging from video editing and visual effects to autonomous driving perception, medical imaging, augmented reality, and robotics. Professional video editors, computer vision researchers, and developers building interactive segmentation tools rely on SAM 2 for its unmatched combination of accuracy, speed, and ease of use.

Open Source

4.8

GroundingDINO

IDEA Research|172M

Grounding DINO is a powerful open-set object detection model developed by IDEA Research that locates and identifies any object in an image based on natural language text descriptions, representing a paradigm shift from fixed-category detection to language-guided visual understanding. With 172 million parameters, the model combines the DINO detection architecture with text grounding capabilities, enabling it to detect objects that were never seen during training simply by describing them in words. Unlike traditional object detectors trained on fixed categories like COCO's 80 classes, Grounding DINO can find arbitrary objects, parts, materials, or visual concepts by accepting free-form text queries such as 'red shoes on the shelf' or 'cracked window in the building.' The architecture fuses visual features from the image encoder with textual features from a text encoder through cross-modality attention layers, learning to align visual regions with their semantic descriptions. Grounding DINO achieves state-of-the-art results on zero-shot object detection benchmarks and when combined with SAM (Segment Anything Model) creates a powerful pipeline for text-prompted segmentation of any visual concept. Released under the Apache 2.0 license, the model is fully open source and widely used in computer vision research and production systems. Key applications include automated image annotation and labeling, visual search engines, robotic manipulation systems that understand verbal commands, visual question answering pipelines, content moderation systems, accessibility tools that describe image contents, and custom quality inspection systems that can be configured with natural language descriptions of defects rather than extensive training data.

Open Source

4.6