Segment Anything 2 (SAM 2) icon

Segment Anything 2 (SAM 2)

Open Source
4.8
Meta

Segment Anything 2 (SAM 2) is a universal segmentation model developed by Meta AI that unifies image and video segmentation within a single Transformer-based architecture enhanced with a streaming memory module. Building on the groundbreaking success of the original SAM, SAM 2 extends promptable segmentation to the video domain, allowing users to segment any object across video frames by providing simple prompts such as points, bounding boxes, or masks on a single frame. The model automatically propagates the segmentation through the entire video using its memory attention mechanism, which maintains temporal consistency even through occlusions and object reappearances. With approximately 300 million parameters, SAM 2 achieves real-time performance while delivering state-of-the-art accuracy across diverse segmentation benchmarks. The architecture processes both images and individual video frames through a shared image encoder, making it versatile for static and dynamic content alike. SAM 2 was trained on the SA-V dataset, the largest video segmentation dataset to date, containing over 600,000 masklet annotations across 50,000 videos. Released under the Apache 2.0 license, the model is fully open source and available on GitHub with pre-trained weights. It serves applications ranging from video editing and visual effects to autonomous driving perception, medical imaging, augmented reality, and robotics. Professional video editors, computer vision researchers, and developers building interactive segmentation tools rely on SAM 2 for its unmatched combination of accuracy, speed, and ease of use.

Segmentation

Key Highlights

Video Segmentation

Tracks objects throughout video with a single prompt, automatically creating segmentation masks across all frames

Real-Time Performance

Efficient architecture optimized to perform video segmentation at real-time speed

Memory Mechanism

Maintains information from previous frames in memory for consistent object tracking and re-detection during occlusion

Universal Segmentation

A single unified model for both image and video, can segment any object without training

About

Segment Anything Model 2 (SAM 2) is a universal segmentation model developed by Meta AI. Built on the major success of the original SAM model, SAM 2 offers both image and video segmentation in a single unified architecture. It can precisely isolate objects in any image or video by providing prompts such as points, boxes, or text. This unified approach represents a significant advancement in segmentation technology, bridging the gap between static images and dynamic video content and ushering in a new era in computer vision applications.

SAM 2's most revolutionary innovation is its video segmentation support. When you mark an object in a video frame, the model automatically tracks that object throughout the entire video and produces consistent segmentation masks in every frame. This capability is a game-changer for video editing, object tracking, and augmented reality applications that previously required manual frame-by-frame annotation. The model successfully handles challenging video scenarios such as object occlusion, size changes, rapid motion, and scene transitions with remarkable robustness. Its memory mechanism preserves information from previous frames to ensure consistent tracking even in long videos with complex motion patterns, and can re-identify objects even after they temporarily leave the field of view.

The model's promptable design makes it extremely flexible and intuitive to use. Segmentation can be initiated by clicking a point, drawing a box, or providing a text description. It can segment object types it has never seen in training data through zero-shot generalization, making it applicable to virtually any domain. It is successfully applied in specialized fields such as medical imaging, satellite photography, industrial quality control, and robotics. Additionally, it can track multiple objects simultaneously, enabling analysis of complex multi-object scenes and monitoring of interactions between different elements in dynamic environments.

Architecturally, SAM 2 extends the original SAM's image encoder with a memory module and temporal attention mechanism designed for video understanding. Its streaming architecture processes video frames sequentially while preserving information from previous frames through a memory bank that efficiently stores and retrieves relevant context. This design enables real-time video segmentation while maintaining reasonable memory usage even for extended video sequences. The hierarchical image encoder captures features at different scales, guaranteeing accurate segmentation of both small and large objects across varying resolutions. The attention mechanism models temporal dependencies to maintain consistency throughout the video duration.

Trained on the SA-V (Segment Anything Video) dataset, SAM 2 utilizes over 600,000 masklet annotations across more than 51,000 real-world videos spanning diverse scenarios. This comprehensive training data enables the model to generalize robustly across diverse video scenarios including sports, nature, urban environments, and indoor settings. The training process relies on a unified strategy across both image and video data, allowing the model to achieve high performance in both modalities without compromising quality in either domain. In image segmentation, it produces results comparable to or better than the original SAM, while in video segmentation it clearly surpasses previous specialized methods.

Released as open source under the Apache 2.0 license, SAM 2 is accessible via Hugging Face and GitHub for immediate use. It is being rapidly adapted to various applications by the research community worldwide. SAM 2-based solutions are being developed across a wide range of use cases including video editing software, autonomous driving systems, robotics applications, medical video analysis, and interactive media tools. The model is establishing the next-generation segmentation standard in computer vision, creating a transformative impact from industrial applications to creative workflows across the technology landscape.

Use Cases

1

Video Object Tracking

Selecting objects in video and performing automatic tracking and segmentation throughout the entire video

2

Video Editing

Precise segmentation for masking, replacing, or applying effects to specific objects within video

3

Autonomous Driving Perception

Environmental perception system with real-time video segmentation of vehicles, pedestrians, and road elements

4

Sports Analysis

Performance analysis and tactical visualization through player and ball tracking in sports videos

Pros & Cons

Pros

  • Meta's universal model that can segment any object
  • Combines both image and video segmentation in a single model
  • Zero-shot performance — segment new object types without training
  • Real-time interactive segmentation support
  • Trained on SA-V dataset — 50K+ videos, 600K+ masked frames

Cons

  • No semantic understanding — segments without knowing what it is
  • Errors can still occur at fine and complex boundaries
  • Drift issues in long-term tracking in video segmentation
  • High GPU requirement — 16GB+ VRAM for large models

Technical Details

Parameters

300M

Architecture

Transformer + Streaming Memory

Training Data

SA-V dataset

License

Apache 2.0

Features

  • Video Segmentation
  • Real-Time Tracking
  • Memory Bank
  • Occlusion Handling
  • Promptable Interface
  • Unified Image-Video Model

Benchmark Results

MetricValueCompared ToSource
DAVIS 2017 J&F90.7%+2.6% over previous bestSAM 2 Paper (Meta AI)
Video Processing Speed (Hiera-B+)43.8 FPSOn A100 GPUSAM 2 Paper / Ultralytics Docs
Speed vs SAM6x fasterOriginal SAMMeta AI SAM 2 Announcement
Training Dataset (SA-V)51K+ videos, 600K+ mask annotationsSAM 2 Paper (Meta AI)

Available Platforms

GitHub
HuggingFace
PyPI

News & References

Frequently Asked Questions

Related Models

GroundingDINO icon

GroundingDINO

IDEA Research|172M

Grounding DINO is a powerful open-set object detection model developed by IDEA Research that locates and identifies any object in an image based on natural language text descriptions, representing a paradigm shift from fixed-category detection to language-guided visual understanding. With 172 million parameters, the model combines the DINO detection architecture with text grounding capabilities, enabling it to detect objects that were never seen during training simply by describing them in words. Unlike traditional object detectors trained on fixed categories like COCO's 80 classes, Grounding DINO can find arbitrary objects, parts, materials, or visual concepts by accepting free-form text queries such as 'red shoes on the shelf' or 'cracked window in the building.' The architecture fuses visual features from the image encoder with textual features from a text encoder through cross-modality attention layers, learning to align visual regions with their semantic descriptions. Grounding DINO achieves state-of-the-art results on zero-shot object detection benchmarks and when combined with SAM (Segment Anything Model) creates a powerful pipeline for text-prompted segmentation of any visual concept. Released under the Apache 2.0 license, the model is fully open source and widely used in computer vision research and production systems. Key applications include automated image annotation and labeling, visual search engines, robotic manipulation systems that understand verbal commands, visual question answering pipelines, content moderation systems, accessibility tools that describe image contents, and custom quality inspection systems that can be configured with natural language descriptions of defects rather than extensive training data.

Open Source
4.6

Quick Info

Parameters300M
TypeTransformer
LicenseApache 2.0
Released2024-07
ArchitectureTransformer + Streaming Memory
Version2.0
Rating4.8 / 5
CreatorMeta

Links

Tags

segmentation
meta
video
real-time
Visit Website