SAM 2 works as a unified model combining image and video segmentation. In video mode, it uses a memory mechanism to store information from previous frames and tracks objects consistently across frames. After the user marks an object in a single frame, the model performs tracking and segmentation throughout the entire video.

What is the difference between SAM 2 and SAM 1?

SAM 2's biggest innovation is video segmentation support and its memory mechanism. While SAM 1 can only perform single image segmentation, SAM 2 provides object tracking and consistent segmentation across video frames. SAM 2 also produces faster and more accurate results in image segmentation compared to SAM 1.

Does SAM 2 work in real time?

Yes, SAM 2 is optimized to perform real-time video segmentation. It can process video frames at sufficient speed on modern GPUs for live applications and interactive video editing scenarios. The exact FPS value depends on the GPU model and video resolution.

Is SAM 2 open source?

Yes, SAM 2 is released as open source by Meta AI under the Apache 2.0 license. Model weights, source code, and a demo application are available on GitHub. It can be freely used for both research and commercial use, and an extensive ecosystem of derivative works has emerged.

What dataset was SAM 2 trained on?

SAM 2 was trained on a new video segmentation dataset called SA-V (Segment Anything - Video). In addition to the SA-1B image dataset, this dataset contains millions of object masks and tracking annotations across thousands of videos. This massive data enables the model to perform video segmentation across a wide range.

How does SAM 2 handle occlusion?

SAM 2's memory mechanism preserves information when an object temporarily exits the field of view or is hidden behind another object. When the object becomes visible again, the model re-identifies it based on information in memory and continues segmentation seamlessly.

Segment Anything 2 (SAM 2)

Open Source

4.8

Key Highlights

Video Segmentation

Tracks objects throughout video with a single prompt, automatically creating segmentation masks across all frames

Real-Time Performance

Efficient architecture optimized to perform video segmentation at real-time speed

Memory Mechanism

Maintains information from previous frames in memory for consistent object tracking and re-detection during occlusion

Universal Segmentation

A single unified model for both image and video, can segment any object without training

About

Segment Anything Model 2 (SAM 2) is a universal segmentation model developed by Meta AI. Built on the major success of the original SAM model, SAM 2 offers both image and video segmentation in a single unified architecture. It can precisely isolate objects in any image or video by providing prompts such as points, boxes, or text. This unified approach represents a significant advancement in segmentation technology, bridging the gap between static images and dynamic video content and ushering in a new era in computer vision applications.

SAM 2's most revolutionary innovation is its video segmentation support. When you mark an object in a video frame, the model automatically tracks that object throughout the entire video and produces consistent segmentation masks in every frame. This capability is a game-changer for video editing, object tracking, and augmented reality applications that previously required manual frame-by-frame annotation. The model successfully handles challenging video scenarios such as object occlusion, size changes, rapid motion, and scene transitions with remarkable robustness. Its memory mechanism preserves information from previous frames to ensure consistent tracking even in long videos with complex motion patterns, and can re-identify objects even after they temporarily leave the field of view.

The model's promptable design makes it extremely flexible and intuitive to use. Segmentation can be initiated by clicking a point, drawing a box, or providing a text description. It can segment object types it has never seen in training data through zero-shot generalization, making it applicable to virtually any domain. It is successfully applied in specialized fields such as medical imaging, satellite photography, industrial quality control, and robotics. Additionally, it can track multiple objects simultaneously, enabling analysis of complex multi-object scenes and monitoring of interactions between different elements in dynamic environments.

Architecturally, SAM 2 extends the original SAM's image encoder with a memory module and temporal attention mechanism designed for video understanding. Its streaming architecture processes video frames sequentially while preserving information from previous frames through a memory bank that efficiently stores and retrieves relevant context. This design enables real-time video segmentation while maintaining reasonable memory usage even for extended video sequences. The hierarchical image encoder captures features at different scales, guaranteeing accurate segmentation of both small and large objects across varying resolutions. The attention mechanism models temporal dependencies to maintain consistency throughout the video duration.

Trained on the SA-V (Segment Anything Video) dataset, SAM 2 utilizes over 600,000 masklet annotations across more than 51,000 real-world videos spanning diverse scenarios. This comprehensive training data enables the model to generalize robustly across diverse video scenarios including sports, nature, urban environments, and indoor settings. The training process relies on a unified strategy across both image and video data, allowing the model to achieve high performance in both modalities without compromising quality in either domain. In image segmentation, it produces results comparable to or better than the original SAM, while in video segmentation it clearly surpasses previous specialized methods.

Released as open source under the Apache 2.0 license, SAM 2 is accessible via Hugging Face and GitHub for immediate use. It is being rapidly adapted to various applications by the research community worldwide. SAM 2-based solutions are being developed across a wide range of use cases including video editing software, autonomous driving systems, robotics applications, medical video analysis, and interactive media tools. The model is establishing the next-generation segmentation standard in computer vision, creating a transformative impact from industrial applications to creative workflows across the technology landscape.

Use Cases

Video Object Tracking

Selecting objects in video and performing automatic tracking and segmentation throughout the entire video

Video Editing

Precise segmentation for masking, replacing, or applying effects to specific objects within video

Autonomous Driving Perception

Environmental perception system with real-time video segmentation of vehicles, pedestrians, and road elements

Sports Analysis

Performance analysis and tactical visualization through player and ball tracking in sports videos

Pros & Cons

Pros

Meta's universal model that can segment any object
Combines both image and video segmentation in a single model
Zero-shot performance — segment new object types without training
Real-time interactive segmentation support
Trained on SA-V dataset — 50K+ videos, 600K+ masked frames

Cons

No semantic understanding — segments without knowing what it is
Errors can still occur at fine and complex boundaries
Drift issues in long-term tracking in video segmentation
High GPU requirement — 16GB+ VRAM for large models

Technical Details

Parameters

300M

Architecture

Transformer + Streaming Memory

Training Data

SA-V dataset

License

Apache 2.0

Features

Video Segmentation
Real-Time Tracking
Memory Bank
Occlusion Handling
Promptable Interface
Unified Image-Video Model

Benchmark Results

Metric	Value	Compared To	Source
DAVIS 2017 J&F	90.7%	+2.6% over previous best	SAM 2 Paper (Meta AI)
Video Processing Speed (Hiera-B+)	43.8 FPS	On A100 GPU	SAM 2 Paper / Ultralytics Docs
Speed vs SAM	6x faster	Original SAM	Meta AI SAM 2 Announcement
Training Dataset (SA-V)	51K+ videos, 600K+ mask annotations	—	SAM 2 Paper (Meta AI)

Available Platforms

GitHub

HuggingFace

PyPI

News & References

Meta releases SAM 2 universal segmentation model

Meta AI Blog · 2024-07

SAM 2 sets new standard in video segmentation

VentureBeat · 2024-07

Frequently Asked Questions

Related Models

GroundingDINO

IDEA Research|172M

Grounding DINO is a powerful open-set object detection model developed by IDEA Research that locates and identifies any object in an image based on natural language text descriptions, representing a paradigm shift from fixed-category detection to language-guided visual understanding. With 172 million parameters, the model combines the DINO detection architecture with text grounding capabilities, enabling it to detect objects that were never seen during training simply by describing them in words. Unlike traditional object detectors trained on fixed categories like COCO's 80 classes, Grounding DINO can find arbitrary objects, parts, materials, or visual concepts by accepting free-form text queries such as 'red shoes on the shelf' or 'cracked window in the building.' The architecture fuses visual features from the image encoder with textual features from a text encoder through cross-modality attention layers, learning to align visual regions with their semantic descriptions. Grounding DINO achieves state-of-the-art results on zero-shot object detection benchmarks and when combined with SAM (Segment Anything Model) creates a powerful pipeline for text-prompted segmentation of any visual concept. Released under the Apache 2.0 license, the model is fully open source and widely used in computer vision research and production systems. Key applications include automated image annotation and labeling, visual search engines, robotic manipulation systems that understand verbal commands, visual question answering pipelines, content moderation systems, accessibility tools that describe image contents, and custom quality inspection systems that can be configured with natural language descriptions of defects rather than extensive training data.

Open Source

4.6

Quick Info

Parameters300M

TypeTransformer

LicenseApache 2.0

Released2024-07

ArchitectureTransformer + Streaming Memory

Version2.0

Rating4.8 / 5

CreatorMeta

Links

Official Website GitHub arXiv Paper

Tags

segmentation