SAM consists of three components: a ViT-based image encoder that processes the image, a prompt encoder that handles user inputs (points, boxes, text), and a mask decoder that generates segmentation masks. Once the image is encoded, multiple masks can be generated in milliseconds with different prompts.

What dataset was SAM trained on?

SAM was trained on SA-1B, the largest segmentation dataset ever created. This dataset contains over 1.1 billion high-quality segmentation masks across 11 million images. The data was created through a combination of model-assisted annotation and human verification in an iterative process.

Yes, SAM is released as fully open source under the Apache 2.0 license. Source code, pretrained models, and a demo application are available on GitHub. It is freely usable for both research and commercial use, and an extensive ecosystem of derivative works has emerged.

What is the difference between SAM and classical segmentation methods?

Classical segmentation methods are trained for specific object categories and require retraining for new categories. SAM can segment any object in a zero-shot manner. It also supports interactive segmentation with user prompts, making it much more flexible and general-purpose than traditional approaches.

What hardware is required for SAM?

For SAM's ViT-H (largest) model, a GPU with minimum 8GB VRAM is recommended during inference. The ViT-B (smallest) model can run with 4GB VRAM. The image encoding step is the most intensive operation (~150ms on GPU), but mask generation for each prompt takes only ~50ms afterward.

Can SAM do video segmentation?

The original SAM is designed for single image segmentation and does not directly support video segmentation. However, SAM 2 (Segment Anything 2) has been released with video segmentation support and can perform real-time video object tracking and segmentation across frames.

Segment Anything (SAM)

Open Source

4.8

Key Highlights

Universal Segmentation

Zero-shot, task-agnostic segmentation capability that can segment any object in any image

Promptable Interface

Offers user-friendly and flexible segmentation experience through point clicking, box drawing, or text input

Real-Time Mask Generation

Generates segmentation masks within milliseconds for each new prompt after the image is processed once

Massive Training Data

Trained on over 1.1 billion masks across 11 million images, achieving success across virtually any visual domain

About

Segment Anything Model (SAM) is a foundational AI model for image segmentation developed by Meta AI Research, released in April 2023. SAM introduced the concept of a promptable segmentation system that can segment any object in any image using points, boxes, or text prompts as input. This zero-shot capability marked a paradigm shift in computer vision, similar to what large language models achieved for natural language processing, establishing SAM as one of the most influential publications in the history of computer vision research.

SAM was trained on the SA-1B dataset, the largest segmentation dataset ever created, containing over 1.1 billion masks across 11 million images—a scale that surpasses all previous segmentation datasets by orders of magnitude. The training data was collected through an innovative data engine that combined model-assisted annotation with human verification in an iterative loop. This massive scale of training data enables SAM to generalize across virtually any visual domain without task-specific fine-tuning, allowing the model to successfully segment a vast variety of objects, textures, and scene types never encountered during training. The iterative design of the data engine continuously improved model quality, producing better annotations and stronger generalization capabilities with each cycle.

Architecturally, SAM consists of three components: an image encoder based on a Vision Transformer (ViT) pretrained with MAE, a flexible prompt encoder that handles points, boxes, and text inputs, and a lightweight mask decoder that produces segmentation masks in real time. The image encoder processes each image only once, after which multiple prompts can generate different masks nearly instantaneously, making interactive segmentation practical for real-world applications. Available in three sizes—ViT-H, ViT-L, and ViT-B—the model offers flexibility to balance performance against computational cost depending on application requirements and hardware configurations available for deployment.

One of SAM's most notable features is its ambiguity awareness. When a single prompt could correspond to multiple valid segmentation possibilities, the model produces multiple mask proposals and provides a confidence score for each. This feature makes it easier to handle challenging situations such as overlapping objects, complex scenes, and ambiguous boundaries where human annotators would also disagree. Additionally, SAM can segment all objects in an image without any prompts in its automatic mask generation mode, making it invaluable for exhaustive scene analysis, dataset creation, and comprehensive visual understanding tasks.

SAM has been widely adopted across industries including medical imaging, autonomous driving, agriculture, satellite imagery analysis, augmented reality, and creative applications. In the medical field, it has been fine-tuned for tasks such as tumor segmentation, organ delineation, and cell counting with impressive domain-specific accuracy. In agriculture, it is applied for plant disease detection and crop analysis, while in remote sensing it handles building and road segmentation with remarkable precision. In robotics, it serves as a foundation model for object grasping and scene understanding, and in autonomous vehicles it is adapted for road element and obstacle segmentation.

Its open-source release under Apache 2.0 license has spawned an extensive ecosystem of derivative works, fine-tuned variants, and integrated applications. Available on GitHub, Hugging Face, and through various cloud platforms, SAM is one of the most accessible and impactful computer vision models ever released. The research community has published hundreds of papers building on SAM and developed numerous derivative projects extending the model's capabilities. Lightweight variants such as FastSAM, MobileSAM, and EfficientSAM have made the model deployable on mobile and edge devices, bringing powerful segmentation to resource-constrained environments.

Use Cases

Medical Imaging

Medical research and diagnostic support for organ and lesion segmentation in X-ray, MRI, and CT scans

E-Commerce Image Processing

Object isolation, background removal, and automatic product masking from product images

Autonomous Driving

Driver assistance systems for real-time segmentation of roads, vehicles, pedestrians, and traffic signs

Creative Design

Precise segmentation for extracting objects from photos, creating compositions, and image editing workflows

Pros & Cons

Pros

Zero-shot capability works on a wide variety of images and videos right out of the box
Dramatically reduces time and cost for data annotation and rapid prototyping of vision applications
Delivers high-quality masks for common objects and scenes; fast decoder enables interactive applications
Cross-domain generalizability without extensive retraining; backed by Meta with strong community evolution (SAM → SAM 2 → SAM 3)

Cons

Lacks semantic understanding; segments but doesn't classify, needing other models to identify what's segmented
Poor performance on specialized data (medical, industrial defects) without fine-tuning
Low-quality prompts cause mask decoder to activate features biased towards background or confined to object parts
Complex scenes require more manual prompts with prior knowledge, potentially degrading user experience
SAM 2: loses track in extended sequences, confuses similar objects in crowds, and degrades with multiple simultaneous objects

Technical Details

Parameters

636M

Architecture

ViT-based image encoder + prompt encoder + lightweight mask decoder

Training Data

SA-1B dataset (11M images, 1.1B masks, largest segmentation dataset)

License

Apache 2.0

Features

Zero-Shot Segmentation
Point Prompt
Box Prompt
Text Prompt
Real-Time Inference
SA-1B Dataset

Benchmark Results

Metric	Value	Compared To	Source
Training Dataset	SA-1B: 11M images, 1.1B masks	—	Meta AI / SAM Paper (ICCV 2023)
Mask Quality (IoU >90%)	94%	—	SAM Paper (ICCV 2023)
Mask Quality (IoU >75%)	97%	—	SAM Paper (ICCV 2023)
Zero-Shot Performance	Best on 16 of 23 datasets	—	SAM Paper (ICCV 2023)

Available Platforms

hugging face

replicate

fal ai

Frequently Asked Questions

Related Models

RemBG

Daniel Gatis|N/A

RemBG is a popular open-source tool developed by Daniel Gatis for automatic background removal from images, providing a simple and efficient solution for isolating foreground subjects without manual selection or professional editing skills. The tool leverages multiple pre-trained segmentation models including U2-Net, IS-Net, SAM, and specialized variants optimized for different use cases such as general objects, human subjects, anime characters, and clothing items. RemBG processes images through semantic segmentation to identify foreground elements and generates precise alpha matte masks that cleanly separate subjects from backgrounds, producing transparent PNG outputs ready for immediate use. The tool excels at handling complex edge cases including wispy hair, translucent fabrics, intricate jewelry, and objects with irregular boundaries. RemBG is available as a Python library via pip, a command-line interface for batch processing, and through API integrations for production deployment. It processes images locally without sending data to external servers, making it suitable for privacy-sensitive applications. Common use cases include e-commerce product photography, social media content creation, passport photo processing, graphic design compositing, real estate photography, and marketing materials. The tool supports JPEG, PNG, and WebP formats and handles both single images and batch directory operations. RemBG has become one of the most starred background removal repositories on GitHub with millions of downloads, and its models are integrated into numerous other AI tools. Released under the MIT license, it provides a free and commercially viable alternative to paid background removal services.

Open Source

4.6

BRIA RMBG

BRIA AI|N/A

BRIA RMBG is a state-of-the-art background removal model developed by BRIA AI, an Israeli startup specializing in responsible and commercially licensed generative AI. The model delivers exceptional accuracy in separating foreground subjects from backgrounds, handling complex scenarios including fine hair details, transparent objects, intricate edges, smoke, and glass with remarkable precision. BRIA RMBG is built on a proprietary architecture trained on exclusively licensed and ethically sourced data, ensuring full commercial safety and IP compliance that distinguishes it from models trained on scraped internet data. It produces high-quality alpha mattes preserving fine edge details and natural transparency gradients for clean cutouts suitable for professional workflows. Available in versions including RMBG 1.4 and RMBG 2.0, the model consistently ranks among top performers on background removal benchmarks including DIS5K and HRS10K datasets. BRIA RMBG is accessible through Hugging Face with a permissive license for research and commercial use, and through BRIA's commercial API for scalable cloud processing. Integration options include Python SDK, REST API, and popular image processing pipeline compatibility. Applications span e-commerce product photography, graphic design compositing, video conferencing virtual backgrounds, automotive and real estate photography, social media content creation, and document digitization. The model processes images in milliseconds on modern GPUs, suitable for real-time and high-volume batch processing. BRIA RMBG has established itself as one of the most commercially trusted and technically advanced background removal solutions available.

Open Source

4.7

BiRefNet

ZhengPeng7|N/A

BiRefNet (Bilateral Reference Network) is an advanced open-source segmentation model developed by ZhengPeng7 for high-resolution dichotomous image segmentation, precisely separating foreground objects from backgrounds with pixel-level accuracy at fine structural details. The model introduces a bilateral reference framework leveraging both global semantic information and local detail features through a dual-branch architecture, enabling superior edge quality compared to traditional segmentation approaches. BiRefNet processes images through a backbone encoder to extract multi-scale features, then applies bilateral reference modules that cross-reference global context with local boundary information to produce crisp segmentation masks with clean edges around complex structures like hair strands, lace patterns, chain links, and transparent materials. The model achieves state-of-the-art results on multiple benchmarks including DIS5K, demonstrating strength in handling objects with intricate boundaries that challenge conventional models. BiRefNet has gained significant popularity as a background removal solution due to its exceptional edge quality, outperforming many dedicated background removal tools on challenging images. It supports high-resolution input processing and produces alpha mattes suitable for professional compositing. Available through Hugging Face with multiple model variants optimized for different quality-speed tradeoffs, BiRefNet integrates easily into Python-based pipelines and has been adopted by several popular AI platforms. Common applications include precision background removal for product photography, fine-grained object isolation for graphic design, medical image segmentation, and creating high-quality cutouts for visual effects. Released under an open-source license, BiRefNet provides a free and technically sophisticated alternative to commercial segmentation services.

Open Source

4.5

MODNet

ZHKKKe|N/A

MODNet (Matting Objective Decomposition Network) is an open-source portrait matting model developed by ZHKKKe, designed for real-time human portrait background removal without requiring a pre-defined trimap or additional user input. Unlike traditional matting approaches needing manually drawn trimaps, MODNet achieves fully automatic portrait matting by decomposing the complex matting objective into three sub-tasks: semantic estimation for identifying the person region, detail prediction for refining edge quality around hair and clothing boundaries, and semantic-detail fusion for combining both signals into a high-quality alpha matte. This decomposition enables efficient single-pass inference at real-time speeds, making it practical for video conferencing, live streaming, and mobile photography where latency is critical. The model produces smooth and accurate alpha mattes with particular strength in handling hair strands, fabric edges, and other fine boundary details challenging for segmentation-based approaches. MODNet supports both image and video input with temporal consistency optimizations for stable video matting without flickering. The model is lightweight enough for mobile devices and edge hardware, with ONNX export supporting deployment across iOS, Android, and web browsers through WebAssembly. Common applications include video call background replacement, portrait mode photography, social media content creation, virtual try-on systems, and film post-production green screen alternatives. Released under Apache 2.0, MODNet provides a free and efficient solution widely adopted in both research and production portrait matting applications.

Open Source

4.3

Quick Info

Parameters636M

Typetransformer

LicenseApache 2.0

Released2023-04

ArchitectureViT-based image encoder + prompt encoder + lightweight mask decoder

Rating4.8 / 5

CreatorMeta

Links

Official Website GitHub arXiv Paper HuggingFace

Tags

sam

Segment Anything (SAM)

Key Highlights

Universal Segmentation

Promptable Interface

Real-Time Mask Generation

Massive Training Data

About

Use Cases

Medical Imaging

E-Commerce Image Processing

Autonomous Driving

Creative Design

Pros & Cons

Pros

Cons

Technical Details

Features

Benchmark Results

Available Platforms

Frequently Asked Questions

How does SAM work?

What dataset was SAM trained on?

Is SAM open source?

What is the difference between SAM and classical segmentation methods?

What hardware is required for SAM?

Can SAM do video segmentation?

Related Models

RemBG

BRIA RMBG

BiRefNet

MODNet

Quick Info

Links

Tags