Segment Anything (SAM)
Segment Anything Model (SAM) is Meta AI's foundation model for promptable image segmentation, designed to segment any object in any image based on input prompts including points, bounding boxes, masks, or text descriptions. Released in April 2023 alongside the SA-1B dataset containing over 1 billion masks from 11 million images, SAM creates a general-purpose segmentation model that handles diverse tasks without task-specific fine-tuning. The architecture consists of three components: a Vision Transformer image encoder that processes input images into embeddings, a flexible prompt encoder handling different prompt types, and a lightweight mask decoder producing segmentation masks in real-time. SAM's zero-shot transfer capability means it can segment objects never seen during training, making it applicable across visual domains from medical imaging to satellite photography to creative content editing. The model supports automatic mask generation for segmenting everything in an image, interactive point-based segmentation for precise object selection, and box-prompted segmentation for region targeting. SAM has spawned derivative works including SAM 2 with video support, EfficientSAM for edge deployment, and FastSAM for faster inference. Practical applications span background removal, medical image annotation, autonomous driving perception, agricultural monitoring, GIS mapping, and interactive editing tools. SAM is fully open source under Apache 2.0 with PyTorch implementations, and models and dataset are freely available through Meta's repositories. It has become one of the most influential computer vision models, fundamentally changing how segmentation tasks are approached across industries.
Key Highlights
Universal Segmentation
Zero-shot, task-agnostic segmentation capability that can segment any object in any image
Promptable Interface
Offers user-friendly and flexible segmentation experience through point clicking, box drawing, or text input
Real-Time Mask Generation
Generates segmentation masks within milliseconds for each new prompt after the image is processed once
Massive Training Data
Trained on over 1.1 billion masks across 11 million images, achieving success across virtually any visual domain
About
Segment Anything Model (SAM) is a foundational AI model for image segmentation developed by Meta AI Research, released in April 2023. SAM introduced the concept of a promptable segmentation system that can segment any object in any image using points, boxes, or text prompts as input. This zero-shot capability marked a paradigm shift in computer vision, similar to what large language models achieved for natural language processing, establishing SAM as one of the most influential publications in the history of computer vision research.
SAM was trained on the SA-1B dataset, the largest segmentation dataset ever created, containing over 1.1 billion masks across 11 million images—a scale that surpasses all previous segmentation datasets by orders of magnitude. The training data was collected through an innovative data engine that combined model-assisted annotation with human verification in an iterative loop. This massive scale of training data enables SAM to generalize across virtually any visual domain without task-specific fine-tuning, allowing the model to successfully segment a vast variety of objects, textures, and scene types never encountered during training. The iterative design of the data engine continuously improved model quality, producing better annotations and stronger generalization capabilities with each cycle.
Architecturally, SAM consists of three components: an image encoder based on a Vision Transformer (ViT) pretrained with MAE, a flexible prompt encoder that handles points, boxes, and text inputs, and a lightweight mask decoder that produces segmentation masks in real time. The image encoder processes each image only once, after which multiple prompts can generate different masks nearly instantaneously, making interactive segmentation practical for real-world applications. Available in three sizes—ViT-H, ViT-L, and ViT-B—the model offers flexibility to balance performance against computational cost depending on application requirements and hardware configurations available for deployment.
One of SAM's most notable features is its ambiguity awareness. When a single prompt could correspond to multiple valid segmentation possibilities, the model produces multiple mask proposals and provides a confidence score for each. This feature makes it easier to handle challenging situations such as overlapping objects, complex scenes, and ambiguous boundaries where human annotators would also disagree. Additionally, SAM can segment all objects in an image without any prompts in its automatic mask generation mode, making it invaluable for exhaustive scene analysis, dataset creation, and comprehensive visual understanding tasks.
SAM has been widely adopted across industries including medical imaging, autonomous driving, agriculture, satellite imagery analysis, augmented reality, and creative applications. In the medical field, it has been fine-tuned for tasks such as tumor segmentation, organ delineation, and cell counting with impressive domain-specific accuracy. In agriculture, it is applied for plant disease detection and crop analysis, while in remote sensing it handles building and road segmentation with remarkable precision. In robotics, it serves as a foundation model for object grasping and scene understanding, and in autonomous vehicles it is adapted for road element and obstacle segmentation.
Its open-source release under Apache 2.0 license has spawned an extensive ecosystem of derivative works, fine-tuned variants, and integrated applications. Available on GitHub, Hugging Face, and through various cloud platforms, SAM is one of the most accessible and impactful computer vision models ever released. The research community has published hundreds of papers building on SAM and developed numerous derivative projects extending the model's capabilities. Lightweight variants such as FastSAM, MobileSAM, and EfficientSAM have made the model deployable on mobile and edge devices, bringing powerful segmentation to resource-constrained environments.
Use Cases
Medical Imaging
Medical research and diagnostic support for organ and lesion segmentation in X-ray, MRI, and CT scans
E-Commerce Image Processing
Object isolation, background removal, and automatic product masking from product images
Autonomous Driving
Driver assistance systems for real-time segmentation of roads, vehicles, pedestrians, and traffic signs
Creative Design
Precise segmentation for extracting objects from photos, creating compositions, and image editing workflows
Pros & Cons
Pros
- Zero-shot capability works on a wide variety of images and videos right out of the box
- Dramatically reduces time and cost for data annotation and rapid prototyping of vision applications
- Delivers high-quality masks for common objects and scenes; fast decoder enables interactive applications
- Cross-domain generalizability without extensive retraining; backed by Meta with strong community evolution (SAM → SAM 2 → SAM 3)
Cons
- Lacks semantic understanding; segments but doesn't classify, needing other models to identify what's segmented
- Poor performance on specialized data (medical, industrial defects) without fine-tuning
- Low-quality prompts cause mask decoder to activate features biased towards background or confined to object parts
- Complex scenes require more manual prompts with prior knowledge, potentially degrading user experience
- SAM 2: loses track in extended sequences, confuses similar objects in crowds, and degrades with multiple simultaneous objects
Technical Details
Parameters
636M
Architecture
ViT-based image encoder + prompt encoder + lightweight mask decoder
Training Data
SA-1B dataset (11M images, 1.1B masks, largest segmentation dataset)
License
Apache 2.0
Features
- Zero-Shot Segmentation
- Point Prompt
- Box Prompt
- Text Prompt
- Real-Time Inference
- SA-1B Dataset
Benchmark Results
| Metric | Value | Compared To | Source |
|---|---|---|---|
| Training Dataset | SA-1B: 11M images, 1.1B masks | — | Meta AI / SAM Paper (ICCV 2023) |
| Mask Quality (IoU >90%) | 94% | — | SAM Paper (ICCV 2023) |
| Mask Quality (IoU >75%) | 97% | — | SAM Paper (ICCV 2023) |
| Zero-Shot Performance | Best on 16 of 23 datasets | — | SAM Paper (ICCV 2023) |
Available Platforms
Frequently Asked Questions
Related Models
RemBG
RemBG is a popular open-source tool developed by Daniel Gatis for automatic background removal from images, providing a simple and efficient solution for isolating foreground subjects without manual selection or professional editing skills. The tool leverages multiple pre-trained segmentation models including U2-Net, IS-Net, SAM, and specialized variants optimized for different use cases such as general objects, human subjects, anime characters, and clothing items. RemBG processes images through semantic segmentation to identify foreground elements and generates precise alpha matte masks that cleanly separate subjects from backgrounds, producing transparent PNG outputs ready for immediate use. The tool excels at handling complex edge cases including wispy hair, translucent fabrics, intricate jewelry, and objects with irregular boundaries. RemBG is available as a Python library via pip, a command-line interface for batch processing, and through API integrations for production deployment. It processes images locally without sending data to external servers, making it suitable for privacy-sensitive applications. Common use cases include e-commerce product photography, social media content creation, passport photo processing, graphic design compositing, real estate photography, and marketing materials. The tool supports JPEG, PNG, and WebP formats and handles both single images and batch directory operations. RemBG has become one of the most starred background removal repositories on GitHub with millions of downloads, and its models are integrated into numerous other AI tools. Released under the MIT license, it provides a free and commercially viable alternative to paid background removal services.
BRIA RMBG
BRIA RMBG is a state-of-the-art background removal model developed by BRIA AI, an Israeli startup specializing in responsible and commercially licensed generative AI. The model delivers exceptional accuracy in separating foreground subjects from backgrounds, handling complex scenarios including fine hair details, transparent objects, intricate edges, smoke, and glass with remarkable precision. BRIA RMBG is built on a proprietary architecture trained on exclusively licensed and ethically sourced data, ensuring full commercial safety and IP compliance that distinguishes it from models trained on scraped internet data. It produces high-quality alpha mattes preserving fine edge details and natural transparency gradients for clean cutouts suitable for professional workflows. Available in versions including RMBG 1.4 and RMBG 2.0, the model consistently ranks among top performers on background removal benchmarks including DIS5K and HRS10K datasets. BRIA RMBG is accessible through Hugging Face with a permissive license for research and commercial use, and through BRIA's commercial API for scalable cloud processing. Integration options include Python SDK, REST API, and popular image processing pipeline compatibility. Applications span e-commerce product photography, graphic design compositing, video conferencing virtual backgrounds, automotive and real estate photography, social media content creation, and document digitization. The model processes images in milliseconds on modern GPUs, suitable for real-time and high-volume batch processing. BRIA RMBG has established itself as one of the most commercially trusted and technically advanced background removal solutions available.
BiRefNet
BiRefNet (Bilateral Reference Network) is an advanced open-source segmentation model developed by ZhengPeng7 for high-resolution dichotomous image segmentation, precisely separating foreground objects from backgrounds with pixel-level accuracy at fine structural details. The model introduces a bilateral reference framework leveraging both global semantic information and local detail features through a dual-branch architecture, enabling superior edge quality compared to traditional segmentation approaches. BiRefNet processes images through a backbone encoder to extract multi-scale features, then applies bilateral reference modules that cross-reference global context with local boundary information to produce crisp segmentation masks with clean edges around complex structures like hair strands, lace patterns, chain links, and transparent materials. The model achieves state-of-the-art results on multiple benchmarks including DIS5K, demonstrating strength in handling objects with intricate boundaries that challenge conventional models. BiRefNet has gained significant popularity as a background removal solution due to its exceptional edge quality, outperforming many dedicated background removal tools on challenging images. It supports high-resolution input processing and produces alpha mattes suitable for professional compositing. Available through Hugging Face with multiple model variants optimized for different quality-speed tradeoffs, BiRefNet integrates easily into Python-based pipelines and has been adopted by several popular AI platforms. Common applications include precision background removal for product photography, fine-grained object isolation for graphic design, medical image segmentation, and creating high-quality cutouts for visual effects. Released under an open-source license, BiRefNet provides a free and technically sophisticated alternative to commercial segmentation services.
MODNet
MODNet (Matting Objective Decomposition Network) is an open-source portrait matting model developed by ZHKKKe, designed for real-time human portrait background removal without requiring a pre-defined trimap or additional user input. Unlike traditional matting approaches needing manually drawn trimaps, MODNet achieves fully automatic portrait matting by decomposing the complex matting objective into three sub-tasks: semantic estimation for identifying the person region, detail prediction for refining edge quality around hair and clothing boundaries, and semantic-detail fusion for combining both signals into a high-quality alpha matte. This decomposition enables efficient single-pass inference at real-time speeds, making it practical for video conferencing, live streaming, and mobile photography where latency is critical. The model produces smooth and accurate alpha mattes with particular strength in handling hair strands, fabric edges, and other fine boundary details challenging for segmentation-based approaches. MODNet supports both image and video input with temporal consistency optimizations for stable video matting without flickering. The model is lightweight enough for mobile devices and edge hardware, with ONNX export supporting deployment across iOS, Android, and web browsers through WebAssembly. Common applications include video call background replacement, portrait mode photography, social media content creation, virtual try-on systems, and film post-production green screen alternatives. Released under Apache 2.0, MODNet provides a free and efficient solution widely adopted in both research and production portrait matting applications.