How does YOLOv10 work?

YOLOv10 is a model that performs object detection in a single neural network pass. It takes an image as input and outputs bounding box coordinates, class labels, and confidence scores for each object. Its NMS-free architecture eliminates the post-processing step, working faster and more efficiently.

What is the difference between YOLOv10 and YOLOv8?

YOLOv10's most important innovation is its NMS-free detection mechanism. While YOLOv8 still requires a Non-Maximum Suppression post-processing step, YOLOv10 eliminates this step, achieving lower latency and higher efficiency. Improvements in accuracy-speed trade-off have also been achieved.

What model sizes does YOLOv10 offer?

YOLOv10 is offered in multiple sizes for different hardware and use scenarios: YOLOv10-N (Nano, lightest), YOLOv10-S (Small), YOLOv10-M (Medium), YOLOv10-B (Balanced), YOLOv10-L (Large), and YOLOv10-X (Extra-Large). Each size provides a different accuracy-speed trade-off.

Is YOLOv10 open source?

Yes, YOLOv10 is released as open source and was developed by researchers at Tsinghua University. PyTorch-based source code, pretrained weights, and training scripts are available on GitHub. It can be used for research and commercial projects.

Does YOLOv10 work on mobile devices?

Yes, especially the YOLOv10-N (Nano) and YOLOv10-S (Small) models are designed to run on mobile and edge devices. They can be exported in ONNX and TensorFlow Lite formats. They are fast and lightweight enough for real-time object detection on Android and iOS devices.

Can I train YOLOv10 on my own dataset?

Yes, YOLOv10 can be easily trained with custom datasets in COCO format. Transfer learning can be done using the Ultralytics-style training pipeline or the model's own training scripts. Fine-tuning on pretrained weights makes it possible to detect your own object categories.

YOLOv10

Open Source

4.7

Tsinghua University

YOLOv10 is the tenth major iteration of the YOLO (You Only Look Once) real-time object detection series, developed by researchers at Tsinghua University. The model introduces a fundamentally redesigned NMS-free (Non-Maximum Suppression free) architecture that eliminates the post-processing bottleneck present in all previous YOLO versions, enabling true end-to-end object detection with consistent latency. YOLOv10 employs a dual-assignment training strategy that combines one-to-many and one-to-one label assignments during training, achieving rich supervision signals while maintaining efficient inference without redundant predictions. Built on a CSPNet backbone with enhanced feature aggregation, the model comes in six scale variants ranging from Nano (8M parameters) to Extra-Large (68M parameters), allowing deployment across edge devices, mobile platforms, and high-performance servers. Each variant is optimized for its target hardware profile, delivering the best accuracy-latency trade-off in its class. YOLOv10 achieves state-of-the-art performance on the COCO benchmark, outperforming previous YOLO versions and competing models like RT-DETR with significantly lower computational cost. Released under the AGPL-3.0 license, the model is open source and integrates seamlessly with the Ultralytics ecosystem for training, validation, and deployment. Common applications include autonomous driving perception, industrial quality inspection, security surveillance, retail analytics, robotics, and drone-based monitoring. The model supports ONNX and TensorRT export for optimized production deployment.

Object Detection

Visit Website

Key Highlights

NMS-Free Detection

Faster and more efficient inference with direct object detection without requiring Non-Maximum Suppression

Real-Time Performance

Architecture optimized for real-time applications, capable of object detection within milliseconds

Scalable Model Family

Adapts to different hardware needs by offering models in various sizes from Nano to Extra-Large

Superior Accuracy-Speed Trade-off

Achieves same or higher accuracy with less computation compared to previous YOLO versions

About

YOLOv10 is the tenth major version of the YOLO (You Only Look Once) series in real-time object detection. Developed by researchers at Tsinghua University, this model fundamentally differs from previous versions with its NMS-free (Non-Maximum Suppression free) architecture, setting new standards in both speed and accuracy. In the YOLO series' evolution spanning over a decade, YOLOv10 represents a critical milestone by offering truly end-to-end object detection, demonstrating the maturation of the architecture and overcoming the architectural limitations of previous versions.

The elimination of NMS is YOLOv10's most important innovation and represents a fundamental rethinking of the detection pipeline. In traditional object detection models, the NMS step used to filter overlapping detection boxes creates additional latency and complicates end-to-end training. YOLOv10 completely eliminates this step through a consistent dual assignment strategy, offering truly end-to-end object detection for the first time in the YOLO family. This innovation simplifies the training process while significantly increasing inference speed and completely eliminating latency caused by post-processing steps that have been a bottleneck in previous versions. It resolves the training-inference inconsistency problem, providing more stable and predictable performance.

The model is offered in six different sizes from Nano to Extra-Large: YOLOv10-N, S, M, B, L, and X. This variety provides a suitable option for every scenario, from real-time applications on mobile devices to high-accuracy server-side analysis. The smallest variant can process 100+ frames per second, while the largest variant captures the highest accuracy on the COCO dataset. Nano and Small variants are optimized for embedded systems and IoT devices, while Large and Extra-Large variants are designed for server applications requiring maximum accuracy and comprehensive detection coverage. This wide model range enables all deployment targets to be addressed from a single architecture family.

Architecturally, YOLOv10 incorporates an advanced backbone network, feature pyramid network (FPN), and a dual label assignment strategy for optimal training efficiency. During training, both one-to-one and one-to-many label assignments are used: the one-to-many assignment provides rich supervisory signals while the one-to-one assignment produces results directly without needing NMS. This dual strategy improves training efficiency while maintaining inference simplicity. Additionally, large-kernel convolutions and self-attention mechanisms enable the model to better detect objects requiring a wide receptive field. Efficient channel expansion and partial self-attention mechanisms keep computational costs under control.

It is widely used in autonomous driving, security camera analysis, industrial inspection, retail counting, and sports analytics across diverse deployment environments. It is preferred in various real-world applications such as vehicle and pedestrian detection in traffic management, quality control and defect detection on production lines, inventory management and shelf tracking in retail stores, and player tracking and motion analysis in sports competitions. It is also frequently used for real-time object detection on drones and unmanned aerial vehicles in surveillance and monitoring scenarios.

Available in PyTorch and ONNX formats, it can be easily deployed to edge devices with minimal configuration. TensorRT optimization achieves maximum performance on NVIDIA GPUs for latency-critical applications. OpenVINO support enables efficient operation on Intel hardware. CoreML conversion allows direct deployment on iOS devices, while TFLite enables running on Android devices natively. Training, evaluation, and deployment processes are standardized through the Ultralytics library, and this broad platform support makes YOLOv10 deployable in every environment from cloud to edge across the entire computing spectrum.

Use Cases

Security Camera Analysis

Real-time human, vehicle, and object detection and tracking system in security camera footage

Autonomous Driving

Real-time object recognition system for vehicle, pedestrian, traffic sign, and obstacle detection

Quality Control

Visual inspection system for automatic detection of defective products on production lines

Retail Analytics

Visual perception for customer movement, shelf status, and product placement analysis in stores

Pros & Cons

Pros

Significant speed advantage in post-processing with NMS-free approach
Higher mAPval with fewer parameters and FLOPs than YOLOv8; 25% less parameters and 46% reduced latency
Distinct advantage in small object detection; leverages parameters more efficiently
Particularly well-suited for crowded scene analysis and deployment to low-power edge devices
Real-time performance up to 120+ fps for instant object detection

Cons

May show significantly lower accuracy than YOLOv8 on some datasets despite being smaller
Lacks multi-task support like YOLOv8's native instance segmentation, pose estimation, and OBB
Less mature ecosystem compared to YOLOv8's massive open-source community and comprehensive documentation
Performance varies by use case; superiority is not guaranteed in every scenario

Technical Details

Parameters

8M-68M

Architecture

CNN (CSPNet backbone)

Training Data

COCO

License

AGPL-3.0

Features

NMS-Free Detection
Real-Time Inference
Scalable Architecture
Multi-Size Models
ONNX Export
Edge Deployment

Benchmark Results

Metric	Value	Compared To	Source
mAP (COCO val, YOLOv10-X)	54.4%	YOLOv9-E: 55.6%, YOLOv8-X: 53.9%	YOLOv10 Paper (Tsinghua, 2024)
Hız (T4 GPU, YOLOv10-S)	2.49ms (FP16)	YOLOv8-S: 4.03ms	YOLOv10 Paper (Tsinghua, 2024)
Parametre Sayısı (YOLOv10-S)	7.2M	YOLOv8-S: 11.2M	YOLOv10 Paper (Tsinghua, 2024)
Desteklenen Sınıf Sayısı (COCO)	80 sınıf	—	COCO Dataset / YOLOv10 GitHub

Available Platforms

GitHub

HuggingFace

Ultralytics

Frequently Asked Questions

Related Models

GroundingDINO

IDEA Research|172M

Grounding DINO is a powerful open-set object detection model developed by IDEA Research that locates and identifies any object in an image based on natural language text descriptions, representing a paradigm shift from fixed-category detection to language-guided visual understanding. With 172 million parameters, the model combines the DINO detection architecture with text grounding capabilities, enabling it to detect objects that were never seen during training simply by describing them in words. Unlike traditional object detectors trained on fixed categories like COCO's 80 classes, Grounding DINO can find arbitrary objects, parts, materials, or visual concepts by accepting free-form text queries such as 'red shoes on the shelf' or 'cracked window in the building.' The architecture fuses visual features from the image encoder with textual features from a text encoder through cross-modality attention layers, learning to align visual regions with their semantic descriptions. Grounding DINO achieves state-of-the-art results on zero-shot object detection benchmarks and when combined with SAM (Segment Anything Model) creates a powerful pipeline for text-prompted segmentation of any visual concept. Released under the Apache 2.0 license, the model is fully open source and widely used in computer vision research and production systems. Key applications include automated image annotation and labeling, visual search engines, robotic manipulation systems that understand verbal commands, visual question answering pipelines, content moderation systems, accessibility tools that describe image contents, and custom quality inspection systems that can be configured with natural language descriptions of defects rather than extensive training data.