What is the difference between AudioCraft and MusicGen?

AudioCraft is Meta's comprehensive open-source framework that encompasses multiple audio generation models, while MusicGen is one specific model within that framework focused on text-to-music generation. AudioCraft also includes AudioGen for sound effects and EnCodec for audio compression, providing a unified research and development platform for all audio AI tasks.

Is AudioCraft free to use commercially?

AudioCraft is released under the MIT license, which permits commercial use. However, specific model weights may have additional licensing terms depending on the training data used. The MusicGen models trained on licensed music from ShutterStock and Pond5 are covered for derivative works, but you should always verify the specific license terms for each model variant you plan to use commercially.

What audio formats does AudioCraft support?

AudioCraft primarily works with WAV format audio at sample rates of 16 kHz or 32 kHz depending on the model. EnCodec supports various bitrates from 1.5 kbps to 24 kbps for audio compression. The framework can process mono and stereo audio inputs and outputs, with MusicGen supporting stereo generation at 32 kHz for higher fidelity music production.

Can I train custom models with AudioCraft?

Yes, AudioCraft provides complete training scripts and infrastructure for training custom models on your own datasets. You can fine-tune existing pre-trained models or train new ones from scratch. The framework supports distributed training across multiple GPUs, custom dataset loading, and various hyperparameter configurations for experimenting with different model architectures and training strategies.

What are the system requirements for running AudioCraft?

For inference, AudioCraft requires Python 3.9 or higher, PyTorch 2.0 or later, and a CUDA-compatible GPU with at least 4GB VRAM for the smallest models. The large 3.3B parameter models require 16GB or more VRAM. For training, you will need significantly more resources — typically multiple high-end GPUs like A100s with 40GB or 80GB VRAM each, depending on the model size and batch configuration.

How does EnCodec compare to traditional audio codecs?

EnCodec is a neural audio codec that achieves significantly better compression quality at low bitrates compared to traditional codecs like MP3 or Opus. At 6 kbps, EnCodec produces audio quality comparable to MP3 at 64 kbps, representing roughly a 10x compression improvement. It uses a convolutional encoder-decoder architecture with residual vector quantization, making it particularly well-suited for AI audio generation pipelines.

AudioCraft

Open Source

4.5

Key Highlights

Unified Audio Framework

Unifies MusicGen, AudioGen and EnCodec under a single consistent codebase, offering a complete toolkit for audio AI research

Modular Architecture

Separates audio tokenization from sequence modeling, enabling flexible experimentation with different model sizes and training strategies

EnCodec Neural Codec

Advanced neural audio codec technology that compresses audio into discrete tokens at various bitrates while maintaining high perceptual quality

Complete Training Infrastructure

Provides researchers with a comprehensive development environment including training scripts, evaluation tools and pre-trained model weights

About

AudioCraft is Meta AI's comprehensive open-source framework for generative audio research and applications. Released in 2023, this framework brings together three specialized models under a single integrated platform: MusicGen for music generation, AudioGen for sound effect synthesis, and EnCodec for neural audio compression. AudioCraft aims to provide researchers and developers with a standardized infrastructure for rapid experimentation in the audio AI domain.

AudioCraft's technical architecture is built upon the EnCodec neural audio codec. EnCodec compresses raw audio waveforms into 4 codebook layers at 50 Hz, creating high-quality representations at 32 kHz sample rate. These compressed representations serve as the working foundation for both MusicGen and AudioGen. Both models share an autoregressive transformer language model architecture but are specialized with different training data and target domains. MusicGen is trained on licensed music datasets, while AudioGen is trained on environmental sounds and sound effects datasets. This modular approach enables specialized models for different audio types to operate on the same infrastructure.

Performance metrics across AudioCraft's models are noteworthy. MusicGen achieves a FAD score of 3.80 on the MusicCaps benchmark set, while AudioGen produces competitive results on the AudioCaps dataset. EnCodec maintains high audio quality even at bit rates as low as 6 kbps, representing a significant advancement over traditional codecs. The framework supports various control mechanisms including text conditioning, melody conditioning, and style transfer capabilities.

In terms of applications, AudioCraft spans a wide range from academic research to commercial use cases. Researchers can leverage the framework's modular structure to develop new audio generation techniques. Game developers can create dynamic audio environments and adaptive music systems. Content creators can produce background music and sound effects for podcasts and videos. Telecommunications companies can benefit from EnCodec's low-bitrate audio compression capabilities for bandwidth optimization.

AudioCraft is fully open-source under the MIT license and accessible via GitHub. Its Python-based API can be easily installed via pip and is compatible with Jupyter notebook environments. The framework is built on PyTorch and optimized for NVIDIA GPUs. Through Hugging Face integration, pre-trained models can be easily downloaded and deployed for immediate use.

AudioCraft's position in the audio AI ecosystem is unique. Designed not as a standalone model but as a comprehensive research and application framework, it stands apart from competitors. Compared to Google's MusicLM or Stability AI's Stable Audio, AudioCraft distinguishes itself through open-source accessibility, modular architecture, and multi-model support. This approach aims to democratize research and development processes in audio AI, and its community-contribution-friendly structure ensures continuous evolution and improvement.

A more detailed examination of AudioCraft's technical infrastructure reveals how the shared codebase and common API design among the framework's models enhances research efficiency. Researchers can directly leverage the EnCodec tokenization infrastructure when developing new audio generation models and adapt the transformer architecture to their specific needs. AudioCraft also includes auxiliary components such as training pipelines, evaluation metrics, and data preprocessing tools. The framework supports multi-GPU training and is optimized for distributed training scenarios. Meta's FAIResearch team continues active development on AudioCraft, regularly releasing new model versions and improvements. This continuous development process makes AudioCraft one of the most dynamic open-source projects in the audio AI field, ensuring it remains at the forefront of generative audio research.

Use Cases

Audio AI Research

Conducting experiments on audio generation models in academic and industrial research laboratories

Music Production Tools

Building and integrating the AI engine behind professional music production software

Sound Design Applications

Developing tools for generating sound effects and ambient sounds for film, game and media projects

Interactive Audio Systems

Creating dynamic audio generation systems that respond in real-time to user input

Pros & Cons

Pros

Comprehensive audio generation framework including MusicGen, AudioGen, and EnCodec in a unified library
Multi-Band Diffusion decoder reduces audio artifacts, producing clearer and more natural stereo sound
Melody-guided generation via chromagrams allows guiding music to follow extracted melodies while being faithful to text
Trained on 20,000 hours of licensed music with vocals removed to prevent artist voice replication
Open-source research framework with pre-trained models available on HuggingFace

Cons

Requires GPU with minimum 16GB VRAM for local use, limiting accessibility
Training dataset lacks diversity — contains mostly Western-style music with English text pairs only
Pre-trained models may not be used commercially, restricting business applications
Generated music lacks long-term structural coherence beyond short musical phrases
Limited genre diversity in output due to dataset bias toward specific musical styles

Technical Details

Parameters

N/A

Architecture

Transformer-based framework with EnCodec neural codec

Training Data

Combination of licensed music (ShutterStock, Pond5) and environmental audio datasets

License

MIT

Features

MusicGen Text-to-Music Model
AudioGen Sound Effect Synthesis
EnCodec Neural Audio Compression
Melody Conditioning Support
Multi-Scale Transformer Architecture
Pre-trained Model Weights Library

Benchmark Results

Metric	Value	Compared To	Source
Örnekleme Hızı	32 kHz (EnCodec)	—	GitHub facebookresearch/audiocraft
Codebook Sayısı	4 codebook @ 50 Hz	—	arXiv 2306.05284
Maksimum Süre	30 saniye	Stable Audio: 180 saniye	GitHub facebookresearch/audiocraft
FAD (MusicCaps)	3.80 (MusicGen-Large)	Riffusion: 11.50	arXiv 2306.05284

Available Platforms

hugging face

replicate

Frequently Asked Questions

Related Models

Suno AI

Suno|N/A

Suno AI is a commercial AI music generation platform that creates complete songs with vocals, lyrics, and instrumental arrangements from text descriptions. Founded in 2023 by a team of former Kensho Technologies engineers, Suno AI offers an accessible web interface that enables users to generate professional-sounding songs by simply describing the desired genre, mood, topic, and style in natural language. The platform uses a proprietary transformer-based architecture that generates all components of a song including melody, harmony, rhythm, instrumentation, vocal performance, and lyrics in a single integrated process. Suno AI supports a remarkably wide range of musical genres from pop and rock to hip-hop, country, classical, electronic, jazz, and experimental styles, producing outputs that often sound indistinguishable from human-created music to casual listeners. Generated songs can be up to several minutes in duration and include realistic singing voices with proper pronunciation, emotional expression, and musical phrasing. The platform allows users to provide custom lyrics or let the AI generate lyrics based on a theme or concept. Suno AI operates on a freemium subscription model with limited free generations and paid tiers for higher volume and commercial usage rights. The platform has gained significant attention for democratizing music creation, enabling people without musical training to produce complete songs. Suno AI is particularly popular among content creators, social media marketers, hobbyist musicians, and anyone needing original music for videos, podcasts, or personal projects without the cost and complexity of traditional music production.

Proprietary

4.7

Suno v3.5

Suno AI|undisclosed

Suno v3.5 is the latest iteration of Suno AI's music generation model, released in June 2024, offering significant improvements in audio quality, vocal clarity, and musical coherence over its predecessor v3. The model generates full songs up to 4 minutes in length complete with vocals, instrumentation, and professional mixing from text prompts describing desired genre, mood, lyrics, or musical style. Suno v3.5 produces audio at higher fidelity with more natural-sounding vocals, cleaner instrument separation, and improved stereo imaging. The model handles a wide range of genres including pop, rock, hip-hop, electronic, jazz, classical, country, and world music with genre-appropriate production styles. Users can provide custom lyrics or let the AI generate them, specify instrumental-only tracks, and control tempo, mood, and arrangement through descriptive prompts. The platform features a user-friendly web interface with song history, playlist management, and social sharing capabilities. Suno v3.5 competes directly with Udio as the leading AI music generation platform, with particular strengths in vocal quality and ease of use. A free tier offers 10 songs per day, while Pro and Premier plans provide increased generation limits, commercial licensing, and higher quality downloads.

Proprietary

4.7

MusicGen

Meta|3.3B

MusicGen is a single-stage transformer-based music generation model developed by Meta AI Research as part of the AudioCraft framework. Released in June 2023 under the MIT license, MusicGen uses a single autoregressive language model operating over compressed discrete audio representations from EnCodec, unlike cascading approaches that require multiple models. The model comes in multiple sizes ranging from 300M to 3.3B parameters, allowing users to balance quality against computational requirements. MusicGen generates high-quality mono and stereo music at 32 kHz from text descriptions, supporting a wide range of genres, instruments, moods, and musical styles. Users can describe desired music using natural language prompts specifying genre, tempo, instrumentation, and atmosphere, and the model produces coherent musical compositions that follow the specified characteristics. Beyond text-to-music generation, MusicGen supports melody conditioning where an existing audio clip guides the melodic structure of the generated output, enabling more controlled music creation. The model achieves strong results across both objective metrics and subjective listening evaluations, producing music that sounds natural and musically coherent for durations up to 30 seconds. As a fully open-source model with code and weights available on GitHub and Hugging Face, MusicGen has become one of the most widely adopted AI music generation tools in both research and creative communities. It integrates easily into existing audio production workflows through the Audiocraft Python library and various community-built interfaces. MusicGen is particularly popular among content creators, game developers, and musicians who need royalty-free background music generated on demand.

Open Source

4.6

Udio

Udio|N/A

Udio is an AI music generation platform developed by former Google DeepMind researchers that creates high-quality songs with vocals, lyrics, and instrumentals from text prompts. Launched in April 2024, Udio quickly gained attention for producing remarkably realistic and musically coherent outputs that rival professional studio recordings in audio fidelity. The platform uses a proprietary transformer-based architecture that generates all aspects of a musical composition including vocal performances, instrumental arrangements, harmonies, and production effects in a unified process. Udio supports an extensive range of musical genres and styles from mainstream pop and rock to niche genres like lo-fi, synthwave, Afrobeat, and traditional folk music from various cultures. Generated songs feature studio-quality audio at high sample rates with realistic vocal timbres, proper musical dynamics, and professional-sounding mixing and mastering. The platform allows users to provide custom lyrics, specify song structure, and control various musical parameters through text descriptions. Udio also supports audio extensions where users can generate additional sections to extend existing songs, enabling the creation of full-length tracks through iterative generation. The platform operates on a freemium model with free daily generations and paid subscription tiers for commercial use and higher generation limits. Udio is particularly notable for its vocal quality, which includes natural-sounding vibrato, breath sounds, and emotional expressiveness that many competing platforms struggle to achieve. The platform is popular among content creators, independent musicians exploring AI-assisted composition, marketing teams needing original music, and hobbyists who want to create professional-sounding songs without musical training or expensive production equipment.

Proprietary

4.6

Quick Info

ParametersN/A

Typetransformer

LicenseMIT

Released2023-08

ArchitectureTransformer-based framework with EnCodec neural codec

Rating4.5 / 5

CreatorMeta

Links

Official Website GitHub HuggingFace

Tags

audiocraft