Skip to content

Community list of AI tools for audio, music, and speech applications

Notifications You must be signed in to change notification settings

yyf/ai-audio-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

ai-audio-tools

Community list of open-source AI tools for audio, music, and speech applications

To contribute to the list

Edit the README and make a PR

Audio

DAW

  • OpenVINO: OpenVINO AI effects for Audacity
  • TuneFlow: TuneFlow is a next-gen DAW that aims to boost music making productivity through the power of AI

Music

Analysis

  • Essentia: open-source C++ library for audio analysis and audio-based music information retrieval
  • Librosa: Python library for audio and music analysis
  • DDSP: DDSP is a library of differentiable versions of common DSP functions (such as synthesizers, waveshapers, and filters). This allows these interpretable elements to be used as part of an deep learning model, especially as the output layers for audio generation
  • MIDI-DDSP: MIDI-DDSP is a hierarchical audio generation model for synthesizing MIDI expanded from DDSP
  • TorchAudio: Data manipulation and transformation for audio signal processing, powered by PyTorch
  • nnAudio: Audio processing by using pytorch 1D convolution network
  • pyAudioAnalysis: Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications
  • mutagen: a Python module to handle audio metadata
  • dejavu: Audio fingerprinting and recognition in Python
  • audiomentations: A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning
  • soundata: Python library for downloading, loading, and working with sound datasets
  • EfficientAT: This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings
  • AugLy: A data augmentations library for audio, image, text, and video
  • Pedalboard: A Python library for working with audio
  • TinyTag: a Python library for reading audio file metadata
  • OpenSmile: The Munich Open-Source Large-Scale Multimedia Feature Extractor
  • Madmom: Python audio and music signal processing library
  • Beets: a music library manager and MusicBrainz tagger
  • Mirdata: Python library for working with Music Information Retrieval datasets
  • Partitura: A python package for handling modern staff notation of music
  • msaf: a python package for the analysis of music structural segmentation algorithms
  • basic-pitch: A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
  • jams: A JSON Annotated Music Specification for Reproducible MIR Research

Production

  • Spleeter: Deezer source separation library including pretrained models
  • DeepAFx: Third-party audio effects plugins as differentiable layers within deep neural networks
  • matchering: open source audio matching and mastering
  • AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec
  • USS: This is the PyTorch implementation of the Universal Source Separation with Weakly labelled Data
  • FAST-RIR: This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given rectangular acoustic environment

Generation

  • StableAudio: Generative models for conditional audio generation
  • AudioCraft: a PyTorch library for deep learning research on audio generation. AudioCraft contains inference and training code for two state-of-the-art AI generative models producing high-quality audio: AudioGen and MusicGen.
  • Jukebox: A generative model for music
  • Magenta: symbolic music generation with diffusion models
  • TorchSynth: A GPU-optional modular synthesizer in pytorch, 16200x faster than realtime, for audio ML researchers
  • audiobox: Audiobox is Meta’s new foundation research model for audio generation. It can generate voices and sound effects using a combination of voice inputs and natural language text prompts
  • Amphion: Amphion is a toolkit for Audio, Music, and Speech Generation
  • AudioGPT: AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  • WaveGAN: WaveGAN: Learn to synthesize raw audio with generative adversarial networks
  • RAVE: Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
  • AudioLDM: This toolbox aims to unify audio generation model evaluation for easier comparison
  • Make-An-Audio: a conditional diffusion probabilistic model capable of generating high fidelity audio efficiently from X modality
  • Diffuser: Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules
  • stable-audio-tools: Generative models for conditional audio generation
  • MidiTok: MIDI / symbolic music tokenizers for Deep Learning models
  • muspy: an open source Python library for symbolic music generation
  • [MusicLM] (https://google-research.github.io/seanet/musiclm/examples/): a model generating high-fidelity music from text descriptions
  • riffusion: Stable diffusion for real-time music generation
  • muzic: Music Understanding and Generation with Artificial Intelligence
  • midi-lm: Generative modeling of MIDI files
  • UniAudio: The Open Source Code of UniAudio
  • MuseGAN: An AI for Music Generation

Speech

Recognition

  • Whisper: a multitasking model that can perform multilingual speech recognition, speech translation, and language identification
  • Deep Speech: Mozilla's open-source speech-to-text engine
  • Kaldi ASR: open-source speech recognition toolkit written in C++
  • PaddleSpeech: Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting
  • NeMo: a framework for generative AI
  • julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine
  • speechbrain: an open-source and all-in-one conversational AI toolkit based on PyTorch
  • pocketsphinx: A small speech recognizer
  • FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models
  • NeuralSpeech: a research project at Microsoft Research Asia, which focuses on neural network based speech processing, including automatic speech recognition (ASR), text-to-speech synthesis (TTS), spatial audio synthesis, video dubbing, etc
  • espnet: End-to-End Speech Processing Toolkit

Production

  • Descript audio codec: State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio
  • Descript audio tools: Object-oriented handling of audio data, with GPU-powered augmentations, and more
  • Meta encodec: State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio
  • audino: Open source audio annotation tool for humans

Synthesis

  • Coqui TTS: a deep learning toolkit for Text-to-Speech, battle-tested in research and production
  • DiffSinger: singing voice synthesis via shallow diffusion mechanism
  • Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
  • wavenet: A TensorFlow implementation of DeepMind's WaveNet paper
  • FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
  • MelGAN: Unofficial PyTorch implementation of MelGAN vocoder
  • hifi-gan: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • elevenlabs-pythons: The official Python API for ElevenLabs Text to Speech.
  • tortoise-tts: A multi-voice TTS system trained with an emphasis on quality
  • lyrebird: Simple and powerful voice changer for Linux, written with Python & GTK
  • elevenlabs: The official Python API for ElevenLabs Text to Speech
  • piper: A fast, local neural text to speech system
  • tts-generation-webui: TTS Generation Web UI (Bark, MusicGen + AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, MAGNet)
  • GPT-SoVITS: 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
  • metavoice-src: Foundational model for human-like, expressive TTS
  • Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time
  • Retrieval-based-Voice-Conversion-WebUI: Voice data <= 10 mins can also be used to train a good VC model!
  • midi2voice: Singing synthesis from MIDI file
  • OpenVoice: Instant voice cloning by MyShell

Releases

No releases published

Packages

No packages published