Music2Latent2: Audio Compression with Summary Embeddings and Autoregressive Decoding
Marco Pasini1, Stefan Lattner2, George Fazekas1
- Queen Mary University of London
- Sony Computer Science Laboratories Paris
Abstract
Efficiently compressing high-dimensional audio signals into a compact and informative latent space is crucial for various tasks, including generative modeling and music information retrieval (MIR). Existing audio autoencoders, however, often struggle to achieve high compression ratios while preserving audio fidelity and facilitating efficient downstream applications. We introduce Music2Latent2, a novel audio autoencoder that addresses these limitations by leveraging consistency models and a novel approach to representation learning based on unordered latent embeddings, which we call summary embeddings. Unlike conventional methods that encode local audio features into ordered sequences, Music2Latent2 compresses audio signals into sets of summary embeddings, where each embedding can capture distinct global features of the input sample. This enables to achieve higher reconstruction quality at the same compression ratio. To handle arbitrary audio lengths, Music2Latent2 employs an autoregressive consistency model trained on two consecutive audio chunks with causal masking, ensuring coherent reconstruction across segment boundaries. Additionally, we propose a novel two-step decoding procedure that leverages the denoising capabilities of consistency models to further refine the generated audio at no additional cost. Our experiments demonstrate that Music2Latent2 outperforms existing continuous audio autoencoders regarding audio quality and performance on downstream tasks. Music2Latent2 paves the way for new possibilities in audio compression.
Architecture
Convolutional patchifiers and de-patchifiers are indicated with P, transformer modules with T. Audio embeddings are illustrated as A, learned/summary embeddings as L, and mask embeddings as M. We represent chunked causal masking with a curved arrow.

Inference/Decoding

Audio Examples
We compare the reconstructions of Music2Latent2 and Music2Latent2_stereo against baselines for MusicCaps evaluation samples. We also include reconstructions from Descript Audio Codec (DAC): altough not directly comparable since it encodes audio into discrete tokens instead of continuous embeddings at a much higher sampling rate, we understand it may be valuable to provide a comparison between the two models.
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
|---|---|---|---|---|
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |
| Original | Music2Latent2 | Music2Latent2 Stereo | Musika | LatMusic |
| Mousaiv2 | Mousaiv3 | Music2Latent | StableAudio | DAC |