Bradlee Speice, Tue 01 November 2016, Blog
I'm going to be working with some audio data for a while as I get prepared for a term project this semester. I'll be working (with a partner) to design a system for separating voices from music. Given my total lack of experience with Digital Signal Processing I figured that now was as good a time as ever to work on a couple of fun projects that would get me back up to speed.
The first project I want to work on: Designing a new compression scheme for audio data.
Audio files when uncompressed (files ending with .wav
) are huge. Like, 10.5 Megabytes per minute huge. Storage is cheap these days, but that's still an incredible amount of data that we don't really need. Instead, we'd like to compress that data so that it's not taking up so much space. There are broadly two ways to accomplish this:
Lossless compression - Formats like FLAC, ALAC, and Monkey's Audio (.ape) all go down this route. The idea is that when you compress and uncompress a file, you get exactly the same as what you started with.
Lossy compression - Formats like MP3, Ogg, and AAC (.m4a
) are far more popular, but make a crucial tradeoff: We can reduce the file size even more during compression, but the decompressed file won't be the same.
There is a fundamental tradeoff at stake: Using lossy compression sacrifices some of the integrity of the resulting file to save on storage space. Most people (I personally believe it's everybody) can't hear the difference, so this is an acceptable tradeoff. You have files that take up a 10th of the space, and nobody can tell there's a difference in audio quality.
What I want to try out is a PCA approach to encoding audio. The PCA technique comes from Machine Learning, where it is used for a process called Dimensionality Reduction. Put simply, the idea is the same as lossy compression: if we can find a way that represents the data well enough, we can save on space. There are a lot of theoretical concerns that lead me to believe this compression style will not end well, but I'm interested to try it nonetheless.
PCA works as follows: Given a dataset with a number of features, I find a way to approximate those original features using some "new features" that are statistically as close as possible to the original ones. This is comparable to a scheme like MP3: Given an original signal, I want to find a way of representing it that gets approximately close to what the original was. The difference is that PCA is designed for statistical data, and not signal data. But we won't let that stop us.
The idea is as follows: Given a signal, reshape it into 1024 columns by however many rows are needed (zero-padded if necessary). Run the PCA algorithm, and do dimensionality reduction with a couple different settings. The number of components I choose determines the quality: If I use 1024 components, I will essentially be using the original signal. If I use a smaller number of components, I start losing some of the data that was in the original file. This will give me an idea of whether it's possible to actually build an encoding scheme off of this, or whether I'm wasting my time.
The audio I will be using comes from the song Tabulasa, by Broke for Free. I'll be loading in the audio signal to Python and using Scikit-Learn to actually run the PCA algorithm.
We first need to convert the FLAC file I have to a WAV:
!ffmpeg -hide_banner -loglevel panic -i "Broke For Free/XXVII/01 Tabulasa.flac" "Tabulasa.wav" -c wav
Then, let's go ahead and load a small sample so you can hear what is going on.
from IPython.display import Audio
from scipy.io import wavfile
samplerate, tabulasa = wavfile.read('Tabulasa.wav')
start = samplerate * 14 # 10 seconds in
end = start + samplerate * 10 # 5 second duration
Audio(data=tabulasa[start:end, 0], rate=samplerate)
Next, we'll define the code we will be using to do PCA. It's very short, as the PCA algorithm is very simple.
from sklearn.decomposition import PCA
import numpy as np
def pca_reduce(signal, n_components, block_size=1024):
# First, zero-pad the signal so that it is divisible by the block_size
samples = len(signal)
hanging = block_size - np.mod(samples, block_size)
padded = np.lib.pad(signal, (0, hanging), 'constant', constant_values=0)
# Reshape the signal to have 1024 dimensions
reshaped = padded.reshape((len(padded) // block_size, block_size))
# Second, do the actual PCA process
pca = PCA(n_components=n_components)
pca.fit(reshaped)
transformed = pca.transform(reshaped)
reconstructed = pca.inverse_transform(transformed).reshape((len(padded)))
return pca, transformed, reconstructed
Now that we've got our functions set up, let's try actually running something. First, we'll use n_components == block_size
, which implies that we should end up with the same signal we started with.
tabulasa_left = tabulasa[:,0]
_, _, reconstructed = pca_reduce(tabulasa_left, 1024, 1024)
Audio(data=reconstructed[start:end], rate=samplerate)
OK, that does indeed sound like what we originally had. Let's drastically cut down the number of components we're doing this with as a sanity check: the audio quality should become incredibly poor.
_, _, reconstructed = pca_reduce(tabulasa_left, 32, 1024)
Audio(data=reconstructed[start:end], rate=samplerate)
As expected, our reconstructed audio does sound incredibly poor! But there's something else very interesting going on here under the hood. Did you notice that the bassline comes across very well, but that there's no midrange or treble? The drums are almost entirely gone.
It will help to understand PCA more fully when trying to read this part, but I'll do my best to break it down. PCA tries to find a way to best represent the dataset using "components." Think of each "component" as containing some of the information you need in order to reconstruct the full audio. For example, you might have a "low frequency" component that contains all the information you need in order to hear the bassline. There might be other components that explain the high frequency things like singers, or melodies, that you also need.
What makes PCA interesting is that it attempts to find the "most important" components in explaining the signal. In a signal processing world, this means that PCA is trying to find the signal amongst the noise in your data. In our case, this means that PCA, when forced to work with small numbers of components, will chuck out the noisy components first. It's doing it's best job to reconstruct the signal, but it has to make sacrifices somewhere.
So I've mentioned that PCA identifies the "noisy" components in our dataset. This is equivalent to saying that PCA removes the "high frequency" components in this case: it's very easy to represent a low-frequency signal like a bassline. It's far more difficult to represent a high-frequency signal because it's changing all the time. When you force PCA to make a tradeoff by using a small number of components, the best it can hope to do is replicate the low-frequency sections and skip the high-frequency things.
This is a very interesting insight, and it also has echos (pardon the pun) of how humans understand music in general. Other encoding schemes (like MP3, etc.) typically chop off a lot of the high-frequency range as well. There is typically a lot of high-frequency noise in audio that is nearly impossible to hear, so it's easy to remove it without anyone noticing. PCA ends up doing something similar, and while that certainly wasn't the intention, it is an interesting effect.
So we've seen the edge cases so far: Using a large number of components results in audio very close to the original, and using a small number of components acts as a low-pass filter. How about we develop something that sounds "good enough" in practice, that we can use as a benchmark for size? We'll use ourselves as judges of audio quality, and build another function to help us estimate how much space we need to store everything in.
from bz2 import compress
import pandas as pd
def raw_estimate(transformed, pca):
# We assume that we'll be storing things as 16-bit WAV,
# meaning two bytes per sample
signal_bytes = transformed.tobytes()
# PCA stores the components as floating point, we'll assume
# that means 32-bit floats, so 4 bytes per element
component_bytes = transformed.tobytes()
# Return a result in megabytes
return (len(signal_bytes) + len(component_bytes)) / (2**20)
# Do an estimate for lossless compression applied on top of our
# PCA reduction
def bz2_estimate(transformed, pca):
bytestring = transformed.tobytes() + b';' + pca.components_.tobytes()
compressed = compress(bytestring)
return len(compressed) / (2**20)
compression_attempts = [
(1, 1),
(1, 2),
(1, 4),
(4, 32),
(16, 256),
(32, 256),
(64, 256),
(128, 1024),
(256, 1024),
(512, 1024),
(128, 2048),
(256, 2048),
(512, 2048),
(1024, 2048)
]
def build_estimates(signal, n_components, block_size):
pca, transformed, recon = pca_reduce(tabulasa_left, n_components, block_size)
raw_pca_estimate = raw_estimate(transformed, pca)
bz2_pca_estimate = bz2_estimate(transformed, pca)
raw_size = len(recon.tobytes()) / (2**20)
return raw_size, raw_pca_estimate, bz2_pca_estimate
pca_compression_results = pd.DataFrame([
build_estimates(tabulasa_left, n, bs)
for n, bs in compression_attempts
])
pca_compression_results.columns = ["Raw", "PCA", "PCA w/ BZ2"]
pca_compression_results.index = compression_attempts
pca_compression_results
As we can see, there are a couple of instances where we do nearly 20 times better on storage space than the uncompressed file. Let's here what that sounds like:
_, _, reconstructed = pca_reduce(tabulasa_left, 16, 256)
Audio(data=reconstructed[start:end], rate=samplerate)
It sounds incredibly poor though. Let's try something that's a bit more realistic:
_, _, reconstructed = pca_reduce(tabulasa_left, 1, 4)
Audio(data=reconstructed[start:end], rate=samplerate)
And just out of curiosity, we can try something that has the same ratio of components to block size. This should be close to an apples-to-apples comparison.
_, _, reconstructed = pca_reduce(tabulasa_left, 64, 256)
Audio(data=reconstructed[start:end], rate=samplerate)
The smaller block size definitely has better high-end response, but I personally think the larger block size sounds better overall.
So, what do I think about audio compression using PCA?
Strangely enough, it actually works pretty well relative to what I expected. That said, it's a terrible idea in general.
First off, you don't really save any space. The component matrix needed to actually run the PCA algorithm takes up a lot of space on its own, so it's very difficult to save space without sacrificing a huge amount of audio quality. And even then, codecs like AAC sound very nice even at bitrates that this PCA method could only dream of.
Second, there's the issue of audio streaming. PCA relies on two components: the datastream, and a matrix used to reconstruct the original signal. While it is easy to stream the data, you can't stream that matrix. And even if you divided the stream up into small blocks to give you a small matrix, you must guarantee that the matrix arrives; if you don't have that matrix, the data stream will make no sense whatsoever.
All said, this was an interesting experiment. It's really cool seeing PCA used for signal analysis where I haven't seen it applied before, but I don't think it will lead to any practical results. Look forward to more signal processing stuff in the future!