Towards a new (and pretty poor) compression scheme -------------------------------------------------- I'm going to be working with some audio data for a while as I get prepared for a term project this semester. I'll be working (with a partner) to design a system for separating voices from music. Given my total lack of experience with [Digital Signal Processing][1] I figured that now was as good a time as ever to work on a couple of fun projects that would get me back up to speed. The first project I want to work on: Designing a new compression scheme for audio data. A Brief Introduction to Audio Compression ----------------------------------------- Audio files when uncompressed (files ending with `.wav`) are huge. Like, 10.5 Megabytes per minute huge. Storage is cheap these days, but that's still an incredible amount of data that we don't really need. Instead, we'd like to compress that data so that it's not taking up so much space. There are broadly two ways to accomplish this: 1. Lossless compression - Formats like [FLAC][2], [ALAC][3], and [Monkey's Audio (.ape)][4] all go down this route. The idea is that when you compress and uncompress a file, you get exactly the same as what you started with. 2. Lossy compression - Formats like [MP3][5], [Ogg][6], and [AAC (`.m4a`)][7] are far more popular, but make a crucial tradeoff: We can reduce the file size even more during compression, but the decompressed file won't be the same. There is a fundamental tradeoff at stake: Using lossy compression sacrifices some of the integrity of the resulting file to save on storage space. Most people (I personally believe it's everybody) can't hear the difference, so this is an acceptable tradeoff. You have files that take up a 10th of the space, and nobody can tell there's a difference in audio quality. A PCA-based Compression Scheme ------------------------------ What I want to try out is a [PCA][8] approach to encoding audio. The PCA technique comes from Machine Learning, where it is used for a process called [Dimensionality Reduction][9]. Put simply, the idea is the same as lossy compression: if we can find a way that represents the data well enough, we can save on space. There are a lot of theoretical concerns that lead me to believe this compression style will not end well, but I'm interested to try it nonetheless. PCA works as follows: Given a dataset with a number of features, I find a way to approximate those original features using some "new features" that are statistically as close as possible to the original ones. This is comparable to a scheme like MP3: Given an original signal, I want to find a way of representing it that gets approximately close to what the original was. The difference is that PCA is designed for statistical data, and not signal data. But we won't let that stop us. The idea is as follows: Given a signal, reshape it into 1024 columns by however many rows are needed (zero-padded if necessary). Run the PCA algorithm, and do dimensionality reduction with a couple different settings. The number of components I choose determines the quality: If I use 1024 components, I will essentially be using the original signal. If I use a smaller number of components, I start losing some of the data that was in the original file. This will give me an idea of whether it's possible to actually build an encoding scheme off of this, or whether I'm wasting my time. Running the Algorithm --------------------- The audio I will be using comes from the song [Tabulasa][10], by [Broke for Free][11]. I'll be loading in the audio signal to Python and using [Scikit-Learn][12] to actually run the PCA algorithm. We first need to convert the FLAC file I have to a WAV: [1]: https://en.wikipedia.org/wiki/Digital_signal_processing [2]: https://en.wikipedia.org/wiki/FLAC [3]: https://en.wikipedia.org/wiki/Apple_Lossless [4]: https://en.wikipedia.org/wiki/Monkey%27s_Audio [5]: https://en.wikipedia.org/wiki/MP3 [6]: https://en.wikipedia.org/wiki/Vorbis [7]: https://en.wikipedia.org/wiki/Advanced_Audio_Coding [8]: https://en.wikipedia.org/wiki/Principal_component_analysis [9]: https://en.wikipedia.org/wiki/Dimensionality_reduction [10]: https://brokeforfree.bandcamp.com/track/tabulasa [11]: https://brokeforfree.bandcamp.com/album/xxvii [12]: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA ```python !ffmpeg -hide_banner -loglevel panic -i "Broke For Free/XXVII/01 Tabulasa.flac" "Tabulasa.wav" -c wav ``` Then, let's go ahead and load a small sample so you can hear what is going on. ```python from IPython.display import Audio from scipy.io import wavfile samplerate, tabulasa = wavfile.read('Tabulasa.wav') start = samplerate * 14 # 10 seconds in end = start + samplerate * 10 # 5 second duration Audio(data=tabulasa[start:end, 0], rate=samplerate) ``` Next, we'll define the code we will be using to do PCA. It's very short, as the PCA algorithm is very simple. ```python from sklearn.decomposition import PCA import numpy as np def pca_reduce(signal, n_components, block_size=1024): # First, zero-pad the signal so that it is divisible by the block_size samples = len(signal) hanging = block_size - np.mod(samples, block_size) padded = np.lib.pad(signal, (0, hanging), 'constant', constant_values=0) # Reshape the signal to have 1024 dimensions reshaped = padded.reshape((len(padded) // block_size, block_size)) # Second, do the actual PCA process pca = PCA(n_components=n_components) pca.fit(reshaped) transformed = pca.transform(reshaped) reconstructed = pca.inverse_transform(transformed).reshape((len(padded))) return pca, transformed, reconstructed ``` Now that we've got our functions set up, let's try actually running something. First, we'll use `n_components == block_size`, which implies that we should end up with the same signal we started with. ```python tabulasa_left = tabulasa[:,0] _, _, reconstructed = pca_reduce(tabulasa_left, 1024, 1024) Audio(data=reconstructed[start:end], rate=samplerate) ``` OK, that does indeed sound like what we originally had. Let's drastically cut down the number of components we're doing this with as a sanity check: the audio quality should become incredibly poor. ```python _, _, reconstructed = pca_reduce(tabulasa_left, 32, 1024) Audio(data=reconstructed[start:end], rate=samplerate) ``` As expected, our reconstructed audio does sound incredibly poor! But there's something else very interesting going on here under the hood. Did you notice that the bassline comes across very well, but that there's no midrange or treble? The drums are almost entirely gone. [Drop the (Treble)][13] ----------------------- It will help to understand PCA more fully when trying to read this part, but I'll do my best to break it down. PCA tries to find a way to best represent the dataset using "components." Think of each "component" as containing some of the information you need in order to reconstruct the full audio. For example, you might have a "low frequency" component that contains all the information you need in order to hear the bassline. There might be other components that explain the high frequency things like singers, or melodies, that you also need. What makes PCA interesting is that it attempts to find the "most important" components in explaining the signal. In a signal processing world, this means that PCA is trying to find the signal amongst the noise in your data. In our case, this means that PCA, when forced to work with small numbers of components, will chuck out the noisy components first. It's doing it's best job to reconstruct the signal, but it has to make sacrifices somewhere. So I've mentioned that PCA identifies the "noisy" components in our dataset. This is equivalent to saying that PCA removes the "high frequency" components in this case: it's very easy to represent a low-frequency signal like a bassline. It's far more difficult to represent a high-frequency signal because it's changing all the time. When you force PCA to make a tradeoff by using a small number of components, the best it can hope to do is replicate the low-frequency sections and skip the high-frequency things. This is a very interesting insight, and it also has echos (pardon the pun) of how humans understand music in general. Other encoding schemes (like MP3, etc.) typically chop off a lot of the high-frequency range as well. There is typically a lot of high-frequency noise in audio that is nearly impossible to hear, so it's easy to remove it without anyone noticing. PCA ends up doing something similar, and while that certainly wasn't the intention, it is an interesting effect. ## A More Realistic Example So we've seen the edge cases so far: Using a large number of components results in audio very close to the original, and using a small number of components acts as a low-pass filter. How about we develop something that sounds "good enough" in practice, that we can use as a benchmark for size? We'll use ourselves as judges of audio quality, and build another function to help us estimate how much space we need to store everything in. [13]: https://youtu.be/Ua0KpfJsxKo?t=1m17s ```python from bz2 import compress import pandas as pd def raw_estimate(transformed, pca): # We assume that we'll be storing things as 16-bit WAV, # meaning two bytes per sample signal_bytes = transformed.tobytes() # PCA stores the components as floating point, we'll assume # that means 32-bit floats, so 4 bytes per element component_bytes = transformed.tobytes() # Return a result in megabytes return (len(signal_bytes) + len(component_bytes)) / (2**20) # Do an estimate for lossless compression applied on top of our # PCA reduction def bz2_estimate(transformed, pca): bytestring = transformed.tobytes() + b';' + pca.components_.tobytes() compressed = compress(bytestring) return len(compressed) / (2**20) compression_attempts = [ (1, 1), (1, 2), (1, 4), (4, 32), (16, 256), (32, 256), (64, 256), (128, 1024), (256, 1024), (512, 1024), (128, 2048), (256, 2048), (512, 2048), (1024, 2048) ] def build_estimates(signal, n_components, block_size): pca, transformed, recon = pca_reduce(tabulasa_left, n_components, block_size) raw_pca_estimate = raw_estimate(transformed, pca) bz2_pca_estimate = bz2_estimate(transformed, pca) raw_size = len(recon.tobytes()) / (2**20) return raw_size, raw_pca_estimate, bz2_pca_estimate pca_compression_results = pd.DataFrame([ build_estimates(tabulasa_left, n, bs) for n, bs in compression_attempts ]) pca_compression_results.columns = ["Raw", "PCA", "PCA w/ BZ2"] pca_compression_results.index = compression_attempts pca_compression_results ```
Raw | PCA | PCA w/ BZ2 | |
---|---|---|---|
(1, 1) | 69.054298 | 138.108597 | 16.431797 |
(1, 2) | 69.054306 | 69.054306 | 32.981380 |
(1, 4) | 69.054321 | 34.527161 | 16.715032 |
(4, 32) | 69.054443 | 17.263611 | 8.481735 |
(16, 256) | 69.054688 | 8.631836 | 4.274846 |
(32, 256) | 69.054688 | 17.263672 | 8.542909 |
(64, 256) | 69.054688 | 34.527344 | 17.097543 |
(128, 1024) | 69.054688 | 17.263672 | 9.430644 |
(256, 1024) | 69.054688 | 34.527344 | 18.870387 |
(512, 1024) | 69.054688 | 69.054688 | 37.800940 |
(128, 2048) | 69.062500 | 8.632812 | 6.185015 |
(256, 2048) | 69.062500 | 17.265625 | 12.366942 |
(512, 2048) | 69.062500 | 34.531250 | 24.736506 |
(1024, 2048) | 69.062500 | 69.062500 | 49.517493 |