Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here. Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights.
Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds. The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.
In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘Bach‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id (spotify) and the specific playlist_id (37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:
You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:
- Heavy Metal,
- Michael Jackson.
Transforming audio mp3’s to features
The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.
library(reticulate) librosa = import("librosa") #### python environment with librosa module installed use_python(python = "/usr/bin/python3")
The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.
onemp3 = librosa$load("mp3songs/bach1.mp3") length(onemp3[]) length(onemp3[])/onemp3[] # ~30 seconds sound ## 5 seconds plot pp = 5*onemp3[] plot(onemp3[][1:pp], type="l")
A line plot of the raw audio values will look like.
For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.
ff = librosa$feature mel = librosa$logamplitude( ff$melspectrogram( onemp3[], sr = onemp3[], n_mels=96 ), ref_power=1.0 ) image(mel)
Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.
A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:
tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3)
The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here.
It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal 🙂 The complete R code can be found here.