Show HN: An interpretable "taste map" of my Spotify library



Over last semester, I wanted to see whether I could formalize my music preferences numerically, whether I could model the distinctions I felt but couldn’t articulate, and use them to automatically sort my liked songs into playlists. Could I build something that surfaces patterns in my library I sense but can’t describe?

Spotify Wrapped dropped around then and, as usual, offered nothing beyond top-5 artists and genre percentages. So I looked into it.

I figured Spotify’s API would give me what I needed. I remember from an older project that their API had useful endpoints: audio-features (danceability, valence, energy) and audio-analysis. Turns out they deprecated both of them in late 2024. So I took this as an opportunity to brush up on some ML and data science skills.

P.S. This post’s timing coincides with Anna’s Archive’s Spotify scrape. It included the features and analyses of 99.9% of Spotify’s library. This happened 3-4 days after I started this project. Had the leak dropped a week earlier, I wouldn’t have built this, and I’d have missed the temporal analysis that made the project worthwhile.



Part 0: Getting the Data

Before any analysis, I needed the raw materials:

Songs: I used Spotify’s API to get all my liked songs + their metadata (1,488 songs; 1,452 after removing duplicates).

Audio: I used spotdl to download MP3s by matching tracks to YouTube. This worked: 100% match rate, with only a few quality misses on live recordings.

Lyrics: I fetched lyrics from Genius and Musixmatch. Got 1,148 (79%) matches. The misses were mostly instrumental tracks, live music/remixes, niche music, leaks, or very new releases.

Final dataset: I wanted to include instrumental tracks that may have had no lyrics. The logic:

for song in library:
    if has_lyrics(song):
        include(song)
    elif instrumentalness(song) >= 0.5:
        include(song)   # instrumental, lyrics not expected
    else:
        pass    # no lyrics, but should have had them

Note: instrumentalness (see Essentia Classifiers below) is derived from a classification model run over audio embeddings, not from Spotify metadata.

This gave me 1,253 (86%) songs with complete data. I could have manually searched for the remaining 14%, but the coverage was sufficient for clustering.

Part 1: Embedding the Audio

My first instinct was to use existing music embedding models. If vision has CLIP and language has BERT, surely music has something similar.

I first tried MERT, a late-2024 music representation transformer that’s (AFAIK) the state-of-the-art. It produced 768-dimension vectors, which I then PCA’d into 128-D vectors. I used KNN (N = 2-8), and clusters mixed mellow jazz with aggressive EDM. I couldn’t empirically tell what each playlist (cluster) was about except at N = 2, where I could tell one playlist was slow/mellow and the other was fast/aggressive. Once I switched to interpretable dimensions, the clusters became legible. I then tried Essentia’s Discogs-EffNet embedding model (1,280-D) and CLAP-Music (512-D). Both scattered sonically similar tracks across 4+ clusters (i.e. was terrible).

I couldn’t even debug what was going wrong. I tried PCA to reduce dimensionality, but that didn’t help me understand the failure. PCA finds directions of maximum variance, but variance != semantics. The top principal components captured “loudness” and “bass presence,” but this was all vibes-based and wasn’t going to take me very far. The whole pipeline was unsupervised: high-dimensional embedding → PCA → clustering. At no point could I inspect individual dimensions and say “this one needs work.”

The failure of embedding → clustering made sense: embeddings optimize for their training objective. MERT for masked prediction, EffNet for Discogs (genre) tags. The geometry of the space encodes their goals, not mine. And general-purpose embeddings are trained on everything from Beethoven to Beyoncé. My library, given my limited taste, occupies a tiny corner of that space, where the distinctions I care about (sad rap vs. narrative rap, warm jazz vs. cold EDM) are smaller than the noise floor.

I started thinking it’d be better to at least understand the dimensions one by one. Instead of letting the embedding define the space, I’d define the space myself and project songs onto it. This approach is called supervised dimensionality reduction: you choose the dimensions that matter to you, then train or use classifiers to score data along those dimensions. I stumbled upon this out of necessity rather than “a blog told me to do so.”

Essentia Classifiers

Essentia doesn’t just provide embeddings. It has dozens of pre-trained classifiers that output interpretable features: mood_*, danceability, instrumentalness, genre_*, and more. I first had to make sure I embedded all mp3s with Discogs-EffNet + MusiCNN, which are 2 different feature extraction/embedding models. I then ran these embeddings through these classifiers:

Dim Name Model Output Description
0 bpm RhythmExtractor (DSP) Float Tempo in BPM, normalized to [0,1] via min-max
1 danceability Discogs-EffNet [0,1] Rhythm regularity, tempo stability
2 instrumentalness Discogs-EffNet [0,1] 0 = vocals present, 1 = instrumental
3 valence MusiCNN + DEAM [1-9] → [0,1] Emotional positivity
4 arousal MusiCNN + DEAM [1-9] → [0,1] Emotional energy/intensity
5 engagement Discogs-EffNet [0,1] Background music vs. active listening
6 approachability Discogs-EffNet [0,1] Mainstream accessibility vs. niche
7 mood_happy Discogs-EffNet [0,1] Joy/celebration presence
8 mood_sad Discogs-EffNet [0,1] Melancholy/grief presence
9 mood_aggressive Discogs-EffNet [0,1] Anger/intensity presence
10 mood_relaxed Discogs-EffNet [0,1] Calm/peace presence
11 mood_party Discogs-EffNet [0,1] Upbeat/celebratory energy
12 voice_gender Discogs-EffNet [0,1] Female (0) vs. male (1) vocals; 0.5 for instrumentals
13 genre_fusion* Discogs-EffNet Entropy → [0,1] Low = pure genre, high = genre fusion
14 electronic_acoustic* Discogs-EffNet Computed 0 = electronic, 1 = acoustic
15 timbre_brightness Discogs-EffNet [0,1] 0 = dark/mellow, 1 = bright/crisp production
16 key_sin* KeyExtractor (DSP) [-0.33, 0.33] Circular pitch encoding (sin component)
17 key_cos* KeyExtractor (DSP) [-0.33, 0.33] Circular pitch encoding (cos component)
18 key_scale* KeyExtractor (DSP) 0 or 0.33 Minor (0) vs. major (0.33)

Most of these are straightforward [0,1] classifiers. But a few (marked with *) came from interesting empirical notes or needed a custom implementation.

What These Dimensions Actually Capture

Most dimensions do what you’d expect. BPM is just BPM. Approachability separates TikTok indie from rhythm game EDM. Engagement distinguishes background music from foreground music. These confirm intuitions rather than reveal anything new.

Two dimensions surprised me.

Danceability

Essentia‘s danceability measures rhythm regularity, not “would people dance to this.” Drake averages 0.073, 59% below my library average of 0.178. The man who made One Dance scores lower than Marty Robbins.

Track Artist Danceability Why?
Praise The Lord (Da Shine) A$AP Rocky, Skepta 0.000 Flow changes, beat switches
Crazy Doechii 0.000 Tempo shifts, varied samples
American Pie Don McLean 0.993 Steady strum, metronomic tempo
A Town with an Ocean View Joe Hisaishi 0.992 Consistent orchestral pulse

The model penalizes the beat changes and varied samples in modern hip-hop, clearly. Praise The Lord has flow changes in each verse; the model reads that as rhythmic instability. Meanwhile, American Pie‘s steady acoustic strum reads as perfectly “danceable.” The name is misleading: it’s really measuring metronome-likeness.

Valence x Arousal

These two dimensions together map emotional space better than either alone.

Quadrant Example Track Artist Valence Arousal
High/High (euphoric) Anomaly Camellia 1.00 0.89
High/Low (peaceful) A Town with an Ocean View Joe Hisaishi 0.71 0.34
Low/High (aggressive) Hard Times Paramore 0.52 1.00
Low/Low (melancholic) The Breaking of the Fellowship Echoes of Old Stories 0.00 0.00

The Breaking of the Fellowship is the absolute minimum for both (0.00, 0.00), the most melancholic and calm track in my library. Interestingly, t+pazolite‘s jarring speedcore averages 0.78 valence, nearly 2x my library average. The model reads frenetic major-key energy as “happy,” which makes sense once you stop expecting valence to mean lyrical sentiment.

Custom Audio Dimensions

Genre Fusion (Dim 13)

I differentiate my music as genre soup (jazz fusion, experimental hip hop) vs. one-trick ponies (pure trap, straightforward pop). I wanted to capture that.

Essentia‘s genre classifier outputs probabilities across 400 Discogs genres. I computed the entropy of this distribution:

  • Low entropy → pure genre. The model is confident the song belongs to one category. Think Playboi Carti’s “New Tank”: obviously trap.
  • High entropy → genre fusion. Probability is spread across many categories. Think a track like “Raksit Leila” that blends Arabic pop, electronic, and folk.

To validate this worked, I looked at high-popularity tracks at both extremes:

Track Artist Genre fusion Top Genre
The Box Roddy Ricch 0.164 Hip Hop / Cloud Rap
Drip Too Hard Lil Baby, Gunna 0.337 Hip Hop / Trap
Big Dawgs Hanumankind, Kalmi 0.376 Hip Hop / Horrorcore
Flex Up Lil Yachty, Future, Playboi Carti 0.075 Hip Hop / Cloud Rap
Track Artist Genre fusion Top 3 Genres
Self Control Frank Ocean 0.947 Alt Rock, Indie Rock, Pop Rock
Bad Habit Steve Lacy 0.850 Neo Soul, Funk, Contemporary R&B
POWER Kanye West 0.839 Conscious, Pop Rap, Crunk
LA FAMA ROSALÍA, The Weeknd 0.833 House, Dance-pop, UK Garage
Feel No Ways Drake 0.753 Contemporary R&B, Experimental, Vaporwave

Almost all trap/cloud rap in the pure genre table; genres with distinctive sonic signatures the model recognizes confidently. Genre-bending artists (Frank Ocean, Steve Lacy, ROSALÍA, Kanye) cluster in the fusion table. Their sound draws from multiple traditions, so the model spreads probability across many categories. Although some may disagree with this being an important dimension, I thought it was useful for clustering my personal library.

Genre fusion histogram showing distribution

Acoustic/Electronic (Dim 14)

I expected jazz and electronic game music (osu! music) might cluster together. Both are instrumental with complex rhythms, and often have high genre_fusion. But while one uses saxophones and pianos, the other uses synthesizers. I wanted a dimension to separate them.

Essentia has separate mood_acoustic and mood_electronic classifiers. I initially dismissed this, but I noticed purely through empirical means that it’d be useful as a dimension to separate these. I combined them as: (mood_acoustic - mood_electronic + 1) / 2. This separated jazz from EDM cleanly: tracks that shared high genre_fusion now landed in distinct clusters.

Circular Key Encoding (Dims 16-18)

Musical keys are cyclical: C is close to B, not twelve steps away. Standard one-hot encoding would treat them as unrelated categories.

w = 0.33
key_sin = math.sin(2 * π * pitch / 12) * w
key_cos = math.cos(2 * π * pitch / 12) * w
key_scale = 0 if minor else w

The weight w = 0.33 ensures the three key dimensions contribute roughly 1~ effective dimension of clustering influence, not 3.

Linear key encoding visualization

Circular key encoding visualization

That gives me 19 audio dimensions total (more like 17, effectively).

Part 2: Embedding the Lyrics

Audio features alone weren’t enough. Two songs can sound similar but have completely different lyrical content. Given that 60%+ of my library is rap, lyric features needed significant weight.

Sentence Transformers Failed

I tried embedding lyrics with sentence transformers (BGE-M3, E5-large, BERT). They clustered by language, not emotion. Japanese songs grouped together regardless of mood. The embedding space captures linguistic meaning, not affective meaning.

GPT as an Annotator

The proper approach here would mirror what I did for audio: supervised dimensionality reduction. I’d embed lyrics with something like BERT or GPT embeddings, then run those embeddings through pre-trained classifiers for each dimension (valence, arousal, mood, narrative density, etc.). Each classifier would be trained on labeled data, outputting interpretable scores. This is exactly how the Essentia pipeline worked: embedding → classifier → interpretable feature.

Frankly, I was too lazy to do that. But I figured prompting GPT directly would be good enough: LLMs have been trained on so much text about emotions, sentiment, and meaning that they’ve essentially internalized how humans label these things. So instead of hunting for classifiers, I made a bunch of GPT-5 mini calls with structured prompts explaining the dimensions I wanted. Sue me.

Instead of embedding lyrics and hoping the right structure emerges, I defined the dimensions I wanted and had GPT-5 mini score directly into them:

Dim Name Source Output Description
19 lyric_valence GPT-5 mini [0,1] Emotional tone (0 = negative, 1 = positive)
20 lyric_arousal GPT-5 mini [0,1] Energy level of lyric content
21 lyric_mood_happy GPT-5 mini [0,1] Joy/celebration (non-mutually exclusive)
22 lyric_mood_sad GPT-5 mini [0,1] Grief/melancholy
23 lyric_mood_aggressive GPT-5 mini [0,1] Anger/confrontation
24 lyric_mood_relaxed GPT-5 mini [0,1] Peace/calm
25 lyric_explicit GPT-5 mini [0,1] Profanity, sexual content, violence, drugs (combined)
26 lyric_narrative GPT-5 mini [0,1] 0 = vibes/hooks, 1 = story with arc
27 lyric_vocabulary Local [0,1] Type-token ratio (vocabulary richness)
28 lyric_repetition Local [0,1] 1 – (unique lines / total lines)
29 theme* GPT-5 mini Ordinal Primary theme (love, party, flex, struggle, etc.)
30 language* GPT-5 mini Ordinal Primary language, grouped by musical tradition
31 popularity Spotify API [0,1] Spotify recency-weighted popularity [0-100], normalized via min-max
32 release_year Spotify API [0,1] Decade bucket encoding: 1950s=0.0, …, 2020s=1.0

A few interesting design choices I made for the prompt:

  • Non-mutually exclusive moods. A song can be both happy (0.7) and sad (0.4) if it’s bittersweet.
  • Single theme choice. Forces picking one primary theme to avoid multi-label ambiguity.
  • Detailed anchor points. Not just “rate 0-1” but specific guidance: “0.0-0.2: deeply negative, despair, nihilism.”

Similar to the audio dimensions, a few (marked with *) needed custom implementation. They’re underwhelming, though.

Theme and Language Encoding

GPT extracted theme (one of: party, flex, love, struggle, introspection, etc.) and language (English, Arabic, Japanese, etc.). But these are categorical; I needed to encode them as numbers for clustering.

I used ordinal scales designed around musical similarity:

party=1.0, flex=0.89, love=0.78, social=0.67, spirituality=0.56,
introspection=0.44, street=0.33, heartbreak=0.22, struggle=0.11, other=0.0
English=1.0, Romance=0.86, Germanic=0.71, Slavic=0.57,
Middle Eastern=0.43, South Asian=0.29, East Asian=0.14, African=0.0

Love it or hate it, this took 5 minutes to write in claude code. I’d been debugging for days over break. The marginal gain from principled categorical encoding didn’t feel worth another rabbit hole.

A Note on Ordinal Encoding

This encoding is arbitrary. Party isn’t objectively “closer” to flex than to struggle. The distances are made up. A cleaner approach would be learned embeddings or one-hot encoding with dimensionality reduction, but that would bloat the vector with 20+ sparse dimensions, drowning out the interpretable features I’d carefully constructed.

In practice, the arbitrariness matters less than it looks. The audio dimensions do the heavy lifting for separation. The instrumentalness dimension alone has massive effect sizes (Cohen’s d = -6.37 between Hard-Rap and Jazz-Fusion), and the instrumentalness weighting pulls lyric features toward neutral values. So even if instrumentals land at an arbitrary 0.5 on the theme/language scales, they still cluster correctly because the audio dimensions handle the real separation work.

Instrumentalness Weighting

Here’s where it got tricky. What do you do with instrumental tracks? How about tracks where lyrics have very little value (e.g. in EDM)?

Problem 1: Zero isn’t always neutral

My first attempt: zero out lyric features for instrumentals. An instrumental has no sad lyrics, so lyric_mood_sad = 0. But this caused instrumentals to cluster with happy music, because happy songs also have low sadness.

The issue is that some features are bipolar (valence: negative ↔ positive) while others are presence-based (sadness: absent ↔ present). For bipolar features, zero means “negative,” not “absent.” An instrumental isn’t lyrically negative; it’s lyrically absent. The neutral point is 0.5, not 0.

Feature Type Neutral Value Rationale
Bipolar (valence, arousal) 0.5 Neutral, not negative
Presence (moods, explicit, narrative) 0 Absence is absence
Categorical (theme, language) 0.5 Centered

Problem 2: Instrumentalness is a spectrum

Tracks like Fred Again’s leavemealone technically have lyrics, but (with all due respect to Keem) these are textural at best, not semantic. The words don’t carry emotional weight the way Kendrick’s verses do. When GPT classified these lyrics, it’d return real values (arousal 0.8, narrative 0.1), and suddenly my EDM was clustering with lyrically similar pop instead of with other electronic music.

One should realize that GPT’s classifications are valid. leavemealone does have high arousal, low narrative lyrics. But that doesn’t mean those dimensions should swing the vector as much as they do, because the lyrics component of the song as a whole isn’t that important.

Basically, instrumentalness is a spectrum, not a switch. Another Fred Again track like adore u should have some lyric influence, just dampened proportionally.

I needed lyrics to matter less as a track becomes more instrumental.

The fix: weight each lyric dimension by (1 - instrumentalness), pulling toward the appropriate neutral value:

# For bipolar features (neutral = 0.5):
weighted = 0.5 + (raw - 0.5) * (1 - instrumentalness)

# For presence features (neutral = 0):
weighted = raw * (1 - instrumentalness)

At instrumentalness = 0 (pure vocals), the raw lyric value passes through unchanged. At instrumentalness = 1 (pure instrumental), the value collapses to neutral. In between, it’s a smooth blend: Fred Again’s vocal samples contribute a little, Kendrick’s verses contribute fully.

This fixed the scattering problem. After this change, Fred Again’s leavemealone stopped clustering with lyrically similar pop tracks and landed with other electronic music where it belonged.

That gives me 33 dimensions total: 19 audio, 12 lyrics, 2 metadata. Each song is now a point in 33-dimensional space. The next step: group them.

Part 3: Clustering

With 33 interpretable dimensions, I needed a clustering algorithm. I tried 3. Do note that I (mostly) ignored things like silhouette scores (data wasn’t well-separated enough) or elbow methods. Most of my work here was based on observation.

HDBSCAN

HDBSCAN was my first choice. It had worked well for me on biological data before: density-based, no need to specify k, finds arbitrary shapes. On music, HDBSCAN labeled 90% of tracks as noise.

HDBSCAN assumes clusters are “regions of the data that are denser than the surrounding space”; the mental model is “trying to separate the islands from the sea.” But music taste barely has any density gaps. Chill pop gradates into bedroom pop gradates into lo-fi. There are no valleys to cut.

KNN

KNN-based clustering (building a neighbor graph, then running community detection) was better. It captures local structure. A song might neighbor five chill tracks and one jazz track, and that single edge pulls it wrong. I found ~20% of tracks felt misplaced from randomly shuffling. It was miles better than HDBSCAN, but I was still exploring.

Hierarchical Agglomerative Clustering (HAC)

HAC worked great. It builds a tree: every song starts as its own cluster, and the algorithm repeatedly merges the two most similar. You cut wherever you want k clusters. HAC dropped misplacement rate from ~20% to ~5%.

Interestingly, I felt like HAC struggled with sub-clusters a little. This was especially clear when I was exploring a 100~ song cluster and trying to cluster it further because I sensed a clear distinction in mood. My intuition was that HAC‘s greedy merges compound at higher resolution. KNN‘s local focus, a liability globally, became an asset when “nearby” already meant “similar vibe.”

Final approach: HAC for main clustering (k=5), KNN for subclustering within each. Good enough: 90% of clusters matched my intuition on shuffle.

Part 4: Analysis

The Clusters

3D UMAP visualization with cluster colors

After standardizing features, running HAC, and listening to a lot of my questionable music taste, I landed on 5 clusters.

Hard-Rap (Cluster 0, Playlist) — 733 songs, 58.5%

High-energy rap dominated by trap and cloud rap. The cluster that confirms I am, at my core, a “guy music” typa guy.

Top 3 defining features: High lyric_arousal, high lyric_explicit, high lyric_mood_aggressive. This is music that’s loud, profane, and confrontational.

Top artists: Kanye West, JPEGMAFIA, Drake, Kendrick Lamar, Travis Scott, Paris Texas.

What it sounds like: Main character energy.

3D UMAP visualization of Hard-Rap sub-clusters

I did feel like this cluster was huge and shuffling through it there felt like a Lil Baby rap side and a more upbeat 2010s Kanye rap side. Sub-clustering with k-means (k=2) revealed two modes:

  • Hard-Rap-Aggro (Cluster 0.0, Playlist) — 472 songs, 64.4%: Pure mood_agressive. JPEGMAFIA, Carnival-era Kanye, (rapping) Drake, Paris Texas. The defining split from 0.1: lower mood_sad, lower electronic_acoustic production, more hype. Big “gym playlist” energy.
  • Hard-Rap-Acoustic (Cluster 0.1, Playlist) — 261 songs, 35.6%: More laid-back. Ultralight Beam Kanye, DUCKWORTH Kendrick, Heaven to Me Tyler. Higher mood_sad, higher electronic_acoustic production, more relaxed. I was surprised by the “sad” label. Listening back, I hear acoustic warmth more than melancholy. Less 808s, more soul samples. The acoustic production style might be tricking the model to think the audio is sadder?

Narrative-Rap (Cluster 1, Playlist) — 226 songs, 18.0%

Songs that tell stories. Lyrically dense, emotionally heavy.

Top 3 defining features: High lyric_mood_sad, low lyric_valence, high lyric_narrative. Feel more like essays set to music.

Top artists: Kendrick Lamar, Kanye West, JID, Drake, ZUTOMAYO, BROCKHAMPTON, Earl Sweatshirt.

What it sounds like: Kendrick’s Duckworth type stories, BROCKHAMPTON’s confessionals, Earl’s introspection. Music you have to whip out the Genius article for.

Interesting note: J-pop was the second-largest genre here. This reminded me of those memes about people dancing to depressing Japanese music without understanding the lyrics. Same pattern with my Arabic rap, nearly all Shabjdeed and Daboor (Palestinian rappers) landed here.

This isn’t merely sad music; it’s introspective music. Frankly, I’ve been listening to this cluster for a while and it’s almost perfectly coagulated, so no sub-clustering was necessary.

Jazz-Fusion (Cluster 2, Playlist) — 91 songs, 7.3%

Instrumental, relaxed, head-nodding music.

Top 3 defining features: High instrumentalness (0.80 average), low lyric vocabulary (because there are no lyrics), high relaxation. The “gentle” cluster.

Top artists: Joe Hisaishi (Ghibli’s music director), Uyama Hiroto, SOIL & “PIMP” SESSIONS, Masayoshi Takanaka, Khruangbin.

What it sounds like: Soundtracks, ambient, downtempo, Japanese jazz fusion. Energetic but never aggressive. The kind of music that makes you nod along while working.

This cluster barely existed until summer 2024, then surged. More on that in the temporal section.

Rhythm-Game-EDM (Cluster 3, Playlist) — 47 songs, 3.8%

EDM, osu! music. Breakcore, hardcore, chiptune.

Top 3 defining features: High instrumentalness, low approachability, high engagement.

Top artists: Camellia, t+pazolite, Morimori Atsushi, Seiryu, VDYCD.

What it sounds like: 180+ BPM, complex time signatures, very electronic.

This cluster was near-zero until October 2024; easily the fastest growing cluster. More on that timing later.

Mellow (Cluster 4, Playlist) — 156 songs, 12.5%

Soft, reflective, acoustic-leaning.

Top 3 defining features: High electronic_acoustic production, high mood_sad, low mood_party. The opposite of Hard-Rap in almost every way.

Top artists: JANNABI, Kanye West (24, Come to Life), Travis Scott (Bad Apple, sdp interlude), Glass Animals, Yumi Arai, Marty Robbins.

What it sounds like: Late night drives. Rain on windows. The playlist you put on when you’re in your feelings but not trying to wallow.

3D UMAP visualization of Mellow sub-clusters

Although I saw the high mood_sad, shuffling through the playlist made me feel like it’s mellow with 2 moods. Subclustering with KNN (k=2) further revealed:

Mellow-Hopecore (Cluster 4.0, Playlist) — 97 songs, 62.2%: Hopeful calm. On God Kanye, SDP Interlude Travis, Latin folk. Lower mood_sad, lower danceability (in the Essentia sense), less electronic_acoustic than 4.1. The “things will be okay” playlist. One of my favorite (sub)clusters.

Mellow-Sadcore (Cluster 4.1, Playlist) — 59 songs, 37.8%: Sad calm. JANNABI, Glass Animals, Mustafa. Higher mood_sad, higher danceability, more electronic_acoustic. The “things are not okay but at least the music is pretty” playlist.

Overview

The dissimilarity matrix validates that the clusters are fairly different. Narrative-Rap and Mellow are the closest pair (d = 0.59), both introspective and emotionally weighted. Hard-Rap and Narrative-Rap are also close (d = 0.67), sharing rap conventions despite diverging in lyrical tone. This matters because the clustering respects the gradients I actually perceive: rap stays near rap, reflective stays near reflective. Mellow-Sadcore is even closer to Narrative-Rap.

Hard-Rap and Jazz-Fusion are the most distant pair (d = 1.51). The features driving this are instrumentalness (Cohen’s d = -6.37), language (3.63), lyric explicitness (3.34), and lyric arousal (2.95). One cluster is vocal-heavy, English-dominant, explicit, and high-energy. The other is instrumental, language-neutral, clean, and calm. They represent opposite ends of the feature space, which is exactly what I’d expect given how differently they function in my listening habits.

Cluster dissimilarity matrix

Audio vs. Lyrics Covariance

When I cluster my library using only audio features, then separately cluster using only lyric features, the two clusterings barely agree. The Adjusted Rand Index between audio-only and lyric-only clusterings is 0.092, nearly indistinguishable from random chance (0.0).

The contingency matrix below shows this. Each cell counts how many songs fall into a given audio cluster (row) and lyric cluster (column). If the two modalities agreed, you’d see high counts along the diagonal. Instead, songs scatter everywhere. A track in Audio Cluster 2 could land in any Lyric Cluster.

Audio cluster vs lyric cluster overlap

This might sound like a problem at first, but it’s actually the point. If audio and lyrics produced the same clusters, one of them would be redundant. The low agreement means they capture different information.

What each modality is good at

Lyrics produce tighter clusters. The silhouette score for lyric-only clustering is 0.322 (the bigger the more well-separated), compared to 0.096 for audio-only. My guess for why lyrics are better at separating songs into discrete groups is because lyrical content is more categorical: a song is either about heartbreak or it isn’t. Audio features are continuous and overlapping: the boundary between “chill” and “sad” is fuzzy.

Audio captures sonic texture that lyrics miss entirely. Two songs can have identical themes (love, loss, flex) but sound completely different. Lyrics can’t distinguish lo-fi from trap. Audio can.

Genre matters

Hip-hop tracks have 34.4% agreement between audio and lyric clustering. Electronic tracks have 16.4%. In hip-hop, production style and lyrical content are tightly coupled: trap beats go with trap lyrics, conscious production goes with conscious bars. In electronic music, the lyrics (if any) are often textural rather than semantic. A Fred Again track and a Camellia track are both “electronic” sonically but have nothing in common lyrically.

Instrumental tracks validate the design

In audio-only clustering, instrumental tracks scatter across multiple clusters based on their sonic properties: jazz ends up separate from EDM ends up separate from ambient. In lyric-only clustering, 88% of instrumentals converge into a single cluster. This is the intended behavior. The neutral lyric values (0.5 for bipolar features, 0 for presence features) collapse instrumental tracks in lyric space while letting audio features differentiate them.

Example movements

Some tracks stay put across both clusterings: Lil Baby’s “Grace,” Tyler’s “SMUCKERS,” Clipse’s “Intro.” These are songs where the audio mood matches the lyric mood.

Other tracks shift dramatically. Take Kendrick’s “United In Grief” or JID’s “Better Days.” In audio-only clustering, they land in a cluster I’d call “Introspective Rap”: relaxed but aggressive, lower valence, contemplative production. Top artists there are Kanye, Tyler, Kendrick, Drake. In lyric-only clustering, they land in a cluster I’d call “Sad Narratives”: high lyric sadness (+170% above library average), storytelling-heavy, low explicitness.

The audio captures how it sounds: chill, introspective production. The lyrics capture what it’s about: sad, narrative-driven content. Both clusterings agree these tracks are low-energy and contemplative, but they diverge on the specific emotional flavor. The audio says “relaxed,” the lyrics say “sad.” That distinction matters.

The takeaway

The 33-dimensional space captures complementary information. Songs that sound similar can have wildly different lyrics. Lyrical themes cross audio boundaries. Neither modality alone explains why I saved a track. Both together get closer.

Genre Analysis

Quick caveat: this “genre” isn’t Spotify metadata, since access to that metadata has been deprecated. It’s from Essentia’s Discogs tagger, which outputs a 400-way P(genre) distribution per track. So what follows is “what the model thinks this sounds like,” not a canonical ground truth.

Generally, the distribution confirms the obvious: I’m hip-hop heavy. Trap + Cloud Rap dominate, then Electronic, then Rock. Hip Hop is steadily losing share quarter-over-quarter (still #1, but no longer the entire identity), while Electronic is the big climber, going from “barely present” to competing for dominance by late 2025.

Genre trends by quarter

Temporal Analysis

With clusters defined and features validated, I wanted to see how my taste evolved over time. Note: timestamps are only reliable post-July 2024.

First, the baseline. My library skews heavily toward 2010s and 2020s releases. The older tail (1960s jazz, 1970s Latin folk) is almost entirely from the summer.

Songs by release decade

The extremes tell the story. The oldest songs are jazz standards and cowboy ballads; the newest are whatever dropped last week.

Oldest Artist Year
Big Iron Marty Robbins 1959
Take Five The Dave Brubeck Quartet 1959
Giant Steps John Coltrane 1960
Saudade Vem Correndo Stan Getz, Luiz Bonfá 1963
Tall Handsome Stranger Marty Robbins 1963
Newest Artist Year
solo Fred again.., Blanco 2025
Shock The World That Boy Franco 2025
Muevelou (Pocoyo Song) Vic 2025
I Run 4dawgs 2025
California Games Armand Hammer, The Alchemist 2025

Cluster share of additions by quarter

The overall mood profile stays fairly stable over time, but clear changes in my library corresponded to specific people and moments. I could trace two of them immediately.

Summer 2024: Internship, California

At my internship, Connor kept putting me on Latin folk and jazz-adjacent stuff. Berni was a big EDM guy back in the day and reminded me of the European EDM heads I used to know from osu!. Christian introduced me to Masayoshi Takanaka on our ride to Yosemite that summer, and I’ve been hooked on Japanese jazz fusion since. In my head, all of it was just “work music.”

In the cluster chart, you can see Jazz-Fusion spawn. It went from 0.83% of additions in Q2 to 8.63% in Q3, then 14.70% in Q4. Instrumental, relaxed, interesting enough to keep me locked in while working. The mood profile shifted too: mood_relaxed climbed through July and August as I added more of this stuff. Once that slot exists in your day, it stays.

Fall 2024: Blue Lock edits at 2am

Around fall 2024, my brother started sending me Blue Lock anime edits. I really liked “l’etoile d’afrique – #18” by VDYCD from one of them, and that sent me down a phonk and opium rabbit hole. I was listening to this stuff super hard. The aggressive, high-energy production fit right into the Rhythm-Game-EDM cluster, but the lyrical content (when there was any) was darker, more confrontational. Another person, another slot carved out.

Fall 2025: Fall break in the common room

Over fall break, a few friends and I ended up on a Spotify playlist of popular osu! beatmaps. I thought it’d be a one-night nostalgia trip.

Rhythm-Game-EDM went from 1.65% in Q2 to 9.100% in Q4. That weekend reminded me of how fire osu! music was. You can see mood_party and engagement spike around October because of it in the mood profile. High valence, high energy, very electronic. The cluster is loud in feature space, so it’s easy for the model to catch. But the real reason it stuck is that it already had a place in my head. That weekend just reminded me it was there.

Cumulative mood profile over time

Connor gave me Latin folk. Christian gave me Takanaka on the aux. My brother’s Blue Lock edits at 2am sent me down the phonk hole. Claire and that night in the Cube reopened a dormant folder. Each person carved out a slot, and the slots stuck.

There are moments I can’t timestamp. That March I quit premed. The 2010s Kanye era that built the foundation for my taste. I wish I could trace those too. But what I can say: my taste isn’t some coherent aesthetic I curated. It’s a messy mosaic of people and moments, and I was happy to reminisce while looking at the data.

The clusters are just statistics. But the people who built them – Connor’s Latin folk, Christian’s aux on the drive to Yosemite, my brother’s Blue Lock edits at 2am – they’re the actual structure underneath.

Future Work

Two extensions would make this practical:

The obvious next step is auto-sorting. I could pipe new liked songs through the model and automatically assign them to cluster-based playlists.

The more interesting direction is interpretable recommendations. Current systems are black boxes; you don’t know why Spotify suggested something, not to mention that it’s usually pretty ass in my experience. With named dimensions, you could show: “Recommended because: high energy, similar lyric themes, acoustic production.” Given the Anna’s Archive scrape, anyone can now study how Spotify’s internal features relate to user behavior. These 33 dimensions are a starting point for building something more transparent.

It’d be worth tuning the weights of each dimension to see if I can squeeze out better clusters as well. A labeled validation set of 50 songs would replace shuffle-and-check.

Honestly though, the temporal analysis alone was worth the project. Seeing which clusters grew, which shrank, which appeared from nowhere.

Part 5: Afterword

What I Learned About My Taste

59% of my library is Hard-Rap.

But the more interesting findings were the phases I’d forgotten I was having. Remembering my summer days listening to Connor’s Latin folk playlist in the office, the lock-in I had that fall break with friends, that March where I quit premed. The genre trends tell the same story at a higher level: trap declining, electronic rising, jazz appearing from nowhere. My taste isn’t some coherent aesthetic I curated. It’s a messy mosaic of people and moments, and I was happy to reminisce while looking at the data.

Technical Notes

I have a lot of trouble with silent errors when using LLMs for coding. There were multiple times where Claude would insert a fallback value “for safety,” but a simple bug would cause it to always default to that fallback. I’d only catch this bug empirically, shuffling through clusters and often sloppy code. I added a note to my .CLAUDE file: never put fallback values unless explicitly specified.

From a data science perspective, this whole project was “vibey” in a way that made me uncomfortable at first. I didn’t optimize silhouette scores or run formal ablations. Earlier I said vibes-based debugging was insufficient. That’s true for feature selection, but for cluster validation, my intuition is the only ground truth. If a cluster doesn’t match my mental model, then it’s fair to call it false. At the end of the day, this is merely an exploration at a pipeline that mimics my mental model of music.

The full codebase, including data processing pipelines, clustering implementations, and visualization generation, is available on GitHub. You may follow the README.md on how to do this for your library, albeit it might take you 6-9 hours if you’re on a MacBook.

Some readers informed me that they had to refresh to view the figures/embeds too, so please try to hard refresh (Ctrl + Shift + R on Windows, ⌘ + Shift + R on Mac) if you’re unable to see them while I try to understand what’s wrong. My guess is the large (~500KB cumulative) sizes.

The State of Music + Tech

The embedding models were disappointing. MERT, CLAP, and Discogs-EffNet all produced garbage clusters when I used them as-is. They’re optimized for their training objectives (masked prediction, genre tagging), not for the distinctions I care about. More broadly: so much AI research focuses on language and video. Music feels neglected. The classification models disagreed with my intuition 30-40% of the time, the embeddings don’t transfer well to personal-scale analysis, and there’s so much low-hanging fruit for anyone willing to build in the open.

I’m also heavily disappointed in Spotify and other music streaming platforms. As a self-proclaimed time waster power-user, I’m disappointed by the lack of power tools in this space. You have Spotify with a React Native application with a (near) monopoly over music streaming, but it lacks batch operations, custom sorting, or any programmatic access to your own data. You have Apple Music getting stuck at “Syncing Your Library…” constantly from my experience. TIDAL might’ve been the closest thing to a power-user (audiophile) experience in music streaming, and it’s UX still feels clunkier than iOS 26. I do think Last.fm is very cool, but after using it, it still feels like a metadata junk drawer more than anything.

We’ve come a long way from glazing 2010s Kanye. Alas, the data was never really the point. The people were.



Source: apmoverflow.xyz…

We will be happy to hear your thoughts

Leave a reply

FOR LIFE DEALS
Logo
Register New Account
Compare items
  • Total (0)
Compare
0