Analysing music habits with Spotify API and Python


Hero image

I’m using Spotify since 2013 as the main source of music, and back at that time the app automatically created a playlist for songs that I liked from artists’ radios. By innertion I’m still using the playlist to save songs that I like. As the playlist became a bit big and a bit old (6 years, huh), I’ve decided to try to analyze it.

Boring preparation

To get the data I used Spotify API and spotipy as a Python client. I’ve created an application in the Spotify Dashboard and gathered the credentials. Then I was able to initialize and authorize the client:

import spotipy
import spotipy.util as util

token = util.prompt_for_user_token(user_id,
                                   'playlist-read-collaborative',
                                   client_id=client_id,
                                   client_secret=client_secret,
                                   redirect_uri='http://localhost:8000/')
sp = spotipy.Spotify(auth=token)

Tracks metadata

As everything is inside just one playlist, it was easy to gather. The only problem was that user_playlist method in spotipy doesn’t support pagination and can only return the first 100 track, but it was easily solved by just going down to private and undocumented _get:

playlist = sp.user_playlist(user_id, playlist_id)
tracks = playlist['tracks']['items']
next_uri = playlist['tracks']['next']
for _ in range(int(playlist['tracks']['total'] / playlist['tracks']['limit'])):
    response = sp._get(next_uri)
    tracks += response['items']
    next_uri = response['next']

tracks_df = pd.DataFrame([(track['track']['id'],
                           track['track']['artists'][0]['name'],
                           track['track']['name'],
                           parse_date(track['track']['album']['release_date']) if track['track']['album']['release_date'] else None,
                           parse_date(track['added_at']))
                          for track in playlist['tracks']['items']],
                         columns=['id', 'artist', 'name', 'release_date', 'added_at'] )
tracks_df.head(10)
id artist name release_date added_at
0 1MLtdVIDLdupSO1PzNNIQg Lindstrøm & Christabelle Looking For What 2009-12-11 2013-06-19 08:28:56+00:00
1 1gWsh0T1gi55K45TMGZxT0 Au Revoir Simone Knight Of Wands - Dam Mantle Remix 2010-07-04 2013-06-19 08:48:30+00:00
2 0LE3YWM0W9OWputCB8Z3qt Fever Ray When I Grow Up - D. Lissvik Version 2010-10-02 2013-06-19 22:09:15+00:00
3 5FyiyLzbZt41IpWyMuiiQy Holy Ghost! Dumb Disco Ideas 2013-05-14 2013-06-19 22:12:42+00:00
4 5cgfva649kw89xznFpWCFd Nouvelle Vague Too Drunk To Fuck 2004-11-01 2013-06-19 22:22:54+00:00
5 3IVc3QK63DngBdW7eVker2 TR/ST F.T.F. 2012-11-16 2013-06-20 11:50:58+00:00
6 0mbpEDdZHNMEDll6woEy8W Art Brut My Little Brother 2005-10-02 2013-06-20 13:58:19+00:00
7 2y8IhUDSpvsuuEePNLjGg5 Niki & The Dove Somebody (drum machine version) 2011-06-14 2013-06-21 09:28:40+00:00
8 1X4RqFAShNL8aHfUIpjIVr Gorillaz Kids with Guns - Hot Chip Remix 2007-11-19 2013-06-23 19:00:57+00:00
9 1cV4DVeAM5AstrDlXgvzJ7 Lykke Li I'm Good, I'm Gone 2008-01-28 2013-06-23 22:31:52+00:00

The first naive idea of data to get was the list of the most appearing artists:

tracks_df \
    .groupby('artist') \
    .count()['id'] \
    .reset_index() \
    .sort_values('id', ascending=False) \
    .rename(columns={'id': 'amount'}) \
    .head(10)
artist amount
260 Pet Shop Boys 12
334 The Knife 11
213 Metronomy 9
303 Soulwax 8
284 Röyksopp 7
180 Ladytron 7
94 Depeche Mode 7
113 Fever Ray 6
324 The Chemical Brothers 6
233 New Order 6

But as taste can change, I’ve decided to get top five artists from each year and check if I was adding them to the playlist in other years:

counted_year_df = tracks_df \
    .assign(year_added=tracks_df.added_at.dt.year) \
    .groupby(['artist', 'year_added']) \
    .count()['id'] \
    .reset_index() \
    .rename(columns={'id': 'amount'}) \
    .sort_values('amount', ascending=False)

in_top_5_year_artist = counted_year_df \
    .groupby('year_added') \
    .head(5) \
    .artist \
    .unique()

counted_year_df \
    [counted_year_df.artist.isin(in_top_5_year_artist)] \
    .pivot('artist', 'year_added', 'amount') \
    .fillna(0) \
    .style.background_gradient()
year_added 2013 2014 2015 2016 2017 2018 2019
artist
Arcade Fire 2 0 0 1 3 0 0
Clinic 1 0 0 2 0 0 1
Crystal Castles 0 0 2 2 0 0 0
Depeche Mode 1 0 3 1 0 2 0
Die Antwoord 1 4 0 0 0 1 0
FM Belfast 3 3 0 0 0 0 0
Factory Floor 3 0 0 0 0 0 0
Fever Ray 3 1 1 0 1 0 0
Grimes 1 0 3 1 0 0 0
Holy Ghost! 1 0 0 0 3 1 1
Joe Goddard 0 0 0 0 3 1 0
John Maus 0 0 4 0 0 0 1
KOMPROMAT 0 0 0 0 0 0 2
LCD Soundsystem 0 0 1 0 3 0 0
Ladytron 5 1 0 0 0 1 0
Lindstrøm 0 0 0 0 0 0 2
Marie Davidson 0 0 0 0 0 0 2
Metronomy 0 1 0 6 0 1 1
Midnight Magic 0 4 0 0 1 0 0
Mr. Oizo 0 0 0 1 0 3 0
New Order 1 5 0 0 0 0 0
Pet Shop Boys 0 12 0 0 0 0 0
Röyksopp 0 4 0 3 0 0 0
Schwefelgelb 0 0 0 0 1 0 4
Soulwax 0 0 0 0 5 3 0
Talking Heads 0 0 3 0 0 0 0
The Chemical Brothers 0 0 2 0 1 0 3
The Fall 0 0 0 0 0 2 0
The Knife 5 1 3 1 0 0 1
The Normal 0 0 0 2 0 0 0
The Prodigy 0 0 0 0 0 2 0
Vitalic 0 0 0 0 2 2 0

As a bunch of artists was reappearing in different years, I decided to check if that correlates with new releases, so I’ve checked the last ten years:

counted_release_year_df = tracks_df \
    .assign(year_added=tracks_df.added_at.dt.year,
            year_released=tracks_df.release_date.dt.year) \
    .groupby(['year_released', 'year_added']) \
    .count()['id'] \
    .reset_index() \
    .rename(columns={'id': 'amount'}) \
    .sort_values('amount', ascending=False)

counted_release_year_df \
    [counted_release_year_df.year_released.isin(
        sorted(tracks_df.release_date.dt.year.unique())[-11:]
    )] \
    .pivot('year_released', 'year_added', 'amount') \
    .fillna(0) \
    .style.background_gradient()
year_added 2013 2014 2015 2016 2017 2018 2019
year_released
2010.0 19 8 2 10 6 5 10
2011.0 14 10 4 6 5 5 5
2012.0 11 15 6 5 8 2 0
2013.0 28 17 3 6 5 4 2
2014.0 0 30 2 1 0 10 1
2015.0 0 0 15 5 8 7 9
2016.0 0 0 0 8 7 4 5
2017.0 0 0 0 0 23 5 5
2018.0 0 0 0 0 0 4 8
2019.0 0 0 0 0 0 0 14

Audio features

Spotify API has an endpoint that provides features like danceability, energy, loudness and etc for tracks. So I gathered features for all tracks from the playlist:

features = []
for n, chunk_series in tracks_df.groupby(np.arange(len(tracks_df)) // 50).id:
    features += sp.audio_features([*map(str, chunk_series)])
features_df = pd.DataFrame.from_dict(filter(None, features))
tracks_with_features_df = tracks_df.merge(features_df, on=['id'], how='inner')
tracks_with_features_df.head()
id artist name release_date added_at danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
0 1MLtdVIDLdupSO1PzNNIQg Lindstrøm & Christabelle Looking For What 2009-12-11 2013-06-19 08:28:56+00:00 0.566 0.726 0 -11.294 1 0.1120 0.04190 0.494000 0.282 0.345 120.055 359091 4
1 1gWsh0T1gi55K45TMGZxT0 Au Revoir Simone Knight Of Wands - Dam Mantle Remix 2010-07-04 2013-06-19 08:48:30+00:00 0.563 0.588 4 -7.205 0 0.0637 0.00573 0.932000 0.104 0.467 89.445 237387 4
2 0LE3YWM0W9OWputCB8Z3qt Fever Ray When I Grow Up - D. Lissvik Version 2010-10-02 2013-06-19 22:09:15+00:00 0.687 0.760 5 -6.236 1 0.0479 0.01160 0.007680 0.417 0.818 92.007 270120 4
3 5FyiyLzbZt41IpWyMuiiQy Holy Ghost! Dumb Disco Ideas 2013-05-14 2013-06-19 22:12:42+00:00 0.752 0.831 10 -4.407 1 0.0401 0.00327 0.729000 0.105 0.845 124.234 483707 4
4 5cgfva649kw89xznFpWCFd Nouvelle Vague Too Drunk To Fuck 2004-11-01 2013-06-19 22:22:54+00:00 0.461 0.786 7 -6.950 1 0.0467 0.47600 0.000003 0.495 0.808 159.882 136160 4

After that I’ve checked changes in features over time, only instrumentalness had some visible difference:

sns.boxplot(x=tracks_with_features_df.added_at.dt.year,
            y=tracks_with_features_df.instrumentalness)

Instrumentalness over time

Then I had an idea to check seasonality and valence, and it kind of showed that in depressing months valence is a bit lower:

sns.boxplot(x=tracks_with_features_df.added_at.dt.month,
            y=tracks_with_features_df.valence)

Valence seasonality

To play a bit more with data, I decided to check that danceability and valence might correlate:

tracks_with_features_df.plot(kind='scatter', x='danceability', y='valence')

Dnaceability vs valence

And to check that the data is meaningful, I checked instrumentalness vs speechiness, and those featues looked mutually exclusive as expected:

tracks_with_features_df.plot(kind='scatter', x='instrumentalness', y='speechiness')

Speachness vs instrumentalness

Tracks difference and similarity

As I already had a bunch of features classifying tracks, it was hard not to make vectors out of them:

encode_fields = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'duration_ms',
    'time_signature',
]

def encode(row):
    return np.array([
        (row[k] - tracks_with_features_df[k].min())
        / (tracks_with_features_df[k].max() - tracks_with_features_df[k].min())
        for k in encode_fields])

tracks_with_features_encoded_df = tracks_with_features_df.assign(
    encoded=tracks_with_features_df.apply(encode, axis=1))

Then I just calculated distance between every two tracks:

tracks_with_features_encoded_product_df = tracks_with_features_encoded_df \
    .assign(temp=0) \
    .merge(tracks_with_features_encoded_df.assign(temp=0), on='temp', how='left') \
    .drop(columns='temp')
tracks_with_features_encoded_product_df = tracks_with_features_encoded_product_df[
    tracks_with_features_encoded_product_df.id_x != tracks_with_features_encoded_product_df.id_y
]
tracks_with_features_encoded_product_df['merge_id'] = tracks_with_features_encoded_product_df \
    .apply(lambda row: ''.join(sorted([row['id_x'], row['id_y']])), axis=1)
tracks_with_features_encoded_product_df['distance'] = tracks_with_features_encoded_product_df \
    .apply(lambda row: np.linalg.norm(row['encoded_x'] - row['encoded_y']), axis=1)

After that I was able to get most similar songs/songs with the minimal distance, and it selected kind of similar songs:

tracks_with_features_encoded_product_df \
    .sort_values('distance') \
    .drop_duplicates('merge_id') \
    [['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
    .head(10)
artist_x name_x release_date_x artist_y name_y release_date_y distance
84370 Labyrinth Ear Wild Flowers 2010-11-21 Labyrinth Ear Navy Light 2010-11-21 0.000000
446773 YACHT I Thought the Future Would Be Cooler 2015-09-11 ADULT. Love Lies 2013-05-13 0.111393
21963 Ladytron Seventeen 2011-03-29 The Juan Maclean Give Me Every Little Thing 2005-07-04 0.125358
11480 Class Actress Careful What You Say 2010-02-09 MGMT Little Dark Age 2017-10-17 0.128865
261780 Queen of Japan I Was Made For Loving You 2001-10-02 Midnight Juggernauts Devil Within 2007-10-02 0.131304
63257 Pixies Bagboy 2013-09-09 Kindness That's Alright 2012-03-16 0.146897
265792 Datarock Computer Camp Love 2005-10-02 Chromeo Night By Night 2010-09-21 0.147235
75359 Midnight Juggernauts Devil Within 2007-10-02 Lykke Li I'm Good, I'm Gone 2008-01-28 0.152680
105246 ADULT. Love Lies 2013-05-13 Dr. Alban Sing Hallelujah! 1992-05-04 0.154475
285180 Gigamesh Don't Stop 2012-05-28 Pet Shop Boys Paninaro 95 - 2003 Remaster 2003-10-02 0.156469

The most different songs weren’t that fun, as two songs were too different from the rest:

tracks_with_features_encoded_product_df \
    .sort_values('distance', ascending=False) \
    .drop_duplicates('merge_id') \
    [['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
    .head(10)
artist_x name_x release_date_x artist_y name_y release_date_y distance
79324 Labyrinth Ear Navy Light 2010-11-21 Boy Harsher Modulations 2014-10-01 2.480206
84804 Labyrinth Ear Wild Flowers 2010-11-21 Boy Harsher Modulations 2014-10-01 2.480206
400840 Charlotte Gainsbourg Deadly Valentine - Soulwax Remix 2017-11-10 Labyrinth Ear Navy Light 2010-11-21 2.478183
84840 Labyrinth Ear Wild Flowers 2010-11-21 Charlotte Gainsbourg Deadly Valentine - Soulwax Remix 2017-11-10 2.478183
388510 Ladytron Paco! 2001-10-02 Labyrinth Ear Navy Light 2010-11-21 2.444927
388518 Ladytron Paco! 2001-10-02 Labyrinth Ear Wild Flowers 2010-11-21 2.444927
20665 Factory Floor Fall Back 2013-01-15 Labyrinth Ear Navy Light 2010-11-21 2.439136
20673 Factory Floor Fall Back 2013-01-15 Labyrinth Ear Wild Flowers 2010-11-21 2.439136
79448 Labyrinth Ear Navy Light 2010-11-21 La Femme Runway 2018-10-01 2.423574
84928 Labyrinth Ear Wild Flowers 2010-11-21 La Femme Runway 2018-10-01 2.423574

Then I calculated the most avarage songs, eg the songs with the least distance from every other song:

tracks_with_features_encoded_product_df \
    .groupby(['artist_x', 'name_x', 'release_date_x']) \
    .sum()['distance'] \
    .reset_index() \
    .sort_values('distance') \
    .head(10)
artist_x name_x release_date_x distance
48 Beirut No Dice 2009-02-17 638.331257
591 The Juan McLean A Place Called Space 2014-09-15 643.436523
347 MGMT Little Dark Age 2017-10-17 645.959770
101 Class Actress Careful What You Say 2010-02-09 646.488998
31 Architecture In Helsinki 2 Time 2014-04-01 648.692344
588 The Juan Maclean Give Me Every Little Thing 2005-07-04 648.878463
323 Lindstrøm Baby Can't Stop 2009-10-26 652.212858
307 Ladytron Seventeen 2011-03-29 652.759843
310 Lauer Mirrors (feat. Jasnau) 2018-11-16 655.498535
451 Pet Shop Boys Always on My Mind 1998-03-31 656.437048

And totally opposite thing – the most outstanding songs:

tracks_with_features_encoded_product_df \
    .groupby(['artist_x', 'name_x', 'release_date_x']) \
    .sum()['distance'] \
    .reset_index() \
    .sort_values('distance', ascending=False) \
    .head(10)
artist_x name_x release_date_x distance
665 YACHT Le Goudron - Long Version 2012-05-25 2823.572387
300 Labyrinth Ear Navy Light 2010-11-21 1329.234390
301 Labyrinth Ear Wild Flowers 2010-11-21 1329.234390
57 Blonde Redhead For the Damaged Coda 2000-06-06 1095.393120
616 The Velvet Underground After Hours 1969-03-02 1080.491779
593 The Knife Forest Families 2006-02-17 1040.114214
615 The Space Lady Major Tom 2013-11-18 1016.881467
107 CocoRosie By Your Side 2004-03-09 1015.970860
170 El Perro Del Mar Party 2015-02-13 1012.163212
403 Mr.Kitty XIII 2014-10-06 1010.115117

Conclusion

Although the dataset is a bit small, it was still fun to have a look at the data.

Gist with a jupyter notebook with even more boring stuff, can be reused by modifying credentials.



comments powered by Disqus