I’m using Spotify since 2013 as the main source of music, and back at that time the app automatically created a playlist for songs that I liked from artists’ radios. By innertion I’m still using the playlist to save songs that I like. As the playlist became a bit big and a bit old (6 years, huh), I’ve decided to try to analyze it.
Boring preparation
To get the data I used Spotify API and spotipy as a Python client. I’ve created an application in the Spotify Dashboard and gathered the credentials. Then I was able to initialize and authorize the client:
import spotipy
import spotipy.util as util
token = util.prompt_for_user_token(user_id,
'playlist-read-collaborative',
client_id=client_id,
client_secret=client_secret,
redirect_uri='http://localhost:8000/')
sp = spotipy.Spotify(auth=token)
Tracks metadata
As everything is inside just one playlist, it was easy to gather. The only problem was
that user_playlist
method in spotipy doesn’t support pagination and can only return the
first 100 track, but it was easily solved by just going down to private and undocumented
_get
:
playlist = sp.user_playlist(user_id, playlist_id)
tracks = playlist['tracks']['items']
next_uri = playlist['tracks']['next']
for _ in range(int(playlist['tracks']['total'] / playlist['tracks']['limit'])):
response = sp._get(next_uri)
tracks += response['items']
next_uri = response['next']
tracks_df = pd.DataFrame([(track['track']['id'],
track['track']['artists'][0]['name'],
track['track']['name'],
parse_date(track['track']['album']['release_date']) if track['track']['album']['release_date'] else None,
parse_date(track['added_at']))
for track in playlist['tracks']['items']],
columns=['id', 'artist', 'name', 'release_date', 'added_at'] )
tracks_df.head(10)
id | artist | name | release_date | added_at | |
---|---|---|---|---|---|
0 | 1MLtdVIDLdupSO1PzNNIQg | Lindstrøm & Christabelle | Looking For What | 2009-12-11 | 2013-06-19 08:28:56+00:00 |
1 | 1gWsh0T1gi55K45TMGZxT0 | Au Revoir Simone | Knight Of Wands - Dam Mantle Remix | 2010-07-04 | 2013-06-19 08:48:30+00:00 |
2 | 0LE3YWM0W9OWputCB8Z3qt | Fever Ray | When I Grow Up - D. Lissvik Version | 2010-10-02 | 2013-06-19 22:09:15+00:00 |
3 | 5FyiyLzbZt41IpWyMuiiQy | Holy Ghost! | Dumb Disco Ideas | 2013-05-14 | 2013-06-19 22:12:42+00:00 |
4 | 5cgfva649kw89xznFpWCFd | Nouvelle Vague | Too Drunk To Fuck | 2004-11-01 | 2013-06-19 22:22:54+00:00 |
5 | 3IVc3QK63DngBdW7eVker2 | TR/ST | F.T.F. | 2012-11-16 | 2013-06-20 11:50:58+00:00 |
6 | 0mbpEDdZHNMEDll6woEy8W | Art Brut | My Little Brother | 2005-10-02 | 2013-06-20 13:58:19+00:00 |
7 | 2y8IhUDSpvsuuEePNLjGg5 | Niki & The Dove | Somebody (drum machine version) | 2011-06-14 | 2013-06-21 09:28:40+00:00 |
8 | 1X4RqFAShNL8aHfUIpjIVr | Gorillaz | Kids with Guns - Hot Chip Remix | 2007-11-19 | 2013-06-23 19:00:57+00:00 |
9 | 1cV4DVeAM5AstrDlXgvzJ7 | Lykke Li | I'm Good, I'm Gone | 2008-01-28 | 2013-06-23 22:31:52+00:00 |
The first naive idea of data to get was the list of the most appearing artists:
tracks_df \
.groupby('artist') \
.count()['id'] \
.reset_index() \
.sort_values('id', ascending=False) \
.rename(columns={'id': 'amount'}) \
.head(10)
artist | amount | |
---|---|---|
260 | Pet Shop Boys | 12 |
334 | The Knife | 11 |
213 | Metronomy | 9 |
303 | Soulwax | 8 |
284 | Röyksopp | 7 |
180 | Ladytron | 7 |
94 | Depeche Mode | 7 |
113 | Fever Ray | 6 |
324 | The Chemical Brothers | 6 |
233 | New Order | 6 |
But as taste can change, I’ve decided to get top five artists from each year and check if I was adding them to the playlist in other years:
counted_year_df = tracks_df \
.assign(year_added=tracks_df.added_at.dt.year) \
.groupby(['artist', 'year_added']) \
.count()['id'] \
.reset_index() \
.rename(columns={'id': 'amount'}) \
.sort_values('amount', ascending=False)
in_top_5_year_artist = counted_year_df \
.groupby('year_added') \
.head(5) \
.artist \
.unique()
counted_year_df \
[counted_year_df.artist.isin(in_top_5_year_artist)] \
.pivot('artist', 'year_added', 'amount') \
.fillna(0) \
.style.background_gradient()
year_added | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|
artist | |||||||
Arcade Fire | 2 | 0 | 0 | 1 | 3 | 0 | 0 |
Clinic | 1 | 0 | 0 | 2 | 0 | 0 | 1 |
Crystal Castles | 0 | 0 | 2 | 2 | 0 | 0 | 0 |
Depeche Mode | 1 | 0 | 3 | 1 | 0 | 2 | 0 |
Die Antwoord | 1 | 4 | 0 | 0 | 0 | 1 | 0 |
FM Belfast | 3 | 3 | 0 | 0 | 0 | 0 | 0 |
Factory Floor | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
Fever Ray | 3 | 1 | 1 | 0 | 1 | 0 | 0 |
Grimes | 1 | 0 | 3 | 1 | 0 | 0 | 0 |
Holy Ghost! | 1 | 0 | 0 | 0 | 3 | 1 | 1 |
Joe Goddard | 0 | 0 | 0 | 0 | 3 | 1 | 0 |
John Maus | 0 | 0 | 4 | 0 | 0 | 0 | 1 |
KOMPROMAT | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
LCD Soundsystem | 0 | 0 | 1 | 0 | 3 | 0 | 0 |
Ladytron | 5 | 1 | 0 | 0 | 0 | 1 | 0 |
Lindstrøm | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
Marie Davidson | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
Metronomy | 0 | 1 | 0 | 6 | 0 | 1 | 1 |
Midnight Magic | 0 | 4 | 0 | 0 | 1 | 0 | 0 |
Mr. Oizo | 0 | 0 | 0 | 1 | 0 | 3 | 0 |
New Order | 1 | 5 | 0 | 0 | 0 | 0 | 0 |
Pet Shop Boys | 0 | 12 | 0 | 0 | 0 | 0 | 0 |
Röyksopp | 0 | 4 | 0 | 3 | 0 | 0 | 0 |
Schwefelgelb | 0 | 0 | 0 | 0 | 1 | 0 | 4 |
Soulwax | 0 | 0 | 0 | 0 | 5 | 3 | 0 |
Talking Heads | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
The Chemical Brothers | 0 | 0 | 2 | 0 | 1 | 0 | 3 |
The Fall | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
The Knife | 5 | 1 | 3 | 1 | 0 | 0 | 1 |
The Normal | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
The Prodigy | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Vitalic | 0 | 0 | 0 | 0 | 2 | 2 | 0 |
As a bunch of artists was reappearing in different years, I decided to check if that correlates with new releases, so I’ve checked the last ten years:
counted_release_year_df = tracks_df \
.assign(year_added=tracks_df.added_at.dt.year,
year_released=tracks_df.release_date.dt.year) \
.groupby(['year_released', 'year_added']) \
.count()['id'] \
.reset_index() \
.rename(columns={'id': 'amount'}) \
.sort_values('amount', ascending=False)
counted_release_year_df \
[counted_release_year_df.year_released.isin(
sorted(tracks_df.release_date.dt.year.unique())[-11:]
)] \
.pivot('year_released', 'year_added', 'amount') \
.fillna(0) \
.style.background_gradient()
year_added | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|
year_released | |||||||
2010.0 | 19 | 8 | 2 | 10 | 6 | 5 | 10 |
2011.0 | 14 | 10 | 4 | 6 | 5 | 5 | 5 |
2012.0 | 11 | 15 | 6 | 5 | 8 | 2 | 0 |
2013.0 | 28 | 17 | 3 | 6 | 5 | 4 | 2 |
2014.0 | 0 | 30 | 2 | 1 | 0 | 10 | 1 |
2015.0 | 0 | 0 | 15 | 5 | 8 | 7 | 9 |
2016.0 | 0 | 0 | 0 | 8 | 7 | 4 | 5 |
2017.0 | 0 | 0 | 0 | 0 | 23 | 5 | 5 |
2018.0 | 0 | 0 | 0 | 0 | 0 | 4 | 8 |
2019.0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 |
Audio features
Spotify API has an endpoint that provides features like danceability, energy, loudness and etc for tracks. So I gathered features for all tracks from the playlist:
features = []
for n, chunk_series in tracks_df.groupby(np.arange(len(tracks_df)) // 50).id:
features += sp.audio_features([*map(str, chunk_series)])
features_df = pd.DataFrame.from_dict(filter(None, features))
tracks_with_features_df = tracks_df.merge(features_df, on=['id'], how='inner')
tracks_with_features_df.head()
id | artist | name | release_date | added_at | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1MLtdVIDLdupSO1PzNNIQg | Lindstrøm & Christabelle | Looking For What | 2009-12-11 | 2013-06-19 08:28:56+00:00 | 0.566 | 0.726 | 0 | -11.294 | 1 | 0.1120 | 0.04190 | 0.494000 | 0.282 | 0.345 | 120.055 | 359091 | 4 |
1 | 1gWsh0T1gi55K45TMGZxT0 | Au Revoir Simone | Knight Of Wands - Dam Mantle Remix | 2010-07-04 | 2013-06-19 08:48:30+00:00 | 0.563 | 0.588 | 4 | -7.205 | 0 | 0.0637 | 0.00573 | 0.932000 | 0.104 | 0.467 | 89.445 | 237387 | 4 |
2 | 0LE3YWM0W9OWputCB8Z3qt | Fever Ray | When I Grow Up - D. Lissvik Version | 2010-10-02 | 2013-06-19 22:09:15+00:00 | 0.687 | 0.760 | 5 | -6.236 | 1 | 0.0479 | 0.01160 | 0.007680 | 0.417 | 0.818 | 92.007 | 270120 | 4 |
3 | 5FyiyLzbZt41IpWyMuiiQy | Holy Ghost! | Dumb Disco Ideas | 2013-05-14 | 2013-06-19 22:12:42+00:00 | 0.752 | 0.831 | 10 | -4.407 | 1 | 0.0401 | 0.00327 | 0.729000 | 0.105 | 0.845 | 124.234 | 483707 | 4 |
4 | 5cgfva649kw89xznFpWCFd | Nouvelle Vague | Too Drunk To Fuck | 2004-11-01 | 2013-06-19 22:22:54+00:00 | 0.461 | 0.786 | 7 | -6.950 | 1 | 0.0467 | 0.47600 | 0.000003 | 0.495 | 0.808 | 159.882 | 136160 | 4 |
After that I’ve checked changes in features over time, only instrumentalness had some visible difference:
sns.boxplot(x=tracks_with_features_df.added_at.dt.year,
y=tracks_with_features_df.instrumentalness)
Then I had an idea to check seasonality and valence, and it kind of showed that in depressing months valence is a bit lower:
sns.boxplot(x=tracks_with_features_df.added_at.dt.month,
y=tracks_with_features_df.valence)
To play a bit more with data, I decided to check that danceability and valence might correlate:
tracks_with_features_df.plot(kind='scatter', x='danceability', y='valence')
And to check that the data is meaningful, I checked instrumentalness vs speechiness, and those featues looked mutually exclusive as expected:
tracks_with_features_df.plot(kind='scatter', x='instrumentalness', y='speechiness')
Tracks difference and similarity
As I already had a bunch of features classifying tracks, it was hard not to make vectors out of them:
encode_fields = [
'danceability',
'energy',
'key',
'loudness',
'mode',
'speechiness',
'acousticness',
'instrumentalness',
'liveness',
'valence',
'tempo',
'duration_ms',
'time_signature',
]
def encode(row):
return np.array([
(row[k] - tracks_with_features_df[k].min())
/ (tracks_with_features_df[k].max() - tracks_with_features_df[k].min())
for k in encode_fields])
tracks_with_features_encoded_df = tracks_with_features_df.assign(
encoded=tracks_with_features_df.apply(encode, axis=1))
Then I just calculated distance between every two tracks:
tracks_with_features_encoded_product_df = tracks_with_features_encoded_df \
.assign(temp=0) \
.merge(tracks_with_features_encoded_df.assign(temp=0), on='temp', how='left') \
.drop(columns='temp')
tracks_with_features_encoded_product_df = tracks_with_features_encoded_product_df[
tracks_with_features_encoded_product_df.id_x != tracks_with_features_encoded_product_df.id_y
]
tracks_with_features_encoded_product_df['merge_id'] = tracks_with_features_encoded_product_df \
.apply(lambda row: ''.join(sorted([row['id_x'], row['id_y']])), axis=1)
tracks_with_features_encoded_product_df['distance'] = tracks_with_features_encoded_product_df \
.apply(lambda row: np.linalg.norm(row['encoded_x'] - row['encoded_y']), axis=1)
After that I was able to get most similar songs/songs with the minimal distance, and it selected kind of similar songs:
tracks_with_features_encoded_product_df \
.sort_values('distance') \
.drop_duplicates('merge_id') \
[['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
.head(10)
artist_x | name_x | release_date_x | artist_y | name_y | release_date_y | distance | |
---|---|---|---|---|---|---|---|
84370 | Labyrinth Ear | Wild Flowers | 2010-11-21 | Labyrinth Ear | Navy Light | 2010-11-21 | 0.000000 |
446773 | YACHT | I Thought the Future Would Be Cooler | 2015-09-11 | ADULT. | Love Lies | 2013-05-13 | 0.111393 |
21963 | Ladytron | Seventeen | 2011-03-29 | The Juan Maclean | Give Me Every Little Thing | 2005-07-04 | 0.125358 |
11480 | Class Actress | Careful What You Say | 2010-02-09 | MGMT | Little Dark Age | 2017-10-17 | 0.128865 |
261780 | Queen of Japan | I Was Made For Loving You | 2001-10-02 | Midnight Juggernauts | Devil Within | 2007-10-02 | 0.131304 |
63257 | Pixies | Bagboy | 2013-09-09 | Kindness | That's Alright | 2012-03-16 | 0.146897 |
265792 | Datarock | Computer Camp Love | 2005-10-02 | Chromeo | Night By Night | 2010-09-21 | 0.147235 |
75359 | Midnight Juggernauts | Devil Within | 2007-10-02 | Lykke Li | I'm Good, I'm Gone | 2008-01-28 | 0.152680 |
105246 | ADULT. | Love Lies | 2013-05-13 | Dr. Alban | Sing Hallelujah! | 1992-05-04 | 0.154475 |
285180 | Gigamesh | Don't Stop | 2012-05-28 | Pet Shop Boys | Paninaro 95 - 2003 Remaster | 2003-10-02 | 0.156469 |
The most different songs weren’t that fun, as two songs were too different from the rest:
tracks_with_features_encoded_product_df \
.sort_values('distance', ascending=False) \
.drop_duplicates('merge_id') \
[['artist_x', 'name_x', 'release_date_x', 'artist_y', 'name_y', 'release_date_y', 'distance']] \
.head(10)
artist_x | name_x | release_date_x | artist_y | name_y | release_date_y | distance | |
---|---|---|---|---|---|---|---|
79324 | Labyrinth Ear | Navy Light | 2010-11-21 | Boy Harsher | Modulations | 2014-10-01 | 2.480206 |
84804 | Labyrinth Ear | Wild Flowers | 2010-11-21 | Boy Harsher | Modulations | 2014-10-01 | 2.480206 |
400840 | Charlotte Gainsbourg | Deadly Valentine - Soulwax Remix | 2017-11-10 | Labyrinth Ear | Navy Light | 2010-11-21 | 2.478183 |
84840 | Labyrinth Ear | Wild Flowers | 2010-11-21 | Charlotte Gainsbourg | Deadly Valentine - Soulwax Remix | 2017-11-10 | 2.478183 |
388510 | Ladytron | Paco! | 2001-10-02 | Labyrinth Ear | Navy Light | 2010-11-21 | 2.444927 |
388518 | Ladytron | Paco! | 2001-10-02 | Labyrinth Ear | Wild Flowers | 2010-11-21 | 2.444927 |
20665 | Factory Floor | Fall Back | 2013-01-15 | Labyrinth Ear | Navy Light | 2010-11-21 | 2.439136 |
20673 | Factory Floor | Fall Back | 2013-01-15 | Labyrinth Ear | Wild Flowers | 2010-11-21 | 2.439136 |
79448 | Labyrinth Ear | Navy Light | 2010-11-21 | La Femme | Runway | 2018-10-01 | 2.423574 |
84928 | Labyrinth Ear | Wild Flowers | 2010-11-21 | La Femme | Runway | 2018-10-01 | 2.423574 |
Then I calculated the most avarage songs, eg the songs with the least distance from every other song:
tracks_with_features_encoded_product_df \
.groupby(['artist_x', 'name_x', 'release_date_x']) \
.sum()['distance'] \
.reset_index() \
.sort_values('distance') \
.head(10)
artist_x | name_x | release_date_x | distance | |
---|---|---|---|---|
48 | Beirut | No Dice | 2009-02-17 | 638.331257 |
591 | The Juan McLean | A Place Called Space | 2014-09-15 | 643.436523 |
347 | MGMT | Little Dark Age | 2017-10-17 | 645.959770 |
101 | Class Actress | Careful What You Say | 2010-02-09 | 646.488998 |
31 | Architecture In Helsinki | 2 Time | 2014-04-01 | 648.692344 |
588 | The Juan Maclean | Give Me Every Little Thing | 2005-07-04 | 648.878463 |
323 | Lindstrøm | Baby Can't Stop | 2009-10-26 | 652.212858 |
307 | Ladytron | Seventeen | 2011-03-29 | 652.759843 |
310 | Lauer | Mirrors (feat. Jasnau) | 2018-11-16 | 655.498535 |
451 | Pet Shop Boys | Always on My Mind | 1998-03-31 | 656.437048 |
And totally opposite thing – the most outstanding songs:
tracks_with_features_encoded_product_df \
.groupby(['artist_x', 'name_x', 'release_date_x']) \
.sum()['distance'] \
.reset_index() \
.sort_values('distance', ascending=False) \
.head(10)
artist_x | name_x | release_date_x | distance | |
---|---|---|---|---|
665 | YACHT | Le Goudron - Long Version | 2012-05-25 | 2823.572387 |
300 | Labyrinth Ear | Navy Light | 2010-11-21 | 1329.234390 |
301 | Labyrinth Ear | Wild Flowers | 2010-11-21 | 1329.234390 |
57 | Blonde Redhead | For the Damaged Coda | 2000-06-06 | 1095.393120 |
616 | The Velvet Underground | After Hours | 1969-03-02 | 1080.491779 |
593 | The Knife | Forest Families | 2006-02-17 | 1040.114214 |
615 | The Space Lady | Major Tom | 2013-11-18 | 1016.881467 |
107 | CocoRosie | By Your Side | 2004-03-09 | 1015.970860 |
170 | El Perro Del Mar | Party | 2015-02-13 | 1012.163212 |
403 | Mr.Kitty | XIII | 2014-10-06 | 1010.115117 |
Conclusion
Although the dataset is a bit small, it was still fun to have a look at the data.
Gist with a jupyter notebook with even more boring stuff, can be reused by modifying credentials.