The Last.fm 1K User Dataset is a collection of 1000 users' music listening habits on Last.fm, originally collected and released by Òscar Celma in 2010. The official release and README of the project can be found at http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html.
This repository hosts the Last.fm Dataset - 1K users under the same license and terms as in the offical README, copied for convenience below.
This is a fork of eifuentes' repository and builds on their additions to the original dataset.
My changes are based on eifuentes' parquet files of the dataset. Please see their README about the changes they made.
My use case required album infomation for the listened tracks. For every listened track with a musicbrainz ID, I queried musicbrainz for the corresponding album information. Note that a track may appear on more than one album. I always chose the first album returned. This may not be identical to the album the user actually listened to.
The following example rows show the new table schema and possible values for each column.
Length is in milliseconds.
user_id | timestamp | artist_id | artist_name | track_id | track_name | album_id | album_name | album_artist_name | length | tracknumber |
---|---|---|---|---|---|---|---|---|---|---|
user_000001 | 2006-08-13 13:59:20+00:00 | 09a114d9-7723-4e14-b524-379697f6d2b5 | Plaid & Bob Jaroc | c4633ab1-e715-477f-8685-afa5f2058e42 | The Launching Of Big Face | b5d8192f-6924-4e2c-952b-b10b69eb2cfa | Greedy Baby | Plaid & Bob Jaroc | 254173 | 3 |
user_000001 | 2006-08-21 15:14:58+00:00 | 7b42e4cf-650b-4fe0-bc7d-c95e31247729 | Peven Everett | f548d49d-4239-4066-b867-f88419c14e3c | I Can Give It | 9fd11527-90ac-4c2d-ae37-40e1570d5011 | GP03 | Various Artists | 235360 | 4 |
user_000001 | 2006-08-21 15:18:49+00:00 | e757253a-85cc-4892-a2e4-c88c3948b670 | Shaun Escoffery | 413ee492-9d5e-488f-94c4-993d6a54b55e | Let It Go | 9fd11527-90ac-4c2d-ae37-40e1570d5011 | GP03 | Various Artists | 265240 | 5 |
user_000587 | 2009-01-28 12:20:14+00:00 | Nujabes + Fat Jon | 5. Mystline | |||||||
user_000647 | 2007-06-20 20:06:50+00:00 | 03ad1736-b7c9-412a-b442-82536d63a5c4 | Elliott Smith | 8a8b9728-b745-4054-81e4-5b1fcdcd13cd | L.A. | 0 | 0 |
- Row 1: musicbrainz id for the track was available and an album was found
- Row 2 and 3: MBID was available and a compilation album was found for both tracks
- Row 4: No MBID was available so no query for album information was made
- Row 5: MBID was available but no official release contains this track so no album information was saved
- As mentioned above, the album information is not entirely accurate. Since a track may appear on several albums and it's not possible to reconstruct with the available data which album the user was listening to, I just chose any album this track appears on. This is sufficent for my use case. It may not be for yours.
- I only queried album information for tracks that had a musicbrainz ID. It's certainly possible to query musicbrainz for all other tracks as well and a good number of them are probably in their database. I decided against it because incomplete data is to be expected for user-submitted information.
- So far I have only processed the smaller, educational subset of 50 users.
The provided source code may be sufficient to process the larger dataset
of 1000 users, though two issues kept popping up:
- I ran into serious memory issues while trying my code with the larger dataset. Since I only needed the smaller dataset, I have not investigated the memory issues further. Any help with this would be much appreciated!
- The runtime is atrocious. Processing the small subset took about 44 hours. Thanks to caching of musicbrainz queries, the runtime should be less than linear as long as tracks are repeated throughout the dataset. Parallelization and async calls would probably help a lot.
If you only need the data, just grab one of the files from the release.
If you want to extend a dataset yourself, you can run the extender.
The extender reads in a parquet file of the original data without album information, queries musicbrainz for information and outputs a new parquet file with the additional information.
Install the requirements:
$ pip install -r requirements.txt
Then run the main module (from the root of the repository):
$ python -m lastfm-dataset-1K-extended.main --help
Usage: python -m lastfm-dataset-1K-extended.main [OPTIONS]
Options:
-i, --infile TEXT Input dataset. [required]
-o, --outfile TEXT Output file.
-s, --skip INTEGER Number of rows to skip.
--help Show this message and exit.
-i
/--infile
is required and should point to the parquet file
you want to extend.
-o
/--outfile
is optional and will be the filename of the extended dataset.
-s
/--skip
is useful to continue the extension run after an interruption
(the new parquet file is auto-saved after every 100 entries, so you might want
to round down to the last multiple of 100).
Install parquet-tools
:
$ pip install parquet-tools
And run:
$ parquet-tools csv <parquet file>
Install sqlite-utils
:
$ pip install sqlite-utils
And run:
$ sqlite-utils insert <output database file> <table name> <input csv file> --csv
For ease of use, I decided to create a normalized sqlite database.
sqlite-utils
can be used for that, see
here
for documentation.
Table listens
user_id | timestamp | artist_id | artist_name | track_id | track_name | album_id | album_name | album_artist_name | length | tracknumber |
---|---|---|---|---|---|---|---|---|---|---|
user_000001 | 2006-08-21 15:14:58+00:00 | 7b42e4cf-650b-4fe0-bc7d-c95e31247729 | Peven Everett | f548d49d-4239-4066-b867-f88419c14e3c | I Can Give It | 9fd11527-90ac-4c2d-ae37-40e1570d5011 | GP03 | Various Artists | 235360 | 4 |
id | name |
---|---|
1 | user_000001 |
CREATE TABLE [users] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
)
id | mbid | name |
---|---|---|
1 | 7b42e4cf-650b-4fe0-bc7d-c95e31247729 | Peven Everett |
2 | Various Artists |
CREATE TABLE [artists] (
[id] INTEGER PRIMARY KEY,
[mbid] TEXT,
[name] TEXT
)
id | mbid | title | artist_id |
---|---|---|---|
1 | 9fd11527-90ac-4c2d-ae37-40e1570d5011 | GP03 | 2 |
CREATE TABLE [albums] (
[id] INTEGER PRIMARY KEY,
[mbid] TEXT,
[title] TEXT,
[artist_id] TEXT,
FOREIGN KEY([artist_id]) REFERENCES [artists]([id])
)
id | artist_id | mbid | title | album_id | length | tracknumber |
---|---|---|---|---|---|---|
1 | 1 | f548d49d-4239-4066-b867-f88419c14e3c | I Can Give It | 1 | 235360 | 4 |
CREATE TABLE [tracks] (
[id] INTEGER PRIMARY KEY,
[artist_id] INTEGER,
[mbid] TEXT,
[title] TEXT,
[album_id] INTEGER,
[length] INTEGER,
[tracknumber] INTEGER,
FOREIGN KEY([artist_id]) REFERENCES [artists]([id]),
FOREIGN KEY([album_id]) REFERENCES [albums]([id])
)
user_id | timestamp | track_id |
---|---|---|
1 | 2006-08-21 15:14:58+00:00 | 1 |
CREATE TABLE "listens" (
[user_id] INTEGER REFERENCES [users]([id]),
[timestamp] TEXT,
[track_id] INTEGER,
FOREIGN KEY([track_id]) REFERENCES [tracks]([id])
)
Version 1.0, May 2010
This dataset contains user, timestamp, artist, song tuples collected from Last.fm API, using the user.getRecentTracks() method.
This dataset represents the whole listening habits (till May, 5th 2009) for nearly 1,000 users.
Filename MD5 userid-timestamp-artid-artname-traid-traname.tsv 64747b21563e3d2aa95751e0ddc46b68 userid-profile.tsv c53608b6b445db201098c1489ea497df
Element Statistic Total Lines 19,150,868 Unique Users 992 Artists with MBID 107,528 Artists without MBDID 69,420 The data is formatted one entry per line as follows (tab separated,
\t
)userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name
userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again? user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead ...
user_000639 \t m \t Mexico \t Apr 27, 2005 ...
The data contained in
lastfm-dataset-1K.tar.gz
is distributed with permission of Last.fm.The data is made available for non-commercial use.
Those interested in using the data or web services in a commercial context should contact:
partners [at] last [dot] fm.
For more information see Last.fm terms of service.
Thanks to Last.fm for providing the access to this data via their web services.
Special thanks to Norman Casagrande.
When using this dataset you must reference the Last.fm webpage.
Optionally (not mandatory at all!), you can cite Chapter 3 of this book
@book{Celma:Springer2010, author = {Celma, O.}, title = {{Music Recommendation and Discovery in the Long Tail}}, publisher = {Springer}, year = {2010} }
This data was collected by Òscar Celma @ MTG/UPF