Last.fm 1K User Dataset Extended

What is this

The Last.fm 1K User Dataset is a collection of 1000 users' music listening habits on Last.fm, originally collected and released by Òscar Celma in 2010. The official release and README of the project can be found at http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html.

This repository hosts the Last.fm Dataset - 1K users under the same license and terms as in the offical README, copied for convenience below.

This is a fork of eifuentes' repository and builds on their additions to the original dataset.

Changes to the Dataset

My changes are based on eifuentes' parquet files of the dataset. Please see their README about the changes they made.

My use case required album infomation for the listened tracks. For every listened track with a musicbrainz ID, I queried musicbrainz for the corresponding album information. Note that a track may appear on more than one album. I always chose the first album returned. This may not be identical to the album the user actually listened to.

Examples

The following example rows show the new table schema and possible values for each column.

Length is in milliseconds.

user_id	timestamp	artist_id	artist_name	track_id	track_name	album_id	album_name	album_artist_name	length	tracknumber
user_000001	2006-08-13 13:59:20+00:00	09a114d9-7723-4e14-b524-379697f6d2b5	Plaid & Bob Jaroc	c4633ab1-e715-477f-8685-afa5f2058e42	The Launching Of Big Face	b5d8192f-6924-4e2c-952b-b10b69eb2cfa	Greedy Baby	Plaid & Bob Jaroc	254173	3
user_000001	2006-08-21 15:14:58+00:00	7b42e4cf-650b-4fe0-bc7d-c95e31247729	Peven Everett	f548d49d-4239-4066-b867-f88419c14e3c	I Can Give It	9fd11527-90ac-4c2d-ae37-40e1570d5011	GP03	Various Artists	235360	4
user_000001	2006-08-21 15:18:49+00:00	e757253a-85cc-4892-a2e4-c88c3948b670	Shaun Escoffery	413ee492-9d5e-488f-94c4-993d6a54b55e	Let It Go	9fd11527-90ac-4c2d-ae37-40e1570d5011	GP03	Various Artists	265240	5
user_000587	2009-01-28 12:20:14+00:00		Nujabes + Fat Jon		5. Mystline
user_000647	2007-06-20 20:06:50+00:00	03ad1736-b7c9-412a-b442-82536d63a5c4	Elliott Smith	8a8b9728-b745-4054-81e4-5b1fcdcd13cd	L.A.				0	0

Row 1: musicbrainz id for the track was available and an album was found
Row 2 and 3: MBID was available and a compilation album was found for both tracks
Row 4: No MBID was available so no query for album information was made
Row 5: MBID was available but no official release contains this track so no album information was saved

Caveats

As mentioned above, the album information is not entirely accurate. Since a track may appear on several albums and it's not possible to reconstruct with the available data which album the user was listening to, I just chose any album this track appears on. This is sufficent for my use case. It may not be for yours.
I only queried album information for tracks that had a musicbrainz ID. It's certainly possible to query musicbrainz for all other tracks as well and a good number of them are probably in their database. I decided against it because incomplete data is to be expected for user-submitted information.
So far I have only processed the smaller, educational subset of 50 users. The provided source code may be sufficient to process the larger dataset of 1000 users, though two issues kept popping up:
- I ran into serious memory issues while trying my code with the larger dataset. Since I only needed the smaller dataset, I have not investigated the memory issues further. Any help with this would be much appreciated!
- The runtime is atrocious. Processing the small subset took about 44 hours. Thanks to caching of musicbrainz queries, the runtime should be less than linear as long as tracks are repeated throughout the dataset. Parallelization and async calls would probably help a lot.

How to use

If you only need the data, just grab one of the files from the release.

If you want to extend a dataset yourself, you can run the extender.

The extender reads in a parquet file of the original data without album information, queries musicbrainz for information and outputs a new parquet file with the additional information.

Install the requirements:

$ pip install -r requirements.txt

Then run the main module (from the root of the repository):

$ python -m lastfm-dataset-1K-extended.main --help
Usage: python -m lastfm-dataset-1K-extended.main [OPTIONS]

Options:
  -i, --infile TEXT   Input dataset.  [required]
  -o, --outfile TEXT  Output file.
  -s, --skip INTEGER  Number of rows to skip.
  --help              Show this message and exit.

-i/--infile is required and should point to the parquet file you want to extend.

-o/--outfile is optional and will be the filename of the extended dataset.

-s/--skip is useful to continue the extension run after an interruption (the new parquet file is auto-saved after every 100 entries, so you might want to round down to the last multiple of 100).

Converting file types

parquet to csv

Install parquet-tools:

$ pip install parquet-tools

And run:

$ parquet-tools csv <parquet file>

csv to sqlite

Install sqlite-utils:

$ pip install sqlite-utils

And run:

$ sqlite-utils insert <output database file> <table name> <input csv file> --csv

Normalizing the database

For ease of use, I decided to create a normalized sqlite database. sqlite-utils can be used for that, see here for documentation.

Table layout before normalization:

Table listens

user_id	timestamp	artist_id	artist_name	track_id	track_name	album_id	album_name	album_artist_name	length	tracknumber
user_000001	2006-08-21 15:14:58+00:00	7b42e4cf-650b-4fe0-bc7d-c95e31247729	Peven Everett	f548d49d-4239-4066-b867-f88419c14e3c	I Can Give It	9fd11527-90ac-4c2d-ae37-40e1570d5011	GP03	Various Artists	235360	4

Table layout after normalization:

Table `users`

id	name
1	user_000001

CREATE TABLE [users] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
)

Table `artists`

id	mbid	name
1	7b42e4cf-650b-4fe0-bc7d-c95e31247729	Peven Everett
2		Various Artists

CREATE TABLE [artists] (
   [id] INTEGER PRIMARY KEY,
   [mbid] TEXT,
   [name] TEXT
)

Table `albums`

id	mbid	title	artist_id
1	9fd11527-90ac-4c2d-ae37-40e1570d5011	GP03	2

CREATE TABLE [albums] (
   [id] INTEGER PRIMARY KEY,
   [mbid] TEXT,
   [title] TEXT,
   [artist_id] TEXT,
   FOREIGN KEY([artist_id]) REFERENCES [artists]([id])
)

Table `tracks`

id	artist_id	mbid	title	album_id	length	tracknumber
1	1	f548d49d-4239-4066-b867-f88419c14e3c	I Can Give It	1	235360	4

CREATE TABLE [tracks] (
   [id] INTEGER PRIMARY KEY,
   [artist_id] INTEGER,
   [mbid] TEXT,
   [title] TEXT,
   [album_id] INTEGER,
   [length] INTEGER,
   [tracknumber] INTEGER,
   FOREIGN KEY([artist_id]) REFERENCES [artists]([id]),
   FOREIGN KEY([album_id]) REFERENCES [albums]([id])
)

Table `listens`

user_id	timestamp	track_id
1	2006-08-21 15:14:58+00:00	1

CREATE TABLE "listens" (
   [user_id] INTEGER REFERENCES [users]([id]),
   [timestamp] TEXT,
   [track_id] INTEGER,
   FOREIGN KEY([track_id]) REFERENCES [tracks]([id])
)

Original README

README

Version 1.0, May 2010

What is this?

This dataset contains user, timestamp, artist, song tuples collected from Last.fm API, using the user.getRecentTracks() method.

This dataset represents the whole listening habits (till May, 5th 2009) for nearly 1,000 users.

Files

Filename MD5

userid-timestamp-artid-artname-traid-traname.tsv 64747b21563e3d2aa95751e0ddc46b68

userid-profile.tsv c53608b6b445db201098c1489ea497df

Data Statistics

userid-timestamp-artid-artname-traid-traname.tsv

Element Statistic

Total Lines 19,150,868

Unique Users 992

Artists with MBID 107,528

Artists without MBDID 69,420

Data Format

The data is formatted one entry per line as follows (tab separated, \t)

userid-timestamp-artid-artname-traid-traname.tsv
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name
userid-profile.tsv
userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)
Example:

userid-timestamp-artid-artname-traid-traname.tsv
user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again?
user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead
...
userid-profile.tsv
user_000639 \t m \t Mexico \t Apr 27, 2005
...
License

The data contained in lastfm-dataset-1K.tar.gz is distributed with permission of Last.fm.

The data is made available for non-commercial use.

Those interested in using the data or web services in a commercial context should contact:

partners [at] last [dot] fm.

For more information see Last.fm terms of service.

Acknowledgements

Thanks to Last.fm for providing the access to this data via their web services.

Special thanks to Norman Casagrande.

References

When using this dataset you must reference the Last.fm webpage.

Optionally (not mandatory at all!), you can cite Chapter 3 of this book
@book{Celma:Springer2010,
  author = {Celma, O.},
  title = {{Music Recommendation and Discovery in the Long Tail}},
  publisher = {Springer},
  year = {2010}
}
Contact

This data was collected by Òscar Celma @ MTG/UPF

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
lastfm-dataset-1K-extended		lastfm-dataset-1K-extended
scripts		scripts
.gitignore		.gitignore
README.md		README.md
preprocessing.ipynb		preprocessing.ipynb
requirements.txt		requirements.txt

Filename	MD5
userid-timestamp-artid-artname-traid-traname.tsv	64747b21563e3d2aa95751e0ddc46b68
userid-profile.tsv	c53608b6b445db201098c1489ea497df

Element	Statistic
Total Lines	19,150,868
Unique Users	992
Artists with MBID	107,528
Artists without MBDID	69,420

jlieth/lastfm-dataset-1K-extended

Folders and files

Latest commit

History

Repository files navigation

Last.fm 1K User Dataset Extended

What is this

Changes to the Dataset

Examples

Caveats

How to use

Converting file types

parquet to csv

csv to sqlite

Normalizing the database

Table layout before normalization:

Table layout after normalization:

Table users

Table artists

Table albums

Table tracks

Table listens

Original README

README

What is this?

Files

Data Statistics

userid-timestamp-artid-artname-traid-traname.tsv

Data Format

userid-timestamp-artid-artname-traid-traname.tsv

userid-profile.tsv

Example:

userid-timestamp-artid-artname-traid-traname.tsv

userid-profile.tsv

License

Acknowledgements

References

Contact

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Table `users`

Table `artists`

Table `albums`

Table `tracks`

Table `listens`

Packages