Thought this was cool: The music classifying nightmare
Comments: “The music classifying nightmare”
While I accumulated music files over the years, the main issue I came across
wasn’t how to not get caught or where to find the data, but how to classify and
organize the whole music files tree. There is actually no such thing as perfect
naming/tagging convention, and this blog post has one only goal: share my
nightmare with you trying to find one anyway.
If you ever considered to organize your music and don’t know where to start, I
hope this will discourage you so you won’t ever waste the amount of time I did
over the last years.
However, if you are trying to create a nice universal software solution for
this problem, you might be interested in the different issues I faced (but
well, you will fail anyway).
The following list (chronologically ordered) are the reasons that made me
change my way of sorting out my music collection:
“I need to find out the music I want to listen fast”
“I want to be able to really identify a song, out of its place”
“I want all the music of the universe and be a better provider than commercial ones”
“I don’t care anymore”
The first point led me to put all the
MP3 in one single directory, with a
<artist> - <title>.mp3. That worked, for a while. Then I
started to really have a lot of files and thus started creating directories for
artists, albums, eventually a bazaar directory (the name was actually more
crude). At that time, I wasn’t aware of good systems making use of the
The second point was all about extracting one song out of the tree and still
keep a full reference (for instance in order to share the file with friends, or
copy it to a
MP3 player). So I started renaming files like
<tracknum> - into something like
<artist> - <album> - <tracknum> -. The files were in
<artist>/<album>. And everything was fine,
until I hit several kind of music such as rap, soundtracks, or even classical
This is where the third point came into play. The madness began when I started
thinking on how I could organize my music so it could store and provide to
everyone any music ever made. The first step was to get the most complete
collection possible, so I took every orphan
MP3 I had one by one, and grabbed
the discography of the artist. The second step was to fix all the tags, and
have an homogeneous naming convention. So I started playing with
EasyTAG a lot. This was a nightmare.
And then I reached the terminal phase of this madness, where I just stopped
caring anymore, for various reasons. Here is a short list:
- my naming system changed all the time because I couldn’t bear the
imperfections, or because I hit a new classification problem in the current
- eventually, everything I did needed to be re-done because I found some
in better quality, or in a lossless codec (copy
1:1from the CD)
- time consuming
- people don’t need me anymore to be their source of music, so as long as it’s
fine for myself, that’s all that matter
- there are still a lot of problems where I don’t see any sane solution
I strongly advise you to pick a few reasons too if you don’t want to become
As I mentioned earlier, the
MP3 (or any lossy codec) might not be the perfect
solution: this is not what you would get if you had the original CD. Having
lossless data instead of lossy avoids the need of re-downloading several times
the same music in order to have a higher quality (or any data closer to the
original release). Also, there is no point in having a perfect organization
system if it’s just to sort junk files.
Still, there are music only distributed as
MP3 and never as CD; this happens
often with independent artists selling or sharing their music online, so you
will likely end up with different formats in your collection.
MP3 is just a file with
MPEG stacked packets, with generally
tags on top of the file, or
ID3v1 tags at the end, sometimes both, and
sometimes well… nothing. This tag system has several limitations. The main
one is not having a method to handle multiple artists, which is a real
problem. You have to select a separator and stick with it (Comma? Semicolon?
Slash? Something else?).
Also, different audio formats means different tag systems, so you can’t tag a
flac file the same way you would tag
MP3. If you want to do some
correlation between tracks metadata, you need some kind of common mapping, and
heuristics (to split the artists from
ID3 for example).
The multi-language problem
Fortunately, all the music doesn’t come from the USA. But this obviously means
some localisation issues. Generally you have an international title, so you can
somehow manage to keep only ASCII data, but this is not always the case,
especially with marginal tracks, so you need to keep the original titles. But
how do we know if an album of a given artist will always be internationalized
(you could grab one or two world licensed and renamed music and later found an
old local mixtape)? The best solution here is to keep both original and
international one (if available) for homogeneous purpose; you can’t have two
different naming of the same artist because of this, or else you might lost
Look at Miyuki Nakajima for instance. How would you name the
artist (in directory and tags)? Miyuki Nakajima (中島 みゆき)? More issues
come out now, only because of some language issues:
- should we prefer
Miyukiis the first name. Also, the space
みゆきcould (should?) be removed.
- there are sometimes different ways of romanization of the Japanese.
- how do you put the original title, the romanization and the
internationalization version in the same title?
This list is a subset of problems you will have with Japanese materials, what
about Russian or traditional Chinese albums?
Extra info in the titles
Sometimes you get various extra information about the album or the track, such
Skit, … And
- where do these information actually belong? A specific tag?
- if not, what should be the markup used?
(feat. X), simply use a
- long (
cover…) or short (
- what if the artist used a different naming convention than you on its album
track list? (for example if he used
instr.and you used to prefer and use
Instrumentalall over your collection).
If you just keep the original/first naming the artist gave, and want that
information to be exploitable by softwares, you will need a lot of complex
heuristics to make it match all the different conventions. Extracting these
information can lead to various issues, especially when splitting the artist
The albums annotations sometimes refer to an album type, or categories:
Where is that supposed to be stored in the tags? For the file system, you could
go with a pattern like
<artist>/<album-style>/<year> - <album>/<tracknum>..
Doing this can sometimes be mixed up with the track specific information (see
previous section), like a live track from a soundtrack being release as a
How are we supposed to deal with multiple artists? What is supposed to be the
separator? You can use the tag built-in if available (
Ogg tags support this
for example), or you need to define a separator as mentioned earlier.
If the album is made of various artists, you have different patterns, here are
some random ones I came across:
1. A, B, C, D 2. B, D, X, Y, Z 3. A, X, Y ...
Here generally you define the artist as
Various artists, or you could also
have a real artist list in the tags. But what about the artist directory?
Various artists can be a solution here, that’s generally what I do.
Soundtrack common pattern:
1. X 2. Main 3. Main 4. Main ... 10. Main 11. X
The soundtracks are really horribles. I personally have a dedicated directory,
because I’m actually using the file system (I’m much more confident with it
than any tags system). You generally have a main artist and one or two special
tracks from a different artist. The common way to handle that situation is to
use the main artist name to classify the whole album, and sometimes even the
track where he isn’t the author. I don’t want to lose that “grain” of
information, so I keep the real reference names in the tags, and find the whole
soundtrack using the file system (my
Soundtracks/ directory) instead of
relying on the main author name. I have a dedicated section to soundtracks
later in this post, for more issues about them.
Rap case 2:
1. Main (feat. X) 2. Main 3. Main (feat. Y) 4. Main (feat. Z and X) ...
Here the artists are “secondary” most of the time: basically, the main artist
doesn’t share the same amount of time. The featuring might even be just the
sample in background. So it might not be wise to split the artist list like in
the first case (or they will be at the same “level”, which is wrong). But
sometimes, they share somehow 50/50 of the time, or worse the featuring artist
might actually monopolize the whole track. You need to know pretty well the
artists and songs to be able to sort this out.
Singles common pattern:
1. Main - track foo 2. Main - track foo (Remixed by X) 3. Main - track foo (Y Remix) 4. Main - track foo (foobar mix) 5. Main - track foo (radio edit)
This meets various issues I talked about previously (naming convention, and
authorship). You will also note there is sometimes a differentiation with the
composer (hi classical music lovers!), and you might want to keep this
information in some case (even if it is likely to be ignored by most players).
This is yet another thing I had real hard time to deal with. Let’s take Aphex
Twin for instance, which is one of the worst insane case:
Aphex Twin has also recorded music under the aliases AFX, Blue Calx, Bradley
Strider, Caustic Window, DJ Smojphace, GAK, Martin Tressider, Polygon Window,
Power-Pill, Prichard D. Jams, Q-Chastic, Tahnaiya Russell, The Dice Man,
Soit-P.P., and speculatively The Tuss.
Oh, and his real name is Richard David James. What are you supposed to use
for the file system directories and files name? His name? The most common
nickname? Both? One file system solution is to have symbolic links (do you
link Richard David James to Aphex Twin, or vice versa?). For tags, if you
don’t want to lose information, this is another story…
I will start this section by selecting one of the worst case I expected in
soundtracks: Compilation album by Anna Tsuchiya Inspi’ Nana (Black Stones)
Olivia Inspi’ Reira (Trapnest)
The real artists are Anna Tsuchiya (see issues with
romanization I mentioned above by the way) and Olivia Lufkin.
Nana is the anime’s name, and Black Stones and Trapnest are the bands in
the anime. So far, it just looks like a lot of information, but it’s more like
multiple different names. Olivia is for example spelled in a few different
ways in the albums:
- Olivia Lufkin
- Olivia Inspi’ Reira (Trapnest)
The second alias is pretty interesting, because it is common with Japanese
(nick)names to use capitalization, while a lot of “tag oriented music viewer”
use some kind of generic formatting changing this name into “Olivia” (like what
you get with the
.title() method in
What if now you also want to store some other songs from Olivia, where and how
should the files be located and tagged so we can easily find everything she did
(and also get the related music if she worked with different artists in the
Also, what is the Genre of the soundtracks?
Rock? I’ll be back on the genre issue
What should we do with the “soundtrack” mention in the title by the way?
Various solutions with their own problems for this:
- you can stick with what’s basically written on the cover: but of course, not
all covers mention the “Soundtrack” word (or anything similar)
- if you decide to not mention anything and just keep that information in the
Genre tag (or similar), you will need to be very careful not to mix up things
in case there is a soundtrack of the same name for the video game for
example. Also, you need to decide on the common term to use.
You will also note various different separators than dash ‘
-‘ in the tracks
title, like tildes ‘
~‘ or special dots ‘
・‘ (common Japanese “markups”),
which are actually not really part of the title. Or cases where the title
contains the nickname artist, but you have to store the real artist name (in
Artist tag for instance), a different name for the lyrics, and also a
different name for the arranger(s), which by the way are all under the same
Assuming you managed to tag everything in a somehow consistent way, how about
the file system? Keep in mind you might want to group all the albums under an
arbitrary pack name such as “Nana” (like you would put all the “Lord of the
Rings” soundtracks under a directory of that name), which might not even appear
in the tags. You will certainly come with an arbitrary path such as:
Soundtracks/Animes/Nana/<album name>/<tracknum>. <title>
Which is somehow an intuitive way of storing most of the “important” information
when looking for the song, but certainly in total contradiction with the other
Sometimes, you hit yet another problem, like with the Arrietty
soundtrack, where you have an US (international) version, a French version
(with additional tracks, somehow flagged as “premium”), and maybe a Japanese
one I’m not aware of.
You noticed I used in the previous example “Soundtracks” as the root directory,
which somehow sounds like a musical Genre. If you want to do that for all the
artists, you just can’t, for the simple reason that genres are almost song
specific, pretty subjective, generally mixed, or never defined by the author.
What you can do on the other hand is to have genre tags (like arbitrary
annotations), if your system allows it. But this doesn’t solve the file system
It is sometimes important to keep track of the date of an album. For instance to
keep the chronological evolution of the artist. But if the artist released
multiple albums the same year, you will need to stick with a convention such as:
- 2012-0, 2012-1, 2012-2
- 2012-1, 2012-2, 2012-3
- 2012a, 2012b, 2012c
- 2012-a, 2012-b, 2012-c
But sometimes, a “same” album has been released at different times…
A lot of albums are released in different versions, see for example the album
Brothers in Arms by Dire Straits:
One of the issue is that you can’t easily keep up with the following file
<artist>/<year> - <album>/
And you will have a lot of tags collisions. You need to find a way of making a
differentiation between these albums and their tracks.
Here are some more issues left I didn’t have the courage to elaborate on:
- artists with special characters such as
AC/DC, which can’t be used on the
file system, and prevent you from using ‘
/‘ as an artist separator
- totally insane tracks such as mathematical formula
- different artists with the exact same name
- music not related to any album, or commercial support, like Golden Age of
Video by Ricardo Autobahn
- different kind of music, such has keygen musics, have a
totally different organization, and should not be mixed with a “normal”
- different supports, such as multiple CD (sometimes named, sometimes not),
Vinyl, MaxiCD, …
- a cover sometimes include a list of multiple artists
Some potential solutions exists. The first one is to have a virtual file system
(with FUSE for instance), dedicated to music. For instance the path
genre/soundtracks/animes/Nana would focus on similar data than:
This solution may need a lot of thinking, and will likely hit the same issues
as the tag system, and certainly a lot of others. If you are working on
something similar to this, I’m interested.
Musical content retrieval
This is actually a more promising solution in my opinion. The goal here is to
remove the textual content issue. Anita Lillie made a thesis on this:
MusicBox. I hereby encourage you to read this if you are
interested in the topic.
I actually worked on this for two years as an experiment at my school with a
few fellows: we basically re-implemented what Anita proposed, put the analysis
in-place (instead of using an existing engine like she did), and implemented a
way of communicating with various players instead of taking the burden of
trying to propose yet another player. There is nothing really releasable, so
here are some feedback on that experience if you want to do something like
- music analysis is hard, slow and a pretty young science (but where all the
fun is in my opinion)
- mapping all the songs graphically is a complex scaling issue
- you still need to make the text dimension available for some specific
- providing a playlist creator and not an audio player is nice, but might be
more complicated that what you expect
Since this is going slightly off topic, I won’t detail much here, just
mentioning the idea.
A fine hack
Despite all my attempts to get a well classified music collection, it is
essentially like everyone else, a giant mess. But the tags are somehow good
enough to give interesting results with the suggestion API of
Last.FM (and actually they need to be as messy as everyone
else so the match algorithm can work), so I basically start with a song I like,
and run DynaMPD in order to get some related content from my
And about the mess in my files well… I just don’t give a shit anymore. I’m
still able to find what I’m looking for in a relatively short amount of time,
and trying to remove the small overhead isn’t worth wasting my life on it.
I resigned myself; Music is a form of art, thus it is not, and must not ever be
limited to a binary mind.