Just another site

Thought this was cool: The music classifying nightmare

leave a comment »

Comments: “The music classifying nightmare”


While I accumulated music files over the years, the main issue I came across
wasn’t how to not get caught or where to find the data, but how to classify and
organize the whole music files tree. There is actually no such thing as perfect
naming/tagging convention, and this blog post has one only goal: share my
nightmare with you trying to find one anyway.

If you ever considered to organize your music and don’t know where to start, I
hope this will discourage you so you won’t ever waste the amount of time I did
over the last years.

However, if you are trying to create a nice universal software solution for
this problem, you might be interested in the different issues I faced (but
well, you will fail anyway).

The following list (chronologically ordered) are the reasons that made me
change my way of sorting out my music collection:

“I need to find out the music I want to listen fast”
“I want to be able to really identify a song, out of its place”
“I want all the music of the universe and be a better provider than commercial ones”
“I don’t care anymore”

The first point led me to put all the MP3 in one single directory, with a
simple pattern: <artist> - <title>.mp3. That worked, for a while. Then I
started to really have a lot of files and thus started creating directories for
artists, albums, eventually a bazaar directory (the name was actually more
crude). At that time, I wasn’t aware of good systems making use of the

The second point was all about extracting one song out of the tree and still
keep a full reference (for instance in order to share the file with friends, or
copy it to a MP3 player). So I started renaming files like <tracknum> -
into something like <artist> - <album> - <tracknum> -
. The files were in <artist>/<album>. And everything was fine,
until I hit several kind of music such as rap, soundtracks, or even classical

This is where the third point came into play. The madness began when I started
thinking on how I could organize my music so it could store and provide to
everyone any music ever made. The first step was to get the most complete
collection possible, so I took every orphan MP3 I had one by one, and grabbed
the discography of the artist. The second step was to fix all the tags, and
have an homogeneous naming convention. So I started playing with
EasyTAG a lot. This was a nightmare.

And then I reached the terminal phase of this madness, where I just stopped
caring anymore, for various reasons. Here is a short list:

  • my naming system changed all the time because I couldn’t bear the
    imperfections, or because I hit a new classification problem in the current
  • eventually, everything I did needed to be re-done because I found some MP3
    in better quality, or in a lossless codec (copy 1:1 from the CD)
  • time consuming
  • people don’t need me anymore to be their source of music, so as long as it’s
    fine for myself, that’s all that matter
  • there are still a lot of problems where I don’t see any sane solution

I strongly advise you to pick a few reasons too if you don’t want to become


As I mentioned earlier, the MP3 (or any lossy codec) might not be the perfect
solution: this is not what you would get if you had the original CD. Having
lossless data instead of lossy avoids the need of re-downloading several times
the same music in order to have a higher quality (or any data closer to the
original release). Also, there is no point in having a perfect organization
system if it’s just to sort junk files.

Still, there are music only distributed as MP3 and never as CD; this happens
often with independent artists selling or sharing their music online, so you
will likely end up with different formats in your collection.


The MP3 is just a file with MPEG stacked packets, with generally ID3v2
tags on top of the file, or ID3v1 tags at the end, sometimes both, and
sometimes well… nothing. This tag system has several limitations. The main
one is not having a method to handle multiple artists, which is a real
problem. You have to select a separator and stick with it (Comma? Semicolon?
Slash? Something else?).

Also, different audio formats means different tag systems, so you can’t tag a
flac file the same way you would tag MP3. If you want to do some
correlation between tracks metadata, you need some kind of common mapping, and
heuristics (to split the artists from ID3 for example).

The multi-language problem

Fortunately, all the music doesn’t come from the USA. But this obviously means
some localisation issues. Generally you have an international title, so you can
somehow manage to keep only ASCII data, but this is not always the case,
especially with marginal tracks, so you need to keep the original titles. But
how do we know if an album of a given artist will always be internationalized
(you could grab one or two world licensed and renamed music and later found an
old local mixtape)? The best solution here is to keep both original and
international one (if available) for homogeneous purpose; you can’t have two
different naming of the same artist because of this, or else you might lost
some references.

Look at Miyuki Nakajima for instance. How would you name the
artist (in directory and tags)? Miyuki Nakajima (中島 みゆき)? More issues
come out now, only because of some language issues:

  • should we prefer Miyuki Nakajima or Nakajima Miyuki? 中島 is Nakajima
    and みゆき is Miyuki, but Miyuki is the first name. Also, the space
    between 中島 and みゆき could (should?) be removed.
  • there are sometimes different ways of romanization of the Japanese.
  • how do you put the original title, the romanization and the
    internationalization version in the same title?

This list is a subset of problems you will have with Japanese materials, what
about Russian or traditional Chinese albums?

Extra info in the titles

Sometimes you get various extra information about the album or the track, such
as: Cover, Bonus CD, Feat./Ft./Featuring, Remix/rmx,
Single, Edit, Radio edit, Instrumental, Karaoke, Skit, … And

  • where do these information actually belong? A specific tag?
  • if not, what should be the markup used? [feat. X], (feat. X), simply use a
  • long (remix, featuring, cover…) or short (rmx, ft., cv)
  • what if the artist used a different naming convention than you on its album
    track list? (for example if he used instr. and you used to prefer and use
    Instrumental all over your collection).

If you just keep the original/first naming the artist gave, and want that
information to be exploitable by softwares, you will need a lot of complex
heuristics to make it match all the different conventions. Extracting these
information can lead to various issues, especially when splitting the artist

Album types

The albums annotations sometimes refer to an album type, or categories:
Albums, Compilations, Singles, ED, Soundtracks, Anthologies,
Lives, Remixes, Singles, Bootlegs, Mixtapes, …

Where is that supposed to be stored in the tags? For the file system, you could
go with a pattern like <artist>/<album-style>/<year> - <album>/<tracknum>.

Doing this can sometimes be mixed up with the track specific information (see
previous section), like a live track from a soundtrack being release as a

Multiple artists

How are we supposed to deal with multiple artists? What is supposed to be the
separator? You can use the tag built-in if available (Ogg tags support this
for example), or you need to define a separator as mentioned earlier.

If the album is made of various artists, you have different patterns, here are
some random ones I came across:

Compilations/Rap/crazy mashups:

1. A, B, C, D
2. B, D, X, Y, Z
3. A, X, Y

Here generally you define the artist as Various artists, or you could also
have a real artist list in the tags. But what about the artist directory?
Various artists can be a solution here, that’s generally what I do.

Soundtrack common pattern:

1. X
2. Main
3. Main
4. Main
10. Main
11. X

The soundtracks are really horribles. I personally have a dedicated directory,
because I’m actually using the file system (I’m much more confident with it
than any tags system). You generally have a main artist and one or two special
tracks from a different artist. The common way to handle that situation is to
use the main artist name to classify the whole album, and sometimes even the
track where he isn’t the author. I don’t want to lose that “grain” of
information, so I keep the real reference names in the tags, and find the whole
soundtrack using the file system (my Soundtracks/ directory) instead of
relying on the main author name. I have a dedicated section to soundtracks
later in this post, for more issues about them.

Rap case 2:

1. Main (feat. X)
2. Main
3. Main (feat. Y)
4. Main (feat. Z and X)

Here the artists are “secondary” most of the time: basically, the main artist
doesn’t share the same amount of time. The featuring might even be just the
sample in background. So it might not be wise to split the artist list like in
the first case (or they will be at the same “level”, which is wrong). But
sometimes, they share somehow 50/50 of the time, or worse the featuring artist
might actually monopolize the whole track. You need to know pretty well the
artists and songs to be able to sort this out.

Singles common pattern:

1. Main - track foo
2. Main - track foo (Remixed by X)
3. Main - track foo (Y Remix)
4. Main - track foo (foobar mix)
5. Main - track foo (radio edit)

This meets various issues I talked about previously (naming convention, and
authorship). You will also note there is sometimes a differentiation with the
composer (hi classical music lovers!), and you might want to keep this
information in some case (even if it is likely to be ignored by most players).

Artist aliases

This is yet another thing I had real hard time to deal with. Let’s take Aphex
for instance, which is one of the worst insane case:

Aphex Twin has also recorded music under the aliases AFX, Blue Calx, Bradley
Strider, Caustic Window, DJ Smojphace, GAK, Martin Tressider, Polygon Window,
Power-Pill, Prichard D. Jams, Q-Chastic, Tahnaiya Russell, The Dice Man,
Soit-P.P., and speculatively The Tuss.
— Wikipedia

Oh, and his real name is Richard David James. What are you supposed to use
for the file system directories and files name? His name? The most common
nickname? Both? One file system solution is to have symbolic links (do you
link Richard David James to Aphex Twin, or vice versa?). For tags, if you
don’t want to lose information, this is another story…

Other examples: SNoW, M/Matthieu Chedid


I will start this section by selecting one of the worst case I expected in
soundtracks: Compilation album by Anna Tsuchiya Inspi’ Nana (Black Stones)
Olivia Inspi’ Reira (Trapnest)

The real artists are Anna Tsuchiya (see issues with
romanization I mentioned above by the way) and Olivia Lufkin.
Nana is the anime’s name, and Black Stones and Trapnest are the bands in
the anime. So far, it just looks like a lot of information, but it’s more like
multiple different names. Olivia is for example spelled in a few different
ways in the albums:

  • Olivia Lufkin
  • Olivia Inspi’ Reira (Trapnest)

The second alias is pretty interesting, because it is common with Japanese
(nick)names to use capitalization, while a lot of “tag oriented music viewer”
use some kind of generic formatting changing this name into “Olivia” (like what
you get with the .title() method in Python).

What if now you also want to store some other songs from Olivia, where and how
should the files be located and tagged so we can easily find everything she did
(and also get the related music if she worked with different artists in the
same scope).

Also, what is the Genre of the soundtracks? Soundtrack? Anime Soundtrack?
Anime? Original Soundtrack? OST? Rock? I’ll be back on the genre issue

What should we do with the “soundtrack” mention in the title by the way?
Various solutions with their own problems for this:

  • you can stick with what’s basically written on the cover: but of course, not
    all covers mention the “Soundtrack” word (or anything similar)
  • if you decide to not mention anything and just keep that information in the
    Genre tag (or similar), you will need to be very careful not to mix up things
    in case there is a soundtrack of the same name for the video game for
    example. Also, you need to decide on the common term to use.

You will also note various different separators than dash ‘-‘ in the tracks
title, like tildes ‘~‘ or special dots ‘‘ (common Japanese “markups”),
which are actually not really part of the title. Or cases where the title
contains the nickname artist, but you have to store the real artist name (in
the Artist tag for instance), a different name for the lyrics, and also a
different name for the arranger(s), which by the way are all under the same
artist-split issue.

Assuming you managed to tag everything in a somehow consistent way, how about
the file system? Keep in mind you might want to group all the albums under an
arbitrary pack name such as “Nana” (like you would put all the “Lord of the
Rings” soundtracks under a directory of that name), which might not even appear
in the tags. You will certainly come with an arbitrary path such as:
Soundtracks/Animes/Nana/<album name>/<tracknum>. <title>

Which is somehow an intuitive way of storing most of the “important” information
when looking for the song, but certainly in total contradiction with the other

Sometimes, you hit yet another problem, like with the Arrietty
soundtrack, where you have an US (international) version, a French version
(with additional tracks, somehow flagged as “premium”), and maybe a Japanese
one I’m not aware of.


You noticed I used in the previous example “Soundtracks” as the root directory,
which somehow sounds like a musical Genre. If you want to do that for all the
artists, you just can’t, for the simple reason that genres are almost song
specific, pretty subjective, generally mixed, or never defined by the author.
What you can do on the other hand is to have genre tags (like arbitrary
annotations), if your system allows it. But this doesn’t solve the file system


It is sometimes important to keep track of the date of an album. For instance to
keep the chronological evolution of the artist. But if the artist released
multiple albums the same year, you will need to stick with a convention such as:

  • 2012-0, 2012-1, 2012-2
  • 2012-1, 2012-2, 2012-3
  • 2012a, 2012b, 2012c
  • 2012-a, 2012-b, 2012-c

But sometimes, a “same” album has been released at different times…

Multiple editions

A lot of albums are released in different versions, see for example the album
Brothers in Arms by Dire Straits:

One of the issue is that you can’t easily keep up with the following file
system format: <artist>/<year> - <album>/

And you will have a lot of tags collisions. You need to find a way of making a
differentiation between these albums and their tracks.

Random issues

Here are some more issues left I didn’t have the courage to elaborate on:


Some potential solutions exists. The first one is to have a virtual file system
(with FUSE for instance), dedicated to music. For instance the path
genre/soundtracks/animes/Nana would focus on similar data than:

This solution may need a lot of thinking, and will likely hit the same issues
as the tag system, and certainly a lot of others. If you are working on
something similar to this, I’m interested.

Musical content retrieval

This is actually a more promising solution in my opinion. The goal here is to
remove the textual content issue. Anita Lillie made a thesis on this:
MusicBox. I hereby encourage you to read this if you are
interested in the topic.

I actually worked on this for two years as an experiment at my school with a
few fellows: we basically re-implemented what Anita proposed, put the analysis
in-place (instead of using an existing engine like she did), and implemented a
way of communicating with various players instead of taking the burden of
trying to propose yet another player. There is nothing really releasable, so
here are some feedback on that experience if you want to do something like

  • music analysis is hard, slow and a pretty young science (but where all the
    fun is in my opinion)
  • mapping all the songs graphically is a complex scaling issue
  • you still need to make the text dimension available for some specific
  • providing a playlist creator and not an audio player is nice, but might be
    more complicated that what you expect

Since this is going slightly off topic, I won’t detail much here, just
mentioning the idea.

A fine hack

Despite all my attempts to get a well classified music collection, it is
essentially like everyone else, a giant mess. But the tags are somehow good
enough to give interesting results with the suggestion API of
(and actually they need to be as messy as everyone
else so the match algorithm can work), so I basically start with a song I like,
and run DynaMPD in order to get some related content from my

And about the mess in my files well… I just don’t give a shit anymore. I’m
still able to find what I’m looking for in a relatively short amount of time,
and trying to remove the small overhead isn’t worth wasting my life on it.

I resigned myself; Music is a form of art, thus it is not, and must not ever be
limited to a binary mind.

from Hacker News 50:


Written by cwyalpha

九月 9, 2012 在 3:38 下午

发表在 Uncategorized


Fill in your details below or click an icon to log in: 徽标

You are commenting using your account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )


Connecting to %s

%d 博主赞过: