Wednesday, January 2, 2008

The Similar Albums Feature

Every Backstage album page has a list of similar albums. The feature does not work all that well, and, of course, there's no real documentation on how its generated. I can make some guesses, however.

First, we know that some sort of averaging of the genomes is done at least one place in Pandora. We know that the list of thumbed-up tracks for any station is averaged to create a single seed. I suspect that the averaging is done as simply as possible. I suspect that the genomes are, in the end, long binary strings: does this song have this quality (yes/no). [Pandora is currently playing DeGarmo's "Boy Like You" just to taunt me right now.] I suspect that when Pandora needs an average, it simply takes the straight numerical average of this binary string.

Second, the secret-sauce of Pandora is the distance metrics that it has developed on the genomes to assess "nearness". The creation of the metric on a binary vector space will have been done with a bit of statistics and a lot of tweaking. I suspect that all the genomes have been adjusted to have the same length with a lot of entries zeroed out for attributes which were not assessed at the time of analysis. (That is, when analyzing a harpsichord performance of a Scarlatti sonata the question of whether the track has "trip-hop roots" will not be considered.) Distance metrics would then be constructed on that binary state-space so that nearness can be measured across genomes.

The metric (if all the previous speculation is true) would be a simple linear vector in the n-dimensional state space defined by the n-digit uber-genome which essentially supplies weights indicating which characteristics (like tempo) are more important than other characteristics (like the presence or absence of "thin synth textures").

There might only be one such distance metric used throughout Pandora and there might be several. The most important one is the one which selects songs. I suspect that they calculate a list of a hundred or so nearest neighbors for each song in a database, and that the Player picks a track and then a random set of three other songs from that song's list to make a set and then applies all sorts of mandated constraints replacing tracks as necessary.

I would imagine that the lists of similar artists and similar albums are generated in a similar fashion to the algorithm used to generate the play-lists. Each Artist and and Album will have been reduced to an average of their respective tracks, and the distance between albums and artist may be computed using a different distance metric than used in the player, but the nearest-neighbor algorithm applied will likely be the same.

The big difference between the Player and the Backstage lists is, I suspect, that the Player databases are updated frequently and the Backstage lists are probably generated when the album or the artist is initially published to the Backstage database and are likely never updated. They might update the similar artists lists whenever a new album for that artist is analyzed, but I've seen no evidence that they do (not that I've observed the artist lists that closely over time). Thus, there might be "nearer" artists now available on the database, but the pages are never updated to reflect that fact.

Of course, the real problem with this method is the initial averaging of all the tracks for the artist or album. A diverse artist (or album) gets mushed to a single point in space where the very fact that they are diverse is probably a key characteristic in many people minds when comparing artists. Tom Conrad is pretty smart, and so they probably could measure the width of the footprint of the tracks in n-space via some second-order statistics beyond simple averages and use that in their nearness calculation, but the fact is that I don't think anyone cares all that much about the similar artist and album lists.