Sound Index Algorithm
Thanks for your comments on Beth's previous post about the Sound Index.
The Sound Index is not meant to be a definitive chart (like a sales chart). Rather, it's a gauge of who and what is currently driving conversation and interaction about music online. It's a great tool for music discovery, and to find out who's currently hot in the music world of teenagers. However, we have taken steps to ensure that our data collection is as accurate as possible, and have implemented an to help us create the most editorially relevant and robust Index.
The Sound Index is currently in a four month stage, so that (among other things) the technical, editorial and cost implications of various algorithm options can be assessed.
After viewing an Index of based on the raw data we felt an algorithm was needed to allow all the sources to contribute to the Index, and for all forms on activity on the internet - plays, comments and downloads - to affect the rankings. Without an algorithm the large volumes of the more dominant forms of interaction - mainly plays and downloads - drowned out the smaller numbers of comments, which we felt were important to reflect in the Index.
Therefore, my team has developed the following algorithm, which I feel gives an editorially relevant and justified chart, without any bespoke manipulation or input, meaning that the Sound Index can be viewed as an accurate gauge of online buzz.
For each type of interaction (play, comment, download) all the data for each artist for each individual site has been added. Then each artist is given a score depending on how popular they are on each site. This score is directly related to how many artists are on that site. For example, if there were 200 artists from , the Number One artist (with the most counts) would have a score of 200, whereas if they had the least, they would have a score of 1.
We didn't want sites with massive amounts of only one type of data totally dominate the Sound Index. So each type of data - play, download, comment - is limited to make up a set proportion of each artist's popularity. This is determined by how many different sources there are for each type of data. So, if there were ten sources in total made up of five play counts, three download and two comments, we would multiply the ranks from each source in the following way: 5/10 for counts, 3/10 for downloads and 2/10 for comments.
These figures from each type of activity from each site for each artist multiplied by this fixed proportion are then added together, to give a total buzz score, which is used to create the Sound Index. The same method is applied separately for individual tracks. We have also put in processes with our data collection methods to reduce the impact of gaming. Our partner has implemented spam filtering, porn filtering, multiple post detection and verification to help the data be as clean as possible.
The Sound Index is a project based on trialing new technology. I think that in its current form it's been successful in achieving an exciting way of discovering which bands and artists are creating the most buzz. We are not using it to define any charts or create any definitive lists. Anything editorial around the Sound Index should not use it as an absolute measure: it's a gauge of what is hot. It's a great example of innovation and collaboration with major music sites. We're still learning about what the Sound Index can do.
How would you like the Sound Index to develop? Please do leave a comment.
Geoffrey Goodwin is Head of ´óÏó´«Ã½ Switch.
Comment number 1.
At 9th Jun 2008, Briantist wrote:Interesting and good work so far.
Just a little thing... I think your weighting system needs some justification, rather than being reversed-engineered to provide an 'interesting' ongoing editorial item.
Under normal web circumstances, you have two issues. Firstly is the 'long tail' and the logarithmic nature of web content.
I think you might like to consider using the log of the counts, rather than the counts themselves as it would allow your stated goals. In particular it stops massive numbers of one type of event drowning out modest events in another, but allows small amounts of data to be visible too.
Complain about this comment (Comment number 1)
Comment number 2.
At 10th Jun 2008, SteveFarr wrote:Geoffrey, and further to my comments on Beth's post, thanks for coming back with this more detailed explanation.
Based on the information supplied here we can all now draw our own conclusions as to how useful or how reliable the results are. Just the way it ought to be!
Thanks again for keeping us in the loop. Good work, you all, and i hope the blogging community responds constructively.
Complain about this comment (Comment number 2)