Pyalot: Complex data query requirements.

I have recently participated in a discussion about complex data query requirements. Jonathan Holland who wrote a about the suitability of relational databases, asked me to write a blog entry about the topic, and in particular in relation to a shop I wrote. So here goes:

Architecture Outline
The shop is written in python and javascript. it is a multi-tier MVC architecture, from the bottom up it consists of:

One read only slave database on each machine (postgres)
A variety of web request daemons on each machine, one for general data (accesses the databases, load balanced but favoring the local database), one for browsing (uses in memory data), one for searching (uses a python full text search engine)
A load balancer/web server (nginx) that dispatches request to the web request daemons.
A client business/representation logic implementation in javascript that queries and sends data to the server farm, and modifies the client representation for the business and representation logic.

The Problem
As a music shop it features about 800'000 songs with about 4'000'000 meta data associations. How to make the bulk of the catalog browsable for a user?

The Solution
A user interface pattern called "Drilldown". It works by starting at a summary of all content, offering a way to select criteria and stack more and more conditions as you go along.

The Problem continued
Obviously a user might want to select meta data as filters, for instance all songs associated with the type(year) and the value(2006), of that result set of songs all songs associated with the type(genre) and the value(Pop) and of that result of songs all songs associated with the type(artist) and the value(+44), and of that result set of songs all associated unique meta-data rows with the type(album), everything alphabetically sorted and a count aggregate over each album how many songs it has.

It is easy to imagine that this would be quite a complex query in SQL. In fact the query would be a series of having conditions with a subquery against songs with a where clause for each filter, and everything as a subquery with a order by clause around it, and thats just the start, because you still need to do the aggregates correctly. I've written multiple such queries, and if you put real-life amount of music data on the database, indexes or not, the performance is not acceptable (upwards of 30seconds on a quad-core Xeon 4Gb ram machine)

The Solution Continued
I choose to step around the problem and put the data for retrieval somewhere else then on a database.
Some preliminary tests showed that the maximum amount of expected data (around 2'000'000 songs) shouldn't exeed 1GB when kept in RAM.

The technical solution use python sets (an unordered data type that is fast at intersection building). All songs, types and values are alphabetically sorted, and the index offset of their position in the sorted collection holding them becomes their "key".
From there on it's a pretty straightforward algorithm that reduces the result set of songs by retrieving pre-computed sets of songs for each filter stored in associative collections (dicts), the result set of integer IDs is numerically sorted again, and the desired slice is translated back (using the sorted collection of all songs) to a title.
Of course one needs to get the data to RAM somehow, the process is simple and works by:

Read all data from the database to RAM (takes about 10 minutes)
Dump the resulting data structure to disk using a python object serialization module called "marshal" (occupies around 300mb on disk) (takes about 1 minute)
At star of the web daemon load the marshal file to RAM (takes about 1 minute)
Ready to answer requests for meta-data.

Performance
A single machine (quad core 3ghz Xeon) under load can serve up most requests for such data views under 300ms end to end (mostly the algorithm takes between 10 - 30 ms and the remainder is lost in http and I/O waits).

Final thoughts
It took quite a bit of development efford to come up with this. However, I think it is quite valuable because it lets the end user explore the data. Additional to searching and browsing editorial content, it's another way to discover music you might like. And since the goal of a music shop is to sell music, discovering music you like translates to profit.
It is also a testament to python that you can do things like this with it, because it has some core components offering superb performance, and if you assemble them correctly, the fact that python isn't a native code compiled language doesn't matter.

4 comments:

Jonathan Holland said...: Hi Florian,

I am intrigued by your solution and impressed that Python was able to handle that.

Is it possible for me to get my hands on your original dataset? I'd like to throw it into SQL Server and run some ad hoc queries on it.

Let me know, JonEHollandATGmail.com; 10:04 AM
Dotan Dimet said...: Jonathan, you might try playing with this:

http://www.freedb.org/en/download__database.10.html

(no idea what Florian is using, but it seems like a big database of songs/discs/artists/genres).; 2:32 AM
Florian said...: I was using aggregated data by the big four (sony bmg, universal, emi and warner), for which the company I worked for had contracts.; 2:12 PM
Unknown said...: Your solution does not scale. What happens when you have 8M songs and 40M metadata entries? The post you referred to was about RDBMs scaling, but your solution was designed for a fixed-size dataset and would likely fall flat when the dataset increases.; 2:06 PM

Pyalot

Thursday, April 24, 2008

Complex data query requirements.

4 comments:

Blog Archive

Links

About Me