Thursday, April 24, 2008

Complex data query requirements.

I have recently participated in a discussion about complex data query requirements. Jonathan Holland who wrote a about the suitability of relational databases, asked me to write a blog entry about the topic, and in particular in relation to a shop I wrote. So here goes:

Architecture Outline
The shop is written in python and javascript. it is a multi-tier MVC architecture, from the bottom up it consists of:
  1. One read only slave database on each machine (postgres)
  2. A variety of web request daemons on each machine, one for general data (accesses the databases, load balanced but favoring the local database), one for browsing (uses in memory data), one for searching (uses a python full text search engine)
  3. A load balancer/web server (nginx) that dispatches request to the web request daemons.
  4. A client business/representation logic implementation in javascript that queries and sends data to the server farm, and modifies the client representation for the business and representation logic.
The Problem
As a music shop it features about 800'000 songs with about 4'000'000 meta data associations. How to make the bulk of the catalog browsable for a user?

The Solution
A user interface pattern called "Drilldown". It works by starting at a summary of all content, offering a way to select criteria and stack more and more conditions as you go along.

The Problem continued
Obviously a user might want to select meta data as filters, for instance all songs associated with the type(year) and the value(2006), of that result set of songs all songs associated with the type(genre) and the value(Pop) and of that result of songs all songs associated with the type(artist) and the value(+44), and of that result set of songs all associated unique meta-data rows with the type(album), everything alphabetically sorted and a count aggregate over each album how many songs it has.

It is easy to imagine that this would be quite a complex query in SQL. In fact the query would be a series of having conditions with a subquery against songs with a where clause for each filter, and everything as a subquery with a order by clause around it, and thats just the start, because you still need to do the aggregates correctly. I've written multiple such queries, and if you put real-life amount of music data on the database, indexes or not, the performance is not acceptable (upwards of 30seconds on a quad-core Xeon 4Gb ram machine)

The Solution Continued
I choose to step around the problem and put the data for retrieval somewhere else then on a database.
Some preliminary tests showed that the maximum amount of expected data (around 2'000'000 songs) shouldn't exeed 1GB when kept in RAM.

The technical solution use
python sets (an unordered data type that is fast at intersection building). All songs, types and values are alphabetically sorted, and the index offset of their position in the sorted collection holding them becomes their "key".
From there on it's a pretty straightforward algorithm that reduces the result set of songs by retrieving pre-computed sets of songs for each filter stored in associative collections (dicts), the result set of integer IDs is numerically sorted again, and the desired slice is translated back (using the sorted collection of all songs) to a title.
Of course one needs to get the data to RAM somehow, the process is simple and works by:
  1. Read all data from the database to RAM (takes about 10 minutes)
  2. Dump the resulting data structure to disk using a python object serialization module called "marshal" (occupies around 300mb on disk) (takes about 1 minute)
  3. At star of the web daemon load the marshal file to RAM (takes about 1 minute)
  4. Ready to answer requests for meta-data.
Performance
A single machine (quad core 3ghz Xeon) under load can serve up most requests for such data views under 300ms end to end (mostly the algorithm takes between 10 - 30 ms and the remainder is lost in http and I/O waits).


Final thoughts
It took quite a bit of development efford to come up with this. However, I think it is quite valuable because it lets the end user explore the data. Additional to searching and browsing editorial content, it's another way to discover music you might like. And since the goal of a music shop is to sell music, discovering music you like translates to profit.
It is also a testament to python that you can do things like this with it, because it has some core components offering superb performance, and if you assemble them correctly, the fact that python isn't a native code compiled language doesn't matter.

Tuesday, July 10, 2007

Rene talks about multithreading and multiple cpu's

Rene's talk is about how to use python with multiple cpu's. He has written a paper about it. The core libraries you use for game programming are already threaded. He has written a library to manage threads in games. The core idiom is "map", which in the context of threads becomes tmap.

Monday, July 09, 2007

py.test

I was a little late into this talk (took a nap in the break). An alternative to unittest, py.test, but it seams specifically geared to the needs of pypy.

ThenCad

I'm intriqued that somebody went and developed a whole CAD and released it as GPL, in pure python.

The gust of why ThanCad is better, is that it has some fancy features basically all other CAD apps lack, because it was easy for the guy who implemented them because, he built it open from the ground up. It's the lesson of simplicity really. It makes sense, because for instance in autocad, when you wanted to tag on some fancy features, you mostly can't because you don't have the code, and even if you did have the code, you can't because it's big and bloated.

The architecture seems pretty much an exercise in patterns (as in gemma et. all), but I suspect it works because it's bourne of practicality (rather then academic blindness to the real world).

On the downside, the choice of toolkit and technique leads to many problems and workarounds. While the speaker enumerates the problems met with doing it, ideas pop to my head of doing that in opengl, some small gui toolkit and greenlets.

It seems a highly interesting project, I'll definitly check it out

KSS, javascript with style

KSS is a client framework, so there's no intrinistic server side requirement. It looks interesting, it sort of encapsulates event handling and view changes into a domain specific language. I woner if it works outside a sandbox/wiki environment, with requirements on interaction.

It is an interesting thing, and it certainly is worth trying out.

Pylons

The pylons talk is pretty much what you can find out about pylons yourself. On an interesting note, supervisor2 received some praise in this talk, as a tool that can restart died apps, seems nice. The talk was about the development .

Pythonic interfaces?

This talk is about (the java understanding of) interfaces. The argument is that if you work with many many people on large scale projects, you basically need interfaces. I tend to agree, but I think large teams are a bad idea :)

So, the speaker also mentions that the other solutions out there (zope interfaces, py-protocols etc) are overly complex. Yes I agree, let's see...

Ah-ha, he uses decorators and inheritance to implement the functionality of an interface. And that seems to be it, mostly.

Well, I sort of like that the author tries to out of his way to make usage of it easy. The thing is contained in a single file. Speed seems to have been a focus, since it caches and optimizes for instanciation time.

Well, I don't know, it might be helping, it might be not, on any account there's no version up of it to date at http://www.mikeware.com/index.php

Sunday, July 08, 2007

mxTextTools

At the core mxTextTools is a state machine (tagging engine), written in C. The reason for introducing it was that in order to write a parser, you need a matcher (on a tokenizer and parser level). Writing matchers usually involves re, which is a pain in the butt. I like that it has a JIT compiler for the tagging commands. This seems to be a pretty powerfull replacment for the usual lexers/parsers.

Have a look at mxTextTools, it seems to be an interesting alternative to writing parsers using re/ebnf machines. I'm not sure it's really going to be easy to edit this list of assembly-like commands over a few bits of REs and a nicely written down ebnf.

easy extend

Easy extend seems to be about creating DSLs in python. It allows parsing from an ebnf grammar and can extend the python grammar for the language. You can download EasyExtend.

It compiles these defined languages to the python parse tree, this is then compiled to python bytecode by python. It can run multiple different languages within one python run, by a concept called fibres.

I don't know if it's really apealing, since it seems pretty much tied to the python runtime. For instance, can somebody tell me if I can create new syntax, for instance for a anonymous function? It seems like that's possible, the hard thing is changing not just syntax but changing semantics.

gearing up for europython

I've arrived at the hotel reva in vilnius, had some good nights sleep and I'm headed down for breakfast and registration. The hotel is nice, Thomas Waldmann (with whom I share a room) doesn't snore (sight of of relief).

so long, __doc__

Friday, February 09, 2007

Linux and Mac users may be in the minority, but they are vocal

I found the phrase "Linux and Mac users may be in the minority, but they are vocal" in this article about java and ajax. It struck me as odd to even remark upon it, but then I figured that maybe it's an ignored phenomenon.

Beeing an early adopter is a way of living, not some marketing category. Either you do it in most aspects of your living, or you don't. In order for you to take up Linux or Mac, or use Firefox instead of IE, you have to get active. Purely by crossing that line you've placed yourself in group of people that care about their digital tools and search to improve them.

Funnily this kind of description would perfectly fit the bill of early adopters too. It's no coincidence there's a similarity. Linux and Mac users are deep early adopters, on the very end of the scale of how early adopterish they can get.

Thus the remark becomes:
Early adopters may be in the minority, but they are vocal.
And now it's you, who'll go "duh!" on me. Because it's so glaringly obvious.
But consider, most applications that are written are for Windows. A lot of websites out there are only working satisfactory in IE. Thus the deep early adopters will avoid these products and projects like the pest, because it doesn't run with what they prefer from the depths their hearts.

We all know early adopters are important because they give you a kick start market, free marketing and a clear indication weather what you did rocks or sucks. If you throw a good percentage of early adopters away simply because you feel that their market is to small to be considered, you should consider that decision very very careful. It may prove fatal for your product.

Thursday, February 08, 2007

More thoughts on DRM

Because a DRM requires a secret that may not be revealed, but the place where the secret is kept/produced/defined is in the hands of an untrusted party, it cannot be “secure”. Thus obscurity is important to “secure” a DRM scheme, and in this FairPlay is not alone. Other companies can proudly claim to have the most obscure DRM, for whatever it's worth.

DRM advocates may cry foul now, after all they have invented all their shiny toys they like to brandish, however consider these two simply facts:
  • DRM circumvention protection rights are lobbied for all around the globe by industry lobbyists
  • No DRM company puts their specification up into the open public for review.

The issue to be dealt with is trust. The consumer has been ruled out to be trusted, thus nobody else to trust is left. If you trust nobody you're paranoid, and the world's a scary place to live in. Oddly that's just how the RIAA/MPAA seem to be behaving.
  • A talk about DRM by Cory Doctorow
  • A video illustrating problems with trusted computing. DRM runs on the concept of trusted hardware, and trust is an important topic in the discussion.
Contrary to popular believe copyright law is not only about who owns the content and may restrict copying. It also is about:
  • Fair use
  • Right for private copy
  • The public domain (after the copyright expires)
Unless DRM is implemented in a fashion to accommodate these (still existing and well, thank you) laws, it’s in a legal grey zone.

The future of the DRM industry may seem rosy for now. Today and for a while to come a demand for the technology exists, nobody guarantees that demand will continue. A down in this market will emerge, and as usual, who had a solid business foundation and delivered real and lasting value to his customers will survive it.

Wednesday, February 07, 2007

Basic inalienable digital rights

I was reading a piece about the RIAA and their convoluted understanding of Steve Jobs open letter when I stumbled upon the term "Basic inallienable digitial rights".

There a question occured to me. Do I want to fund the society our children are going to live in, on the basis of the DMCA, patent law, copyright and a RIAA/MPAA mafia?

Every society traces it's funding back to documents laying out basic rights. Such documents as the declaration of independence and the bill of rights. Today we are in the midst of a fast changing civilization. In a mere 15 years the Internet and widespread computing hardware have started to influence and change our lives completely.

A new substrate, a basic building block for civilization as we know it is right now being invented. We find new ways to partake in culture, express ourselves, communicate, discover, share and consume. I have no doubt that the last 15 years of history have been as important as the invention of law, the written word and printing of books.

But who forges rights in our brave new society? we have copyright (which mostly serves a content industry these days), patent law, the DMCA and any other number of legal paraphernalia that usually accompanies enterprises, none of which are concerned with ensuring the rights of the digital citizens.

These are times of change and times of definition. We as a children of a digital age must come together and define the basic inalienable digital rights we want to live by. These rights are personal rights, and they will be as important to the founding of our future society as the declaration of independence has been.

We are in dire need to define ourselves before the rifts of opposing forces of our society become too great for it to mend. We're in dire need of basic inalienable digital rights. It is a great plunder of our time that we haven't gotten around to think about it sooner.