migrating this blog
I'm migrating this blog over to msdn. The new URL is at the following
http://blogs.msdn.com/reuvenlax/default.aspx
I'm migrating this blog over to msdn. The new URL is at the following
http://blogs.msdn.com/reuvenlax/default.aspx
In my list of afterthoughts, the third item is performance.
Why don't developers like thinking about performance from the start? It's clearly not for the same reasons as backup and security, since performance is about a product working well. Nobody will be fooled into thinking performance tuning is easy, so why leave it till the end?
Quite simply, it's been beaten into us. Years ago, when processors were slow and memory expensive, people did spend a lot of time thinking about performance. Today, there's been a strong movement to move away from those practices. I can still hear my professor Joe Warren in an intro class telling us never to think about performance "because computers these days are so fast and have more than enough memory." These attitudes have been driven into us, so failure to think about performance has become cultural.
The truth is, that Joe Warren wasn't fully truthful. I took his class in 1997, and computers today are many times faster and have much more memory. Yet, I'm still struck by how slow many applications are. Simple operations such as opening a windows can take seconds to complete. If computers are getting so efficient, why are applications getting so slow?
Part of the problem is that this assertion that "computers are getting exponentially faster " contains a very subtle lie. Raw processor speed does increase at an incredible rate, as does available memory. What hasn't been increasing that fast is I/O speeds. In fact while disk densities have been doubling about each year for the past 15 years, disk access speeds have only gotten about 10 times faster. Programmers who don't think about performance often end up with programs that are I/O bound. 5 years later the CPU is blistering fast, but their programs are still chugging along waiting for a disk head to move.
Another problem is that performance is often hard to retrofit. Joe Warren wasn't telling a complete lie. Spending a lot of time tuning your code, optimizing inner loops, etc. is best left for later; get a correct, working application before worrying about that stuff. However the overall design of a system limits how much performance can be pushed through. It's very tempting to come up with the most elegant general design possible, and often it's not until much later when the realization hits that the design is the performance bottleneck. Then the application has to be rearchitected for v. 2.
There was another thing they told us in school. They told us that the only thing that really matters is asymptotic performance, and basically that only stupid programmers optimize for constant factors. Once again, this wasn't a complete lie. When you want something to be fast, improving asymptotic complexity is your low-hanging fruit. Improving an O(n^3) routine to run in O(n^2) is a win, and these optimization are the first ones you look for. The problem with stopping at this point is that asymptotic performance measures hide two important factors. For starters, constant factors are ignored. Making something run 10 times faster means nothing for asymptotic measures, but in the real world this can be a huge win. The other problem is a little more subtle. Remember back in high school when you studied function graphs? You graphed the straight-line function y=x as well as the parbola y=x^2. For a small interval, from 0 to 1, the straight line actually beats the parabola. You didn't care about this though, because asymptotically the quadratic function rules. This is what asymptotic performance measures. The algorithm that is asymptotically slower might actually beat the faster algorithm for small data sets, but when the data set gets larger than a threshold (and any finite number, no matter how large, can be this threshold) the asymptotically-faster algorithm starts winning. As an engineer, I want the algorithm that works best for me. If algorithm A is asymptotically slower than algorithm B yet still outperforms algorithm B on 90% of my data sets, guess which one I'm choosing? That's right, I'm choosing the slow one.
When people develop software, there are three things that aren't thought about until the final stages of development.
As a corollary, some might add Manageability.
Now all of these topics will probably have some token representation in any project's design specification, but few really think about them until the end, and sometimes not even then. Often a program won't be really performant and secure until at least the second version, and some projects go for years without reasonable backup solutions.
Why does this happen? Well as far as Security and Backup are concerned, people like to think of their software working, not about how their software might fail. It's uncomfortable to think about your data corrupted because of a crash or a virus, so you push those things off till the end. There's also a lot of naivety. How hard can backup be? Before we ship, I'll throw together a command-line utility that dumps all my data to a file. The customer can just reimport this file. It'll probably never even be needed, since my data store is rock solid.
I work at Microsoft on the Volume Shadow Copy Service, a service for creating point-in-time images of a volume. As such, I work very closely with many backup vendors (who are the main consumers of Volume Shadow Copies) and I'm very sensitive to people who don't design for backup. Windows contains a very general infrastructure that allows all data stores to publish the location of their data, how to backup the data, and provides hooks so the store can get the data into a consistent state before backup and after restore. Yet people persist in creating dozens of proprietary backup APIs and, even worse, command-line tools for backup. This creates a nightmare for admistrators: my backup application backs up some of his data, but he then needs scripts that automate all of the other command lines. It creates an even bigger nightmare for backup vendors: they have to track down all of these strange backup APIs with all their strange and varied semantics, and make it all consistent somehow. After Windows 2000 shipped, it was close to a year before any backup vendor shipped a solution. This was a huge deployment blocker, and cost Microsof t a lot of money.
Hooking the infrastructure to provide a consistent API is a start, but it's not enough. Your data store needs to have been designed with backup in mind. Case in point: I recently dealt with a distributed application, when talking about backup/restore they realized that they couldn't safely restore any one of their databases. All of their databases had references to each other, so restoring one meant restoring all of them. This application can be deployed across a server farm of dozens of machines. Their story to administrators was that if one machine failed, you have to do a full restore on every machine in your farm! If I was an administrator, I'd refuse to deploy this product! Once again, backup/restore was an afterthought.
While discussing the weaknesses of splay trees, I realized that they have lots of problems in muli-threaded environments.
Maintaining large trees in multi-threaded environments is always a challenge due to lock contention. The naive way of implementing this is to throw a big old reader-writer lock that guards access to the entire tree. One you writes start coming down at a decent rate though, you'll find lock contention will start killing your performance.
You could always create a lock for each node in the tree, but this wastes valuable resources. You could compramise and divide your tree into regions and have a seperate lock for each region. Balanced trees can make this more difficult since an add/delete can trigger rotations that will move nodes into new regions. When local operation have non-local effects, this regional approach will always be more difficult.
Splay trees throw a much bigger wrench into these gears. Firstly, operations on a splay tree have drastically non-local effects, all the way up to the root of the tree. To make matters worse, the tree is restructured on every operation, even reads! At least with the naive reader-writer lock solution, we supported parallel reads. Now it seems we're forced to serialize reads as well! Something's gotta give.
The first basic point to notice is that parallel splays are useless. If thread1 starts a splay operation on the tree and we figure out how to allow reads to still come through, none of those other reads should trigger a splay. Only the next operation after the current splay finishes should trigger a splay. Now we just have to figure out how to allow those other reads to still come through.
The first idea I had was to use a copy-on-write mechanism. As we move up the tree with our rotations make a copy of every node that changes, building a new subtree off to the side (this subtree will partially consist of copied nodes and partially of old nodes). Once we reach the root, we grab our reader-writer lock and quickly swap out the root pointer to point to this node. Lock contention is still a problem, but we've more-or-less reduced this to the problem of balanced binary trees to which solutions exist.
The above approach has one major flaw. A minor flaw is that we're hitting the memory allocator a lot with all the overhead and potentially-fragmented heaps that implies. However, in an application like this were you're anyway allocating very large numbers of evenly-sized packets, you should really consider using a simple slab allocator. Really, you'll find it can save quite a bit over a general heap. A bigger flaw is freeing the old memory. The easiest way to handle this would be to use a GC language. Unfortunately, if you're writing a high-performance, back-end data store, that's not bloody likely. This leaves you implementing a crufty ref-counting solution.
A little more thought gives a better solution. Define a global generation counter, and add a few fields to each node.
unsigned int generation = 0; // global generation
struct Node {
... key storage ...
int m_current = 0
unsigned int m_generation = 0
Node* m_left[2]
Node* m_right[2]
}
Each node now contains two sets of child pointers. Node.m_current takes on values from {0, 1} to indicate which set of pointers is valid. Node.m_generation contains the earliest generation for which Node.m_current is valid. Traversal now uses the following algorithm:
left(Node):
if (generation >= Node.m_generation)
return Node.m_left[m_current % 2]
else
return Node.m_left[(m_current - 1) % 2]
Now whenever we perform a splay, we first have to grab the reader/writer lock as a reader. when modifying a node's pointers , check the node to see which pair is in use and alter the other pair. Then we perform the following operation:
Node.m_generation = generation + 1
Node.m_current = Node.m_current + 1
Once we're done with the whole splay, grab the lock as a writer, increment the global generation, and release the lock.
The end result is that anybody traversing the tree while a splay is in progress will never see the tree topology changing. All the changes will be marked with a generation higher than the current one, and so will be ignored. Once we increment the generation under the lock, the new topology will appear by magic.
A couple of issues with this approach: the first one is generational rollover. If the generation conter ever rolls over, we have to walk the entire tree and reset all of the counters. Using a 64-bit counter can alleviate this, but that adds yet another word to each node.
Another issue is that once a splay operation is done, we cannot start a new splay operation until the generation counter is successfully incremented. If reads are coming in at a good clip this could take a while, so ensuring a fair reader/writer lock is important. We can help by storing longer lists of pointers in each node (if we store n left and n right pointers, that allows n-1 splays to happen before we update the counter), and a regional locking scheme could also help.
A couple of weeks ago I started a personal blog on livejournal, http:\\www.livejournal.com\users\groovinsax, to chronical a trip to Argentina. That blog will remain my personal blog, while this blog will focus on my technical interests and my job at Microsoft.
Today, Adi Oltean and I got into a discussion about splay trees. Balanced binary trees a fundamental tool of computer scientists and programmers everywhere, but they can be a pain in the ass to implement; I still have awful memories of being forced to implement a red-black tree as a part of a CS project freshman year in college. What makes balanced binary trees so useful, is the fact that any operation (add, delete, lookup) on an element in the tree takes place in O(lg n) time. The idea behind splay trees is that often the worst case performance of a single operation doesn't matter. What really matters is how well a data structure performs over time. If we can come up with a data structure that performs m operations in O(m * lg n) time, where n is the size of our data, we might not care that one or two of those operation were really slow. Simply, we care about throughput more than latency.
Splay trees are a fairly simple data structure due to Sleator and Tarjan (the source of all things cool). A bring-to-front heuristic is used. Every time an element is added, deleted, or read, a carefully defined sequence of rotations bring that element to the root of the tree. Unlike balanced trees, the state of the tree after an element is promoted is not guaranteed. The tree can become completely unbalanced, and the next operation'll be really slow. The amortized time of any operation (time for m operations / m) is always guaranteed, however, to be O(lg n).
The obvious advantage of splay trees is that they often give you just what you want out of a balanced tree, with a fraction of the development cost. Another hidden advantage is that splay trees give you a little bit of temporal locality, however to really take advantage you'll probably be better off adding a 5-10 element cache to your data structure. One big disadvantage is that latency often does matter. Whenever you have an actual human on the other end of a query waiting for it to finish, latency will be important. As anyone who's optimized a process scheduler for interactive can testify, people notice large delays more than they notice lots of small delays. The tasks with very low latency will seem a hell of a lot faster than tasks with high latency, even if they end up taking longer. As always, picking the proper data structure requires understanding of what the underlying problem really is.