Architecture problem... - Journal of Omnifarious
Oct. 20th, 2011
03:43 pm - Architecture problem...
I used to have a really good idea of what the architecture of a system that had to respond to multiple different possible sources of input or other reasons to do things (such as some interval of time expiring). My idea was basically to make everything purely event-driven and have big event loops at the heart of the program that dispatched events and got things done.
This solves the vexing problem of how to deal with all these asynchronous occurrences without incurring excessively complex synchronization logic. Nothing gives up control to process another event until the data structures its working with are in a consistent state.
But there are two problems with this model. One is old, and one is relatively new.
The old problem is that such event-driven systems typically exhibit inversion of control, and that makes them confusing and hard to follow. There are ways to structure your program to give people a lot of hints as to what's supposed to happen next when you give up control in the middle of an important operation only to recapture it again at some later point in time in a completely different function. But it's still not the easiest thing in the world to follow.
The 'new' problem is that silicon-based CPUs have not been getting especially faster recently. They've instead been getting more numerous. This is a fairly predictable result. CPUs have a clock. This clock needs to stay synchronized across the entire CPU. Once clock speeds exceed a certain frequency, the clock signal takes longer to propagate across the entire chip than the amount of time before the next pulse is supposed to happen. This means that in order to have an effectively faster CPU on a single chip you need to break it up into independent units that do not need to be strictly synchronized with each other. It's a state horizon problem.
But most programs are not designed to take advantage of several CPUs. If you want a program that's a cohesive whole, but still gets faster as the hardware advances, you need to break it up into several threads.
It seems like maybe it would be simple to do this with a program that had multiple threads. You just have multiple event loops. But then you end up with several interesting problems. How do you decide what things happen in which event loop? What happens if you need to have data shared between things running on different event loops? You run the risk of re-introducing the synchronization issues you avoided when you added the event loops in the first place, all with the cost of inversion of control. It doesn't seem worth it.
Additionally, if you have inter-thread synchronization, what happens if it takes awhile for the other thread to free up the resource you need? How do you prevent deadlocks? Most event systems do allow you to treat the release of a mutex or a semaphore as an event, so you can't just fold waiting for the mutex back into the system as just another event without doing some trick like spawning a thread that waits for the mutex and writes into some sort of IPC mechanism once it's acquired.
And splitting up your program into multiple event threads is not trivial either. How do you detect and prevent the case of one thread being overworked? Also, there is 'state kiting' to consider. Preferably you would prefer one CPU to be handling the same modifiable state for long periods of time. You want to avoid situations where first one CPU cache, then the next have to load up the contents of a particular memory region. Typically, each core will have its own cache. If for no reason other than efficient use of space, it would be good if each core had a disjoint set of memory locations in cache. And to avoid the latency of main memory access, it would be good if that set was relatively static. This means that a single event loop should be working with a fairly small and unchanging set of memory locations.
So simply having several threads, each with its own event loop seems a solution fraught with peril, and it seems like you're throwing away a lot of the advantages you went to an event driven system (with the unpleasant inversion of control side-effect) for in the first place.
So the original idea needs modification, or perhaps a completely new idea is needed.
One modification is embodied in the language Erlang. Erlang still has an event loop and inversion of control. You waiting for messages that come in on a queue. Any other loop can add messages to any queue it knows about. These messages are roughly analogous to events. But the messages themselves convey only information that is immutable. Since it is immutable, shared or not, no synchronization is required since it cannot change.
Erlang also encourages the creation of many such event loops, each of which does a very small job. Hopefully, no individual loop is too overloaded. Modern operating systems are adept at scheduling many jobs, and so this offloads the scheduling of all of these small tasks onto the OS.
I do not think Erlang does overly much to solve the locality of reference problem.
Another approach is the approach taken by the E programming language. It makes extensive use of a concept called a 'future' or 'promise'. This is a promise to deliver the result of some operation at some future point in time. It allows these promises to be chained, so you can build up an elaborate structure of dependencies between promises. In a sense, the programming language handles the inversion of control for you. You specify the program as if control flow were normal, but the language environment automatically launches as many concurrent requests as possible and suspends execution until the results are available.
It is possible to build a set of library-level tools in C++11 to implement this kind of thing somewhat transparently in that language.
I am unsure if there are any major tradeoffs in this approach. Certainly in C++ there is a great deal of implementation complexity, and that complexity cannot be completely hidden from the user as it is in E. I wonder if that implementation complexity introduces unacceptable overhead.
I also suspect that it may be difficult to debug programs that use this sort of a model. They appear to execute sequentially, but in truth they do not. It is possible, for example, to have two outstanding promises for bytes from a file descriptor, but which order those promises will be fulfilled in will not be readily apparent from reading the code. And error conditions can crop up at strange times and propagate to non-obvious places in the control flow of your program.
I also suspect this model will not exhibit the best locality of reference semantics. There will be a tendency to frequently spawn and join threads to handle asynchronous requests. And it will not be immediately apparent to the OS CPU scheduler which threads need to work with which memory objects. And this may lead to active state kiting between CPUs.
Also, those calls to create and destroy threads have a cost, even if that cost is fairly small, it's still likely much more expensive than acquiring an unowned mutex, and probably even more expensive than the call to wait for a file descriptor readability event or waiting for a briefly held mutex to become available.
Of course, it may be possible to implement all of this without creating many threads given a sufficiently clever runtime environment that implements its own queue that folds IO state and semaphore/mutex state events into a single queue. Such an environment would still need a lot of help from the application programmer though to divide up the application to maximize locality of reference within a single thread.
This is a fairly long ramble, and I'm still not really sure what the best approach is. I think I may try to set up some kind of 'smart queue'. This queue will have a priority queue of runnable tasks, and a queue of tasks that could potentially execute given a set of conditions. When a condition is met, the queue will be informed, and if that conditions enables one or more tasks to be run, these tasks will be added to the priority queue.
I envision that the primary thing on which the priority queue will be prioritized is length of time since the task was added to the 'wait for condition' list.
I can then write a C++11 library that will allow you to automatically turn any function that returns a promise into a function that uses these conditions to split up its execution. At least, if you use sufficient care in writing the function.
The conditions (since fulfilling a promise will be a possible condition) will have data associated with them. If this data involves shared mutable state, that will require a great deal of extra care.
Please reply to my original post on Dreamwidth. If you don't have an account there you can log in using your LiveJournal account. Just login using OpenID and give
http://<LJ account name>.livejournal.com/ as your OpenID. For example, for the LJ user rosencrantz319 that would be
http://rosencrantz319.livejournal.com/ as their OpenID.