and a bunch of ideas that have been banging around in my head for awhile.
I have always been very leery of multithreaded programming. I understand the concepts well enough. And sometimes that's really and truly the only thing you can do. This is particularly true at the OS level where you really only have one ethernet card and you need to mediate access between several different things running at the same time that all share the same underlying OS.
But it seems to me that having several threads of execution share memory in user-space processes is a mixed bag, and possibly not really worth it. I have several different reasons for thinking this. And a few of those reasons arise directly out of experience with modern multiprocessor system design.
My first thought on encountering the idea of threads was that they were totally ridiculous to use on anything but a multiprocessor system if you had an adequate OS interface for dealing with asynchronous events. Processing in discrete event-sized chunks that voluntarily give up the processor when they're done often results in better throughput because the processor's context is not randomly whipped around by context switches. Locality is everything.
And it's an easier model to get things right in. You don't have to worry that your program might be randomly interrupted at any point and some random thing will come along and change the bit of stuff you're working with unless you and the other random thing have an explicit agreement not to do that.
The only time I think threads are even a little bit justified is when you're writing a library that needs to do significant processing but doesn't want to block the thing that called it while it gets the processing done. And even then I think there are better ways to do things.
Your library should either let the calling program handle the concurrency issue or provide explicit hooks to allow an event oriented framework to call into the library to accomplish the task in small, discrete chunks.
If the library takes the first option, the program could conceivably even fork (create a whole new process) to accomplish the task and provide the result to the original process on a pipe. If your task really takes that long, the extra overhead of writing the result onto a pipe should contribute to the overall time only very minimally.
If your library takes the second option, each chunk should take a fairly small amount of time so the calling program can be effectively interactive if it needs to be. This can be hard for some problems, so for those problems I suspect the fork and write the result back on a pipe option is likely better.
If the library is pausing to do IO, the only polite way to write it (IMNSHO) is to provide hooks to allow an event driven application to call back into your library so it can handle the IO. And your library should never block, but simply inform the calling application of which event will allow your library to continue.
It's even better if your library can provide a way for the calling application to do all the IO on behalf of the library, though insisting that the calling application do this is going to make your library really hard to use in the common case. This is so that the application can virtualize the IO if need be. This future proofs your library against a whole class of possible changes in the underlying system.
First, I'm going to define a hardware spectrum....
First, two terms... I realize that these categories have a somewhat fuzzy boundary.
Nodes that may or may not share the same overall goal that are talking to eachother so that each might achieve its own individual goal.
Cooperating nodes may be actively trying to subvert other nodes. They are cooperating only in the sense that they are taking the rules by which the other nodes communicate into account.
Each node has its own discrete state that makes sense in the context of its goal, but may not really make any sense when considered in the context of the overall state of the system. It would be quite questionable as to whether it made any sense at all to consider the overall state of the system comprising all cooperating nodes as being something coherent and describable.
Nodes that inherently and always share the same goal.
Nodes in a cluster doing a weather simulation fall into this category. You can consider the state of the weather simulation to be fully spread out among all the nodes. And while each node may have its own state that is discrete, those states really don't make sense unless they are considered within the context of all the nodes making up the shared state of the system.
Now that those two terms are defined, here are some more fuzzy categories.
Cooperating nodes separated by arbitrary amounts of latency.
The fact that nodes are merely cooperating and not communal is of the utmost importance here. Nodes must successfully cooperate while still making sure they aren't compromising their own goals. In order to do this they have to subject all incoming communications to a great deal of scrutiny and have sophisticated systems for dealing with 'gaming' and/or fraud by other nodes. This means that any kind of implicit sharing of state is almost certainly a horrible idea.
This is also one of the reasons (though not so explicitly stated in the referenced paper) that I think RPC is a bad idea.
Communal nodes that are potentially separated by more than 2 msecs of latency. There may be useful distinctions to make in terms of latency above 2 msecs. But I definitely think that anything above about 20 msecs is essentially the same from a system design standpoint.
One notable cutoff here is the range for playing games that are heavily based on human reaction times. The minimum reaction time for humans is about 90ms. This means that for latency to not provide so much noise that it swamps reaction time, you want at most 15ms of latency. The maximum physically possible radius in which this level of latency can be achieved is about 1300 miles, and in current practice is probably about 60 miles.
At about 150ms of latency it is so high that people notice in almost every interaction. The radius for this is 13500 miles, which is conveniently small enough that the entire Earth is covered. But in current practice, the radius for this is more like 3000 miles, which is only enough for a continent.
Of course, most systems in which human reaction times are a factor are also ones in which the nodes should be considered to be cooperating and not communal.
Communal nodes that are separated by a latency of between 100-200 µsecs and never more than 2 msecs.
Due to hard restrictions on the propagation of information through space, LANs will never be larger than about 180 miles in radius. With current technology a LAN the maximum size is probably around 2 miles.
Communal nodes that are separated by fewer than 200 µsecs of latency, but still more than 500-1500 nsecs. NUMA architectures are also distinguished from SMP in that though all of the state of the system is ostensibly able to be freely access by any node, state 'owned' by a node is significantly cheaper for that node to access than it is for any other node to access it. Latency is still an important part of the distinction though since current SMP systems do not actually conform to the SMP ideal of all nodes having equal access to the state of the system.
Some might say that I'm missing that in NUMA architectures one node can treat the memory of other nodes as part of the same memory space. My response to this is that with proper OS support you can use virtual memory to make this the case in most situations. You can even do this on a LAN, though it's of dubious usefulness with such high latency. But even on a LAN the latency of fetching swapped out memory from a hard-drive is much higher than the latency to fetch a memory block from another node.
On an interesting note, up until clock speeds went into the 100s of MHz most SMP architectures would be classified as NUMA by this definition.
The maximum physically possible size of a NUMA system is approximately a 17 mile radius. With current technology the maximum size is probably closer to 20 meters.
Communal nodes separated by less than 1500 nsecs of latency in which all state is able to be accessed equally by all nodes. This latter requirement isn't actually true of most current SMP systems because of the extensive use of caching.
The theoretical maximum size of an SMP system is about a 200 meter radius. With current technology the limit is about a 10 meter radius.
I think latency is the biggest difference here. The only reason you don't see NUMA architectures on a LAN is because the nodes on the LAN are too far apart in terms of latency to make algorithms implemented on top of LAN based NUMA even remotely efficient.
And latency is a hard difference that won't go away. It is based on a theoretical physical limitation of the universe for which there is no data at all that contradicts the theory. It is unlike our current limitations on processor speed in that the theoretically physically possible maximum processor speeds are still many orders of magnitude from current practice while the current theoretical minimum latencies are only 1-3 orders of magnitude from current practice. This means we can expect the disconnect between raw processor speed and latency to get worse over time rather than better.
The biggest point I had in making up that spectrum is that latency is the biggest distinction. But in each of those systems there is a core of state that is 'local' and has very little (L1 cache) or no (registers) latency associated with access. And in each of those systems latency can only be expected to become more of a factor as processor speeds increase. Even in an SMP system there are now instructions (memory barriers) that tell processors that supposedly shared memory state needs to be communicated to other processors in the system, and those instructions are considered very expensive to execute.
Therefor, after I carefully plot out these distinctions, I'm going to tell you that I think they are all essentially the same in terms of how you should program for them. The only distinction is between cooperating and communal nodes.
So I think it's likely better now, and in the future, to design your program to have a number of threads of execution that each have their own local state. If they need to share any state, that sharing should be sparingly and explicitly communicated to other threads.
The reason is that all methods of implicit sharing are going to introduce potentially unpredictable delays into your program, even if you're sharing in an SMP environment. Additionally implicit sharing requires laboriously instrumenting your code with expensive instructions to explicitly declare when you need some sort of exclusive access to something.
And since your programs share so little anyway, it makes little sense for them to share their entire space of addressable memory with each other. It's better for them to share only well defined sections, or communicate through the OS using pipes or other forms of RPC.
I would really like Unix developers to add these system facilities to support this model...
First, a sort of reverse mmap that allows you to turn a set of pages into a file descriptor that you can than hand to another process via the existing mechanism for passing file descriptors between processes. Then the receiving process can mmap the memory itself. This allows processes to relatively easily create a shared memory channel to communicate on in order to bypass the few extraneous copies between kernel memory and userspace inherent in using pipes. The OS could also provide a communication channel (like a UNIX domain socket) where it promised that if you wrote whole memory pages that those pages would be marked as COW pages and the receiving process would read them by basically having its page table updated if it read them into a page aligned data area. If you wanted to make sure they were never copied you could simply munmap them after you wrote them.
Secondly, fully supported asynchronous IO. This is sort of implemented, but seems to be generally an afterthought. It's obviously existed and been very important for serial and network device IO. But it needs to work for disk IO as well and be considered so important that nobody would even think shipping a kernel for which it didn't even vaguely acceptable. It would be really nice if disk IO could also be implemented underneath by mapping pages directly and using COW if things were magically aligned as well.
The existence of these facilities would provide even further motivation for library developers to allow the IO of their libraries to be 'virtualized'. This is because then their library could take advantage of details of how the application could communicate with other processes that only the application developer would know because the application developer is aware of all the environmental details for that particular application.