So, I'm doing some protocol design for CAKE, and I'm wondering about an issue involving very long messages, verification and MAC.
I have a message format that would allow gigantically huge messages, on the order of a tebibyte in size. It's highly unlikely that there will ever be any messages that large. But I feel that in having limits of that nature, it's better to go for overkill in the extreme.
But I also recognize that since messages might be fairly large, it might also be that they are generated on the fly without any idea of how long they will eventually be. HTTP solves this problem with chunked-transfer encoding, and I have a very similar idea. But having this means that the length of the message can't be specified fully at the start.
In previous renditions of the protocol, I decided that the last message chunk was the last chunk less than 1000 bytes. I also had a MAC at the end of each chunk that included the chunk length field in the MAC. This prevented chunks from being maliciously reordered or modified. It also prevented the message from being maliciously shortened. The receiver could easily detect if any of these things had been done and reject the message and request that the sender send it again in the hopes of getting a good copy.
But this time I would also like parties in the middle to be able to collect all the chunks together. Basically if the message consists of a 50000 byte chunk and a 3000 byte chunk and a 5 byte chunk, I would like someone in the middle to be able to modify the message to be a single 53005 byte chunk.
But, I was thinking that it might be nice to have these little checks along the way that a message was still good that were provided by the original versions MACs at the end of the chunks. The only problem is if the length of the chunk isn't included in the MAC it's possible that someone in the middle could modify the message maliciously to be shorter without the receiver being able to detect this.
I can't think of a good way to preserve both properties, and I'm not sure which properties are the most important. So that's my question to all of you. Is there a good way to preserve both things? If there isn't, which is more important? Is it more important that the receiver be able to check each chunk as it's received so it can have a good running idea of the health of the message as it's coming in, or is it more important to allow intermediaries (who may be of the store it for minutes to days variety) to be able to coalesce chunks?
What's the goal of coalescing chunks?
If it's not good enough...what's a typical MAC, 8 bytes? And if you said that all non-final chunks had to be 16,384 bytes, then you could do something like using 5 bytes for both the offset and the length - mask off the top 26 bits for the offset, mask off the bottom 14 bits for the length. I guess that's not quite right, as you'd have trouble terminating messages that are a multiple of the chunk size; you might have to steal another bit from the MAC for a 'final' flag (leaving a 63-bit MAC). That's 13 bytes of headers in a 16,384 byte chunk. 0.08% overhead.
Actually, you could do even more if you always check MACs serially. Have it be a running total, not just of the one chunk. 64-bit MAC, 15-bit length, 1-bit terminal flag means 10 bytes overhead. At maximum chunk size of 32,768 bytes, that's 0.03%.
You could have stingy MAC on each chunk and a really strong one at the end.
Re: What's the goal of coalescing chunks?
I'm using the count type. So the per-message overhead for the length field is very small. The MAC OTOH is much larger, but your other idea of using much smaller MAC for the intermediate values is a really interesting one. That would solve the "The MAC can be used to make it seem that the message ends early." problem since the final MAC would be required to be a full 256 bits.
It isn't exactly to allow intermediaries to save overhead. It's more to allow intermediaries more flexibility in dealing with messages in whatever way is efficient for them.
And to answer your other question, my main concern is a DoS attack on the recipient. But really nothing I can do will makes this even a tiny bit harder. It's just better for a recipient to advertise a maximum message size that it can deal with without using too many resources.
Re: What's the goal of coalescing chunks?
Re: What's the goal of coalescing chunks?
Yeah, you're right. *thumps head* Sometimes you just get caught in stupid thought traps.
Re: What's the goal of coalescing chunks?
Oh, I just remembered the real reason... :-) I don't want people who use the protocol to be able to make assumptions about chunk lengths being preserved. If the chunk lengths are forced to be preserved by the nature of the protocol, then it would be very tempting to use that fact.
I think I will be using your solution with short MACs. I'll just use a 16 bit MAC as a health check along the way, and a full MAC at the end. The MACs will not include message lengths, but truncation will be prevented because an intermediary will not be able to extend a short MAC into a full MAC to use at the end.
I don't know whether or not the intermediate health checks are useful. But I feel better putting them in, and they shouldn't be that big an efficiency burden given that the message header will typically be more than a hundred bytes anyway.
Re: What's the goal of coalescing chunks?
I'm not sure which is more efficient. Seems like if you came up with an idea of the distribution of numbers used as input, you could calculate the expected value of the length. My coding theory is not good, but Benford's law suggests maybe a power-law distribution like the Pareto?
Efficiency aside, I like Scarab's elegance (simpler, no length limit, no values with two representations), and it's apparently common enough that Perl's pack and unpack support it natively. All three formats screw over the alignment that makes simple and fast hardware possible, but I guess that's just the nature of the game.
Apparently there's also ASN.1 PER, supposedly more efficient and complex, with an option to not even align to byte boundaries.
Re: What's the goal of coalescing chunks?
A quick Python program calculates with some distributions I found interesting:
Re: What's the goal of coalescing chunks?
I'm not overly concerned with efficiency beyond a certain point. And it looks like a CAKE count has a slight advantage there in most situations anyway.
ASN.1has a reputation for being horribly obscure and painful to use. I don't know enough about Scarab to make a judgement. I would like to use something that was relatively nice and commonly accepted. Especially if lots of parsers existed for it already. Thinking about it, this makes D-BUS vaguely attractive, though that seems to have more other baggage than I want.OTOH, I don't need the data format to be self-describing. Having type information encoded in the data isn't a negative if it doesn't impact efficiency, but I think it's largely superfluous for CAKE.
Re: What's the goal of coalescing chunks?
Also, your two responses were really helpful. :-) How did you find this post? How did you find my journal at all? :-)
Re: What's the goal of coalescing chunks?
What's the goal of running checks?
One approach would be to just not allow coalescing of MACs, even if you allow coalescing of chunks. The originator would generate a MAC for every N bytes of the message, with the MAC being computed over the data plus a header (start-offset and length of this chunk). This means the message could end up with arbitrarily many MACs, but the overhead can be kept below a fixed fraction of the total message size, so it still scales well. This is kind of how Bittorrent does it.
Intermediaries can still coalesce chunks, as long as the coalesced chunk still carries along with it the multiple MACs associated with it. Intermediaries can subdivide chunks, but the subdivided chunk is not checkable until the rest of that MAC's data has been collected.
For whole-message authentication, the originator could generate a final MAC over the whole message, or over the collection of all the message's MACs (or more generally over a pyramidal tree of M(AC)^n s), or the MACs could be chained (the implicit header hashed into each MAC would include the previous chunk's MAC as well as a flag indicating whether this is the last chunk). The tradeoff here is between allowing parallelization of message generation (maybe the message is a response from a distributed data store, and it would be nice to be able to generate each chunk independently, followed by computation of the final MAC) or allowing message streaming with O(1) storage (my CAKE-enabled webcam, for example, may not have an arbitrary amount of storage available to it).
On the third hand, maybe the right answer is to try to disentangle fragmentation-and-reassembly from the CAKE layer entirely (or at least from the integrity-protection mechanisms). CAKE messages could be nonfragmentable and limited to less than 2^32 bytes, but CAKE could define a standard protocol for communicating longer messages, much like IP datagrams are limited to 2^16 bytes but almost everybody uses TCP to handle longer messages.
Allowing messages to be transported in a bittorrenty sort of way is an interesting idea. I think the operative construct here would be a Merkle Hash Tree (http://open-content.net/specs/draft-jchapweske-thex-02.html#anchor2). This allows to you pick a fixed small block size and to allow arbitrary coalescing and splitting of message data on block-size boundaries. I believe that what a .torrent file largely consists of is all the hash data for just such a construct for a 64kibibyte blocksize.
It will be possible for recipients to advertise a maximum message size to the world. So forcing messages to be small and then be split up isn't really necessary. A recipient can make this happen by advertising a maximum message size that's fairly small.
But your ideas give me some other interesting ideas, which I'll try to detail in another reply. I want to catch a bus that is coming soon though.
Bittorrent doesn't use a hash tree. The file is divided into chunks (of arbitrary size, but it has to be uniform for any given .torrent) and the hashes of all the chunks are in the .torrent file.
Oh, it doesn't? I'm mildly disappointed. Those seemed so nice. :-)
There is a way to handle a hash tree where only the top level hash needs to be handed out centrally. Then any other node can send you data for a block plus at most logb n hashes where b is the block size and n is the total file size in order for you to verify the block it sent. And usually it's a lot less.
OpenID and links
I hate what LJ does to links posted by OpenID users.
Merkle Hash Trees
Even better link
Hash tree from Wikipedia.
My other idea that's related is to implement certain services in CAKE using a peculiar underlying model.
Anytime static data is sent back, it's never sent back within the CAKE protocol that's requested. All that's sent back is the hash of the data that was requested. Then a separate step is done to fetch the data the hash represents. This allows all kinds of arbitrary caching mechanisms to be created to handle things. It solves a couple of what I consider a major problem with the web. First that a server always has to be reachable. Secondly, that server has to have the bandwidth to send everybody the data they're asking for.
Always using the root hash of a Merkle tree hash allows a LOT more flexibility in how caching schemes might be implemented.
Hmm, at first glance it's missing one of the more important ideas which is to identify recipients first and foremost by a hash of their public key, and use various other names for them as either locations or pet names.
for single chunk messages, number-of-chunks is set to 1. Routers can concatenate messages by simply reading the final word, and traversing back up the message by the appropriate number of fixed length bytes (single chunk message tail size * number-of-chunks) to grab the "tailer", concatenate the message portion, and then concatenate the "tailer" portion, then increment the number-of-chunks field.
This could be done without sigs, but you could still have a man-in-the-middle attack that way, as size alone isn't going to save you. It decreases bandwidth requirements by 1 field length per concatenated message (in addition to gains by using bigger frames in the first place of course.)
Am I missing something here? Is the problem more complex than this?
The messages are datagrams, but it's unlikely that they will ever be sent directly over Ethernet. So the size of the frame of the medium they are sent with isn't very important. Each message would likely be a 'transactional unit' for whatever protocol is being implemented. For example, in the case of email each message would likely be a complete email message.
Now, message chunks are something completely different. For example in HTTP a web server will run a script that does not have an output length that's known in advance. So what the webserver does is capture the output of the script and periodically send the output its collected so far to the client. It prefixes this chunk with a length. There is a special tag (in HTTPs case, a 0 length chunk) indicating that no more chunks are to follow.
Someone else pointed out, there is no reason for intermediate nodes to be able to combine or split message chunks. It buys you very little in the way of storage or bandwidth efficiency, especially if you use a very compact encoding for the chunk lengths and intermediate hashes.
So, having all the lengths as trailers defeats the purpose in several ways. First, it requires that there be an EOM delimeter that's outside the protocol. Secondly, it is still vulnerable to message truncation, since a router can just drop trailing message chunks without invalidating things. Lastly it would be nice if the thing generating the message didn't have to keep track of all the chunks its sent so it can send a trailer with all that data.
Also, I will be using MACs, not signatures. This is because MACs can be spoofed by the recipient, and I explicitly want this. If you sign every single message you send then there is a log by which the person who gets your messages can later prove to the whole world that you sent them. One of the things about having a conversation that supposedly includes you outed on the Internet is that you can claim that you didn't necessarily say all that and that someone edited the conversation. With signatures instead of MACs, that can be proven to be a lie.