?

Log in

No account? Create an account

I've been a bit inspired, but also sick - Journal of Omnifarious

Feb. 25th, 2008

08:57 pm - I've been a bit inspired, but also sick

Previous Entry Share Next Entry

I've been pretty sick, but in the few productive hours I've had today and over the weekend, I've been working on this project I've been suddenly inspired to do, mostly because of the investigation required for me to write this post on Thrift, D-Bus and RPC.

I've been wanting a self-describing binary data format that's simpler and less ugly than ASN.1. I also wanted one in which an extremely fast parser could be built for fixed-length data structures known at compile time. Either through the use of an IDML (which would be optional) to generate code, or crazier techniques like template meta-programming in C++.

Anyway, I've finished a very preliminary first pass at a parser in Python. I've called the whole concept InBus after D-Bus from which it borrows a lot of ideas.

There are several things I would change about this parser, and a few features I would like to add.

Also, that parser is inefficient. Ideally it would build up the parse as one or more calls to Python's struct.unpack each of which would unpack multiple values. Right now, though struct.unpack is used fairly heavily it only ever (well, my fancy arbitrary precision integer parser not-withstanding) unpacks one element in any call.

Lastly, right now, it expects the type value to be immediately preceded by a type spec. That's a design mistake. The type spec and type value should be handled separately except for the 'variant' type.

This brings me to another couple of features I think would be interesting, but very tricky. It would be nice if 'variant' types could refer to a previously used 'variant' type. Partly for efficiency reasons, and partly for better clarity since one use of the variant type is to record information present in various derived classes of some base class. It would also enable encoding recursive data structures in a saner way. Additionally it might be nice to be able to refer to previously decoded values in some way for data structures that couldn't fit into a strict tree.

On interesting thing, I think you could conceivably use this type tag system to describe IP packet layouts or other binary formats that have existed previously.

Current Mood: [mood icon] contemplative

Comments:

[User Picture]
From:lumiere
Date:February 26th, 2008 03:24 pm (UTC)
(Link)
More for the wishlist in a general binary data-format specification language:
* rather than just fixed-length counts, I'd expect independent min and max lengths.
* similar for integer value ranges (which makes integers encodable as offsets)
* parameters to range/length limits to be either constants, or expressions over references to earlier fields.
* time formats with precision other than second. i've worked with systems that communicate timestamps in microseconds. other applications might need less precision (e.g., days)
* string types labeled with the character set encoding

The spec should say something about endianness.

(Reply) (Thread)
[User Picture]
From:omnifarious
Date:February 26th, 2008 06:10 pm (UTC)
(Link)
  • rather than just fixed-length counts, I'd expect independent min and max lengths.
    similar for integer value ranges (which makes integers encodable as offsets)

    I'm considering having ranges or allowed values as a feature of the IDML, but I don't think that level of detail belongs in the type tag list for describing the binary format. The same would go for length ranges. Fixed lengths do belong because they can aid greatly in constructing a fast parser and are also useful for describing fixed-length record formats.

    Encoding integer values as offsets is an interesting idea... I'll have to think about whether or not I believe the extra parser complexity is worth it.

  • time formats with precision other than second. i've worked with systems that communicate timestamps in microseconds. other applications might need less precision (e.g., days)

    I've definitely thought about this. But I think that resolutions less than a second can be handled by adding a field for the microseconds after the second. For resolutions greater, you can just do a t = t - (t mod resolution).

  • string types labeled with the character set encoding

    Maybe... I was sort of hoping to stamp out the encoding madness and just have one encoding and everybody could count on that one being it.

And yes, the standard should definitely say something about endianness. A super-efficient scheme would allow both sides to negotiate endinanness. Most systems are little-endian nowadays, which I don't like, but it's still true. Network byte order is big-endian, which means little-endian systems have lots of byte swapping to do.

I should also explicitly state that signed integers are always in two's complement notation.

(Reply) (Parent) (Thread)
[User Picture]
From:lumiere
Date:February 27th, 2008 04:42 am (UTC)
(Link)
So suppose I want to encode milliseconds since the Java UTC epoch. First I write an integer for the seconds, and then what? I write a 16 bit int for the milliseconds, and waste 6 bits. Even if you provided bit-length rather than byte-length integers--and I'm not sure you should--this wastes space, though just a fraction of a bit.

By on the one hand defining a common time type, but then disallowing custom precisions, the format becomes wasteful.
(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:February 27th, 2008 07:31 am (UTC)
(Link)

Well, it's meant as a timestamp. You could easily encode logging data by declaring a starting timestamp and then having every entry be an offset in milliseconds.

I couldn't allow smooth scaling in base 10 or 2 because it would annoy people. I would have to provide a small, set number of scales. Either microseconds, milliseconds, seconds, minutes or hours. Smaller than microseconds and you might as well just start adding the extra bits for custom offsets from seconds. Larger than an hour and you start running into definition problems. How long a day is will change over time because, while a second is precisely defined in terms of cesium atom oscillations, a day is defined in terms of how long it takes the earth to rotate once, and that changes.

And I would want to define these all as offsets from a universal base time because I want them to be unambiguous time stamps. So there's a whole ton of bits wasted there. The most convenient universal base time is likely Jan 1, 1970.

Really, people who want to be careful about how much space they're using should record time events as offsets to some precision from their own base time and define that base time using the time type provided in the protocol. The main purpose of that type is to declare UTC as the timezone for all protocol messages and to declare a particular universal base time and provide a convenient way of transmitting the most common sort of timestamp.

(Reply) (Parent) (Thread)