?

Log in

No account? Create an account

New approach to CVS changesets - Journal of Omnifarious

Aug. 14th, 2007

11:15 pm - New approach to CVS changesets

Previous Entry Share Next Entry

I am quite pleased with a new approach I've started working on for converting a CVS repository to a set of changesets. I believe it is somewhat novel, and I think it will yield better results than most others. I'm calling my project cvs2cg (CVS to Change Graph) and this is a link to the repository for cvs2cg.

There are two basic assumptions at the heart of it. One is more important than the other.

The first (less important) assumption is that, since CVS is client/server based that dates are monotonically increasing and largely consistent. They don't mysteriously leap back in time for a new revision.

The second (more important) assumption is that CVS version numbers form a meaningful tree on an individual file basis. This means that version 1.2 is definitely the 'child' of version 1.1 in all cases. Well, except where 1.1 is what I call a 'fake' revision put there for bookkeeping purposes, but fake revisions are easy to detect because of the limited number of situations in which they occur. It also means that revision 1.3 is a child of 1.2 and 1.2.2.1 is a child of 1.2. 1.3 cannot be a fake revision, but 1.2.2.1 might be, in which case 1.2.2.2 is actually the root of a new revision ancestry.

Anyway, these rules can be consistently applied to generate a revision changegraph much like you get with a modern DVCS such as Monotone or Mercurial on a per-file basis.

The second assumption can be used to test the first by asserting that every change has a date before its ancestor.

After you have these trees on a per-file basis and know that your assumptions about dates are correct, you can use a merge algorithm to merge together all of the per-file trees into one big tree for the entire repository. You will use the first assumption in combination with mysteriously coinciding authors and log messages to accomplish this. This will allow you to make much shrewder guesses about which file revisions are part of the same global repository revision. And it will also allow you to recover from classification errors more easily.

I think most current CVS converters try to discover changesets that are the same by comparing dates, authors and changelog messages first and then try to build up a change ancestry afterwards. I think this is done because finding file revisions that are part of the same repository revision is seen as the more vexing and pressing problem.

In reality, that problem is not nearly so important as making sure your revision graph is consistent and reasonably accurate.

Current Mood: [mood icon] accomplished