New approach to CVS changesets - Journal of Omnifarious
Aug. 14th, 2007
11:15 pm - New approach to CVS changesets
I am quite pleased with a new approach I've started working on for converting a CVS repository to a set of changesets. I believe it is somewhat novel, and I think it will yield better results than most others. I'm calling my project cvs2cg (CVS to Change Graph) and this is a link to the repository for cvs2cg.
There are two basic assumptions at the heart of it. One is more important than the other.
The first (less important) assumption is that, since CVS is client/server based that dates are monotonically increasing and largely consistent. They don't mysteriously leap back in time for a new revision.
The second (more important) assumption is that CVS version numbers form a meaningful tree on an individual file basis. This means that version 1.2 is definitely the 'child' of version 1.1 in all cases. Well, except where 1.1 is what I call a 'fake' revision put there for bookkeeping purposes, but fake revisions are easy to detect because of the limited number of situations in which they occur. It also means that revision 1.3 is a child of 1.2 and 22.214.171.124 is a child of 1.2. 1.3 cannot be a fake revision, but 126.96.36.199 might be, in which case 188.8.131.52 is actually the root of a new revision ancestry.
The second assumption can be used to test the first by asserting that every change has a date before its ancestor.
After you have these trees on a per-file basis and know that your assumptions about dates are correct, you can use a merge algorithm to merge together all of the per-file trees into one big tree for the entire repository. You will use the first assumption in combination with mysteriously coinciding authors and log messages to accomplish this. This will allow you to make much shrewder guesses about which file revisions are part of the same global repository revision. And it will also allow you to recover from classification errors more easily.
I think most current CVS converters try to discover changesets that are the same by comparing dates, authors and changelog messages first and then try to build up a change ancestry afterwards. I think this is done because finding file revisions that are part of the same repository revision is seen as the more vexing and pressing problem.
In reality, that problem is not nearly so important as making sure your revision graph is consistent and reasonably accurate.