Monday, January 15, 2007

Convention is better than cure

In a pragmatic world, programmers working in a team adopt a common coding convention because it reduces the amount of mental acrobatics that members of the team have to perform when maintaining each other's code. In an ideal world however, it doesn't matter what coding style you adopt, because it doesn't make a tangible difference to maintaining the code base. Everyone just mindlessly uses whatever convention was drilled into him by well-meaning but short-sighted professors, with no regard for the nuances of good style.

It appears therefore that I'm living in an ideal world - which ironically, is far from ideal. I was the only one working on the project for the first year of its development, and so naturally all the code exhibited a single style - bar minor differences where my style changed over the course of that year (I was still developing my C++ knowledge to a large extent), and even those minor differences I was dilligent about reconciling back to The One Style.

In that first year, I managed to write a large portion of the system - which I'd inherited in the crudest of forms (a couple of hundred lines of shaky (C++)-- code) from a colleague who has since gone on to hopefully better things. However, about 18 months ago, my department finished it's other major project and the entire crew was put full force onto my project, and I had to contend with three other people maintaining my beautiful code.

What made matters worse was that my boss was busy implementing a separate piece of the system (the money-maker component, you could say) from a Delphi spike implementation (almost line for line) - basically independently of the application it was going to have to fit into. I didn't think much of it at the time, but looking back, I'd take continuous integration over the nightmare that resulted from that integration any day - the interface between the two components still feels like a house of cards. His coding conventions were obviously different to mine, but because we weren't even working in the same repository, let alone the same solution, neither of us realised the headache it was going to cause.

So now we're at the point where the system is nearly two and a half years old, and convention is basically non-existent. The particular style element that's got a bee in my bonnet of late is the declaration of pointer variables, and specifically, whether the * should accompany the type, or the variable name. I'll give you an example:

  1. int* myInt = &someInt;
  2. int *myInt = &someOtherInt;

As silly as it sounds, this one character difference has caused a bunch of problems in our team and for our project. Obviously this convention clash hasn't caused bugs - semantically the two statements are equivalent, but we have had problems related to day-to-day feature addition and maintenance programming. For example, when I need to fix a bug in a class, and I want to browse subversion's history for the class's header file, I see that the file has about 20 different changesets and it takes about 10 minutes just to isolate which of those changesets don't alternatively have me changing the convention within the file to style 1 and one of my colleagues changing it to style 2.

This is the kind of crap that can derail your train of thought and kill your productivity for the rest of the day. What makes matters worse is that the "tidying up" is usually done by the programmer in exactly this situation - when they come across the file looking for a bug. This means, when the bug is fixed and the code is checked in, the "tidying up" code is hidden amongst all the bug-fixes, or more commonly and much more dangerously, the bug fix (which is important) is hidden amongst all the "tidying up" (which isn't) - that is, there's a bad signal to noise ratio.

I could go into a 3 page discussion on why style 1 is better than style 2 (which it clearly is), but let's just say for now that if I were asked to join a team working on a code-base with 250K lines of code, and every line used style 2, I'd hopefully be pragmatic (or extreme) about it and just adopt style 2 while working on that project - a common convention makes it easier for everybody.

Or alternatively, being the stick in the mud that I am, I'd write a little script to report all the places where the one instance was used, and I'd either manually go and change all of them (I'm also a sucker for punishment), or write another script to do that. In fact, that's just what I did do to our code today. I threw together a little regular expression (first using Windows' findstr command, but quickly switching to a ruby script) which would tell me all the places in the code where style 2 was used, so I could change them all to style 1.

What I should have done first I guess was write the script and run it for both combinations - so that I could figure out which style was used more. That's what I should have done. What I actually did was assume that my preferred style was more widespread (given that myself and two of the other three maintainers use it exclusively, and the fact that I'd written so much of the code in the first place), and so I jumped straight in and started changing the bad style to the good.

To cut a long story short, about six hours (this is all in my spare time, how sad is that?) and a number of exceptional cases(1) later, my ruby script was reporting that there were no more occurrences of the offending pattern. Feeling very satisfied with myself, I integrated the ruby script into our nant build script and set the build to fail if a violation of the convention occurs again. Luckily, I'd had the foresight to make these changes in a separate working copy of the code, and so I still had the original code to refer to. I made some modifications to the script and ran it to count how many instances there were of style 1 and style 2 on the original code, and the results very gratifying. Approximately 1800 instances of style 1, and 300 instances of style 2 - which just proved my suspicions nicely.

So all that's left to do is check the code into the repository tomorrow when I get back to work, and also the rather distasteful task of informing my team that there is now a strict convention regarding the placement of the * when declaring pointer types, and any instances of going against the convention will cause the build to fail (I believe there's a joke aboyt aryan variables and my coding-naziness hiding in there).

I've toyed with the idea of rather hooking the script into subversion's pre-commit hooks, so that no bad code even gets into the repository, and the checks don't add onto our already straining build time, but unfortunately I haven't had the time to play with such an idea. I can see great benefits though if a coding convention is enforced at this level across all our projects in all the languages we use (mostly C# lately), so perhaps I'll spend some more spare time over the coming weeks identifying little violations like this to build up a comprehensive library of checks and integrating them one by one. I've already installed a pre-commit hook to check that comments always accompany checkins, but even that just gets subverted by people typing in crappy, useless comments.

Thankfully though, it doesn't appear that the script is going to have any major effect on our build-time, which Riaan and I reduced by 15 minutes to 30 minutes total last week (I know, I know, 10 minute build :( ). The script runs on my laptop in about 1.6 seconds, so I think we're OK. For now, I'll just have to wait and see whether this has a positive or negative influence - and the next item on my hitlist is curly braces - both ones missing from single-line conditionals, and ones hiding on the right hand side of the expression, instead of on their own line below. Hopefully that won't meet with too much resistance either.

1. The exception cases included instances where a variable or iterator was being dereferenced, or where multiplication was taking place, or where a function call in a dll was declared.

Wednesday, January 3, 2007

The key to fixing problems quickly

In an ideal world, problems are solved instantaneously, or don't even arise in the first place. Here in the real world though, the key to fixing problems quickly is having enough information. I'm particularly frustrated with developers who, when an error occurs and they need to inform the user, come up with a really useless error message, something about as useful as

"error MSB3152: The install location for prerequisites has not been set to 'component vendor's web site' and the file 'DotNetFX\instmsia.exe' in item '.NET Framework 2.0' can not be located on disk. See Help for more information."

See help for more information.

To cap it all off, the online help is about as useful as matches to an astronaut. It actually says: "To correct this error - Determine whether the file exists on disk". The file does indeed exist on disk. It exists on disks all over the world. It even exists on both disks in the machine in question. It just doesn't appear to exist in the place where the program is looking for it.

I wouldn't need to see help for more information if the stupid program would just tell me where it was looking for the files. This is happening on my build server - the machine I'm setting up to monitor our subversion repository and republish this particular program every time someone checks something into the source tree. Up until now, this has been a manual process triggered by my colleague Karin using an existing Nant script which I threw together a couple of months ago. What this basically means is that all the hard work has been done already, I just need to hook up the monitor part (CruiseControl.NET) to the source control part (Subversion) and the build automation part (NAnt) and the ClickOnce technolgy (MSBuild). But somewhere along the line, the build machine is configured differently to Karin's machine, and MSBuild can't find the prerequisite files.

No problem, easy to fix, right? I'll just copy the files off Karin's machine onto the build server, presumably to the same path (they're just install files for the .Net Framework 2.0 after all), and away we go. Not so. Apparently there's some other place on this particular machine that those files need to be for MSBuild to pick them up - which in theory I'm fine with. There could be any one of a number of reasons for this - the source files and the MSBuild files are on different drive letters, Windows Server 2003 vs. Windows XP, etc, and I don't really care where the files are stored on the build server, as long as I get my automated build published every time someone changes something.

What does bug me is that when MSBuild fails inside of my NAnt task, it fails with that wonderful error message that I copy/pasted above. This is literally a 5 minute fix if they just put all the locations where they search for the files into the error message. Then, all I have to do is go and paste the files into one of those locations and I'm good to go. Or, at least then I can start to hunt around and see where those locations are stored and try and figure out why they're different on the build server and Karin's machine. Instead, I've spent the last two hours moving files all over my hard drives, trying to get MSBuild and ClickOnce to pick them up, to no avail - and I'm still no closer to a solution than I was when I started.

So my advice to programmers: Don't be scared to put as much information as you have into error messages - especially stack traces if you have them - but any other useful information as well. At least if you provide a decent message, then a competent user (which I know probably only accounts for about 3% of all your users) can at least do basic troubleshooting himself, without having to either:

  • Trouble your tech support with a problem that they will probably be no better equipped to solve than he is (because tech support also might not know where this particular configuration is looking for files).
  • Spend hour after frustrating hour scouring the web for solutions - this is NOT a great way to build customer relations and loyalty.

As an aside (and proof that I'm not overly anti-Microsoft), the system we're currently replacing (that we wrote) has an error message which pops up when something goes wrong on login and it looks like this:

"INCORRECT PASSWORD NO RIGHTS OR DATE"

This is an exact quote, even down to SHOUTING AT OUR USERS in ALL-CAPS. What this excellent piece of information basically means is either you typed the password in incorrectly, you don't have rights to send in a login message, or the date you passed through in your login message doesn't correspond with the server's date. Why all three error conditions are reported together I'll never know, but clearly providing too much information is almost as bad as providing too little (at least the problem can be accurately troubleshooted).

Every time this error comes up in our test environment, we need to open the management system and check the password, check the rights and then log into the server (which is Novell, not Windows) check that the server's date is the same as the client's date. Why not, as the developer who wrote this code, just take 5 minutes out of your busy schedule to return three different error codes (don't even get me started on error codes vs. exceptions {I'm pro-exceptions}, but this is an old C system), for the three different errors? That simple, once off 5 minute task would have saved 5 minutes of troubleshooting for probably 200 logins over the course of the last 2 years, which equates to around 800 minutes or almost 14 hours of time saved.

I attribute such techniques partly to laziness, but also partly to something I come across all the time - thinking strictly in terms of very short-term goals. "I have to get this login routine finished today" as opposed to "This needs to be as easy to use and troubleshoot as possible, because every single person who logs into the system in its entire lifetime is going to have to pass these checks, and if their login fails, it's going to frustrate every single one of them."

However, a discussion of short-term goals - of thinking too narrowly - is a discussion for another day.