strcmp4humans

March 2002
back

This is not one of the super-technical articles on the site. But, that's on purpose.

You see, this is one of those articles where no one in the computer industry really wants to address a particular problem, and it's annoying the heck out of me.

Evil Machines Part 1

Have you ever noticed how you name some files:

1.jpg
2.jpg
3.jpg
...
10.jpg
11.jpg

And when you ask a computer to sort them, it does this?

1.jpg
10.jpg
11.jpg
2.jpg
3.jpg
...
9.jpg

So of course, being frustrated, you rename all the files:

0000001.jpg
0000002.jpg
0000003.jpg
...
0000010.jpg
0000011.jpg

Which is just ugly, but of course you put up with it, because everyone puts up with it.
And I'm asking you why? Why really?
I mean, we're smart people here, why?

Wasn't that cathartic? And now that we're all upset together, how about let's fix it?

So, the way we can fix it is you take my free code, use it in your projects, print it out and give it to your programmer friends, and tell me any glaring bugs you find. And then we'll all be nice happy people again.

Squashing The Bug

What's the culprit? Well, everybody's calling the C-library strcmp, a badly abused function that is behind this mess.

And strcmp is well-intentioned. It knows how to compare simple strings, even. But the creators of strcmp assumed that our deepest desire was to do a binary-level compare of two pieces of raw data. But in reality, well, we don't, because we invented our own language, among other reasons.

So, anyway, what we do now is define some rules, The Cardinal Rules of Comparing Strings:

  1. Numbers are numbers, not bits.
  2. Case doesn't matter. People type random Case.
  3. Numbers are bigger than words.
Then we just write the code:
inline char tlower(char b)
{
	if (b >= 'A' && b <= 'Z') return b - 'A' + 'a';
	return b;
}

inline char isnum(char b)
{
	if (b >= '0' && b <= '9') return 1;
	return 0;
}

inline int parsenum(char *&a)
{
	int result = *a - '0';
	++a;

	while (isnum(*a)) {
		result *= 10;
		result += *a - '0';
		++a;
	}

	--a;
	return result;
}

inline int StringCompare(char *a, char *b)
{
	if (a == b) return 0;

	if (a == NULL) return -1;
	if (b == NULL) return 1;

	while (*a && *b) {

		int a0, b0;	// will contain either a number or a letter

		if (isnum(*a)) {
			a0 = parsenum(a) + 256;
		} else {
			a0 = tlower(*a);
		}
		if (isnum(*b)) {
			b0 = parsenum(b) + 256;
		} else {
			b0 = tlower(*b);
		}

		if (a0 < b0) return -1;
		if (a0 > b0) return 1;

		++a;
		++b;
	}

	if (*a) return 1;
	if (*b) return -1;

	return 0;
}
And now, will you please stop typing 00000008 when you really want to type 8? C'mon, I know you want to.

And spread this code everywhere you find a programmer.


From Alexei Lebedev:

I would add this as the first line:

if (a == b) return 0;

First, it is a reasonable optimization when comparing a long string to itself. Second, it handles the case a==NULL && b==NULL correctly, making StringCompare antisymmetric (StringCompare (a,b) == -StringCompare (b,a))

See also this article in the forums. (not incorporated yet.)