Assigning typo-safe club member IDs

(2012-09-22)

Our local computer club NoName e.V. has about 20 members. Since we introduced membership fees to pay for server and domain costs a while ago, we have to deal with bank transfers.

To automate confirmation and reminder mails about the payment situation of each member, it’d be helpful to have a unique identifier for each member so that we can build a script which can reliable match a bank transfer to a member (without using fuzzy matching on the name).

Of course, being a computer club, we want to do it the best way possible, that is, we want to use some redundancy to ensure that the script still works even when somebody makes a typing error (or handwriting error for written bank transfers, though I suppose nobody uses them anymore) in the transfer subject.

First of all, let’s see which letters and numbers we can use in a bank transfer. Banks normally allow A-Z and 0-9. Whether umlauts are allowed is bank-specific, so we can’t use them. Note that bank transfer subjects are in all upper-case, so our IDs cannot be case-sensitive either. After ruling out symbols which are prone to being confused with each other, we are left with the following symbols:

A C D E H K L P T W X Y 3 4 7 9

Since that are 16 symbols, we can use 4 bits to represent them (so 24 = 16 possibilities). Considering that we have about 20 members, with three 4-bit symbols we can have 163 = 4096 different IDs, so that should be enough :-). A nice side effect is that this directly corresponds to the hexadecimal number system, which will be useful in the code below.

An acceptable length for the bank transfer subject is 6 symbols plus a static prefix, let’s say MF-xx-yy-zz (MF means membership fees). Note that adding dashes to point out that the ID consists of 6 symbols should lead to fewer forgotten symbols. That leaves us with 3 symbols plus 3 symbols of redundancy. But how do we actually encode the redundant information?

Reed-Solomon?

The first idea was to use Reed-Solomon error correction. This worked, but it’s relatively hard to implement or understand (and simple solutions should be preferred to complex ones). Also, Reed-Solomon only allows you to correct ⌊t/2⌋ symbols when you have t redundant symbols. In our case, that is just 1 symbol which we can correct.

Levenshtein distance

A better approach is to consider the whole number of possible IDs (166 = 16777216) and then chose IDs with a big distance to each other. When the member makes a typo in the subject, it should still be nearer to the real ID than to any other ID.

A good way to measure the distance between two pieces of text is the Levenshtein distance (or Damerau-Levenshtein distance). As an example, the Levenshtein distance between "kitten" and "sitting" is 3 ('k' to 's', 'e' to 'i', add 'g').

When using 6 characters, there are only a few strings which have a Levenshtein distance of 6 between each other. There are a lot more when you only require a Levenshtein distance of 5. As an example, consider X3-EH-PT and 9X-EK-7C. Only the 'E' is in the same place, the other 5 characters need to be edited.

Generating IDs with a specific Levenshtein distance

A simple way to generate IDs with a specific Levenshtein distance is to use a hash function like SHA-1. Hash functions feature the avalanche effect, meaning that even small changes in the input should lead to big changes in the output. That sounds pretty good, since we want to have a lot of changes between ID 1 and ID 2. And indeed, with the following Perl code, you can easily generate enough IDs for our purpose:

use Digest::SHA qw(sha1_hex);
use Text::Levenshtein::Damerau qw(edistance);
use v5.10;

my @okay = qw(A C D E H K L P T W X Y 3 4 7 9);
my @existing;

# Generate a 6-character identifier by using the first 6 characters of the
# SHA1-hash of the argument. Since hash functions generate a very different
# output for slightly different inputs, this will quickly result in
# identifiers with a big-enough levenshtein-damerau-distance.
sub generate_num {
    join('', map { $okay[hex($_)] } split //, substr(sha1_hex($_[0]), 0, 6));
}

# Returns the minimum levenshtein-damerau-distance to between the given
# identifier and all existing identifiers.
sub min_distance {
    my ($cur) = @_;
    my $min = length($cur);
    for my $other (@existing) {
        my $dist = edistance($cur, $other);
        $min = $dist if $dist < $min;
    }
    return $min;
}

# Initial seed, should be unique to each club to avoid overlapping IDs.
my $src = 2342;

for (1 .. 20) {
    # Search until we have a distance of at least 5 to all other identifiers.
    $src++ while min_distance(generate_num($src)) < 5;
    my $identifier = generate_num($src);
    push @existing, $identifier;
    say "member ID $_: $identifier";
}

Matching IDs

Now, matching IDs is simple: You just calculate the Levenshtein distance of the input to all existing identifiers and then sort ascendingly. If the difference between the first and second result is big enough, we have a clear winner:

# Original: Y7XYPL
my $input = 'Y1XPL';

my ($first, $second) =
    sort { $a->[0] <=> $b->[0] }
    map  { [ edistance($input, $_), $_ ] }
    @existing;
if ($first->[0] == $second->[0]) {
    say "Input ambiguous, could be " . $first->[1] . " or " . $second->[1];
    exit;
}
say "Corrected $input to " . $first->[1];

In comparison to Reed-Solomon, we can fix any two errors guaranteed (typos and erasures) using this technique, and the chance that we will fix even more is high.