Optimizing Character Replacement in C#

2015-03-18 by richard in Software Development

For some time, I’ve had a utility function that cleans strings of Unicode characters that can cause issues for the users of the applications I manage.

It runs through a series of regular expression statements that replace problematic Unicode characters with replacement strings. In the examples below, I’m showing only the replacements related to single-quote characters. In the full code (which has 10 separate RegEx calls), some characters are simply stripped out, others are replaced with two-character strings. Here’s how the old code worked:

public static string FixBadUnicode1(string s) {
  return RegEx.Replace(s,
    "[\u2018\u2019\u201A\u201B\u0091\u0092\u2032\u2035\u05FE]",
    "'");
}

This performs well, but due to the number of times it is called, the total execution time is non-trivial, so I decided to test it against a character loop.

To perform my tests, I performed replacements on a corpus of 38,000 representative strings. Most *don’t* have problematic characters, so the overhead of *looking* for the characters outweighs the cost of the actual replacements. I used the safe native method QueryPerformanceCounter, which provides a higher-resolution counter than the .NET library.

The baseline (RegEx) code ran consistently in around 20 million counter cycles, or about 9 seconds.

My recent experience has been that Dictionary usually beats switch{} when there are more than a handful of cases, so I decided to create a static Dictionary to look up the problem characters. Here’s the first iteration:

public static string FixUnicodeChars(string s) {
    var mappers = UnicodeMapper;
    var sb = new StringBuilder(s.Length);
    string fix = null;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(fix!=null) sb.Append(fix);
        } else {
            sb.Append(c);
        }
    }
    return sb.ToString();
}

If the “fix” string is null (i.e., I just want to strip the character), it skips the StringBuilder.Append() call.

The dictionary UnicodeMapper looks like this (again, only the single-quote portion):

private static Dictionary<char, string> _unicodeMapper;
    private static Dictionary<char, string> UnicodeMapper {
        get {
            if(_unicodeMapper==null) {
                _unicodeMapper = new Dictionary<char,string>();
                lock(_unicodeMapper) {
                    // Fix all single quotes    [\u2018\u2019\u201A\u201B\u0091\u0092\u2032\u2035\u05FE]
                    _unicodeMapper.Add('\u2018', "'");
                    _unicodeMapper.Add('\u2019', "'");
                    _unicodeMapper.Add('\u201A', "'");
                    _unicodeMapper.Add('\u201B', "'");
                    _unicodeMapper.Add('\u0091', "'");
                    _unicodeMapper.Add('\u0092', "'");
                    _unicodeMapper.Add('\u2032', "'");
                    _unicodeMapper.Add('\u2035', "'");
                    _unicodeMapper.Add('\u05FE', "'");
                }
            }
            return _unicodeMapper;
        }
    }

This performs in just around 600,000 cycles, a more than 30x improvement over RegEx. This surprised me quite I bit — I expected RegEx to use unsafe pointers and other tricks to loop through the strings more quickly than I could enumerate the characters. Part of this performance is due to the fact that Char.GetHashCode() is far more efficient than String.GetHashCode(), so the Dictionary lookups are very fast.

On reviewing the code, I determined that since most strings don’t have any replacements, it made no sense to call sb.ToString() unless a replacement actually occurred. I added a flag, with this result:

public static string FixUnicodeChars2(string s) {
    var mappers = UnicodeMapper;
    var sb = new StringBuilder(s.Length);
    string fix = null;
    var hadChanges = false;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(fix!=null) sb.Append(fix);
            hadChanges = true;
        } else {
            sb.Append(c);
        }
    }
    return hadChanges ? sb.ToString() : s;
}

This provided an additional 12% improvement… not as much as the first change, but worth the effort.

Next, I noted that until the first replacement occurs, I don’t need to instantiate the StringBuilder at all — I can just create it when the first replacement happen and populate it with the substring before the replacement:

public static string FixUnicodeChars3(string s) {
    var mappers = UnicodeMapper;
    StringBuilder sb = null;
    string fix = null;
    var hadChanges = false;
    int pos = 0;
    foreach(var c in s) {
        if(mappers.TryGetValue(c, out fix)) {
            if(sb == null) {
                sb = new StringBuilder(s.Length);
                if(pos > 0) sb.Append(s.Substring(0, pos));
                hadChanges = true;
            }
            if(fix!=null) sb.Append(fix);
        } else if(hadChanges) {
            sb.Append(c);
        } else {
            pos++;
        }
    }
    return hadChanges ? sb.ToString() : s;
}

This final version requires us to keep track of the current position in the string until the first replacement (if any) is found, at which point we create and start using the StringBuilder. We’re also able to avoid assigning hadChanges more than once. This provides an additional 7% improvement over version 2.

As an aside, I tested using a for{} loop (since it would give us the position for “free”), but it was actually about 10% slower than using foreach{}. I also tried removing the null check before the StringBuilder.Append() call, since Append() already performs a null check. However, the overhead of calling Append() unnecessarily resulted in a 2% increase in execution time.

Another tweak would be to unroll the foreach{} loop into two loops — one until the first match is found, the other for subsequent tests and appends. That would remove the need for the “sb==null” and “hadChanges” checks during each character iteration. But it would also require either using for{} or creating a substring for foreach{} to work on, and I’m reasonably confident that these would eat up any savings.

The final potential improvement I can think of would be to use char[] instead of StringBuilder. Unfortunately, since some replacements result in a longer string and I can’t predict the replacements that will occur, there would be some complexity in bounds-checking and reallocating the array that could eat up all of the savings. It’s tempting to try, but maybe some other day. For now, I’m quite satisfied with a 42x performance improvement. 🙂