A project I started recently centered around improving performance of some existing code. My first inclination was to scan the code for obvious low hanging fruit and one of the first methods I hit upon was this:
1: public static string CleanString(string Text)
2: {
3: // cleans a string for fuzzy matching
4: // trims white space, and removes numbers
5:
6: Text = Text.Trim();
7: Text = Text.Replace("0","");
8: Text = Text.Replace("1","");
9: Text = Text.Replace("2","");
10: Text = Text.Replace("3","");
11: Text = Text.Replace("4","");
12: Text = Text.Replace("5","");
13: Text = Text.Replace("6","");
14: Text = Text.Replace("7","");
15: Text = Text.Replace("8","");
16: Text = Text.Replace("9","");
17: //Text = Text.Replace(".","");
18:
19: return Text;
20: }
This was pretty obviously bad. Strings are immutable, so the author, in effect, allocated 11 new strings. This function is called quite often in code so it seemed like a worthwhile place to optimize. My first inclination was to throw a RegEx at it, as follows:
1: /// <summary>
2: /// Summary description for Test.
3: /// </summary>
4: public sealed class Test
5: {
6: private static Regex _scrubExpression = new Regex(@"\d*");
7:
8: /// <summary>
9: /// Cleans a string: trims white space, and removes numbers.
10: /// </summary>
11: /// <param name="source"></param>
12: /// <returns></returns>
13: public static string CleanStringOptimized1(string source)
14: {
15: return _scrubExpression.Replace(source.Trim(), "");
16: }
17: }
18: }
You'd expect that to just work. Think again. Over 100,000 runs, the RegEx code turned out to be nearly three times slower. Ok, so maybe RegEx object creation overhead just kills this. Fine. Lets try something simpler, like the good old StringBuilder:
1: public static string CleanStringOptimized3(string source)
2: {
3: StringBuilder output = new StringBuilder(source.Trim());
4:
5: output.Replace("0", "")
6: .Replace("1", "")
7: .Replace("2", "")
8: .Replace("3", "")
9: .Replace("4", "")
10: .Replace("5", "")
11: .Replace("6", "")
12: .Replace("7", "")
13: .Replace("8", "")
14: .Replace("9", "");
15: return output.ToString();
16: }
Yup. This was slower too. So maybe the answer was to go simple and do just one allocation and work with bytes. Lets try this:
1: public static string CleanStringOptimized2(string source)
2: {
3: char[] result = new string(' ', source.Length).ToCharArray();
4: char[] sourceChars = source.ToCharArray();
5:
6: int resultIndex = 0;
7: for (int count = 0; count < source.Length; count++)
8: {
9: char c = sourceChars[count];
10: if (! char.IsDigit(c))
11: {
12: result[resultIndex] = c;
13: resultIndex++;
14: }
15: }
16:
17: return new string(result).Trim();
18: }
Yes, you guessed it. This, too, runs slower than the original. In fact, the results look like this:
Original comparison took 359.375ms over 100000 runs.
Optimized 1 comparison took 10203.125ms over 100000 runs.
Optimized 2 comparison took 421.875ms over 100000 runs.
Optimized 3 comparison took 718.75ms over 100000 runs.
If anyone has any insight into why my optimizations aren't optimizing, I would love to hear it! Incidentally, dropping down to IL revealed nothing out of the orginary.