Tuesday, November 12, 2013

Nullify the Whitespace

Way back in the old days of .NET 1.1 if you wanted to check whether a string was empty or null you had to perform two explicit checks:

if (input == null || input == "")

Then in .NET 2.0 Microsoft gave us the IsNullOrEmpty extension of the System.String class, which combined the two calls and made your life that much easier.

if (string.IsNullOrEmpty(input))

But what to do if you also needed to check if the string starts with an empty character?  What about the dreaded whitespace "character"?  You had to implement your own solution that checked each character in the string to see if it qualified as a whitespace character (space, tab character, linefeed character, etc.).  This isn't exactly hard, but it always left a bad taste in my mouth.  I mean, how was I supposed to know what characters to look for?  What if they changed?  Were was I supposed to put this method?

In .NET 4.0, Microsoft gave us a solution: IsNullOrWhiteSpace.  This new method checks whether the string is empty (length of 0), null, or contains only whitespace characters (more on this later).  When I looked up this new feature originally I made a mental note along the lines of "performs IsNullOrEmpty, then checks if string is blank" and went on my way, using IsNullOrWhiteSpace wherever I had previously used IsNullOrEmpty.  Of course that made sense.  IsNullOrWhiteSpace must be better because it's newer, right?

The answer is yes and no.  The obvious point that I missed was these methods check for different things.  When you only care whether the string has a value, regardless of the contents of the string then you can safely use IsNullOrEmpty.  If you absolutely have to know whether the string contains nothing but whitespace then you'll want to use IsNullOrWhiteSpace.  You're probably asking yourself why you read this far to read something you already knew, and the answer to that is because I have performance metrics to show you!

Both IsNullOrEmpty and IsNullOrWhiteSpace are string extension methods so their code is pretty straightforward.  I got to wondering how each was implemented and found that they're pretty much what you'd expect:

public static bool IsNullOrEmpty(string value)
{
    if (value != null)
        return value.Length == 0;
    else
        return true;
}

public static bool IsNullOrWhiteSpace(string value)
{
    if (value == null)
        return true;
    for (int index = 0; index < value.Length; ++index)
    {
        if (!char.IsWhiteSpace(value[index]))
            return false;
    }
    return true;
}

public static bool IsWhiteSpace(char c)
{
    if (char.IsLatin1(c))
        return char.IsWhiteSpaceLatin1(c);
    else
        return CharUnicodeInfo.IsWhiteSpace(c);
}

private static bool IsWhiteSpaceLatin1(char c)
{
    return (int) c == 32 || (int) c >= 9 && (int) c <= 13 || ((int) c == 160 || (int) c == 133);
}

public static bool IsWhiteSpace(char c)
{ 
    if (IsLatin1(c))
    {
        return (IsWhiteSpaceLatin1(c)); 
    }
 
    return CharUnicodeInfo.IsWhiteSpace(c);
}

IsNullOrEmpty is about as simple as it gets.  It essentially just encapsulates what we were doing on our own before.  Now it's an extension method that's part of the framework.

IsNullOrWhiteSpace is slightly more complex, but still pretty easy to understand.  If the argument is null, return true.  If not (there's some value in it), then iterate through each character in the string and check whether that character is a whitespace character.  If it is, return false.  If it isn't, keep going.  Since the for loop starts at the 0 index and counts up, the method will exit as soon as a non-whitespace character is encountered.  This is what got me thinking of performance.

At first glance, this looks like it could end up being a monster.  Fortunately, that's only true if your input starts with a bunch of whitespace characters.  So I did what any diligent developer would do.  I coded it and ran a test:
var stopwatch = new Stopwatch();
 
var shortStringToCheck = new string('a', 100);
var longStringToCheck = new string('a', 100000);
var shortWhitespaceString = new string(' ', 100) + 'a';
var longWhitespaceString = new string(' ', 100000) + 'a';

IsNullOrEmpty test code:
   1:  stopwatch.Start();
   2:   
   3:  for (var i = 1000000000; --i >= 0;)
   4:  {
   5:      var isEmpty = string.IsNullOrEmpty("");
   6:  }
   7:   
   8:  stopwatch.Stop();
   9:  Console.WriteLine("1 billion iterations of Null/empty string: " + stopwatch.ElapsedMilliseconds);
  10:   
  11:  stopwatch.Reset();
  12:  stopwatch.Start();
  13:   
  14:  for (var i = 1000000000; --i >= 0; )
  15:  {
  16:      var isEmpty = string.IsNullOrEmpty(shortStringToCheck);
  17:  }
  18:   
  19:  stopwatch.Stop();
  20:  Console.WriteLine("1 billion iterations of Null/100 character string: " + stopwatch.ElapsedMilliseconds);
  21:   
  22:  stopwatch.Reset();
  23:  stopwatch.Start();
  24:   
  25:  for (var i = 1000000000; --i >= 0; )
  26:  {
  27:      var isEmpty = string.IsNullOrEmpty(longStringToCheck);
  28:  }
  29:   
  30:  stopwatch.Stop();
  31:  Console.WriteLine("1 billion iterations of Null/100,000 character string: " + stopwatch.ElapsedMilliseconds);

IsNullOrWhiteSpace test code:
   1:  stopwatch.Start();
   2:   
   3:  for (var i = 1000000000; --i >= 0;)
   4:  {
   5:      var isEmpty = string.IsNullOrWhiteSpace("");
   6:  }
   7:   
   8:  stopwatch.Stop();
   9:  Console.WriteLine("1 billion iterations of Whitespace/empty string: " + stopwatch.ElapsedMilliseconds);
  10:   
  11:  stopwatch.Reset();
  12:  stopwatch.Start();
  13:   
  14:  for (var i = 1000000000; --i >= 0;)
  15:  {
  16:      var isEmpty = string.IsNullOrWhiteSpace(shortStringToCheck);
  17:  }
  18:   
  19:  stopwatch.Stop();
  20:  Console.WriteLine("1 billion iterations of Whitespace/100 character string: " +
  21:                      stopwatch.ElapsedMilliseconds);
  22:   
  23:  stopwatch.Reset();
  24:  stopwatch.Start();
  25:   
  26:  for (var i = 1000000000; --i >= 0;)
  27:  {
  28:      var isEmpty = string.IsNullOrWhiteSpace(longStringToCheck);
  29:  }
  30:   
  31:  stopwatch.Stop();
  32:  Console.WriteLine("1 billion iterations of Whitespace/100,000 character string: " +
  33:                      stopwatch.ElapsedMilliseconds);
  34:   
  35:  stopwatch.Reset();
  36:  stopwatch.Start();
  37:   
  38:  for (var i = 1000000; --i >= 0;)
  39:  {
  40:      var isEmpty = string.IsNullOrWhiteSpace(shortWhitespaceString);
  41:  }
  42:   
  43:  stopwatch.Stop();
  44:  Console.WriteLine("1 million iterations of Whitespace/100 character whitespace string: " +
  45:                      stopwatch.ElapsedMilliseconds);
  46:   
  47:  stopwatch.Reset();
  48:  stopwatch.Start();
  49:   
  50:  for (var i = 1000000; --i >= 0;)
  51:  {
  52:      var isEmpty = string.IsNullOrWhiteSpace(longWhitespaceString);
  53:  }
  54:   
  55:  stopwatch.Stop();
  56:  Console.WriteLine("1 million iterations of Whitespace/100,000 character whitespace string: " + stopwatch.ElapsedMilliseconds);

Results:




















As you can see from the results, there can be a performance hit for using IsNullOrWhiteSpace when IsNullOrEmpty will suffice, depending on how much leading whitespace you have.  Keep in mind this test performed a million iterations of string.IsNullOrWhiteSpace on a string that contained 100,000 whitespace characters followed by the letter "a" and that still only took 5.7148 minutes.  That means that on average we were able to check a 100,001 character string in 3.42875 milliseconds.

In the end, it's all about using the right method for the job.

No comments:

Post a Comment