Wrox Home  
Search
Professional C# 2008
by Christian Nagel, Bill Evjen, Jay Glynn, Morgan Skinner, Karli Watson
March 2008, Paperback


Excerpt from Professional C# 2008

Regular Expressions in C# System.Text.RegularExpressions

by Bill Evjen

Regular expressions are part of those small technology areas that are incredibly useful in a wide range of programs, yet rarely used among developers. You can think of regular expressions as a mini-programming language with one specific purpose: to locate substrings within a large string expression. It is not a new technology; it originated in the Unix environment and is commonly used with the Perl programming language. Microsoft ported it onto Windows, where up until now it has been used mostly with scripting languages. Today, regular expressions are, however, supported by a number of .NET classes in the namespace System.Text.RegularExpressions. You can also find the use of regular expressions in various parts of the .NET Framework. For instance, you will find that they are used within the ASP.NET Validation server controls.

If you are not familiar with the regular expressions language, this section introduces both regular expressions and their related .NET classes. If you are already familiar with regular expressions, you will probably want to just skim through this section to pick out the references to the .NET base classes. You might like to know that the .NET regular expression engine is designed to be mostly compatible with Perl 5 regular expressions, although it has a few extra features.

Introduction to Regular Expressions

The regular expressions language is designed specifically for string processing. It contains two features:

  • A set of escape codes for identifying specific types of characters. You will be familiar with the use of the * character to represent any substring in DOS expressions. (For example, the DOS command Dir Re* lists the files with names beginning with Re.) Regular expressions use many sequences like this to represent items such as any one character, a word break, one optional character, and so on.
  • A system for grouping parts of substrings and intermediate results during a search operation.

With regular expressions, you can perform quite sophisticated and high-level operations on strings. For example, you can:

  • Identify (and perhaps either flag or remove) all repeated words in a string (for example, "The computer books books" to "The computer books")
  • Convert all words to title case (for example, "this is a Title" to "This Is A Title")
  • Convert all words longer than three characters to title case (for example, "this is a Title" to "This is a Title")
  • Ensure that sentences are properly capitalized
  • Separate the various elements of a URI (for example, given http://www.wrox.com, extract the protocol, computer name, file name, and so on)

Of course, all of these tasks can be performed in C# using the various methods on System.String and System.Text.StringBuilder. However, in some cases, this would require writing a fair amount of C# code. If you use regular expressions, this code can normally be compressed to just a couple of lines. Essentially, you instantiate a System.Text.RegularExpressions.RegEx object (or, even simpler, invoke a static RegEx() method), pass it the string to be processed, and pass in a regular expression (a string containing the instructions in the regular expressions language), and you're done.

A regular expression string looks at first sight rather like a regular string, but interspersed with escape sequences and other characters that have a special meaning. For example, the sequence \b indicates the beginning or end of a word (a word boundary), so if you wanted to indicate you were looking for the characters th at the beginning of a word, you would search for the regular expression, \bth. (that is, the sequence word boundary -t-h). If you wanted to search for all occurrences of th at the end of a word, you would write th\b (the sequence t-h-word boundary). However, regular expressions are much more sophisticated than that and include, for example, facilities to store portions of text that are found in a search operation. This section merely scratches the surface of the power of regular expressions.

For more on regular expressions, please review the book, Beginning Regular Expressions (ISBN 978-0-7645-7489-4).

Suppose your application needed to convert U.S. phone numbers to an international format. In the United States, the phone numbers have this format: 314-123-1234, which is often written as (314) 123-1234. When converting this national format to an international format you have to include +1 (the country code of the United States) and add brackets around the area code: +1 (314) 123-1234. As find-and-replace operations go, that's not too complicated. It would still require some coding effort if you were going to use the String class for this purpose (which would mean that you would have to write your code using the methods available on System.String).The regular expressions language allows you to construct a short string that achieves the same result.

This section is intended only as a very simple example, so it concentrates on searching strings to identify certain substrings, not on modifying them.

The RegularExpressionsPlayaround Example

For the rest of this section, you develop a short example, called RegularExpressionsPlayground, that illustrates some of the features of regular expressions and how to use the .NET regular expressions engine in C# by performing and displaying the results of some searches. The text you are going to use as your sample document is an introduction to a Wrox Press book on ASP.NET (Professional ASP.NET 3.5: in C# and VB, ISBN 9780470187579).

string Text = 
@"This comprehensive compendium provides a broad and thorough investigation of all 
aspects of programming with ASP.NET. Entirely revised and updated for the 3.5 
Release of .NET, this book will give you the information you need to master ASP.NET 
and build a dynamic, successful, enterprise Web application.";
Note: This code is valid C# code, despite all the line breaks. It nicely illustrates the utility of verbatim strings that are prefixed by the @ symbol.

This text is referred to as the input string. To get your bearings and get used to the regular expressions .NET classes, you start with a basic plain text search that does not feature any escape sequences or regular expression commands. Suppose that you want to find all occurrences of the string ion. This search string is referred to as the pattern. Using regular expressions and the Text variable declared previously, you can write this:

string Pattern = "ion";
MatchCollection Matches = Regex.Matches(Text, Pattern,
                                        RegexOptions.IgnoreCase |
                                        RegexOptions.ExplicitCapture);
foreach (Match NextMatch in Matches)
{
   Console.WriteLine(NextMatch.Index);
}

This code uses the static method Matches() of the Regex class in the System.Text.RegularExpressions namespace. This method takes as parameters some input text, a pattern, and a set of optional flags taken from the RegexOptions enumeration. In this case, you have specified that all searching should be case insensitive. The other flag, ExplicitCapture, modifies the way that the match is collected in a way that, for your purposes, makes the search a bit more efficient — you see why this is later (although it does have other uses that we won't explore here). Matches() returns a reference to a MatchCollection object. A match is the technical term for the results of finding an instance of the pattern in the expression. It is represented by the class System.Text.RegularExpressions.Match. Therefore, you return a MatchCollection that contains all the matches, each represented by a Match object. In the preceding code, you simply iterate over the collection and use the Index property of the Match class, which returns the index in the input text of where the match was found. Running this code results in three matches. The following table details some of the RegexOptions enumerations.

Member Name Description
CultureInvariant Specifies that the culture of the string is ignored
ExplicitCapture Modifies the way the match is collected by making sure that valid captures are the ones that are explicitly named
IgnoreCase Ignores the case of the string that is input
IgnorePatternWhitespace Removes unescaped whitespace from the string and enables comments that are specified with the pound or hash sign
Multiline Changes the characters ^ and $ so that they are applied to the beginning and end of each line and not to just to the beginning and end of the entire string
RightToLeft Causes the inputted string to be read from right to left instead of the default left to right (ideal for some Asian and other languages that are read in this direction)
Singleline Specifies a single-line mode were the meaning of the dot (.) is changed to match every character

So far, nothing is really new from the preceding example apart from some .NET base classes. However, the power of regular collections really comes from that pattern string. The reason is that the pattern string does not have to contain only plain text. As hinted earlier, it can also contain what are known as meta-characters, which are special characters that give commands, as well as escape sequences, which work in much the same way as C# escape sequences. They are characters preceded by a backslash (\) and have special meanings.

For example, suppose that you wanted to find words beginning with n. You could use the escape sequence \b, which indicates a word boundary (a word boundary is just a point where an alphanumeric character precedes or follows a whitespace character or punctuation symbol). You would write this:

string Pattern = @"\bn";
MatchCollection Matches = Regex.Matches(Text, Pattern,
                                        RegexOptions.IgnoreCase | 
                                        RegexOptions.ExplicitCapture);

Notice the @ character in front of the string. You want the \b to be passed to the .NET regular expressions engine at runtime — you don't want the backslash intercepted by a well-meaning C# compiler that thinks it's an escape sequence intended for itself! If you want to find words ending with the sequence ion, you write this:

string Pattern = @"ion\b";

If you want to find all words beginning with the letter a and ending with the sequence ion (which has as its only match the word application in the example), you will have to put a bit more thought into your code. You clearly need a pattern that begins with \ba and ends with ion\b, but what goes in the middle? You need to somehow tell the application that between the a and the ion there can be any number of characters as long as none of them are whitespace. In fact, the correct pattern looks like this:

string Pattern = @"\ba\S*ion\b";

Eventually you will get used to seeing weird sequences of characters like this when working with regular expressions. It actually works quite logically. The escape sequence \S indicates any character that is not a whitespace character. The * is called a quantifier. It means that the preceding character can be repeated any number of times, including zero times. The sequence \S* means any number of characters as long as they are not whitespace characters. The preceding pattern will, therefore, match any single word that begins with a and ends with ion.

The following table lists some of the main special characters or escape sequences that you can use. It is not comprehensive, but a fuller list is available in the MSDN documentation.

Symbol Meaning Example Matches
^ Beginning of input text ^B B, but only if first character in text
$ End of input text X$ X, but only if last character in text
. Any single character except the newline character (\n) i.ation isation, ization
* Preceding character may be repeated 0 or more times ra*t rt, rat, raat, raaat, and so on
+ Preceding character may be repeated 1 or more times ra+t rat, raat, raaat and so on, (but not rt)
? Preceding character may be repeated 0 or 1 times ra?t rt and rat only
\s Any whitespace character \sa [space]a, \ta, \na (\t and \n have the same meanings as in C#)
\S Any character that isn't a whitespace \SF aF, rF, cF, but not \tf
\b Word boundary ion\b Any word ending in ion
\B Any position that isn't a word boundary \BX\B Any X in the middle of a word

If you want to search for one of the meta-characters, you can do so by escaping the corresponding character with a backslash. For example, . (a single period) means any single character other than the newline character, whereas \. means a dot.

You can request a match that contains alternative characters by enclosing them in square brackets. For example, [1|c] means one character that can be either 1 or c. If you wanted to search for any occurrence of the words map or man, you would use the sequence ma[n|p]. Within the square brackets, you can also indicate a range, for example [a-z] to indicate any single lowercase letter, [A-E] to indicate any uppercase letter between A and E (including the letters A and E themselves), or [0-9] to represent a single digit. If you want to search for an integer (that is, a sequence that contains only the characters 0 through 9), you could write [0-9]+ (note the use of the + character to indicate there must be at least one such digit, but there may be more than one — so this would match 9, 83, 854, and so on).

Displaying Results

In this section, you code the RegularExpressionsPlayaround example, so you can get a feel for how the regular expressions work.

The core of the example is a method called WriteMatches(), which writes out all the matches from a MatchCollection in a more detailed format. For each match, it displays the index of where the match was found in the input string, the string of the match, and a slightly longer string, which consists of the match plus up to ten surrounding characters from the input text — up to five characters before the match and up to five afterward. (It is fewer than five characters if the match occurred within five characters of the beginning or end of the input text.) In other words, a match on the word messaging that occurs near the end of the input text quoted earlier would display and messaging of d (five characters before and after the match), but a match on the final word data would display g of data. (only one character after the match), because after that you get to the end of the string. This longer string lets you see more clearly where the regular expression locates the match:

static void WriteMatches(string text, MatchCollection matches)
{
   Console.WriteLine("Original text was: \n\n" + text + "\n");
   Console.WriteLine("No. of matches: " + matches.Count);
   foreach (Match nextMatch in matches)
   {
      int Index = nextMatch.Index;
      string result = nextMatch.ToString();
      int charsBefore = (Index < 5) ? Index : 5;
      int fromEnd = text.Length - Index - result.Length;
      int charsAfter = (fromEnd < 5) ? fromEnd : 5;
      int charsToDisplay = charsBefore + charsAfter + result.Length;


      Console.WriteLine("Index: {0}, \tString: {1}, \t{2}",
         Index, result,
         text.Substring(Index - charsBefore, charsToDisplay));
   }
}

The bulk of the processing in this method is devoted to the logic of figuring out how many characters in the longer substring it can display without overrunning the beginning or end of the input text. Note that you use another property on the Match object, Value, which contains the string identified for the match. Other than that, RegularExpressionsPlayaround simply contains a number of methods with names like Find1, Find2, and so on, which perform some of the searches based on the examples in this section. For example, Find2 looks for any string that contains a at the beginning of a word:

static void Find2()
{
   string text = @"This comprehensive compendium provides a broad and thorough 
     investigation of all aspects of programming with ASP.NET. Entirely revised and 
     updated for the 3.5 Release of .NET, this book will give you the information    
     you need to master ASP.NET and build a dynamic, successful, enterprise Web 
     application.";
   string pattern = @"\ba";
   MatchCollection matches = Regex.Matches(text, pattern, 
     RegexOptions.IgnoreCase);
   WriteMatches(text, matches);
}

Along with this comes a simple Main() method that you can edit to select one of the Find<n>() methods:

static void Main()
{
   Find1();
   Console.ReadLine();
}

The code also needs to make use of the RegularExpressions namespace:

using System;
using System.Text.RegularExpressions;

Running the example with the Find1() method shown previously gives these results:

RegularExpressionsPlayaround
Original text was:


This comprehensive compendium provides a broad and thorough investigation of all 
aspects of programming with ASP.NET. Entirely revised and updated for the 3.5 
Release of .NET, this book will give you the information you need to master ASP.NET 
and build a dynamic, successful, enterprise Web application.


No. of matches: 1
Index: 291,     String: application,     Web application.

This article is excerpted from Chapter 8, "Strings and Regular Expressions," of the upcoming Professional C# 2008 (Wrox, March-2008, ISBN: 978-0-470-19137-8). Bill Evjen (St. Louis, MO) is one of the most active proponents of the .NET technologies. He has been involved with .NET since 2000 and has since become the founder of the International .NET Association, representing more than 500,000 members worldwide. In addition to working in the .NET world, Bill is a Technical Director serving in the office of the Chief Scientist for the international news and financial services company Reuters. Bill is the lead co-author of the upcoming (March 2008) Professional C# 2008. Other related articles of interest by Bill and his co-authors Christian Nagel, Scott Hanselman, and Devin Rader include Windows Presentation Foundation (WPF) Data Binding with C# 2005, Generics, Connecting to Oracle or Access from ASP.NET 2.0, Using the ASP.NET 2.0 SQL Server Cache Dependency, and ASP.NET 2.0 FileUpload Server Control.