This article originally appeared in the Dec 1997 issue of The C/C++ Users Journal.

Filename Pattern Matching
by David R. Tribble

Regular expressions are a handy way to specify sets of strings, such as filenames. The code for matching regular expressions can be handy for many similar tasks.


Most operating systems that provide a command-line interface also provide the capability to specify multiple filenames using patterns (also known as filenames with wildcards). This is true of such operating systems as MS-DOS, Unix, VMS, and others. This article discusses the theory behind filename patterns and presents library functions that perform pattern matching like that provided by Unix.

Regular Expressions

Filename patterns are a special kind of regular expression. Regular expressions are used extensively in compiler technology, lexical analysis, and text editors. The theory of regular expressions originates in linguistics. A generalized regular expression is an expression, or pattern, that matches strings in a given language context. Filename patterns are a simplified kind of regular expression designed explicitly for matching filenames within the context of a file system or operating system. Filename patterns do not have the power or flexibility of full regular expressions, but they don't need to since they are used only to match filenames.

Filename patterns are composed of normal and special characters. Normal characters are letters, digits, and other characters that can be used in legitimate file names. Each normal character matches one and only one character - the character itself. For example, the normal character a in a pattern matches the letter a in a filename. Some operating systems (such as Unix) support case-sensitive filenames, so that the names abc, ABC, and Abc all refer to different files. Others (such as DOS and Windows) support case-insensitive filenames, so that the previous names all refer to the same file. A pattern composed entirely of normal characters is a special case since it matches only one filename, the same filename as the whole pattern. The empty string is also a special case, which matches only empty filename strings.

Special pattern characters are characters that have special meanings in the way they match characters in filenames. They may match several characters, or none, or only a selected set of characters. Or they may change the sense of the matching of the rest of the pattern. The functions described here are modeled after the pattern matching capabilities of Unix. Other operating systems, such as VMS, have a completely different syntax for specifying filename patterns, so they are not considered here. There are a few differences in the Unix and DOS/Windows implementations of these filename pattern matching functions. These differences are noted parenthetically.

The special pattern characters are:

    ?             Any
    *             Closure
    [abc]         Set
    [a-z]         Range set
    [!abc]        Excluded set
    \x            Quoted character (Unix only)
    `x            Quoted character (DOS and Windows only)
    !pat          Not
The Any pattern ? matches any single character (including dot).

The Closure pattern * matches zero or more characters (including dot).

The Set pattern [abc] matches a single character in the set {a, b, c}. On systems supporting case-insensitive filenames, this will also match any character in the set {A, B, C}. If the set contains the ] character, it must be quoted. If the set contains the hyphen character -, it must be the first character in the set, be quoted, or be specified as the range ---.

The Range pattern [a-z] matches a single character in the range a through z. On case-insensitive systems, this also matches any character in the range A through Z. The character before the hyphen must sort lexicographically before the character after the hyphen. Sets and ranges can be combined within the same set of brackets (e.g., the pattern [a-c123] matches any character in the set {a, b, c, 1, 2, 3}).

The Excluded set pattern [!abc] matches any single character not in the set {a, b, c}. Case-insensitive systems also exclude characters in the set {A, B, C}. If the set contains the hyphen character, it must immediately follow the ! character, be quoted, or be specified as the range ---.

The Not pattern !pat matches any filename that is not matched by pattern pat. (This is not supported by Unix but is added as an enhancement to the pattern-matching capabilities of the functions here.)

A quote character \ removes any special meaning from the next character. To match the quote character itself, it must be quoted \\. The backslash is used as the quote character just like Unix systems, but this character is problematic for DOS and Windows systems because they use it as a directory separator within pathnames. So for such systems, a different quote character, the grave accent `, is used.

Some patterns, such as *, will match empty filename strings. This is generally undesirable, so empty filenames are handled as a special case. It is assumed arbitrarily that empty filenames can be matched only by an empty pattern. A few patterns should be noted in particular:

    *        Matches any filename.
    !*       Does not match any filename.
    !!x      The same as x since the not patterns cancel out.
    *.*      Matches any filename with a dot in it.
Notice that dots have no special meaning and are treated just like other normal characters. This reflects the way that Unix and Windows 95 behave, but differs from the way DOS and Windows 3.1 behave.

Older releases of DOS and Windows 3.1 assumed that every filename contained one to eight characters, followed by a period, followed by up to three characters. (This is the so-called 8.3 naming rule.) The pattern *.* is actually expanded internally by DOS into the pattern ????????.???. Modern operating systems, such as Unix and Windows 95, support long filenames, which allow much longer names and do not consider dots to have any special meaning. In fact, almost any ASCII character can be used in a filename, including control characters. Most versions of Unix allow any eight-bit character as well. The only limitations on long filenames are the maximum size (which is typically 255 or more characters) and the disallowing of a few characters that have special meaning to the operating system. In particular, these characters may be disallowed in filenames:

    /        Directory (pathname) separator character
    \        Directory (pathname) separator character (DOS and Windows only)
    '\0'     Filename terminator character
Library Functions

Listing 1 shows the header file for a small set of functions that perform filename pattern matching. Listing 2 is the code for the functions that do the actual matching. (The remaining code, with test cases, is available on the CUJ ftp site. See p. 3 for downloading instructions.) The functions are written in C for maximum portability and are designed to operate on all flavors of Unix, DOS, and Windows operating systems.

To match a filename pattern against a candidate filename, the code attempts to match each subpattern in the pattern against individual characters in the candidate filename. Matching proceeds from the leftmost to the rightmost subpattern. If at any point a subpattern fails to match the corresponding portion of the candidate filename, the entire match fails. Otherwise, the pointer is advanced to the next subpattern and to the next filename character, and the matching process continues. If every subpattern is successfully matched against every candidate filename character, the whole pattern match succeeds. Unlike other regular-expression functions, the functions here do not require a separate precompile pass. Each invocation of the matching functions evaluates the pattern string in place.

How each subpattern attempts to match its corresponding portion of the candidate filename is explained in detail below.

Normal characters are easy, since they either match the corresponding filename character or they don't. The only complication arises when dealing with case-insensitive filenames, which requires that both the subpattern character and the candidate filename character be converted to the same case before they are compared. It is safer to convert characters to lowercase rather than to uppercase. Depending on the character set (or code page) being used, some national character sets have fewer uppercase letters than lowercase letters, so more than one lowercase letter maps to the same uppercase letter (usually as a result of the way accented characters are allocated in the character set). All of this is taken into account by the tolower function, since it is aware of any locale settings that may have been established prior to calling the pattern-matching functions.

A quoted character is also a simple subpattern. It is handled simply by treating the quoted pattern character like a normal character and then attempting to match it against the next filename character. Case insensitivity must also be handled.

The Any subpattern ? matches any single filename character, so it is easy to handle too. It matches any single character other than a null terminator.

The Set subpattern [abc] matches a single character. Care must be taken when parsing the characters within the brackets so that Ranges of characters (specified with a hyphen), Excluded sets (specified with a !), and quoted characters are all handled properly. The candidate filename character must be compared against each character or range of character in the set. Case insensitivity must also be handled.

The Closure subpattern * is a little trickier, since it can match any number of characters in the candidate filename. This is accomplished by assuming that the closure subpattern initially matches zero characters in the filename (starting at the current position in the filename), and then recursively attempting to match the rest of the pattern against the rest of the candidate filename. If the recursive submatch succeeds, the Closure subpattern succeeds. Otherwise, it is then assumed that the Closure subpattern matches one filename character, then two characters, and so on, each time attempting to match the rest of the pattern against the rest of the filename recursively. If all of the possible numbers of filename characters that could be matched are exhausted without encountering a successful recursive submatch, the Closure subpattern match fails.

For efficiency, the algorithm for matching the Closure subpattern actually operates in the opposite direction. That is, it initially assumes that it matches all of the remaining filename characters, then all but one character, then all but two, and so on, until it assumes that it matches zero filename characters. This style of matching is called maximal munch, because it tries to munch (match) the maximum number of characters that it can at each step. It's a little quicker when matching short patterns against long filenames, a fairly common situation (e.g., the pattern a*b matched against the filename abracadabra.b). It is also more efficient for patterns containing multiple Closure subpatterns.

For example, the pattern a*b*c matches the filename AnBxxxBzC in two different ways. The subpattern *b can match either the nB portion of the filename or the nBxxxB portion. Maximal munching causes the second filename portion to be matched, resulting in a faster match for the whole pattern since less backtracking is required. Another example is the pattern a**b matched against the filename abracadabra.b. The first * requires only two attempts to successfully match the filename since it munches all of the characters except the final b. This results in less time being spent attempting to match the second * of the pattern.

Finally, the Not subpattern ! attempts to match the rest of the pattern against the rest of the filename. If the submatch succeeds then the subpattern fails, otherwise the subpattern succeeds. That is, it reverses the sense of the rest of the pattern.

Acknowlegements

The library functions are copyrighted by the author, but permission is hereby granted for their unlimited use provided that the original copyright notice is retained unaltered. If you use any of the code in your software, drop me a note. I'm curious to learn who finds it useful.

About the Author

David R. Tribble has been programming for almost 20 years since he wrote his first BASIC program in high school, way back in the dark ages before Apples and IBM PCs. He still remembers the first computer store that opened in his old hometown of Arlington, Texas. He currently lives with his new wife in Plano, Texas, and is a Senior Software Engineer for BEA Systems, Inc. He can be reached via email at dtribble@technologist.com. His home page is www.flash.net/~dtribble.


© 1997 Miller Freeman, Inc. All Rights Reserved.
All source code copyright © 1997 by David R. Tribble, all rights reserved.
All trademarks are held by their respective owners.