This article originally appeared in the
Dec 1997 issue of
The C/C++ Users Journal.
Filename Pattern Matching
by David R. Tribble
|
Regular expressions are a handy way to specify sets of strings, such as
filenames.
The code for matching regular expressions can be handy for many similar tasks.
Most operating systems that provide a command-line interface also provide
the capability to specify multiple filenames using patterns (also known as
filenames with wildcards).
This is true of such operating systems as MS-DOS, Unix, VMS, and others.
This article discusses the theory behind filename patterns and presents library
functions that perform pattern matching like that provided by Unix.
Regular Expressions
Filename patterns are a special kind of regular expression.
Regular expressions are used extensively in compiler technology, lexical
analysis, and text editors.
The theory of regular expressions originates in linguistics.
A generalized regular expression is an expression, or pattern, that matches
strings in a given language context. Filename patterns are a simplified kind of
regular expression designed explicitly for matching filenames within the context
of a file system or operating system. Filename patterns do not have the power
or flexibility of full regular expressions, but they don't need to since they
are used only to match filenames.
Filename patterns are composed of normal and special characters.
Normal characters are letters, digits, and other characters that can be used
in legitimate file names.
Each normal character matches one and only one character - the character itself.
For example, the normal character a in a pattern matches the letter
a in a filename.
Some operating systems (such as Unix) support case-sensitive filenames, so that
the names abc, ABC, and Abc all refer to different files.
Others (such as DOS and Windows) support case-insensitive filenames, so that
the previous names all refer to the same file.
A pattern composed entirely of normal characters is a special case since it
matches only one filename, the same filename as the whole pattern.
The empty string is also a special case, which matches only empty filename
strings.
Special pattern characters are characters that have special meanings in the way
they match characters in filenames.
They may match several characters, or none, or only a selected set of
characters.
Or they may change the sense of the matching of the rest of the pattern.
The functions described here are modeled after the pattern matching
capabilities of Unix.
Other operating systems, such as VMS, have a completely different syntax for
specifying filename patterns, so they are not considered here.
There are a few differences in the Unix and DOS/Windows implementations of these
filename pattern matching functions.
These differences are noted parenthetically.
The special pattern characters are:
? Any
* Closure
[abc] Set
[a-z] Range set
[!abc] Excluded set
\x Quoted character (Unix only)
`x Quoted character (DOS and Windows only)
!pat Not
The Any pattern ? matches any single character (including dot).
The Closure pattern * matches zero or more characters (including dot).
The Set pattern [abc] matches a single character in the set
{a, b, c}.
On systems supporting case-insensitive filenames, this will also match any
character in the set {A, B, C}.
If the set contains the ] character, it must be quoted.
If the set contains the hyphen character -, it must be the first
character in the set, be quoted, or be specified as the range ---.
The Range pattern [a-z] matches a single character in the range
a through z.
On case-insensitive systems, this also matches any character in the range
A through Z.
The character before the hyphen must sort lexicographically before the
character after the hyphen.
Sets and ranges can be combined within the same set of brackets (e.g., the
pattern [a-c123] matches any character in the set
{a, b, c, 1, 2, 3}).
The Excluded set pattern [!abc] matches any single character not in the
set {a, b, c}.
Case-insensitive systems also exclude characters in the set
{A, B, C}.
If the set contains the hyphen character, it must immediately follow the
! character, be quoted, or be specified as the range ---.
The Not pattern !pat matches any filename that is not matched by pattern
pat.
(This is not supported by Unix but is added as an enhancement to the
pattern-matching capabilities of the functions here.)
A quote character \ removes any special meaning from the next character.
To match the quote character itself, it must be quoted \\.
The backslash is used as the quote character just like Unix systems, but this
character is problematic for DOS and Windows systems because they use it as a
directory separator within pathnames.
So for such systems, a different quote character, the grave accent `,
is used.
Some patterns, such as *, will match empty filename strings.
This is generally undesirable, so empty filenames are handled as a special case.
It is assumed arbitrarily that empty filenames can be matched only by an empty
pattern.
A few patterns should be noted in particular:
* Matches any filename.
!* Does not match any filename.
!!x The same as x since the not patterns cancel out.
*.* Matches any filename with a dot in it.
Notice that dots have no special meaning and are treated just like other normal
characters.
This reflects the way that Unix and Windows 95 behave, but differs from the way
DOS and Windows 3.1 behave.
Older releases of DOS and Windows 3.1 assumed that every filename contained one
to eight characters, followed by a period, followed by up to three characters.
(This is the so-called 8.3 naming rule.)
The pattern *.* is actually expanded internally by DOS into the pattern
????????.???.
Modern operating systems, such as Unix and Windows 95, support long filenames,
which allow much longer names and do not consider dots to have any special
meaning.
In fact, almost any ASCII character can be used in a filename, including control
characters.
Most versions of Unix allow any eight-bit character as well.
The only limitations on long filenames are the maximum size (which is typically
255 or more characters) and the disallowing of a few characters that have
special meaning to the operating system.
In particular, these characters may be disallowed in filenames:
/ Directory (pathname) separator character
\ Directory (pathname) separator character (DOS and Windows only)
'\0' Filename terminator character
Library Functions
Listing 1
shows the header file for a small set of functions that perform filename
pattern matching.
Listing 2
is the code for the functions that do the actual matching.
(The remaining code, with test cases, is available on the CUJ ftp site.
See p. 3 for downloading instructions.)
The functions are written in C for maximum portability and are designed to
operate on all flavors of Unix, DOS, and Windows operating systems.
To match a filename pattern against a candidate filename, the code attempts to
match each subpattern in the pattern against individual characters in the
candidate filename.
Matching proceeds from the leftmost to the rightmost subpattern.
If at any point a subpattern fails to match the corresponding portion of the
candidate filename, the entire match fails. Otherwise, the pointer is advanced
to the next subpattern and to the next filename character, and the matching
process continues.
If every subpattern is successfully matched against every candidate filename
character, the whole pattern match succeeds.
Unlike other regular-expression functions, the functions here do not require a
separate precompile pass.
Each invocation of the matching functions evaluates the pattern string in place.
How each subpattern attempts to match its corresponding portion of the
candidate filename is explained in detail below.
Normal characters are easy, since they either match the corresponding filename
character or they don't.
The only complication arises when dealing with case-insensitive filenames,
which requires that both the subpattern character and the candidate filename
character be converted to the same case before they are compared.
It is safer to convert characters to lowercase rather than to uppercase.
Depending on the character set (or code page) being used, some national
character sets have fewer uppercase letters than lowercase letters, so more
than one lowercase letter maps to the same uppercase letter (usually as a
result of the way accented characters are allocated in the character set).
All of this is taken into account by the tolower function, since it is
aware of any locale settings that may have been established prior to calling
the pattern-matching functions.
A quoted character is also a simple subpattern.
It is handled simply by treating the quoted pattern character like a normal
character and then attempting to match it against the next filename character.
Case insensitivity must also be handled.
The Any subpattern ? matches any single filename character, so it is
easy to handle too.
It matches any single character other than a null terminator.
The Set subpattern [abc] matches a single character.
Care must be taken when parsing the characters within the brackets so that
Ranges of characters (specified with a hyphen), Excluded sets (specified with
a !), and quoted characters are all handled properly.
The candidate filename character must be compared against each character or
range of character in the set.
Case insensitivity must also be handled.
The Closure subpattern * is a little trickier, since it can match any
number of characters in the candidate filename.
This is accomplished by assuming that the closure subpattern initially matches
zero characters in the filename (starting at the current position in the
filename), and then recursively attempting to match the rest of the pattern
against the rest of the candidate filename.
If the recursive submatch succeeds, the Closure subpattern succeeds.
Otherwise, it is then assumed that the Closure subpattern matches one filename
character, then two characters, and so on, each time attempting to match the
rest of the pattern against the rest of the filename recursively.
If all of the possible numbers of filename characters that could be matched are
exhausted without encountering a successful recursive submatch, the Closure
subpattern match fails.
For efficiency, the algorithm for matching the Closure subpattern actually
operates in the opposite direction.
That is, it initially assumes that it matches all of the remaining filename
characters, then all but one character, then all but two, and so on, until it
assumes that it matches zero filename characters.
This style of matching is called maximal munch, because it tries to
munch (match) the maximum number of characters that it can at each step.
It's a little quicker when matching short patterns against long filenames, a
fairly common situation (e.g., the pattern a*b matched against the
filename abracadabra.b).
It is also more efficient for patterns containing multiple Closure subpatterns.
For example, the pattern a*b*c matches the filename AnBxxxBzC in
two different ways.
The subpattern *b can match either the nB portion of the
filename or the nBxxxB portion.
Maximal munching causes the second filename portion to be matched, resulting in
a faster match for the whole pattern since less backtracking is required.
Another example is the pattern a**b matched against the filename
abracadabra.b.
The first * requires only two attempts to successfully match the
filename since it munches all of the characters except the final b.
This results in less time being spent attempting to match the second *
of the pattern.
Finally, the Not subpattern ! attempts to match the rest of the pattern
against the rest of the filename.
If the submatch succeeds then the subpattern fails, otherwise the subpattern
succeeds.
That is, it reverses the sense of the rest of the pattern.
Acknowlegements
The library functions are copyrighted by the author, but permission is
hereby granted for their unlimited use provided that the original copyright
notice is retained unaltered.
If you use any of the code in your software, drop me a note.
I'm curious to learn who finds it useful.
About the Author
David R. Tribble has been programming for almost 20 years since he wrote his
first BASIC program in high school, way back in the dark ages before Apples
and IBM PCs.
He still remembers the first computer store that opened in his old hometown of
Arlington, Texas.
He currently lives with his new wife in Plano, Texas, and is a Senior Software
Engineer for BEA Systems, Inc.
He can be reached via email at
dtribble@technologist.com.
His home page is
www.flash.net/~dtribble.
© 1997 Miller Freeman, Inc. All Rights Reserved.
All source code copyright © 1997 by David R. Tribble, all rights reserved.
All trademarks are held by their respective owners.
|