Next: POSIX C to Scheme correspondence, Previous: POSIX I/O utilities, Up: POSIX interface
The procedures in this section provide access to POSIX regular expression matching. The regular expression syntax and semantics are far too complex to be described here.
Note: Because the C interface uses ASCII NUL
bytes to
mark the ends of strings, patterns & strings that contain NUL
characters will not work correctly.
The first interface to regular expressions is a thin layer over the
interface that POSIX provides. It is exported by the structures
posix-regexps
& posix
.
Make-regexp
creates a regular expression with the given string pattern. The arguments after string specify various options for the regular expression; seeregexp-option
below. The regular expression is not compiled until it is matched against a string, so any errors in the pattern string will not be reported until that point.Regexp?
is the disjoint type predicate for regular expression objects.
Evaluates to a regular expression option, suitable to be passed to
make-regexp
, with the given name. The possible option names are:
extended
- use the extended patterns
ignore-case
- ignore case differences when matching
submatches
- report submatches
newline
- treat newlines specially
Regexp-match
matches regexp against the characters in string, starting at position start. If the string does not match the regular expression,regexp-match
returns#f
. If the string does match, then a list of match records is returned if submatches? is true or#t
if submatches? is false. The first match record gives the location of the substring that matched regexp. If the pattern in regexp contained submatches, then the submatches are returned in order, with match records in the positions where submatches succeeded and#f
in the positions where submatches failed.Starts-line? should be true if string starts at the beginning of a line, and ends-line? should be true if it ends one.
Match?
is the disjoint type predicate for match records. Match records contain three values: the beginning & end of the substring that matched the pattern and an association list of submatch keys and corresponding match records for any named submatches that also matched.Match-start
returns the index of the first character in the matching substring, andmatch-end
gives the index of the first character after the matching substring.Match-submatches
returns the alist of submatches.
This section describes a functional interface for building regular
expressions and matching them against strings, higher-level than the
direct POSIX interface. The matching is done using the POSIX regular
expression package. Regular expressions constructed by procedures
listed here are compatible with those in the previous section; that is,
they satisfy the predicate regexp?
from the posix-regexps
structure. These names are exported by the structure regexps
.
Character sets may be defined using a list of characters and strings, using a range or ranges of characters, or by using set operations on existing character sets.
Set
returns a character set that contains all of the character arguments and all of the characters in all of the string arguments.Range
returns a character set that contains all characters between low-char and high-char, inclusive.Ranges
returns a set that contains all of the characters in the given set of ranges.Range
&ranges
use the ordering imposed bychar->integer
.Ascii-range
&ascii-ranges
are likerange
&ranges
, but they use the ASCII ordering.Ranges
&ascii-ranges
must be given an even number of arguments. It is an error for a high-char to be less than the preceding low-char in the appropriate ordering.
Set operations on character sets.
Negate
returns a character set of all characters that are not in char-set.Union
returns a character set that contains all of the characters in char-seta and all of the characters in char-setb.Intersection
returns a character set of all of the characters that are in both char-seta and char-setb.Subtract
returns a character set of all the characters in char-seta that are not also in char-setb.
(set "abcdefghijklmnopqrstuvwxyz")
(set "abcdefghijklmnopqrstuvwxyz")
(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
(union lower-case upper-case)
(set "0123456789")
(union alphabetic numeric)
(set "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~")
(union alphanumeric punctuation)
(union graphic (set #\space))
(negate printing)
(set #\space (ascii->char 9)) ; ASCII 9 = TAB
(union (set #\space) (ascii-range 9 13))
(set "0123456789ABCDEF")
Predefined character sets.
String-start
returns a regular expression that matches the beginning of the string being matched against;string-end
returns one that matches the end.
Sequence
returns a regular expression that matches concatenation of all of its arguments;one-of
returns a regular expression that matches any one of its arguments.
Returns a regular expression that matches exactly the characters in string, in order.
Repeat
returns a regular expression that matches zero or more occurrences of its regexp argument. With only one argument, the result will match regexp any number of times. With two arguments, i.e. one count argument, the returned regular expression will match regexp exactly that number of times. The final case will match from min to max repetitions, inclusive. Max may be#f
, in which case there is no maximum number of matches. Count & min must be exact, non-negative integers; max should be either#f
or an exact, non-negative integer.
Regular expressions are normally case-sensitive, but case sensitivity can be manipulated simply.
The regular expression returned by
ignore-case
is identical to its argument except that the case will be ignored when matching. The value returned byuse-case
is protected from future applications ofignore-case
. The expressions returned byuse-case
andignore-case
are unaffected by any enclosing uses of these procedures.By way of example, the following matches
"ab"
, but not"aB"
,"Ab"
, or"AB"
:(text "ab")while
(ignore-case (text "ab"))matches all of those, and
(ignore-case (sequence (text "a") (use-case (text "b"))))matches
"ab"
or"Ab"
, but not"aB"
or"AB"
.
A subexpression within a larger expression can be marked as a submatch. When an expression is matched against a string, the success or failure of each submatch within that expression is reported, as well as the location of the substring matched by each successful submatch.
Submatch
returns a regular expression that is equivalent to regexp in every way except that the regular expression returned bysubmatch
will produce a submatch record in the output for the part of the string matched by regexp.No-submatches
returns a regular expression that is equivalent to regexp in every respect except that all submatches generated by regexp will be ignored & removed from the output.
#f
Any-match?
returns#t
if string matches regexp or contains a substring that does, or#f
if otherwise.Exact-match?
returns#t
if string matches regexp exactly, or#f
if it does not.
Match
returns#f
if string does not match regexp, or a match record if it does, as described in the previous section. Matching occurs according to POSIX. The match returned is the one with the lowest starting index in string. If there is more than one such match, the longest is returned. Within that match, the longest possible submatches are returned.All three matching procedures cache a compiled version of regexp. Subsequent calls with the same input regular expression will be more efficient.
Here are some examples of the high-level regular expression interface:
(define pattern (text "abc")) (any-match? pattern "abc") => #t (any-match? pattern "abx") => #f (any-match? pattern "xxabcxx") => #t (exact-match? pattern "abc") => #t (exact-match? pattern "abx") => #f (exact-match? pattern "xxabcxx") => #f (let ((m (match (sequence (text "ab") (submatch 'foo (text "cd")) (text "ef"))) "xxabcdefxx")) (list m (match-submatches m))) => (#{Match 3 9} ((foo . #{Match 5 7}))) (match-submatches (match (sequence (set "a") (one-of (submatch 'foo (text "bc")) (submatch 'bar (text "BC")))) "xxxaBCd")) => ((bar . #{Match 4 6}))