Skip to main content

%Regex.Matcher

Class %Regex.Matcher Extends (%RegisteredObject, %SYSTEM.Help) [ Final ]

The Class %Regex.Matcher creates an object that does pattern matching using regular expressions. The regular expressions come from the International Components for Unicode (ICU). The ICU maintains web pages at https://icu.unicode.org.

The definition and features of the ICU regular expression package can be found in https://unicode-org.github.io/icu/userguide/strings/regexp.html.

On most platforms, installing InterSystems IRIS will also install an appropiate version of the ICU libraries. On platforms that do not have an ICU library available, evaluating any regular expression function or method will result in an error.

A %Regex.Matcher object can be created by evaluating
##class(%Regex.Matcher).%New(pattern) or
##class(%Regex.Matcher).%New(pattern,text).
The first parameter to %New becomes the inital value of the property Pattern. The optional, second parameter to %New become the inital value of the property Text. Setting property Pattern to a regular expression pattern string causes that regular expression pattern to be compiled into a Matcher object where it can be used to do multiple matching operations without being recompiled. The property Text contains the subject text string that is searched by a regular expressions match. Note that an empty string is considered to be an illegal regular expression so the first parameter to %New cannot be missing nor be the empty string.

If x is a %Regex.Matcher object then the built-in method %ConstructClone can be used to copy x ( Set xnew = x.%ConstructClone() ) . The state of the most recent match and any error value in the Status property are not cloned. The %ConstructClone method can be faster than creating a new Matcher with the same Pattern. The %ConstructClone method can just copy instructions for the matching engine rather than recompiling the original pattern string. On 8-bit systems %ConstructClone can just copy the Unicode versions of the Pattern and Text properties without need to do the character-by-character conversion from the NLS 8-bit character set into Unicode.

None of the methods or operations in the %Regex.Matcher package return a %Status value. When an error is detected, these operations always throw the system exception thrown by the kernel code that interfaces to the ICU library. If a program wants to recover from a regular expression error then it is recommended that the code doing regular expression operations be surrounded with a TRY {...} block and that the error recovery be done in the corresponding CATCH {...} block. Note that a TRY block imposes no run-time performance overhead in situations where no error occurs.

The methods and operations in a %Regex.Matcher object will catch any system error and will generate a %Status value that may better describe that error. That %Status value will be stored in the Status property of the %Regex.Matcher object and in the variable %objlasterror. After saving the %Status value, the original unmodified system exception will be rethrown. You may examine that %Status value by executing the following InterSystems IRIS Object Script command:
do $system.Status.DisplayError(%objlasterror)

Some other system errors, like , are passed through the %Regex.Matcher methods without modification.

Note that some ICU operation errors are not considered errors by the %Regex.Matcher package. Examples are evaluating the Start and End properties when the previous matching operation failed. In these cases Start and End have value -2 as a character position rather than throwing an error.

Examples:

Regular expression that finds titles M., Mr., Mrs. and Ms. in a string: "\bMr?s?\."
"\b" matches a break at the beginning (or ending) of a word
"M" matches an upper-case letter-M
"r?" matches 0 or 1 occurences of a lower-case letter-r
"s?" matches 0 or 1 occurences of a lower-case letter-s
"\." matches a period character

USER>set matcher=##class(%Regex.Matcher).%New("\bMr?s?.") USER>set matcher.Text="Mrs. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance" USER>while matcher.Locate() {write "Found ",matcher.Group," at position ",matcher.Start,!} Found Mrs. at position 1 Found Mr. at position 19 Found Ms. at position 37 Found M. at position 54 USER>write matcher.ReplaceAll("Dr.") Dr. Sally Jones, Dr. Mike McMurry, Dr. Amy Johnson, Dr. Maurice LaFrance USER>write matcher.ReplaceFirst("Dr.") Dr. Sally Jones, Mr. Mike McMurry, Ms. Amy Johnson, M. Maurice LaFrance

Regular expression that matches phone numbers of the form "(aaa) bbb-cccc" or of the form "aaa-bbb-ccc": (\((\d{3})\)\s*|\b(\d{3})-)(\d{3})-(\d{4})\b

(\((\d{3})\)\s*|\b(\d{3})-) matches either prefix "(aaa) " or prefix "aaa-". The outer parentheses capture this entire prefix as Group(1) and limits the range of the two prefix subpatterns in alternation by the | operator.

\((\d{3})\)\s* matches prefix "(aaa) "
\( and \) and \s* match "(" and ")" and zero or more spaces, respectively
\d{3} matches exactly 3 digits
(\d{3}) the parentheses capture these 3 digits as Group(2)

\b(\d{3})- matches prefix "aaa-"
\b this "break" allows no other digit or letter immediately before the 3 digits
(\d{3}) captures these 3 digits as Group(3)

(\d{3})- matches "bbb-" and captures these 3 digits as Group(4)

(\d{4}) matches "cccc" and captures these 4 digits as Group(5)

\b this final "break" makes sure the match is not immediately followed by another digit or a letter

ListPhones(s,a) PUBLIC { ; a is a reference variable. On return ; a contains the number of phone numbers in string s ; a(i) contains just the digits of the i'th phone number kill a set a = 0 set m=##class(%Regex.Matcher).%New("(((\d{3}))\s*|\b(\d{3})-)(\d{3})-(\d{4})\b") set m.Text = s while m.Locate() { ; Get first three digits from Group(2) or Group(3) if m.Start(2)>0 { set n=m.Group(2) } else { set n=m.Group(3) } ; Concatenate middle 3 digits and final 4 digits set n = n_m.Group(4) _ m.Group(5) ; Insert digit string into array a set a($increment(a)) = n } } ListPhones2(s,a) PUBLIC { ; a is a reference variable. On return ; a contains the number of phone numbers in string s ; a(i) is i'th phone number formatted as "(aaa)bbb-cccc" ; Note, no blank after "(aaa)" kill a set a = 0 set m=##class(%Regex.Matcher).%New("(((\d{3}))\s*|\b(\d{3})-)(\d{3})-(\d{4})\b") set m.Text = s while m.Locate() { ; Digits are concatentation of Capture groups 2,3,4,5 ; One of group 2 or 3 is the empty string when group is not used set a($increment(a)) = m.SubstituteIn("($2$3)$4-$5") } } USER>write ^t2 Call 617-555-1212 about item number 61773-333-4569 USER>do ListPhones^ListPhones(^t2,.a) USER>zwrite a a=1 a(1)=617555121 USER>write ^t3 Phone (212) 334-5397, (321)770-2121 and 603-646-0110 USER>do ListPhones^ListPhones(^t3,.a) USER>zwrite a a=3 a(1)=2123345397 a(2)=3217702121 a(3)=6036460110 USER>write ^t3 Phone (212) 334-5397, (321)770-2121 and 603-646-0110 USER>do ListPhones2^ListPhones(^t3,.a) USER>zwrite a a=3 a(1)="(212)334-5397" a(2)="(321)770-2121" a(3)="(603)646-0110"

Properties

Pattern

Property Pattern As %String;

The property Pattern is the string representation of the regular expression of the Matcher. Assigning to Pattern resets all saved state concerning the last matching operation.

On an installation using an NLS 8-bit character set different from Latin-1 then you you must be careful with patterns using a character class of the form [x-y] where x or y are national usage characters not in Latin-1. All regular expression matching is done in Unicode so characters x and y are converted Unicode. The character class [x-y] reprsents all characters between the Unicode translations of x and y and not the NLS 8-bit characters between x and y.

RegexId

Property RegexId [ Internal, Private ];

RegexId is an internal value that is mapped to the regular expression matcher object supported by the ICU libraries.

Text

Property Text As %String;

The property Text is the string to which the regular expression will be applied. Assigning to Text resets all saved state resulting from the most recent match operation. On installations using an 8-bit character code, the internal representation of Text is converted to Unicode. Therefore, on an installation using 8-bit characters the maximum length of the Text property is only half the maximum string length supported by that installation.

TextBuffer

Property TextBuffer As %String [ Internal, Private ];

TextBuffer is used only on 8-bit systems. It is a copy of Text as Unicode bytes.

Start

Property Start As %Integer [ MultiDimensional, ReadOnly ];

The property Start without a subscript contains the character position in property Text of the first character of the string found by the last match. If the matched string is the empty string then Start is the character position one beyond where the empty string was located (and the property Start equals the property End.)

The value of Start(i) when subscripted with an integer i between 1 and GroupCount is the character position of the first character of the last string successfully captured by capture group i. If the captured string is the empty string then Start(i) is the character position one beyond where the empty string that was captured (and the property Start(i) equals the property End(i).)

The value of Start(i) is -1 if capture group i did not participate in the last match. The values of Start and Start(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and GroupCount, the value of Start(0) is identical to the value of Start without a subscript. When the property Start(...) is subscripted with values not described above then the attempt to evaluate the property Start(...) is undefined.

End

Property End As %Integer [ MultiDimensional, ReadOnly ];

The property End without a subscript contains the character position in property Text one beyond of the final character of the string found by the last match.

The value of End(i) when subscripted with an integer i between 1 and GroupCount is the character position one beyond the of the last character of the last string successfully captured by capture group i.

The value of End(i) is -1 if capture group i did not participate in the last match. The values of End and End(i) are -2 if the last match attempt failed.

Note: In addition to integer subscripts between 1 and GroupCount, the value of End(0) is identical to the value of End without a subscript. When the property End(...) is subscripted with values not described above then the attempt to evaluate the property End(...) is undefined.

Group

Property Group As %String [ MultiDimensional, ReadOnly ];

The property Group without a subscript contains the string found by the last match.

The value of Group(i) when subscripted with an integer i between 1 and GroupCount is the last string successfully captured by capture group i.

If the last match operation was unsuccessful or if the specified capture group was not used during the last match operation then Group and Group(i) contain the empty string. Note that End and End(i) have negative values when the last match operation did not use the specified capture group or did not succeed in matching.

Note: In addition to integer subscripts between 1 and GroupCount, the value of Group(0) is identical to the value of Group without a subscript. When the property Group(...) is subscripted with values not described above then the attempt to evaluate the property Group(...) is undefined.

HitEnd

Property HitEnd As %Boolean [ Calculated, ReadOnly ];

The property HitEnd is true if the most recent matching operation touched the end of property Text at any point during its processing. In this case, appending additional input characters to the Text property could change the result of that match attempt.

PreviousMatchEnd

Property PreviousMatchEnd As %Integer [ Private ];

PreviousMatchEnd is the End value of the previous match. It has value -1 if there is no current match and value 1 if there is a current match but no previous match.

GroupCount

Property GroupCount As %Integer [ ReadOnly ];

The property GroupCount contains the number of capturing groups in the regular expression Pattern.

RequiredPrefix

Property RequiredPrefix As %String [ Deprecated, Internal, ReadOnly ];

This property is DEPRECATED and is always the empty string.

The property RequiredPrefix contains a string which, if nonempty, is a sequence of characters which must occur at the start of any string which matches the Pattern. A nonempty RequiredPrefix can be used to search a long string for a favorable position to start a Regular Expression matching operation.

In many cases the heuristics used by the ICU library to determine the RequiredPrefix do not include all possible characters of such a prefix. When a prefix cannot be determined, RequiredPrefix will contain the empty string. RequiredPrefix will also contain the empty string if the ICU library used by InterSystems IRIS does not support the RequiredPrefix feature.

OperationLimit

Property OperationLimit As %Integer;

The property OperationLimit provides a way to limit the time taken by a regular expression match. The default value for OperationLimit is 0 which indicates that there is no limit. Setting OperationLimit to a positive integer will cause a match operation to signal a TimeOut error after the specified number of clusters of steps by the match engine.

Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but cluster size is chosen such each cluster's execution time will typically be on the order of milliseconds.

Status

Property Status As %Status;

The property Status contains a %Status value which may provide more information about the last System exception thrown by this object. It is initially $$$OK. Its value remains unchanged by any successful operation. The Status property is changed only when an error is thrown the kernel functions implementing %Regex.Matcher or by a COS Set assignment to the Status property done by the user.

Methods

Error

Method Error(excep As %Exception.AbstractException) As %ObjectHandle [ Internal, Private ]

Creates %Regex application exception from system exception

%OnNew

Method %OnNew(pattern As %String = "", text As %String = "") As %Status [ Internal, Private ]

The class method %New creates a new Matcher.

The argument pattern contains the regular expression. The property Pattern is set to the value of this argument

The argument text is optional. If defined, it contains the new value of the property Text, which is the string to which the regular expression Pattern will be applied.

%OnClose

Method %OnClose() As %Status [ Internal, Private ]

The %OnClose() method frees the URegularExpression handle stored in i%RegexID when a %Regex.Matcher object is deleted.

PatternSet

Method PatternSet(pattern As %String) As %Status

The PatternSet method implements Set assignments to the Pattern property.

TextSet

Method TextSet(text As %String) As %Status

The TextSet method implements Set assignments to the Text property.

StartGet

Method StartGet(group As %Integer = 0) As %Integer

The StartGet method implements the Start property.

EndGet

Method EndGet(group As %Integer = 0) As %Integer

The EndGet method implements the End property.

GroupGet

Method GroupGet(group As %Integer = 0) As %String

The GroupGet method implements the Group property.

HitEndGet

Method HitEndGet() As %Boolean

The HitEndGet method implements the HitEnd property.

GroupCountGet

Method GroupCountGet() As %Integer

The GroupCountGet method implements the GroupCount property.

RequiredPrefixGet

Method RequiredPrefixGet() As %String

The RequiredPrefixGet method implements the RequiredPrefix property.

OperationLimitSet

Method OperationLimitSet(limit) As %Status

The OperationLimitSet method implements the side effects of doing a Set assignment to change the value of the OperationLimit property.

LastStatus

ClassMethod LastStatus() As %Status

The class method LastStatus returns the %Status value containing additional details about the most recent system error. If a %Regex.Matcher object encounters a error then this status is already available in the Status property of the object. Executing
Do $SYSTEM.Status.DisplayError(##class(%Regex.Matcher).LastStatus())
is useful when debugging a error following a call on $MATCH, $LOCATE or ##class(%Regex.Matcher).%New(x) where a %Regex.Matcher oref value is not available.

Match

Method Match(text As %String) As %Boolean

The method Match returns true if the entire string Text is matched by Pattern; it returns false if it does not match.

The argument text is optional. If the argument text is defined then the property Text is set to its value before the match is executed.

LookingAt

Method LookingAt(position As %Integer = 1) As %Boolean

The method LookingAt attempts to find a match in the property Text that must start at a particular character position. The match need not extend to the end of Text.

The argument position gives starting character position of the attempted match.

LookingAt returns 1 if the match is found; 0 otherwise.

Locate

Method Locate(position As %Integer) As %Boolean

The method Locate finds a match for the regular expression Pattern in the text string Text.

If the optional argument position is defined as an integer 1 or greater then the search for a match begins at that character position of Text.

If the argument position is not defined then the search for the match begins the character position following the previous match.

Locate returns 1 if the match is found; 0 otherwise.

ResetPosition

Method ResetPosition(position As %Integer = 1)

The method ResetPosition resets any saved state from the previous match. It also causes the next call to the method Locate() without an argument to begin at the specified character position.

The argument position is the character position from which the next call to Locate() without an argument will begin match attempts.

ReplaceAll

Method ReplaceAll(replacement As %String) As %String

The method ReplaceAll returns a modified copy of the property Text. It replaces every substring of Text that matches the Pattern with a replacement string. Portions of Text that are not matched are copied without change. The value of ReplaceAll is the resulting string. The property Text is not modified.

The argument replacement supplies the string to replace each matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

ReplaceFirst

Method ReplaceFirst(replacement As %String) As %String

The method ReplaceFirst returns a modified copy of the property Text. It replaces the first substring of Text that matches the Pattern with a replacement string. Portions of Text that are not matched are copied without change. The value of ReplaceFirst is the resulting string. The property Text is not modified.

The argument replacement supplies the string to replace the matched region. The replacement string may contain references to capture groups which take the form of $1, $2, etc. The replacement string may reference the entire matched region with $0.

SubstituteIn

Method SubstituteIn(text As %String) As %String

The method SubstituteIn returns the string that results from substituting capturing groups from the most recent regular expression match into components of the argument Text. This method is undefined if the most recent regular expression match operation was not successful.

This method can be used as a low level step in regular expression replacement. It does not modify the property Text. For example, the method ..ReplaceFirst(x) is equivalent to:

Quit:'..Locate(1) ..Text Quit $Extract(..Text,1,..Start-1)..SubstituteIn(x) $Extract(..Text,..End,*)

The argument Text supplies the string that will be modified by the matched region and then returned. The string may contain references to capture groups which take the form of $1, $2, etc. The string may reference the entire matched region with $0.

SplitIntoList

Method SplitIntoList() As %List [ Internal, Private ]

The method SplitIntoList separates the Text string into fields. Matches by the regular expression Pattern identifies delimiters that separate the fields. The contents of Text between the matches become fields. The return value of the method is a $LIST where each List element is a field.

SplitIntoArray

Method SplitIntoArray(ByRef array) As %Integer [ Internal, Private ]

The method SplitIntoArray separates the Text string into fields. Matches by the regular expression Pattern identifies delimiters that separate the fields. The contents of Text between the matches become fields. The return value of the method is an integer which is a count of the number fields.

The argument array is a reference to a local variable which is an array to contain the values of the fields. array(1) is assigned the first field, array(2) is assigned the second field, etc.

%OnConstructClone

Method %OnConstructClone(obj As Matcher, deep As %Boolean, ByRef cloned As %String) As %Status [ Internal, Private ]

The %OnConstructClone method clones the ICU library specific values in a %Regex.Matcher object. It also resets the Status and the state of the last match attempt.