Regular Expressions

Regular expression is not regular expression - in many cases it depends on the program or on your OS - a good article about regular expressions and pattern search can be found (in German) at iX.

A regular expression defines a set of one or more strings of characters. Several of the UNIX utility programs, including ed, vi, grep, awk, and sed, use regular expressions to search for and replace strings.

A simple string of characters is a regular expression that defines one string of characters: itself.

A more complex regular expression uses letters, numbers. and special characters to define many different strings of characters. A regular expression is said to match any string it defines.

Characters and Delimiters

A character is any character except a NEWLINE. Most characters represent themselves within a regular expression. A special character is one that does not represent itself.

If you need to use a special character to represent itself, see "Quoting Special Characters"

A character, called a delimiter, usually marks the beginning and end of a regular expression. The delimiter is always a special character for the regular expression it delimits (that is, it does not represent itself but marks the beginning and end of the expression). You can use any character as a delimiter, as long as you use the same character at both ends of the regular expression. For simplicity, all the regular expressions in this section use a forward slash (/) as a delimiter. In some unambiguous cases, the second delimiter is not required. You can usually omit the second delimiter when it would be followed immediately by a (<Return>).

Simple Strings

The most basic regular expression is a simple string that contains no special characters except the delimiters. A simple string matches only itself.

Regular expressionMeaningExamples
/ring/ matches ring ring, spring, ringing, stringing
/Thursday/ matches Thursday Thursday, Thursday's
/or not/ matches or not or not, poor nothing

Special Characters

You can use special characters within a regular expression to cause it to match more than one string. A regular expression that includes a special character always matches the longest possible string starting as far toward the beginning (left) of the line as possible.

Period

A period (.) matches any character.
Regular expressionMeaningExamples
/ .alk/ matches all strings that contain a <Space> followed by any character followed by alk will talk, may balk
/.ing/ matches all strings with any character preceding ing singing, ping, before inglenook

Square Brackets

Square brackets ([ ]) define a character class that matches any single character within the brackets. If the first character following the left square bracket is a caret (^), the square brackets define a character class that matches any single character not within the brackets. You can use a hyphen (-) to indicate a range of characters.

Within a character class definition, backslashes, asterisks, and dollar signs (all described in the following sections) lose their special meanings. A right square bracket (appearing as a member of the character class) can only appear as the first character following the left square bracket, and a caret is only special if it is the first character following the left bracket.


Regular expressionMeaningExamples
/[bB]ill/ defines the character class containing b or B-matches followed by ill bill, Bill, billed
/t[aeiou].k/ matches t followed by a lower-case vowel, any character, and a k teak, talkative, took, tanker
/number [6-9]/ matches number followed by a <Space> and a member of the specified character class number 6, number 8, number 9
/[^a-zA-Z]/ matches any character that is not a letter 1, 7, @, ., }

Asterisk

An asterisk (*) can follow a regular expression that represents a single character. The asterisk represents zero or more occurrences of a match of the regular expression. An asterisk following a period matches any string of characters. (A period . matches any character, and an asterisk matches zero or more occurrences of the preceding regular expression.) A character class definition followed by an asterisk matches any string of characters that are members of the character class.

Regular expressionMeaningExamples
/ab*c/ matches a followed by zero or more b's followed by a c ac, abc, abbc, abbbc
/ab.*c/ matches ab followed by zero or more other characters followed by c abc, abxc, ab45c, ab 756.345
/t.*ing/ matches t followed by zero or more characters followed by ing thing, ting, thought of going
/[a-zA-Z ]*/ matches a string composed only of letters and <Space>s any string without numbers or punctuation
/(.*)/ matches as long a string as possible between ( and ) (this) and (that)
/([^)]*)/ matches the shortest string possible that starts with ( and ends with ) (this), (this and that)

Caret and Dollar Sign

A regular expression that begins with a caret (^) can only match a string at the beginning of a line. In a similar manner, a dollar sign at the end of a regular expression matches the end of a line.

Regular expressionMeaningExamples
/^T/ matches a T at the beginning of a line That time ...,
This line ...
/^+[0-9] / matches a plus sign followed by a number at the beginning of a line +5 45.72,
+759 Keep this ...
/:$/ matches a colon that ends a line ...below:
/^$/ matches an empty line

Quoting Special Characters

You can quote any special character (but not a digit or a parenthesis) by preceding it with a backslash (\). Quoting a special character makes it represent itself.

Regular expressionMeaningExamples
/end\./ matches all strings that contain end followed by a period pretend.mail, The end., send.
/\/ matches a single backslash \
/\*/ matches an asterisk *.c, an asterisk (*)
/\[5\]/ matches the string [5] it was five [5]
/and\/or/ matches and/or and/or

Rules

The following rules govern the application of regular expressions.

Longest Match Possible

As stated previously, a regular expression always matches the longest possible string starting as far toward the beginning of the line as possible. For example, given the following string:
This (rug) is not what it once was (a long time ago), is it?
The expression:
matches:
/Th.*is/
This (rug) is not what it once was (a long time ago), is
and:
matches:
/(.*)/
(rug) is not what it once was (a long time ago)
however:
matches:
/([^)]*)/
(rug)

One Regular Expression Does Not Exclude Another

If a regular expression is composed of two regular expressions, the first will match as long a string as possible but will not exclude a match of the second. Given the following string:
singing songs, singing more and more
The expression:
matches:
/s.*ing/
singing songs, singing
and:
matches:
/s.*ing song/
singing song

Empty Regular Expressions

An empty regular expression (//) always represents the last regular expression that you used. For example, if you give vi the following Substitute command:

s/mike/robert/

and then you want to make the same substitution again, you can use the command:

s//robert/

The empty regular expression represents the last regular expression you used (/mike/).

Bracketing Expressions

You can use quoted parentheses, \( and \), to bracket a regular expression. The string that the bracketed regular expression matches can subsequently be used, as explained below in "Quoted Digits". A regular expression does not attempt to match quoted parentheses. Thus, a regular expression enclosed within quoted parentheses matches what the same regular expression without the parentheses would match.
The expression:
matches
/\(rexp\)/
what /rexp/ would match
and:
matches
/a\(b*\)c/
what /ab*c/ would match
You can nest quoted parentheses. The following expression consists of two bracketed expressions, one within the other.

/\([a-z]\([A-Z]*\)\)/

The bracketed expressions are identified only by the opening \('s, so there is no ambiguity in identifying them.

The Replacement String

The vi and sed editors use regular expressions as Search Strings within Substitute commands. You can use two special characters, ampersands (&) and quoted digits (\n), to represent the matched strings within the corresponding Replacement String.

Ampersands

Within a Replacement String, an ampersand (&) takes on the value of the string that the Search String (regular expression) matched.

For example, the following Substitute command surrounds a string of one or more numbers with NN. The ampersand in the Replacement String matches whatever string of numbers the regular expression (Search String) matched. Two character class definitions are required because the regular expression [0-9]* matches zero or more occurrences of a digit, and any character string is zero or more occurrences of a digit.

s/[0-9][0-9]*/NN&NN/

Quoted Digits

Within the regular expression itself, a quoted digit (\n) takes on the value of the string that the bracketed regular expression beginning with the nth \( matched.

Within a Replacement String, a quoted digit represents the string that the bracketed regular expression (portion of the Search String) beginning with the nth \( matched.

For example, you can take a list of people in the form

    last-name, first-name initial

and put it in the following format
    first-name initial last-name

with the following vi command:
     1,$s/\([^,]*\), \(*\)/\2 \1/

First the command addresses all the lines (1,$) in the file. The Substitute command (s) uses a Search String and a Replacement String delimited by forward slashes. The first bracketed regular expression within the Search String, \([^,]*\), matches what the same unbracketed regular expression, [^,]*, would match. This regular expression matches a string of zero or more characters not containing a comma (the last-name). Following the first bracketed regular expression is a comma and a <Space> that match themselves. The second bracketed expression \(.*\) matches any string of characters (the first-name and initial).

The Replacement String consists of what the second bracketed regular expression matched (\2) followed by a <Space> and what the first bracketed regular expression matched (\1).

Summary of Regular Expressions

A regular expression defines a set of one or more strings of characters. A regular expression is said to match any string it defines.
The following characters are special within a regular expression.
Special characterFunction
. matches any single character
[xyz] defines a character class that matches x, y, or z
[^xyz] defines a character class that matches any character except x, y, or z
[x-z] defines a character class that matches any character x through z inclusive
* matches zero or more occurrences of a match of the preceding character
^ forces a match to the beginning of a line
$ forces a match to the end of a line
\ used to quote special characters
\(xyz\) matches what xyz matches (a bracketed regular expression)

The following characters are special within a Replacement String.
CharacterFunction
& represents what the regular expression (Search String) matched
\n a quoted number, n, represents the nth bracketed regular expression in the Search String