Abstract
| Overview | |
|---|---|
| Goal | To write regular expressions using grep to isolate or locate content in text files. |
| Objectives |
|
| Sections |
|
| Lab |
|
Write regular expressions to match data.
After completing this section, students should be able to:
Create regular expressions that match desired data.
Use grep to apply regular expressions to text files.
Regular expressions is a pattern-matching language used for enabling applications to sift through data looking for specific content. In addition to vim, grep, and less using regular expressions, programming languages such as Perl, Python, and C all use regular expressions when using pattern-matching criteria.
Regular expressions are a language of their own, which means they have their own syntax and rules. This section will take a look at the syntax used in creating regular expressions, as well as showing some examples of using regular expressions.
A simple regular expression
The simplest regular expression is an exact match. An exact match is when the characters in the regular expression match the type and order in the data that is being searched.
Suppose that a user was looking through the following file of data looking
for all occurrences of the pattern cat:
cat dog concatenate dogma category educated boondoggle vindication chilidog
cat is an exact match of a c, followed by an
a, followed by a t. Using
cat as the regular expression while searching the previous file
gives the following matches:
catconcatenatecategory educated vindication
Using line anchors
The previous section used an exact match regular expression on a file of data. Note that the regular expression would match the data no matter where on the line it occurred: beginning, end, or middle of the word or line. One way that can be used to control the location of where the regular expression looks for a match is a line anchor.
Use a ^, a beginning of line anchor, or $,
an end of line anchor. Using the file from earlier:
cat dog concatenate dogma category educated boondoggle vindication chilidog
To have the regular expression match cat,
but only if it occurs at the beginning of the line in the file, use
^cat. Applying the regular expression ^cat
to the data would yield the following matches:
catcategory
If users only wanted to locate lines in the file that ended with
dog, use that exact expression and an end of line anchor to
create the regular expression dog$. Applying
dog$ to the file would find two matches:
dogchilidog
If users wanted to make sure that the pattern was the only thing on a line,
use both the beginning and end of line anchors.
^cat$ would locate only one line in the file, one with a
beginning of a line, a c, followed by an a,
followed with a t, and ending with an end of line.
Another type of anchor is the word boundary.
\< and \> can be used to respectively match
the beginning and end of a word.
Wildcards and multipliers
Regular expressions use a . as the unrestricted wildcard
character. A regular expression of c.t will look for data
containing a c, followed by any one character, followed by a
t. Examples of data that would match this regular expression's
pattern are cat, cot, and cut, but also c5t and cQt.
Another type of wildcard used in regular expressions is a set of acceptable
characters at a specific character position. When using an unrestricted
wildcard, users could not predict the character that would match the wildcard;
however, if users wanted to only match the words cat, cot, and cut, but not
odd items like c5t or cQt, replace the unrestricted wildcard with
one where acceptable characters are specified. If the
regular expression was changed to c[aou]t, it would be specifying that the
regular expression should match patterns that start with a c,
are followed by an a or an o or a u,
followed by a t.
Multipliers are a mechanism used often with wildcards. Multipliers apply
to the previous character in the regular expression. One of the more common
multipliers used is *. A *, when used in a
regular expression, modifies the previous character to mean zero to
infinitely many of that character. If a regular expression
of c.*t was used, it would match ct, cat, coat, culvert, etc.; any
data that started with a c, then zero to infinitely many
characters, ending with a t.
Another type of multiplier would indicate the number of previous characters
desired in the pattern. An example of using an explicit multiplier would
be c.\{2\}t. Using this regular expression, users are looking
for data that begins with a c, followed by exactly any two
characters, ending with a t.
In the previous examples, Bash regex syntax is being used. There are some slight differences in the syntax used for regular expressions between different implementations (Bash, Python, Perl, etc.).
regex(7) man page