Bookmark this page

Chapter 2.  Using Regular Expressions with grep

Abstract

Overview
Goal To write regular expressions using grep to isolate or locate content in text files.
Objectives

  • Create regular expressions to match text patterns.

  • Use grep to locate content in files.

Sections
  • Regular Expression Fundamentals (and Practice)

  • Matching Text with grep (and Practice)

  • Using grep with Logs (and Practice)

Lab
  • Using Regular Expressions with grep

Regular Expressions Fundamentals

Write regular expressions to match data.

Objectives

After completing this section, students should be able to:

  • Create regular expressions that match desired data.

  • Use grep to apply regular expressions to text files.

Writing regular expressions

Regular expression fundamentals

Regular expressions is a pattern-matching language used for enabling applications to sift through data looking for specific content. In addition to vim, grep, and less using regular expressions, programming languages such as Perl, Python, and C all use regular expressions when using pattern-matching criteria.

Regular expressions are a language of their own, which means they have their own syntax and rules. This section will take a look at the syntax used in creating regular expressions, as well as showing some examples of using regular expressions.

A simple regular expression

The simplest regular expression is an exact match. An exact match is when the characters in the regular expression match the type and order in the data that is being searched.

Suppose that a user was looking through the following file of data looking for all occurrences of the pattern cat:

cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog

cat is an exact match of a c, followed by an a, followed by a t. Using cat as the regular expression while searching the previous file gives the following matches:

cat
concatenate
category
educated
vindication

Using line anchors

The previous section used an exact match regular expression on a file of data. Note that the regular expression would match the data no matter where on the line it occurred: beginning, end, or middle of the word or line. One way that can be used to control the location of where the regular expression looks for a match is a line anchor.

Use a ^, a beginning of line anchor, or $, an end of line anchor. Using the file from earlier:

cat
dog
concatenate
dogma
category
educated
boondoggle
vindication
chilidog

To have the regular expression match cat, but only if it occurs at the beginning of the line in the file, use ^cat. Applying the regular expression ^cat to the data would yield the following matches:

cat
category

If users only wanted to locate lines in the file that ended with dog, use that exact expression and an end of line anchor to create the regular expression dog$. Applying dog$ to the file would find two matches:

dog
chilidog

If users wanted to make sure that the pattern was the only thing on a line, use both the beginning and end of line anchors. ^cat$ would locate only one line in the file, one with a beginning of a line, a c, followed by an a, followed with a t, and ending with an end of line.

Another type of anchor is the word boundary. \< and \> can be used to respectively match the beginning and end of a word.

Wildcards and multipliers

Regular expressions use a . as the unrestricted wildcard character. A regular expression of c.t will look for data containing a c, followed by any one character, followed by a t. Examples of data that would match this regular expression's pattern are cat, cot, and cut, but also c5t and cQt.

Another type of wildcard used in regular expressions is a set of acceptable characters at a specific character position. When using an unrestricted wildcard, users could not predict the character that would match the wildcard; however, if users wanted to only match the words cat, cot, and cut, but not odd items like c5t or cQt, replace the unrestricted wildcard with one where acceptable characters are specified. If the regular expression was changed to c[aou]t, it would be specifying that the regular expression should match patterns that start with a c, are followed by an a or an o or a u, followed by a t.

Multipliers are a mechanism used often with wildcards. Multipliers apply to the previous character in the regular expression. One of the more common multipliers used is *. A *, when used in a regular expression, modifies the previous character to mean zero to infinitely many of that character. If a regular expression of c.*t was used, it would match ct, cat, coat, culvert, etc.; any data that started with a c, then zero to infinitely many characters, ending with a t.

Another type of multiplier would indicate the number of previous characters desired in the pattern. An example of using an explicit multiplier would be c.\{2\}t. Using this regular expression, users are looking for data that begins with a c, followed by exactly any two characters, ending with a t.

Note

In the previous examples, Bash regex syntax is being used. There are some slight differences in the syntax used for regular expressions between different implementations (Bash, Python, Perl, etc.).

References

regex(7) man page

Revision: rh134-7-c643331