With Terminal: Using regular expressions

15 minutes

One of the things that I have always loved about the Linux terminal is what you can achieve using regular expressions. Whether we need to find complicated text or replace it with something else, using regular expressions can greatly simplify the job. Lets start by the beginning:

WARNING: This post is a pain in the ass. Reading this post all the time can cause loss of consciousness. Take breaks in between or consult your doctor or pharmacist before reading the entire post.

What is a regular expression?

A regular expression is a series of special characters that allow us to describe a text that we want to find. For example, if we wanted to find the word "linux" it would be enough to put that word in the program we are using. The word itself is a regular expression. So far it seems very simple, but what if we want to find all the numbers in a certain file? Or all the lines that start with a capital letter? In those cases you can no longer put a simple word. The solution is to use a regular expression.

Regular expressions vs. file patterns.

Before we get into the subject of regular expressions, I want to clear up a common misunderstanding about regular expressions. A regular expression is not what we put as a parameter in commands like rm, cp, etc. to refer to various files on the hard drive. That would be a file pattern. Regular expressions, although similar in that they use some common characters, are different. A file pattern is fired against the files on the hard disk and returns the ones that fully match the pattern, while a regular expression is fired against a text and returns the lines that contain the searched text. For example, the regular expression corresponding to the pattern *.* it would be something like ^.*\..*$

Types of regular expressions.

Not all programs use the same regular expressions. Not much less. There are several more or less standard types of regular expressions, but there are programs that change the syntax slightly, include their own extensions, or even use completely different characters. Therefore, when you want to use regular expressions with a program that you do not know well, the first thing to do is look at the manual or the documentation of the program to see what the regular expressions it recognizes are like.

First, there are two main types of regular expressions, which are contained in the POSIX standard, which is what Linux tools use. They are the basic and extended regular expressions. Many of the commands that work with regular expressions, such as grep or sed, allow you to use these two types. I will talk about them below. There are also the PERL-style regular expressions, and then there are programs like vim or emacs that use variants of these. Depending on what we want to do, it may be more appropriate to use one or the other.

Testing regular expressions.

The syntax of regular expressions is nothing trivial. When we have to write a complicated regular expression we will be in front of a string of special characters impossible to understand at first glance, so to learn how to use them it is essential to have a way to do all the tests we want and see the results easily. That is why I am now going to put several commands with which we can do the tests and experiment everything we need until we have the regular expressions dominated.

The first one is the grep command. This is the command we will use most frequently to do searches. The syntax is as follows:

grep [-E] 'REGEX' FICHERO COMANDO | grep [-E] 'REGEX'

I recommend always putting regular expressions in single quotes so that the shell doesn't get us crazy. The first way is to find a regular expression in a file. The second allows filtering the output of a command through a regular expression. By default, grep uses basic regular expressions. The -E option is for using extended regular expressions.

A trick that can help us see how regular expressions work is to enable the use of color in the grep command. That way, the part of the text that matches the regular expression we are using will be highlighted. To activate the color in the grep command, just make sure that the environment variable GREP_OPTIONS contain in value --color, which can be done with this command:

GREP_OPTIONS=--color

We can put it in the .bashrc to always have it activated.

Another way to use regular expressions is by using the sed command. This is more suitable for replacing text, but can also be used for searching. The syntax for it would be like this:

sed -n[r] '/REGEX/p' FICHERO COMANDO | sed -n[r] '/REGEX/p'

The sed command also uses basic regular expressions by default, you can use extended regular expressions with the -r option.

Another command that I also want to name is awk. This command can be used for many things, as it allows you to write scripts in your own programming language. If what we want is to find a regular expression in a file or in the output of a command, the way to use it would be the following:

awk '/REGEX/' FICHERO COMANDO | awk '/REGEX/'

This command always uses extended regular expressions.

To do our tests we will also need a text that will serve as an example to search it. We can use the following text:

- Lista de páginas wiki:

ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/

- Fechas de lanzamiento:

Arch Linux: 11-03-2002
Gentoo: 31/03/2002
CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10/2004

Desde Linux Rulez.

This is the text that I will use for the examples of the rest of the post, so I recommend that you copy it in a file to have it handy from the terminal. You can put the name you want. I have called it regex.

Beginning lesson.

Now we have everything we need to start testing regular expressions. Let's go little by little. I am going to put several examples of searches with regular expressions in which I will explain what each character is for. They are not very good examples, but since I'm going to have a very long post, I don't want to complicate it any more. And I'm just going to scratch the surface of what can be done with regular expressions.

The simplest of all is to search for a specific word, for example, suppose we want to search for all the lines that contain the word "Linux". This is the easiest, since we only have to write:

grep 'Linux' regex

And we can see the result:

ArchLinux: https://wiki.archlinux.org/ Arch Linux: 11-03-2002 From Linux Rulez.

These are the three lines that contain the word "Linux" which, if we have used the color trick, will appear highlighted. Note that it recognizes the word we are looking for even if it is part of a longer word as in "ArchLinux". However, it does not highlight the word "linux" that appears in the URL "https://wiki.archlinux.org/". That's because it appears there with the lowercase "l" and we have looked for it in uppercase. The grep command has options for this, but I'm not going to talk about them in an article on regular expressions.

With this simple test we can draw the first conclusion:

A normal character put into a regular expression matches itself.

Which is to say that if you put the letter "a" it will look for the letter "a". It seems logical, right? 🙂

Now suppose we want to search for the word "CentO" followed by any character, but only a single character. For this we can use the "." Character, which is a wildcard that matches any character, but only one:

grep 'CentO.' regex

And the result is:

CentOS: http://wiki.centos.org/
CentOs: 14-05-2004 03:32:38

Which means that it includes the "S" in "CentOS" although in one case it is uppercase and in another lowercase. If any other character appeared in that place, it would also include it. We already have the second rule:

The character "." matches any character.

It is no longer as trivial as it seemed, but with this we cannot do much. Let's go a little further. Let's suppose that we want to find the lines in which the year 2002 and 2004 appear. They seem like two searches, but they can be done at once like this:

grep '200[24]' regex

Which means that we want to find the number 200 followed by 2 or 4. And the result is this:

ArchLinux: 11-03-2002
Gentoo: 31/03 /2002
CentOS: 14-05-2004 03:32:38
Ubuntu: 20/10/2004

Which brings us to the third rule:

Multiple characters enclosed in brackets match any of the characters within the brackets.

The brackets give more play. they can also be used to exclude characters. For example, suppose we want to find sites where the ":" character appears, but is not followed by "/". The command would be like this:

grep ':[^/]' regex

It is simply a matter of putting a "^" as the first character inside the bracket. You can put all the characters you want below. The result of this last command is the following:

ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
Arch Linux: 11-03-2002 Gentoo: 31/03/2002 CentOs: 14-05-2004 03:32:38 Debian: 16/08/1993 Ubuntu: 20/10/2004

Now the ":" behind the distro names are highlighted, but not the ones in the URLs because the URLs have "/" after them.

Putting the "^" character at the beginning of a bracket matches any character except the other characters in the bracket.

Another thing we can do is specify a range of characters. For example, to search for any number followed by a "-" it would look like this:

grep '[0-9]-' regex

With this we are specifying a character between 0 and 9 and then a minus sign. Let's see the result:

ArchLinux: 11-03-2002 CentOs: 14-05-2004 03: 32: 38

Multiple ranges can be specified within the brackets to even mix ranges with single characters.

Placing two characters separated by "-" within the brackets matches any character within the range.

Let's see now if we can select the first part of the URLs. The one that says "http" or "https". They only differ in the final "s", so let's do it as follows:

grep -E 'https?' regex

The question mark is used to make the character to its left optional. But now we have added the -E option to the command. This is because interrogation is a feature of extended regular expressions. So far we were using basic regular expressions, so we didn't need to put anything in. Let's see the result:

ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/wiki/Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ Ubuntu: https: //wiki.ubuntu.com/

So we already have a new rule:

A character followed by "?" matches that character or none. This is only valid for extended regular expressions.

Now we are going to look for two completely different words. Let's see how to find the lines that contain both the word "Debian" and "Ubuntu".

grep -E 'Debian|Ubuntu' regex

With the vertical bar we can separate two or more different regular expressions and find the lines that match any of them:

Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
Debian: 16 / 08 / 1993
Ubuntu: 20 / 10 / 2004

The "|" character serves to separate several regular expressions and matches with any of them. It is also specific to extended regular expressions.

Let's continue. Now we are going to look for the word "Linux", but only where it is not stuck to another word on the left. We can do it like this:

grep '\

Here the important character is "<", but it needs to be escaped by putting "\" in front of it so that grep interprets it as a special character. The result is as follows:

Arch Linux: 11-03-2002 From Linux Rulez.

You can also use "\>" to search for words that are not right next to each other. Let's go with an example. Let's try this command:

grep 'http\>' regex

The output it produces is this:

CentOS: http: //wiki.centos.org/

"Http" came out, but not "https", because in "https" there is still a character to the right of the "p" that can be part of a word.

The characters "<" and ">" match the beginning and end of a word, respectively. These characters must be escaped so that they are not interpreted as literal characters.

We go with things a little more complicated. The "+" character matches the character to its left, repeated at least once. This character is only available with extended regular expressions. With it we can search, for example, sequences of several numbers in a row that start with ":".

grep -E ':[0-9]+' regex

Result:

CentOs: 14-05-2004 03: 32: 38

The number 38 is also highlighted because it also begins with ":".

The "+" character matches the character to its left, repeated at least once.

You can also control the number of repetitions using "{" and "}". The idea is to put in braces a number that indicates the exact number of repetitions we want. You can also put a range. Let's see examples of the two cases.

First we are going to find all the four-digit sequences that there are:

grep '[0-9]\{4\}' regex

Note that the curly braces must be escaped if we are using basic regular expressions, but not if we use extended ones. With extended it would be like this:

grep -E '[0-9]{4}' regex

And the result in both cases would be this:

ArchLinux: 11-03-2002
Gentoo: 31/03 /2002
CentOS: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10 /2004

The characters "{" and "}" with a number between them match the previous character repeated the specified number of times.

Now the other example with the braces. Suppose we want to find words that have between 3 and 6 lowercase letters. We could do the following:

grep '[a-z]\{3,6\}' regex

And the result would be this:

- Lista de pages wiki: TOrchLinux: https: //wiki.archlinux.org/ Gthen: https: //wiki.gentoo.org/wiki/MAin_Page
CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ ORrebellion: https: //wiki.ubuntu.com/ - Fyou miss de launch: TOrch Linux: 11-03-2002 Gthen: 31/03/2002 CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993 Urebellion: 20/10/2004 DIt is Linux Rulez.

Which, as you can see, does not look much like what we wanted. That's because the regular expression finds the letters within other words that are longer. Let's try this other version:

grep '\<[a-z]\{3,6\}\>' regex

Result:

- List of pages wiki: ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/wiki/ Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ Ubuntu: https: //wiki.ubuntu.com/

This already looks more like what we wanted. What we have done is require that the word start just before the first letter and end just after the last.

The characters "{" and "}" with two numbers between them separated by a comma match the previous character repeated the number of times indicated by the two numbers.

Let's now look at a character that is a prime of "+". It is "*" and its operation is very similar only that it matches any number of characters including zero. That is, it does the same as the "+" but does not require the character to its left to appear in the text. For example, let's try looking for those addresses that start on wiki and end on org:

grep 'wiki.*org' regex

Let's see the result:

ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/ wiki / Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/

Perfect.

Now the last character that we are going to see. The "\" character is used to escape the character to its right so that it loses its special meaning. For example: Suppose we want to locate the lines that end in a point. The first thing that might occur to us could be this:

grep '.$' regex

The result is not what we are looking for:

- List of wiki pages:
ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
- Release dates: Arch Linux: 11-03-2002
Gentoo: 31/03/2002
CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10/2004
Desde Linux Rulez.

This is because the "." matches anything, so that regular expression matches the last character of each line whatever it is. The solution is this:

grep '\.$' regex

Now the result is what we want:

Desde Linux Rulez.

Game over

Although the subject of regular expressions is so complex that I would give for a series of articles, I think I have already given you enough pain. If you have managed to arrive, congratulations. And if you've read all this in one sitting, take an aspirin or something, because it can't be good.

For now that's all. If you like this article, maybe I can write another. In the meantime, I recommend you try all the regular expressions in the terminal to see clearly how they work. And remember: Only Chuck Norris can parse HTML using regular expressions.

DesdeLinux