With Terminal: Using regular expressions

One of the things that I have always loved about the Linux terminal is what you can achieve using regular expressions. Whether we need to find complicated text or replace it with something else, using regular expressions can greatly simplify the job. Lets start by the beginning:

WARNING: This post is a pain in the ass. Reading this post all the time can cause loss of consciousness. Take breaks in between or consult your doctor or pharmacist before reading the entire post.

What is a regular expression?

A regular expression is a series of special characters that allow us to describe a text that we want to find. For example, if we wanted to find the word "linux" it would be enough to put that word in the program we are using. The word itself is a regular expression. So far it seems very simple, but what if we want to find all the numbers in a certain file? Or all the lines that start with a capital letter? In those cases you can no longer put a simple word. The solution is to use a regular expression.

Regular expressions vs. file patterns.

Before we get into the subject of regular expressions, I want to clear up a common misunderstanding about regular expressions. A regular expression is not what we put as a parameter in commands like rm, cp, etc. to refer to various files on the hard drive. That would be a file pattern. Regular expressions, although similar in that they use some common characters, are different. A file pattern is fired against the files on the hard disk and returns the ones that fully match the pattern, while a regular expression is fired against a text and returns the lines that contain the searched text. For example, the regular expression corresponding to the pattern *.* it would be something like ^.*\..*$

Types of regular expressions.

Not all programs use the same regular expressions. Not much less. There are several more or less standard types of regular expressions, but there are programs that change the syntax slightly, include their own extensions, or even use completely different characters. Therefore, when you want to use regular expressions with a program that you do not know well, the first thing to do is look at the manual or the documentation of the program to see what the regular expressions it recognizes are like.

First, there are two main types of regular expressions, which are contained in the POSIX standard, which is what Linux tools use. They are the basic and extended regular expressions. Many of the commands that work with regular expressions, such as grep or sed, allow you to use these two types. I will talk about them below. There are also the PERL-style regular expressions, and then there are programs like vim or emacs that use variants of these. Depending on what we want to do, it may be more appropriate to use one or the other.

Testing regular expressions.

The syntax of regular expressions is nothing trivial. When we have to write a complicated regular expression we will be in front of a string of special characters impossible to understand at first glance, so to learn how to use them it is essential to have a way to do all the tests we want and see the results easily. That is why I am now going to put several commands with which we can do the tests and experiment everything we need until we have the regular expressions dominated.

The first one is the grep command. This is the command we will use most frequently to do searches. The syntax is as follows:

grep [-E] 'REGEX' FICHERO
COMANDO | grep [-E] 'REGEX'

I recommend always putting regular expressions in single quotes so that the shell doesn't get us crazy. The first way is to find a regular expression in a file. The second allows filtering the output of a command through a regular expression. By default, grep uses basic regular expressions. The -E option is for using extended regular expressions.

A trick that can help us see how regular expressions work is to enable the use of color in the grep command. That way, the part of the text that matches the regular expression we are using will be highlighted. To activate the color in the grep command, just make sure that the environment variable GREP_OPTIONS contain in value --color, which can be done with this command:

GREP_OPTIONS=--color

We can put it in the .bashrc to always have it activated.

Another way to use regular expressions is by using the sed command. This is more suitable for replacing text, but can also be used for searching. The syntax for it would be like this:

sed -n[r] '/REGEX/p' FICHERO
COMANDO | sed -n[r] '/REGEX/p'

The sed command also uses basic regular expressions by default, you can use extended regular expressions with the -r option.

Another command that I also want to name is awk. This command can be used for many things, as it allows you to write scripts in your own programming language. If what we want is to find a regular expression in a file or in the output of a command, the way to use it would be the following:

awk '/REGEX/' FICHERO
COMANDO | awk '/REGEX/'

This command always uses extended regular expressions.

To do our tests we will also need a text that will serve as an example to search it. We can use the following text:

- Lista de páginas wiki:

ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/

- Fechas de lanzamiento:

Arch Linux: 11-03-2002
Gentoo: 31/03/2002
CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10/2004

Desde Linux Rulez.

This is the text that I will use for the examples of the rest of the post, so I recommend that you copy it in a file to have it handy from the terminal. You can put the name you want. I have called it regex.

Beginning lesson.

Now we have everything we need to start testing regular expressions. Let's go little by little. I am going to put several examples of searches with regular expressions in which I will explain what each character is for. They are not very good examples, but since I'm going to have a very long post, I don't want to complicate it any more. And I'm just going to scratch the surface of what can be done with regular expressions.

The simplest of all is to search for a specific word, for example, suppose we want to search for all the lines that contain the word "Linux". This is the easiest, since we only have to write:

grep 'Linux' regex

And we can see the result:

ArchLinux: https://wiki.archlinux.org/ Arch Linux: 11-03-2002 From Linux Rulez.

These are the three lines that contain the word "Linux" which, if we have used the color trick, will appear highlighted. Note that it recognizes the word we are looking for even if it is part of a longer word as in "ArchLinux". However, it does not highlight the word "linux" that appears in the URL "https://wiki.archlinux.org/". That's because it appears there with the lowercase "l" and we have looked for it in uppercase. The grep command has options for this, but I'm not going to talk about them in an article on regular expressions.

With this simple test we can draw the first conclusion:

  • A normal character put into a regular expression matches itself.

Which is to say that if you put the letter "a" it will look for the letter "a". It seems logical, right? 🙂

Now suppose we want to search for the word "CentO" followed by any character, but only a single character. For this we can use the "." Character, which is a wildcard that matches any character, but only one:

grep 'CentO.' regex

And the result is:

CentOS: http://wiki.centos.org/
CentOs: 14-05-2004 03:32:38

Which means that it includes the "S" in "CentOS" although in one case it is uppercase and in another lowercase. If any other character appeared in that place, it would also include it. We already have the second rule:

  • The character "." matches any character.

It is no longer as trivial as it seemed, but with this we cannot do much. Let's go a little further. Let's suppose that we want to find the lines in which the year 2002 and 2004 appear. They seem like two searches, but they can be done at once like this:

grep '200[24]' regex

Which means that we want to find the number 200 followed by 2 or 4. And the result is this:

ArchLinux: 11-03-2002
Gentoo: 31/03 /2002
CentOS: 14-05-2004 03:32:38
Ubuntu: 20/10/2004

Which brings us to the third rule:

  • Multiple characters enclosed in brackets match any of the characters within the brackets.

The brackets give more play. they can also be used to exclude characters. For example, suppose we want to find sites where the ":" character appears, but is not followed by "/". The command would be like this:

grep ':[^/]' regex

It is simply a matter of putting a "^" as the first character inside the bracket. You can put all the characters you want below. The result of this last command is the following:

ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
Arch Linux: 11-03-2002 Gentoo: 31/03/2002 CentOs: 14-05-2004 03:32:38 Debian: 16/08/1993 Ubuntu: 20/10/2004

Now the ":" behind the distro names are highlighted, but not the ones in the URLs because the URLs have "/" after them.

  • Putting the "^" character at the beginning of a bracket matches any character except the other characters in the bracket.

Another thing we can do is specify a range of characters. For example, to search for any number followed by a "-" it would look like this:

grep '[0-9]-' regex

With this we are specifying a character between 0 and 9 and then a minus sign. Let's see the result:

ArchLinux: 11-03-2002 CentOs: 14-05-2004 03: 32: 38

Multiple ranges can be specified within the brackets to even mix ranges with single characters.

  • Placing two characters separated by "-" within the brackets matches any character within the range.

Let's see now if we can select the first part of the URLs. The one that says "http" or "https". They only differ in the final "s", so let's do it as follows:

grep -E 'https?' regex

The question mark is used to make the character to its left optional. But now we have added the -E option to the command. This is because interrogation is a feature of extended regular expressions. So far we were using basic regular expressions, so we didn't need to put anything in. Let's see the result:

ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/wiki/Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ Ubuntu: https: //wiki.ubuntu.com/

So we already have a new rule:

  • A character followed by "?" matches that character or none. This is only valid for extended regular expressions.

Now we are going to look for two completely different words. Let's see how to find the lines that contain both the word "Debian" and "Ubuntu".

grep -E 'Debian|Ubuntu' regex

With the vertical bar we can separate two or more different regular expressions and find the lines that match any of them:

Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
Debian: 16 / 08 / 1993
Ubuntu: 20 / 10 / 2004
  • The "|" character serves to separate several regular expressions and matches with any of them. It is also specific to extended regular expressions.

Let's continue. Now we are going to look for the word "Linux", but only where it is not stuck to another word on the left. We can do it like this:

grep '\

Here the important character is "<", but it needs to be escaped by putting "\" in front of it so that grep interprets it as a special character. The result is as follows:

Arch Linux: 11-03-2002 From Linux Rulez.

You can also use "\>" to search for words that are not right next to each other. Let's go with an example. Let's try this command:

grep 'http\>' regex

The output it produces is this:

CentOS: http: //wiki.centos.org/

"Http" came out, but not "https", because in "https" there is still a character to the right of the "p" that can be part of a word.

  • The characters "<" and ">" match the beginning and end of a word, respectively. These characters must be escaped so that they are not interpreted as literal characters.

We go with things a little more complicated. The "+" character matches the character to its left, repeated at least once. This character is only available with extended regular expressions. With it we can search, for example, sequences of several numbers in a row that start with ":".

grep -E ':[0-9]+' regex

Result:

CentOs: 14-05-2004 03: 32: 38

The number 38 is also highlighted because it also begins with ":".

  • The "+" character matches the character to its left, repeated at least once.

You can also control the number of repetitions using "{" and "}". The idea is to put in braces a number that indicates the exact number of repetitions we want. You can also put a range. Let's see examples of the two cases.

First we are going to find all the four-digit sequences that there are:

grep '[0-9]\{4\}' regex

Note that the curly braces must be escaped if we are using basic regular expressions, but not if we use extended ones. With extended it would be like this:

grep -E '[0-9]{4}' regex

And the result in both cases would be this:

ArchLinux: 11-03-2002
Gentoo: 31/03 /2002
CentOS: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10 /2004
  • The characters "{" and "}" with a number between them match the previous character repeated the specified number of times.

Now the other example with the braces. Suppose we want to find words that have between 3 and 6 lowercase letters. We could do the following:

grep '[a-z]\{3,6\}' regex

And the result would be this:

- Lista de pages wiki: TOrchLinux: https: //wiki.archlinux.org/ Gthen: https: //wiki.gentoo.org/wiki/MAin_Page
CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ ORrebellion: https: //wiki.ubuntu.com/ - Fyou miss de launch: TOrch Linux: 11-03-2002 Gthen: 31/03/2002 CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993 Urebellion: 20/10/2004 DIt is Linux Rulez.

Which, as you can see, does not look much like what we wanted. That's because the regular expression finds the letters within other words that are longer. Let's try this other version:

grep '\<[a-z]\{3,6\}\>' regex

Result:

- List of pages wiki: ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/wiki/ Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/ Ubuntu: https: //wiki.ubuntu.com/

This already looks more like what we wanted. What we have done is require that the word start just before the first letter and end just after the last.

  • The characters "{" and "}" with two numbers between them separated by a comma match the previous character repeated the number of times indicated by the two numbers.

Let's now look at a character that is a prime of "+". It is "*" and its operation is very similar only that it matches any number of characters including zero. That is, it does the same as the "+" but does not require the character to its left to appear in the text. For example, let's try looking for those addresses that start on wiki and end on org:

grep 'wiki.*org' regex

Let's see the result:

ArchLinux: https: //wiki.archlinux.org/ Gentoo: https: //wiki.gentoo.org/ wiki / Main_Page CentOS: http: //wiki.centos.org/ Debian: https: //wiki.debian.org/

Perfect.

Now the last character that we are going to see. The "\" character is used to escape the character to its right so that it loses its special meaning. For example: Suppose we want to locate the lines that end in a point. The first thing that might occur to us could be this:

grep '.$' regex

The result is not what we are looking for:

- List of wiki pages:
ArchLinux: https://wiki.archlinux.org/
Gentoo: https://wiki.gentoo.org/wiki/Main_Page
CentOS: http://wiki.centos.org/
Debian: https://wiki.debian.org/
Ubuntu: https://wiki.ubuntu.com/
- Release dates: Arch Linux: 11-03-2002
Gentoo: 31/03/2002
CentOs: 14-05-2004 03:32:38
Debian: 16/08/1993
Ubuntu: 20/10/2004
Desde Linux Rulez.

This is because the "." matches anything, so that regular expression matches the last character of each line whatever it is. The solution is this:

grep '\.$' regex

Now the result is what we want:

Desde Linux Rulez.

Game over

Although the subject of regular expressions is so complex that I would give for a series of articles, I think I have already given you enough pain. If you have managed to arrive, congratulations. And if you've read all this in one sitting, take an aspirin or something, because it can't be good.

For now that's all. If you like this article, maybe I can write another. In the meantime, I recommend you try all the regular expressions in the terminal to see clearly how they work. And remember: Only Chuck Norris can parse HTML using regular expressions.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.

  1.   Ezekiel said

    What would our life be without the regex?
    The article is very useful, but I will read it little by little. Thanks a lot.

    1.    hexborg said

      Thank you for comment. I still don't believe my article has come out. 🙂 It has come out with some error, but I hope it is useful. 🙂

  2.   Scalibur said

    Thank youssssssss! ..

    Some time ago I had to study a little about regular expressions .. ..I thank you for teaching .. and the step-by-step guide to learn each one of them ..

    Very good! .. .. I'm going to get that aspirin .. ee

    1.    hexborg said

      You're welcome. Courage and that regular expressions can't with you. 🙂

  3.   tanrax said

    Fantastic post! Great job. I wonder how many hours it took you 😀

    1.    hexborg said

      LOL!! The question is: How many hours would it have taken me if I had said everything I intended to say? Infinite !! 🙂

  4.   Tammuz said

    one thing I did not know, good article!

    1.    hexborg said

      Thank you. It is a pleasure to share it with you.

  5.   helena_ryuu said

    great explanation. congratulations! really useful!

    1.    hexborg said

      I'm glad you found it useful. So it is a pleasure to write.

  6.   anti said

    This should go somewhere special. Like the Featured but have a very specific usefulness. Quite useful, although I would like to see it applied to Vim.

    1.    hexborg said

      That is a question of asking myself. I have a few more articles on regular expressions in mind. And I could talk about vim in them. It has some differences from what I have explained in this article. It's a matter of getting on with it. 🙂

  7.   Fernando said

    Good!

    Your article is very good, it is curious, recently (right now) I have published on my website an entry that I had been preparing for a few days where I have collected a list of metacharacters for regular expressions and some examples. And it has been fair to enter DesdeLinux and see an entry on the same topic!

    If it's any consolation, mine is MUCH MORE PUSSY 😀

    Certainly regex are one of the most useful things, I normally use them to trim the output of the commands and keep the part that interests me, and then interact with it in a bash script, for example. I have also used them a lot in university, and they are of vital importance in the construction of compilers (in the definition of lexicographic and parsers). In short, a whole world.

    Greetings and very very good work.

    1.    hexborg said

      Thank you very much.

      I also liked your article. It is more concise than mine. It can serve as a quick reference. It is a coincidence that we have written them at the same time. You can see that people are interested in the subject. 🙂

  8.   Ellery said

    Regular expressions for dummies =), now it is more clear to me, by the way one way to have the output with color for grep, is by creating an alias in .bashrc alias grep = 'grep –color = always', in case it works for someone .

    regards

    1.    hexborg said

      True. That is another way to do it. Thanks for the input. 🙂

  9.   KZKG ^ Gaara said

    O_O… piece of contribution !!! O_O ...
    Thank you very much for the post, I was waiting for something like that for a while hahaha, I already leave it open to read it calmly at home with zero hassle to concentrate hahaha.

    Thanks for the article, I really do 😀

    1.    hexborg said

      I knew you would like it. LOL!! The truth is that many things are missing, but I already have a second part in mind. 🙂

  10.   Eliecer Tates said

    Great article, if only I had read it yesterday, the class I gave today would have been even easier for my students!

    1.    hexborg said

      LOL!! Too bad I was late, but glad it's helpful. 🙂

  11.   LeoToro said

    Finally !!!, the post is super good… I finally found something that clearly explains regular expressions… ..

    1.    hexborg said

      There is a lot of information out there, but it is more difficult to find something that is easy to understand. I'm glad I filled that gap. 🙂

      Greetings.

  12.   Shakespeare Rhodes said

    Hey I need help, I have to do a search in / var / logs with the format: yymmdd, and the logs come like 130901.log -130901.log, I have to search for all those that are between September 1 to October 11, The only thing I managed to do was remove all of September but I don't know how to do the complete chain:

    ex: 1309 [0-3] returns me the logs between September 1 to 30, but I don't know how to get also in the same chain from October 1 to 11.

    1.    hexborg said

      To do it using regular expressions is a bit complicated. It occurs to me that something like this might work:

      13(09[0-3]|10(0|1[01]))

      It is an extended regular expression. You don't say which tool you are using, so I can't give you more details.

      Anyway, I think this is the case instead of using regular expressions it is better to do it with find. You can try something like this:

      find. -newermt '01 sep '-a! -newermt '11 oct '-print

      Luck. Hope this can help you.

  13.   chipo said

    Hello! First of all, I wanted to thank you for your work since this page is among my "top 3" of the best Linux sites.
    I was practicing and I didn't know why a RegExp on a phone number didn't work for me and it was that I was missing the "-E" (which I realized thanks to this post).
    I wanted to ask you if you don't know any good pdf or site where there are exercises on RegExp, although with a little imagination you can practice inventing them yourself.

    Greetings, Pablo.

  14.   cally said

    Very good, I just read it all, and yes now I need an aspirin 🙂

  15.   Oscar said

    The best explanation I've seen of regular expressions. My thanks to the author for sharing this work.

    A greeting.

  16.   alexander said

    I really liked a very good explanation