With the Terminal: Using regular expressions II: Replacements

In Myself previous article I have told you at a basic level how each of the most used special characters of regular expressions work. With these regular expressions it is possible to do complex searches in text files or in the output of other commands. In this article I am going to explain how to use the sed command to find and replace text in a much more powerful way than simply changing one text for another.

A little more about the grep command

Before I start talking about sed, I would like to comment a bit more about the grep command to complete what was explained in the previous article a bit. Everything I'm going to say will be relevant to this one as well. Later we will see the relationship between this and searches.

Combining regular expressions

Many of the special characters that I have talked about in the previous article can be combined, not only with other characters, but with whole regular expressions. The way to do this is to use parentheses to form a subexpression. Let's see an example of this. Let's start by downloading a text that we can use for testing. It is a list of phrases. For that we are going to use the following command:

curl http://artigoo.com/lista-de-frases-comparativas-comicas 2>/dev/null | sed -n 's/.*$.*\.$<\/p>/\1/gp' > frases

This will leave you in the directory where you launch a file named «phrases». You can open it up to take a look and have a little laugh. 🙂

Now let's suppose that we want to find the phrases that have exactly 6 words. The difficulty is in forming a regular expression that matches each word. A word is a sequence of letters, either uppercase or lowercase, which would be something like '[a-zA-Z]+', but you also have to specify that these letters have to be separated by other characters than letters, that is, it would be something like '[a-zA-Z]+[^a-zA-Z]+'. Remember: the "^" as the first character inside the brackets indicates that we want to match with characters that are not in the ranges and the "+" indicates 1 or more characters.

We already have a regular expression that can match a word. To pair it with 6, it will have to be repeated 6 times. For that we used the keys, but it is useless to put '[a-zA-Z]+[^a-zA-Z]+{6}', because the 6 would repeat the last part of the regular expression and what we want is to repeat it all, so what you have to put is this: '([a-zA-Z]+[^a-zA-Z]+){6}'. With the parentheses we form a subexpression and with the braces we repeat it 6 times. Now you just need to add a "^" in front and a "$" in the back to match the entire line. The command is as follows:

grep -E '^([a-zA-Z]+[^a-zA-Z]+){6}$' frases

And the result is just what we wanted:

It is more sung than the Macarena. You are more finished than Luis Aguilé. You have less culture than a stone. You know more languages than Cañita Brava. He has more wrinkles than Tutan Khamón. You know less than Rambo about childcare.

Notice that we put the -E parameter because we want to use extended regular expressions to make the "+" work. If we used the basic ones, we would have to escape the parentheses and the braces.

Back references or backreferences

If you have a spell checker installed, you will probably have a list of words in /usr/share/dict/words. If not, you can install it in arch with:

sudo pacman -S words

Or in debian with:

sudo aptitude install dictionaries-common

If you want you can take a look at the file to see what words it has. It is actually a link to the word file for the language your distro is in. You can have several word files installed at the same time.

We are going to use that file. It turns out that we are very curious to know all the seven letter palindromes out there. For those who do not know: A palindrome is a capicúa word, that is, it can be read from left to right as well as from right to left. Let's try the following command:

grep '^$.$$.$$.$.\3\2\1$' /usr/share/dict/words

It looks a bit strange, right? If we try it, the result will depend on the language of your distro and the words that are in your list, but in my case, with the Spanish language, the result is this:

aniline aniline rolling

Let's see how this regular expression works.

Apart from the "^" and the "$", which we already know what it is for, the first thing we see on the left are three groups of dots enclosed in parentheses. Don't be confused by the bars in front of each parenthesis. They are to escape the parentheses because we are using basic regular expressions, but they have no other meaning. The important thing is that we are asking for any three characters with the dots, but each of those dots are enclosed in parentheses. This is to save the characters that match those points so that they can be referenced again from the regular expression. This is another use of parentheses that will come in handy later for making replacements.

This is where the three numbers below come with the slash in front of them. In this case, the bar is important. It is used to indicate that the number below is a backreference and is referring to one of the previous parentheses. For example: \ 1 refers to the first parenthesis, \ 2 to the second, and so on.

That is, with the regular expression that we have put, what we are looking for are all the words that start with any four letters and then have a letter that is the same as the third, another that is the same as the second and another that is the same as the first. The result is the seven letter palindromes that are in the word list. Just as we wanted.

If we were using extended regular expressions, we wouldn't have to escape the parentheses, but with extended regular expressions, backreferences don't work in all programs because they are not standardized. However, with grep they work, so that may be another way to do the same. You can try it if you want.

Replacement expressions: the sed command

In addition to searching, one of the best uses of regular expressions is to replace complex texts. To do this, one way to do it is with the sed command. The power of the sed command goes far beyond replacing text, but here I am going to use it for that. The syntax that I am going to use with this command is the following:

sed [-r] 's/REGEX/REPL/g' FICHERO

Or also:

COMANDO | sed [-r] 's/REGEX/REPL/g'

Where REGEX will be the search regular expression and REPL the replacement one. Keep in mind that this command does not really replace anything in the file that we indicate, but what it does is show us the result of the replacement in the terminal, so do not be scared by the commands that I am going to put next. None of them are going to modify any files on your system.

Let's start with a simple example. We all have various configuration files in the / etc directory that usually have comments beginning with "#". Suppose we want to see one of these files without the comments. For example, I'm going to do it with the fstab. You can try with the one you want.

sed 's/#.*//g' /etc/fstab

I am not going to put here the result of the command because it depends on what you have in your fstab, but if you compare the output of the command with the content of the file you will see that all the comments have disappeared.

In this command the search expression is «#.*", That is a" # "followed by any number of characters, that is, the comments. And the replacement expression, if you look at the two bars in a row, you will see that there are none, so what it is doing is replacing the comments with nothing, that is, deleting them. Simpler impossible.

Now we are going to do the opposite. Suppose that what we want is to comment all the lines of the file. Let's try like this:

sed 's/^/# /g' /etc/fstab

You will see that, in the output of the command, all the lines begin with a hash mark and a blank space. What we have done is replace the beginning of the line with «# «. This is also a fairly simple example where the text to be replaced is always the same, but now we are going to complicate it a bit more.

The grace of replacements is that in the replacement expression you can use backreferences like the ones I told you before. Let's go back to the phrase file that we downloaded at the beginning of the article. We are going to put in parentheses all the capital letters that there are, but we will do it with a command:

sed 's/$[A-Z]$/(\1)/g' frases

What we have here is a backreference in the replacement expression that refers to the parentheses in the search expression. The parentheses in the replacement expression are normal parentheses. In the replacement expression they have no special meaning, they are put as is. The result is that all capital letters are replaced by that same letter, whatever it is, with parentheses around it.

There is another character that can also be used in the replacement expression, it is "&" and it is replaced by all the text matched by the search expression. An example of this could be putting all the phrases in the file in quotes. This can be achieved with this command:

sed 's/.*/"&"/g' frases

The operation of this command is very similar to the previous one, only now what we replace is the entire line with the same line with quotes around it. Since we are using "&", we don't need to put parentheses.

Some useful commands with regular expressions

Here are a few commands that I find useful or curious and that use regular expressions. With these commands the utility of regular expressions is much better than with the examples I have given you so far, but it seemed important to me to explain something about how regular expressions work in order to understand them.

Show sections of a man page:

man bash | grep '^[A-Z][A-Z ]*$'

Of course, you can change the bash command to whatever you want. And then from man, you can go directly to the section that interests you using, of course, a regular expression. Press «/» to start searching and write «^ALIASES$»To go to the ALIASES section, for example. I think this is the first use I started to make of regular expressions a few years ago. Moving through some pages of the manual is almost impossible without a trick like this.

Show the names of all users of the machine including special ones:

sed 's/$[^:]*$.*/\1/' /etc/passwd

Show user names, but only those with shell:

grep -vE '(/false|/nologin)$' /etc/passwd | sed 's/$[^:]*$.*/\1/g'

It can really be done with a single regular expression, but the way to do it goes beyond what I have told you in these articles, so I have done it by combining two commands.

Insert a comma before the last three digits of all the numbers in the numbers file:

sed 's/$^\|[^0-9.]$$[0-9]\+$$[0-9]\{3\}$/\1\2,\3/g' numbers

It only works with numbers up to 6 digits, but it can be called more than once to place separators in the other groups of three digits.

Extract all email addresses from a file:

grep -E '\<[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\>' FICHERO

Separate the day, month and year of all the dates that appear in a file:

sed -r 's/([0-9]{2})[/-]([0-9]{2})[/-]([0-9]{4})/Día: \1, Mes: \2, Año: \3/g' FICHERO

Find out our local IP:

/sbin/ifconfig | grep 'inet .*broadcast' | sed -r 's/[^0-9]*(([0-9]+\.){3}[0-9]+).*/\1/g'

This can also be done with a single sed command, but I better separate it into a grep and a sed for simplicity.

Some useful addresses

Here are some addresses that may be useful related to regular expressions:

Regular expression library: This is a regular expression library in which you can search for regular expressions related to the topic that interests you. To search for web addresses, ID or whatever.
RegExr: An online regular expression checker. It allows you to enter a text and apply a regular expression to it either search or replace. It gives information about the regular expression and you have a few options to change its behavior.
Regular Expressions Tester: It is an addon for firefox that allows you to check regular expressions from the browser.

Conclusion

For now that's all. Regular expressions are complex but useful. It takes time to learn them, but if you are like me, playing with them will seem fun and, little by little you will master them. It is a whole world. There would be a lot to say yet, about lazy quantifiers, PERL style regex, multiline, etc. And then each program has its characteristics and its variants, so the best advice I can give you is to always look at the documentation of the program that you are using every time you have to write a regular expression in a new program.

Hey! …HEY! … WAKE UP! … WHAT ARE YOU ALL DOING SLEEPING? 🙂

Fonts

Some of the ideas and examples for regular expressions in this article I have taken from here:

http://sed.sourceforge.net/sed1line.txt
http://www.thegeekstuff.com/2009/10/unix-sed-tutorial-advanced-sed-substitution-examples/

DesdeLinux

With Terminal: Using Regular Expressions II: Replacements