Table of Contents
What is wget?
Nothing better than Wikipedia to explain what this tool consists of:
GNU Wget is a free software tool that allows the downloading of content from web servers in a simple way. Its name derives from World Wide Web (w), and from "get" (in English get), this means: get from the WWW.
Currently it supports downloads using the HTTP, HTTPS and FTP protocols.
Among the most outstanding features it offers wget there is the possibility of easy downloading of complex mirrors recursively, conversion of links to display HTML content locally, support for proxies ...
De wget we have already spoken enough here in DesdeLinux. In fact ya We had seen how to download a complete website with wget, the problem is that nowadays administrators do not always allow anyone to download their entire website just like that, it is not something that they really like ... and, obviously I understand. The site is there on the internet to consult it, the reader accesses content of interest and the site administrator benefits financially well (through advertising), such as visits, etc. If the reader downloads the site to his computer, he will not have to go online to consult a past post.
To download a site with wget is as simple as:
wget -r -k http://www.sitio.com
- -r : This indicates that the entire website will be downloaded.
- -k : This indicates that the links of the downloaded site will be converted to be seen on computers without internet.
Now, things get complicated when the site administrator makes it difficult for us ...
What restrictions might exist?
The most common that we could find is that access to the site is only allowed if you have a recognized UserAgent. In other words, the site will recognize that the UserAgent that is downloading so many pages is not one of the "normal" ones and will therefore close access.
Also through the robots.txt file you can specify that wget (like a bunch more similar apps) You will not be able to download as the client wishes, well ... well, the site administrator wants it, period 😀
How to circumvent these restrictions?
For the first case we will establish a UserAgent to wget, we can do this with the option –User-agent, here I show you how:
wget --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" -r http://www.site.com -k
Now, to get around robots.txt, just exclude that file, that is, let wget download the site and don't care what robots.txt says:
wget --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" -r http://www.site.com -k -e robots = off
Now ... there are other options or parameters that we can use to further deceive the site, for example, indicate that we enter the site from Google, here I leave the final line with everything:
wget --header = "Accept: text / html" --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" --referer = http: / /www.google.com -r http://www.site.com -e robots = off -k
Is it ok to do this?
That depends ... you always have to see it from both points of view, from the site administrator but also from the reader.
On the one hand, as an administrator, I would not like that they are taking an HTML copy of my site just like that, it is here online not for pleasure, for the enjoyment of all ... our goal is to have interesting content available, that you can learn.
But, on the other hand ... there are users who do not have internet at home, who would like to have the entire Tutorials section that we have put here ... I put myself in their place (in fact I am, because at home I don't have internet) and it is not pleasant to be on the computer, having a problem or wanting to do something and not being able to because you do not have access to the network of networks.
Whether it is right or wrong is up to each administrator, each one's reality ... what would most concern me would be the resource consumption that wget causes on the server, but with a good cache system it should be enough for the server does not suffer.
I ask you not to start downloading from Linux now, HAHAHA! For example, my girlfriend asked me to download some Geometry Dash cheats (something like Geometry Dash Cheats), I will not download the entire website but just open the desired page and save it to PDF or HTML or something like that. is what I would recommend to you.
If you have a DesdeLinux tutorial that you want to save, save it in your bookmarks, such as HTML or PDF ... but, for one or two tutorials it is not necessary to generate excessive traffic and consumption on the server 😉
Well nothing, I hope it is useful ... Greetings