Download an entire site with wget even if there are restrictions

What is wget?

Nothing better than Wikipedia to explain what this tool consists of:

GNU Wget is a free software tool that allows the downloading of content from web servers in a simple way. Its name derives from World Wide Web (w), and from "get" (in English get), this means: get from the WWW.

Currently it supports downloads using the HTTP, HTTPS and FTP protocols.

Among the most outstanding features it offers wget there is the possibility of easy downloading of complex mirrors recursively, conversion of links to display HTML content locally, support for proxies ...

De wget we have already spoken enough here in DesdeLinux. In fact ya We had seen how to download a complete website with wget, the problem is that nowadays administrators do not always allow anyone to download their entire website just like that, it is not something that they really like ... and, obviously I understand. The site is there on the internet to consult it, the reader accesses content of interest and the site administrator benefits financially well (through advertising), such as visits, etc. If the reader downloads the site to his computer, he will not have to go online to consult a past post.

To download a site with wget is as simple as:

wget -r -k http://www.sitio.com

  • -r : This indicates that the entire website will be downloaded.
  • -k : This indicates that the links of the downloaded site will be converted to be seen on computers without internet.

Now, things get complicated when the site administrator makes it difficult for us ...

What restrictions might exist?

The most common that we could find is that access to the site is only allowed if you have a recognized UserAgent. In other words, the site will recognize that the UserAgent that is downloading so many pages is not one of the "normal" ones and will therefore close access.

Also through the robots.txt file you can specify that wget (like a bunch more similar apps) You will not be able to download as the client wishes, well ... well, the site administrator wants it, period 😀

How to circumvent these restrictions?

For the first case we will establish a UserAgent to wget, we can do this with the option –User-agent, here I show you how:

wget --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" -r http://www.site.com -k

Now, to get around robots.txt, just exclude that file, that is, let wget download the site and don't care what robots.txt says:

wget --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" -r http://www.site.com -k -e robots = off

Now ... there are other options or parameters that we can use to further deceive the site, for example, indicate that we enter the site from Google, here I leave the final line with everything:

wget --header = "Accept: text / html" --user-agent = "Mozilla / 5.0 (X11; Linux amd64; rv: 32.0b4) Gecko / 20140804164216 ArchLinux KDE Firefox / 32.0b4" --referer = http: / /www.google.com -r http://www.site.com -e robots = off -k

It is not mandatory that the site contains http: // www at the beginning, it can be one directly http: // as for example this one Geometry Dash

Is it ok to do this?

That depends ... you always have to see it from both points of view, from the site administrator but also from the reader.

On the one hand, as an administrator, I would not like that they are taking an HTML copy of my site just like that, it is here online not for pleasure, for the enjoyment of all ... our goal is to have interesting content available, that you can learn.

But, on the other hand ... there are users who do not have internet at home, who would like to have the entire Tutorials section that we have put here ... I put myself in their place (in fact I am, because at home I don't have internet) and it is not pleasant to be on the computer, having a problem or wanting to do something and not being able to because you do not have access to the network of networks.

Whether it is right or wrong is up to each administrator, each one's reality ... what would most concern me would be the resource consumption that wget causes on the server, but with a good cache system it should be enough for the server does not suffer.

Internet

Conclusions

I ask you not to start downloading from Linux now, HAHAHA! For example, my girlfriend asked me to download some Geometry Dash cheats (something like Geometry Dash Cheats), I will not download the entire website but just open the desired page and save it to PDF or HTML or something like that. is what I would recommend to you.

If you have a DesdeLinux tutorial that you want to save, save it in your bookmarks, such as HTML or PDF ... but, for one or two tutorials it is not necessary to generate excessive traffic and consumption on the server 😉

Well nothing, I hope it is useful ... Greetings


The content of the article adheres to our principles of editorial ethics. To report an error click here!.

23 comments, leave yours

Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.

  1.   eliotime3000 said

    Interesting tip. I didn't know that you could do that.

  2.   Emmanuel said

    It is expressly what had happened to me twice, and it was certainly because of it. Although, it was for speed reasons (home vs university) that I wanted to access content that way. 😛
    Thanks for the advice. Regards.

  3.   Gerardo said

    Great for those of us who don't have internet. Certainly good tutorials.

  4.   Quinotto said

    Very interesting article.
    Question: how can it be done for https sites?
    Where is it required to authenticate by means of username and password and also a large part of the site is written in java?
    Greetings and Thanks

  5.   Gelibassium said

    and where are the downloads saved?

    1.    Gelibassium said

      I answer myself: in the personal folder. But now the question is ... can you somehow tell him where to download the content?

      graciass

      1.    Daniel said

        I guess you first access the folder where you want to save it and then you run wget

  6.   cristian said

    query ... and there will be something like this to "clone" a database

  7.   xphnx said

    I have a curiosity, do you receive money for placing those links to micro-niches webs?

  8.   Rupert said

    Blessed wget ... that's how I downloaded a lot of porn in my pig days xD

  9.   moony said

    good tip. thanks

  10.   NULL said

    Very good, I liked the part about circumventing the restrictions.

  11.   Franz said

    Thanks for that gem:
    wget –header = »Accept: text / html» –user-agent = »Mozilla / 5.0 (X11; Linux i686; rv: 31) Gecko / 20100101 Firefox / 31 ″ –referer = http: //www.google.com - r https://launchpad.net/~linux-libre/+archive/ubuntu/rt-ppa/+files/linux-image-3.6.11-gnu-3-generic_3.6.11-gnu-3.rt25.precise1_i386.deb -k -e robots = off

    wget –header = »Accept: text / html» –user-agent = »Mozilla / 5.0 (X11; Linux i686; rv: 31) Gecko / 20100101 Firefox / 31 ″ –referer = http: //www.google.com - r https://launchpad.net/~linux-libre/+archive/ubuntu/rt-ppa/+files/linux-headers-3.6.11-gnu-3_3.6.11-gnu-3.rt25.precise1_all.deb -k -e robots = off

    wget –header = »Accept: text / html» –user-agent = »Mozilla / 5.0 (X11; Linux i686; rv: 31) Gecko / 20100101 Firefox / 31 ″ –referer = http: //www.google.com - r https://launchpad.net/~linux-libre/+archive/ubuntu/rt-ppa/+files/linux-headers-3.6.11-gnu-3-generic_3.6.11-gnu-3.rt25.precise1_i386.deb -k -e robots = off

  12.   Palomares said

    Very interesting.

  13.   oscar meza said

    wget is one of those ultra-powerful tools, with a little terminal programming you can make your own google-style robot to start downloading the content of the pages and store it in your own database and do whatever you want later with that data.

  14.   Charles G. said

    I find this tool very interesting, I had never paid attention to its parameters, I would like to know if it is possible to download content from an «X» page to which you need to be logged in to enter, and if it is somewhere on this site « X »is there any video, would I also download it even if it belongs to a different CDN than the« X »site?

    If this were possible, how does a site protect against such a tool?

    Regards!

  15.   Erick zanardi said

    Goodnight:

    I am writing to you for a consultation. I downloaded with the last command of this article, almost 300MB of information .. files .swf, .js, .html, from the page http://www.netacad.com/es with my user from a small course that I did in Maracay, Venezuela.

    My question is… Will it be possible to see the flash animations?

    I enter "Global Configuration" and the options that it shows none allow me to configure.

    I appreciate any response.

    Thanks in advance!

    1.    ADX said

      I have the same detail, the .swf are downloaded half, if you manage to skip it, share me info. What I did last time was to use a spider to get all the netacad links but still the .swf doesn't finish downloading as it should

  16.   alejandro.hernandez said

    very good !!! thanks.

  17.   Ana said

    Hello, thanks for your tuto. I try to download a blog in which I am invited, with a password, so that I can read it from home offline. I use this program, and obviously, I have the password of the blog (wordpress), but I don't know how to proceed. Could you show me?
    Thanks in advance and best regards!

  18.   Fran said

    what a great post !!!

  19.   Santiago said

    excellent it has served me a lot

  20.   Fran said

    I am logged in to a website with embedded vimeo videos and there is no way for them to be downloaded .. it seems as if vimeo has them protected. Any ideas??