With the terminal: Download a complete website with Wget

Nothing better than Wikipedia to explain what this tool consists of:

GNU Wget is a free software tool that allows the downloading of content from web servers in a simple way. Its name derives from World Wide Web (w), and from "get" (in English get), this means: get from the WWW.

Currently it supports downloads using the HTTP, HTTPS and FTP protocols.

Among the most outstanding features it offers wget there is the possibility of easy downloading of complex mirrors recursively, conversion of links to display HTML content locally, support for proxies ...

It is true that there are other applications that help us to perform this type of work such as httrack or even extensions for Firefox as Scrapbook, but nothing like the simplicity of a terminal 😀

Doing the magic

I was curious about the movie: The Social Network, as the character of mark_zuckerberg use the phrase: «A bit of magic wget«, When I was about to download the photos for Facemash 😀 and it's true, wget allows you to do magic with the appropriate parameters.

Let's look at a couple of examples, let's start with the simple use of the tool.

To go down a page:

$ wget https://blog.desdelinux.net/con-el-terminal-bajar-un-sitio-web-completo-con-wget

To download the entire site recursively, including images and other types of data:

$ wget -r https://blog.desdelinux.net/

And here comes the magic. As they explain us in the article of Humans, many sites verify the identity of the browser to apply various restrictions. With Wget we can circumvent this in the following way:

wget  -r -p -U Mozilla https://blog.desdelinux.net/

Or we can also pause between each page, since otherwise the site owner may realize that we are downloading the site completely with Wget.

wget --wait=20 --limit-rate=20K -r -p -U Mozilla https://blog.desdelinux.net/


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.

  1.   pandev92 said

    There is something to download only the images xd?

    1.    Courage said
      1.    pandev92 said

        lol oo xd

    2.    KZKG ^ Gaara said

      man wget 😉

      1.    pandev92 said

        Life is too short to read mans.

        1.    KZKG ^ Gaara said

          Life is too short to fill the brain with information, but it is still valid to try 🙂

          1.    pandev92 said

            Information is worth half, I prefer to fill it with women, games and money if possible XD.

          2.    Courage said

            You are always fucking thinking about women. From now on you will be listening to Dadee Yankee, Don Omar and Wisin Y Yandel like KZKG ^ Gaara does.

            Dedicate yourself better to money, which is the most important thing in this life

            1.    KZKG ^ Gaara said

              There are things that are worth much more than money ... for example, being in history, making a difference, being remembered for how much you managed to contribute to the world; and not for how much money did you have when you died 😉

              Try not to become a man of success but a man of courage, Albert Einsein.


          3.    Courage said

            And can a beggar living under a bridge do that without having a penny?

            Well, no

          4.    Courage said

            *to have

          5.    pandev92 said

            Courage, I had my reggaeton era and well no longer, that was years ago, I only listen to Japanese music and classical music, and with the money… we are working on it :).

          6.    pandev92 said

            I do not care to be remembered, gara, when I will have died I will have died and screw the others, since I will not even be able to know what they think of me. What is it worth to be remembered but you can be proud of it xD.

    3.    hypersayan_x said

      To download a specific type of files you can use filters:

      https://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html

      And a tip, if you are going to clone a very large page, it is recommended that you do it through a proxy such as tor, because otherwise there are certain pages that have reached a certain number of consecutive requests, blocking your IP for several hours or days .
      The other time that happened to me when I wanted to clone a wiki.

    4.    mdir said

      An extension, which I use in Firefox, downloads only images; it's called "Save Images 0.94"

  2.   Brown said

    eh a question hehe where are the files that I download saved? They are going to want to kill me, right? LOL

    1.    KZKG ^ Gaara said

      The files are downloaded to the folder where you are located in the terminal when executing wget 😉

  3.   auroszx said

    Ahh, I didn't imagine that wget could have such an interesting use… Now, regarding the use that Courage mentions… No words 😉

  4.   Carlos-Xfce said

    Does anyone know if there is a WordPress plug-in that prevents Wget from downloading your blog?

  5.   darzee said

    Well it suits me great !! Thank you

  6.   piolavski said

    Very good, let's try to see how, thanks for the contribution.

  7.   lyairmg said

    Although I consider myself a beginner this is easy for me now I will try to mix it with other things and see what it gives….

  8.   oswaldo said

    I hope you can help me because it is for Monday, December 3, 2012

    The project to be developed is the following:

    Relocation of a website by adjusting the href references.
    1.-Considering a Web site, download the complete site to a local directory using the wget command. And using a script of your authorship, perform the following operations:

    1.1.-Create an independent directory for each type of content: gif images, jpeg images, etc, avi videos, mpg videos, etc, mp3 audio, wav audio, etc., web content (HTML, javascript, etc).

    1.2.-Once each of these contents has been relocated, carry out the adjustment of the references to the local locations of each resource on the site.

    1.3.-Activate a Web server, and configure the root directory where the Web site backup is located as the root directory of the local Web server.

    1.4.-Note: the wget command can only be used with the following options:
    –Recursive
    –Domains
    –Page-requisites
    If for some reason more commands are necessary, use the necessary ones.

    1.    KZKG ^ Gaara said

      To download here I think you have the solution in the post, now ... to move files and replace the paths, I had to do something like this a while ago in my work, I leave you the script I used: http://paste.desdelinux.net/4670

      You modify it taking into account the type of file and the path, that is, how the .HTMLs of your site are formed and that.

      This is not the 100% solution because you must make some arrangements or changes but, I guarantee you that it is 70 or 80% of all the work 😉

      1.    oswaldo said

        Thanks KZKG ^ Gaara has been a great help to me

  9.   debt said

    I have always used httrack. Scrapbook for firefox I'm going to try it, but I love wget. Thank you!

  10.   Daniel PZ said

    Man, the command did not work for me ... this one did work well for me:

    wget –random-wait -r -p -e robots = off -U mozilla http://www.example.com

    1.    Daniel said

      Thanks a lot! I used it with the parameters proposed by Daniel PZ and I had no problems 🙂

  11.   Ruben Almaguer said

    Thanks boy, I did that with WGet on my Linux puppy but I didn't know how to do it in terminal. a greeting

  12.   stubborn said

    where do you keep the pages?

    1.    Hache said

      Where you have the terminal open. At first, in your user root folder, unless you indicate another path.

  13.   fernando said

    Also download the links? So if there is a link to a pdf or another document, do you also download it?

  14.   river said

    What can I do to download my entire blog, I tried and what I can't see seems to be in codes or blocked, despite taking many hours to download but only the initial page can be read, which I recommend to download my blog, thanks raul.

  15.   leo said

    hello, a doubt it is possible to replace the links within the html, to later be able to browse through the downloaded page as if it were the original.

    What happens is that I download the page and when I opened it from the downloaded files I did not take the .css or .js and the links on the page lead me to the page on the Internet.