Download Website’s Textual Content In One File With Wget

Downloading files into one huge text glob can be one strategy for web scraping analysis. In this tutorial we will be utilizing the handy command line downloading utility called wget. The following syntax will download all the html content of a site in a concatenated file called www.txt.

wget --mirror --wait 1.5 --random-wait -erobots=off --reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' --ignore-tags=img,link,script  --header="Accept: text/html" --tries=8 --output-document www.txt [URL]

PARAMETER BREAKDOWN

  • --mirror = Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
  • --wait = Wait the specified number of seconds between the retrievals.
  • --random-wait = This option causes the time between requests to vary between 0.5 and 1.5 * –wait
  • ‐‐execute robots=off = ignore a websites robots.txt
  • --reject = Specify comma-separated lists of file name suffixes or patterns to accept or reject.
  • --ignore-tags = Skip certain HTML tags when recursively looking for documents to download, specify them in a comma-separated list.
  • --header=header-line = Send header-line along with the rest of the headers in each HTTP request.
  • --tries=8 = Set number of retries to number.
  • --output-document = www.txt = The documents will not be written to the appropriate files, but all will be concatenated together and written to file.

BONUS: More cool wget commands to get the spice flowing!

👁 Get Blog Updates

📝 Latest Posts

0Shares
0

Leave a Reply