http - Mirror WebSite download specific file type with BASH -


i trying archive collections several websites. want able maintain them sort of organization. ideal store them in mirrored dir structure. below attempt

wget -m -x -e robots=off --no-parent --accept "*.ext" http://example.com 

while using "-m" option have limit on how far goes? (will wander off the web-site? go forever?) if so, better use

wget -r -x -e robots=off --no-parent --accept "*.ext" --level 2 http://example.com 

is reasonable way this? know "wget" has --spider option, stable?

edit

this solution have found.

the files looking tagged , stored in single dir on server side. when trying variations of wget. able structure of links , various files kept having trouble links running in loops. came work around. works slow. advice on how increase efficiency?

the structure of website & files trying get

home    ├──foo    │  ├──paul.mp3    │  ├──saul.mp3    │  ├──micheal.mp3    │  ├──ring.mp3    ├──bar       ├──nancy.mp3       ├──jan.mp3       ├──mary.mp3 

so first created, the file tags of files want

taglist.txt foo bar 

the script

#!/bin/bash  #this script seems work until download part   url="http://www.example.com" link_file=taglist.txt  while read tag;     mkdir "$tag"     cd "$tag"          # urls page         wget -q $url/$tag -o - | \tr "\t\r\n'" '   "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt         # clean , sort urls         grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt             # download page url             while read tape_url;             #wget -r -a.mp3 $tape_url             wget -o tmp.$random $tape_url             done <tmp.curls.txt             # find .mp3 links in files             grep -r -o -e 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $tag.mp3.list             # clean             rm tmp.*              # download collected urls             wget -i $tag.mp3.list     cd ..    done <"$link_file" 

by reading man page wget, you'll see following answers questions:

  • -m equivalent -r -n -l inf --no-remove-listing, means (a) recurse, (b) download file server if newer version have, (c) not limit recursion depth, , (d) keep placeholder files along way make sure files have been fetched.

  • yes, recursion follow links wherever may go, why there default recursion depth of 5. using -m, however, turning off depth limit, potentially download entire internet computer. that's why should read recursive accept/reject options section of man page. tells how limit recursion. example, can specify links within domain followed.

  • -r --level 2 restrict recursion, (a) not guarantee don't visit other sites, , (b) miss large amount of site want mirror

  • --spider not downloading files; it's visiting pages.

note -m directive still not capture all files need mirror entire site. you'll need use -p option page prerequisites every page visit.


Comments

Popular posts from this blog

How to run C# code using mono without Xamarin in Android? -

c# - SharpSsh Command Execution -

python - Specify path of savefig with pylab or matplotlib -