http - Mirror WebSite download specific file type with BASH -
i trying archive collections several websites. want able maintain them sort of organization. ideal store them in mirrored dir structure. below attempt
wget -m -x -e robots=off --no-parent --accept "*.ext" http://example.com   while using "-m" option have limit on how far goes? (will wander off the web-site? go forever?) if so, better use
wget -r -x -e robots=off --no-parent --accept "*.ext" --level 2 http://example.com   is reasonable way this? know "wget" has --spider option, stable?
edit
this solution have found.
the files looking tagged , stored in single dir on server side. when trying variations of wget. able structure of links , various files kept having trouble links running in loops. came work around. works slow. advice on how increase efficiency?
the structure of website & files trying get
home    ├──foo    │  ├──paul.mp3    │  ├──saul.mp3    │  ├──micheal.mp3    │  ├──ring.mp3    ├──bar       ├──nancy.mp3       ├──jan.mp3       ├──mary.mp3   so first created, the file tags of files want
taglist.txt foo bar   the script
#!/bin/bash  #this script seems work until download part   url="http://www.example.com" link_file=taglist.txt  while read tag;     mkdir "$tag"     cd "$tag"          # urls page         wget -q $url/$tag -o - | \tr "\t\r\n'" '   "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt         # clean , sort urls         grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt             # download page url             while read tape_url;             #wget -r -a.mp3 $tape_url             wget -o tmp.$random $tape_url             done <tmp.curls.txt             # find .mp3 links in files             grep -r -o -e 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $tag.mp3.list             # clean             rm tmp.*              # download collected urls             wget -i $tag.mp3.list     cd ..    done <"$link_file"      
by reading man page wget, you'll see following answers questions:
-mequivalent-r -n -l inf --no-remove-listing, means (a) recurse, (b) download file server if newer version have, (c) not limit recursion depth, , (d) keep placeholder files along way make sure files have been fetched.yes, recursion follow links wherever may go, why there default recursion depth of 5. using
-m, however, turning off depth limit, potentially download entire internet computer. that's why should read recursive accept/reject options section ofmanpage. tells how limit recursion. example, can specify links within domain followed.-r--level 2restrict recursion, (a) not guarantee don't visit other sites, , (b) miss large amount of site want mirror--spidernot downloading files; it's visiting pages.
note -m directive still not capture all files need mirror entire site.  you'll need use -p option page prerequisites every page visit.
Comments
Post a Comment