http - Mirror WebSite download specific file type with BASH -
i trying archive collections several websites. want able maintain them sort of organization. ideal store them in mirrored dir structure. below attempt
wget -m -x -e robots=off --no-parent --accept "*.ext" http://example.com
while using "-m" option have limit on how far goes? (will wander off the web-site? go forever?) if so, better use
wget -r -x -e robots=off --no-parent --accept "*.ext" --level 2 http://example.com
is reasonable way this? know "wget" has --spider option, stable?
edit
this solution have found.
the files looking tagged , stored in single dir
on server side. when trying variations of wget
. able structure of links , various files kept having trouble links running in loops. came work around. works slow. advice on how increase efficiency?
the structure of website & files trying get
home ├──foo │ ├──paul.mp3 │ ├──saul.mp3 │ ├──micheal.mp3 │ ├──ring.mp3 ├──bar ├──nancy.mp3 ├──jan.mp3 ├──mary.mp3
so first created, the file tags of files want
taglist.txt foo bar
the script
#!/bin/bash #this script seems work until download part url="http://www.example.com" link_file=taglist.txt while read tag; mkdir "$tag" cd "$tag" # urls page wget -q $url/$tag -o - | \tr "\t\r\n'" ' "' | \grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \sed -e 's/^.*"\([^"]\+\)".*$/\1/g' > tmp.urls.txt # clean , sort urls grep -i 'http://www.example.com/storage_dir/*' tmp.urls.txt | sort -u > tmp.curls.txt # download page url while read tape_url; #wget -r -a.mp3 $tape_url wget -o tmp.$random $tape_url done <tmp.curls.txt # find .mp3 links in files grep -r -o -e 'href="([^"#]+)[.mp3]"' * | cut -d'"' -f2 | sort | uniq > $tag.mp3.list # clean rm tmp.* # download collected urls wget -i $tag.mp3.list cd .. done <"$link_file"
by reading man
page wget
, you'll see following answers questions:
-m
equivalent-r -n -l inf --no-remove-listing
, means (a) recurse, (b) download file server if newer version have, (c) not limit recursion depth, , (d) keep placeholder files along way make sure files have been fetched.yes, recursion follow links wherever may go, why there default recursion depth of 5. using
-m
, however, turning off depth limit, potentially download entire internet computer. that's why should read recursive accept/reject options section ofman
page. tells how limit recursion. example, can specify links within domain followed.-r
--level 2
restrict recursion, (a) not guarantee don't visit other sites, , (b) miss large amount of site want mirror--spider
not downloading files; it's visiting pages.
note -m
directive still not capture all files need mirror entire site. you'll need use -p
option page prerequisites every page visit.
Comments
Post a Comment