summaryrefslogtreecommitdiff
path: root/man/1/webgrab
diff options
context:
space:
mode:
authorCharles.Forsyth <devnull@localhost>2006-12-22 20:52:35 +0000
committerCharles.Forsyth <devnull@localhost>2006-12-22 20:52:35 +0000
commit46439007cf417cbd9ac8049bb4122c890097a0fa (patch)
tree6fdb25e5f3a2b6d5657eb23b35774b631d4d97e4 /man/1/webgrab
parent37da2899f40661e3e9631e497da8dc59b971cbd0 (diff)
20060303-partial
Diffstat (limited to 'man/1/webgrab')
-rw-r--r--man/1/webgrab130
1 files changed, 130 insertions, 0 deletions
diff --git a/man/1/webgrab b/man/1/webgrab
new file mode 100644
index 00000000..a50d9d6e
--- /dev/null
+++ b/man/1/webgrab
@@ -0,0 +1,130 @@
+.TH WEBGRAB 1
+.SH NAME
+webgrab \- fetch web page content as files
+.SH SYNOPSIS
+.B webgrab
+[
+.B -r
+] [
+.B -v
+] [
+.BI -o " stem"
+]
+.I url
+.SH DESCRIPTION
+.I Webgrab
+connects to the web server named in the
+.IR url .
+It fetches the content of the web page also determined by the
+.IR url ,
+and stores it locally in a file.
+If the page is written in HTML,
+.I webgrab
+reads it to build a list of sub-component pages (eg, frames) and images.
+It fetches those, saving the content in separate files.
+It adds a comment to the end of each HTML file giving the time, and the file's origin.
+It automatically follows redirections offered by the server.
+.PP
+The
+.I stem
+of the names of the output files is normally derived from a component of the
+.IR url .
+If the
+.I url
+contains a path name, the
+.I stem
+is the component of that path, less any dot-separated suffix and prefix.
+For example, given
+.IP
+.BR http://www.vitanuova.com/inferno/old.index.html
+.PP
+the stem would be
+.BR index .
+If there is no path name, but the
+.I url
+contains a domain name, the
+.I stem
+is the penultimate component of the domain name (eg, excluding
+trailing
+.BR .com ,
+and initial
+.BR www ,
+etc).
+For example, given
+.IP
+.B www.innerhost.vitanuova.com
+.PP
+the stem would be
+.BR vitanuova .
+If all else fails,
+.I webgrab
+uses the
+.I stem
+.BR webgrab .
+.PP
+Given a
+.IR stem ,
+the initial page is stored in
+.IB stem . suffix
+where
+.I suffix
+is the suffix (eg,
+.BR .html )
+of the name of the original page.
+Subordinate pages are saved in a similar way in files named
+.IB stem _1. suffix1,
+.IB stem _2. suffix2,
+\&... .
+.PP
+The options are:
+.TP
+.B -r
+do not fetch subcomponents (just the `raw' source of
+.I url
+itself)
+.TP
+.B -v
+print a progress report
+.TP
+.B -vv
+print a chatty progress report
+.TP
+.BI -o " stem"
+use the
+.I stem
+as given
+.PP
+.I Webgrab
+reads the
+configuration file
+.B /services/webget/config
+(if it exists),
+to look for the address of an optional HTTP proxy
+(in the
+.L httpproxy
+entry), and list of domains for which a proxy should not be used
+(in the
+.B noproxy
+or
+.B noproxydoms
+entry). If symbolic network and service names might be involved, the
+connection server
+.B lib/cs
+needs to be already running.
+.SH FILES
+.B /services/webget/config
+.SH SOURCE
+.B /appl/cmd/webgrab.b
+.SH BUGS
+It should read the proxy name from the
+.IR charon (1)
+configuration file and not the
+.I webget
+configuration file.
+.br
+It cannot do `secure' transfers
+.RB ( https ).
+.br
+Its HTML parsing is naive, but on the other hand, it is less likely to trip over HTML novelties.
+.SH "SEE ALSO"
+.IR cs (8)