From ad5b7acc19290ff91e0f42a0de448a26760fcf99 Mon Sep 17 00:00:00 2001 From: Xavier Roche Date: Mon, 19 Mar 2012 12:36:11 +0000 Subject: Imported httrack 3.20.2 --- HelpHtml/filters.html | 261 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 261 insertions(+) create mode 100644 HelpHtml/filters.html (limited to 'HelpHtml/filters.html') diff --git a/HelpHtml/filters.html b/HelpHtml/filters.html new file mode 100644 index 0000000..6438dab --- /dev/null +++ b/HelpHtml/filters.html @@ -0,0 +1,261 @@ + + + + + + + HTTrack Website Copier - Offline Browser + + + + + + + + + +
HTTrack Website Copier
+ + + + +
Open Source offline browser
+ + + + +
+ + + + +
+ + + + +
+ + +

Filters: Advanced

+ +
+ +See also: The FAQ
+ +
+ + You have to know that once you have defined + starts links, the default mode is to mirror these links - i.e. if one of your start page is + www.someweb.com/test/index.html, all links starting with www.someweb.com/test/ will be + accepted. But links directly in www.someweb.com/.. will not be accepted, however, because + they are in a higher strcuture. This prevent HTTrack from mirroring the whole site. (All + files in structure levels equal or lower than the primary links will be retrieved.)
+
+
+ But you may want to download files that are not directly in the subfolders, or on the + contrary refuse files of a particular type. That is the purpose of filters. +
+ +
+ To accept a family of links (for example, all links with a specific name or type), you just have to add + an authorization filter, like +*.gif. The pattern is a plus (this one: +), + followed by a pattern composed of letters and wildcards (this one: *). +

+ To forbide a family of links, define + an authorization filter, like -*.gif. The pattern is a dash (this one: -), + followed by a the same kind of pattern as for the authorization filter. +

+ Example: +*.gif will accept all files finished by .gif
+ Example: -*.gif will refuse all files finished by .gif
+
+ +
+ Let's talk a little more about patterns: + +
+ Filters are analyzed by HTTrack from the first filter to the last one. The complete URL + name is compared to filters defined by the user or added automatically by HTTrack.

+ A link has an higher priority than the one before it - hierarchy is important:
+ +
+ + + +
+ +*.gif -image*.gif + + Will accept all gif files BUT image1.gif,imageblue.gif,imagery.gif and so on +
+ -image*.gif +*.gif + + Will accept all gif files, because the second pattern is prioritary (because it is defined AFTER the first one) +
+
+ +
+ We saw that patterns are composed of letters and wildcards (*), as in */image*.gif + +


+ Special wild cards can be used for specific characters: (*[..])

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
*any characters (the most commonly used)
*[file] or *[name]any filename or name, e.g. not /,? and ; characters
*[path]any path (and filename), e.g. not ? and ; characters
*[a,z,e,r,t,y]any letters among a,z,e,r,t,y
*[a-z]any letters
*[0-9,a,z,e,r,t,y]any characters among 0..9 and a,z,e,r,t,y
*[]no characters must be present after
+ + +


+ Here are some examples of filters: (that can be generated automatically using the + interface)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
www.thisweb.com* This will refuse/accept this web site (all links located in it will be rejected)
*.com/*This will refuse/accept all links that contains .com in them
*cgi-bin* This will refuse/accept all links that contains cgi-bin in them
www.*.com/*[path].zip This will refuse/accept all zip files in .com addresses
*someweb*/*.tar*This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb
*/*somepage*This will refuse/accept all links containing somepage (but not in the address)
*.htmlThis will refuse/accept all html files.
+ Warning! With this filter you will accept ALL html files, even those in other addresses. + (causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from + a web.
*.html*[]Identical to *.html, but the link must not have any supplemental characters + at the end (links with parameters, like www.someweb.com/index.html?page=10, will be + refused)
+ +
+ + +
+
+
+ + + + + +
+ + + + + + -- cgit v1.2.3