From 64cc4a88da8887ef1f7f4d90be0158d2cc76222d Mon Sep 17 00:00:00 2001
From: Xavier Roche <xroche@users.noreply.github.com>
Date: Mon, 19 Mar 2012 12:57:43 +0000
Subject: httrack 3.40.4

---
 html/filters.html | 261 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 231 insertions(+), 30 deletions(-)

(limited to 'html/filters.html')
diff --git a/html/filters.html b/html/filters.html
index d058296..dac8545 100644
--- a/html/filters.html
+++ b/html/filters.html
@@ -119,7 +119,10 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
     contrary refuse files of a particular type. That is the purpose of filters.
     <br>
 
-    <br>
+<p>    
+    <h4>Scan rules based on URL or extension (e.g. accept or refuse all .zip or .gif files)</h4>
+</p>    
+    
     To accept a family of links (for example, all links with a specific name or type), you just have to add 
     an authorization filter, like <b><tt>+*.gif</tt></b>. The pattern is a plus (this one: <b><tt>+</tt></b>),
     followed by a pattern composed of letters and wildcards (this one: <b><tt>*</tt></b>). 
@@ -131,23 +134,71 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
     Example: +*.gif will accept all files finished by .gif<br>
     Example: -*.gif will refuse all files finished by .gif<br>
     <br>
+    
+<p>    
+    <h4>Scan rules based on size (e.g. accept or refuse files bigger/smaller than a certain size)</h4>
+</p>    
+
+    Once a link is scheduled for download, you can still refuse it (i.e. abort the download) by checking its 
+    size to ensure that you won't reach a defined limit.
+    
+    Example: You may want to accept all files on the domain www.example.com, using '+www.example.com/*', 
+    including gif files inside this domain and outside (eternal images), but not take to large images, 
+    or too small ones (thumbnails)<br>
+    Excluding gif images smaller than 5KB and images larger than 100KB is therefore a good option;
+    +www.example.com +*.gif -*.gif*[<5] -*.gif*[>100] 
+    
+    <br>
+    
+    Important notice: size scan rules are checked <font color=red><b>after</b></font> the link was scheduled for download, 
+    allowing to abort the connection. 
+
+
+<p>    
+    <h4>Scan rules based on MIME types (e.g. accept or refuse all files of type audio/mp3)</h4>
+</p>    
+    
+    Once a link is scheduled for download, you can still refuse it (i.e. abort the download) by matching its MIME 
+    type against certain patterns.
+    
+    Example: You may want to accept all files on the domain www.example.com, using '+www.example.com/*', and
+    exclude all gif files, using '-*.gif'. But some dynamic scripts (such as www.example.com/dynamic.php) can
+    both generate html content, or image data content, depending on the context. Excluding this script, using
+    the scan rule '-www.example.com/dynamic.php', is therefore not a good solution.
+    
+    <br>
+    The only reliable way in such cases is to exclude the specific mime type 'image/gif', using the scan rule
+    syntax:<br>
+    -mime:image/gif
+    <br>
+    
+    Important notice: MIME types scan rules are <font color=red><b>only</b></font> checked against links that were
+    scheduled for download, i.e. links <b>already authorized</b> by url scan rules. 
+    Hence, using '+mime:image/gif' will only be a hint to accept images that were already authorized, 
+    if previous MIME scan rules excluded them - such as in '-mime:*/* +mime:text/html +mime:image/gif'
+    
+    <br>
 
     <br>
-    <u>Let's talk a little more about patterns:</u>
+    <h3>Scan rules patterns:</h3>
+
+<p>    
+    <h4>1.a. Scan rules based on URL or extension</h4>
+</p>    
 
     <br>
     Filters are analyzed by HTTrack from the first filter to the last one. The complete URL
     name is compared to filters defined by the user or added automatically by HTTrack. <br><br>
-    A link has an higher priority than the one before it - hierarchy is important: <br>
+    A scan rule has an higher priority is it is declared later - hierarchy is important: <br>
 
     <br>
     <table BORDER="1" CELLPADDING="2">
-    <tr><td>
+    <tr><td nowrap>
     <tt>+*.gif -image*.gif</tt>
     </td><td>
     Will accept all gif files BUT image1.gif,imageblue.gif,imagery.gif and so on
     </tr>
-    <tr><td>
+    <tr><td nowrap>
     <tt>-image*.gif +*.gif</tt>
     </td><td>
     Will accept all gif files, because the second pattern is prioritary (because it is defined AFTER the first one)
@@ -155,6 +206,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
     </table>
     <br>
 
+    Note: these scan rules can be mixed with scan rules based on size (see 1.b)<br>
+
     <br>
     We saw that patterns are composed of letters and wildcards (<b><tt>*</tt></b>), as in */image*.gif
 
@@ -162,47 +215,44 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
     Special wild cards can be used for specific characters: (*[..])</p>
     <table BORDER="1" CELLPADDING="2">
       <tr>
-        <td><tt>*</tt></td>
+        <td nowrap><tt>*</tt></td>
         <td>any characters (the most commonly used)</td>
       </tr>
       <tr>
-        <td><tt>*[file] or *[name]</tt></td>
+        <td nowrap><tt>*[file] or *[name]</tt></td>
         <td>any filename or name, e.g. not /,? and ; characters</td>
       </tr>
       <tr>
-        <td><tt>*[path]</tt></td>
+        <td nowrap><tt>*[path]</tt></td>
         <td>any path (and filename), e.g. not ? and ; characters</td>
       </tr>
       <tr>
-        <td><tt>*[a,z,e,r,t,y]</tt></td>
+        <td nowrap><tt>*[a,z,e,r,t,y]</tt></td>
         <td>any letters among a,z,e,r,t,y</td>
       </tr>
       <tr>
-        <td><tt>*[a-z]</tt></td>
+        <td nowrap><tt>*[a-z]</tt></td>
         <td>any letters</td>
       </tr>
       <tr>
-        <td><tt>*[0-9,a,z,e,r,t,y]</tt></td>
+        <td nowrap><tt>*[0-9,a,z,e,r,t,y]</tt></td>
         <td>any characters among 0..9 and a,z,e,r,t,y</td>
       </tr>
       <tr>
-        <td><tt>*[]</tt></td>
-        <td>no characters must be present after</a></td>
+        <td nowrap><tt>*[\*]</tt></td>
+        <td>the * character</td>
       </tr>
       <tr>
-        <td><tt>*[&lt;NN]</tt></td>
-        <td>the file size must be smaller than NN KB
-        <br>(note: this may cause broken files during the download)</td>
+        <td nowrap><tt>*[\\]</tt></td>
+        <td>the \ character</td>
       </tr>
       <tr>
-        <td><tt>*[&gt;NN]</tt></td>
-        <td>the file size must be greater than NN KB
-        <br>(note: this may cause broken files during the download)</td>
+        <td nowrap><tt>*[\[\]]</tt></td>
+        <td>the [ or ] character</td>
       </tr>
       <tr>
-        <td><tt>*[&lt;NN&gt;MM]</tt></td>
-        <td>the file size must be smaller than NN KB and greater than MM KB
-        <br>(note: this may cause broken files during the download)</td>
+        <td nowrap><tt>*[]</tt></td>
+        <td>no characters must be present after</a></td>
       </tr>
     </table>
 
@@ -212,44 +262,195 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
     interface)</p>
     <table BORDER="1" CELLPADDING="2">
       <tr>
-        <td><tt>www.thisweb.com* </tt></td>
+        <td nowrap><tt>www.thisweb.com* </tt></td>
         <td>This will refuse/accept this web site (all links located in it will be rejected)</td>
       </tr>
       <tr>
-        <td><tt>*.com/*</tt></td>
+        <td nowrap><tt>*.com/*</tt></td>
         <td>This will refuse/accept all links that contains .com in them</td>
       </tr>
       <tr>
-        <td><tt>*cgi-bin* </tt></td>
+        <td nowrap><tt>*cgi-bin* </tt></td>
         <td>This will refuse/accept all links that contains cgi-bin in them</td>
       </tr>
       <tr>
-        <td><tt>www.*[path].com/*[path].zip </tt></td>
+        <td nowrap><tt>www.*[path].com/*[path].zip </tt></td>
         <td>This will refuse/accept all zip files in .com addresses</td>
       </tr>
       <tr>
-        <td><tt>*someweb*/*.tar*</tt></td>
+        <td nowrap><tt>*someweb*/*.tar*</tt></td>
         <td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb</td>
       </tr>
       <tr>
-        <td><tt>*/*somepage*</tt></td>
+        <td nowrap><tt>*/*somepage*</tt></td>
         <td>This will refuse/accept all links containing somepage (but not in the address)</td>
       </tr>
       <tr>
-        <td><tt>*.html</tt></td>
+        <td nowrap><tt>*.html</tt></td>
         <td>This will refuse/accept all html files. <br>
         Warning! With this filter you will accept ALL html files, even those in other addresses.
         (causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from
         a web.</td>
       </tr>
       <tr>
-        <td><tt>*.html*[]</tt></td>
+        <td nowrap><tt>*.html*[]</tt></td>
         <td>Identical to <tt>*.html</tt>, but the link must not have any supplemental characters
         at the end (links with parameters, like <tt>www.someweb.com/index.html?page=10</tt>, will be
         refused)</td>
       </tr>
     </table>
 
+<p>    
+    <h4>1.b. Scan rules based on size</h4>
+</p>    
+
+    <br>
+    Filters are analyzed by HTTrack from the first filter to the last one. The sizes
+    are compared against scan rules defined by the user.<br><br>
+    A scan rule has an higher priority is it is declared later - hierarchy is important.<br>
+    
+    Note: scan rules based on size can be mixed with regular URL patterns<br>
+
+    <p align="JUSTIFY"><br>
+    Size patterns:</p>
+    <table BORDER="1" CELLPADDING="2">
+      <tr>
+        <td nowrap><tt>*[&lt;NN]</tt></td>
+        <td>the file size must be smaller than NN KB
+        <br>(note: this may cause broken files during the download)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>*[&gt;NN]</tt></td>
+        <td>the file size must be greater than NN KB
+        <br>(note: this may cause broken files during the download)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>*[&lt;NN&gt;MM]</tt></td>
+        <td>the file size must be smaller than NN KB and greater than MM KB
+        <br>(note: this may cause broken files during the download)</td>
+      </tr>
+    </table>
+
+    <p align="JUSTIFY"><br>
+    Here are some examples of filters: (that can be generated automatically using the
+    interface)</p>
+    <table BORDER="1" CELLPADDING="2">
+      <tr>
+        <td nowrap><tt>-*[&lt;10]</tt></td>
+        <td>the file will be forbidden if its size is smaller than 10 KB</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-*[&gt;50]</tt></td>
+        <td>the file will be forbidden if its size is greater than 50 KB</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-*[&lt;10] -*[&gt;50]</tt></td>
+        <td>the file will be forbidden if if its size is smaller than 10 KB <b>or</b> greater than 50 KB</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>+*[&lt;80&gt;1]</tt></td>
+        <td>the file will be accepted if if its size is smaller than 80 KB <b>and</b> greater than 1 KB</td>
+      </tr>
+    </table>
+
+
+<p>    
+    <h4>2. Scan rules based on MIME types</h4>
+</p>    
+
+    <br>
+    Filters are analyzed by HTTrack from the first filter to the last one. The complete MIME
+    type is compared against scan rules defined by the user.<br><br>
+    A scan rule has an higher priority is it is declared later - hierarchy is important<br>
+
+    Note: scan rules based on MIME types can <b>NOT</b> be mixed with regular URL patterns or size patterns within the same rule, but you can use both of them in distinct ones<br>
+
+    <p align="JUSTIFY"><br>
+    Here are some examples of filters: (that can be generated automatically using the
+    interface)</p>
+    <table BORDER="1" CELLPADDING="2">
+      <tr>
+        <td nowrap><tt>-mime:application/octet-stream</tt></td>
+        <td>This will refuse all links of type 'application/octet-stream' that were already scheduled for download
+        (i.e. the download will be aborted)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-mime:application/*</tt></td>
+        <td>This will refuse all links of type begining with 'application/' that were already scheduled for download
+        (i.e. the download will be aborted)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-mime:application/* +mime:application/pdf</tt></td>
+        <td>This will refuse all links of type begining with 'application/' that were already scheduled for download, except for 'application/pdf' ones
+        (i.e. all other 'application/' link download will be aborted)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-mime:video/*</tt></td>
+        <td>This will refuse all video links that were already scheduled for download
+        (i.e. all other 'application/' link download will be aborted)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-mime:video/* -mime:audio/*</tt></td>
+        <td>This will refuse all audio and video links that were already scheduled for download
+        (i.e. all other 'application/' link download will be aborted)</td>
+      </tr>
+      <tr>
+        <td nowrap><tt>-mime:*/* +mime:text/html +mime:image/*</tt></td>
+        <td>This will refuse all links that were already scheduled for download, except html pages, and images
+        (i.e. all other link download will be aborted). Note that this is a very unefficient way of filtering
+        files, as aborted downloads will generate useless requests to the server. You are strongly advised to
+        use additional URL scan rules</td>
+      </tr>
+    </table>
+      
+<p>    
+    <h4>2. Scan rules based on URL or size, and scan rules based on MIME types interactions</h4>
+</p>    
+
+    You must use scan rules based on MIME types very carefully, or you will end up with an imcomplete
+    mirror, or create an unefficient download session (generating costly and useless requests to the server)
+    <br>
+
+    <p align="JUSTIFY"><br>
+    Here are some examples of good/bad scan rules interactions:</p>
+    <table BORDER="1" CELLPADDING="1">
+      <tr>
+        <td>Purpose</td>
+        <td>Method</td>
+        <td>Result</td>
+      </tr>
+      <!-- -->
+      <tr>
+        <td rowspan=2>Download all html and images on www.example.com</td>
+        <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br /> +www.example.com/*.php<br /> +www.example.com/*.asp<br /> +www.example.com/*.gif <br />+www.example.com/*.jpg <br />+www.example.com/*.png<br /> -mime:*/* +mime:text/html +mime:image/*</tt></td>
+        <td>Good: efficient download</td>
+      </tr>
+      <tr>
+        <td bgcolor="#FF5555"><tt>-*<br />+www.example.com/*<br />-mime:*/* +mime:text/html +mime:image/*</tt></td>    
+        <td>Bad: many aborted downloads, leading to poor performances and server load</td>
+      </tr>
+      <!-- -->
+      <tr>
+        <td rowspan=2>Download only html on www.example.com, plus ZIP files</td>
+        <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br />+www.example.com/somedynamicscript.php<br />+www.example.com/*.zip<br>-mime:* +mime:text/html +mime:application/zip</tt></td>
+        <td>Good: ZIP files will be downloaded, even those generated by 'somedynamicscript.php'</td>
+      </tr>
+      <tr>
+        <td bgcolor="#FF5555"><tt>-*<br /> +www.example.com/*.html<br>-mime:* +mime:text/html +mime:application/zip</tt></td>
+        <td>Bad: ZIP files will never be scheduled for download, and hence the zip mime scan rule will never be used</td>
+      </tr>
+      <!-- -->
+      <tr>
+        <td rowspan=2>Download all html, and images smaller than 100KB on www.example.com</td>
+        <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br /> +www.example.com/*.php<br /> +www.example.com/*.asp<br /> +www.example.com/*.gif*[<100] <br />+www.example.com/*.jpg*[<100] <br />+www.example.com/*.png*[<100]<br /> -mime:*/* +mime:text/html +mime:image/*</tt></td>
+        <td>Good: efficient download</td>
+      </tr>
+      <tr>
+        <td bgcolor="#FF5555"><tt>-*<br />+www.example.com/**[<100]<br />-mime:*/* +mime:text/html +mime:image/*</tt></td>    
+        <td>Bad: many aborted downloads, leading to poor performances and server load</td>
+      </tr>
+    </table>
+
 <br>
 
 <!-- ==================== Start epilogue ==================== -->
-- 
cgit v1.2.3


+
`+.gif -image.gif`	Will accept all gif files BUT image1.gif,imageblue.gif,imagery.gif and so on
+
`-image.gif +.gif`	Will accept all gif files, because the second pattern is prioritary (because it is defined AFTER the first one) @@ -155,6 +206,8 @@ See also: The FAQ
`*`	`*`	any characters (the most commonly used)
`[file] or [name]`	`[file] or [name]`	any filename or name, e.g. not /,? and ; characters
`*[path]`	`*[path]`	any path (and filename), e.g. not ? and ; characters
`*[a,z,e,r,t,y]`	`*[a,z,e,r,t,y]`	any letters among a,z,e,r,t,y
`*[a-z]`	`*[a-z]`	any letters
`*[0-9,a,z,e,r,t,y]`	`*[0-9,a,z,e,r,t,y]`	any characters among 0..9 and a,z,e,r,t,y
`*[]`	no characters must be present after	`[\]`	the * character
`*[<NN]`	the file size must be smaller than NN KB - (note: this may cause broken files during the download)	`*[\\]`	the \ character
`*[>NN]`	the file size must be greater than NN KB - (note: this may cause broken files during the download)	`*[\[\]]`	the [ or ] character
`*[<NN>MM]`	the file size must be smaller than NN KB and greater than MM KB - (note: this may cause broken files during the download)	`*[]`	no characters must be present after
`www.thisweb.com*`	`www.thisweb.com*`	This will refuse/accept this web site (all links located in it will be rejected)
`.com/`	`.com/`	This will refuse/accept all links that contains .com in them
`cgi-bin`	`cgi-bin`	This will refuse/accept all links that contains cgi-bin in them
`www.[path].com/[path].zip`	`www.[path].com/[path].zip`	This will refuse/accept all zip files in .com addresses
`someweb/.tar`	`someweb/.tar`	This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb
`/somepage*`	`/somepage*`	This will refuse/accept all links containing somepage (but not in the address)
`*.html`	`*.html`	This will refuse/accept all html files. Warning! With this filter you will accept ALL html files, even those in other addresses. (causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from a web.
`.html[]`	`.html[]`	Identical to `*.html`, but the link must not have any supplemental characters at the end (links with parameters, like `www.someweb.com/index.html?page=10`, will be refused)
`-*[<10]`	the file will be forbidden if its size is smaller than 10 KB
`-*[>50]`	the file will be forbidden if its size is greater than 50 KB
`-[<10] -[>50]`	the file will be forbidden if if its size is smaller than 10 KB or greater than 50 KB
`+*[<80>1]`	the file will be accepted if if its size is smaller than 80 KB and greater than 1 KB
`-mime:application/octet-stream`	This will refuse all links of type 'application/octet-stream' that were already scheduled for download + (i.e. the download will be aborted)
`-mime:application/*`	This will refuse all links of type begining with 'application/' that were already scheduled for download + (i.e. the download will be aborted)
`-mime:application/* +mime:application/pdf`	This will refuse all links of type begining with 'application/' that were already scheduled for download, except for 'application/pdf' ones + (i.e. all other 'application/' link download will be aborted)
`-mime:video/*`	This will refuse all video links that were already scheduled for download + (i.e. all other 'application/' link download will be aborted)
`-mime:video/* -mime:audio/*`	This will refuse all audio and video links that were already scheduled for download + (i.e. all other 'application/' link download will be aborted)
`-mime:/ +mime:text/html +mime:image/*`	This will refuse all links that were already scheduled for download, except html pages, and images + (i.e. all other link download will be aborted). Note that this is a very unefficient way of filtering + files, as aborted downloads will generate useless requests to the server. You are strongly advised to + use additional URL scan rules
Purpose	Method	Result
Download all html and images on www.example.com	`-* +www.example.com/.html +www.example.com/.php +www.example.com/.asp +www.example.com/.gif +www.example.com/.jpg +www.example.com/.png -mime:/ +mime:text/html +mime:image/*`	Good: efficient download
Download all html and images on www.example.com	`-* +www.example.com/* -mime:/ +mime:text/html +mime:image/*`	Bad: many aborted downloads, leading to poor performances and server load
Download only html on www.example.com, plus ZIP files	`-* +www.example.com/.html +www.example.com/somedynamicscript.php +www.example.com/.zip -mime:* +mime:text/html +mime:application/zip`	Good: ZIP files will be downloaded, even those generated by 'somedynamicscript.php'
Download only html on www.example.com, plus ZIP files	`-* +www.example.com/.html -mime: +mime:text/html +mime:application/zip`	Bad: ZIP files will never be scheduled for download, and hence the zip mime scan rule will never be used
Download all html, and images smaller than 100KB on www.example.com	`-* +www.example.com/.html +www.example.com/.php +www.example.com/.asp +www.example.com/.gif[<100] +www.example.com/.jpg[<100] +www.example.com/.png[<100] -mime:/* +mime:text/html +mime:image/*`	Good: efficient download
	`-* +www.example.com/*[<100] -mime:/* +mime:text/html +mime:image/*`	Bad: many aborted downloads, leading to poor performances and server load