diff options
Diffstat (limited to 'html/filters.html')
-rw-r--r-- | html/filters.html | 261 |
1 files changed, 231 insertions, 30 deletions
diff --git a/html/filters.html b/html/filters.html index d058296..dac8545 100644 --- a/html/filters.html +++ b/html/filters.html @@ -119,7 +119,10 @@ See also: The <a href="faq.html#VF1">FAQ</a><br> contrary refuse files of a particular type. That is the purpose of filters. <br> - <br> +<p> + <h4>Scan rules based on URL or extension (e.g. accept or refuse all .zip or .gif files)</h4> +</p> + To accept a family of links (for example, all links with a specific name or type), you just have to add an authorization filter, like <b><tt>+*.gif</tt></b>. The pattern is a plus (this one: <b><tt>+</tt></b>), followed by a pattern composed of letters and wildcards (this one: <b><tt>*</tt></b>). @@ -131,23 +134,71 @@ See also: The <a href="faq.html#VF1">FAQ</a><br> Example: +*.gif will accept all files finished by .gif<br> Example: -*.gif will refuse all files finished by .gif<br> <br> + +<p> + <h4>Scan rules based on size (e.g. accept or refuse files bigger/smaller than a certain size)</h4> +</p> + + Once a link is scheduled for download, you can still refuse it (i.e. abort the download) by checking its + size to ensure that you won't reach a defined limit. + + Example: You may want to accept all files on the domain www.example.com, using '+www.example.com/*', + including gif files inside this domain and outside (eternal images), but not take to large images, + or too small ones (thumbnails)<br> + Excluding gif images smaller than 5KB and images larger than 100KB is therefore a good option; + +www.example.com +*.gif -*.gif*[<5] -*.gif*[>100] + + <br> + + Important notice: size scan rules are checked <font color=red><b>after</b></font> the link was scheduled for download, + allowing to abort the connection. + + +<p> + <h4>Scan rules based on MIME types (e.g. accept or refuse all files of type audio/mp3)</h4> +</p> + + Once a link is scheduled for download, you can still refuse it (i.e. abort the download) by matching its MIME + type against certain patterns. + + Example: You may want to accept all files on the domain www.example.com, using '+www.example.com/*', and + exclude all gif files, using '-*.gif'. But some dynamic scripts (such as www.example.com/dynamic.php) can + both generate html content, or image data content, depending on the context. Excluding this script, using + the scan rule '-www.example.com/dynamic.php', is therefore not a good solution. + + <br> + The only reliable way in such cases is to exclude the specific mime type 'image/gif', using the scan rule + syntax:<br> + -mime:image/gif + <br> + + Important notice: MIME types scan rules are <font color=red><b>only</b></font> checked against links that were + scheduled for download, i.e. links <b>already authorized</b> by url scan rules. + Hence, using '+mime:image/gif' will only be a hint to accept images that were already authorized, + if previous MIME scan rules excluded them - such as in '-mime:*/* +mime:text/html +mime:image/gif' + + <br> <br> - <u>Let's talk a little more about patterns:</u> + <h3>Scan rules patterns:</h3> + +<p> + <h4>1.a. Scan rules based on URL or extension</h4> +</p> <br> Filters are analyzed by HTTrack from the first filter to the last one. The complete URL name is compared to filters defined by the user or added automatically by HTTrack. <br><br> - A link has an higher priority than the one before it - hierarchy is important: <br> + A scan rule has an higher priority is it is declared later - hierarchy is important: <br> <br> <table BORDER="1" CELLPADDING="2"> - <tr><td> + <tr><td nowrap> <tt>+*.gif -image*.gif</tt> </td><td> Will accept all gif files BUT image1.gif,imageblue.gif,imagery.gif and so on </tr> - <tr><td> + <tr><td nowrap> <tt>-image*.gif +*.gif</tt> </td><td> Will accept all gif files, because the second pattern is prioritary (because it is defined AFTER the first one) @@ -155,6 +206,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br> </table> <br> + Note: these scan rules can be mixed with scan rules based on size (see 1.b)<br> + <br> We saw that patterns are composed of letters and wildcards (<b><tt>*</tt></b>), as in */image*.gif @@ -162,47 +215,44 @@ See also: The <a href="faq.html#VF1">FAQ</a><br> Special wild cards can be used for specific characters: (*[..])</p> <table BORDER="1" CELLPADDING="2"> <tr> - <td><tt>*</tt></td> + <td nowrap><tt>*</tt></td> <td>any characters (the most commonly used)</td> </tr> <tr> - <td><tt>*[file] or *[name]</tt></td> + <td nowrap><tt>*[file] or *[name]</tt></td> <td>any filename or name, e.g. not /,? and ; characters</td> </tr> <tr> - <td><tt>*[path]</tt></td> + <td nowrap><tt>*[path]</tt></td> <td>any path (and filename), e.g. not ? and ; characters</td> </tr> <tr> - <td><tt>*[a,z,e,r,t,y]</tt></td> + <td nowrap><tt>*[a,z,e,r,t,y]</tt></td> <td>any letters among a,z,e,r,t,y</td> </tr> <tr> - <td><tt>*[a-z]</tt></td> + <td nowrap><tt>*[a-z]</tt></td> <td>any letters</td> </tr> <tr> - <td><tt>*[0-9,a,z,e,r,t,y]</tt></td> + <td nowrap><tt>*[0-9,a,z,e,r,t,y]</tt></td> <td>any characters among 0..9 and a,z,e,r,t,y</td> </tr> <tr> - <td><tt>*[]</tt></td> - <td>no characters must be present after</a></td> + <td nowrap><tt>*[\*]</tt></td> + <td>the * character</td> </tr> <tr> - <td><tt>*[<NN]</tt></td> - <td>the file size must be smaller than NN KB - <br>(note: this may cause broken files during the download)</td> + <td nowrap><tt>*[\\]</tt></td> + <td>the \ character</td> </tr> <tr> - <td><tt>*[>NN]</tt></td> - <td>the file size must be greater than NN KB - <br>(note: this may cause broken files during the download)</td> + <td nowrap><tt>*[\[\]]</tt></td> + <td>the [ or ] character</td> </tr> <tr> - <td><tt>*[<NN>MM]</tt></td> - <td>the file size must be smaller than NN KB and greater than MM KB - <br>(note: this may cause broken files during the download)</td> + <td nowrap><tt>*[]</tt></td> + <td>no characters must be present after</a></td> </tr> </table> @@ -212,44 +262,195 @@ See also: The <a href="faq.html#VF1">FAQ</a><br> interface)</p> <table BORDER="1" CELLPADDING="2"> <tr> - <td><tt>www.thisweb.com* </tt></td> + <td nowrap><tt>www.thisweb.com* </tt></td> <td>This will refuse/accept this web site (all links located in it will be rejected)</td> </tr> <tr> - <td><tt>*.com/*</tt></td> + <td nowrap><tt>*.com/*</tt></td> <td>This will refuse/accept all links that contains .com in them</td> </tr> <tr> - <td><tt>*cgi-bin* </tt></td> + <td nowrap><tt>*cgi-bin* </tt></td> <td>This will refuse/accept all links that contains cgi-bin in them</td> </tr> <tr> - <td><tt>www.*[path].com/*[path].zip </tt></td> + <td nowrap><tt>www.*[path].com/*[path].zip </tt></td> <td>This will refuse/accept all zip files in .com addresses</td> </tr> <tr> - <td><tt>*someweb*/*.tar*</tt></td> + <td nowrap><tt>*someweb*/*.tar*</tt></td> <td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb</td> </tr> <tr> - <td><tt>*/*somepage*</tt></td> + <td nowrap><tt>*/*somepage*</tt></td> <td>This will refuse/accept all links containing somepage (but not in the address)</td> </tr> <tr> - <td><tt>*.html</tt></td> + <td nowrap><tt>*.html</tt></td> <td>This will refuse/accept all html files. <br> Warning! With this filter you will accept ALL html files, even those in other addresses. (causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from a web.</td> </tr> <tr> - <td><tt>*.html*[]</tt></td> + <td nowrap><tt>*.html*[]</tt></td> <td>Identical to <tt>*.html</tt>, but the link must not have any supplemental characters at the end (links with parameters, like <tt>www.someweb.com/index.html?page=10</tt>, will be refused)</td> </tr> </table> +<p> + <h4>1.b. Scan rules based on size</h4> +</p> + + <br> + Filters are analyzed by HTTrack from the first filter to the last one. The sizes + are compared against scan rules defined by the user.<br><br> + A scan rule has an higher priority is it is declared later - hierarchy is important.<br> + + Note: scan rules based on size can be mixed with regular URL patterns<br> + + <p align="JUSTIFY"><br> + Size patterns:</p> + <table BORDER="1" CELLPADDING="2"> + <tr> + <td nowrap><tt>*[<NN]</tt></td> + <td>the file size must be smaller than NN KB + <br>(note: this may cause broken files during the download)</td> + </tr> + <tr> + <td nowrap><tt>*[>NN]</tt></td> + <td>the file size must be greater than NN KB + <br>(note: this may cause broken files during the download)</td> + </tr> + <tr> + <td nowrap><tt>*[<NN>MM]</tt></td> + <td>the file size must be smaller than NN KB and greater than MM KB + <br>(note: this may cause broken files during the download)</td> + </tr> + </table> + + <p align="JUSTIFY"><br> + Here are some examples of filters: (that can be generated automatically using the + interface)</p> + <table BORDER="1" CELLPADDING="2"> + <tr> + <td nowrap><tt>-*[<10]</tt></td> + <td>the file will be forbidden if its size is smaller than 10 KB</td> + </tr> + <tr> + <td nowrap><tt>-*[>50]</tt></td> + <td>the file will be forbidden if its size is greater than 50 KB</td> + </tr> + <tr> + <td nowrap><tt>-*[<10] -*[>50]</tt></td> + <td>the file will be forbidden if if its size is smaller than 10 KB <b>or</b> greater than 50 KB</td> + </tr> + <tr> + <td nowrap><tt>+*[<80>1]</tt></td> + <td>the file will be accepted if if its size is smaller than 80 KB <b>and</b> greater than 1 KB</td> + </tr> + </table> + + +<p> + <h4>2. Scan rules based on MIME types</h4> +</p> + + <br> + Filters are analyzed by HTTrack from the first filter to the last one. The complete MIME + type is compared against scan rules defined by the user.<br><br> + A scan rule has an higher priority is it is declared later - hierarchy is important<br> + + Note: scan rules based on MIME types can <b>NOT</b> be mixed with regular URL patterns or size patterns within the same rule, but you can use both of them in distinct ones<br> + + <p align="JUSTIFY"><br> + Here are some examples of filters: (that can be generated automatically using the + interface)</p> + <table BORDER="1" CELLPADDING="2"> + <tr> + <td nowrap><tt>-mime:application/octet-stream</tt></td> + <td>This will refuse all links of type 'application/octet-stream' that were already scheduled for download + (i.e. the download will be aborted)</td> + </tr> + <tr> + <td nowrap><tt>-mime:application/*</tt></td> + <td>This will refuse all links of type begining with 'application/' that were already scheduled for download + (i.e. the download will be aborted)</td> + </tr> + <tr> + <td nowrap><tt>-mime:application/* +mime:application/pdf</tt></td> + <td>This will refuse all links of type begining with 'application/' that were already scheduled for download, except for 'application/pdf' ones + (i.e. all other 'application/' link download will be aborted)</td> + </tr> + <tr> + <td nowrap><tt>-mime:video/*</tt></td> + <td>This will refuse all video links that were already scheduled for download + (i.e. all other 'application/' link download will be aborted)</td> + </tr> + <tr> + <td nowrap><tt>-mime:video/* -mime:audio/*</tt></td> + <td>This will refuse all audio and video links that were already scheduled for download + (i.e. all other 'application/' link download will be aborted)</td> + </tr> + <tr> + <td nowrap><tt>-mime:*/* +mime:text/html +mime:image/*</tt></td> + <td>This will refuse all links that were already scheduled for download, except html pages, and images + (i.e. all other link download will be aborted). Note that this is a very unefficient way of filtering + files, as aborted downloads will generate useless requests to the server. You are strongly advised to + use additional URL scan rules</td> + </tr> + </table> + +<p> + <h4>2. Scan rules based on URL or size, and scan rules based on MIME types interactions</h4> +</p> + + You must use scan rules based on MIME types very carefully, or you will end up with an imcomplete + mirror, or create an unefficient download session (generating costly and useless requests to the server) + <br> + + <p align="JUSTIFY"><br> + Here are some examples of good/bad scan rules interactions:</p> + <table BORDER="1" CELLPADDING="1"> + <tr> + <td>Purpose</td> + <td>Method</td> + <td>Result</td> + </tr> + <!-- --> + <tr> + <td rowspan=2>Download all html and images on www.example.com</td> + <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br /> +www.example.com/*.php<br /> +www.example.com/*.asp<br /> +www.example.com/*.gif <br />+www.example.com/*.jpg <br />+www.example.com/*.png<br /> -mime:*/* +mime:text/html +mime:image/*</tt></td> + <td>Good: efficient download</td> + </tr> + <tr> + <td bgcolor="#FF5555"><tt>-*<br />+www.example.com/*<br />-mime:*/* +mime:text/html +mime:image/*</tt></td> + <td>Bad: many aborted downloads, leading to poor performances and server load</td> + </tr> + <!-- --> + <tr> + <td rowspan=2>Download only html on www.example.com, plus ZIP files</td> + <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br />+www.example.com/somedynamicscript.php<br />+www.example.com/*.zip<br>-mime:* +mime:text/html +mime:application/zip</tt></td> + <td>Good: ZIP files will be downloaded, even those generated by 'somedynamicscript.php'</td> + </tr> + <tr> + <td bgcolor="#FF5555"><tt>-*<br /> +www.example.com/*.html<br>-mime:* +mime:text/html +mime:application/zip</tt></td> + <td>Bad: ZIP files will never be scheduled for download, and hence the zip mime scan rule will never be used</td> + </tr> + <!-- --> + <tr> + <td rowspan=2>Download all html, and images smaller than 100KB on www.example.com</td> + <td bgcolor="#55ff55"><tt>-*<br /> +www.example.com/*.html<br /> +www.example.com/*.php<br /> +www.example.com/*.asp<br /> +www.example.com/*.gif*[<100] <br />+www.example.com/*.jpg*[<100] <br />+www.example.com/*.png*[<100]<br /> -mime:*/* +mime:text/html +mime:image/*</tt></td> + <td>Good: efficient download</td> + </tr> + <tr> + <td bgcolor="#FF5555"><tt>-*<br />+www.example.com/**[<100]<br />-mime:*/* +mime:text/html +mime:image/*</tt></td> + <td>Bad: many aborted downloads, leading to poor performances and server load</td> + </tr> + </table> + <br> <!-- ==================== Start epilogue ==================== --> |