From ad5b7acc19290ff91e0f42a0de448a26760fcf99 Mon Sep 17 00:00:00 2001 From: Xavier Roche Date: Mon, 19 Mar 2012 12:36:11 +0000 Subject: Imported httrack 3.20.2 --- HelpHtml/abuse.html | 580 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 580 insertions(+) create mode 100644 HelpHtml/abuse.html (limited to 'HelpHtml/abuse.html') diff --git a/HelpHtml/abuse.html b/HelpHtml/abuse.html new file mode 100644 index 0000000..4be36a1 --- /dev/null +++ b/HelpHtml/abuse.html @@ -0,0 +1,580 @@ + + + + + + + HTTrack Website Copier - Offline Browser + + + + + + + + + +
HTTrack Website Copier
+ + + + +
Open Source offline browser
+ + + + +
+ + + + +
+ + + + +
+ + +

For HTTrack users:

+ +
+ +

For webmasters having problems with bandwidth abuse / other abuses related to HTTrack:

+ + +

+
+ + + + + + + + + +

Advice & what not to do

+ +

Please follow these common sense rules to avoid any network abuse

+ +
+ +
    +
  • Do not overload the websites!
  • +
    +Downloading a site can overload it, if you have a fast pipe, or if you capture too many simultaneous cgi (dynamically generated pages). +
    +
      +
    • Do not download too large websites: use filters
    • +
    • Do not use too many simultaneous connections
    • +
    • Use bandwidth limits
    • +
    • Use connection limits
    • +
    • Use size limits
    • +
    • Use time limits
    • +
    • Only disable robots.txt rules with great care
    • +
    • Try not to download during working hours
    • +
    • Check your mirror transfer rate/size
    • +
    • For large mirrors, first ask the webmaster of the site
    • +
    +
    +
  • Ensure that you can copy the website
  • +
      +
    • Are the pages copyrighted?
    • +
    • Can you copy them only for private purpose?
    • +
    • Do not make online mirrors unless you are authorized to do so
    • +
    +
    +
  • Do not overload your network
  • +
      +
    • Is your (corporate, private..) network connected through dialup ISP?
    • +
    • Is your network bandwidth limited (and expensive)?
    • +
    • Are you slowing down the traffic?
    • +
    +
    +
  • Do not steal private information
  • +
      +
    • Do not grab emails
    • +
    • Do not grab private information
    • +
    +
+ +
+

+
+ + + + + + + + + +

Abuse FAQ for webmasters

+ +

How to limit network abuse +
+HTTrack Website Copier FAQ (updated - DRAFT) +

+ +
+Q: How to block offline browsers, like HTTrack?
+
+A: This is a complex question, let's study it
+
+First, there are several different reasons for that
+Why do you want to block offline browsers? :
+
+
    +
  1. Because a large part of your bandwidth is used by some users, who are slowing down the rest
  2. +
  3. Because of copyright questions (you do not want people to copy parts of your website)
  4. +
  5. Because of privacy (you do not want email grabbers to steal all your user's emails)
  6. +
+
+
+
    + + +
  1. Bandwidth abuse:
    +
    +Many Webmasters are concerned about bandwidth abuse, even if this problem is caused by +a minority of people. Offline browsers tools, like HTTrack, can be used in a WRONG way, +and +therefore are sometimes considered as a potential danger.
    +But before thinking that all offline browsers are BAD, consider this: +students, teachers, IT consultants, websurfers and many people who like your website, may +want to copy +parts of it, for their work, their studies, to teach or demonstrate to people during class +school or +shows. They might do that because they are connected through expensive modem connection, +or because they would like to consult pages while travelling, or archive sites that may be +removed +one day, make some data mining, comiling information ("if only I could find this +website I saw one day..").
    +There are many good reasons to mirror websites, and this helps many good people.
    +As a webmaster, you might be interested to use such tools, too: test broken links, move a +website to +another location, control which external links are put on your website for legal/content +control, +test the webserver response and performances, index it..
    +
    +Anyway, bandwidth abuse can be a problem. If your site is regularly "clobbered" +by evil downloaders, you have
    +various solutions. You have radical solutions, and intermediate solutions. I strongly +recomment not to use
    +radical solutions, because of the previous remarks (good people often mirror websites).
    +
    +In general, for all solutions,
    +the good thing: it will limit the bandwidth abuse
    +the bad thing: depending on the solution, it will be either a small constraint, or a fatal +nuisance (you'll get 0 visitors)
    +or, to be extreme: if you unplug the wire, there will be no bandwidth abuse
    +
    +
      + +
    1. Inform people, explain why ("please do not clobber the bandwidth")
      +Good: Will work with good people. Many good people just don't KNOW that they can slow down +a network.
      +Bad: Will **only** work with good people
      +How to do: Obvious - place a note, a warning, an article, a draw, a poeme or whatever you +want
      +
      +
    2. Use "robots.txt" file
      +Good: Easy to setup
      +Bad: Easy to override
      +How to do: Create a robots.txt file on top dir, with proper parameters
      +Example:
      +    User-agent: *
      +
      +    Disallow: /bigfolder
      +
      +
    3. Ban registered offline-browsers User-agents
      +Good: Easy to setup
      +Bad: Radical, and easy to override
      +How to do: Filter the "User-agent" HTTP header field
      +
      +
    4. Limit the bandwidth per IP (or by folders)
      +Good: Efficient
      +Bad: Multiple users behind proxies will be slow down, not really easy to setup
      +How to do: Depends on webserver. Might be done with low-level IP rules (QoS)
      +
      +
    5. Priorize small files, against large files
      +Good: Efficient if large files are the cause of abuse
      +Bad: Not always efficient
      +How to do: Depends on the webserver
      +
      +
    6. Ban abuser IPs
      +Good: Immediate solution
      +Bad: Annoying to do, useless for dynamic IPs, and not very user friendly
      +How to do: Either ban IP's on the firewall, or on the webserver (see ACLs)
      +
      +
    7. Limit abusers IPs
      +Good: Intermediate and immediate solution
      +Bad: Annoying to do, useless for dynamic IPs, and annoying to maintain..
      +How to do: Use routine QoS (fair queuing), or webserver options
      +
      +
    8. Use technical tricks (like javascript) to hide URLs
      +Good: Efficient
      +Bad: The most efficient tricks will also cause your website to he heavy, and not +user-friendly (and therefore less attractive, even for surfing users). Remember: clients +or visitors might want to consult offline your website. Advanced users will also be still +able to note the URLs and catch them. Will not work on non-javascript browsers. It will +not work if the user clicks 50 times and put downloads in background with a standard +browser
      +How to do: Most offline browsers (I would say all, but let's say most) are unable to +"understand" javascript/java properly. Reason: very tricky to handle!
      +Example:
      +You can replace:
      + + +    <a href="bigfile.zip">Foo</a>
      +
      + +by:
      + +    <script language="javascript">
      +    <!--
      +    document.write('<a h' + 're' + 'f="');
      +    document.write('bigfile' + '.' + 'zip">');
      +    // -->
      +    </script>
      +    Foo
      +    </a>
      +
      +
      +You can also use java-based applets. I would say that it is the "best of the +horrors". A big, fat, slow, bogus java applet. Avoid!
      +
      +
    9. Use technical tricks to lag offline browsers
      +Good: Efficient
      +Bad: Can be avoided by advanced users, annoying to maintain, AND potentially worst that +the illness (cgi's are often taking some CPU usage). . It will not work if the user clicks +50 times and put downloads in background with a standard browser
      +How to do: Create fake empty links that point to cgi's, with long delays
      +Example: Use things like + + +<ahref="slow.cgi?p=12786549"><nothing></a> (example in php:)
      +    <?php
      +    for($i=0;$i<10;$i++) {
      +        sleep(6);
      +        echo " ";
      +    }
      +    ?>
      +
      + +
      +
    10. Use technical tricks to temporarily ban IPs
      +Good: Efficient
      +Bad: Radical (your site will only be available online for all users), not easy to setup
      +How to to: Create fake links with "killing" targets
      +Example: Use things like <a href="killme.cgi"><nothing></a> +(again an example in php:)
      + +    <?php
      +        // Of course, +"add_temp_firewall_rule" has to be written..
      +        add_temp_firewall_rule($REMOTE_ADDR,"30s");
      +    ?>
      +
      +
      +
      +
    11. +
    + + +
  2. Copyright issues
    +
    +You do not want people to "steal" your website, or even copy parts of it. First, +stealing a website does not
    +require to have an offline browser. Second, direct (and credited) copy is sometimes better +than disguised
    +plagiarism. Besides, several previous remarks are also interesting here: the more +protected your website will be,
    +the potentially less attractive it will also be. There is no perfect solution, too. A +webmaster asked me one day
    +to give him a solution to prevent any website copy. Not only for offline browsers, but +also against "save as",
    +cut and paste, print.. and print screen. I replied that is was not possible, especially +for the print screen - and
    +that another potential threat was the evil photographer. Maybe with a "this document +will self-destruct in 5 seconds.."
    +or by shooting users after consulting the document.
    +More seriously, once a document is being placed on a website, there will always be the +risks of copy (or plagiarism)
    +
    +To limit the risk, previous a- and h- solutions, in "bandwidth abuse" section, +can be used
    +
    +
    +
  3. + + +
  4. Privacy
    +
    +Might be related to section 2.
    +But the greatest risk is maybe email grabbers.
    +
    +
      +
    1. A solution can be to use javascript to hide emails.
      +Good: Efficient
      +Bad: Will not work on non-javascript browsers
      +How to do: Use javascript to build mailto: links
      +Example: (in php)
      + +    <script language="javascript">
      +    <!--
      +    function FOS(host,nom,info) {
      +      var s;
      +      if (info == "") info=nom+"@"+host;
      +      s="mail";
      +      document.write("<a href='"+s+"to:"+nom+"@"+host+"'>"+info+"</a>");
      +    }
      +    FOS('mycompany.com','smith?subject=Hi, John','Click here to email me!')
      +    // -->
      +    </script>
      +
      +
      +
    2. Another one is to create images of emails
      +Good: Efficient, does not require javascript
      +Bad: There is still the problem of the link (mailto:), images are bigger than text
      +How to do: Not so obvious of you do not want to create images by yourself
      +Example: (php, Unix)
      + + +<?php
      +/*
      +Email contact displayer
      +Usage: email.php3?id=<4 bytes of user's md5>
      +The <4 bytes of user's md5> can be calculated using the 2nd script (see below)
      +Example: http://yourhost/email.php3?id=91ff1a48
      +*/
      +$domain="mycompany.com";
      +$size=12;
      +
      +/* Find the user in the system database */
      +if (!$id)
      +  exit;
      +unset($email);
      +unset($name);
      +unset($pwd);
      +unset($apwd);
      +$email="";
      +$name="";
      +$fp=@fopen("/etc/passwd","r");
      +if ($fp) {
      +  $pwd=@fread($fp,filesize("/etc/passwd"));
      +  @fclose($fp);
      +}
      +$apwd=split("\n",$pwd);
      +foreach($apwd as $line) {
      +  $fld=split(":",$line);
      +  if (substr(md5($fld[0]),0,8) == $id) {
      +    $email=$fld[0]."@".$domain;
      +    $nm=substr($fld[4],0,strpos($fld[4],","));
      +    $name=$email;
      +    if ($nm)
      +      $name="\"".$nm."\" <".$email.">";
      +  }
      +}
      +if (!$name)
      +  exit;
      +
      +/* Create and show the image */
      +Header ("Content-type: image/gif");
      +$im = imagecreate ($size*strlen($name), $size*1.5);
      +$black = ImageColorAllocate ($im, 255, 255, 255);
      +$white = ImageColorAllocate ($im, 0,0,0);
      +ImageTTFText($im, $size, 0, 0, $size , $white, +"/usr/share/enlightenment/E-docs/aircut3.ttf",$name);
      +ImageGif ($im);
      +ImageDestroy ($im);
      +?>
      +
      +
      + +The script to find the id:
      +
      + + +#!/bin/sh
      +
      +# small script for email.php3
      +echo "Enter login:"
      +read login
      +echo "The URL is:"
      +printf "http://yourhost/email.php3?id="
      +printf $login|md5sum|cut -c1-8
      +echo
      +
      +
      + +
    3. You can also create temporary email aliases, each week, for all users
      +Good: Efficient, and you can give your real email in your reply-to address
      +Bad: Temporary emails
      +How to do: Not so hard todo
      +Example: (script & php, Unix)
      + + +#!/bin/sh
      +#
      +# Anonymous random aliases for all users
      +# changed each week, to avoid spam problems
      +# on websites
      +# (to put into /etc/cron.weekly/)
      +
      +# Each alias is regenerated each week, and valid for 2 weeks
      +
      +# prefix for all users
      +# must not be the prefix of another alias!
      +USER_PREFIX="user-"
      +
      +# valid for 2 weeks
      +ALIAS_VALID=2
      +
      +# random string
      +SECRET="my secret string `hostname -f`"
      +
      +# build
      +grep -vE "^$USER_PREFIX" /etc/aliases > /etc/aliases.new
      +for i in `cut -f1 -d':' /etc/passwd`; do
      +  if test `id -u $i` -ge 500; then
      +    off=0
      +    while test "$off" -lt $ALIAS_VALID; do
      +      THISWEEK="`date +'%Y'` $[`date +'%U'`-$off]"
      +      SECRET="`echo \"$SECRET $i $THISWEEK\" | md5sum | cut -c1-4`"
      +      FIRST=`echo $i | cut -c1-3`
      +      NAME="$USER_PREFIX$FIRST$SECRET"
      +      echo "$NAME : $i" >> /etc/aliases.new
      +      #
      +      off=$[$off+1]
      +    done
      +  fi
      +done
      +
      +# move file
      +mv -f /etc/aliases /etc/aliases.old
      +mv -f /etc/aliases.new /etc/aliases
      +
      +# update aliases
      +newaliases
      +
      +
      + +And then, put the email address in your pages through: +
      +
      + + +<a href="mailto:<?php
      +    $user="smith";
      +    $alias=exec("grep ".$user." /etc/aliases | cut -f1 -d' ' | head -n1");
      +    print $alias;
      +?>@mycompany.com>> +
      + +
    4. +
    + +
  5. + + + + + + +
+ +
+ + +
+
+
+ + + + + +
+ + + + + + + -- cgit v1.2.3