From 25adbdabb47499fe641c7bd9595024ff82667058 Mon Sep 17 00:00:00 2001 From: Xavier Roche Date: Mon, 19 Mar 2012 12:51:31 +0000 Subject: httrack 3.30.1 --- HelpHtml/abuse.html | 580 ---------------------------------------------------- 1 file changed, 580 deletions(-) delete mode 100644 HelpHtml/abuse.html (limited to 'HelpHtml/abuse.html') diff --git a/HelpHtml/abuse.html b/HelpHtml/abuse.html deleted file mode 100644 index 4be36a1..0000000 --- a/HelpHtml/abuse.html +++ /dev/null @@ -1,580 +0,0 @@ - - - - - - - HTTrack Website Copier - Offline Browser - - - - - - - - - -
HTTrack Website Copier
- - - - -
Open Source offline browser
- - - - -
- - - - -
- - - - -
- - -

For HTTrack users:

- -
- -

For webmasters having problems with bandwidth abuse / other abuses related to HTTrack:

- - -

-
- - - - - - - - - -

Advice & what not to do

- -

Please follow these common sense rules to avoid any network abuse

- -
- -
    -
  • Do not overload the websites!
  • -
    -Downloading a site can overload it, if you have a fast pipe, or if you capture too many simultaneous cgi (dynamically generated pages). -
    -
      -
    • Do not download too large websites: use filters
    • -
    • Do not use too many simultaneous connections
    • -
    • Use bandwidth limits
    • -
    • Use connection limits
    • -
    • Use size limits
    • -
    • Use time limits
    • -
    • Only disable robots.txt rules with great care
    • -
    • Try not to download during working hours
    • -
    • Check your mirror transfer rate/size
    • -
    • For large mirrors, first ask the webmaster of the site
    • -
    -
    -
  • Ensure that you can copy the website
  • -
      -
    • Are the pages copyrighted?
    • -
    • Can you copy them only for private purpose?
    • -
    • Do not make online mirrors unless you are authorized to do so
    • -
    -
    -
  • Do not overload your network
  • -
      -
    • Is your (corporate, private..) network connected through dialup ISP?
    • -
    • Is your network bandwidth limited (and expensive)?
    • -
    • Are you slowing down the traffic?
    • -
    -
    -
  • Do not steal private information
  • -
      -
    • Do not grab emails
    • -
    • Do not grab private information
    • -
    -
- -
-

-
- - - - - - - - - -

Abuse FAQ for webmasters

- -

How to limit network abuse -
-HTTrack Website Copier FAQ (updated - DRAFT) -

- -
-Q: How to block offline browsers, like HTTrack?
-
-A: This is a complex question, let's study it
-
-First, there are several different reasons for that
-Why do you want to block offline browsers? :
-
-
    -
  1. Because a large part of your bandwidth is used by some users, who are slowing down the rest
  2. -
  3. Because of copyright questions (you do not want people to copy parts of your website)
  4. -
  5. Because of privacy (you do not want email grabbers to steal all your user's emails)
  6. -
-
-
-
    - - -
  1. Bandwidth abuse:
    -
    -Many Webmasters are concerned about bandwidth abuse, even if this problem is caused by -a minority of people. Offline browsers tools, like HTTrack, can be used in a WRONG way, -and -therefore are sometimes considered as a potential danger.
    -But before thinking that all offline browsers are BAD, consider this: -students, teachers, IT consultants, websurfers and many people who like your website, may -want to copy -parts of it, for their work, their studies, to teach or demonstrate to people during class -school or -shows. They might do that because they are connected through expensive modem connection, -or because they would like to consult pages while travelling, or archive sites that may be -removed -one day, make some data mining, comiling information ("if only I could find this -website I saw one day..").
    -There are many good reasons to mirror websites, and this helps many good people.
    -As a webmaster, you might be interested to use such tools, too: test broken links, move a -website to -another location, control which external links are put on your website for legal/content -control, -test the webserver response and performances, index it..
    -
    -Anyway, bandwidth abuse can be a problem. If your site is regularly "clobbered" -by evil downloaders, you have
    -various solutions. You have radical solutions, and intermediate solutions. I strongly -recomment not to use
    -radical solutions, because of the previous remarks (good people often mirror websites).
    -
    -In general, for all solutions,
    -the good thing: it will limit the bandwidth abuse
    -the bad thing: depending on the solution, it will be either a small constraint, or a fatal -nuisance (you'll get 0 visitors)
    -or, to be extreme: if you unplug the wire, there will be no bandwidth abuse
    -
    -
      - -
    1. Inform people, explain why ("please do not clobber the bandwidth")
      -Good: Will work with good people. Many good people just don't KNOW that they can slow down -a network.
      -Bad: Will **only** work with good people
      -How to do: Obvious - place a note, a warning, an article, a draw, a poeme or whatever you -want
      -
      -
    2. Use "robots.txt" file
      -Good: Easy to setup
      -Bad: Easy to override
      -How to do: Create a robots.txt file on top dir, with proper parameters
      -Example:
      -    User-agent: *
      -
      -    Disallow: /bigfolder
      -
      -
    3. Ban registered offline-browsers User-agents
      -Good: Easy to setup
      -Bad: Radical, and easy to override
      -How to do: Filter the "User-agent" HTTP header field
      -
      -
    4. Limit the bandwidth per IP (or by folders)
      -Good: Efficient
      -Bad: Multiple users behind proxies will be slow down, not really easy to setup
      -How to do: Depends on webserver. Might be done with low-level IP rules (QoS)
      -
      -
    5. Priorize small files, against large files
      -Good: Efficient if large files are the cause of abuse
      -Bad: Not always efficient
      -How to do: Depends on the webserver
      -
      -
    6. Ban abuser IPs
      -Good: Immediate solution
      -Bad: Annoying to do, useless for dynamic IPs, and not very user friendly
      -How to do: Either ban IP's on the firewall, or on the webserver (see ACLs)
      -
      -
    7. Limit abusers IPs
      -Good: Intermediate and immediate solution
      -Bad: Annoying to do, useless for dynamic IPs, and annoying to maintain..
      -How to do: Use routine QoS (fair queuing), or webserver options
      -
      -
    8. Use technical tricks (like javascript) to hide URLs
      -Good: Efficient
      -Bad: The most efficient tricks will also cause your website to he heavy, and not -user-friendly (and therefore less attractive, even for surfing users). Remember: clients -or visitors might want to consult offline your website. Advanced users will also be still -able to note the URLs and catch them. Will not work on non-javascript browsers. It will -not work if the user clicks 50 times and put downloads in background with a standard -browser
      -How to do: Most offline browsers (I would say all, but let's say most) are unable to -"understand" javascript/java properly. Reason: very tricky to handle!
      -Example:
      -You can replace:
      - - -    <a href="bigfile.zip">Foo</a>
      -
      - -by:
      - -    <script language="javascript">
      -    <!--
      -    document.write('<a h' + 're' + 'f="');
      -    document.write('bigfile' + '.' + 'zip">');
      -    // -->
      -    </script>
      -    Foo
      -    </a>
      -
      -
      -You can also use java-based applets. I would say that it is the "best of the -horrors". A big, fat, slow, bogus java applet. Avoid!
      -
      -
    9. Use technical tricks to lag offline browsers
      -Good: Efficient
      -Bad: Can be avoided by advanced users, annoying to maintain, AND potentially worst that -the illness (cgi's are often taking some CPU usage). . It will not work if the user clicks -50 times and put downloads in background with a standard browser
      -How to do: Create fake empty links that point to cgi's, with long delays
      -Example: Use things like - - -<ahref="slow.cgi?p=12786549"><nothing></a> (example in php:)
      -    <?php
      -    for($i=0;$i<10;$i++) {
      -        sleep(6);
      -        echo " ";
      -    }
      -    ?>
      -
      - -
      -
    10. Use technical tricks to temporarily ban IPs
      -Good: Efficient
      -Bad: Radical (your site will only be available online for all users), not easy to setup
      -How to to: Create fake links with "killing" targets
      -Example: Use things like <a href="killme.cgi"><nothing></a> -(again an example in php:)
      - -    <?php
      -        // Of course, -"add_temp_firewall_rule" has to be written..
      -        add_temp_firewall_rule($REMOTE_ADDR,"30s");
      -    ?>
      -
      -
      -
      -
    11. -
    - - -
  2. Copyright issues
    -
    -You do not want people to "steal" your website, or even copy parts of it. First, -stealing a website does not
    -require to have an offline browser. Second, direct (and credited) copy is sometimes better -than disguised
    -plagiarism. Besides, several previous remarks are also interesting here: the more -protected your website will be,
    -the potentially less attractive it will also be. There is no perfect solution, too. A -webmaster asked me one day
    -to give him a solution to prevent any website copy. Not only for offline browsers, but -also against "save as",
    -cut and paste, print.. and print screen. I replied that is was not possible, especially -for the print screen - and
    -that another potential threat was the evil photographer. Maybe with a "this document -will self-destruct in 5 seconds.."
    -or by shooting users after consulting the document.
    -More seriously, once a document is being placed on a website, there will always be the -risks of copy (or plagiarism)
    -
    -To limit the risk, previous a- and h- solutions, in "bandwidth abuse" section, -can be used
    -
    -
    -
  3. - - -
  4. Privacy
    -
    -Might be related to section 2.
    -But the greatest risk is maybe email grabbers.
    -
    -
      -
    1. A solution can be to use javascript to hide emails.
      -Good: Efficient
      -Bad: Will not work on non-javascript browsers
      -How to do: Use javascript to build mailto: links
      -Example: (in php)
      - -    <script language="javascript">
      -    <!--
      -    function FOS(host,nom,info) {
      -      var s;
      -      if (info == "") info=nom+"@"+host;
      -      s="mail";
      -      document.write("<a href='"+s+"to:"+nom+"@"+host+"'>"+info+"</a>");
      -    }
      -    FOS('mycompany.com','smith?subject=Hi, John','Click here to email me!')
      -    // -->
      -    </script>
      -
      -
      -
    2. Another one is to create images of emails
      -Good: Efficient, does not require javascript
      -Bad: There is still the problem of the link (mailto:), images are bigger than text
      -How to do: Not so obvious of you do not want to create images by yourself
      -Example: (php, Unix)
      - - -<?php
      -/*
      -Email contact displayer
      -Usage: email.php3?id=<4 bytes of user's md5>
      -The <4 bytes of user's md5> can be calculated using the 2nd script (see below)
      -Example: http://yourhost/email.php3?id=91ff1a48
      -*/
      -$domain="mycompany.com";
      -$size=12;
      -
      -/* Find the user in the system database */
      -if (!$id)
      -  exit;
      -unset($email);
      -unset($name);
      -unset($pwd);
      -unset($apwd);
      -$email="";
      -$name="";
      -$fp=@fopen("/etc/passwd","r");
      -if ($fp) {
      -  $pwd=@fread($fp,filesize("/etc/passwd"));
      -  @fclose($fp);
      -}
      -$apwd=split("\n",$pwd);
      -foreach($apwd as $line) {
      -  $fld=split(":",$line);
      -  if (substr(md5($fld[0]),0,8) == $id) {
      -    $email=$fld[0]."@".$domain;
      -    $nm=substr($fld[4],0,strpos($fld[4],","));
      -    $name=$email;
      -    if ($nm)
      -      $name="\"".$nm."\" <".$email.">";
      -  }
      -}
      -if (!$name)
      -  exit;
      -
      -/* Create and show the image */
      -Header ("Content-type: image/gif");
      -$im = imagecreate ($size*strlen($name), $size*1.5);
      -$black = ImageColorAllocate ($im, 255, 255, 255);
      -$white = ImageColorAllocate ($im, 0,0,0);
      -ImageTTFText($im, $size, 0, 0, $size , $white, -"/usr/share/enlightenment/E-docs/aircut3.ttf",$name);
      -ImageGif ($im);
      -ImageDestroy ($im);
      -?>
      -
      -
      - -The script to find the id:
      -
      - - -#!/bin/sh
      -
      -# small script for email.php3
      -echo "Enter login:"
      -read login
      -echo "The URL is:"
      -printf "http://yourhost/email.php3?id="
      -printf $login|md5sum|cut -c1-8
      -echo
      -
      -
      - -
    3. You can also create temporary email aliases, each week, for all users
      -Good: Efficient, and you can give your real email in your reply-to address
      -Bad: Temporary emails
      -How to do: Not so hard todo
      -Example: (script & php, Unix)
      - - -#!/bin/sh
      -#
      -# Anonymous random aliases for all users
      -# changed each week, to avoid spam problems
      -# on websites
      -# (to put into /etc/cron.weekly/)
      -
      -# Each alias is regenerated each week, and valid for 2 weeks
      -
      -# prefix for all users
      -# must not be the prefix of another alias!
      -USER_PREFIX="user-"
      -
      -# valid for 2 weeks
      -ALIAS_VALID=2
      -
      -# random string
      -SECRET="my secret string `hostname -f`"
      -
      -# build
      -grep -vE "^$USER_PREFIX" /etc/aliases > /etc/aliases.new
      -for i in `cut -f1 -d':' /etc/passwd`; do
      -  if test `id -u $i` -ge 500; then
      -    off=0
      -    while test "$off" -lt $ALIAS_VALID; do
      -      THISWEEK="`date +'%Y'` $[`date +'%U'`-$off]"
      -      SECRET="`echo \"$SECRET $i $THISWEEK\" | md5sum | cut -c1-4`"
      -      FIRST=`echo $i | cut -c1-3`
      -      NAME="$USER_PREFIX$FIRST$SECRET"
      -      echo "$NAME : $i" >> /etc/aliases.new
      -      #
      -      off=$[$off+1]
      -    done
      -  fi
      -done
      -
      -# move file
      -mv -f /etc/aliases /etc/aliases.old
      -mv -f /etc/aliases.new /etc/aliases
      -
      -# update aliases
      -newaliases
      -
      -
      - -And then, put the email address in your pages through: -
      -
      - - -<a href="mailto:<?php
      -    $user="smith";
      -    $alias=exec("grep ".$user." /etc/aliases | cut -f1 -d' ' | head -n1");
      -    print $alias;
      -?>@mycompany.com>> -
      - -
    4. -
    - -
  5. - - - - - - -
- -
- - -
-
-
- - - - - -
- - - - - - - -- cgit v1.2.3