From 844ecc37072d515513177c65a8c9dc35c9cdfc1a Mon Sep 17 00:00:00 2001 From: Xavier Roche Date: Mon, 19 Mar 2012 12:55:42 +0000 Subject: httrack 3.33.16 --- html/cache.html | 293 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 293 insertions(+) create mode 100755 html/cache.html (limited to 'html/cache.html') diff --git a/html/cache.html b/html/cache.html new file mode 100755 index 0000000..df28dc3 --- /dev/null +++ b/html/cache.html @@ -0,0 +1,293 @@ + + + + + + + HTTrack Website Copier - Cache format specification + + + + + + + + + +
HTTrack Website Copier
+ + + + +
Open Source offline browser
+ + + + +
+ + + + +
+ + + + +
+ + +

Cache format specification

+ +
+ +For updating purpose, HTTrack stores original (untouched) HTML data, +references to downloaded files, and other meta-data (especially parts of the HTTP headers) in a cache, +located in the hts-cache directory. Because local html pages are always modified to "fit" the local +filesystem structure, and because meta-data such as the last-Modified date and Etag can not be stored +with the associated files, the cache is absolutely mandatory for reprocessing (update/continue) phases. + +

+ +

The (new) cache.zip format

+ +The 3.31 release of HTTrack introduces a new cache format, more extensible and efficient than the previous one (ndx/dat format). + +The main advantages of this cache are: + +
    +
  • One single file for a complete website cache archive
  • +
  • Standard ZIP format, that can be easily reused on most platforms and languages
  • +
  • Compressed data with the efficient and opened zlib format
  • +
+ +The cache is made of ZIP files entries ; with one ZIP file entry per fetched URL (successfully or not - errors are also stored).
+For each entry: +
    +
  • The ZIP file name is the original URL [see notes below]
  • +
  • The ZIP file contents, if available, is the original (compressed, using the deflate algorythm) data
  • +
  • The ZIP file extra field (in the local file header) contains a list of meta-fields, very similar to the HTTP headers fields. See also RFC.

  • +
  • The ZIP file timestamp follows the "Last-Modified-Since" field given for this URL, if any
  • +
+ +Example of cache file: +
+ +
+$ unzip -l hts-cache/new.zip
+Archive:  hts-cache/new.zip
+HTTrack Website Copier/3.31-ALPHA-4 mirror complete in 3 seconds : 5 links scanned, 
+3 files written (16109 bytes overall) [17690 bytes received at 5896 bytes/sec]
+(1 errors, 0 warnings, 0 messages)
+  Length     Date   Time    Name
+ --------    ----   ----    ----
+       94  07-18-03 08:59   http://www.httrack.com/robots.txt
+     9866  01-17-04 01:09   http://www.httrack.com/html/cache.html
+        0  05-11-03 13:31   http://www.httrack.com/html/images/bg_rings.gif
+      207  01-19-04 05:49   http://www.httrack.com/html/fade.gif
+        0  05-11-03 13:31   http://www.httrack.com/html/images/header_title_4.gif
+ --------                   -------
+    10167                   5 files
+
+ +Example of cache file meta-data: +
+ +
+HTTP/1.1 200 OK
+X-In-Cache: 1
+X-StatusCode: 200
+X-StatusMessage: OK
+X-Size: 94
+Content-Type: text/plain
+Last-Modified: Fri, 18 Jul 2003 08:59:11 GMT
+Etag: "40ebb5-5e-3f17b6df"
+X-Addr: www.httrack.com
+X-Fil: /robots.txt
+
+ +There are also specific issues regarding this format: + +
    +
  • The data in the central directory (such as CD extra field, and CD comments) are not used
  • +
  • The ZIP archive is allowed to contains more than 2^16 files (65535) ; in such case the total number of entries in the 32-bit central directory is 65536 (0xffff), but the presence of the 64-bit central directory is not mandatory
  • +
  • The ZIP archive is allowed to contains more than 2^32 bytes (4GiB) ; in such case the 64-bit central directory must be present (not currently supported)
  • +
+ +
+Meta-data stored in the "extra field" of the local file headers
+ +The extra field is composed of text data, and this text data is composed of distinct lines of headers. +The end of text, or a double CR/LF, mark the end of this zone. +This method allows to optionally store original HTTP headers just after the "meta-data" headers for informational use.
+ +
+The status line (the first headers line)
+ +Status-Line = HTTP-Version SP Status-Code SP X-Reason-Phrase CRLF
+ +
+Other lines:
+ +
+Specific fields:
+
    +
  • X-In-Cache

  • +Indicates if the data are present (value=1) in the cache (that is, as ZIP data), or in an external file (value=0). +This field MUST be the first field. + +
  • X-StatusCode

  • +The modified (by httrack) status code after processing. 304 error codes ("Not modified"), for example, are transformed into "200" codes after processing. + +
  • X-StatusMessage

  • +The modified (by httrack) status message. + +
  • X-Size

  • +The stored (either in cache, or in an external file) data size. + +
  • X-Charset

  • +The original charset. + +
  • X-Addr

  • +The original URL address part. + +
  • X-Fil

  • +The original URL path part. + +
  • X-Save

  • +The local filename, depending on user's "build structure" preferences. + +
+ +
+Standard (RFC 2616) "useful" fields:
+
    +
  • Content-Type
  • +
  • Last-Modified
  • +
  • Etag
  • +
  • Location
  • +
  • Content-Disposition
  • +
+ +
+Specific fields in "BNF-like" grammar:
+ +
+X-In-Cache          = "X-In-Cache" ":" 1*DIGIT
+X-StatusCode        = "X-StatusCode" ":" 1*DIGIT
+X-StatusMessage     = "X-StatusMessage" ":" *<TEXT, excluding CR, LF>
+X-Size              = "X-Size" ":" 1*DIGIT
+X-Charset           = "X-Charset" ":" value
+X-Addr              = "X-Addr" ":" scheme ":" "//" authority
+X-Fil               = "X-Fil" ":" rel_path
+X-Save              = "X-Save" ":" rel_path
+
+ +RFC standard fields:
+ +
+Content-Type        = "Content-Type" ":" media-type
+Last-Modified       = "Last-Modified" ":" HTTP-date
+Etag                = "ETag" ":" entity-tag
+Location            = "Location" ":" absoluteURI
+Content-Disposition = "Content-Disposition" ":" disposition-type *( ";" disposition-parm )
+
+ +
+And, for your information, +
+X-Reason-Phrase     = *<TEXT, with a maximum of 32 characters, and excluding CR, LF>
+
+ + +Note: Because the URLs may have an unexpected format, especially with double "/" inside, and other reserved characters ("?", "&" ..), +various ZIP uncompressors can potentially have troubles accessing or decompressing the data. +Libraries should generally handle this peculiar format, however. + + +

+ + +
+
+
+ + + + + +
+ + + + + + -- cgit v1.2.3