Age | Commit message (Collapse) | Author |
|
|
|
bug is still out there.
|
|
|
|
|
|
the current page being parsed, NOT the parent page. (alexei dot co at gmail dot com)
* closes: issue #20
|
|
|
|
|
|
|
|
|
|
|
|
related to the way non-ascii characters are being decoded
Rationale:
* inside URI
* non-ascii characters are read with the page encoding, and transformed into UTF-8
* url-escaped %xx are considered utf-8 sequences to be decoded, unless they form invalis sequences (in such case we left them as-is)
* html entities (names, or decimal/hex) are decoded as utf-8 characters
* inside query string
* non-ascii characters are read as binary, and escaped using %xx
* url-escaped %xx are left unless not harmful (alphanum, for example)
* html entities (names, or decimal/hex) are decoded as utf-8 characters and encoded back to the page encoding (possibly using %xx)
* inside hostnames
* non-ascii characters are encoded using IDNA
Example:
* are equivalent in a iso-8859-1 page: http://foo/café.html http://foo/caf%c3%a9.html http://caf&#a9;.html
|
|
|
|
|
|
|
|
Rationale:
* hostname is ASCII, non-ascii characters shall be encoded with IDNA
* URI filenames may embed non-ascii characters, which MUST be UTF-8 encoded
* query string may embed non-ascii characters, which are encoded with the pahe charset into %xx codes
|
|
Fixed HTML entities decoding which was done before charset decoding.
|
|
|
|
|
|
(RFC 3986)" (http://code.google.com/p/httrack/issues/detail?id=12)
|
|
(http://code.google.com/p/httrack/issues/detail?id=11)
|
|
javascript issues (http://code.google.com/p/httrack/issues/detail?id=4)
|
|
(http://code.google.com/p/httrack/issues/detail?id=2)
|
|
setup:
indent -l80 -lc80 -nhnl -nut -bad -bap -bbo -br -brf -bli2 -brs -bls -br -ss
-sai -pmt -nsaw -nsaf -nprs -i2 -ce -npsl -npcs -cs -sob -cdw -nbc -lp
logs:
indent: htsparse.c:364: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:366: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:368: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:370: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:387: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:738: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:907: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:925: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:970: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:971: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1261: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:1277: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:1410: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:1459: Warning:old style assignment ambiguity in "=*". Assuming "= *"
indent: htsparse.c:1494: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1504: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1541: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1583: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1597: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:1625: Warning:old style assignment ambiguity in "=-". Assuming "= -"
indent: htsparse.c:2975: Warning:old style assignment ambiguity in "=-". Assuming "= -"
|
|
|
|
|
|
* cleaned up logging
|
|
conditions in downloaded files, leading to download several times the same file, possibly ending with "Unexpected 412/416 error" errors.
|
|
* introduced SOClen type (aka. socklen_t)
|
|
html page
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* fixed lintian shlib-calls-exit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|