diff options
author | Xavier Roche <xroche@users.noreply.github.com> | 2013-06-01 09:39:09 +0000 |
---|---|---|
committer | Xavier Roche <xroche@users.noreply.github.com> | 2013-06-01 09:39:09 +0000 |
commit | 2d4f6880c1f9fe436d5dd4286fa584503c18c98d (patch) | |
tree | 07fdf1a70871e323c8e280c313449146ae9865c3 /src/proxy | |
parent | 7b5c1c5a8487fe9dfcd2799359a5395ccf797372 (diff) |
Fixed issue 14 (http://code.google.com/p/httrack/issues/detail?id=14) related to the way non-ascii characters are being decoded
Rationale:
* inside URI
* non-ascii characters are read with the page encoding, and transformed into UTF-8
* url-escaped %xx are considered utf-8 sequences to be decoded, unless they form invalis sequences (in such case we left them as-is)
* html entities (names, or decimal/hex) are decoded as utf-8 characters
* inside query string
* non-ascii characters are read as binary, and escaped using %xx
* url-escaped %xx are left unless not harmful (alphanum, for example)
* html entities (names, or decimal/hex) are decoded as utf-8 characters and encoded back to the page encoding (possibly using %xx)
* inside hostnames
* non-ascii characters are encoded using IDNA
Example:
* are equivalent in a iso-8859-1 page: http://foo/café.html http://foo/caf%c3%a9.html http://caf&#a9;.html
Diffstat (limited to 'src/proxy')
0 files changed, 0 insertions, 0 deletions