summaryrefslogtreecommitdiff
path: root/src/proxy
diff options
context:
space:
mode:
authorXavier Roche <xroche@users.noreply.github.com>2013-06-01 09:39:09 +0000
committerXavier Roche <xroche@users.noreply.github.com>2013-06-01 09:39:09 +0000
commit2d4f6880c1f9fe436d5dd4286fa584503c18c98d (patch)
tree07fdf1a70871e323c8e280c313449146ae9865c3 /src/proxy
parent7b5c1c5a8487fe9dfcd2799359a5395ccf797372 (diff)
Fixed issue 14 (http://code.google.com/p/httrack/issues/detail?id=14) related to the way non-ascii characters are being decoded
Rationale: * inside URI * non-ascii characters are read with the page encoding, and transformed into UTF-8 * url-escaped %xx are considered utf-8 sequences to be decoded, unless they form invalis sequences (in such case we left them as-is) * html entities (names, or decimal/hex) are decoded as utf-8 characters * inside query string * non-ascii characters are read as binary, and escaped using %xx * url-escaped %xx are left unless not harmful (alphanum, for example) * html entities (names, or decimal/hex) are decoded as utf-8 characters and encoded back to the page encoding (possibly using %xx) * inside hostnames * non-ascii characters are encoded using IDNA Example: * are equivalent in a iso-8859-1 page: http://foo/café.html http://foo/caf%c3%a9.html http://caf&#a9;.html
Diffstat (limited to 'src/proxy')
0 files changed, 0 insertions, 0 deletions