Commit fd1e8f08 authored by Nicolas Delaby's avatar Nicolas Delaby

Workaround bug in HTMLParser (2.5<= v <=2.7) which is impossible to fix due

lack of HTMLParser API which does not accept encoding parameter.
So decoding strings on the fly can not be ensured in all cases.
Python3 solve the problem by accepting only unicode bytes.

The fix consist to pass unicode content to the parser.



git-svn-id: https://svn.erp5.org/repos/public/erp5/trunk@41979 20353a03-c40f-0410-a6d1-a30d3c3de9de
parent 71ac64a1
......@@ -279,6 +279,16 @@ def scrubHTML(html, valid=VALID_TAGS, nasty=NASTY_TAGS,
remove_javascript=remove_javascript,
raise_error=raise_error,
default_encoding=default_encoding)
# HTMLParser is affected by a known bug referenced
# by http://bugs.python.org/issue3932
# As suggested by python developpers:
# "Python 3.0 implicitly rejects non-unicode strings"
# We try to decode strings against provided codec first
if isinstance(html, str):
try:
html = html.decode(default_encoding)
except UnicodeDecodeError:
pass
parser.feed(html)
parser.close()
result = parser.getResult()
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment