Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
46495182
Commit
46495182
authored
Jun 24, 2012
by
Ezio Melotti
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
#15156: HTMLParser now uses the new "html.entities.html5" dictionary.
parent
a504a7a7
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
23 additions
and
22 deletions
+23
-22
Doc/library/html.entities.rst
Doc/library/html.entities.rst
+0
-4
Lib/html/parser.py
Lib/html/parser.py
+15
-17
Lib/test/test_htmlparser.py
Lib/test/test_htmlparser.py
+6
-1
Misc/NEWS
Misc/NEWS
+2
-0
No files found.
Doc/library/html.entities.rst
View file @
46495182
...
...
@@ -11,10 +11,6 @@
This module defines four dictionaries, :data:`html5`,
:data:`name2codepoint`, :data:`codepoint2name`, and :data:`entitydefs`.
:data:`entitydefs` is used to provide the :attr:`entitydefs`
attribute of the :class:`html.parser.HTMLParser` class. The definition provided
here contains all the entities defined by XHTML 1.0 that can be handled using
simple textual substitution in the Latin-1 character set (ISO-8859-1).
.. data:: html5
...
...
Lib/html/parser.py
View file @
46495182
...
...
@@ -500,7 +500,6 @@ class HTMLParser(_markupbase.ParserBase):
self
.
error
(
"unknown declaration: %r"
%
(
data
,))
# Internal -- helper to remove special character quoting
entitydefs
=
None
def
unescape
(
self
,
s
):
if
'&'
not
in
s
:
return
s
...
...
@@ -510,24 +509,23 @@ class HTMLParser(_markupbase.ParserBase):
if
s
[
0
]
==
"#"
:
s
=
s
[
1
:]
if
s
[
0
]
in
[
'x'
,
'X'
]:
c
=
int
(
s
[
1
:],
16
)
c
=
int
(
s
[
1
:]
.
rstrip
(
';'
)
,
16
)
else
:
c
=
int
(
s
)
c
=
int
(
s
.
rstrip
(
';'
)
)
return
chr
(
c
)
except
ValueError
:
return
'&#'
+
s
+
';'
return
'&#'
+
s
else
:
# Cannot use name2codepoint directly, because HTMLParser
# supports apos, which is not part of HTML 4
import
html.entities
if
HTMLParser
.
entitydefs
is
None
:
entitydefs
=
HTMLParser
.
entitydefs
=
{
'apos'
:
"'"
}
for
k
,
v
in
html
.
entities
.
name2codepoint
.
items
():
entitydefs
[
k
]
=
chr
(
v
)
try
:
return
self
.
entitydefs
[
s
]
except
KeyError
:
return
'&'
+
s
+
';'
return
re
.
sub
(
r"&(#?[xX]?(?:[0-9a-fA-F]+|\
w{
1,8}));"
,
from
html.entities
import
html5
if
s
in
html5
:
return
html5
[
s
]
elif
s
.
endswith
(
';'
):
return
'&'
+
s
for
x
in
range
(
2
,
len
(
s
)):
if
s
[:
x
]
in
html5
:
return
html5
[
s
[:
x
]]
+
s
[
x
:]
else
:
return
'&'
+
s
return
re
.
sub
(
r"&(#?[xX]?(?:[0-9a-fA-F]+;|\
w{
1,32};?))"
,
replaceEntities
,
s
,
flags
=
re
.
ASCII
)
Lib/test/test_htmlparser.py
View file @
46495182
...
...
@@ -456,7 +456,7 @@ class HTMLParserTolerantTestCase(HTMLParserStrictTestCase):
self
.
_run_check
(
'<form action="/xxx.php?a=1&b=2&", '
'method="post">'
,
[
(
'starttag'
,
'form'
,
[(
'action'
,
'/xxx.php?a=1&b=2&
amp
'
),
[(
'action'
,
'/xxx.php?a=1&b=2&'
),
(
','
,
None
),
(
'method'
,
'post'
)])])
def
test_weird_chars_in_unquoted_attribute_values
(
self
):
...
...
@@ -541,6 +541,11 @@ class HTMLParserTolerantTestCase(HTMLParserStrictTestCase):
self
.
assertEqual
(
p
.
unescape
(
'&'
),
'&'
)
# see #12888
self
.
assertEqual
(
p
.
unescape
(
'{ '
*
1050
),
'{ '
*
1050
)
# see #15156
self
.
assertEqual
(
p
.
unescape
(
'ÉricÉric'
'&alphacentauriαcentauri'
),
'ÉricÉric&alphacentauriαcentauri'
)
self
.
assertEqual
(
p
.
unescape
(
'&co;'
),
'&co;'
)
def
test_broken_comments
(
self
):
html
=
(
'<! not really a comment >'
...
...
Misc/NEWS
View file @
46495182
...
...
@@ -76,6 +76,8 @@ Library
It is used automatically on platforms supporting the necessary os.openat()
and os.unlinkat() functions. Main code by Martin von Löwis.
- Issue #15156: HTMLParser now uses the new "html.entities.html5" dictionary.
- Issue #11113: add a new "html5" dictionary containing the named character
references defined by the HTML5 standard and the equivalent Unicode
character(s) to the html.entities module.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment