Commit d4d7db31 authored by Raymond Hettinger's avatar Raymond Hettinger

Issue 21469: Mitigate risk of false positives with robotparser.

* Repair the broken link to norobots-rfc.txt.

* HTTP response codes >= 500 treated as a failed read rather than as a not
found.  Not found means that we can assume the entire site is allowed.  A 5xx
server error tells us nothing.

* A successful read() or parse() updates the mtime (which is defined to be "the
  time the robots.txt file was last fetched").

* The can_fetch() method returns False unless we've had a read() with a 2xx or
4xx response.  This avoids false positives in the case where a user calls
can_fetch() before calling read().

* I don't see any easy way to test this patch without hitting internet
resources that might change or without use of mock objects that wouldn't
provide must reassurance.
parent 74d5e1a6
...@@ -7,7 +7,8 @@ ...@@ -7,7 +7,8 @@
2) PSF license for Python 2.2 2) PSF license for Python 2.2
The robots.txt Exclusion Protocol is implemented as specified in The robots.txt Exclusion Protocol is implemented as specified in
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html http://www.robotstxt.org/norobots-rfc.txt
""" """
import urlparse import urlparse
import urllib import urllib
...@@ -60,7 +61,7 @@ class RobotFileParser: ...@@ -60,7 +61,7 @@ class RobotFileParser:
self.errcode = opener.errcode self.errcode = opener.errcode
if self.errcode in (401, 403): if self.errcode in (401, 403):
self.disallow_all = True self.disallow_all = True
elif self.errcode >= 400: elif self.errcode >= 400 and self.errcode < 500:
self.allow_all = True self.allow_all = True
elif self.errcode == 200 and lines: elif self.errcode == 200 and lines:
self.parse(lines) self.parse(lines)
...@@ -86,6 +87,7 @@ class RobotFileParser: ...@@ -86,6 +87,7 @@ class RobotFileParser:
linenumber = 0 linenumber = 0
entry = Entry() entry = Entry()
self.modified()
for line in lines: for line in lines:
linenumber += 1 linenumber += 1
if not line: if not line:
...@@ -131,6 +133,14 @@ class RobotFileParser: ...@@ -131,6 +133,14 @@ class RobotFileParser:
return False return False
if self.allow_all: if self.allow_all:
return True return True
# Until the robots.txt file has been read or found not
# to exist, we must assume that no url is allowable.
# This prevents false positives when a user erronenously
# calls can_fetch() before calling read().
if not self.last_checked:
return False
# search for given user agent matches # search for given user agent matches
# the first match counts # the first match counts
parsed_url = urlparse.urlparse(urllib.unquote(url)) parsed_url = urlparse.urlparse(urllib.unquote(url))
......
...@@ -52,6 +52,10 @@ Library ...@@ -52,6 +52,10 @@ Library
- Issue #21306: Backport hmac.compare_digest from Python 3. This is part of PEP - Issue #21306: Backport hmac.compare_digest from Python 3. This is part of PEP
466. 466.
- Issue #21469: Reduced the risk of false positives in robotparser by
checking to make sure that robots.txt has been read or does not exist
prior to returning True in can_fetch().
- Issue #21321: itertools.islice() now releases the reference to the source - Issue #21321: itertools.islice() now releases the reference to the source
iterator when the slice is exhausted. Patch by Anton Afanasyev. iterator when the slice is exhausted. Patch by Anton Afanasyev.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment