Allow patched pandas.read_* in restricted Python
(please see wendelin!99 (closed))
This merge request aims to add restricted (secure) versions of selected
pandas.read_* functions, so that they can be used in the zope sandbox.
It does so by monkey-patching the respective functions as suggested by @tatuya.
The monkey-patched versions only allow
str inputs, instances of any other data type (e.g.
pathlib.Path, bytes) are prohibited.
If the parsed argument is a
str, it will convert it to a
StringIO instance and further parse it to the original function.
In this way the restriction prevents that users can parse urls, file paths, etc. which would be downloaded by pandas.
The implementation is heuristic, since it makes the assumption that pandas won't search for urls etc. in
StringIO instances (which is true in 0.19.x, as the tests showed).
Besides general code issues, my main question is: can the given solution be considered to be secure or should I move to other solutions (see also solution in Wendelin SR and Note below)?
Then: I didn't add
read_html yet, because it has an additional unresolved dependency (html5lib). Should we add this dependency and also add
Regarding the tests: For "the proof of security" the tests need to use working links from which pandas can actually initialize
DataFrame instances if the function wouldn't be restricted in correct way. For now I simply added random real-word links, which should be replaced.
Initially I wanted to patch
read_json more explicitly, e.g. more or less directly parse the given argument to pandas internal
SeriesParser and simply bypass all of pandas url-etc. parsing parts. But this solution has other disadvantages:
Since we would have to rely on code which doesn't belong to the public api of pandas it would potentially break when switching pandas version (in fact in pandas up-to-date version the mentioned classes have been moved to other modules).
We may have to duplicate code in order to provide users with mostly the same api (rich set of different kwargs and args) and we would have to adapt it when moving to newer pandas versions in order to avoid unexpected behaviour.
This patch wouldn't be generic for the selected
read_*functions, because the code structure between pandas different
readfunctions differ widely.