Allow patched pandas.read_* in restricted Python (!1615) · Merge requests · nexedi / erp5

You need to sign in or sign up before continuing.

Merged Levin Zimmermann requested to merge levin.zimmermann/erp5:add-restricted-pandas-read-functions into master May 03, 2022

(please see wendelin!99 (closed))

This merge request aims to add restricted (secure) versions of selected pandas.read_* functions, so that they can be used in the zope sandbox. It does so by monkey-patching the respective functions as suggested by @tatuya.

The monkey-patched versions only allow str inputs, instances of any other data type (e.g. pathlib.Path, bytes) are prohibited. If the parsed argument is a str, it will convert it to a StringIO instance and further parse it to the original function.

In this way the restriction prevents that users can parse urls, file paths, etc. which would be downloaded by pandas. The implementation is heuristic, since it makes the assumption that pandas won't search for urls etc. in StringIO instances (which is true in 0.19.x, as the tests showed).

Besides general code issues, my main question is: can the given solution be considered to be secure or should I move to other solutions (see also solution in Wendelin SR and Note below)?

Then: I didn't add read_html yet, because it has an additional unresolved dependency (html5lib). Should we add this dependency and also add read_html?

Regarding the tests: For "the proof of security" the tests need to use working links from which pandas can actually initialize DataFrame instances if the function wouldn't be restricted in correct way. For now I simply added random real-word links, which should be replaced.

Note:

Initially I wanted to patch read_json more explicitly, e.g. more or less directly parse the given argument to pandas internal FrameParser or SeriesParser and simply bypass all of pandas url-etc. parsing parts. But this solution has other disadvantages:

Since we would have to rely on code which doesn't belong to the public api of pandas it would potentially break when switching pandas version (in fact in pandas up-to-date version the mentioned classes have been moved to other modules).
We may have to duplicate code in order to provide users with mostly the same api (rich set of different kwargs and args) and we would have to adapt it when moving to newer pandas versions in order to avoid unexpected behaviour.
This patch wouldn't be generic for the selected read_* functions, because the code structure between pandas different read functions differ widely.

Edited May 18, 2022 by Levin Zimmermann