Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
erp5 erp5
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Labels
    • Labels
  • Merge requests 139
    • Merge requests 139
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • nexedi
  • erp5erp5
  • Merge requests
  • !1615

Merged
Created May 03, 2022 by Levin Zimmermann@levin.zimmermannMaintainer

Allow patched pandas.read_* in restricted Python

  • Overview 52
  • Commits 1
  • Changes 3

(please see wendelin!99 (closed))

This merge request aims to add restricted (secure) versions of selected pandas.read_* functions, so that they can be used in the zope sandbox. It does so by monkey-patching the respective functions as suggested by @tatuya.

The monkey-patched versions only allow str inputs, instances of any other data type (e.g. pathlib.Path, bytes) are prohibited. If the parsed argument is a str, it will convert it to a StringIO instance and further parse it to the original function.

In this way the restriction prevents that users can parse urls, file paths, etc. which would be downloaded by pandas. The implementation is heuristic, since it makes the assumption that pandas won't search for urls etc. in StringIO instances (which is true in 0.19.x, as the tests showed).

Besides general code issues, my main question is: can the given solution be considered to be secure or should I move to other solutions (see also solution in Wendelin SR and Note below)?

Then: I didn't add read_html yet, because it has an additional unresolved dependency (html5lib). Should we add this dependency and also add read_html?

Regarding the tests: For "the proof of security" the tests need to use working links from which pandas can actually initialize DataFrame instances if the function wouldn't be restricted in correct way. For now I simply added random real-word links, which should be replaced.

Note:

Initially I wanted to patch read_json more explicitly, e.g. more or less directly parse the given argument to pandas internal FrameParser or SeriesParser and simply bypass all of pandas url-etc. parsing parts. But this solution has other disadvantages:

  1. Since we would have to rely on code which doesn't belong to the public api of pandas it would potentially break when switching pandas version (in fact in pandas up-to-date version the mentioned classes have been moved to other modules).

  2. We may have to duplicate code in order to provide users with mostly the same api (rich set of different kwargs and args) and we would have to adapt it when moving to newer pandas versions in order to avoid unexpected behaviour.

  3. This patch wouldn't be generic for the selected read_* functions, because the code structure between pandas different read functions differ widely.

Edited May 18, 2022 by Levin Zimmermann
Assignee
Assign to
Reviewer
Request review from
None
Milestone
None
Assign milestone
Time tracking
Source branch: add-restricted-pandas-read-functions
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7