zodbdump: Start to stabilize output format

Since zodbdump start (c0a6299f "zodbdump - Tool to dump content of a ZODB database (draft)") and up till now zodbdump output format was not good. For example user and description transaction properties were output without proper quoting, which in situation when there would be fancy characters in there would break the output. So start the format stabilization: - user and description are output as quoted, so now they are guaranteed to be on one line. The quoting character is always " (instead of e.g. smartly quoting either by ' or " as python does) for easier compatibility with ZODB implementations in other languages. - transaction extension is now printed as raw bytes, not as dict. The idea here is that `zodb dump` * should perform dump of raw data as stored inside ZODB so that later `zodb restore` could restore the database identically to the same state. * we should dump raw data instead of unpickled ones because generally on-disk extension's format can be any raw bytes and this information should be preserved. - transaction status is now also output as quoted to preserve line breakage on fancy status codes. - it is documented that sha1 is not the only allowed hash function that might be used. - in hashonly mode we add trailing " -" to obj string so that it is possible to distinguish outputs of `zodb dump` and `zodb dump -hashonly` without knowing a-priory the way it was produced. The reason to do so is that it would be not good to e.g. by accident feed hashonly output to (future) `zodb restore`, which, without having a way to see it should not consume object data would read following transaction information as raw object data with confusing later errors (and a small chance to restore completely different database without reporting error at all). Because ZODB iteration API gives us already unpickled extension and only that, for now to dump it as raw we get a long road to pickle it back also caring to try to pickle in stable order. Hopefully this will be only a fallback because of https://github.com/zopefoundation/ZODB/pull/183 and next zodbtools patch. ~~~~ For testing purposes (currently only quoting function unit test) py.test usage is introduced. The code is also generally polished here and there.

zodbdump: Start to stabilize output format
Since zodbdump start (c0a6299f "zodbdump - Tool to dump content of a ZODB database (draft)") and up till now zodbdump output format was not good. For example user and description transaction properties were output without proper quoting, which in situation when there would be fancy characters in there would break the output. So start the format stabilization: - user and description are output as quoted, so now they are guaranteed to be on one line. The quoting character is always " (instead of e.g. smartly quoting either by ' or " as python does) for easier compatibility with ZODB implementations in other languages. - transaction extension is now printed as raw bytes, not as dict. The idea here is that `zodb dump` * should perform dump of raw data as stored inside ZODB so that later `zodb restore` could restore the database identically to the same state. * we should dump raw data instead of unpickled ones because generally on-disk extension's format can be any raw bytes and this information should be preserved. - transaction status is now also output as quoted to preserve line breakage on fancy status codes. - it is documented that sha1 is not the only allowed hash function that might be used. - in hashonly mode we add trailing " -" to obj string so that it is possible to distinguish outputs of `zodb dump` and `zodb dump -hashonly` without knowing a-priory the way it was produced. The reason to do so is that it would be not good to e.g. by accident feed hashonly output to (future) `zodb restore`, which, without having a way to see it should not consume object data would read following transaction information as raw object data with confusing later errors (and a small chance to restore completely different database without reporting error at all). Because ZODB iteration API gives us already unpickled extension and only that, for now to dump it as raw we get a long road to pickle it back also caring to try to pickle in stable order. Hopefully this will be only a fallback because of https://github.com/zopefoundation/ZODB/pull/183 and next zodbtools patch. ~~~~ For testing purposes (currently only quoting function unit test) py.test usage is introduced. The code is also generally polished here and there.
75c03368 · Kirill Smelkov · 79cf177a · 75c03368 · 75c03368 · 75c03368
Commit 75c03368 authored Nov 02, 2017 by Kirill Smelkov
Showing with 217 additions and 26 deletions

setup.py setup.py +4 -0

zodbtools/test/test_util.py zodbtools/test/test_util.py +37 -0

zodbtools/util.py zodbtools/util.py +17 -0

zodbtools/zodbdump.py zodbtools/zodbdump.py +159 -26

No files found.
--- a/setup.py
+++ b/setup.py
@@ -21,6 +21,10 @@ setup(
    packages    = find_packages(),
    install_requires = ['ZODB', 'zodburi', 'six'],

+    extras_require = {
+                  'test': ['pytest'],
+    },
+
    entry_points= {'console_scripts': ['zodb = zodbtools.zodb:main']},

    classifiers = [_.strip() for _ in """\

--- a/zodbtools/test/test_util.py
+++ b/zodbtools/test/test_util.py
+# -*- coding: utf-8 -*-
+# Copyright (C) 2017  Nexedi SA and Contributors.
+#                     Kirill Smelkov <kirr@nexedi.com>
+#
+# This program is free software: you can Use, Study, Modify and Redistribute
+# it under the terms of the GNU General Public License version 3, or (at your
+# option) any later version, as published by the Free Software Foundation.
+#
+# You can also Link and Combine this program with other software covered by
+# the terms of any of the Free Software licenses or any of the Open Source
+# Initiative approved licenses and Convey the resulting work. Corresponding
+# source of such a combination shall include the source code for all other
+# software used.
+#
+# This program is distributed WITHOUT ANY WARRANTY; without even the implied
+# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+#
+# See COPYING file for full licensing terms.
+# See https://www.nexedi.com/licensing for rationale and options.
+
+from zodbtools.util import escapeqq
+
+def test_escapeqq():
+    testv = (
+        # in            want without leading/trailing "
+        ('',            r""),
+        ('\'',          r"'"),
+        ('"',           r"\""),
+        ('abc\ndef',    r"abc\ndef"),
+        ('a\'c\ndef',   r"a'c\ndef"),
+        ('a\"c\ndef',   r"a\"c\ndef"),
+        # ('привет',      r"привет"),       TODO
+    )
+
+    for tin, twant in testv:
+        twant = '"' + twant + '"'   # add lead/trail "
+        assert escapeqq(tin) == twant
--- a/zodbtools/util.py
+++ b/zodbtools/util.py
@@ -36,6 +36,23 @@ class Inf:
        return +1
 inf = Inf()

+# escapeqq escapes string into valid "..." string always quoted with ".
+#
+# (python's automatic escape uses smartquotes quoting with either ' or ")
+#
+# TODO also accept unicode as input.
+# TODO output printable UTF-8 characters as-is, but escape non-printable UTF-8 and invalid UTF-8 bytes.
+def escapeqq(s):
+    outv = []
+    # we don't want ' to be escaped
+    for _ in s.split("'"):
+        # this escape almost everything except " character
+        # NOTE string_escape does not do smartquotes and always uses ' for quoting
+        # (repr(str) is the same except it does smartquoting picking ' or " automatically)
+        q = _.encode("string_escape")
+        q = q.replace('"', r'\"')
+        outv.append(q)
+    return '"' + "'".join(outv) + '"'

 # get next item from iter -> (item, !stop)
 def nextitem(it):

--- a/zodbtools/zodbdump.py
+++ b/zodbtools/zodbdump.py
@@ -18,55 +18,188 @@
 # See https://www.nexedi.com/licensing for rationale and options.
 """Zodbdump - Tool to dump content of a ZODB database

-TODO format     (WARNING dump format is not yet stable)
+This program dumps content of a ZODB database.
+It uses ZODB Storage iteration API to get list of transactions and for every
+transaction prints transaction's header and information about changed objects.

-txn <tid> (<status>)
-user <user|encode?>
-description <description|encode?>
-extension <extension|encode?>
-obj <oid> (delete | from <tid> | <sha1> <size> (LF <content>)?) LF     XXX do we really need back <tid>
---- // ----
-LF
-txn ...
+The information dumped is complete raw information as stored in ZODB storage
+and should be suitable for restoring the database from the dump file bit-to-bit
+identical to its original(*). It is dumped in semi text-binary format where
+object data is output as raw binary and everything else is text.

+There is also shortened mode activated via --hashonly where only hash of object
+data is printed without content.
+
+Dump format:
+
+    txn <tid> <status|quote>
+    user <user|quote>
+    description <description|quote>
+    extension <raw_extension|quote>
+    obj <oid> (delete | from <tid> | <size> <hashfunc>:<hash> (-|LF <raw-content>)) LF
+    obj ...
+    ...
+    obj ...
+    LF
+    txn ...
+
+quote:      quote string with " with non-printable and control characters \-escaped
+hashfunc:   one of sha1, sha256, sha512 ...
+
+(*) On best-effort basis as it is not generally possible to obtain transaction
+    metadata in raw form.
+
+TODO also protect txn record by hash.
 """

 from __future__ import print_function
 from zodbtools.util import ashex, sha1, txnobjv, parse_tidrange, TidRangeInvalid,   \
-        storageFromURL
+        storageFromURL, escapeqq
+from ZODB._compat import loads, _protocol, BytesIO
+from zodbpickle.slowpickle import Pickler as pyPickler
+#import pickletools
+
+import sys
+import logging

-def zodbdump(stor, tidmin, tidmax, hashonly=False):
+# zodbdump dumps content of a ZODB storage to a file.
+# please see module doc-string for dump format and details
+def zodbdump(stor, tidmin, tidmax, hashonly=False, out=sys.stdout):
    first = True

    for txn in stor.iterator(tidmin, tidmax):
-        if not first:
-            print()
-        first = False
+        vskip = "\n"
+        if first:
+            vskip = ""
+            first = False

-        print('txn %s (%s)' % (ashex(txn.tid), txn.status))
-        print('user: %r' % (txn.user,))                    # XXX encode
-        print('description:', txn.description)      # XXX encode
-        print('extension:', txn.extension)          # XXX dict, encode
+        # XXX .status not covered by IStorageTransactionInformation
+        # XXX but covered by BaseStorage.TransactionRecord
+        out.write("%stxn %s %s\nuser %s\ndescription %s\nextension %s\n" % (
+            vskip, ashex(txn.tid), escapeqq(txn.status),
+            escapeqq(txn.user),
+            escapeqq(txn.description),
+            escapeqq(serializeext(txn.extension)) ))

        objv = txnobjv(txn)

        for obj in objv:
-            entry = 'obj %s ' % ashex(obj.oid)
+            entry = "obj %s " % ashex(obj.oid)
+            write_data = False
+
            if obj.data is None:
-                entry += 'delete'
+                entry += "delete"

            # was undo and data taken from obj.data_txn
            elif obj.data_txn is not None:
-                entry += 'from %s' % ashex(obj.data_txn)
+                entry += "from %s" % ashex(obj.data_txn)

            else:
-                entry += '%s %i' % (ashex(sha1(obj.data)), len(obj.data))
-                if not hashonly:
-                    entry += '\n'
-                    entry += obj.data
+                # XXX sha1 is hardcoded for now. Dump format allows other hashes.
+                entry += "%i sha1:%s" % (len(obj.data), ashex(sha1(obj.data)))
+                write_data = True
+
+            out.write(entry)
+
+            if write_data:
+                if hashonly:
+                    out.write(" -")
+                else:
+                    out.write("\n")
+                    out.write(obj.data)
+
+            out.write("\n")
+
+# ----------------------------------------
+# XPickler is Pickler that tries to save objects stably
+# in other words dicts/sets/... are pickled with items emitted always in the same order.
+#
+# NOTE we order objects by regular python objects "<", and in general case
+# python fallbacks to comparing objects by their addresses, so comparision
+# result is not in general stable from run to run. The following program
+# prints True/False randomly with p. 50%:
+# ---- 8< ----
+# from random import choice
+# class A: pass
+# class B: pass
+# if choice([True, False]):
+#     a = A()
+#     b = B()
+# else:
+#     b = B()
+#     a = A()
+# print a < b
+# ---- 8< ----
+#
+# ( related reference: https://pythonhosted.org/BTrees/#total-ordering-and-persistence )
+#
+# We are ok with this semi-working solution(*) because it is only a fallback:
+# for proper zodbdump usage it is adviced for storage to provide
+# IStorageTransactionInformationRaw with all raw metadata directly accessible.
+#
+# (*) but 100% working e.g. for keys = only strings or integers
+#
+# NOTE cannot use C pickler because hooking into internal machinery is not possible there.
+class XPickler(pyPickler):
+
+    dispatch = pyPickler.dispatch.copy()
+
+    def save_dict(self, obj):
+        # original pickler emits items taken from obj.iteritems()
+        # let's prepare something with .iteritems() but emits those objs items ordered
+
+        items = obj.items()
+        items.sort()   # sorts by key
+        xitems = asiteritems(items)
+
+        super(self, XPickler).save_dict(xitems)
+
+    def save_set(self, obj):
+        # set's reduce always return 3 values
+        # https://github.com/python/cpython/blob/309fb90f/Objects/setobject.c#L1954
+        typ, keyv, dict_ = obj.__reduce_ex__(self.proto)
+        keyv.sort()
+
+        rv = (typ, keyv, dict_)
+        self.save_reduce(obj=obj, *rv)
+
+    dispatch[set] = save_set
+
+# asiteritems creates object that emits prepared items via .iteritems()
+# see save_dict() above for why/where it is needed.
+class asiteritems(object):
+
+    def __init__(self, items):
+        self._items = items
+
+    def iteritems(self):
+        return iter(self._items)
+
+
+# serializeext canonically serializes transaction's metadata "extension" dict
+def serializeext(ext):
+    # ZODB iteration API gives us depickled extensions and only that.
+    # So for dumping in raw form we need to pickle it back hopefully getting
+    # something close to original raw data.

-            print(entry)
+    if not ext:
+        # ZODB usually does this: encode {} as empty "", not as "}."
+        # https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/BaseStorage.py#L194
+        #
+        # and here are decoders:
+        # https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/FileStorage/FileStorage.py#L1145
+        # https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/FileStorage/FileStorage.py#L1990
+        # https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/fstools.py#L66
+        # ...
+        return b""

+    buf = BytesIO()
+    p = XPickler(buf, _protocol)
+    p.dump(ext)
+    out = buf.getvalue()
+    #out = pickletools.optimize(out) # remove unneeded PUT opcodes
+    assert loads(out) == ext
+    return out

 # ----------------------------------------
 import sys, getopt