Commit 75c03368 authored by Kirill Smelkov's avatar Kirill Smelkov

zodbdump: Start to stabilize output format

Since zodbdump start (c0a6299f "zodbdump - Tool to dump content of a
ZODB database   (draft)") and up till now zodbdump output format was not
good. For example user and description transaction properties were
output without proper quoting, which in situation when there would be
fancy characters in there would break the output.

So start the format stabilization:

- user and description are output as quoted, so now they are guaranteed
  to be on one line. The quoting character is always " (instead of e.g.
  smartly quoting either by ' or " as python does) for easier
  compatibility with ZODB implementations in other languages.

- transaction extension is now printed as raw bytes, not as dict.
  The idea here is that `zodb dump`

  * should perform dump of raw data as stored inside ZODB so that later
    `zodb restore` could restore the database identically to the same state.

  * we should dump raw data instead of unpickled ones because generally
    on-disk extension's format can be any raw bytes and this information
    should be preserved.

- transaction status is now also output as quoted to preserve line
  breakage on fancy status codes.

- it is documented that sha1 is not the only allowed hash function that
  might be used.

- in hashonly mode we add trailing " -" to obj string so that it is
  possible to distinguish outputs of `zodb dump` and `zodb dump -hashonly`
  without knowing a-priory the way it was produced.

  The reason to do so is that it would be not good to e.g. by accident
  feed hashonly output to (future) `zodb restore`, which, without having
  a way to see it should not consume object data would read following
  transaction information as raw object data with confusing later
  errors (and a small chance to restore completely different database
  without reporting error at all).

Because ZODB iteration API gives us already unpickled extension and only
that, for now to dump it as raw we get a long road to pickle it back
also caring to try to pickle in stable order.

Hopefully this will be only a fallback because of

https://github.com/zopefoundation/ZODB/pull/183

and next zodbtools patch.

~~~~

For testing purposes (currently only quoting function unit test) py.test
usage is introduced.

The code is also generally polished here and there.
parent 79cf177a
......@@ -21,6 +21,10 @@ setup(
packages = find_packages(),
install_requires = ['ZODB', 'zodburi', 'six'],
extras_require = {
'test': ['pytest'],
},
entry_points= {'console_scripts': ['zodb = zodbtools.zodb:main']},
classifiers = [_.strip() for _ in """\
......
# -*- coding: utf-8 -*-
# Copyright (C) 2017 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
from zodbtools.util import escapeqq
def test_escapeqq():
testv = (
# in want without leading/trailing "
('', r""),
('\'', r"'"),
('"', r"\""),
('abc\ndef', r"abc\ndef"),
('a\'c\ndef', r"a'c\ndef"),
('a\"c\ndef', r"a\"c\ndef"),
# ('привет', r"привет"), TODO
)
for tin, twant in testv:
twant = '"' + twant + '"' # add lead/trail "
assert escapeqq(tin) == twant
......@@ -36,6 +36,23 @@ class Inf:
return +1
inf = Inf()
# escapeqq escapes string into valid "..." string always quoted with ".
#
# (python's automatic escape uses smartquotes quoting with either ' or ")
#
# TODO also accept unicode as input.
# TODO output printable UTF-8 characters as-is, but escape non-printable UTF-8 and invalid UTF-8 bytes.
def escapeqq(s):
outv = []
# we don't want ' to be escaped
for _ in s.split("'"):
# this escape almost everything except " character
# NOTE string_escape does not do smartquotes and always uses ' for quoting
# (repr(str) is the same except it does smartquoting picking ' or " automatically)
q = _.encode("string_escape")
q = q.replace('"', r'\"')
outv.append(q)
return '"' + "'".join(outv) + '"'
# get next item from iter -> (item, !stop)
def nextitem(it):
......
......@@ -18,55 +18,188 @@
# See https://www.nexedi.com/licensing for rationale and options.
"""Zodbdump - Tool to dump content of a ZODB database
TODO format (WARNING dump format is not yet stable)
This program dumps content of a ZODB database.
It uses ZODB Storage iteration API to get list of transactions and for every
transaction prints transaction's header and information about changed objects.
txn <tid> (<status>)
user <user|encode?>
description <description|encode?>
extension <extension|encode?>
obj <oid> (delete | from <tid> | <sha1> <size> (LF <content>)?) LF XXX do we really need back <tid>
---- // ----
LF
txn ...
The information dumped is complete raw information as stored in ZODB storage
and should be suitable for restoring the database from the dump file bit-to-bit
identical to its original(*). It is dumped in semi text-binary format where
object data is output as raw binary and everything else is text.
There is also shortened mode activated via --hashonly where only hash of object
data is printed without content.
Dump format:
txn <tid> <status|quote>
user <user|quote>
description <description|quote>
extension <raw_extension|quote>
obj <oid> (delete | from <tid> | <size> <hashfunc>:<hash> (-|LF <raw-content>)) LF
obj ...
...
obj ...
LF
txn ...
quote: quote string with " with non-printable and control characters \-escaped
hashfunc: one of sha1, sha256, sha512 ...
(*) On best-effort basis as it is not generally possible to obtain transaction
metadata in raw form.
TODO also protect txn record by hash.
"""
from __future__ import print_function
from zodbtools.util import ashex, sha1, txnobjv, parse_tidrange, TidRangeInvalid, \
storageFromURL
storageFromURL, escapeqq
from ZODB._compat import loads, _protocol, BytesIO
from zodbpickle.slowpickle import Pickler as pyPickler
#import pickletools
import sys
import logging
def zodbdump(stor, tidmin, tidmax, hashonly=False):
# zodbdump dumps content of a ZODB storage to a file.
# please see module doc-string for dump format and details
def zodbdump(stor, tidmin, tidmax, hashonly=False, out=sys.stdout):
first = True
for txn in stor.iterator(tidmin, tidmax):
if not first:
print()
vskip = "\n"
if first:
vskip = ""
first = False
print('txn %s (%s)' % (ashex(txn.tid), txn.status))
print('user: %r' % (txn.user,)) # XXX encode
print('description:', txn.description) # XXX encode
print('extension:', txn.extension) # XXX dict, encode
# XXX .status not covered by IStorageTransactionInformation
# XXX but covered by BaseStorage.TransactionRecord
out.write("%stxn %s %s\nuser %s\ndescription %s\nextension %s\n" % (
vskip, ashex(txn.tid), escapeqq(txn.status),
escapeqq(txn.user),
escapeqq(txn.description),
escapeqq(serializeext(txn.extension)) ))
objv = txnobjv(txn)
for obj in objv:
entry = 'obj %s ' % ashex(obj.oid)
entry = "obj %s " % ashex(obj.oid)
write_data = False
if obj.data is None:
entry += 'delete'
entry += "delete"
# was undo and data taken from obj.data_txn
elif obj.data_txn is not None:
entry += 'from %s' % ashex(obj.data_txn)
entry += "from %s" % ashex(obj.data_txn)
else:
entry += '%s %i' % (ashex(sha1(obj.data)), len(obj.data))
if not hashonly:
entry += '\n'
entry += obj.data
# XXX sha1 is hardcoded for now. Dump format allows other hashes.
entry += "%i sha1:%s" % (len(obj.data), ashex(sha1(obj.data)))
write_data = True
out.write(entry)
if write_data:
if hashonly:
out.write(" -")
else:
out.write("\n")
out.write(obj.data)
out.write("\n")
# ----------------------------------------
# XPickler is Pickler that tries to save objects stably
# in other words dicts/sets/... are pickled with items emitted always in the same order.
#
# NOTE we order objects by regular python objects "<", and in general case
# python fallbacks to comparing objects by their addresses, so comparision
# result is not in general stable from run to run. The following program
# prints True/False randomly with p. 50%:
# ---- 8< ----
# from random import choice
# class A: pass
# class B: pass
# if choice([True, False]):
# a = A()
# b = B()
# else:
# b = B()
# a = A()
# print a < b
# ---- 8< ----
#
# ( related reference: https://pythonhosted.org/BTrees/#total-ordering-and-persistence )
#
# We are ok with this semi-working solution(*) because it is only a fallback:
# for proper zodbdump usage it is adviced for storage to provide
# IStorageTransactionInformationRaw with all raw metadata directly accessible.
#
# (*) but 100% working e.g. for keys = only strings or integers
#
# NOTE cannot use C pickler because hooking into internal machinery is not possible there.
class XPickler(pyPickler):
dispatch = pyPickler.dispatch.copy()
def save_dict(self, obj):
# original pickler emits items taken from obj.iteritems()
# let's prepare something with .iteritems() but emits those objs items ordered
items = obj.items()
items.sort() # sorts by key
xitems = asiteritems(items)
super(self, XPickler).save_dict(xitems)
def save_set(self, obj):
# set's reduce always return 3 values
# https://github.com/python/cpython/blob/309fb90f/Objects/setobject.c#L1954
typ, keyv, dict_ = obj.__reduce_ex__(self.proto)
keyv.sort()
rv = (typ, keyv, dict_)
self.save_reduce(obj=obj, *rv)
dispatch[set] = save_set
# asiteritems creates object that emits prepared items via .iteritems()
# see save_dict() above for why/where it is needed.
class asiteritems(object):
def __init__(self, items):
self._items = items
def iteritems(self):
return iter(self._items)
# serializeext canonically serializes transaction's metadata "extension" dict
def serializeext(ext):
# ZODB iteration API gives us depickled extensions and only that.
# So for dumping in raw form we need to pickle it back hopefully getting
# something close to original raw data.
print(entry)
if not ext:
# ZODB usually does this: encode {} as empty "", not as "}."
# https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/BaseStorage.py#L194
#
# and here are decoders:
# https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/FileStorage/FileStorage.py#L1145
# https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/FileStorage/FileStorage.py#L1990
# https://github.com/zopefoundation/ZODB/blob/2490ae09/src/ZODB/fstools.py#L66
# ...
return b""
buf = BytesIO()
p = XPickler(buf, _protocol)
p.dump(ext)
out = buf.getvalue()
#out = pickletools.optimize(out) # remove unneeded PUT opcodes
assert loads(out) == ext
return out
# ----------------------------------------
import sys, getopt
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment