Commits · 010fbd2ec25996def9c1dcc1bff09c0d96711a9d · Kirill Smelkov / og-rek

16 Jul, 2024 4 commits

Add AsInt64 helper to help programs handle unpickled data be it int or long · 010fbd2e

Kirill Smelkov authored Jul 15, 2024

On Python two different objects with different types can represent
essentially the same entity. For example 1 (int) and 1L (long) represent
integer number one via two different types and are decoded by ogórek into Go
types int64 and big.Int correspondingly. However on the Python side those
two representations are often used interchangeably and programs are usually
expected to handle both with the same effect.

For example I hit this situation for real with FileStorage index, ZEO
protocol, BTrees and Wendelin.core where data with integer fields, for
the same field, are sometimes represented as int and sometimes as long.
As the result ZODB/go and Wendelin.core had private XInt64 helper used
as follows:

https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/pycompat/pycompat.go
https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L385
https://lab.nexedi.com/search?utf8=%E2%9C%93&search=XInt64&group_id=&project_id=73&search_code=true&repository_ref=f7776fc1&nav_source=navbar

Besides int/long we are also hitting similar situation with
bytes / str(py2) / str(py3) which will also need to use corresponding
helpers to be able to load pickled data from existing databases.

In the next patch we will add StrictUnicode mode to handle that strings,
and it will naturally come with AsBytes and AsString helpers. Thus it
also feels appropriate to keep the other useful helpers to handle
unpickled data inside ogórek itself.

This patch adds AsInt64 helper with top-level description and tests.

010fbd2e

Drop support for Go 1.7 and Go 1.8 · 17fed0ed

Kirill Smelkov authored Jul 15, 2024

In the next patch I will need to use big.Int.IsInt64, which was added in
Go 1.9 in 2017. Go 1.7 and Go 1.8 are very old nowdays - their last
point releases were done in 2017 (go1.7.6) and 2018 (go1.8.7) - 7 and 6
years ago correspondingly. Since then those releases are EOL with many
further Go releases done after them. We even specify go1.12 as the
requirement in our go.mod, but that is probably because go1.12 is the
first version to support modules. Anyway dropping support for those very
old Go1.7 and Go1.8 should be almost certainly ok as I doubt if there
are current ogórek users still stuck at those old Go versions.

17fed0ed

encoder: Fix protocol 0 UNICODE emission for invalid UTF-8 · 09fec8bd

Kirill Smelkov authored Jul 12, 2024

In 9daf6a2a (Fix UNICODE decoding) I fixed protocol 0 UNICODE decoding
by implementing "raw-unicode-escape" decoder and switching unpickler to
use that to match 1-to-1 what Python unpickler does.

In 57f875fd (encoder: Fix protocol 0 UNICODE emission) I further fixed
protocol 0 UNICODE encoding by implementing "raw-unicode-escape" encoder
and switching pickler to use that, trying to match 1-to-1 what Python
pickler does.

However there is a difference in between Python unicode and unicode on
Go side: in Python unicode is immutable array of UCS code points. On Go
side, unicode is immutable array of bytes, that, similarly to Go string
are treated as being mostly UTF-8. We did not pick e.g. []rune to
represent unicode on Go side because []rune is not immutable and so
cannot be used as map keys.

So Go unicode can be either valid UTF-8 or invalid UTF-8. For valid
UTF-8 case everything is working as intended: our raw-unicode-escape
decoder, because it works in terms of UCS, can produce only valid UTF-8,
while our raw-unicode-escape encoder also matches 1-to-1 what python
does because for valid UTF-8 utf8.DecodeRuneInString gives good rune and
the encoder further works in terms of UCS.

However for the case of invalid UTF-8, there is a difference in between
what Python and Go raw-unicode-escape encoders do: for Python this case
is simply impossible because there input is []UCS. For the Go case, in
57f875fd I extended the encoder to do:

	// invalid UTF-8 -> emit byte as is
	case r == utf8.RuneError:
		out = append(out, s[0])

which turned out to be not very thoughtful and wrong because
original raw-unicode-escape also emits UCS < 0x100 as those plain bytes
and so if we also emit invalid UTF-8 as is then the following two inputs
would be encoded into the same representation "\x93":

	unicode("\x93")     // invalid UTF-8
	unicode("\xc2\x93)  // UTF-8 of \u93

which would break `decode/encode = identity` invariant and corrupt the
data because when loaded back it will be "\xc2\x93" instead of original "\x93".

-> Fix it by rejecting to encode such invalid UTF-8 via protocol 0 UNICODE.

Unfortunately rejection is the only reasonable choice because
raw-unicode-escape codec does not allow \xAA to be present in the output
stream and so there is simply no way to represent arbitrary bytes there.

Better to give explicit error instead of corrupting the data.

For protocols ≥ 1 arbitrary unicode - both valid and invalid UTF-8 - can
still be loaded and saved because *BINUNICODE opcodes come with
bytestring argument and there is no need to decode/encode those
bytestrings.

/reviewed-by @kisielk
/reviewed-on https://github.com/kisielk/og-rek/pull/67

09fec8bd

Make Bytes and unicode to be distinguishable from string in test errors · 1fea98e4

Kirill Smelkov authored Jul 12, 2024

For example currently if result of decoding is expected to be unicode,
but is actually string the following confusing error is generated with
exactly the same have and want:

    --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s)
        ogorek_test.go:504: decode:
            have: "\x93"
            want: "\x93"

This happens because by default Printf("%#v") does not emit object type
even if it was derived type from builtin string - either our Bytes or
unicode, and emits the same string representation irregardless of which
type it is.

I think out-of-the box behaviour of %#v is unreasonable, but it is there
for a long time and so instead of trying to argue with Go upstream,
which after many years of arguing I think is not practical, teach our
Bytes and unicode types how to represent themselves under %#v with
indicating their type in the output.

After the fix the error from above becomes

    --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s)
        ogorek_test.go:504: decode:
            have: "\x93"
            want: ogórek.unicode("\x93")

which is more clear because want is different from have.

Adjusted behaviour is not test-only, because user applications might
want to do the same %#v at application level and there it would be also
confusing not to see a type.

/reviewed-by @kisielk
/reviewed-on https://github.com/kisielk/og-rek/pull/67

1fea98e4

12 Jul, 2024 5 commits

Merge pull request #66 from navytux/y/fuzz-sha1 · 9913a56c
Kamil Kisiel authored Jul 12, 2024
```
fuzz: Export tests pickles to fuzz/corpus with stable filenames
```
9913a56c

fuzz: Export tests pickles to fuzz/corpus with stable filenames · 1e108391

Kirill Smelkov authored Jul 12, 2024

In 9d1344ba (fuzz: Automatically export all tests pickles into
fuzz/corpus) I added code that automatically exports all test pickles
to fuzz corpus, so that fuzzing starts from all the tricky cases we
already put into tests. However exported files were named as
test-i-j.pickle, where i is sequential number of test entry in global
test vector. This way if we insert some new entry in the middle, or
change entries order it causes lots of renaming inside fuzz/corpus/ on
next `go generate`.

-> Fix that by emitting vectors as test-<sha1-of-pickle>.pickle so that
file name depends only on pickle content are remains stable
independently of entry position in the list of tests.

Besides changes to fuzz_test.go this patch brings:

1) rm fuzz/corpus/test-*.pickle
2) `go generate`

The result inside fuzz/corpus/ is pure renaming + removal of duplicates.
For example test-15-4.pickle is removed because the same data was in test-14-4.pickle
and now resides in test-ee4459c0c165c4cbc3672ee5ef5438e00121b229.pickle.

1e108391

Merge pull request #65 from navytux/y/bytearray8-fix · 9ca4a70a
Kamil Kisiel authored Jul 12, 2024
```
decoder: Fix integer overflow in BYTEARRAY8 handling
```
9ca4a70a
fuzz/corpus: Update · 4f485784
Kirill Smelkov authored Jul 12, 2024
```
Rerunning recent go-fuze resulted in adding some new test vectors.
```
4f485784

decoder: Fix integer overflow in BYTEARRAY8 handling · 45fb14a6

Kirill Smelkov authored Jul 12, 2024

Rerunning fuzzing afresh found the following crash because uint64 len > max(int64)
was cast to int resulting in negative number:

    "\x960000000\xbd"

    panic: bytes.Buffer.Grow: negative count

    goroutine 1 [running]:
    bytes.(*Buffer).Grow(0xc000012140?, 0xc00010abf8?)
            /home/kirr/src/tools/go/go1.22/src/bytes/buffer.go:166 +0xb4
    github.com/kisielk/og-rek.(*Decoder).bufLoadBytesData(0xc000072000, 0xbd30303030303030)
            /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:788 +0xd5
    github.com/kisielk/og-rek.(*Decoder).bufLoadBinData8(0xc000072000)
            /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:776 +0x105
    github.com/kisielk/og-rek.(*Decoder).loadBytearray8(0xc000072000)
            /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:1270 +0x34
    github.com/kisielk/og-rek.(*Decoder).Decode(0xc000072000)
            /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:311 +0x1a2c
    github.com/kisielk/og-rek.Fuzz({0x7fd052187000, 0x9, 0x9})
            /home/kirr/src/neo/src/github.com/kisielk/og-rek/fuzz.go:15 +0xf6
    go-fuzz-dep.Main({0xc00010af38, 0x1, 0x5bc338?})
            go-fuzz-dep/main.go:36 +0x14c
    main.main()
            github.com/kisielk/og-rek/go.fuzz.main/main.go:15 +0x35
    exit status 2

-> Fix it by first comparing uint64 len without casting and further casting to
int only after we cap the len to be in safe range.

45fb14a6

07 Jul, 2024 1 commit

Switch from Travis CI to GitHub Actions (#64) · 72a53b61

Kirill Smelkov authored Jul 07, 2024

* Switch from Travis CI to GitHub Actions

Travis CI no longer works for ogórek and https://travis-ci.org/kisielk/og-rek
returns "not found".

-> Switch to GitHub Actions to fix CI.

We loose CI coverage for "go tip" as specifying "tip" Go version in
actions/setup-go is not supported out of the box:

https://github.com/actions/setup-go/issues/21

Maybe in 201x it was relevant to test wrt go tip, but nowdays I think
this is no longer much needed.

* CI += go1.16 - go1.22

Add CI coverage for all releases past go1.15 that we had before.

72a53b61

10 Mar, 2021 1 commit

Preliminary support for protocol 5 · 8def3d02

Kirill Smelkov authored Mar 09, 2021

Pickle protocol 5 adds out-of-band data and BYTEARRAY8 opcode (see [1]
and [2]). We add support for BYTEARRAY8 here. Handling out-of-band data
would require more setup from users - on both decoding and encoding
side, and it is likely that currently no practical use-case exists to
work with pickles with out-of-band data on Go side. This way we
behave as if no out-of-band data was provided when seeing all protocol 5
opcodes besides BYTEARRAY8.

Hopefully fixes: https://github.com/kisielk/og-rek/issues/62

[1] https://www.python.org/dev/peps/pep-0574
[2] https://github.com/python/cpython/commit/91f4380cedba

/reviewed-by @kisielk
/reviewed-on https://github.com/kisielk/og-rek/pull/63

8def3d02

16 Sep, 2020 2 commits

Add module support · 24bb08c2

Kirill Smelkov authored Sep 15, 2020

Ogórek can be already used from another module (but at outdated
version), but let's be explicit about that. Go.mod was created by `go
mod init github.com/kisielk/og-rek`. There is no external dependencies.

24bb08c2

.travis += go1.12, go1.13, go1.14 and go1.15 · 4cf717d2

Kirill Smelkov authored Sep 15, 2020

Go1.14 and Go1.15 are the current stable Go releases.
Go1.12 and Go1.14 fill in the gap in testing coverage in between -go1.11
and current go releases.

Maybe we can eventually drop older go release from travis coverage.

4cf717d2

28 Sep, 2018 8 commits

Package-level documentation · 8b25c4ce

Kirill Smelkov authored Sep 28, 2018

Add brief overview on to how to use ogórek, and how Python and Go types
are mapped with each other. For completeness give overview of pickle
protocols and overview of persistent references.

(Documentation put into separate doc.go as suggested by Kamil)

8b25c4ce

fuzz/corpus: Update · e5126a53

Kirill Smelkov authored Sep 27, 2018

Rerun `go generate` since new tests vectors have been added in recent
bytes/bytearray patches.

e5126a53

Fix/Add support for []byte (= bytearray on Python side) · 30ebda02

Kirill Smelkov authored Sep 27, 2018

Starting from 2013 (from 555efd8f "first draft of dumb pickle encoder")
wrt []byte ogórek state was:

1. []byte was encoded as string
2. there was no way to decode a pickle object into []byte

then, as

- []byte encoding was never explicitly tested,
- nor I could find any usage of such encodings via searching through all Free /
  Open-Source software ogórek uses - I searched via "Uses" of NewEncoder on godoc:

  https://sourcegraph.com/github.com/kisielk/og-rek/-/blob/encode.go#L48:6=&tab=references:external

it is likely that []byte encoding support was added just for the sake of
it and convenience and then never used. It is also likely that the
original author does not use ogórek encoder anymore:

	https://github.com/kisielk/og-rek/pull/52#issuecomment-423639026

For those reasons I tend to think that it should be relatively safe to
change how []byte is handled:

- the need to change []byte handling is that currently []byte is a kind of
  exception: we can only encode it and not decode something into it.
  Currently encode/decode roundtrip for []byte gives string, which breaks
  the property of encode/decode being identity for all other basic types.

- on the similar topic, on encoding strings are assumed UTF-8 and are
  encoded as UNICODE opcodes for protocol >= 3. Passing arbitrary bytes
  there seems to be not good.

- on to how change []byte - sadly it cannot be used as Python's bytes
  counterpart. In fact in the previous patch we actually just added
  ogórek.Bytes as a counterpart for Python bytes. We did not used []byte
  for that because - contrary to Python bytes - []byte cannot be used as a
  dict key.

- the most natural counterpart for Go's []byte is thus Python's
  bytearray:

	https://docs.python.org/3/library/stdtypes.html#bytearray-objects

  which is "a mutable counterpart to bytes objects"

So add Python's bytearray decoding into []byte, and correspondingly
change []byte encoding to be encoded as bytearray.

P.S.

This changes encoder semantic wrt []byte. If some ogórek use breaks
somewhere because of it, we could add an encoder option to restore
backward compatible behaviour. However since I suspect noone was
actually encoding []byte into pickles, I prefer we wait for someone to
speak-up first instead of loading EncoderConfig with confusion options
that nobody will ever use.

30ebda02

decoder: Remember protocol version as last seen in a PROTO opcode · 2c359fc1

Kirill Smelkov authored Sep 27, 2018

In the next patch decoding a pickle will depend on whether current
protocol is <= 2 or >= 3. For this let's teach PROTO opcode handler to
recall last seen protocol version.

For testing - prepare tests infrastructure for cases where protocol
version affects decoding semantic: if a test pickle in tests table comes
with \x80\xff prefix, on decode tests this prefix will be changed to
concrete

	\x80 ver

with all versions protperly iterated as specified in TestPickle.protov.

We will also need this functionality in the next patch.

2c359fc1

Add support for Python bytes · 514123d4

Kirill Smelkov authored Sep 27, 2018

In Python bytes is immutable and read-only array of bytes. It is
also hashable and so is different from go []byte in that it can be
used as a dict key. Thus the closes approximation for Python bytes in Go
is some type derived from Go's string - it will be different from string
and at the same time will inherit from string it immutability property
and being able to be used as map key. So

- add ogórek.Bytes type to represent Python bytes
- add support to decode BINBYTES* pickle opcodes (these are protocol 3 opcodes)
- add support to encode ogórek.Bytes via those BINBYTES* opcodes
- for protocols <= 2, where there is no opcodes to directly represent
  bytes, adopt the same approach as Python - by pickling bytes as

	_codecs.encode(byt.decode('latin1'), 'latin1')

  this way unpickling it on Python3 will give bytes, while unpickling it
  on Python2 will give str:

	In [1]: sys.version
	Out[1]: '3.6.6 (default, Jun 27 2018, 14:44:17) \n[GCC 8.1.0]'

	In [2]: byt = b'\x01\x02\x03'

	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
	Out[3]: b'\x01\x02\x03'

  ---

	In [1]: sys.version
	Out[1]: '2.7.15+ (default, Aug 31 2018, 11:56:52) \n[GCC 8.2.0]'

	In [2]: byt = b'\x01\x02\x03'

	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
	Out[3]: '\x01\x02\x03'

- correspondingly teach decoder to recognize particular calls to
  _codecs.encode as being representation for bytes and decode it
  appropriately.

- since we now have to emit byt.decode('latin1') as UNICODE - add, so
  far internal, `type unicode(string)` that instructs ogórek encoder to
  always emit the string with UNICODE opcodes (regular string is encoded
  to unicode pickle object only for protocol >= 3).

- For []byte encoding preserve the current status - even though
  dispatching in Encoder.encode changes, the end result is the same -
  []byte was and stays currently encoded as just regular string.

  This was added in 555efd8f "first draft of dumb pickle encoder", and
  even though that might be not a good choice, changing it is a topic for
  another patch.

514123d4

encoder: Fix protocol 0 UNICODE emission · 57f875fd

Kirill Smelkov authored Sep 27, 2018

Previously, we were quoting UNICODE opcode argument with strconv.QuoteToASCII().
However that function, in addition to \u and \U escapes, can produce
e.g. \n, \r, \xAA etc escapes. And all of the latter variants are not
treated as special escapes of a unicode literal by Python, thus leading
to data being wrongly received.

Fix it by doing exactly the same that Python pickle encoder does - the
UNICODE argument comes are "raw-unicode-escape" encoded.

This patch contains only codec tests - not end-to-end pickle tests,
because currently Encoder.encodeUnicode() is called only from under
Encoder.encodeString(), and there only from under

	if e.config.Protocol >= 3

We will indirectly add tests for encodeUnicode @ protocol=0 in the next
patches, while adding support for Python bytes.

57f875fd

pyquote: Add tests · 9936cf9d

Kirill Smelkov authored Sep 27, 2018

I initially added pyquote only as debugging tool (in b429839d "tests:
Show pickles in a way that can be copy-pasted into Python"), and later
it started to be used in the Encoder (see 18004fbd "Move pyquote into
main codebase"). However there is no explicit tests for pyquote.

Add some pyquote tests.

9936cf9d

tests/testEncode: Fix potential panic · a15630df

Kirill Smelkov authored Sep 27, 2018

Both object and objectDecodedBack are interace{}. So it could turn out,
if e.g. they are both of the same non-comparable type (e.g. []byte),
comparing them directly will panic.

a15630df

26 Sep, 2018 7 commits

fuzz/corpus: Update · 3ab92142

Kirill Smelkov authored Sep 26, 2018

I was (re-)running fuzz tests again, and in addition to what was already
known (e.g. self-referencing structures) found two new issues:

- long as dict keys (https://github.com/kisielk/og-rek/issues/55), and
- encoding global with '\n' in module name (fixed in previous patch).

Update the corpus with points go-fuzz found, so that next runs could
restart with having all the previous interesting input vectors available.

3ab92142

encoder: Fix GLOBAL emission wrt module/name with \n · 6e5e403e

Kirill Smelkov authored Sep 26, 2018

Caught via fuzzing:

	"\x8c\x030\n02\x93."

        0: \x8c SHORT_BINUNICODE '0\n0'
        5: 2    DUP
        6: \x93 STACK_GLOBAL
        7: .    STOP

	panic: protocol 0: decode back error: err
	pickle: "c0\n0\n0\n0\n."

	goroutine 1 [running]:
	github.com/kisielk/og-rek.Fuzz(0x7f2f1009a000, 0x8, 0x200000, 0x3)
	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/fuzz.go:47 +0x8b8
	go-fuzz-dep.Main(0x525e10)
	        /tmp/go-fuzz-build645492341/goroot/src/go-fuzz-dep/main.go:49 +0xad
	main.main()
	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
	exit status 2

i.e. '0\n0' module name was emitted as-is as part ot text-based GLOBAL which
completely broke pickle stream.

For the reference Python decodes such globals with \n in name just ok:

	In [10]: s = b"S'decimal\\nq'\nS'Decimal'\n\x93."

	In [11]: pickle.loads(s)
	---------------------------------------------------------------------------
	ModuleNotFoundError                       Traceback (most recent call last)
	<ipython-input-11-764e4625bc41> in <module>()
	----> 1 pickle.loads(s)

	ModuleNotFoundError: No module named 'decimal\nq'

	In [12]: import sys

	In [15]: d = sys.modules['decimal']

	In [16]: sys.modules['decimal\nq'] = d

	In [17]: pickle.loads(s)
	Out[17]: decimal.Decimal

6e5e403e

fuzz: Fix error printing thinko · f2f59c50
Kirill Smelkov authored Sep 26, 2018
```
I put "err" into format string instead of argument. Apologize for that...
```
f2f59c50

fuzz/corpus: Update · 192173dc

Kirill Smelkov authored Sep 25, 2018

While fixing https://github.com/kisielk/og-rek/issues/48 there were
corresponding test-vector additions. Regenerate the fuzz test corpus.

192173dc

Fix UNICODE decoding · 9daf6a2a

Kirill Smelkov authored Sep 25, 2018

UNICODE is text-based opcode, which is used at protocol 0 and follows
'raw-unicode-escape' encoded argument till EOL.

- for decoding we must explicitly implement Python's
  'raw-unicode-escape' codec decoding, which is used by Python's pickle
  for UNICODE argument.

Updates and hopefully fixes: https://github.com/kisielk/og-rek/issues/48

9daf6a2a

Fix STRING encoding/decoding · f62fe97f

Kirill Smelkov authored Sep 25, 2018

STRING is text-based opcode which is used at protocol 0 and follows
\-escaped argument till EOL.

- for encoding we must not use Go's %q, since that will use \u and \U
  when seeing corresponding bytes, and since Python does not interpret
  \u or \U in string literals, the data received at Python side will be
  different.

- for decoding we must explicitly implement Python's 'string-escape'
  codec decoding which is used by Python's pickle for STRING opcode
  argument.

Updates: https://github.com/kisielk/og-rek/issues/48

f62fe97f

Move pyquote into main codebase · 18004fbd

Kirill Smelkov authored Sep 25, 2018

We added pyquote in b429839d (tests: Show pickles in a way that can be
copy-pasted into Python). However in the next patch this functionality
will be needed for the encoder to fix encoding of strings at protocol 0.
Thus we need this function to be not test-only.

Plain code movement here, no semantic change.

18004fbd

25 Sep, 2018 8 commits

decoder: Fix panic on dict with non-comparable keys · cdd8269c

Kirill Smelkov authored Sep 24, 2018

Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes
found by fuzzer (#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	https://github.com/kisielk/og-rek/issues/50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once https://github.com/golang/go/issues/14939 is fixed.

Updates: https://github.com/kisielk/og-rek/issues/30

cdd8269c

tests/BenchmarkDecode: Skip empty pickles from tests table · e3797bb1

Kirill Smelkov authored Sep 24, 2018

In 06e06939 (encoder: Allow to specify pickle protocol version) I added
ability to add to tests error cases - object inputs that on encoding
should produce error.

For decoding we should skip those cases as there pickle.data = "", and
if not skipped it leads to

	--- FAIL: BenchmarkDecode
	    ogorek_test.go:803: unexpected # of decode steps: got 100  ; want 102

e3797bb1

fuzz/corpus: Update · 6daa7b2c
Kirill Smelkov authored Sep 21, 2018
```
Updates coming via `go generate` since main tests table was amended.
```
6daa7b2c
fixup! fuzz: Add more details when reporting failures · c54a5739
Kirill Smelkov authored Sep 21, 2018
```
Appologize for the breakage there.
```
c54a5739

tests: Show pickles in a way that can be copy-pasted into Python · b429839d

Kirill Smelkov authored Sep 21, 2018

When encoding tests fails, the "want" and "have" pickles are printed. It
is handy to copy-paste those pickles into Python console and check them
further there.

Pickle printing currently uses %q. However in Go fmt's %q can use \u and
\U if byte sequence form a valid UTF-8 character. That poses a problem:
in Python str (py2) or bytes (py3) literal \uXXXX are not processed as
unicode-escapes and enter the string as is. This result in different
pickle data pasted into Python and further confusion.

Entering data into Python as unicode literals (where \u works) and then
adding .encode('utf-8') also does not generally work - as pickle data is
generally arbitrary it can be a not valid UTF-8, for example:

	"\x80\u043c\u0438\u0440"	(= "\x80мир"   = "\x80\xd0\xbc\xd0\xb8\xd1\x80")

end unicode-encoding them in python also gives different data:

	In [1]: u"\x80\u043c\u0438\u0440".encode('utf-8')
	Out[1]: '\xc2\x80\xd0\xbc\xd0\xb8\xd1\x80'

(note leading extra \xc2)

For this reason let's implement quoting - that Python can understand -
ourselves. This dumping functionality was very handy during recent
encoder fixes debugging.

b429839d

encoder: Fix class wrt protocol version · 656974e2

Kirill Smelkov authored Sep 21, 2018

- we can use STACK_GLOBAL only if protocol >= 4.
- for earlier protocols we have to use text-based GLOBAL.

656974e2

encoder: Fix struct encoding wrt protocol version · 95f42543

Kirill Smelkov authored Sep 21, 2018

Similarly to dict, for struct encoding switch from protocol 1 opcodes
into always using protocol 0 opcodes, which is by the way 1 byte
shorter.

For the reference - for structs, unlike maps, the order of emitted keys
is well-defined - it is the order of fields as they are defined in the
struct. This way we can precisely test encoder output on structs with
more than 1 field.

95f42543

encoder: Fix dict wrt protocol version · bb7b117b

Kirill Smelkov authored Sep 21, 2018

- we can use EMPTY_DICT only if protocol >= 1

Also: similarly to list (33d1926f), since we are now using EMPTY_DICT
only optionally, it is logical to swit to

	MARK + ... + DICT

from

	EMPTY_DICT (or MARK + DICT @proto=0) + MARK + ... + SETITEMS

which is at least 1 byte longer.

For the reference - SETITEMS is also from protocol 1, while DICT is from
protocol 0.

bb7b117b

21 Sep, 2018 4 commits

fuzz/corpus: Update · 89930c10

Kirill Smelkov authored Sep 21, 2018

More corpus files appeared while running fuzz testing today for ~ 1 hour.

89930c10

fuzz: Add more details when reporting failures · 65d6314e

Kirill Smelkov authored Sep 21, 2018

Should be better in 302c79ea (fuzz: Hook encoder into the loop), but it
is hopefully never too late.

65d6314e

decoder: Fix BININT decoding for negative values · bd5a7fd4

Kirill Smelkov authored Sep 21, 2018

Found via fuzzing:

	"I-7\n."

	panic: protocol 1: decode·encode != identity:
	have: 4294967289
	want: -7

	goroutine 1 [running]:
	github.com/kisielk/og-rek.Fuzz(0x7f99bd8b4000, 0x5, 0x200000, 0x3)
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:50 +0x604
	go-fuzz-dep.Main(0x524df8)
	        /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad
	main.main()
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
	exit status 2

I've checked other handlers, like BININT1 and BININT2, and since there
everywhere argument is unsigned, there is no similar problem.

We needed previous patch on proper readLine EOF detection, because else
the testcase for P0("I-7\n.") would be breaking:

    --- FAIL: TestDecode/int(-7)/"I-7\n." (0.00s)
        ogorek_test.go:401: no ErrUnexpectedEOF on [:2] truncated stream: v = <nil>  err = &strconv.NumError{Func:"ParseInt", Num:"-", Err:(*errors.errorString)(0xc00000e1b0)}

bd5a7fd4

decoder: Don't treat \r\n as combined EOL · 57137139

Kirill Smelkov authored Sep 21, 2018

Currently we use bufio.Reader.ReadLine which accepts either \n or \r\n
as line ending. That is however not correct:

- we should not accept e.g. "S'abc'\r\n." pickle, because it is
  invalid:

	In [32]: pickle.loads(b"S'abc'\r\n.")
	---------------------------------------------------------------------------
	UnpicklingError                           Traceback (most recent call last)
	<ipython-input-32-b1da1988bae1> in <module>()
	----> 1 pickle.loads(b"S'abc'\r\n.")

	UnpicklingError: the STRING opcode argument must be quoted

- we should not accept e.g. "L123L\r\n.", because it is also invalid:

	In [33]: pickle.loads(b"L123L\r\n.")
	---------------------------------------------------------------------------
	ValueError                                Traceback (most recent call last)
	<ipython-input-33-7231ec07f5c4> in <module>()
	----> 1 pickle.loads(b"L123L\r\n.")

	ValueError: invalid literal for int() with base 10: '123L\r\n'

- treating \r as part of EOL in e.g. UNICODE pickle would just drop encoded
  information:

	# python
	In [34]: pickle.loads(b"Vabc\r\n.")
	Out[34]: 'abc\r'

  while ogórek currently decodes it as just 'abc' (no trailing \r).

For this reason let's fix Decoder.readLine to treat only \n as EOL.

Besides this fix, we now get another property: previously, when internally
using bufio.Reader.ReadLine we were not able to distinguish two situations:

- a line was abruptly ended without any EOL characters at all,
- a line was properly ended with EOL character.

Now after we switched to internally using bufio.Reader.ReadSlice, we will be
able to properly detect EOF and return that as error. This property will be
needed in the following patch.

57137139