Commits · bd5a7fd4014739be26685f339e271b6675074264 · Kirill Smelkov / og-rek

21 Sep, 2018 4 commits

decoder: Fix BININT decoding for negative values · bd5a7fd4

Kirill Smelkov authored Sep 21, 2018

Found via fuzzing:

	"I-7\n."

	panic: protocol 1: decode·encode != identity:
	have: 4294967289
	want: -7

	goroutine 1 [running]:
	github.com/kisielk/og-rek.Fuzz(0x7f99bd8b4000, 0x5, 0x200000, 0x3)
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:50 +0x604
	go-fuzz-dep.Main(0x524df8)
	        /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad
	main.main()
	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
	exit status 2

I've checked other handlers, like BININT1 and BININT2, and since there
everywhere argument is unsigned, there is no similar problem.

We needed previous patch on proper readLine EOF detection, because else
the testcase for P0("I-7\n.") would be breaking:

    --- FAIL: TestDecode/int(-7)/"I-7\n." (0.00s)
        ogorek_test.go:401: no ErrUnexpectedEOF on [:2] truncated stream: v = <nil>  err = &strconv.NumError{Func:"ParseInt", Num:"-", Err:(*errors.errorString)(0xc00000e1b0)}

bd5a7fd4

decoder: Don't treat \r\n as combined EOL · 57137139

Kirill Smelkov authored Sep 21, 2018

Currently we use bufio.Reader.ReadLine which accepts either \n or \r\n
as line ending. That is however not correct:

- we should not accept e.g. "S'abc'\r\n." pickle, because it is
  invalid:

	In [32]: pickle.loads(b"S'abc'\r\n.")
	---------------------------------------------------------------------------
	UnpicklingError                           Traceback (most recent call last)
	<ipython-input-32-b1da1988bae1> in <module>()
	----> 1 pickle.loads(b"S'abc'\r\n.")

	UnpicklingError: the STRING opcode argument must be quoted

- we should not accept e.g. "L123L\r\n.", because it is also invalid:

	In [33]: pickle.loads(b"L123L\r\n.")
	---------------------------------------------------------------------------
	ValueError                                Traceback (most recent call last)
	<ipython-input-33-7231ec07f5c4> in <module>()
	----> 1 pickle.loads(b"L123L\r\n.")

	ValueError: invalid literal for int() with base 10: '123L\r\n'

- treating \r as part of EOL in e.g. UNICODE pickle would just drop encoded
  information:

	# python
	In [34]: pickle.loads(b"Vabc\r\n.")
	Out[34]: 'abc\r'

  while ogórek currently decodes it as just 'abc' (no trailing \r).

For this reason let's fix Decoder.readLine to treat only \n as EOL.

Besides this fix, we now get another property: previously, when internally
using bufio.Reader.ReadLine we were not able to distinguish two situations:

- a line was abruptly ended without any EOL characters at all,
- a line was properly ended with EOL character.

Now after we switched to internally using bufio.Reader.ReadSlice, we will be
able to properly detect EOF and return that as error. This property will be
needed in the following patch.

57137139

fuzz: Automatically export all tests pickles into fuzz/corpus · 9d1344ba

Kirill Smelkov authored Sep 21, 2018

This way whatever/whenever we add a tricky test pickle into main tests
table, it should be automatically also be present as a starting point in
the fuzz corpus. This should hopefully improve fuzzing coverage.

9d1344ba

decoder: More mark exposing fixes · 7aeda71a

Kirill Smelkov authored Sep 21, 2018

Continuing 5dbc8a1b (decoder: Don't allow mark to be returned as pickle
result) I discovered that the mark object can be still exposed to user,
but not directly. For example the following pickle:

	"(\x85." // MARK + TUPLE1

was creating Tuple{mark} and returning it just ok to the user.

As marker must be used only internally it is invalid to do so. Python
also forbids this:

        In [3]: s = b"(\x85."

        In [4]: dis(s)
            0: (    MARK
            1: \x85     TUPLE1
            2: .        STOP
        highest protocol among opcodes = 2

        In [5]: pickle.loads(s)
        ---------------------------------------------------------------------------
        UnpicklingError                           Traceback (most recent call last)
        <ipython-input-5-764e4625bc41> in <module>()
        ----> 1 pickle.loads(s)

        UnpicklingError: unexpected MARK found

So let's close all (hopefully) holes where mark object could be returned to
user in a similar way.

7aeda71a

20 Sep, 2018 5 commits

fuzz/corpus: Update · b7c2a34e
Kirill Smelkov authored Sep 20, 2018
```
New test vectors caught by decode·encode idempotency fuzzing.
```
b7c2a34e

decoder: Don't allow mark to be returned as pickle result · 5dbc8a1b

Kirill Smelkov authored Sep 20, 2018

A pickle is considered as invalid if it tries to return MARK as the
result by both Python2 and Python3, e.g.:

        In [2]: pickle.loads(b"(.")
        ---------------------------------------------------------------------------
        UnpicklingError                           Traceback (most recent call last)
        <ipython-input-2-0c142c82b126> in <module>()
        ----> 1 pickle.loads(b"(.")

        UnpicklingError: unexpected MARK found

However until now, despite mark is unexported ogórek type, we were
allowing for it to be returned just ok.

The problem was caught by decode/encode roundtrip fuzz tests, e.g.

	"(Q."

panic: protocol 1: decode·encode != identity:
have: ogórek.Ref{Pid:map[interface {}]interface {}{}}
want: ogórek.Ref{Pid:ogórek.mark{}}

goroutine 1 [running]:
github.com/kisielk/og-rek.Fuzz(0x7fbe6c15f000, 0x3, 0x200000, 0x3)
        /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/fuzz.go:87 +0x604
go-fuzz-dep.Main(0x524d78)
        /tmp/go-fuzz-build697921479/goroot/src/go-fuzz-dep/main.go:49 +0xad
main.main()
        /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
exit status 2

5dbc8a1b

fuzz: Hook encoder into the loop · 302c79ea

Kirill Smelkov authored Sep 20, 2018

We can enhance our fuzz-testing coverage by hooking Encoder also into
the loop: if input data is suceessfully decoded, we have an object that
can be passed back to Encoder to generate a pickle. We can't tell that
that pickle must be the same as original input data, since pickle
machine allows multiple representations of the same data. However we can
assert that when that pickle is decoded back it should be the same as
encoded object.

This catches several problems:

- marker is currently returned as pickle result (see next patch).
- text-based STRING and UNICODE are not properly decoded (no fix yet).
- self-referencing data structures kill Encoder (no fix yet).

302c79ea

fuzz/corups: Update · 230ffba9

Kirill Smelkov authored Sep 20, 2018

Add more files go-fuzz put to corpus while discovering the crash fixed
in previous commit.

230ffba9

decoder: Fix DOS in BINSTRING decoding · dbc3bd9a

Kirill Smelkov authored Sep 20, 2018

BINSTRING opcode follows 4-byte length of the data and the data itself.

Upon seeing the BINSTRING len header we were preallocating destination
buffer with len as optimization (see 14aaa14f "decoder: Preallocate .buf
capacity when we know (approximate) size of output to it"). However this
way it is easy for a malicious pickle to specify

	BINSTRING biglength (4GB)

and no data, and then the goroutine that processes such input will
allocate 4GB immediately, which in turn might cause out-of-memory DOS.

The other places where we currently grow Decoder.buf all either have 1
or 2 byte length (thus limited to 64K), or the length of the data that
was already read from input stream.

The problem was found by rerunning fuzz tests:

	"T0000"

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x514fff, 0x16)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc004000000, 0x34000000, 0x5f83d8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mem_linux.go:156 +0xc7
runtime.(*mheap).sysAlloc(0x5dfd40, 0x34000000, 0x43c0e3, 0x7ffc94fe56b8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:619 +0x1c7
runtime.(*mheap).grow(0x5dfd40, 0x18182, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:920 +0x42
runtime.(*mheap).allocSpanLocked(0x5dfd40, 0x18182, 0x5f83e8, 0x203000)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:848 +0x337
runtime.(*mheap).alloc_m(0x5dfd40, 0x18182, 0x7ffc94fe0101, 0x40a87f)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:692 +0x119
runtime.(*mheap).alloc.func1()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:759 +0x4c
runtime.(*mheap).alloc(0x5dfd40, 0x18182, 0x7ffc94010101, 0x4135f5)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:758 +0x8a
runtime.largeAlloc(0x30303030, 0x440101, 0xc00005e240)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:1019 +0x97
runtime.mallocgc.func1()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:914 +0x46
runtime.systemstack(0x44eea9)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:1229

goroutine 1 [running]:
runtime.systemstack_switch()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:311 fp=0xc000034310 sp=0xc000034308 pc=0x44efa0
runtime.mallocgc(0x30303030, 0x4edc00, 0x1, 0xc0000343e8)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:913 +0x896 fp=0xc0000343b0 sp=0xc000034310 pc=0x40ae26
runtime.makeslice(0x4edc00, 0x30303030, 0x30303030, 0xc000018108, 0x4, 0x4)
        /tmp/go-fuzz-build515164548/goroot/src/runtime/slice.go:70 +0x77 fp=0xc0000343e0 sp=0xc0000343b0 pc=0x43b1d7
bytes.makeSlice(0x30303030, 0x0, 0x0, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:231 +0x9d fp=0xc000034420 sp=0xc0000343e0 pc=0x4b204d
bytes.(*Buffer).grow(0xc000084030, 0x30303030, 0x0)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:144 +0x2e4 fp=0xc000034470 sp=0xc000034420 pc=0x4b1604
bytes.(*Buffer).Grow(0xc000084030, 0x30303030)
        /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:163 +0x86 fp=0xc000034498 sp=0xc000034470 pc=0x4b18b6
github.com/kisielk/og-rek.(*Decoder).loadBinString(0xc000084000, 0x203054, 0x0)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:646 +0x143 fp=0xc000034520 sp=0xc000034498 pc=0x4d5103
github.com/kisielk/og-rek.(*Decoder).Decode(0xc000084000, 0xc000080000, 0xc000084000, 0xf7eb9fa, 0x586dc)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:227 +0xf94 fp=0xc0000346d8 sp=0xc000034520 pc=0x4d09b4
github.com/kisielk/og-rek.Fuzz(0x7fbccd400000, 0x5, 0x200000, 0xc000034748)
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/fuzz.go:12 +0xb0 fp=0xc000034710 sp=0xc0000346d8 pc=0x4cf640
go-fuzz-dep.Main(0x519cf0)
        /tmp/go-fuzz-build515164548/goroot/src/go-fuzz-dep/main.go:49 +0xad fp=0xc000034780 sp=0xc000034710 pc=0x4642fd
main.main()
        /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d fp=0xc000034798 sp=0xc000034780 pc=0x4db1ad
runtime.main()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:201 +0x207 fp=0xc0000347e0 sp=0xc000034798 pc=0x428ec7
runtime.goexit()
        /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000347e8 sp=0xc0000347e0 pc=0x450f01
exit status 2

dbc3bd9a

19 Sep, 2018 16 commits

encoder: Fix string wrt protocol version · e7d96969

Kirill Smelkov authored Sep 19, 2018

- we can use BINSTRING* only if protocol >= 1;
- at protocol 0 we thus have to use text STRING;
- if protocol >= 3 we have to emit the string as unicode pickle object
  the same way as Python3 does. If we don't do - Python3 won't be
  generally able to load our pickle:

	In [1]: s = b'U\x06\xd0\xbc\xd0\xb8\xd1\x80q\x00.'

  	In [2]: from pickletools import dis

  	In [3]: dis(s)
  	    0: U    SHORT_BINSTRING 'Ð¼Ð¸Ñ\x80'
  	    8: q    BINPUT     0
  	   10: .    STOP
  	highest protocol among opcodes = 1

  	In [4]: import pickle

  	In [5]: pickle.loads(s)
  	---------------------------------------------------------------------------
  	UnicodeDecodeError                        Traceback (most recent call last)
  	<ipython-input-5-764e4625bc41> in <module>()
  	----> 1 pickle.loads(s)

  	UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

We already decode unicode pickle objects into string, this way
decode(encode(string)) remains always idempotent.

e7d96969

encoder: Fix Ref wrt protocol version · 98fb1987

Kirill Smelkov authored Sep 19, 2018

- we can use BINPERSID only if protocol >= 1
- we can use text PERSID only if protocol < 4 and Ref string does not
  contain \n - else encoding have to fail.

Tests for Ref encoding at protocols 3 & 4 will follow after string
encoding is fixed.

98fb1987

encoder: Fix list wrt protocol version · 33d1926f

Kirill Smelkov authored Sep 19, 2018

- we can use EMPTY_LIST only if protocol >= 1

Also: since we are now using EMPTY_LIST only optionally, it is logical
to switch to

	MARK + ... + LIST

instead of

	EMPTY_LIST (or MARK + LIST @proto=0) + MARK + ... + APPENDS

which is at least 1 byte longer.

For the reference - APPENDS is also from protocol 1, while LIST is from
protocol 0.

33d1926f

encoder: Fix tuple wrt protocol version · 24695efa

Kirill Smelkov authored Sep 19, 2018

- we can use EMPTY_TUPLE only if protocol >= 1
- also: if protocol >= 2 we can now use TUPLE{1,2,3} opcodes.

24695efa

encoder: Add tests todo to use LONG1 and LONG4 · 08fb378e

Kirill Smelkov authored Sep 19, 2018

Corresponding TODO in the code was added in 4fd6be93 (encoder: Adjust it
so that decode(encode(v)) == v)

08fb378e

encoder: Fix float wrt protocol version · 606e9f5a

Kirill Smelkov authored Sep 19, 2018

We can use BINFLOAT opcode only starting from protocol >= 1.
At protocol 0 we must use ASCII FLOAT.

606e9f5a

encoder: Fix int wrt protocol version · 9053359b

Kirill Smelkov authored Sep 19, 2018

We can use BININT* opcodes only starting from protocol >= 1.
At protocol 0 we must use ASCII INT.

9053359b

encoder: Fix bool wrt protocol version · 6e6a8aa3

Kirill Smelkov authored Sep 19, 2018

Starting from protocol 2 1-byte NEWTRUE/NEWFALSE opcodes are more
efficient compared to 5-bytes e.g. "I01\n.".

It is not only about efficiency, as protocol 4 _forbids_ use of variable
length ASCII-only opcodes - whose data length is determined by doing
forward scan for '\n'.

Without encodeBool changes and only with the tests added it would fail
this way:

    --- FAIL: TestEncode/True/proto=2 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x02I01\n."
            want: "\x80\x02\x88."
    --- FAIL: TestEncode/True/proto=3 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x03I01\n."
            want: "\x80\x03\x88."
    --- FAIL: TestEncode/True/proto=4 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x04I01\n."
            want: "\x80\x04\x88."
    --- FAIL: TestEncode/False/proto=2 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x02I00\n."
            want: "\x80\x02\x89."
    --- FAIL: TestEncode/False/proto=3 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x03I00\n."
            want: "\x80\x03\x89."
    --- FAIL: TestEncode/False/proto=4 (0.00s)
        ogorek_test.go:383: encode:
            have: "\x80\x04I00\n."
            want: "\x80\x04\x89."

6e6a8aa3

encoder: Allow to specify pickle protocol version · 06e06939

Kirill Smelkov authored Sep 19, 2018

There are many pickle protocol versions - 0 to 4. Python2 for example
understands only versions 0 - 2. However we currently unconditionally
emit opcodes from higher versions, for example STACK_GLOBAL - from
version 4 - when encoding a Class, which leads to inability to decode
pickles generated by ogórek on Python2.

Similarly protocol 0 states that only text opcodes should be used,
however we currently unconditionally emit e.g. BININT (from protocol 1)
when encoding integers.

Changing to always using protocol 0 opcodes would be not good, since many
opcodes for efficiently encoding either integers, booleans, unicode etc
are available only in protocol versions 2 and 4.

For this reason, similarly to Python[1], let's allow users to specify
desired pickle protocol when creating Encoder with config. For backward
compatibility and common sense the protocol version that plain
NewEncoder selects is 2.

This commit adds only above-described user interface and testing
infrastructure for verifying what was the result of encoding an object
at particular protocol version.

For now only a few of pickle test vectors are right wrt what the encoder
should be or currently generates. Thus in the next patches we'll be
step-by-step fixing encoder on this topic.

[1] https://docs.python.org/3/library/pickle.html#pickle.dump

06e06939

tests: Allow to specify several pickles for one test case · 93075d82

Kirill Smelkov authored Sep 19, 2018

For now all decoding all those pickles is tested to give the same
object, as in e.g.

	"(I1\nI2\ntp0\n.", // MARK + TUPLE + INT

and

	"I1\nI2\n\x86."),  // TUPLE2 + INT

But having a way to specify several pickles to a test case will become
even more handy when later adding precise tests for Encoder - there we
will need to assert that at such and such protocol encoding gives such
and such pickles. And there are 5 protocol versions to test...

93075d82

tests: Merge Encode and Decode tests data · e8189e5f

Kirill Smelkov authored Sep 19, 2018

Previously there were two separate tables - for decode and encode tests.

The table for encode tests was very small. TestDecode, which was
operating on data from decode table, was also performing checks that
decode(encode(object)) is idempotent - the work which is already too by
TestEncode. However the coverage of input objects from decode table was
not a strict superset of encode table objects.

For the reasons above let's stop this divergence. Let's have a common
table that define tests where for every test case there can be:

	- an "in" object,
	- a pickle, and
	- an "out" object

where

	1. pickle must decode to "out" object, and
	2. encoding "in" object must give some pickle that decodes to "out" object.

	NOTE: In the usual case "in" object == "out" object and they can only
	differ if "in" object contains a Go struct.

This will allow us to cover all existing decode and encode tests logic.

However the coverage of each logic is now higher - for example Encoder
tests are now run on every object from main table, not only for 3 cases
like it was before.

e8189e5f

tests: Split TestEncode into main driver + worker that tests 1 particular case · 05c5233f
Kirill Smelkov authored Sep 19, 2018
```
Similarly to TestDecode. No integration with main tests table yet.
```
05c5233f

tests: Move encoding tests to ogorek_test.go · b291b470

Kirill Smelkov authored Sep 19, 2018

We are going to integrate encoding tests with the main tests data table
in several steps.

Only code movement here, no other change.

b291b470

tests: Split TestDecode into main driver + worker that tests 1 particular pickle · 89d4f00e

Kirill Smelkov authored Sep 19, 2018

We are going to modify the main tests table and for every case (e.g.
int(1) add pickles with various encoding that all represent the same
object when decoded. The main TestDecode driver will iterate over all
those input data and hand it over for particular testing to testDecode
worker.

Use t.Run for spawning each case - it is less typing (we do not need to
manually print test.name on errors, etc), and it allows to select which
subtests to run via e.g. `go test -run Decode/int`

89d4f00e

tests: Move Graphite and "long line" data definition out from main test table · 81e5b5e7

Kirill Smelkov authored Sep 19, 2018

Having that long data makes the table clumsy and harder to understand.
By moving such data definition out of the table, we make it a bit easier
to understand.

In the case of "long line", the pickle input and the line itself were
almost duplicating each other, so instead of having two long lines
explicitly pasted, let's have the test input be defined as

+	{"too long line", "V" + longLine + "\n.", longLine},

In the case of Graphite messages, one long Graphite object was also
duplicated in encode_test.go .

Just a cleanup, no semantic change.

81e5b5e7

Drop support for Go1.6 · 283146f0

Kirill Smelkov authored Sep 19, 2018

It will be handy to use testing.T.Run in the upcoming patches that add
more decode/encode tests. However since testing.T.Run was was added in
Go1.7 the code won't work with Go1.6 .

Since Go1.6 was released in the beginning of 2016 - more than 2.5 years
ago, and upstream Go policy is to support only current stable (currently
Go1.11) and two previous releases (currently Go1.10 and Go1.9), as of
today Go1.6 is both not supported and outdated. So let's drop explicit
support for it in the name of progress.

By above logic we could also drop support for 1.7 and 1.8, but since
there is no pressing particular need to do so I'm not dropping them for
now.

283146f0

18 Sep, 2018 6 commits

encoder: Raise signal/noise ratio when writing data out · 423fe587

Kirill Smelkov authored Sep 18, 2018

Currently we often use constructs like

	_, err := e.w.Write([]byte{opNone})
	return err

however with added Encoder.emit helper it can become only

	return e.emit(opNone)

which is more clearly readable.

We can do similarly for formatted output:

-	_, err := fmt.Fprintf(e.w, "%c%dL\n", opLong, b)
-	return err
+	return e.emitf("%c%dL\n", opLong, b)

Due to much boilerplate in one place (encodeInt) the return of
fmt.Fprintf was not even checked. We change the code there with

-		_, err = e.w.Write([]byte{opInt})
-		if err != nil {
-			return err
-		}
-		fmt.Fprintf(e.w, "%d\n", i)
+		return e.emitf("%c%d\n", opInt, i)

which is now hopefully more visible for what is going on and is easier
to catch potential mistake by eyes.

A corresponding test for "int64 encoded as string" will be added later.

423fe587

Add definition for Protocol 3 and missing from Protocol 4 opcodes · b21a5f61

Kirill Smelkov authored Sep 18, 2018

Protocol 3 adds opcodes to represent Python bytes.
For Protocol 4 we were missing definitions of several of the opcodes,
such as BINUNICODE8, EMPTY_SET etc.

Add definitions for them all.

b21a5f61

Split "Protocol 1" opcodes out from "Protocol 0" opcodes · da584909

Kirill Smelkov authored Sep 18, 2018

We already keep "Protocol 2" and "Protocol 4" opcodes separately, and
for a good reason - it must be clear for a person just by looking at
opcodes definition from which protocol version an opcode comes. However
at present Protocol 0 and Protocol 1 opcodes come all intermixed with
each other.

In the upcoming patches we'll be teaching Encoder to take desired pickle
protocol version into account (similarly to how pickle.dumps accepts desired
protocol version in Python), and then the encoder will have to use
different opcodes for different versions: for example if user asks to
produce "protocol 0" pickle stream it will be invalid to use "protocol
1" opcodes - e.g. BININT, EMPTY_TUPLE, etc.

So I went through

	https://github.com/python/cpython/blob/master/Lib/pickletools.py

for each opcode searching its protocol version and separated the opcodes
with proto=1 from protocol 0 opcodes.

da584909

Make all opcode constants to have type byte · f6ba06e8

Kirill Smelkov authored Sep 18, 2018

It was probably a thinko in d00e99e7 (Add more opcodes so we can read
Graphite's Carbon stream) which changed opcodes from string ("X") to
character ('X') constants and marked the first opcode with byte type,
probably with the idea that following opcodes will have the same type.

Unfortunately it is not so as the following program demonstrates:

	package main

	const (
	        opAAA byte = 'a'
	        opBBB      = 'b'
	)

	func main() {
	        op := opBBB
	        if true {
	                op = opAAA	// <-- line 11
	        }
	        println(op)
	}

	--> ./bc.go:11:6: cannot use opAAA (type byte) as type rune in assignment

Similarly if we try comparing opcodes, it also results in compile error:

	func main() {
	        op := opBBB
	        if op == opAAA {	// <-- line 10
	                panic(0)
	        }
	        println(op)
	}

	--> ./bc.go:10:8: invalid operation: op == opAAA (mismatched types rune and byte)

Since in the following patches it will be handy for encoder to e.g. set default
opcode first, and then change it under some conditions to another opcode,
exactly the same compile error(s) will pop up.

So let's fix all opcode constants to have their type as byte to avoid such
unexpected errors.

f6ba06e8

decoder: Use .push instead of open-coded .stack append where appropriate · 0d8e88fa

Kirill Smelkov authored Sep 18, 2018

For example

	d.push(math.Float64frombits(u))

looks more clear compared to

	d.stack = append(d.stack, math.Float64frombits(u))

and we already use push in many other places.

0d8e88fa

decoder/loadFloat: No need to explicitly cast v to interface{} · f04563c4
Kirill Smelkov authored Sep 18, 2018
```
v is concrete type here (float64) and Go automatically converts it to
interface{} on call to Decoder.push().
```
f04563c4

17 Sep, 2018 1 commit

decoder: Teach loadUnicode to return ErrUnexpectedEOF, not paper it over in tests (#44) · 49c79ab9

Kirill Smelkov authored Sep 17, 2018

While doing a5094338 (encoder: More unexpected EOF handling) I was a bit
disappointed that https://golang.org/cl/37052 was not going to be
accepted and somehow added hack to tests that was doing

	strconv.ErrSyntax -> io.ErrUnexpectedEOF

error conversion if the test name contained "unicode".

Today I was refactoring tests and noticed that it is better we teach loadUnicode
to return io.ErrUnexpectedEOF in the first place, and the next easy way for this
(after fixing strconv.UnquoteChar itself) is to append many "0" to string and
see if strconv.UnquoteChar still returns ErrSyntax.

So do it in our custom unquoteChar wrapper.

49c79ab9

11 Sep, 2018 1 commit

decoder: Cleanup decodeLong tests (#43) · 9eb8c4c6

Kirill Smelkov authored Sep 12, 2018

Instead of copy-pasting full test driver for each value, make it to be
only one test with many values in table each tested by common driver.

TestZeroLengthData was checking output.BitLen() == 0 which is another
way to test for equality with big.Int(0) according to

	https://golang.org/pkg/math/big/#Int.BitLen

So join it too to the rest of decodeLong tests.

While at decodeLong topic, rename BenchmarkSpeed -> BenchmarkDecodeLong
as this benchmark checks speed of decodeLong only, nothing else.

9eb8c4c6

10 Sep, 2018 1 commit

fixup! decoder: Fix thinko in PROTO opcode handling (#41) (#42) · 2a0ddd42

Kirill Smelkov authored Sep 10, 2018

Add a test that excersizes whether invalid protocol version is caught
and reported. Without 5c420f85 the test fails like this:

	--- FAIL: TestDecodeError (0.00s)
	    ogorek_test.go:312: "\x80\xffI1\n.": no decode error  ; got 1, <nil>

2a0ddd42

05 Sep, 2018 1 commit

decoder: Fix thinko in PROTO opcode handling (#41) · 5c420f85

Kirill Smelkov authored Sep 05, 2018

Due to := in

	v, err := d.r.ReadByte()

newly declared err was shadowing outside err and thus even though the
code intent was to treat all protocols != 2 as ErrInvalidPickleVersion,
after leaving the switch err was always =nil and no error was reported.

Fix it by explicitly declaring v and not using := not to shadow err.

We also change the condition for supported protocol version to be in [0, 4] because:

- we already have messages in our testsuite with PROTO opcode and
  version 1, and if we don't adjust the version condition the tests will fail.

- the highest protocol version we support opcodes for is 4.
- even though the documentation for PROTO opcode says version must be >= 2,
  in practice CPython decodes pickles with PROTO and lower version just ok:

	In [18]: pickle.loads("\x80\x02I5\n.")
	Out[18]: 5

	In [19]: pickle.loads("\x80\x01I5\n.")
	Out[19]: 5

	In [20]: pickle.loads("\x80\x00I5\n.")
	Out[20]: 5

  so we don't complain about those lower versions too.

5c420f85

29 Aug, 2018 1 commit
- .travis.yml += go1.11 (#40) · cf7f883f
  Kirill Smelkov authored Aug 29, 2018
  
  cf7f883f
17 Jul, 2018 2 commits

Allow to hook-in application-level logic to handle persistent references (#38) · d51cc146

Kirill Smelkov authored Jul 17, 2018

Continuing 84598fe1 (Add support for persistent references) theme let's
develop support for persistent references more: the usual situation (at
least in ZODB context, which is actually the original reason
persistent references exist) is that decode converts a reference of
(type, oid) tuple form into corresponding "ghost" object - an instance
of type, but not yet loaded with database object's data. This way
resulted object tree is created right away with types application
expects (and ZODB/py further cares to automatically load object data
when that ghost objects are accessed).

To mimic this on Go side, with current state, after decoding, one would
need to make a full reflect pass on the resulted decoded objects, find
Refs, and convert them to instances of other types, also caring to
patch pointers in other objects consistently that were pointing to such
Refs. In other words it is some work and it coincides almost 100% with
what Decoder already does by itself.

Thus, not to duplicate that work and conducting parallel with Python
world, let's add ability to configure Decoder and Encoder so that
persistent references could be handled with user-specified application
logic right in the process, and e.g. Decoder would return resulted
object with references converted to corresponding Go types.

To do so we introduce DecoderConfig and EncoderConfig - configurations
to tune decode/encode process. To maintain backward compatibility
NewDecoder and NewEncoder signatures are not adjusted and instead
NewDecoderWithConfig and NewEncoderWithConfig are introduces. We should
be sure we won't need e.g. NewDecoderWithConfigAndXXX in the future,
since from now on we could be adding fields to configuration structs
without breaking backward compatibility.

For decoding there is DecoderConfig.PersistentLoad which mimics
Unpickler.persistent_load in Python:

https://docs.python.org/3/library/pickle.html#pickle.Unpickler.persistent_load

For encoding there is EncoderConfig.PersistentRef which mimics
Pickler.persistent_id in Python:

https://docs.python.org/3/library/pickle.html#pickle.Pickler.persistent_id

( for Persistent{Load,Ref}, following suggestion from @kisielk, the
choice was made to explicitly pass in/out Ref, instead of raw
interface{} pid, because that makes the API cleaner. )

Then both Decoder and Encoder are adjusted correspondingly with tests
added.

About tests: I was contemplating to patch-in support for decoder and
encoder configurations, and handling errors, into our main test
TestDecode, but for now decided not to go that way and to put the test
as separate TestPersistentRefs.

By the way, with having configurations in place, we could start to add
other things there - e.g. the protocol version to use for Encoder.
Currently we have several places where we either don't use opcodes from
higher protocol versions (fearing to break compatibility), or instead
always use higher-version opcodes without checking we should be able to
do so by allowed protocol version.

d51cc146

.travis.yml += go1.9, go1.10 (#39) · 56e0ff77

Kirill Smelkov authored Jul 17, 2018

"1.10" is put into quotes because otherwise Travis parses that as 1.1 -
see e.g. https://github.com/travis-ci/travis-ci/issues/9247.

56e0ff77

01 Jul, 2018 1 commit

Add support for persistent references (#37) · 84598fe1

Kirill Smelkov authored Jul 01, 2018

* Add docstrings to types exposed in public API

I was refreshing my knowled of ogórek and added them along the way.

* Add support for persistent references

Python has the mechanism for objects in one pickle to reference objects
in another pickle. This mechanism is primarily used in ZODB object
database for one object to "persistently" reference another object(*).

I needed a way to load and understand objects from a ZODB database in
Go, and hence comes this patch for ogórek that teaches it to understand
such persistent references.

Like it is already with Call and Class we add a dedicated Ref type to
represent persistent references in pickle stream.

(*) in ZODB when one object is changed, there is no need to resave
unchanged object that is referencing it because its reference stays
valid and pointing to changed object: the "persistent ID" in ZODB is
only object ID and which revision of the referenced object it points to
is implicitly follows by current transaction number view of the database.

84598fe1

25 Apr, 2017 1 commit
- Implement opStackGlobal from 'Protocol 4' (#36) · dd41cde7
  Sean DuBois authored Apr 25, 2017
  
  dd41cde7