1. 28 Sep, 2018 6 commits
    • Kirill Smelkov's avatar
      Fix/Add support for []byte (= bytearray on Python side) · 30ebda02
      Kirill Smelkov authored
      Starting from 2013 (from 555efd8f "first draft of dumb pickle encoder")
      wrt []byte ogórek state was:
      
      1. []byte was encoded as string
      2. there was no way to decode a pickle object into []byte
      
      then, as
      
      - []byte encoding was never explicitly tested,
      - nor I could find any usage of such encodings via searching through all Free /
        Open-Source software ogórek uses - I searched via "Uses" of NewEncoder on godoc:
      
        https://sourcegraph.com/github.com/kisielk/og-rek/-/blob/encode.go#L48:6=&tab=references:external
      
      it is likely that []byte encoding support was added just for the sake of
      it and convenience and then never used. It is also likely that the
      original author does not use ogórek encoder anymore:
      
      	https://github.com/kisielk/og-rek/pull/52#issuecomment-423639026
      
      For those reasons I tend to think that it should be relatively safe to
      change how []byte is handled:
      
      - the need to change []byte handling is that currently []byte is a kind of
        exception: we can only encode it and not decode something into it.
        Currently encode/decode roundtrip for []byte gives string, which breaks
        the property of encode/decode being identity for all other basic types.
      
      - on the similar topic, on encoding strings are assumed UTF-8 and are
        encoded as UNICODE opcodes for protocol >= 3. Passing arbitrary bytes
        there seems to be not good.
      
      - on to how change []byte - sadly it cannot be used as Python's bytes
        counterpart. In fact in the previous patch we actually just added
        ogórek.Bytes as a counterpart for Python bytes. We did not used []byte
        for that because - contrary to Python bytes - []byte cannot be used as a
        dict key.
      
      - the most natural counterpart for Go's []byte is thus Python's
        bytearray:
      
      	https://docs.python.org/3/library/stdtypes.html#bytearray-objects
      
        which is "a mutable counterpart to bytes objects"
      
      So add Python's bytearray decoding into []byte, and correspondingly
      change []byte encoding to be encoded as bytearray.
      
      P.S.
      
      This changes encoder semantic wrt []byte. If some ogórek use breaks
      somewhere because of it, we could add an encoder option to restore
      backward compatible behaviour. However since I suspect noone was
      actually encoding []byte into pickles, I prefer we wait for someone to
      speak-up first instead of loading EncoderConfig with confusion options
      that nobody will ever use.
      30ebda02
    • Kirill Smelkov's avatar
      decoder: Remember protocol version as last seen in a PROTO opcode · 2c359fc1
      Kirill Smelkov authored
      In the next patch decoding a pickle will depend on whether current
      protocol is <= 2 or >= 3. For this let's teach PROTO opcode handler to
      recall last seen protocol version.
      
      For testing - prepare tests infrastructure for cases where protocol
      version affects decoding semantic: if a test pickle in tests table comes
      with \x80\xff prefix, on decode tests this prefix will be changed to
      concrete
      
      	\x80 ver
      
      with all versions protperly iterated as specified in TestPickle.protov.
      
      We will also need this functionality in the next patch.
      2c359fc1
    • Kirill Smelkov's avatar
      Add support for Python bytes · 514123d4
      Kirill Smelkov authored
      In Python bytes is immutable and read-only array of bytes. It is
      also hashable and so is different from go []byte in that it can be
      used as a dict key. Thus the closes approximation for Python bytes in Go
      is some type derived from Go's string - it will be different from string
      and at the same time will inherit from string it immutability property
      and being able to be used as map key. So
      
      - add ogórek.Bytes type to represent Python bytes
      - add support to decode BINBYTES* pickle opcodes (these are protocol 3 opcodes)
      - add support to encode ogórek.Bytes via those BINBYTES* opcodes
      - for protocols <= 2, where there is no opcodes to directly represent
        bytes, adopt the same approach as Python - by pickling bytes as
      
      	_codecs.encode(byt.decode('latin1'), 'latin1')
      
        this way unpickling it on Python3 will give bytes, while unpickling it
        on Python2 will give str:
      
      	In [1]: sys.version
      	Out[1]: '3.6.6 (default, Jun 27 2018, 14:44:17) \n[GCC 8.1.0]'
      
      	In [2]: byt = b'\x01\x02\x03'
      
      	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
      	Out[3]: b'\x01\x02\x03'
      
        ---
      
      	In [1]: sys.version
      	Out[1]: '2.7.15+ (default, Aug 31 2018, 11:56:52) \n[GCC 8.2.0]'
      
      	In [2]: byt = b'\x01\x02\x03'
      
      	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
      	Out[3]: '\x01\x02\x03'
      
      - correspondingly teach decoder to recognize particular calls to
        _codecs.encode as being representation for bytes and decode it
        appropriately.
      
      - since we now have to emit byt.decode('latin1') as UNICODE - add, so
        far internal, `type unicode(string)` that instructs ogórek encoder to
        always emit the string with UNICODE opcodes (regular string is encoded
        to unicode pickle object only for protocol >= 3).
      
      - For []byte encoding preserve the current status - even though
        dispatching in Encoder.encode changes, the end result is the same -
        []byte was and stays currently encoded as just regular string.
      
        This was added in 555efd8f "first draft of dumb pickle encoder", and
        even though that might be not a good choice, changing it is a topic for
        another patch.
      514123d4
    • Kirill Smelkov's avatar
      encoder: Fix protocol 0 UNICODE emission · 57f875fd
      Kirill Smelkov authored
      Previously, we were quoting UNICODE opcode argument with strconv.QuoteToASCII().
      However that function, in addition to \u and \U escapes, can produce
      e.g. \n, \r, \xAA etc escapes. And all of the latter variants are not
      treated as special escapes of a unicode literal by Python, thus leading
      to data being wrongly received.
      
      Fix it by doing exactly the same that Python pickle encoder does - the
      UNICODE argument comes are "raw-unicode-escape" encoded.
      
      This patch contains only codec tests - not end-to-end pickle tests,
      because currently Encoder.encodeUnicode() is called only from under
      Encoder.encodeString(), and there only from under
      
      	if e.config.Protocol >= 3
      
      We will indirectly add tests for encodeUnicode @ protocol=0 in the next
      patches, while adding support for Python bytes.
      57f875fd
    • Kirill Smelkov's avatar
      pyquote: Add tests · 9936cf9d
      Kirill Smelkov authored
      I initially added pyquote only as debugging tool (in b429839d "tests:
      Show pickles in a way that can be copy-pasted into Python"), and later
      it started to be used in the Encoder (see 18004fbd "Move pyquote into
      main codebase"). However there is no explicit tests for pyquote.
      
      Add some pyquote tests.
      9936cf9d
    • Kirill Smelkov's avatar
      tests/testEncode: Fix potential panic · a15630df
      Kirill Smelkov authored
      Both object and objectDecodedBack are interace{}. So it could turn out,
      if e.g. they are both of the same non-comparable type (e.g. []byte),
      comparing them directly will panic.
      a15630df
  2. 26 Sep, 2018 7 commits
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 3ab92142
      Kirill Smelkov authored
      I was (re-)running fuzz tests again, and in addition to what was already
      known (e.g. self-referencing structures) found two new issues:
      
      - long as dict keys (https://github.com/kisielk/og-rek/issues/55), and
      - encoding global with '\n' in module name (fixed in previous patch).
      
      Update the corpus with points go-fuzz found, so that next runs could
      restart with having all the previous interesting input vectors available.
      3ab92142
    • Kirill Smelkov's avatar
      encoder: Fix GLOBAL emission wrt module/name with \n · 6e5e403e
      Kirill Smelkov authored
      Caught via fuzzing:
      
      	"\x8c\x030\n02\x93."
      
              0: \x8c SHORT_BINUNICODE '0\n0'
              5: 2    DUP
              6: \x93 STACK_GLOBAL
              7: .    STOP
      
      	panic: protocol 0: decode back error: err
      	pickle: "c0\n0\n0\n0\n."
      
      	goroutine 1 [running]:
      	github.com/kisielk/og-rek.Fuzz(0x7f2f1009a000, 0x8, 0x200000, 0x3)
      	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/fuzz.go:47 +0x8b8
      	go-fuzz-dep.Main(0x525e10)
      	        /tmp/go-fuzz-build645492341/goroot/src/go-fuzz-dep/main.go:49 +0xad
      	main.main()
      	        /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
      	exit status 2
      
      i.e. '0\n0' module name was emitted as-is as part ot text-based GLOBAL which
      completely broke pickle stream.
      
      For the reference Python decodes such globals with \n in name just ok:
      
      	In [10]: s = b"S'decimal\\nq'\nS'Decimal'\n\x93."
      
      	In [11]: pickle.loads(s)
      	---------------------------------------------------------------------------
      	ModuleNotFoundError                       Traceback (most recent call last)
      	<ipython-input-11-764e4625bc41> in <module>()
      	----> 1 pickle.loads(s)
      
      	ModuleNotFoundError: No module named 'decimal\nq'
      
      	In [12]: import sys
      
      	In [15]: d = sys.modules['decimal']
      
      	In [16]: sys.modules['decimal\nq'] = d
      
      	In [17]: pickle.loads(s)
      	Out[17]: decimal.Decimal
      6e5e403e
    • Kirill Smelkov's avatar
      fuzz: Fix error printing thinko · f2f59c50
      Kirill Smelkov authored
      I put "err" into format string instead of argument. Apologize for that...
      f2f59c50
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 192173dc
      Kirill Smelkov authored
      While fixing https://github.com/kisielk/og-rek/issues/48 there were
      corresponding test-vector additions. Regenerate the fuzz test corpus.
      192173dc
    • Kirill Smelkov's avatar
      Fix UNICODE decoding · 9daf6a2a
      Kirill Smelkov authored
      UNICODE is text-based opcode, which is used at protocol 0 and follows
      'raw-unicode-escape' encoded argument till EOL.
      
      - for decoding we must explicitly implement Python's
        'raw-unicode-escape' codec decoding, which is used by Python's pickle
        for UNICODE argument.
      
      Updates and hopefully fixes: https://github.com/kisielk/og-rek/issues/48
      9daf6a2a
    • Kirill Smelkov's avatar
      Fix STRING encoding/decoding · f62fe97f
      Kirill Smelkov authored
      STRING is text-based opcode which is used at protocol 0 and follows
      \-escaped argument till EOL.
      
      - for encoding we must not use Go's %q, since that will use \u and \U
        when seeing corresponding bytes, and since Python does not interpret
        \u or \U in string literals, the data received at Python side will be
        different.
      
      - for decoding we must explicitly implement Python's 'string-escape'
        codec decoding which is used by Python's pickle for STRING opcode
        argument.
      
      Updates: https://github.com/kisielk/og-rek/issues/48
      f62fe97f
    • Kirill Smelkov's avatar
      Move pyquote into main codebase · 18004fbd
      Kirill Smelkov authored
      We added pyquote in b429839d (tests: Show pickles in a way that can be
      copy-pasted into Python). However in the next patch this functionality
      will be needed for the encoder to fix encoding of strings at protocol 0.
      Thus we need this function to be not test-only.
      
      Plain code movement here, no semantic change.
      18004fbd
  3. 25 Sep, 2018 8 commits
    • Kirill Smelkov's avatar
      decoder: Fix panic on dict with non-comparable keys · cdd8269c
      Kirill Smelkov authored
      Even though we tried to catch whether dict keys are ok to be used via
      reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes
      found by fuzzer (#32)"), that turned out to be not enough. For example
      if key is a struct, e.g. of the following type
      
      	type Ref struct {
      		Pid interface{}
      	}
      
      it will be comparable. But the comparision, depending on dynamic .Pid
      type, might panic. This is what was actually cauht by fuzz-testing
      recently:
      
      	https://github.com/kisielk/og-rek/issues/50 (second part of the report)
      
      So instead of recursively walking a key type and checking each subfield
      with reflect.TypeOf().Comparable(), switch for using panic/recover for
      detecting the "unhashable key" situation.
      
      This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):
      
      	name          old time/op    new time/op    delta
      	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
      	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
      	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)
      
      but that is the price of correctness. And with manually recursively walking key
      type I doubt it would be faster.
      
      The defer overhead should be less once https://github.com/golang/go/issues/14939 is fixed.
      
      Updates: https://github.com/kisielk/og-rek/issues/30
      cdd8269c
    • Kirill Smelkov's avatar
      tests/BenchmarkDecode: Skip empty pickles from tests table · e3797bb1
      Kirill Smelkov authored
      In 06e06939 (encoder: Allow to specify pickle protocol version) I added
      ability to add to tests error cases - object inputs that on encoding
      should produce error.
      
      For decoding we should skip those cases as there pickle.data = "", and
      if not skipped it leads to
      
      	--- FAIL: BenchmarkDecode
      	    ogorek_test.go:803: unexpected # of decode steps: got 100  ; want 102
      e3797bb1
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 6daa7b2c
      Kirill Smelkov authored
      Updates coming via `go generate` since main tests table was amended.
      6daa7b2c
    • Kirill Smelkov's avatar
      fixup! fuzz: Add more details when reporting failures · c54a5739
      Kirill Smelkov authored
      Appologize for the breakage there.
      c54a5739
    • Kirill Smelkov's avatar
      tests: Show pickles in a way that can be copy-pasted into Python · b429839d
      Kirill Smelkov authored
      When encoding tests fails, the "want" and "have" pickles are printed. It
      is handy to copy-paste those pickles into Python console and check them
      further there.
      
      Pickle printing currently uses %q. However in Go fmt's %q can use \u and
      \U if byte sequence form a valid UTF-8 character. That poses a problem:
      in Python str (py2) or bytes (py3) literal \uXXXX are not processed as
      unicode-escapes and enter the string as is. This result in different
      pickle data pasted into Python and further confusion.
      
      Entering data into Python as unicode literals (where \u works) and then
      adding .encode('utf-8') also does not generally work - as pickle data is
      generally arbitrary it can be a not valid UTF-8, for example:
      
      	"\x80\u043c\u0438\u0440"	(= "\x80мир"   = "\x80\xd0\xbc\xd0\xb8\xd1\x80")
      
      end unicode-encoding them in python also gives different data:
      
      	In [1]: u"\x80\u043c\u0438\u0440".encode('utf-8')
      	Out[1]: '\xc2\x80\xd0\xbc\xd0\xb8\xd1\x80'
      
      (note leading extra \xc2)
      
      For this reason let's implement quoting - that Python can understand -
      ourselves. This dumping functionality was very handy during recent
      encoder fixes debugging.
      b429839d
    • Kirill Smelkov's avatar
      encoder: Fix class wrt protocol version · 656974e2
      Kirill Smelkov authored
      - we can use STACK_GLOBAL only if protocol >= 4.
      - for earlier protocols we have to use text-based GLOBAL.
      656974e2
    • Kirill Smelkov's avatar
      encoder: Fix struct encoding wrt protocol version · 95f42543
      Kirill Smelkov authored
      Similarly to dict, for struct encoding switch from protocol 1 opcodes
      into always using protocol 0 opcodes, which is by the way 1 byte
      shorter.
      
      For the reference - for structs, unlike maps, the order of emitted keys
      is well-defined - it is the order of fields as they are defined in the
      struct. This way we can precisely test encoder output on structs with
      more than 1 field.
      95f42543
    • Kirill Smelkov's avatar
      encoder: Fix dict wrt protocol version · bb7b117b
      Kirill Smelkov authored
      - we can use EMPTY_DICT only if protocol >= 1
      
      Also: similarly to list (33d1926f), since we are now using EMPTY_DICT
      only optionally, it is logical to swit to
      
      	MARK + ... + DICT
      
      from
      
      	EMPTY_DICT (or MARK + DICT @proto=0) + MARK + ... + SETITEMS
      
      which is at least 1 byte longer.
      
      For the reference - SETITEMS is also from protocol 1, while DICT is from
      protocol 0.
      bb7b117b
  4. 21 Sep, 2018 6 commits
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 89930c10
      Kirill Smelkov authored
      More corpus files appeared while running fuzz testing today for ~ 1 hour.
      89930c10
    • Kirill Smelkov's avatar
      fuzz: Add more details when reporting failures · 65d6314e
      Kirill Smelkov authored
      Should be better in 302c79ea (fuzz: Hook encoder into the loop), but it
      is hopefully never too late.
      65d6314e
    • Kirill Smelkov's avatar
      decoder: Fix BININT decoding for negative values · bd5a7fd4
      Kirill Smelkov authored
      Found via fuzzing:
      
      	"I-7\n."
      
      	panic: protocol 1: decode·encode != identity:
      	have: 4294967289
      	want: -7
      
      	goroutine 1 [running]:
      	github.com/kisielk/og-rek.Fuzz(0x7f99bd8b4000, 0x5, 0x200000, 0x3)
      	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:50 +0x604
      	go-fuzz-dep.Main(0x524df8)
      	        /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad
      	main.main()
      	        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
      	exit status 2
      
      I've checked other handlers, like BININT1 and BININT2, and since there
      everywhere argument is unsigned, there is no similar problem.
      
      We needed previous patch on proper readLine EOF detection, because else
      the testcase for P0("I-7\n.") would be breaking:
      
          --- FAIL: TestDecode/int(-7)/"I-7\n." (0.00s)
              ogorek_test.go:401: no ErrUnexpectedEOF on [:2] truncated stream: v = <nil>  err = &strconv.NumError{Func:"ParseInt", Num:"-", Err:(*errors.errorString)(0xc00000e1b0)}
      bd5a7fd4
    • Kirill Smelkov's avatar
      decoder: Don't treat \r\n as combined EOL · 57137139
      Kirill Smelkov authored
      Currently we use bufio.Reader.ReadLine which accepts either \n or \r\n
      as line ending. That is however not correct:
      
      - we should not accept e.g. "S'abc'\r\n." pickle, because it is
        invalid:
      
      	In [32]: pickle.loads(b"S'abc'\r\n.")
      	---------------------------------------------------------------------------
      	UnpicklingError                           Traceback (most recent call last)
      	<ipython-input-32-b1da1988bae1> in <module>()
      	----> 1 pickle.loads(b"S'abc'\r\n.")
      
      	UnpicklingError: the STRING opcode argument must be quoted
      
      - we should not accept e.g. "L123L\r\n.", because it is also invalid:
      
      	In [33]: pickle.loads(b"L123L\r\n.")
      	---------------------------------------------------------------------------
      	ValueError                                Traceback (most recent call last)
      	<ipython-input-33-7231ec07f5c4> in <module>()
      	----> 1 pickle.loads(b"L123L\r\n.")
      
      	ValueError: invalid literal for int() with base 10: '123L\r\n'
      
      - treating \r as part of EOL in e.g. UNICODE pickle would just drop encoded
        information:
      
      	# python
      	In [34]: pickle.loads(b"Vabc\r\n.")
      	Out[34]: 'abc\r'
      
        while ogórek currently decodes it as just 'abc' (no trailing \r).
      
      For this reason let's fix Decoder.readLine to treat only \n as EOL.
      
      Besides this fix, we now get another property: previously, when internally
      using bufio.Reader.ReadLine we were not able to distinguish two situations:
      
      - a line was abruptly ended without any EOL characters at all,
      - a line was properly ended with EOL character.
      
      Now after we switched to internally using bufio.Reader.ReadSlice, we will be
      able to properly detect EOF and return that as error. This property will be
      needed in the following patch.
      57137139
    • Kirill Smelkov's avatar
      fuzz: Automatically export all tests pickles into fuzz/corpus · 9d1344ba
      Kirill Smelkov authored
      This way whatever/whenever we add a tricky test pickle into main tests
      table, it should be automatically also be present as a starting point in
      the fuzz corpus. This should hopefully improve fuzzing coverage.
      9d1344ba
    • Kirill Smelkov's avatar
      decoder: More mark exposing fixes · 7aeda71a
      Kirill Smelkov authored
      Continuing 5dbc8a1b (decoder: Don't allow mark to be returned as pickle
      result) I discovered that the mark object can be still exposed to user,
      but not directly. For example the following pickle:
      
      	"(\x85." // MARK + TUPLE1
      
      was creating Tuple{mark} and returning it just ok to the user.
      
      As marker must be used only internally it is invalid to do so. Python
      also forbids this:
      
              In [3]: s = b"(\x85."
      
              In [4]: dis(s)
                  0: (    MARK
                  1: \x85     TUPLE1
                  2: .        STOP
              highest protocol among opcodes = 2
      
              In [5]: pickle.loads(s)
              ---------------------------------------------------------------------------
              UnpicklingError                           Traceback (most recent call last)
              <ipython-input-5-764e4625bc41> in <module>()
              ----> 1 pickle.loads(s)
      
              UnpicklingError: unexpected MARK found
      
      So let's close all (hopefully) holes where mark object could be returned to
      user in a similar way.
      7aeda71a
  5. 20 Sep, 2018 5 commits
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · b7c2a34e
      Kirill Smelkov authored
      New test vectors caught by decode·encode idempotency fuzzing.
      b7c2a34e
    • Kirill Smelkov's avatar
      decoder: Don't allow mark to be returned as pickle result · 5dbc8a1b
      Kirill Smelkov authored
      A pickle is considered as invalid if it tries to return MARK as the
      result by both Python2 and Python3, e.g.:
      
              In [2]: pickle.loads(b"(.")
              ---------------------------------------------------------------------------
              UnpicklingError                           Traceback (most recent call last)
              <ipython-input-2-0c142c82b126> in <module>()
              ----> 1 pickle.loads(b"(.")
      
              UnpicklingError: unexpected MARK found
      
      However until now, despite mark is unexported ogórek type, we were
      allowing for it to be returned just ok.
      
      The problem was caught by decode/encode roundtrip fuzz tests, e.g.
      
      	"(Q."
      
      panic: protocol 1: decode·encode != identity:
      have: ogórek.Ref{Pid:map[interface {}]interface {}{}}
      want: ogórek.Ref{Pid:ogórek.mark{}}
      
      goroutine 1 [running]:
      github.com/kisielk/og-rek.Fuzz(0x7fbe6c15f000, 0x3, 0x200000, 0x3)
              /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/fuzz.go:87 +0x604
      go-fuzz-dep.Main(0x524d78)
              /tmp/go-fuzz-build697921479/goroot/src/go-fuzz-dep/main.go:49 +0xad
      main.main()
              /tmp/go-fuzz-build697921479/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
      exit status 2
      5dbc8a1b
    • Kirill Smelkov's avatar
      fuzz: Hook encoder into the loop · 302c79ea
      Kirill Smelkov authored
      We can enhance our fuzz-testing coverage by hooking Encoder also into
      the loop: if input data is suceessfully decoded, we have an object that
      can be passed back to Encoder to generate a pickle. We can't tell that
      that pickle must be the same as original input data, since pickle
      machine allows multiple representations of the same data. However we can
      assert that when that pickle is decoded back it should be the same as
      encoded object.
      
      This catches several problems:
      
      - marker is currently returned as pickle result (see next patch).
      - text-based STRING and UNICODE are not properly decoded (no fix yet).
      - self-referencing data structures kill Encoder (no fix yet).
      302c79ea
    • Kirill Smelkov's avatar
      fuzz/corups: Update · 230ffba9
      Kirill Smelkov authored
      Add more files go-fuzz put to corpus while discovering the crash fixed
      in previous commit.
      230ffba9
    • Kirill Smelkov's avatar
      decoder: Fix DOS in BINSTRING decoding · dbc3bd9a
      Kirill Smelkov authored
      BINSTRING opcode follows 4-byte length of the data and the data itself.
      
      Upon seeing the BINSTRING len header we were preallocating destination
      buffer with len as optimization (see 14aaa14f "decoder: Preallocate .buf
      capacity when we know (approximate) size of output to it"). However this
      way it is easy for a malicious pickle to specify
      
      	BINSTRING biglength (4GB)
      
      and no data, and then the goroutine that processes such input will
      allocate 4GB immediately, which in turn might cause out-of-memory DOS.
      
      The other places where we currently grow Decoder.buf all either have 1
      or 2 byte length (thus limited to 64K), or the length of the data that
      was already read from input stream.
      
      The problem was found by rerunning fuzz tests:
      
      	"T0000"
      
      fatal error: runtime: out of memory
      
      runtime stack:
      runtime.throw(0x514fff, 0x16)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/panic.go:608 +0x72
      runtime.sysMap(0xc004000000, 0x34000000, 0x5f83d8)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mem_linux.go:156 +0xc7
      runtime.(*mheap).sysAlloc(0x5dfd40, 0x34000000, 0x43c0e3, 0x7ffc94fe56b8)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:619 +0x1c7
      runtime.(*mheap).grow(0x5dfd40, 0x18182, 0x0)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:920 +0x42
      runtime.(*mheap).allocSpanLocked(0x5dfd40, 0x18182, 0x5f83e8, 0x203000)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:848 +0x337
      runtime.(*mheap).alloc_m(0x5dfd40, 0x18182, 0x7ffc94fe0101, 0x40a87f)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:692 +0x119
      runtime.(*mheap).alloc.func1()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:759 +0x4c
      runtime.(*mheap).alloc(0x5dfd40, 0x18182, 0x7ffc94010101, 0x4135f5)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/mheap.go:758 +0x8a
      runtime.largeAlloc(0x30303030, 0x440101, 0xc00005e240)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:1019 +0x97
      runtime.mallocgc.func1()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:914 +0x46
      runtime.systemstack(0x44eea9)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:351 +0x66
      runtime.mstart()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:1229
      
      goroutine 1 [running]:
      runtime.systemstack_switch()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:311 fp=0xc000034310 sp=0xc000034308 pc=0x44efa0
      runtime.mallocgc(0x30303030, 0x4edc00, 0x1, 0xc0000343e8)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/malloc.go:913 +0x896 fp=0xc0000343b0 sp=0xc000034310 pc=0x40ae26
      runtime.makeslice(0x4edc00, 0x30303030, 0x30303030, 0xc000018108, 0x4, 0x4)
              /tmp/go-fuzz-build515164548/goroot/src/runtime/slice.go:70 +0x77 fp=0xc0000343e0 sp=0xc0000343b0 pc=0x43b1d7
      bytes.makeSlice(0x30303030, 0x0, 0x0, 0x0)
              /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:231 +0x9d fp=0xc000034420 sp=0xc0000343e0 pc=0x4b204d
      bytes.(*Buffer).grow(0xc000084030, 0x30303030, 0x0)
              /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:144 +0x2e4 fp=0xc000034470 sp=0xc000034420 pc=0x4b1604
      bytes.(*Buffer).Grow(0xc000084030, 0x30303030)
              /tmp/go-fuzz-build515164548/goroot/src/bytes/buffer.go:163 +0x86 fp=0xc000034498 sp=0xc000034470 pc=0x4b18b6
      github.com/kisielk/og-rek.(*Decoder).loadBinString(0xc000084000, 0x203054, 0x0)
              /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:646 +0x143 fp=0xc000034520 sp=0xc000034498 pc=0x4d5103
      github.com/kisielk/og-rek.(*Decoder).Decode(0xc000084000, 0xc000080000, 0xc000084000, 0xf7eb9fa, 0x586dc)
              /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/ogorek.go:227 +0xf94 fp=0xc0000346d8 sp=0xc000034520 pc=0x4d09b4
      github.com/kisielk/og-rek.Fuzz(0x7fbccd400000, 0x5, 0x200000, 0xc000034748)
              /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/fuzz.go:12 +0xb0 fp=0xc000034710 sp=0xc0000346d8 pc=0x4cf640
      go-fuzz-dep.Main(0x519cf0)
              /tmp/go-fuzz-build515164548/goroot/src/go-fuzz-dep/main.go:49 +0xad fp=0xc000034780 sp=0xc000034710 pc=0x4642fd
      main.main()
              /tmp/go-fuzz-build515164548/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d fp=0xc000034798 sp=0xc000034780 pc=0x4db1ad
      runtime.main()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/proc.go:201 +0x207 fp=0xc0000347e0 sp=0xc000034798 pc=0x428ec7
      runtime.goexit()
              /tmp/go-fuzz-build515164548/goroot/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0000347e8 sp=0xc0000347e0 pc=0x450f01
      exit status 2
      dbc3bd9a
  6. 19 Sep, 2018 8 commits
    • Kirill Smelkov's avatar
      encoder: Fix string wrt protocol version · e7d96969
      Kirill Smelkov authored
      - we can use BINSTRING* only if protocol >= 1;
      - at protocol 0 we thus have to use text STRING;
      - if protocol >= 3 we have to emit the string as unicode pickle object
        the same way as Python3 does. If we don't do - Python3 won't be
        generally able to load our pickle:
      
      	In [1]: s = b'U\x06\xd0\xbc\xd0\xb8\xd1\x80q\x00.'
      
        	In [2]: from pickletools import dis
      
        	In [3]: dis(s)
        	    0: U    SHORT_BINSTRING 'миÑ\x80'
        	    8: q    BINPUT     0
        	   10: .    STOP
        	highest protocol among opcodes = 1
      
        	In [4]: import pickle
      
        	In [5]: pickle.loads(s)
        	---------------------------------------------------------------------------
        	UnicodeDecodeError                        Traceback (most recent call last)
        	<ipython-input-5-764e4625bc41> in <module>()
        	----> 1 pickle.loads(s)
      
        	UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
      
      We already decode unicode pickle objects into string, this way
      decode(encode(string)) remains always idempotent.
      e7d96969
    • Kirill Smelkov's avatar
      encoder: Fix Ref wrt protocol version · 98fb1987
      Kirill Smelkov authored
      - we can use BINPERSID only if protocol >= 1
      - we can use text PERSID only if protocol < 4 and Ref string does not
        contain \n - else encoding have to fail.
      
      Tests for Ref encoding at protocols 3 & 4 will follow after string
      encoding is fixed.
      98fb1987
    • Kirill Smelkov's avatar
      encoder: Fix list wrt protocol version · 33d1926f
      Kirill Smelkov authored
      - we can use EMPTY_LIST only if protocol >= 1
      
      Also: since we are now using EMPTY_LIST only optionally, it is logical
      to switch to
      
      	MARK + ... + LIST
      
      instead of
      
      	EMPTY_LIST (or MARK + LIST @proto=0) + MARK + ... + APPENDS
      
      which is at least 1 byte longer.
      
      For the reference - APPENDS is also from protocol 1, while LIST is from
      protocol 0.
      33d1926f
    • Kirill Smelkov's avatar
      encoder: Fix tuple wrt protocol version · 24695efa
      Kirill Smelkov authored
      - we can use EMPTY_TUPLE only if protocol >= 1
      - also: if protocol >= 2 we can now use TUPLE{1,2,3} opcodes.
      24695efa
    • Kirill Smelkov's avatar
      encoder: Add tests todo to use LONG1 and LONG4 · 08fb378e
      Kirill Smelkov authored
      Corresponding TODO in the code was added in 4fd6be93 (encoder: Adjust it
      so that decode(encode(v)) == v)
      08fb378e
    • Kirill Smelkov's avatar
      encoder: Fix float wrt protocol version · 606e9f5a
      Kirill Smelkov authored
      We can use BINFLOAT opcode only starting from protocol >= 1.
      At protocol 0 we must use ASCII FLOAT.
      606e9f5a
    • Kirill Smelkov's avatar
      encoder: Fix int wrt protocol version · 9053359b
      Kirill Smelkov authored
      We can use BININT* opcodes only starting from protocol >= 1.
      At protocol 0 we must use ASCII INT.
      9053359b
    • Kirill Smelkov's avatar
      encoder: Fix bool wrt protocol version · 6e6a8aa3
      Kirill Smelkov authored
      Starting from protocol 2 1-byte NEWTRUE/NEWFALSE opcodes are more
      efficient compared to 5-bytes e.g. "I01\n.".
      
      It is not only about efficiency, as protocol 4 _forbids_ use of variable
      length ASCII-only opcodes - whose data length is determined by doing
      forward scan for '\n'.
      
      Without encodeBool changes and only with the tests added it would fail
      this way:
      
          --- FAIL: TestEncode/True/proto=2 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x02I01\n."
                  want: "\x80\x02\x88."
          --- FAIL: TestEncode/True/proto=3 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x03I01\n."
                  want: "\x80\x03\x88."
          --- FAIL: TestEncode/True/proto=4 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x04I01\n."
                  want: "\x80\x04\x88."
          --- FAIL: TestEncode/False/proto=2 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x02I00\n."
                  want: "\x80\x02\x89."
          --- FAIL: TestEncode/False/proto=3 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x03I00\n."
                  want: "\x80\x03\x89."
          --- FAIL: TestEncode/False/proto=4 (0.00s)
              ogorek_test.go:383: encode:
                  have: "\x80\x04I00\n."
                  want: "\x80\x04\x89."
      6e6a8aa3