- 16 Aug, 2024 3 commits
-
-
Kirill Smelkov authored
After 6e5e403e (encoder: Fix GLOBAL emission wrt module/name with \n) BenchmarkEncode started to fail with --- FAIL: BenchmarkEncode ogorek_test.go:1003: protocol 0-3: global: module & name must be string without \n -> Fix it by taking as input only those testdata objects that are marked to encode without errors.
-
Kirill Smelkov authored
BININT1 emits unsigned integer that fits into 1 byte. BININT2 emits unsigned integer that fits into 2 bytes. So they cover [0, 0xff] and [0, 0xffff] ranges including edges. However we were emitting BININTX opcodes only for numbers being strictly inside those ranges - i.e. (0, 0xff) and (0, 0xffff) - without edges. As the result what could be emitted as "K\x00." was emitted as "J\x00\x00\x00\x00." and so on. Fix it by emitting BININT1 for [0, 0xff] inclusive and BININT for [0, 0xffff] also inclusive.
-
Kirill Smelkov authored
BINUNICODE opcode comes with 4-byte little-endian unsigned int length after it. However we were decoding it as signed int32 instead of unsigned uint32. As the result if last byte of this length is >= 0x80 the length was decoded as negative: y := int32(0x81) println(y << (3*8)) Output: -2130706432 which was leading to empty unicode string returned instead of an error. -> Fix it by decoding BINUNICODE lenght as uint32. Without the fix added test fails as --- FAIL: TestDecodeError (0.00s) ogorek_test.go:812: "X\xff\xff\xff\xff.": no decode error ; got "", <nil> And for that "X\xff\xff\xff\xff." Python also rejects to load this pickle: In [1]: p = b"X\xff\xff\xff\xff." In [2]: pickle.loads(p) --------------------------------------------------------------------------- UnpicklingError Traceback (most recent call last) Cell In [2], line 1 ----> 1 pickle.loads(p) UnpicklingError: pickle data was truncated In [3]: dis(p) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [3], line 1 ----> 1 dis(p) File /usr/lib/python3.11/pickletools.py:2448, in dis(pickle, out, memo, indentlevel, annotate) 2446 errormsg = None 2447 annocol = annotate # column hint for annotations -> 2448 for opcode, arg, pos in genops(pickle): 2449 if pos is not None: 2450 print("%5d:" % pos, end=' ', file=out) File /usr/lib/python3.11/pickletools.py:2291, in _genops(data, yield_end_pos) 2289 arg = None 2290 else: -> 2291 arg = opcode.arg.reader(data) 2292 if yield_end_pos: 2293 yield opcode, arg, pos, getpos() File /usr/lib/python3.11/pickletools.py:693, in read_unicodestring4(f) 691 if len(data) == n: 692 return str(data, 'utf-8', 'surrogatepass') --> 693 raise ValueError("expected %d bytes in a unicodestring4, but only %d " 694 "remain" % (n, len(data))) ValueError: expected 4294967295 bytes in a unicodestring4, but only 1 remain
-
- 17 Jul, 2024 1 commit
-
-
Kirill Smelkov authored
Fix noticed typos, clarify things in a couple of places. Only minor and documentation changes. /helped-by @kisielk
-
- 16 Jul, 2024 6 commits
-
-
Kirill Smelkov authored
Rerunning go-fuzz after recent patches resulted in some new fuzz test vectors.
-
Kirill Smelkov authored
Up till now ogórek works relatively well for me but experience gained with ZODB/go and Wendelin.core highlighted two problems with strings: 1) loading, then re-saving string data is not generally identity, and 2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str. Let me explain those problems: 1) Loading, then re-saving string data is not generally identity ---------------------------------------------------------------- On decoding ogórek currently loads both byte-string (*STRING opcodes) and unicode string (*UNICODE opcodes) into the same Go type string. And on encoding that string type is encoded as *STRING for protocol <= 2 and as *UNICODE for protocol >= 3. It was me to implement that encoding logic in 2018 in e7d96969 (encoder: Fix string wrt protocol version) where in particular I did - if protocol >= 3 we have to emit the string as unicode pickle object the same way as Python3 does. If we don't do - Python3 won't be generally able to load our pickle ... with the idea that protocol=3 always means that the pickle is intended to be for Python3. But there I missed that zodbpickle can use and generate pickles with protocol=3 even on py2 and that ZODB/py2 actually uses this protocol=3 mode since https://github.com/zopefoundation/ZODB/commit/12ee41c4 authored in the same 2018. So, there can be pickles saved under protocol=3 that do contain both *STRING and *UNICODE opcodes, and when ogórek sees those it loads that *STRING and *UNICODE data as just string loosing the information about type of particular variant. And then on encoding the data is saved as all UNICODE, or all STRING, thus breaking decode/encode=identity property. This breakage is there even with plain pickle and protocol=2 - when both *STRING and *UNICODE are in the database, ogórek loads them both as the same Go type string, and then saving back under the same protocol=2 goes as all STRING resulting in unicode objects becoming str on resave without any intended change. 2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str ---------------------------------------------------------------------------------------------------------------------------- Continuing above example of py2 database with both *STRING and *UNICODE opcodes present there is currently no way for application to distinguish those from each other. In other words there is currently no way for the application to distinguish whether it is binary data coming from py2 protocol <= 2 era, from unicode text. The latter problem I hit for real: with Wendelin.core we have lots of data saved from under Python2 as just str. And the Go part of Wendlin.core, upon loading block of data, wants to accept only binary - either bytes from py3 or bytestring from py2, but not unicode, because it indicates a mistake if e.g. a ZBlk object would come with unicode data https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/bigfile/file_zodb.py#L267-300 https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L31-32 https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L98-107 but there is currently no way to distinguish whether it was unicode or bytestring saved into the database becase they both are represented as the same Go string type after decoding. ---------------------------------------- So to solve those problems I thought it over and understood that the issues start to appear becase we let *STRING and *UNICODE to become mixed into the same entity on loading. This behaviour is there from ogórek beginning and the intention, it seems, was to get the data from the pickle stream in an easily-accepted form on the Go side. However the convenience turned out to come with cost of loosing some correctness in the general case as explained above. So if we are to fix the correctness we need to change that behaviour and load *STRING and *UNICODE opcodes into distinct types, so that the information about what was what is preserved and it becomes possible to distinguish bytestring from unicode strings and resave the data in exactly the same form as loaded. Though we can do this only under opt-in option with default behaviour staying as it was before to preserve backward compatibility. -> Do it. Below is excerpt from doc.go and DecoderConfig and EncoderConfig changes that describe the new system: For strings there are two modes. In the first, default, mode both py2/py3 str and py2 unicode are decoded into string with py2 str being considered as UTF-8 encoded. Correspondingly for protocol ≤ 2 Go string is encoded as UTF-8 encoded py2 str, and for protocol ≥ 3 as py3 str / py2 unicode. ogórek.ByteString can be used to produce bytestring objects after encoding even for protocol ≥ 3. This mode tries to match Go string with str type of target Python depending on protocol version, but looses information after decoding/encoding cycle: py2/py3 str
↔ string StrictUnicode=n mode, default py2 unicode → string py2 str ← ogórek.ByteString However with StrictUnicode=y mode there is 1-1 mapping in between py2 unicode / py3 str vs Go string, and between py2 str vs ogórek.ByteString. In this mode decoding/encoding and encoding/decoding operations are always identity with respect to strings: py2 unicode / py3 str↔ string StrictUnicode=y mode py2 str↔ ogórek.ByteString For bytes, unconditionally to string mode, there is direct 1-1 mapping in between Python and Go types: bytes↔ ogórek.Bytes (~) bytearray↔ []byte -------- type DecoderConfig struct { // StrictUnicode, when true, requests to decode to Go string only // Python unicode objects. Python2 bytestrings (py2 str type) are // decoded into ByteString in this mode... StrictUnicode bool } type EncoderConfig struct { // StrictUnicode, when true, requests to always encode Go string // objects as Python unicode independently of used pickle protocol... StrictUnicode bool } Since strings are now split into two types, string and ByteString, and ByteString can either mean text or binary data, new AsString and AsBytes helpers are also added to handle string and binary data in uniform way supporting both py2 and py3 databases. Corresponding excerpts from doc.go and typeconv.go changes with the description of those helpers come below: On Python3 strings are unicode strings and binary data is represented by bytes type. However on Python2 strings are bytestrings and could contain both text and binary data. In the default mode py2 strings, the same way as py2 unicode, are decoded into Go strings. However in StrictUnicode mode py2 strings are decoded into ByteString - the type specially dedicated to represent them on Go side. There are two utilities to help programs handle all those bytes/string data in the pickle stream in uniform way: - the program should use AsString if it expects text data - either unicode string, or byte string. - the program should use AsBytes if it expects binary data - either bytes, or byte string. Using the helpers fits into Python3 strings/bytes model but also allows to handle the data generated from under Python2. -------- // AsString tries to represent unpickled value as string. // // It succeeds only if the value is either string, or ByteString. // It does not succeed if the value is Bytes or any other type. // // ByteString is treated related to string because ByteString represents str // type from py2 which can contain both string and binary data. func AsString(x interface{}) (string, error) { // AsBytes tries to represent unpickled value as Bytes. // // It succeeds only if the value is either Bytes, or ByteString. // It does not succeed if the value is string or any other type. // // ByteString is treated related to Bytes because ByteString represents str // type from py2 which can contain both string and binary data. func AsBytes(x interface{}) (Bytes, error) { ZODB/go and Wendelin.core intend to switch to using StrictUnicode mode while leaving ogórek to remain 100% backward-compatible in its default mode for other users. -
Kirill Smelkov authored
On Python two different objects with different types can represent essentially the same entity. For example 1 (int) and 1L (long) represent integer number one via two different types and are decoded by ogórek into Go types int64 and big.Int correspondingly. However on the Python side those two representations are often used interchangeably and programs are usually expected to handle both with the same effect. For example I hit this situation for real with FileStorage index, ZEO protocol, BTrees and Wendelin.core where data with integer fields, for the same field, are sometimes represented as int and sometimes as long. As the result ZODB/go and Wendelin.core had private XInt64 helper used as follows: https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/pycompat/pycompat.go https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L385 https://lab.nexedi.com/search?utf8=%E2%9C%93&search=XInt64&group_id=&project_id=73&search_code=true&repository_ref=f7776fc1&nav_source=navbar Besides int/long we are also hitting similar situation with bytes / str(py2) / str(py3) which will also need to use corresponding helpers to be able to load pickled data from existing databases. In the next patch we will add StrictUnicode mode to handle that strings, and it will naturally come with AsBytes and AsString helpers. Thus it also feels appropriate to keep the other useful helpers to handle unpickled data inside ogórek itself. This patch adds AsInt64 helper with top-level description and tests.
-
Kirill Smelkov authored
In the next patch I will need to use big.Int.IsInt64, which was added in Go 1.9 in 2017. Go 1.7 and Go 1.8 are very old nowdays - their last point releases were done in 2017 (go1.7.6) and 2018 (go1.8.7) - 7 and 6 years ago correspondingly. Since then those releases are EOL with many further Go releases done after them. We even specify go1.12 as the requirement in our go.mod, but that is probably because go1.12 is the first version to support modules. Anyway dropping support for those very old Go1.7 and Go1.8 should be almost certainly ok as I doubt if there are current ogórek users still stuck at those old Go versions.
-
Kirill Smelkov authored
In 9daf6a2a (Fix UNICODE decoding) I fixed protocol 0 UNICODE decoding by implementing "raw-unicode-escape" decoder and switching unpickler to use that to match 1-to-1 what Python unpickler does. In 57f875fd (encoder: Fix protocol 0 UNICODE emission) I further fixed protocol 0 UNICODE encoding by implementing "raw-unicode-escape" encoder and switching pickler to use that, trying to match 1-to-1 what Python pickler does. However there is a difference in between Python unicode and unicode on Go side: in Python unicode is immutable array of UCS code points. On Go side, unicode is immutable array of bytes, that, similarly to Go string are treated as being mostly UTF-8. We did not pick e.g. []rune to represent unicode on Go side because []rune is not immutable and so cannot be used as map keys. So Go unicode can be either valid UTF-8 or invalid UTF-8. For valid UTF-8 case everything is working as intended: our raw-unicode-escape decoder, because it works in terms of UCS, can produce only valid UTF-8, while our raw-unicode-escape encoder also matches 1-to-1 what python does because for valid UTF-8 utf8.DecodeRuneInString gives good rune and the encoder further works in terms of UCS. However for the case of invalid UTF-8, there is a difference in between what Python and Go raw-unicode-escape encoders do: for Python this case is simply impossible because there input is []UCS. For the Go case, in 57f875fd I extended the encoder to do: // invalid UTF-8 -> emit byte as is case r == utf8.RuneError: out = append(out, s[0]) which turned out to be not very thoughtful and wrong because original raw-unicode-escape also emits UCS < 0x100 as those plain bytes and so if we also emit invalid UTF-8 as is then the following two inputs would be encoded into the same representation "\x93": unicode("\x93") // invalid UTF-8 unicode("\xc2\x93) // UTF-8 of \u93 which would break `decode/encode = identity` invariant and corrupt the data because when loaded back it will be "\xc2\x93" instead of original "\x93". -> Fix it by rejecting to encode such invalid UTF-8 via protocol 0 UNICODE. Unfortunately rejection is the only reasonable choice because raw-unicode-escape codec does not allow \xAA to be present in the output stream and so there is simply no way to represent arbitrary bytes there. Better to give explicit error instead of corrupting the data. For protocols ≥ 1 arbitrary unicode - both valid and invalid UTF-8 - can still be loaded and saved because *BINUNICODE opcodes come with bytestring argument and there is no need to decode/encode those bytestrings. /reviewed-by @kisielk /reviewed-on https://github.com/kisielk/og-rek/pull/67
-
Kirill Smelkov authored
For example currently if result of decoding is expected to be unicode, but is actually string the following confusing error is generated with exactly the same have and want: --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s) ogorek_test.go:504: decode: have: "\x93" want: "\x93" This happens because by default Printf("%#v") does not emit object type even if it was derived type from builtin string - either our Bytes or unicode, and emits the same string representation irregardless of which type it is. I think out-of-the box behaviour of %#v is unreasonable, but it is there for a long time and so instead of trying to argue with Go upstream, which after many years of arguing I think is not practical, teach our Bytes and unicode types how to represent themselves under %#v with indicating their type in the output. After the fix the error from above becomes --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s) ogorek_test.go:504: decode: have: "\x93" want: ogórek.unicode("\x93") which is more clear because want is different from have. Adjusted behaviour is not test-only, because user applications might want to do the same %#v at application level and there it would be also confusing not to see a type. /reviewed-by @kisielk /reviewed-on https://github.com/kisielk/og-rek/pull/67
-
- 12 Jul, 2024 5 commits
-
-
Kamil Kisiel authored
fuzz: Export tests pickles to fuzz/corpus with stable filenames
-
Kirill Smelkov authored
In 9d1344ba (fuzz: Automatically export all tests pickles into fuzz/corpus) I added code that automatically exports all test pickles to fuzz corpus, so that fuzzing starts from all the tricky cases we already put into tests. However exported files were named as test-i-j.pickle, where i is sequential number of test entry in global test vector. This way if we insert some new entry in the middle, or change entries order it causes lots of renaming inside fuzz/corpus/ on next `go generate`. -> Fix that by emitting vectors as test-<sha1-of-pickle>.pickle so that file name depends only on pickle content are remains stable independently of entry position in the list of tests. Besides changes to fuzz_test.go this patch brings: 1) rm fuzz/corpus/test-*.pickle 2) `go generate` The result inside fuzz/corpus/ is pure renaming + removal of duplicates. For example test-15-4.pickle is removed because the same data was in test-14-4.pickle and now resides in test-ee4459c0c165c4cbc3672ee5ef5438e00121b229.pickle.
-
Kamil Kisiel authored
decoder: Fix integer overflow in BYTEARRAY8 handling
-
Kirill Smelkov authored
Rerunning recent go-fuze resulted in adding some new test vectors.
-
Kirill Smelkov authored
Rerunning fuzzing afresh found the following crash because uint64 len > max(int64) was cast to int resulting in negative number: "\x960000000\xbd" panic: bytes.Buffer.Grow: negative count goroutine 1 [running]: bytes.(*Buffer).Grow(0xc000012140?, 0xc00010abf8?) /home/kirr/src/tools/go/go1.22/src/bytes/buffer.go:166 +0xb4 github.com/kisielk/og-rek.(*Decoder).bufLoadBytesData(0xc000072000, 0xbd30303030303030) /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:788 +0xd5 github.com/kisielk/og-rek.(*Decoder).bufLoadBinData8(0xc000072000) /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:776 +0x105 github.com/kisielk/og-rek.(*Decoder).loadBytearray8(0xc000072000) /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:1270 +0x34 github.com/kisielk/og-rek.(*Decoder).Decode(0xc000072000) /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:311 +0x1a2c github.com/kisielk/og-rek.Fuzz({0x7fd052187000, 0x9, 0x9}) /home/kirr/src/neo/src/github.com/kisielk/og-rek/fuzz.go:15 +0xf6 go-fuzz-dep.Main({0xc00010af38, 0x1, 0x5bc338?}) go-fuzz-dep/main.go:36 +0x14c main.main() github.com/kisielk/og-rek/go.fuzz.main/main.go:15 +0x35 exit status 2 -> Fix it by first comparing uint64 len without casting and further casting to int only after we cap the len to be in safe range.
-
- 07 Jul, 2024 1 commit
-
-
Kirill Smelkov authored
* Switch from Travis CI to GitHub Actions Travis CI no longer works for ogórek and https://travis-ci.org/kisielk/og-rek returns "not found". -> Switch to GitHub Actions to fix CI. We loose CI coverage for "go tip" as specifying "tip" Go version in actions/setup-go is not supported out of the box: https://github.com/actions/setup-go/issues/21 Maybe in 201x it was relevant to test wrt go tip, but nowdays I think this is no longer much needed. * CI += go1.16 - go1.22 Add CI coverage for all releases past go1.15 that we had before.
-
- 10 Mar, 2021 1 commit
-
-
Kirill Smelkov authored
Pickle protocol 5 adds out-of-band data and BYTEARRAY8 opcode (see [1] and [2]). We add support for BYTEARRAY8 here. Handling out-of-band data would require more setup from users - on both decoding and encoding side, and it is likely that currently no practical use-case exists to work with pickles with out-of-band data on Go side. This way we behave as if no out-of-band data was provided when seeing all protocol 5 opcodes besides BYTEARRAY8. Hopefully fixes: https://github.com/kisielk/og-rek/issues/62 [1] https://www.python.org/dev/peps/pep-0574 [2] https://github.com/python/cpython/commit/91f4380cedba /reviewed-by @kisielk /reviewed-on https://github.com/kisielk/og-rek/pull/63
-
- 16 Sep, 2020 2 commits
-
-
Kirill Smelkov authored
Ogórek can be already used from another module (but at outdated version), but let's be explicit about that. Go.mod was created by `go mod init github.com/kisielk/og-rek`. There is no external dependencies.
-
Kirill Smelkov authored
Go1.14 and Go1.15 are the current stable Go releases. Go1.12 and Go1.14 fill in the gap in testing coverage in between -go1.11 and current go releases. Maybe we can eventually drop older go release from travis coverage.
-
- 28 Sep, 2018 8 commits
-
-
Kirill Smelkov authored
Add brief overview on to how to use ogórek, and how Python and Go types are mapped with each other. For completeness give overview of pickle protocols and overview of persistent references. (Documentation put into separate doc.go as suggested by Kamil)
-
Kirill Smelkov authored
Rerun `go generate` since new tests vectors have been added in recent bytes/bytearray patches.
-
Kirill Smelkov authored
Starting from 2013 (from 555efd8f "first draft of dumb pickle encoder") wrt []byte ogórek state was: 1. []byte was encoded as string 2. there was no way to decode a pickle object into []byte then, as - []byte encoding was never explicitly tested, - nor I could find any usage of such encodings via searching through all Free / Open-Source software ogórek uses - I searched via "Uses" of NewEncoder on godoc: https://sourcegraph.com/github.com/kisielk/og-rek/-/blob/encode.go#L48:6=&tab=references:external it is likely that []byte encoding support was added just for the sake of it and convenience and then never used. It is also likely that the original author does not use ogórek encoder anymore: https://github.com/kisielk/og-rek/pull/52#issuecomment-423639026 For those reasons I tend to think that it should be relatively safe to change how []byte is handled: - the need to change []byte handling is that currently []byte is a kind of exception: we can only encode it and not decode something into it. Currently encode/decode roundtrip for []byte gives string, which breaks the property of encode/decode being identity for all other basic types. - on the similar topic, on encoding strings are assumed UTF-8 and are encoded as UNICODE opcodes for protocol >= 3. Passing arbitrary bytes there seems to be not good. - on to how change []byte - sadly it cannot be used as Python's bytes counterpart. In fact in the previous patch we actually just added ogórek.Bytes as a counterpart for Python bytes. We did not used []byte for that because - contrary to Python bytes - []byte cannot be used as a dict key. - the most natural counterpart for Go's []byte is thus Python's bytearray: https://docs.python.org/3/library/stdtypes.html#bytearray-objects which is "a mutable counterpart to bytes objects" So add Python's bytearray decoding into []byte, and correspondingly change []byte encoding to be encoded as bytearray. P.S. This changes encoder semantic wrt []byte. If some ogórek use breaks somewhere because of it, we could add an encoder option to restore backward compatible behaviour. However since I suspect noone was actually encoding []byte into pickles, I prefer we wait for someone to speak-up first instead of loading EncoderConfig with confusion options that nobody will ever use.
-
Kirill Smelkov authored
In the next patch decoding a pickle will depend on whether current protocol is <= 2 or >= 3. For this let's teach PROTO opcode handler to recall last seen protocol version. For testing - prepare tests infrastructure for cases where protocol version affects decoding semantic: if a test pickle in tests table comes with \x80\xff prefix, on decode tests this prefix will be changed to concrete \x80 ver with all versions protperly iterated as specified in TestPickle.protov. We will also need this functionality in the next patch.
-
Kirill Smelkov authored
In Python bytes is immutable and read-only array of bytes. It is also hashable and so is different from go []byte in that it can be used as a dict key. Thus the closes approximation for Python bytes in Go is some type derived from Go's string - it will be different from string and at the same time will inherit from string it immutability property and being able to be used as map key. So - add ogórek.Bytes type to represent Python bytes - add support to decode BINBYTES* pickle opcodes (these are protocol 3 opcodes) - add support to encode ogórek.Bytes via those BINBYTES* opcodes - for protocols <= 2, where there is no opcodes to directly represent bytes, adopt the same approach as Python - by pickling bytes as _codecs.encode(byt.decode('latin1'), 'latin1') this way unpickling it on Python3 will give bytes, while unpickling it on Python2 will give str: In [1]: sys.version Out[1]: '3.6.6 (default, Jun 27 2018, 14:44:17) \n[GCC 8.1.0]' In [2]: byt = b'\x01\x02\x03' In [3]: _codecs.encode(byt.decode('latin1'), 'latin1') Out[3]: b'\x01\x02\x03' --- In [1]: sys.version Out[1]: '2.7.15+ (default, Aug 31 2018, 11:56:52) \n[GCC 8.2.0]' In [2]: byt = b'\x01\x02\x03' In [3]: _codecs.encode(byt.decode('latin1'), 'latin1') Out[3]: '\x01\x02\x03' - correspondingly teach decoder to recognize particular calls to _codecs.encode as being representation for bytes and decode it appropriately. - since we now have to emit byt.decode('latin1') as UNICODE - add, so far internal, `type unicode(string)` that instructs ogórek encoder to always emit the string with UNICODE opcodes (regular string is encoded to unicode pickle object only for protocol >= 3). - For []byte encoding preserve the current status - even though dispatching in Encoder.encode changes, the end result is the same - []byte was and stays currently encoded as just regular string. This was added in 555efd8f "first draft of dumb pickle encoder", and even though that might be not a good choice, changing it is a topic for another patch.
-
Kirill Smelkov authored
Previously, we were quoting UNICODE opcode argument with strconv.QuoteToASCII(). However that function, in addition to \u and \U escapes, can produce e.g. \n, \r, \xAA etc escapes. And all of the latter variants are not treated as special escapes of a unicode literal by Python, thus leading to data being wrongly received. Fix it by doing exactly the same that Python pickle encoder does - the UNICODE argument comes are "raw-unicode-escape" encoded. This patch contains only codec tests - not end-to-end pickle tests, because currently Encoder.encodeUnicode() is called only from under Encoder.encodeString(), and there only from under if e.config.Protocol >= 3 We will indirectly add tests for encodeUnicode @ protocol=0 in the next patches, while adding support for Python bytes.
-
Kirill Smelkov authored
I initially added pyquote only as debugging tool (in b429839d "tests: Show pickles in a way that can be copy-pasted into Python"), and later it started to be used in the Encoder (see 18004fbd "Move pyquote into main codebase"). However there is no explicit tests for pyquote. Add some pyquote tests.
-
Kirill Smelkov authored
Both object and objectDecodedBack are interace{}. So it could turn out, if e.g. they are both of the same non-comparable type (e.g. []byte), comparing them directly will panic.
-
- 26 Sep, 2018 7 commits
-
-
Kirill Smelkov authored
I was (re-)running fuzz tests again, and in addition to what was already known (e.g. self-referencing structures) found two new issues: - long as dict keys (https://github.com/kisielk/og-rek/issues/55), and - encoding global with '\n' in module name (fixed in previous patch). Update the corpus with points go-fuzz found, so that next runs could restart with having all the previous interesting input vectors available.
-
Kirill Smelkov authored
Caught via fuzzing: "\x8c\x030\n02\x93." 0: \x8c SHORT_BINUNICODE '0\n0' 5: 2 DUP 6: \x93 STACK_GLOBAL 7: . STOP panic: protocol 0: decode back error: err pickle: "c0\n0\n0\n0\n." goroutine 1 [running]: github.com/kisielk/og-rek.Fuzz(0x7f2f1009a000, 0x8, 0x200000, 0x3) /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/fuzz.go:47 +0x8b8 go-fuzz-dep.Main(0x525e10) /tmp/go-fuzz-build645492341/goroot/src/go-fuzz-dep/main.go:49 +0xad main.main() /tmp/go-fuzz-build645492341/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d exit status 2 i.e. '0\n0' module name was emitted as-is as part ot text-based GLOBAL which completely broke pickle stream. For the reference Python decodes such globals with \n in name just ok: In [10]: s = b"S'decimal\\nq'\nS'Decimal'\n\x93." In [11]: pickle.loads(s) --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-11-764e4625bc41> in <module>() ----> 1 pickle.loads(s) ModuleNotFoundError: No module named 'decimal\nq' In [12]: import sys In [15]: d = sys.modules['decimal'] In [16]: sys.modules['decimal\nq'] = d In [17]: pickle.loads(s) Out[17]: decimal.Decimal
-
Kirill Smelkov authored
I put "err" into format string instead of argument. Apologize for that...
-
Kirill Smelkov authored
While fixing https://github.com/kisielk/og-rek/issues/48 there were corresponding test-vector additions. Regenerate the fuzz test corpus.
-
Kirill Smelkov authored
UNICODE is text-based opcode, which is used at protocol 0 and follows 'raw-unicode-escape' encoded argument till EOL. - for decoding we must explicitly implement Python's 'raw-unicode-escape' codec decoding, which is used by Python's pickle for UNICODE argument. Updates and hopefully fixes: https://github.com/kisielk/og-rek/issues/48
-
Kirill Smelkov authored
STRING is text-based opcode which is used at protocol 0 and follows \-escaped argument till EOL. - for encoding we must not use Go's %q, since that will use \u and \U when seeing corresponding bytes, and since Python does not interpret \u or \U in string literals, the data received at Python side will be different. - for decoding we must explicitly implement Python's 'string-escape' codec decoding which is used by Python's pickle for STRING opcode argument. Updates: https://github.com/kisielk/og-rek/issues/48
-
Kirill Smelkov authored
We added pyquote in b429839d (tests: Show pickles in a way that can be copy-pasted into Python). However in the next patch this functionality will be needed for the encoder to fix encoding of strings at protocol 0. Thus we need this function to be not test-only. Plain code movement here, no semantic change.
-
- 25 Sep, 2018 6 commits
-
-
Kirill Smelkov authored
Even though we tried to catch whether dict keys are ok to be used via reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes found by fuzzer (#32)"), that turned out to be not enough. For example if key is a struct, e.g. of the following type type Ref struct { Pid interface{} } it will be comparable. But the comparision, depending on dynamic .Pid type, might panic. This is what was actually cauht by fuzz-testing recently: https://github.com/kisielk/og-rek/issues/50 (second part of the report) So instead of recursively walking a key type and checking each subfield with reflect.TypeOf().Comparable(), switch for using panic/recover for detecting the "unhashable key" situation. This slows down decoding a bit (only cumulative figure for all-test-vectors decoding): name old time/op new time/op delta DecodeLong-4 361ns ± 0% 362ns ± 0% ~ (p=0.238 n=5+4) Decode-4 93.2µs ± 0% 95.6µs ± 0% +2.54% (p=0.008 n=5+5) Encode-4 16.5µs ± 0% 16.6µs ± 0% ~ (p=0.841 n=5+5) but that is the price of correctness. And with manually recursively walking key type I doubt it would be faster. The defer overhead should be less once https://github.com/golang/go/issues/14939 is fixed. Updates: https://github.com/kisielk/og-rek/issues/30
-
Kirill Smelkov authored
In 06e06939 (encoder: Allow to specify pickle protocol version) I added ability to add to tests error cases - object inputs that on encoding should produce error. For decoding we should skip those cases as there pickle.data = "", and if not skipped it leads to --- FAIL: BenchmarkDecode ogorek_test.go:803: unexpected # of decode steps: got 100 ; want 102
-
Kirill Smelkov authored
Updates coming via `go generate` since main tests table was amended.
-
Kirill Smelkov authored
Appologize for the breakage there.
-
Kirill Smelkov authored
When encoding tests fails, the "want" and "have" pickles are printed. It is handy to copy-paste those pickles into Python console and check them further there. Pickle printing currently uses %q. However in Go fmt's %q can use \u and \U if byte sequence form a valid UTF-8 character. That poses a problem: in Python str (py2) or bytes (py3) literal \uXXXX are not processed as unicode-escapes and enter the string as is. This result in different pickle data pasted into Python and further confusion. Entering data into Python as unicode literals (where \u works) and then adding .encode('utf-8') also does not generally work - as pickle data is generally arbitrary it can be a not valid UTF-8, for example: "\x80\u043c\u0438\u0440" (= "\x80мир" = "\x80\xd0\xbc\xd0\xb8\xd1\x80") end unicode-encoding them in python also gives different data: In [1]: u"\x80\u043c\u0438\u0440".encode('utf-8') Out[1]: '\xc2\x80\xd0\xbc\xd0\xb8\xd1\x80' (note leading extra \xc2) For this reason let's implement quoting - that Python can understand - ourselves. This dumping functionality was very handy during recent encoder fixes debugging.
-
Kirill Smelkov authored
- we can use STACK_GLOBAL only if protocol >= 4. - for earlier protocols we have to use text-based GLOBAL.
-