Commits · 2e8a0084bb1aa86b27877bfb117cbdad023bbdd9 · Gwenaël Samain / cython

15 Aug, 2008 1 commit

Rewrite of the string literal handling code · 2e8a0084

Stefan Behnel authored Aug 15, 2008

String literals pass through the compiler as follows:
- unicode string literals are stored as unicode strings and encoded to UTF-8 on the way out
- byte string literals are stored as correctly encoded byte strings by unescaping the source string literal into the corresponding byte sequence. No further encoding is done later on!
- char literals are stored as byte strings of length 1. This can be verified by the parser now, e.g. a non-ASCII char literal in UTF-8 source code will result in an error, as it would end up as two or more bytes in the C code, which can no longer be represented as a C char.

Storing byte strings is necessary as we otherwise loose the ability to encode byte string literals on the way out. They do not necessarily contain only bytes that fit into the source code encoding as the source can use escape sequences to represent them. Previously, ASCII encoded source code could not contain byte string literals with properly escaped non-ASCII bytes.

Another bug that was fixed: in Python, escape sequences behave different in unicode strings (where they represent the character code) and byte strings (where they represent a byte value). Previously, they resulted in the same byte value in Cython code. This is only a problem for non-ASCII escapes, since the character code and the byte value of ASCII characters are identical.

2e8a0084

14 Aug, 2008 6 commits
- test runner fix · 7bc8549a
  Stefan Behnel authored Aug 14, 2008
  
  7bc8549a
- merge of 0.9.8.1 beta2 · f968f684
  Stefan Behnel authored Aug 14, 2008
  
  f968f684
- Fix annotation. · e751a61a
  Robert Bradshaw authored Aug 14, 2008
  
  e751a61a
- Partial revert of 1001, use builtin unicode type. · 44e76ee6
  Robert Bradshaw authored Aug 14, 2008
  
  44e76ee6
- better test output · 857e7852
  Stefan Behnel authored Aug 14, 2008
  
  857e7852
- Fix embed_position encoding bug. · b79430c5
  Robert Bradshaw authored Aug 13, 2008
  
  b79430c5
13 Aug, 2008 12 commits
- Fixed buffer [] syntax yet another time · 3aad1ec5
  Dag Sverre Seljebotn authored Aug 13, 2008
  
  3aad1ec5
- Buffers: Fix for Python 2.6 beta compatability · 5a472980
  Dag Sverre Seljebotn authored Aug 13, 2008
  
  5a472980
- merge · e6fe1a89
  Dag Sverre Seljebotn authored Aug 13, 2008
  
  e6fe1a89
- Made bufaccess.pyx testcase Py3-compatible · 7077456c
  Dag Sverre Seljebotn authored Aug 13, 2008
  
  7077456c
- Added --cython-only switch to runtests.py · 1192e11d
  Dag Sverre Seljebotn authored Aug 13, 2008
  
  1192e11d
- Sage compiles · 6a774aad
  Robert Bradshaw authored Aug 13, 2008
  
  6a774aad
- embed positions fix · 296cc2df
  Robert Bradshaw authored Aug 13, 2008
  
  296cc2df
- change include to import for python.pxd · f6ccc40b
  Robert Bradshaw authored Aug 13, 2008
  
  f6ccc40b
- string escaping bugs · 85fee25b
  Robert Bradshaw authored Aug 13, 2008
  
  85fee25b
- Minor fixes for bufaccess. · 309e3a2d
  Robert Bradshaw authored Aug 13, 2008
  
  309e3a2d
- Merge fixes, fix constant unicode, string literal indexing. · 205ea59d
  Robert Bradshaw authored Aug 12, 2008
```
All test pass but bufaccess, tnumpy, and r_mang1.
```
  205ea59d
- merge dag and devel branches · 82acc1f8
  Robert Bradshaw authored Aug 12, 2008
  
  82acc1f8
12 Aug, 2008 11 commits
- parsetuple format fix · 2fbc1d54
  Stefan Behnel authored Aug 12, 2008
  
  2fbc1d54
- use a dedicated UnicodeType and UnicodeNode to represent unicode literals · 14986aea
  Stefan Behnel authored Aug 12, 2008
```
fixes the unicode literal indexing problem (only for unicode strings, not for byte strings!)
```
  14986aea
- applied Py3 exception format patch by Lisandro · 0227fc22
  Stefan Behnel authored Aug 12, 2008
  
  0227fc22
- slight test change · fa568114
  Stefan Behnel authored Aug 12, 2008
  
  fa568114
- new test case that shows broken string literal slicing behaviour · 4e113f32
  Stefan Behnel authored Aug 12, 2008
  
  4e113f32
- allow unicode values up to 1114111, even if they are not portable to two-byte unicode systems · 8b5a6441
  Stefan Behnel authored Aug 12, 2008
  
  8b5a6441
- use correct byte encoding for char values, some escaping on char literals · efb4bde9
  Stefan Behnel authored Aug 12, 2008
  
  efb4bde9
- fix raw string escapes · c3d75bde
  Stefan Behnel authored Aug 12, 2008
  
  c3d75bde
- the module docstring didn't get escaped · 55957b2e
  Stefan Behnel authored Aug 12, 2008
  
  55957b2e
- unescape all string content in the parser and escape it on the way out · 62fc87e0
  Stefan Behnel authored Aug 12, 2008
```
otherwise, different ways of spelling special characters can end up being correctly escaped or not in the C file
```
  62fc87e0
- docstrings in classes were neither escaped nor byte encoded · f6df9115
  Stefan Behnel authored Aug 12, 2008
  
  f6df9115
11 Aug, 2008 5 commits
- cleanup: removed special cases from string escaping code · 78576564
  Stefan Behnel authored Aug 11, 2008
  
  78576564
- better test output: everything else goes through sys.stderr, so divert the normal output there, too · 83a9dd2f
  Stefan Behnel authored Aug 11, 2008
  
  83a9dd2f
- Py2.6/3.0 import fixes · b469f471
  Stefan Behnel authored Aug 11, 2008
  
  b469f471
- Py2.6/3.0 import fixes · 6b760cb3
  Stefan Behnel authored Aug 11, 2008
  
  6b760cb3
- escape C digraphs, trigraphs and other special characters in strings · 39389bdc
  Stefan Behnel authored Aug 11, 2008
  
  39389bdc
10 Aug, 2008 5 commits
- Py3 test fixes · f40f8078
  Stefan Behnel authored Aug 10, 2008
  
  f40f8078
- Py2.3 test fixes · 1d03a4b0
  Stefan Behnel authored Aug 10, 2008
  
  1d03a4b0
- new test runner option to run the regression tests of the running Python installation · c3897fcc
  Stefan Behnel authored Aug 10, 2008
  
  c3897fcc
- more readable test output · 1b0ffcb1
  Stefan Behnel authored Aug 10, 2008
  
  1b0ffcb1
- support for long unicode escapes ('\U...') · 71a59940
  Stefan Behnel authored Aug 10, 2008
```
fixed unicode escape handling in byte strings
unescape \xXY in string literals as C allows it to conflict with trailing hex numbers - output string escaping will do the right thing
```
  71a59940