• Greg Price's avatar
    closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558) · 2f094139
    Greg Price authored
    The purpose of the `unicodedata.is_normalized` function is to answer
    the question `str == unicodedata.normalized(form, str)` more
    efficiently than writing just that, by using the "quick check"
    optimization described in the Unicode standard in UAX #15.
    
    However, it turns out the code doesn't implement the full algorithm
    from the standard, and as a result we often miss the optimization and
    end up having to compute the whole normalized string after all.
    
    Implement the standard's algorithm.  This greatly speeds up
    `unicodedata.is_normalized` in many cases where our partial variant
    of quick-check had been returning MAYBE and the standard algorithm
    returns NO.
    
    At a quick test on my desktop, the existing code takes about 4.4 ms/MB
    (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
    has to do the slow normalize-and-compare:
    
      $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
          -- 'unicodedata.is_normalized("NFD", s)'
      50 loops, best of 5: 4.39 msec per loop
    
    With this patch, it gets the answer instantly (58 ns) on the same 1 MB
    string:
    
      $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
          -- 'unicodedata.is_normalized("NFD", s)'
      5000000 loops, best of 5: 58.2 nsec per loop
    
    This restores a small optimization that the original version of this
    code had for the `unicodedata.normalize` use case.
    
    With this, that case is actually faster than in master!
    
    $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
        -- 'unicodedata.normalize("NFD", s)'
    500 loops, best of 5: 561 usec per loop
    
    $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
        -- 'unicodedata.normalize("NFD", s)'
    500 loops, best of 5: 512 usec per loop
    2f094139
2019-08-27-21-21-36.bpo-37966.5OBLez.rst 205 Bytes