• Andrew Dalke's avatar
    Changes to string.split/rsplit on whitespace to preallocate space in the · 525eab37
    Andrew Dalke authored
    results list.
    
    Originally it allocated 0 items and used the list growth during append.  Now
    it preallocates 12 items so the first few appends don't need list reallocs.
    
    ("Here are some words ."*2).split(None, 1) is 7% faster
    ("Here are some words ."*2).split() is is 15% faster
    
      (Your milage may vary, see dealership for details.)
    
    File parsing like this
    
        for line in f:
            count += len(line.split())
    
    is also about 15% faster.  There is a slowdown of about 3% for large
    strings because of the additional overhead of checking if the append is
    to a preallocated region of the list or not.  This will be the rare case.
    It could be improved with special case code but we decided it was not
    useful enough.
    
    There is a cost of 12*sizeof(PyObject *) bytes per list.  For the normal
    case of file parsing this is not a problem because of the lists have
    a short lifetime.  We have not come up with cases where this is a problem
    in real life.
    
    I chose 12 because human text averages about 11 words per line in books,
    one of my data sets averages 6.2 words with a final peak at 11 words per
    line, and I work with a tab delimited data set with 8 tabs per line (or
    9 words per line).  12 encompasses all of these.
    
    Also changed the last rstrip code to append then reverse, rather than
    doing insert(0).  The strip() and rstrip() times are now comparable.
    525eab37
stringobject.c 122 KB