- 19 Feb, 2015 2 commits
-
-
Kevin Modzelewski authored
We seem to be spending a fair amount of time doing unnecessary work for simple calls like boxInt and createList, which are generated by irgen and reduce to calling new BoxedInt / BoxedList. The operator new calls tp_alloc, so we get some indirect function calls, and then tp_alloc does some checking about its caller, and then we check to see what size object to create, and how to initialize it. I created a DEFAULT_CLASS_SIMPLE macro to go with DEFAULT_CLASS, that should help with these things. I (manually) inlined all of those functions into the operator new. I also moved the small arena bucket selection function (SmallArena::alloc) into the header file so that it can get inlined, since the allocation size is often known at compile time and we can statically resolve to a bucket. Putting these together means that boxInt and createList are much tighter.
-
Kevin Modzelewski authored
__thread seems quite a bit faster than pthread_get_specific, so if we give up on having multiple Heap objects, then we can store a reference to the current thread's ThreadBlockCache in a static __thread variable. It looks like this ends up mattering (5% average speedup) since SmallArena::_alloc() is so hot
-
- 18 Feb, 2015 21 commits
-
-
Kevin Modzelewski authored
Teach len() howto rewrite itself
-
Marius Wachtler authored
-15% for fasta.py
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
At some point I'm sure we'll start paying for our 2KB+ inline caches, but it doesn't seem to be now!
-
Kevin Modzelewski authored
It's a pretty crude heuristic, but it stops us from endlessly rewriting "megamorphic" IC sites. pyston interp2.py : 6.7s baseline: 6.5 (+3.0%) pyston raytrace.py : 8.3s baseline: 7.9 (+4.3%) pyston nbody.py : 10.6s baseline: 10.3 (+3.1%) pyston fannkuch.py : 7.4s baseline: 7.4 (+0.8%) pyston chaos.py : 24.2s baseline: 24.6 (-1.5%) pyston spectral_norm.py : 22.7s baseline: 30.4 (-25.4%) pyston fasta.py : 9.0s baseline: 8.4 (+7.6%) pyston pidigits.py : 4.4s baseline: 4.3 (+1.7%) pyston richards.py : 2.7s baseline: 12.5 (-78.7%) pyston deltablue.py : 2.7s baseline: 2.6 (+0.9%) pyston (geomean-0b9f) : 7.6s baseline: 9.0 (-15.2%) There are a number of regressions; I feel like this is something we'll be tuning a lot.
-
Kevin Modzelewski authored
Limit the number of generator stacks that we save, and register them as additional GC pressure.
-
https://github.com/toshok/pystonKevin Modzelewski authored
Conflicts: src/runtime/generator.cpp Closes #307
-
Kevin Modzelewski authored
New context switching code for generators
-
Kevin Modzelewski authored
Smaller performance improvements for fasta
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
remove the larger buckets, and hoist some math out of loops.
-
Kevin Modzelewski authored
Python exposes the finding part through the 'imp' module.
-
Chris Toshok authored
-
Kevin Modzelewski authored
It uses the buffer protocol, so make str support that better.
-
Chris Toshok authored
-
Chris Toshok authored
For some reason the larger bucket sizes are causing a large perf hit in spectral_norm. It's unclear exactly why this is happening, but theories are legion. More investigation is warranted, but this gets us back from the perf regression. Also hoist the atom_idx calculation out of a couple of loops that were iterating over object indices.
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
-
- 17 Feb, 2015 3 commits
-
-
Kevin Modzelewski authored
Only gets hit when there are >=3 !is_defined names also set (other fake names might also count towards this).
-
Kevin Modzelewski authored
Don't emit duplicate attr guards
-
Marius Wachtler authored
pyston (calibration) : 0.8s stock2: 0.8 (+2.5%) pyston interp2.py : 5.9s stock2: 6.2 (-4.5%) pyston raytrace.py : 6.9s stock2: 7.0 (-1.6%) pyston nbody.py : 9.8s stock2: 9.6 (+1.9%) pyston fannkuch.py : 7.0s stock2: 6.9 (+2.6%) pyston chaos.py : 20.6s stock2: 21.6 (-4.6%) pyston spectral_norm.py : 27.9s stock2: 34.2 (-18.6%) pyston fasta.py : 17.1s stock2: 17.8 (-4.5%) pyston pidigits.py : 4.4s stock2: 4.5 (-1.0%) pyston richards.py : 10.4s stock2: 10.2 (+2.2%) pyston deltablue.py : 2.2s stock2: 2.2 (-1.9%) pyston (geomean-0b9f) : 8.8s stock2: 9.1 (-3.2%)
-
- 16 Feb, 2015 3 commits
-
-
Marius Wachtler authored
reduces the generator yield overhead
-
Marius Wachtler authored
This is a huge speed improvement for generators, fasta.py takes 8secs now instead of 18secs
-
Marius Wachtler authored
reduces strJoin runtime from 0.8sec to 0.5sec when executing fasta.py
-
- 14 Feb, 2015 7 commits
-
-
Kevin Modzelewski authored
We should do a more comprehensive investigation. Removing t2 caused regressions on a number of benchmarks since we lost chances to do speculations, but making t3 easier to get to caused regressions due to the cost of our LLVM optimization set (which is pretty hefty since it's supposed to be hard to activate).
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
Fully switch to the new deopt system, and clean up a lot of stuff.
-
Kevin Modzelewski authored
A "FunctionSpecialization" object really makes no sense in the context of an OSR compile, since the FunctionSpecialization talks about the types of the input arguments, which no longer matter for OSR compiles. Now, their type information comes (almost) entirely from the OSREntryDescriptor, so in most places assert that we get exactly one or the other.
-
Kevin Modzelewski authored
We only needed that for supporting the old deopt system
-
Kevin Modzelewski authored
Long live new-deopt!
-
Kevin Modzelewski authored
Before we would do type analysis starting from the function entry (using the specialization of the previous function). This makes things pretty complicated because we can infer different types than we are OSRing with! Ex if the type analysis determines that we should speculate in an earlier BB, the types we have now might not reflect that speculation. So instead, start the type analysis starting from the BB that the OSR starts at. Should also have the side-benefit of requiring less type analysis work. But this should let us get rid of the OSR-entry guarding, and the rest of the old deopt system!
-
- 13 Feb, 2015 4 commits
-
-
Kevin Modzelewski authored
-
Kevin Modzelewski authored
add a simple_destructor for file objects
-
Kevin Modzelewski authored
Add third GC arena
-
Kevin Modzelewski authored
Previously it was: tier 0: ast interpreter tier 1: llvm, no speculations, no llvm opts tier 2: llvm, w/ speculations, no llvm opts tier 3: llvm, w/ speculations, w/ llvm opts tier 2 seemed pretty useless, and very little would stay in it. Also, OSR would always skip from tier 1 to tier 3. Separately, add configurable OSR/reopt thresholds. This is mostly for the sake of tests, where we can set lower limits and force OSR/reopts to happen.
-