Commits · efeaab759b207e9268def82be5de9bda25de1a82 · Boxiang Sun / Pyston

19 Feb, 2015 6 commits

Merge pull request #312 from kevinxucs/docs_gcc-url · efeaab75
Kevin Modzelewski authored Feb 19, 2015
```
Update gcc-4.8.2 tarball url to generic gnu ftpmirror.
```
efeaab75
Merge pull request #313 from undingen/perf_chaos · f2e68e79
Kevin Modzelewski authored Feb 19, 2015
```
compvar: add int <op> float handling
```
f2e68e79

compvar: add int <op> float handling · 15541f28

Marius Wachtler authored Feb 19, 2015

Convert the integer to a float and then let the float code handle the operation
With this change the type analysis is also able to comprehend that
e.g. '1 - <float>' will return a float

This means that the math operations in the 'linear_combination' function in chaos.py
get completely inlined.

improves chaos.py by 5%

15541f28

Update gcc-4.8.2 tarball url to generic gnu ftpmirror. · c43f92a5
Kaiwen Xu authored Feb 19, 2015

c43f92a5

Rearrange things to improve our ability to inline common cases · 2c4ab499

Kevin Modzelewski authored Feb 18, 2015

We seem to be spending a fair amount of time doing unnecessary work
for simple calls like boxInt and createList, which are generated
by irgen and reduce to calling new BoxedInt / BoxedList. The
operator new calls tp_alloc, so we get some indirect function calls,
and then tp_alloc does some checking about its caller, and then we
check to see what size object to create, and how to initialize it.

I created a DEFAULT_CLASS_SIMPLE macro to go with DEFAULT_CLASS,
that should help with these things. I (manually) inlined all of those
functions into the operator new.

I also moved the small arena bucket selection function (SmallArena::alloc)
into the header file so that it can get inlined, since the allocation size
is often known at compile time and we can statically resolve to a bucket.

Putting these together means that boxInt and createList are much tighter.

2c4ab499

Use a __thread cache for the GC's thread-local ThreadBlockCache · a2e51e4f

Kevin Modzelewski authored Feb 18, 2015

__thread seems quite a bit faster than pthread_get_specific, so
if we give up on having multiple Heap objects, then we can store
a reference to the current thread's ThreadBlockCache in a static
__thread variable.  It looks like this ends up mattering (5% average
speedup) since SmallArena::_alloc() is so hot

a2e51e4f

18 Feb, 2015 21 commits

Merge pull request #309 from undingen/len_rewriting · 243781f7
Kevin Modzelewski authored Feb 18, 2015
```
Teach len() howto rewrite itself
```
243781f7
Teach len() howto rewrite itself · ee7cf48d
Marius Wachtler authored Feb 17, 2015
```
-15% for fasta.py
```
ee7cf48d
Allow rewriting 1-arg calls to type() · a4722ed0
Kevin Modzelewski authored Feb 17, 2015

a4722ed0
max->min · 12f29135
Kevin Modzelewski authored Feb 18, 2015

12f29135
Change from "never retry ICs" to exponential backoff · af59d5ae
Kevin Modzelewski authored Feb 17, 2015

af59d5ae

Increase callsite IC sizes · 49a830b6

Kevin Modzelewski authored Feb 17, 2015

At some point I'm sure we'll start paying for our 2KB+ inline caches,
but it doesn't seem to be now!

49a830b6

Stop rewriting ICs after a certain number of rewrites · 1a1afc8c

Kevin Modzelewski authored Feb 17, 2015

It's a pretty crude heuristic, but it stops us from endlessly
rewriting "megamorphic" IC sites.

pyston interp2.py : 6.7s baseline: 6.5 (+3.0%)
pyston raytrace.py : 8.3s baseline: 7.9 (+4.3%)
pyston nbody.py : 10.6s baseline: 10.3 (+3.1%)
pyston fannkuch.py : 7.4s baseline: 7.4 (+0.8%)
pyston chaos.py : 24.2s baseline: 24.6 (-1.5%)
pyston spectral_norm.py : 22.7s baseline: 30.4 (-25.4%)
pyston fasta.py : 9.0s baseline: 8.4 (+7.6%)
pyston pidigits.py : 4.4s baseline: 4.3 (+1.7%)
pyston richards.py : 2.7s baseline: 12.5 (-78.7%)
pyston deltablue.py : 2.7s baseline: 2.6 (+0.9%)
pyston (geomean-0b9f) : 7.6s baseline: 9.0 (-15.2%)

There are a number of regressions; I feel like this is something
we'll be tuning a lot.

1a1afc8c

Reduce generators memory usage · 58d587ba

Kevin Modzelewski authored Feb 17, 2015

Limit the number of generator stacks that we save, and register them as
additional GC pressure.

58d587ba

Merge branch 'generator-simple-destructor' of https://github.com/toshok/pyston · 1a58c87d
Kevin Modzelewski authored Feb 17, 2015
```
Conflicts:
	src/runtime/generator.cpp

Closes #307
```
1a58c87d
Merge pull request #302 from undingen/ctxswitching · 76057292
Kevin Modzelewski authored Feb 17, 2015
```
New context switching code for generators
```
76057292
Merge pull request #303 from undingen/perf_fasta · 81c004af
Kevin Modzelewski authored Feb 17, 2015
```
Smaller performance improvements for fasta
```
81c004af
Make find_module support packages · f5d262cc
Kevin Modzelewski authored Feb 17, 2015

f5d262cc
Merge pull request #306 from toshok/fix-spectral-norm-gc-regression · 14dd8e7d
Kevin Modzelewski authored Feb 17, 2015
```
remove the larger buckets, and hoist some math out of loops.
```
14dd8e7d
Separate the code to find then import modules · 01b8b9b7
Kevin Modzelewski authored Feb 17, 2015
```
Python exposes the finding part through the 'imp' module.
```
01b8b9b7
add bm_ai.py · 330b378c
Chris Toshok authored Feb 18, 2015

330b378c
Fix some issues with file.write · e4767851
Kevin Modzelewski authored Feb 17, 2015
```
It uses the buffer protocol, so make str support that better.
```
e4767851
reuse generator stacks · cf9487cd
Chris Toshok authored Feb 18, 2015

cf9487cd

remove the larger buckets, and hoist some math out of loops. · a9426101

Chris Toshok authored Feb 18, 2015

For some reason the larger bucket sizes are causing a large perf hit
in spectral_norm.  It's unclear exactly why this is happening, but
theories are legion.  More investigation is warranted, but this gets us
back from the perf regression.

Also hoist the atom_idx calculation out of a couple of loops that were
iterating over object indices.

a9426101

Merge branch 'zlib' · 5ba655be
Kevin Modzelewski authored Feb 17, 2015

5ba655be
Enable the zlib module · aea9ef2d
Kevin Modzelewski authored Feb 17, 2015

aea9ef2d
Switch to CPython's pthread library · 1a20fdce
Kevin Modzelewski authored Feb 17, 2015

1a20fdce

17 Feb, 2015 3 commits

Support passing generator objects through the args array in OSR · bff16616
Kevin Modzelewski authored Feb 17, 2015
```
Only gets hit when there are >=3 !is_defined names also set (other
fake names might also count towards this).
```
bff16616
Merge pull request #305 from undingen/dup_guards · 0b650c38
Kevin Modzelewski authored Feb 17, 2015
```
Don't emit duplicate attr guards
```
0b650c38

Don't emit duplicate attr guards · 6509deb8

Marius Wachtler authored Feb 17, 2015

pyston (calibration) : 0.8s stock2: 0.8 (+2.5%)
pyston interp2.py : 5.9s stock2: 6.2 (-4.5%)
pyston raytrace.py : 6.9s stock2: 7.0 (-1.6%)
pyston nbody.py : 9.8s stock2: 9.6 (+1.9%)
pyston fannkuch.py : 7.0s stock2: 6.9 (+2.6%)
pyston chaos.py : 20.6s stock2: 21.6 (-4.6%)
pyston spectral_norm.py : 27.9s stock2: 34.2 (-18.6%)
pyston fasta.py : 17.1s stock2: 17.8 (-4.5%)
pyston pidigits.py : 4.4s stock2: 4.5 (-1.0%)
pyston richards.py : 10.4s stock2: 10.2 (+2.2%)
pyston deltablue.py : 2.2s stock2: 2.2 (-1.9%)
pyston (geomean-0b9f) : 8.8s stock2: 9.1 (-3.2%)

6509deb8

16 Feb, 2015 3 commits
- Save current internal thread state in TLS · 7f864131
  Marius Wachtler authored Feb 16, 2015
```
reduces the generator yield overhead
```
  7f864131
- New context switching code for generators · 2537d743
  Marius Wachtler authored Feb 16, 2015
```
This is a huge speed improvement for generators,
fasta.py takes 8secs now instead of 18secs
```
  2537d743
- strJoin: use llvm::raw_string_ostream · babe2f69
  Marius Wachtler authored Feb 13, 2015
```
reduces strJoin runtime from 0.8sec to 0.5sec when executing fasta.py
```
  babe2f69
14 Feb, 2015 7 commits

Reenable tier 2 for now · a3a12bb6

Kevin Modzelewski authored Feb 14, 2015

We should do a more comprehensive investigation. Removing t2 caused
regressions on a number of benchmarks since we lost chances to do
speculations, but making t3 easier to get to caused regressions
due to the cost of our LLVM optimization set (which is pretty hefty
since it's supposed to be hard to activate).

a3a12bb6

Increase the reopt threshold to trigger tier 3 less · fb70753e
Kevin Modzelewski authored Feb 14, 2015

fb70753e
Merge branch 'deopt' · 2e030372
Kevin Modzelewski authored Feb 13, 2015
```
Fully switch to the new deopt system, and clean up a lot of stuff.
```
2e030372

Further distinguish OSR and non-osr compiles · 1bfb56e8

Kevin Modzelewski authored Feb 13, 2015

A "FunctionSpecialization" object really makes no sense in the context of
an OSR compile, since the FunctionSpecialization talks about the types
of the input arguments, which no longer matter for OSR compiles.
Now, their type information comes (almost) entirely from the OSREntryDescriptor,
so in most places assert that we get exactly one or the other.

1bfb56e8

Can kill all notion of partial-block-compilation · 0e60f0d3
Kevin Modzelewski authored Feb 13, 2015
```
We only needed that for supporting the old deopt system
```
0e60f0d3
Nuke the old "block guards" and the rest of the old deopt system · 8feae20e
Kevin Modzelewski authored Feb 13, 2015
```
Long live new-deopt!
```
8feae20e

For OSRs, do type analysis starting from OSR edge · ea673dfd

Kevin Modzelewski authored Feb 13, 2015

Before we would do type analysis starting from the function entry
(using the specialization of the previous function).  This makes things
pretty complicated because we can infer different types than we are OSRing
with!  Ex if the type analysis determines that we should speculate in an
earlier BB, the types we have now might not reflect that speculation.

So instead, start the type analysis starting from the BB that the OSR starts at.
Should also have the side-benefit of requiring less type analysis work.

But this should let us get rid of the OSR-entry guarding, and the rest of
the old deopt system!

ea673dfd