[Pdl-porters] PDL3::PP and JIT via TCC

Discussion:

Chris Marshall

2013-12-19 13:36:53 UTC

There has been some interesting development of modules for
runtime compilation of C routines for/from perl via the TinyCC
compiler:

http://search.cpan.org/perldoc?Alien::TinyCC
http://search.cpan.org/perldoc?C::TinyCompiler
http://search.cpan.org/perldoc?C::TCC
http://search.cpan.org/perldoc?XS::TCC

The tcc compiler supports code generation for both x86
(32bit and 64bit), and ARM platforms and is an order of
magnitude faster than gcc: small routines can compile
in msec!

This JIT compiling capability could enable the PDL3
build process to be streamlined by compiling only a
limited number of generic versions of the various code
loops by default. Say a version that casts values to
double, processes, and then converts the result back.
That would reduce the size of the static code base and
build time.

Then, the code could be generated at runtime for any
non-generic routines. This also leads to the possibility
of runtime optimization strategies which could lead to
improved cache and memory bandwidth usage.

The sky's the limit! :-)

--Chris

David Mertens

2013-12-19 14:22:04 UTC

Permalink

Chris -

You are reading my mind! There's one major optimization I'm working on with
my TCC stuff at the moment: the ability to share declaration information
between tcc compiler states. If I can get that to work, then we can reduce
the amount of code actually processed by an individual compiler state to
the absolute, bare minimum. I'll be happy to expand on those details once
things are actually working.

More to come, hopefully in the next couple of days. I'm working on this as
we speak. :-)

David

Post by Chris Marshall
There has been some interesting development of modules for
runtime compilation of C routines for/from perl via the TinyCC
http://search.cpan.org/perldoc?Alien::TinyCC
http://search.cpan.org/perldoc?C::TinyCompiler
http://search.cpan.org/perldoc?C::TCC
http://search.cpan.org/perldoc?XS::TCC
The tcc compiler supports code generation for both x86
(32bit and 64bit), and ARM platforms and is an order of
magnitude faster than gcc: small routines can compile
in msec!
This JIT compiling capability could enable the PDL3
build process to be streamlined by compiling only a
limited number of generic versions of the various code
loops by default. Say a version that casts values to
double, processes, and then converts the result back.
That would reduce the size of the static code base and
build time.
Then, the code could be generated at runtime for any
non-generic routines. This also leads to the possibility
of runtime optimization strategies which could lead to
improved cache and memory bandwidth usage.
The sky's the limit! :-)
--Chris
_______________________________________________
PDL-porters mailing list
http://mailman.jach.hawaii.edu/mailman/listinfo/pdl-porters

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

Henning Glawe

2013-12-19 19:03:55 UTC

Permalink

Post by Chris Marshall
This JIT compiling capability could enable the PDL3
build process to be streamlined by compiling only a
limited number of generic versions of the various code
loops by default.

With my debian developer hat on: this would limit PDLs portability to the
architectures supported by TCC...

--
c u
henning

David Mertens

2013-12-19 21:11:14 UTC

Permalink

Indeed, though I don't know that PDL is getting any use on architectures
other than Intel-based. I would be *happy* to hear that somebody is running
it on ARM.

But this suggests that PDL needs some sort of fallback mechanism, with JIT
stuff only as an enhancement when on supported architectures.

David

Post by Henning Glawe

Craig DeForest

2013-12-19 22:13:23 UTC

Permalink

I haven't done so in years, but I ran it on an i-opener back in the day :-). Also I've used it on a Gumstix (which is ARM).

Indeed, though I don't know that PDL is getting any use on architectures other than Intel-based. I would be *happy* to hear that somebody is running it on ARM.
But this suggests that PDL needs some sort of fallback mechanism, with JIT stuff only as an enhancement when on supported architectures.
David

With my debian developer hat on: this would limit PDLs portability to the
architectures supported by TCC...
--
c u
henning
_______________________________________________
PDL-porters mailing list
http://mailman.jach.hawaii.edu/mailman/listinfo/pdl-porters
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan
_______________________________________________
PDL-porters mailing list
http://mailman.jach.hawaii.edu/mailman/listinfo/pdl-porters

Chris Marshall

2013-12-19 22:50:44 UTC

Permalink

Post by David Mertens
Indeed, though I don't know that PDL is getting any use on architectures
other than Intel-based. I would be *happy* to hear that somebody is running
it on ARM.

Raspberry Pi has an ARM processor...

Post by David Mertens
But this suggests that PDL needs some sort of fallback mechanism, with JIT
stuff only as an enhancement when on supported architectures.

My thought was to develop a continuum of PP engines/engineering
to support PDL3:

- Static config and compile a la PDL::PP currently
- Inline::Pdlpp (an out-of-core, slow, JIT framework)
- JIT compilers
- TCC (x86 and ARM)
- LLVM/clang (many)
- maybe even Java...

Portability and robustness are two of the primary goals for PDL
and I wish to see them maintained. The nice thing about moving
to a more dynamic modern OO framework for PDL, that allows
us to provide alternate methods for compilation and code
optimization that is abstracted away from the baseline, generic
implementation.

--Chris

Post by David Mertens
David

Post by Henning Glawe

With my debian developer hat on: this would limit PDLs portability to the
architectures supported by TCC...

Only the tcc-based JIT would be affected, other options are
already available in principle and optimization of the standby
static build and/or Inline::C could take place incrementally.

Of course, all would happen much more easily if we had a
good size set of developers on all the major PDL platforms.

--Chris

Craig DeForest

2013-12-19 22:54:00 UTC

Permalink

I like the idea of keeping the core (whatever that eventually means) static and allowing dynamic compilation as a per-host compilation feature (maybe on by default, maybe not). Even with tinycc, there's significant overhead to compiling and linking new pieces of code, and I worry that overall performance could suffer if *everything* were JIT-compiled to machine code.

Chris Marshall

2013-12-19 23:05:44 UTC

Permalink

On Thu, Dec 19, 2013 at 5:54 PM, Craig DeForest

Post by Craig DeForest
I like the idea of keeping the core (whatever that eventually means) static
and allowing dynamic compilation as a per-host compilation feature (maybe
on by default, maybe not). Even with tinycc, there's significant overhead to
compiling and linking new pieces of code, and I worry that overall
performance could suffer if *everything* were JIT-compiled to machine code.

I wouldn't plan on *everything* being JIT compiled although that
would work. In that case, the key would be that you wouldn't
have to compile an entire big-honking.pd. You could JIT compile
on a method-by-method basis with the existing static default
available as well.

Additionally, tcc can compile an link in the traditional fashion so that
JIT-ed routines could be saved to disk avoiding re-JIT compiles.
Of course, there is a wide range of tradeoffs in where, when and why
PDL routines are compiled. The nice thing is that a common framework
could support many variations in a seamless/transparent fashion.

One example of interest to me is related to the need to multiple
PDL routines doing the same thing except for signature, for example
index and index1d. I would be possible to have the default
implementation be index and that the "optimized" version is
essentially index1d.

Cheers,
Chris

Craig DeForest

2013-12-19 23:08:51 UTC

Permalink

Yep, I like that vision a lot too.

David, have you benchmarked object code from your TCC stuff against a -O3 heavyweight compiler? Does it even make a difference for the kind of stuff we do (mostly RAM-limited vector operations)?

Post by Chris Marshall
On Thu, Dec 19, 2013 at 5:54 PM, Craig DeForest

I wouldn't plan on *everything* being JIT compiled although that
would work. In that case, the key would be that you wouldn't
have to compile an entire big-honking.pd. You could JIT compile
on a method-by-method basis with the existing static default
available as well.
Additionally, tcc can compile an link in the traditional fashion so that
JIT-ed routines could be saved to disk avoiding re-JIT compiles.
Of course, there is a wide range of tradeoffs in where, when and why
PDL routines are compiled. The nice thing is that a common framework
could support many variations in a seamless/transparent fashion.
One example of interest to me is related to the need to multiple
PDL routines doing the same thing except for signature, for example
index and index1d. I would be possible to have the default
implementation be index and that the "optimized" version is
essentially index1d.
Cheers,
Chris

David Mertens

2013-12-26 16:54:28 UTC

Permalink

Craig -

I have not benchmarked the object code generated by tcc compared with a
traditional C compiler, but tcc's output is known to be slower than gcc
-O3. The comparison will be pretty easy once I am ready, but I've focused
on a feature to share declarations between compiler contexts. This will
making one-off compilation units about as cheap as possible. Once I have
the working, then I'll work on some benchmarks. I had hoped to get all of
this done before the New Year, but it may not happen until January.

Stay tuned!
David

On Thu, Dec 19, 2013 at 3:08 PM, Craig DeForest

Post by Craig DeForest
Yep, I like that vision a lot too.
David, have you benchmarked object code from your TCC stuff against a -O3
heavyweight compiler? Does it even make a difference for the kind of stuff
we do (mostly RAM-limited vector operations)?

Post by Chris Marshall
On Thu, Dec 19, 2013 at 5:54 PM, Craig DeForest

Post by Craig DeForest
I like the idea of keeping the core (whatever that eventually means)

static

Post by Chris Marshall

Post by Craig DeForest
and allowing dynamic compilation as a per-host compilation feature

(maybe

Post by Chris Marshall

Post by Craig DeForest
on by default, maybe not). Even with tinycc, there's significant

overhead to

Post by Chris Marshall

Post by Craig DeForest
compiling and linking new pieces of code, and I worry that overall
performance could suffer if *everything* were JIT-compiled to machine

code.

Post by Chris Marshall
I wouldn't plan on *everything* being JIT compiled although that
would work. In that case, the key would be that you wouldn't
have to compile an entire big-honking.pd. You could JIT compile
on a method-by-method basis with the existing static default
available as well.
Additionally, tcc can compile an link in the traditional fashion so that
JIT-ed routines could be saved to disk avoiding re-JIT compiles.
Of course, there is a wide range of tradeoffs in where, when and why
PDL routines are compiled. The nice thing is that a common framework
could support many variations in a seamless/transparent fashion.
One example of interest to me is related to the need to multiple
PDL routines doing the same thing except for signature, for example
index and index1d. I would be possible to have the default
implementation be index and that the "optimized" version is
essentially index1d.
Cheers,
Chris