Hi everyone,
The following are some suggestions. I'd love to help work on them. Sorry for
the length.
- Zaki Mughal
---
I announced adding data frames for PDL several months back
<http://comments.gmane.org/gmane.comp.lang.perl.pdl.general/8335> and my
intention to embed R in Perl. Embedding R in Perl is actually complete now and
just about ready for CPAN <https://github.com/zmughal/embedding-r-in-perl-experiment>
thanks to the help of the gang on #inline <http://inline.ouistreet.com/node/zfp7.html>.
In order to build the data frames and match R types, I created several
subclasses of PDL that handle a subset of PDL functions, but I haven't figured
out a way to wrap all of PDL's functionality systematically. I have several
thoughts on this.
## Levels of measurement
When using R, one of the nice things it does is warn or give
an error when you try to do an operation that would be invalid on a certain
type of data. One such type of data is categorical data, which R calls
factors and for which I made a subclass of PDL called PDL::Factor. Some of
this behvaviour is inspired by the statistical methodology of levels of
measurement <https://en.wikipedia.org/wiki/Level_of_measurement>. I believe
SAS even explicitly allows assigning levels of measurment to variables.
For example, if I try to apply the mean() function on all the columns of the
Iris data set, I get this warning:
```r
lapply( iris, mean )
#> $Sepal.Length
#> [1] 5.843333
#>
#> $Sepal.Width
#> [1] 3.057333
#>
#> $Petal.Length
#> [1] 3.758
#>
#> $Petal.Width
#> [1] 1.199333
#>
#> $Species
#> [1] NA
#>
#> Warning message:
#> In mean.default(X[[5L]], ...) :
#> argument is not numeric or logical: returning NA
```
`NA` is R's equivalent of `BAD` values. For `mean()` this makes sense for
categorical data. For logical vectors, it does something else:
```r
which_setosa <- iris$Species == 'setosa' # this is a logical
mean( which_setosa )
#> [1] 0.3333333
```
This means 1/3 of the logical data was true which may be useful for `mean()`
to return in that case.
Thinking in terms of levels of measurement can help with another experiment
I'm doing which based around tracking the units of measure used for numerical
things in Perl. Code is here <https://github.com/zmughal/units-experiment/blob/master/overload_override.pl>.
What I do there is use Moo roles to add a unit attribute to numerical types
(Perl scalars, Number::Fraction, PDL, etc.) and whenever they go through an
operation by either operator overloading or calling a function such as
`sum()`, the unit will be carried along with it and be manipulated
appropriately (you can take the mean of Kelvin, but not degrees Celsius). I
know that units of measure are messy to implement, but being able to support
auxiliary operations like this will go a long way to making PDL flexible.
[Has anyone used udunits2? I made an Alien package for it. It's on CPAN.]
## DataShape and Blaze
I think it would be beneficial to look at the work being done by the Blaze
project <http://blaze.pydata.org/> with its DataShape specification
<http://datashape.pydata.org/>. The idea behind it is to be able to use the
various array-like APIs without having to worry what is going on in the
backend — be it with a CPU-based, GPU-based, SciDB, or even a SQL server.
## Julia
Julia has been doing some amazing things with how they've grown out their
language. I was looking to see if they have anything similar to the dataflow
in PDL and I came across ArrayViews <https://github.com/JuliaLang/ArrayViews.jl>.
It may be enlightening to see how they compose this feature onto already
existing n-d arrays as opposed to how PDL does it.
I do not know what tradeoffs that brings, but it is a starting point to think
about. I think similar approaches can be made to support sparse arrays.
In fact, one of Julia's strengths is how they use multimethods to handle new
types with ease. See "The Design Impact of Multiple Dispatch" <http://nbviewer.ipython.org/gist/StefanKarpinski/b8fe9dbb36c1427b9f22>
for examples. [Perl 6 has built-in multimethods]
## MATLAB subclassing
I use MATLAB daily. I came across this area of the documentation that talks
about how to subclass. <http://www.mathworks.com/help/matlab/matlab_oop/subclassing-matlab-built-in-classes.html>
Some of the information in there is good for knowing how *not* to implement
things, but there is also some discussion on what is necessary for the
storage types that might be worth looking at.
[By the way, I have downloaded all of MATLAB File Central's code and I could do
some analysis on the functions used there if that would be helpful.]
## GPU and threading
I think it would be best to offload GPU support to other libraries, so it
would be good to extract what is common between libraries like
- MAGMA <http://icl.cs.utk.edu/magma/>,
- ViennaCL <http://viennacl.sourceforge.net/>,
- Blaze-lib <https://code.google.com/p/blaze-lib/>,
- VXL <http://vxl.sourceforge.net/>,
- Spark <http://spark.apache.org/>,
- Torch <http://torch.ch/>,
- Theano <http://www.deeplearning.net/software/theano/>,
- Eigen <http://eigen.tuxfamily.org/>, and
- Armadillo <http://arma.sourceforge.net/>.
Eigen is interesting in particular because it has support for storing in both
row-major and column-major data <http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html>.
Another source of inspiration would be the VSIPL spec <http://www.omgwiki.org/hpec/vsipl>.
It's a standard made for signal processing routines in the embedded DSP world
and comes with "Core" and "Core Lite" profiles which might help decide what
should be included in a smaller subset of PDL.
Also in my wishlist is interoperability with libraries like ITK <http://www.itk.org/>,
VTK <http://www.vtk.org/>, and yt <http://yt-project.org/>. They have
interesting architectures especially for computation. Unfortunately, the
first two are C++ based and I don't have experience with combining C++ and XS.
## Better testing
PDL should make more guarantees about how types flow through the system. This
might be accomplished by adding assertions in the style of Design-by-Contract
which can act as both a testable spec and documentation. I'm working on the
test suite right now on a branch and I hope to create a proof-of-concept of
this soon.
I hope that this can help make PDL more consistent and easily testable. There
are still small inconsistencies that shouldn't be there which can be weeded out
with testing. For example, what type is expected for this code? :
```perl
use PDL;
print stretcher( sequence(float, 3) )->type;
```
I would expect 'float', but it is actually 'double' under PDL v2.007_04.
## Incremental computation
I find that the way I grow my code is to slowly add modules that work
together in a pipeline. Running and rerunning this code through all the
modules is slow. To avoid that, I create multiple small programs that read
and write files to pass from one script to the next. I was looking for a
solution and came across IncPy <http://www.pgbovine.net/incpy.html>. It
modifies the Python interpreter to support automatic persistent memoization.
I don't think the idea has caught on, but I think it should and perhaps Perl
and PDL is flexible enough to herald it as a CPAN module.