[Pdl-porters] PDL::IO::HDF5 patches from [rcb]

Discussion:

Richard Balint

2014-08-20 03:57:16 UTC

Hello! I'm Richard, and back in 2008, I did a lot of updates for
PDL::IO::HDF5 when I was with Ball Aerospace. Well, I've finally gotten
around to try to get them properly upstream. :)

I used to be in contact with John Cerney back then, and he did move some of
the patches into the tree back then, but there are lots more:

* multi-dimension attributes,
* attributes of varying datatypes,
* compound data types,
* HDF5 constants are dynamically computed at compile time,
* split hdf5.pd into an XS (hdf5.pd) and a perl/POD module (HDF5.template.pm)
for less hair pulling
* Additional test cases
* Added ->close() methods.
* Group.pm/Dataset.pm use inherited class routines when possible for code
consolidation (mostly attribute stuff)
* getGroupsByAttr() now uses regex
* Added pthread compilation option to Makefile.PL

At least that's what my old release notes tell me. ;)

What I wanted to do was raise awareness that I wish to merge request some
or all of my code upstream if possible. But I had a few questions on
interfaces, style, etc.

* Can we use v5.10 :) Consensus on this list seems that PDL is going to
v5.10 in the next few iterations. Can PDL::IO::HDF5 move this way as well?

* The current attribute interface mostly assumes that attributes are text,
and tries to send back perl text if at all possible. Wouldn't it be more
standard to always assume that this returns a PDL, rather than a perl
variable? I know that John mentioned that this may break some older code,
and as a mitigation, maybe have some flags on HDF5 open to force conversion
of text PDLs to Perl text scalars? Thoughts on this? This is probably my
biggest concern. Users pulling attributes and having to test if it is a PDL
versus scalar seems less friendly than PDL text values. :/

* When was the groupIndex cache added? Is this heavily used? If so, I
would need to rework some code to put it back in, due to some slight
behind-the-scenes changes. (The fileObj requirement for groups and datasets
just seems very ugly to me... there's got to be a better way to implement
the caching... but no idea on how to do it yet.)

* Tests are in a random order, would it make sense to have tests run in a
more logical order like file, group, dataset, attributes, etc? So that if
there's a problem in say group, testing fails quicker and doesn't test the
stuff that depends on it?

* Does anyone know if users are still using HDF5 1.4.x? [2007] The code
would be a lot cleaner if we were able to just support HDF5 1.6.x and
1.8.x, which are very similar.

If anyone is interested in perusing my changed so far (only have a few test
files to get re-working) (up through attributes works now), my fork is at:

https://sourceforge.net/u/hackswell/pdl/ci/reorganization/tree/
(which is following the official master.)

Responses, thought-out criticism, ideas, peer-review, etc are all welcome!

-Ricardo!

Richard Balint

2014-08-20 18:11:27 UTC

Permalink

Hello! I'm Richard, and back in 2008, I did a lot of updates for
PDL::IO::HDF5 when I was with Ball Aerospace. Well, I've finally gotten
around to try to get them properly upstream. :)

I used to be in contact with John Cerney back then, and he did move some of
the patches into the tree back then, but there are lots more:

* multi-dimension attributes,
* attributes of varying datatypes,
* compound data types,
* HDF5 constants are dynamically computed at compile time,
* split hdf5.pd into an XS (hdf5.pd) and a perl/POD module (HDF5.template.pm)
for less hair pulling
* Additional test cases
* Added ->close() methods.
* Group.pm/Dataset.pm use inherited class routines when possible for code
consolidation (mostly attribute stuff)
* getGroupsByAttr() now uses regex
* Added pthread compilation option to Makefile.PL

At least that's what my old release notes tell me. ;)

What I wanted to do was raise awareness that I wish to merge request some
or all of my code upstream if possible. But I had a few questions on
interfaces, style, etc.

* What is the style guide concerning cuddled braces, et al? Indentation
levels are with tabs or spaces? If the latter, how many spaces?

* Can we use v5.10 :) Consensus on this list seems that PDL is going to
v5.10 in the next few iterations. Can PDL::IO::HDF5 move this way as well?

* The current attribute interface mostly assumes that attributes are text,
and tries to send back perl text if at all possible. Wouldn't it be more
standard to always assume that this returns a PDL, rather than a perl
variable? I know that John mentioned that this may break some older code,
and as a mitigation, maybe have some flags on HDF5 open to force conversion
of text PDLs to Perl text scalars? Thoughts on this? This is probably my
biggest concern. Users pulling attributes and having to test if it is a PDL
versus scalar seems less friendly than PDL text values. :/

* When was the groupIndex cache added? Is this heavily used? If so, I
would need to rework some code to put it back in, due to some slight
behind-the-scenes changes. (The fileObj requirement for groups and datasets
just seems very ugly to me... there's got to be a better way to implement
the caching... but no idea on how to do it yet.)

* Tests are in a random order, would it make sense to have tests run in a
more logical order like file, group, dataset, attributes, etc? So that if
there's a problem in say group, testing fails quicker and doesn't test the
stuff that depends on it?

* Does anyone know if users are still using HDF5 1.4.x? [2007] The code
would be a lot cleaner if we were able to just support HDF5 1.6.x and
1.8.x, which are very similar.

As a side note, the varlen.hdf5 is incorrect for testing. The attribute
"attr2" on /Dataset has the proper value "dude", but it is a 4 byte
null-terminated string, rather than a variable-length string.

If anyone is interested in perusing my changed so far, my fork is at:
https://sourceforge.net/u/hackswell/pdl/ci/reorganization/tree/

All tests pass, but tests are a mess, and I had to change a few, so peer
review would be welcomed. :) the NN_testname.t tests are the ones I wrote,
the others are the existing tests.

Responses, thought-out criticism, ideas, peer-review, etc are all welcome!

-Ricardo!

John Cerney

2014-08-20 20:04:37 UTC

Permalink

Post by Richard Balint
Hello! I'm Richard, and back in 2008, I did a lot of updates for
PDL::IO::HDF5 when I was with Ball Aerospace. Well, I've finally
gotten around to try to get them properly upstream. :)
I used to be in contact with John Cerney back then, and he did move
* multi-dimension attributes,
* attributes of varying datatypes,
* compound data types,
* HDF5 constants are dynamically computed at compile time,
* split hdf5.pd into an XS (hdf5.pd) and a perl/POD module
(HDF5.template.pm <http://HDF5.template.pm>) for less hair pulling
* Additional test cases
* Added ->close() methods.
* Group.pm/Dataset.pm use inherited class routines when possible for
code consolidation (mostly attribute stuff)
* getGroupsByAttr() now uses regex
* Added pthread compilation option to Makefile.PL
At least that's what my old release notes tell me. ;)
What I wanted to do was raise awareness that I wish to merge request
some or all of my code upstream if possible. But I had a few
questions on interfaces, style, etc.
* Can we use v5.10 :) Consensus on this list seems that PDL is
going to v5.10 in the next few iterations. Can PDL::IO::HDF5 move this
way as well?

I have no issues with using 5.10.

Post by Richard Balint
* The current attribute interface mostly assumes that attributes are
text, and tries to send back perl text if at all possible. Wouldn't
it be more standard to always assume that this returns a PDL, rather
than a perl variable? I know that John mentioned that this may break
some older code, and as a mitigation, maybe have some flags on HDF5
open to force conversion of text PDLs to Perl text scalars? Thoughts
on this? This is probably my biggest concern. Users pulling
attributes and having to test if it is a PDL versus scalar seems less
friendly than PDL text values. :/

This would definitely break older code. That is my biggest concern with
any changes. Changes should add functionality, but not break old code.
The existing test cases should still pass with any updated changes.

Post by Richard Balint
* When was the groupIndex cache added? Is this heavily used? If so,
I would need to rework some code to put it back in, due to some slight
behind-the-scenes changes. (The fileObj requirement for groups and
datasets just seems very ugly to me... there's got to be a better way
to implement the caching... but no idea on how to do it yet.)

groupindex is used heavily here.

Post by Richard Balint
* Tests are in a random order, would it make sense to have tests run
in a more logical order like file, group, dataset, attributes, etc?
So that if there's a problem in say group, testing fails quicker and
doesn't test the stuff that depends on it?

Re-arranging tests shouldn't be a problem.

Post by Richard Balint
* Does anyone know if users are still using HDF5 1.4.x? [2007] The
code would be a lot cleaner if we were able to just support HDF5 1.6.x
and 1.8.x, which are very similar.

I agree, it is probably time to drop 1.4 support.

Post by Richard Balint
If anyone is interested in perusing my changed so far (only have a few
test files to get re-working) (up through attributes works now), my
https://sourceforge.net/u/hackswell/pdl/ci/reorganization/tree/
(which is following the official master.)
Responses, thought-out criticism, ideas, peer-review, etc are all welcome!
-Ricardo!
_______________________________________________
PDL-porters mailing list
http://mailman.jach.hawaii.edu/mailman/listinfo/pdl-porters