Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the .hdf5/.npz model format #99

Closed
superbock opened this issue Feb 29, 2016 · 5 comments · Fixed by #110
Closed

Rework the .hdf5/.npz model format #99

superbock opened this issue Feb 29, 2016 · 5 comments · Fixed by #110
Milestone

Comments

@superbock
Copy link
Collaborator

Right now the model format is suited only for RNNs (for historical reasons), but we should extend/rework it in a way it is able to do everything. I hereby propose the following:

  • we keep .hdf5/.npz as our primary formats, since this works quite well
  • to flatten a .hdf5 file to .npz (a flat format) we use the names (HDF5_object.name) as keys
  • objects and functions are stored as a group (arbitrary names are allowed)
  • the arguments needed to instantiate an object are stored as datasets with the name of the argument
  • in case a list of objects is needed as an argument, these get nested in a group again (with the argument as the group name)
  • to determine which class/type an object/function is of we save it as "type" attribute
  • HDF5 attributes are stored with key "attrs/attribute_name" in the NPZ dictionary

Questions:

  • how do we store the order of the items in a lists? Should we attach an "id" attribute to the group or rather encode the order in the group name (making it sortable)?
  • how do we determine if an instance of the class is wanted or just the class/function itself?

Example:

h5py.File
> Group: rnn
  > Attribute: type='madmom.ml.rnn.RecurrentNeuralNetwork'
  > Group: layers
    > Group: hidden_layer_0
      > Attribute: type='madmom.ml.rnn.BidirectionalLayer'
      > Attribute: id=0
      > Group: fwd_layer
        > Attribute: type='madmom.ml.rnn.RecurrentLayer'
        > Dataset: weights
        > Dataset: bias
        > Dataset: recurrent_weights
        > Group: transfer_fn
          > Attribute: type='madmom.ml.rnn.tanh'
      > Group: bwd_layer
        > Attribute: type='madmom.ml.rnn.RecurrentLayer'
        > Dataset: weights
        > Dataset: bias
        > Dataset: recurrent_weights
        > Group: transfer_fn
          > Attribute: type='madmom.ml.rnn.tanh'
    > Group: hidden_layer_1
      > Attribute: type='madmom.ml.rnn.BidirectionalLayer'
      > Attribute: id=1
      > Group: fwd_layer
        > Attribute: type='madmom.ml.rnn.RecurrentLayer'
        > Dataset: weights
        > Dataset: bias
        > Dataset: recurrent_weights
        > Group: transfer_fn
          > Attribute: type='madmom.ml.rnn.tanh'
      > Group: bwd_layer
        > Attribute: type='madmom.ml.rnn.RecurrentLayer'
        > Dataset: weights
        > Dataset: bias
        > Dataset: recurrent_weights
        > Group: transfer_fn
          > Attribute: type='madmom.ml.rnn.tanh'
@superbock superbock added this to the v0.14 milestone Feb 29, 2016
@fdlm
Copy link
Contributor

fdlm commented Mar 3, 2016

Seems fine to me. Some thoughts:

  • If a group represents a list, we will need an attribute type='list' so we can handle it appropriately
  • In this case, the names of the nested groups to not matter (they are just items in a list) and we could use them to define the order. However, we need to keep in mind that the group names are strings, and therefore the ordering of numbers is not "natural" (e.g., '0', '1', '10', '2', ...). The creator of the model file would thus be responsible to format the names accordingly (e.g. '00', '01', '02', '10'). We get the sorting for free, and thus our code is simpler. However, we are dependent on the order in which h5py returns the groups, which might change (although it probably won't)
  • If we define an 'id' attribute, we would need to first go through all the groups, get the id attributes, and sort the groups accordingly. This is easy: l = [(g_.attrs['id'], g_) for g_ in g.itervalues()]; l.sort(). It seems to me that this is the clearer and more future-proof solution.
  • how do we determine if an instance of the class is wanted or just the class/function itself? By adding an attribute instantiate=False if you want the class/function itself

@fdlm
Copy link
Contributor

fdlm commented Mar 3, 2016

Here's a prototype (seems to work, didn't check if the produced model is correct):

Code: https://gist.github.com/fdlm/b4be1190af0bfc9f2e7a
HDF5-File: https://drive.google.com/file/d/0B0gBhdh1fIPKT3BDR2NJY2JVRkU/view?usp=sharing

@superbock
Copy link
Collaborator Author

Regarding your points:

    1. type='list' is straight forward and inline with my suggestions
  • I prefer 3) over 2) since the ids can be generated automatically when saving a model to HDF5/NPZ format
  • I'm fine with the instantiate attribute as well

Let's do it this way.

@superbock
Copy link
Collaborator Author

As outlined in #102, I think the right way of doing this is to add the loading functionality to the Processor class and adapt the load() method to be able to handle not only pickled but also .hdf5 and .npz files -- preferably by adding a dedicated method for each format and load() just acting as a wrapper.

The dump() method should be adapted to be able to save the processor in the desired format.

Anything I have missed?

@superbock
Copy link
Collaborator Author

The whole thing is basically a reimplementation of pickle, so I propose to just use pickle and we're done.

superbock pushed a commit that referenced this issue Mar 9, 2016
Refactor the neural network stuff into ml.nn.
Additionally, the models are simple pickles now; fixes #99.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants