-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Add a "pandas field selection layer" to allow saving a specification of what inputs are needed in what order / what the output of the model corresponds to #34
Comments
Hi, Thanks for opening the feature request. The standard way of handling the pandas dataframe in Tensorflow is by loading the dataframe and convert the each data to a tensor type and then load it to Keras model. The above mentioned process is explained in detail in the document below. Please refer: https://www.tensorflow.org/tutorials/load_data/pandas_dataframe |
Thanks for pointing to this. I am well aware of this, and I already manage to do this just fine in a way quite similar to what you point to. What I suggest is that the spec and selection of the columns to use for the predictors, i.e. in this example the 2 lines:
could be done by a custom Keras layer so that this becomes part of the model and is saved and loaded automatically. A bit like the mean and std in the case of the normalization layer: this is simple to do by hand and this is what I used to do before the Normalization layer was implemented, but it is just so much simpler to have it be part of the model spec rather than needing to save and load a data structure holding the normalization coefficients. The rationale is that when training many model variants using different sets of features, it becomes heavy / error prone to keep track of this columns spec (need to save additional information alsongside the Keras network dump). Let me know if this is unclear. As a user that has to test many model variants 'playing around' with using different sets of predictors, having this selection being part of the model would make my life much simpler and less error prone - even though I already manage to do it and save / load this information writing additional code. |
@jerabaul29 Thank you for the suggestion! |
Thanks for the pointers to resources to look into :) . So if I understand well, I could get inspiration from https://github.com/keras-team/keras/blob/master/keras/layers/preprocessing/normalization.py + https://keras.io/guides/serialization_and_saving/ to build a custom layer that holds the list of columns to use as its state, and uses this to select data in a pandas / transform NN output into a pandas? And this should work out of the box since a list of strings (describing the columns spec) is a native python type, right? If this is correct, I think I could actually get a small snippet of code that does this. Then the question is, do you think that this need is common enough / useful for many enough users, that it is worth sharing a layer in this kind through Keras directly, as you do with the Normalization layer? On this last point, I understand the fact of not having a data type specific layer. But at the same time, using pandas for preparing the input + keras to work with it is becoming nearly an "industry standard", so wondering if this could be useful to a wide group of users and standardization on how to do this could improve usability :) . |
I (naively) tried to start playing around from something along the lines of: import numpy as np
import tensorflow as tf
import pandas as pd
import keras
class PandasSelectionLayer(keras.layers.Layer):
def __init__(self, list_columns, **kwargs):
super().__init__(**kwargs)
self.list_columns = list_columns
def call(self, input_data):
assert isinstance(input_data, pd.DataFrame)
return tf.convert_to_tensor(input_data[:, self.list_columns].to_numpy().astype(np.float32))
def get_config(self):
config = super().get_config()
config.update(
{
"list_columns": self.list_columns,
}
)
return config
list_cols = ["col_1", "col_2"]
pandas_input = PandasSelectionLayer(list_columns=list_cols)
fully_connected = keras.layers.Dense(60, activation="relu")(pandas_input)
output = keras.layers.Dense(1)(fully_connected)
keras_model = keras.Model(inputs=pandas_input, outputs=output) but "of course" this does not work, as the |
Another option I could use as a short term quick fix is to define a wrapping class, instead of a custom layer. The wrapping class could combine the pandas columns spec + the trained model, and take care of saving / loading both at the same time. This requires a bit of extra boilerplate etc, and would be quite a bit less convenient (for example, this could only take care of already trained network, otherwise adding quite a lot more logics and methods would be needed) so a native layer would be better, but can be an option. Something like: class PandasKeras():
def __init__(self, trained_keras_network, columns_spec):
...
def load(self):
...
def save(self):
...
def predict(pandas_in):
... |
System information.
TensorFlow version (you are using): 2.11.0 (though it should not play a role)
Are you willing to contribute it (Yes/No) : No (I am not familiar enough with the Keras internals, it would take me too much time to get familiar with these)
Describe the feature and the current behavior/state.
I regularly perform error correction / post processing of data, where the data are available as a big
pandas
dataframe, with each potential "entry to process" as a row in the dataframe, and each column in the dataframe as a data field present in each entry. What happens usually is that the input dataframe has many more columns than I end up using - effectively, I use only a few of the features in the end. To remember which columns I use, and in which order, I then end up needing to save, in addition to the Keras model, a specification of the list of ordered feature columns I use as input to my model. Of course, this is a bit tiring and error prone to do by hand, make sure to keep the correct spec alongside the correct Keras model dump, etc.This reminds me a bit of the "problem" faced when normalizing / denormalizing the data input / output. This used to be a pain (need to save the means and stds separately, and manage them by hand), but this is now super easy to manage thanks to the
Normalization
layers: since these are part of the network, this means that by using them, the user does not need to worry about storing, restoring, and applying these coefficients by hand, and does not need either to manage additional files that must be kept alongside the keras model dump (a simple in theory but error prone in practice process). For me, these simpleNormalization
layers are a huge gain, and I would like to leverage this in the same way for the features selection / output labeling.Therefore, my question is the following: could we add a layer to make the specification of what columns to use from a
pandas
dataframe, and in what order, part of the specification of the Keras models, by adding a new "pandas field selection layer"? This would remove the overhead / tiring / error prone process of bookkeeping, saving, restoring, etc, this spec which users now have to do by hand.This could also be used "in reverse" to automatically turn the Keras model output into a pandas, with named column(s). This way, this makes it possible to better / implicitly document the model as a whole (things like, "what is it producing and in what units" can now be embedded in the network, through the name of the output column(s)).
I am not an expert, but an API something like the following could be useful, partially copied from https://keras.io/api/layers/preprocessing_layers/numerical/normalization/ (open to discussions / suggestions of improvements of course :) ):
with arguments:
list_columns
: the list of pandas columns, like["column_name_feature_1", "column_name_feature_2", ...]
invert
: if False, the layer can be used as the input layer to the model, and takes in a pandas dataframe, and will generate the individual samples with the ordered features corresponding tolist_columns
. If True, the layer can be used as the output of the model, and transforms the purely numeric output of the keras model into a pandas with column names for each output as specified bylist_columns
.The layer would generate a runtime exception if the list of columns cannot be found in the input pandas dataframe. The layer would also have a couple of attributes, like
.list_columns
would return thelist_columns
list. A.reverse
method to return the "reversed" version of the layer.So my models would now look like (of course could have something else than connected layer at the start and end of the "real" network neural layers):
Now calling:
would work out of the box, and
pandas_out
is apandas
dataframe with the same number of rows aspandas_in
, and the set of columns defined inpandas_inv_labeling_layer.list_columns
, and all of this metadata is saved / restored with thesave
andload_model
API.Will this change the current api? How?
This will not change any existing API, this will only add an extra layer that can be used if the user wants and provides "automagic" management of metadata and inputs and outputs specs by leveraging pandas datasets labeling.
Who will benefit from this feature?
Potentially, all users who use pandas as an input to their Keras model, and use a given subset / ordering of the pandas file as an input. These users will not need any longer to implement the bookkeeping themselves, and can delegate it to a Keras layer that is part of the model spec, dump, load, etc.
Contributing
The text was updated successfully, but these errors were encountered: