new procedure: filter_item #14

semio · 2016-08-05T00:32:28Z

Reason

While running the recipe cooking procedures, there will be some temporary data. We need to remove them in the final result, so we need a procedure for this task.

API

filter_item takes two parameters:

ingredient: the ingredient to filter
items: a list of the items we should keep. so other than the items in this list, the other items will be dropped.

The text was updated successfully, but these errors were encountered:

jheeffer · 2016-10-03T10:56:34Z

Isn't this better called filter_columns with a list of columns to keep? Or possibly option to give columns to remove. Seems more consistent with the table terms (e.g. filter_rows) we're using.

semio · 2016-10-10T02:43:57Z

No, this is not filtering columns in datapoints. It's filtering keys in the ingredient.

Internally the data of ingredient are store as a dictionary. If we have an ingredient like this:

- id: example-datapoint-ingredient
  dataset:  example-dataset 
  key: geo,time
  value: [concept1, concept2]

When we try to get data from this ingredient, the result will be a dictionary

{
    "concept1": DataFrame_of_concept1,
    "concept2": DataFrame_of_concept2
}

Then we create a new ingredient where concept3 = concept1 + concept2. Now the data in new ingredient will be

{
    "concept1": DataFrame_of_concept1,
    "concept2": DataFrame_of_concept2,
    "concept3": DataFrame_of_concept3
}

But in final DDF output, we only want concept2 and concept3. So we need to filter out the ingredient and only keep those concepts we need. That's why I create this procedure.

An other way is add filter option for each procedure, so in each step we can drop those concepts we don't need later. I think either way is ok but I prefer to do this in a procedure.

p.s Thanks for asking about filter_columns. For now there is no filter_columns because I made filter_row drop those columns with unique values, and no other use cases for filter_columns in SG so far. It will be useful to create this procedure and make filter_row don't drop columns by default. #33 #2

jheeffer · 2016-10-10T06:37:23Z

Ok, help me out, cause I'm not getting the difference. If we take your example:

- id: example-datapoint-ingredient
  dataset:  example-dataset 
  key: geo,time
  value: [concept1, concept2]

and put it in a table:

geo	time	concept1	concept2
swe	2015	4	6
swe	2014	3	3

Then do the sum you said

geo	time	concept1	concept2	concept3
swe	2015	4	6	10
swe	2014	3	3	6

And then remove concept1 because we only want concept2 and concept3

geo	time	concept2	concept3
swe	2015	6	10
swe	2014	3	6

Didn't we just remove a column? I'm not sure how this is not removing a column. Maybe it's safe if we say you can only remove columns which are values, not keys? Is that why you don't call it remove a column, cause it's too broad?
You say

It's filtering keys in the ingredient

I don't see how that follows from your example. I just see a column/(concept-used-as-value) be removed.

Not sure what you meant about filter_row dropping columns with unique values? Can you explain in #2.

semio · 2016-10-10T07:56:08Z

oh, I see your point. There is one different between the table you used above and the actual representation of data in Chef module. In chef, I am using a dictionary:

{
concept1:

geo	time	concept1
swe	2014	10
swe	2015	16

concept2:

geo	time	concept2
chn	2014	6
swe	2015	3

}

Because usually we call a (key, value) pair an item of a dictionary, so I name this procedure filter_item. And the usage is to filter keys in this data dictionary. Concept-used-as-value is the key of this dictionary, so you see only columns/concepts-used-as-value being removed in the example. Am I clear?

Using your representation of data table, filter_item only removes the value columns, so we still need a procedure to remove some dimension columns, hence I create #33. I understand both are just part of a broader column removing procedures. And it's OK to make them one procedure instead of two, without changing the current dictionary data structure.

I chose the dictionary structure because:

dictionary is more memory efficient than big DataFrame. Sometimes if there is a concept with a lot of datapoints that others don't have, for example a concept with data from 1800-2016, while other concepts don't have 1800-1900 datapoints, combining them in one DataFrame will create a lot of N/A values.
in our datasets datapoints are stored in separated files, reading them in and concat them to a large DataFrame will take long time.

jheeffer · 2016-10-10T08:19:43Z

Okay, so I understand your implementation details, thanks for explaining! Always good to know the rational behind it (and have it documented here at least).

I don't think we should let implementation choices influence what the API design is like. It can give us a direction, but sometimes it's better to abstract away from implementation details.
It seems unsafe to be able to remove a dimension column without changing the values. I can't imagine a case where that would lead to valid data? Can you?
With groupby you can 'remove' a column while applying the needed aggregate functions to make the values work for the new dimensions.

So: I'd have just this function, we don't need both this one and #33.

because they do the same, (select a subset of columns). If you like theory: https://en.wikipedia.org/wiki/Projection_(relational_algebra)
Because removing dimensions without changing values seems not to be a use-case? Unless we can find a use case or decide we allow potentially dangerous operations.

Depending also on the outcome of #2 , we can see what the right naming of this function is.

semio · 2016-10-17T08:23:02Z

Thanks for the theory link, always good to know more about theories :)

I think only when a dimension column just have only one unique value we can safely remove the dimension column. That's what will filter_row do and I will explain in #2.

semio added the recipe label Aug 5, 2016

semio closed this as completed Aug 10, 2016

semio reopened this Oct 10, 2016

jheeffer mentioned this issue Oct 10, 2016

recipe procedure: filter_row #2

Closed

semio closed this as completed Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new procedure: filter_item #14

new procedure: filter_item #14

semio commented Aug 5, 2016

jheeffer commented Oct 3, 2016 •

edited

Loading

semio commented Oct 10, 2016 •

edited

Loading

jheeffer commented Oct 10, 2016 •

edited

Loading

semio commented Oct 10, 2016 •

edited

Loading

jheeffer commented Oct 10, 2016 •

edited

Loading

semio commented Oct 17, 2016 •

edited

Loading

new procedure: filter_item #14

new procedure: filter_item #14

Comments

semio commented Aug 5, 2016

Reason

API

jheeffer commented Oct 3, 2016 • edited Loading

semio commented Oct 10, 2016 • edited Loading

jheeffer commented Oct 10, 2016 • edited Loading

semio commented Oct 10, 2016 • edited Loading

jheeffer commented Oct 10, 2016 • edited Loading

semio commented Oct 17, 2016 • edited Loading

jheeffer commented Oct 3, 2016 •

edited

Loading

semio commented Oct 10, 2016 •

edited

Loading

jheeffer commented Oct 10, 2016 •

edited

Loading

semio commented Oct 10, 2016 •

edited

Loading

jheeffer commented Oct 10, 2016 •

edited

Loading

semio commented Oct 17, 2016 •

edited

Loading