Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new procedure: filter_item #14

Closed
semio opened this issue Aug 5, 2016 · 6 comments
Closed

new procedure: filter_item #14

semio opened this issue Aug 5, 2016 · 6 comments
Labels

Comments

@semio
Copy link
Owner

semio commented Aug 5, 2016

Reason

While running the recipe cooking procedures, there will be some temporary data. We need to remove them in the final result, so we need a procedure for this task.

API

filter_item takes two parameters:

  • ingredient: the ingredient to filter
  • items: a list of the items we should keep. so other than the items in this list, the other items will be dropped.
@semio semio added the recipe label Aug 5, 2016
@semio semio closed this as completed Aug 10, 2016
@jheeffer
Copy link

jheeffer commented Oct 3, 2016

Isn't this better called filter_columns with a list of columns to keep? Or possibly option to give columns to remove. Seems more consistent with the table terms (e.g. filter_rows) we're using.

@semio
Copy link
Owner Author

semio commented Oct 10, 2016

No, this is not filtering columns in datapoints. It's filtering keys in the ingredient.

Internally the data of ingredient are store as a dictionary. If we have an ingredient like this:

- id: example-datapoint-ingredient
  dataset:  example-dataset 
  key: geo,time
  value: [concept1, concept2]

When we try to get data from this ingredient, the result will be a dictionary

{
    "concept1": DataFrame_of_concept1,
    "concept2": DataFrame_of_concept2
}

Then we create a new ingredient where concept3 = concept1 + concept2. Now the data in new ingredient will be

{
    "concept1": DataFrame_of_concept1,
    "concept2": DataFrame_of_concept2,
    "concept3": DataFrame_of_concept3
}

But in final DDF output, we only want concept2 and concept3. So we need to filter out the ingredient and only keep those concepts we need. That's why I create this procedure.

An other way is add filter option for each procedure, so in each step we can drop those concepts we don't need later. I think either way is ok but I prefer to do this in a procedure.

p.s Thanks for asking about filter_columns. For now there is no filter_columns because I made filter_row drop those columns with unique values, and no other use cases for filter_columns in SG so far. It will be useful to create this procedure and make filter_row don't drop columns by default. #33 #2

@semio semio reopened this Oct 10, 2016
@jheeffer
Copy link

jheeffer commented Oct 10, 2016

Ok, help me out, cause I'm not getting the difference. If we take your example:

- id: example-datapoint-ingredient
  dataset:  example-dataset 
  key: geo,time
  value: [concept1, concept2]

and put it in a table:

geo time concept1 concept2
swe 2015 4 6
swe 2014 3 3

Then do the sum you said

geo time concept1 concept2 concept3
swe 2015 4 6 10
swe 2014 3 3 6

And then remove concept1 because we only want concept2 and concept3

geo time concept2 concept3
swe 2015 6 10
swe 2014 3 6

Didn't we just remove a column? I'm not sure how this is not removing a column. Maybe it's safe if we say you can only remove columns which are values, not keys? Is that why you don't call it remove a column, cause it's too broad?
You say

It's filtering keys in the ingredient

I don't see how that follows from your example. I just see a column/(concept-used-as-value) be removed.

Not sure what you meant about filter_row dropping columns with unique values? Can you explain in #2.

@semio
Copy link
Owner Author

semio commented Oct 10, 2016

oh, I see your point. There is one different between the table you used above and the actual representation of data in Chef module. In chef, I am using a dictionary:

{
concept1:

geo time concept1
swe 2014 10
swe 2015 16

concept2:

geo time concept2
chn 2014 6
swe 2015 3

}

Because usually we call a (key, value) pair an item of a dictionary, so I name this procedure filter_item. And the usage is to filter keys in this data dictionary. Concept-used-as-value is the key of this dictionary, so you see only columns/concepts-used-as-value being removed in the example. Am I clear?

Using your representation of data table, filter_item only removes the value columns, so we still need a procedure to remove some dimension columns, hence I create #33. I understand both are just part of a broader column removing procedures. And it's OK to make them one procedure instead of two, without changing the current dictionary data structure.

I chose the dictionary structure because:

  • dictionary is more memory efficient than big DataFrame. Sometimes if there is a concept with a lot of datapoints that others don't have, for example a concept with data from 1800-2016, while other concepts don't have 1800-1900 datapoints, combining them in one DataFrame will create a lot of N/A values.
  • in our datasets datapoints are stored in separated files, reading them in and concat them to a large DataFrame will take long time.

@jheeffer
Copy link

jheeffer commented Oct 10, 2016

Okay, so I understand your implementation details, thanks for explaining! Always good to know the rational behind it (and have it documented here at least).

  1. I don't think we should let implementation choices influence what the API design is like. It can give us a direction, but sometimes it's better to abstract away from implementation details.
  2. It seems unsafe to be able to remove a dimension column without changing the values. I can't imagine a case where that would lead to valid data? Can you?
    With groupby you can 'remove' a column while applying the needed aggregate functions to make the values work for the new dimensions.

So: I'd have just this function, we don't need both this one and #33.

  1. because they do the same, (select a subset of columns). If you like theory: https://en.wikipedia.org/wiki/Projection_(relational_algebra)
  2. Because removing dimensions without changing values seems not to be a use-case? Unless we can find a use case or decide we allow potentially dangerous operations.

Depending also on the outcome of #2 , we can see what the right naming of this function is.

@semio
Copy link
Owner Author

semio commented Oct 17, 2016

Thanks for the theory link, always good to know more about theories :)

I think only when a dimension column just have only one unique value we can safely remove the dimension column. That's what will filter_row do and I will explain in #2.

@semio semio closed this as completed Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants