Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recipe procedure: align #3

Closed
semio opened this issue Jul 18, 2016 · 7 comments
Closed

recipe procedure: align #3

semio opened this issue Jul 18, 2016 · 7 comments

Comments

@semio
Copy link
Owner

semio commented Jul 18, 2016

Reason

The same entity may have different keys in different datasets. For example, the key for country entity, may be the iso code in one dataset, while in the other dataset it may be the alphanumeric form of the country name. When creating a new dataset we may want to use the datapoints from one and country key from other. so we need a procedure to change the country in datapoints to match the other dataset.

EDIT: I found this is an automate version of translate_column. translate_column takes a dictionary as input and align generate the dictionary on the fly. Not sure if it's good to combine this function into translate_column.

API

align procedure translate the index column in a ingredient to make it align with an other ingredient.

the align procedure accepts following options:

  • ingredients: the target ingredient to translate
  • search_cols: the columns to search for values in base ingredient
  • to_find: the column in target ingredient contains values to search
  • to_replace: after finishing search, replace the column with new value
@semio semio added the recipe label Jul 18, 2016
semio added a commit that referenced this issue Aug 2, 2016
related issue: #3

function working in some cases, need more testing
@semio
Copy link
Owner Author

semio commented Aug 3, 2016

example:

https://github.com/semio/ddf--gapminder--systema_globalis/blob/feature/autogenerated/etl/recipes/recipe_gapminder.yaml

This recipe replace the country column in datapoints to align with the GW geo entity.

@semio semio closed this as completed Aug 10, 2016
@jheeffer
Copy link

jheeffer commented Oct 2, 2016

How could we clear up the API? I think this can be joined with translate_column in a clear API.

{
  procedure: 'translate_column',
  ingredient: 'wdi-countries',              // ingredient which will be translated
  column: 'country',                        // column in ingredient which will be translated
  target_column: 'geo',                     // column where translations will be written to. 
                                            // Overwrites when target_column already exists. 
                                            // Creates new column when target_column does not exist. 
                                            // Defaults to 'column' value.
  dictionary: { 
    ingredient: 'gw-entities',              // ingredient from which to build dictionary
    key/from: 'name' || ['name','alt1','alt2'],  // columns which form dictionary keys. May default to ingredient value columns?
    value/to: 'geo'                            // column which forms dictionary value. May default to ingredient key column?
  },
  dictionary: 'path/to/dictionary.json',    // path to a json file containing a dictionary
  dictionary: {                             // inline dictionary
    China: 'chn',
    Sweden: 'swe'
  },
  not_found: 'error' || 'drop' || 'include',  // action on row when column contains value not in dictionary keys. Defaults to include.
}

of course only one of the dictionaries can be given

Questions:

  1. Does this seem a clear API to you? Does it allow for the same functionality as translate_column and align have now? Are there other problems with it?
    1. use key/value or from/to for dictionary? key/value is technical, from/to a bit more descriptive.
  2. How to solve the ambiguity of dictionaries with an object value (inline/ingredient)
    1. Maybe three different dictionary properties? dictionary_from_file, dictionary_from_ingredient, dictionary_inline?
    2. dictionary: { type: 'inline' || 'file' || 'ingredient', .. }
    3. dictionary_type: 'inline' || 'file' || 'ingredient', dictionary: { }
    4. Assume ingredient dictionary when object contains only ingredient,key,value properties. Have iii as fallback for conflicting cases.
    5. Something else?

@jheeffer
Copy link

jheeffer commented Oct 2, 2016

p.s.

here: #28
and in the recipe code I saw the base option for translate_column which seems to overlap with align. Is that correct?

@semio
Copy link
Owner Author

semio commented Oct 3, 2016

yes, @jheeffer your suggestion looks clear. I agree we can join this function to translation_column, and this will also cover what we need for #28

My suggestions:

  1. I think we should put the options inside options, as what we do in other functions
  2. using the type option is better than add 3 different options for 3 kinds of dictionary, for we can add other types and keep using the dictionary option
{
  procedure: 'translate_column',
  ingredient: 'wdi-countries',              // ingredient which will be translated
  options: {
    column: 'country',                      // column in ingredient which will be translated
    target_column: 'geo',                   // column where translations will be written to. 
                                            // Overwrites when target_column already exists. 
                                            // Creates new column when target_column does not exist. 
                                            // Defaults to 'column' value.
    type: 'ingredient'                      // type of dictionary
    dictionary: { 
      ingredient: 'gw-entities',              // ingredient from which to build dictionary
      key: 'name' || ['name','alt1','alt2'],  // columns which form dictionary keys
      value: 'geo'                            // column which forms dictionary value
  }
}

@jheeffer
Copy link

jheeffer commented Oct 3, 2016

Alright, options is good then. I see you prefer key/value over from/to. I would call type dictionary_type, to make the link to dictionary clear.
Is dictionary_type mandatory or has it a default/is it smart depending on dictionary object's properties?

@semio
Copy link
Owner Author

semio commented Dec 8, 2016

Yes, dictionary_type can be smart depending on dictionary object. We can skip this parameter.

@jheeffer
Copy link

jheeffer commented Dec 8, 2016

I think this is ready for implementation!

semio added a commit that referenced this issue Dec 9, 2016
for now we will parse the options object as descripted in
#3 (comment)
@semio semio closed this as completed Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants