Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog by smart classes #1038

Closed
alasarr opened this issue Sep 29, 2019 · 12 comments · Fixed by #1086 or #1093
Closed

Catalog by smart classes #1038

alasarr opened this issue Sep 29, 2019 · 12 comments · Fixed by #1086 or #1093
Assignees

Comments

@alasarr
Copy link
Contributor

alasarr commented Sep 29, 2019

Pandas implementation of the catalog doesn't work pretty well because of these two main issues:

Thus, let's move to an approach where classes are smarter.

At the definition, we're just including the properties that work as methods. The rest of the properties are not defined but must appear with the same name as in the metadata DB.

[] Denotes a list that will work as an Entity List (see below)

Methods are replaced by properties. I think it's better for a catalog, i.e., Catalog.countries 👍 instead of Catalog.countries()

EntityList

  1. get: It will allow finding by id: Catalog.countries.get('es')

  2. to_dataframe: returns a pandas dataframe of the list.

Classes

Catalog

Catalog.countries => [Country] #Static
Catalog.datasets => [Datasets] #Static
Catalog.categories => [Category] #Static

Country

Country.get(<country_id>) => Country #Static
Country.id => String
Country.categories => [Category]
Country.datasets => [Dataset]
Country.geographies => [Geography]

Category

Category.get(<category_id>)
Category.id => String
Category.datasets => [Dataset]
Category.geographies => [Geography] #It returns all the geographies with datasets for this category (country and category), This instance of category must be create with the optional parameter category_id
Category.countries => [Country]

Dataset

Dataset.get(<dataset_id>) #Static
Dataset.id => String
Dataset.variables => [Variable]
Dataset.variables_groups => { 'group_1': [Variable], 'group_2': [Variable] } # It removes the concept of Variables Groups!
Dataset.geography => Geography

Variable
Variable.get(<variable_id>) #static
Variable.id => String
Variable.dataset => Dataset

Geography

Geography.get(<geography_id>)
Geography.datasets = [Dataset]
Geography.support = String (admin|quadgrid|postalcodes)
Geography.support_level = 1,2,3,4
Geography.country = Country

If Geography class is instantiate by providing category_id, datasets method will return all the datasets filtered by the category provided.

Usage

Get all categories of a country

Country.get('usa').categories

Convert a list to pandas

Country.get('usa').categories.to_dataframe().head()
Country.get('usa').geographies.to_dataframe().head()
Country.get('usa').datasets.to_dataframe().head()

Get all datasets of a category

Country.get('usa').categories.get('demographics').datasets

Get all datasets of a category

Category.get('geomgraphics').countries.get('usa').datasets

Get all boundaries with demographics datasets

Country.get('usa').categories.get('demographics').geographies

Get all demographics datasets for block groups of a country

Country.get('usa').categories.get('demographics').geographies.get('block_groups').datasets()

cc: @alrocar @cmongut

@esloho
Copy link
Contributor

esloho commented Sep 30, 2019

Since there are several things to change here, I'll be creating different PRs to solve the following points in an incremental manner:

@esloho
Copy link
Contributor

esloho commented Sep 30, 2019

Regarding the navigation by categories, I would like to better understand the user's goal and expectation on this so I have a little more context :)

So, leaving aside the implementation details by now, what would be the data flow for final users? For example: starting from a particular country, they would need to obtain the datasets from that country that also belong to a particular category. Is that the case? Is there any restrictions when crossing different entities or any combination should be allowed? i.e. country-variable-datasets, or provider-category-datasets could also represent a user data flow.

@alasarr
Copy link
Contributor Author

alasarr commented Sep 30, 2019

So, leaving aside the implementation details by now, what would be the data flow for final users? For example: starting from a particular country, they would need to obtain the datasets from that country that also belong to a particular category. Is that the case?

The canonical use case should be: country->category->geography->datasets. But we will also allow then to access data with fewer filters, country->datasets (get all datasets for one country) or country->category->datasets (get all datasets of one category for one country)

Is there any restrictions when crossing different entities or any combination should be allowed? i.e. country-variable-datasets, or provider-category-datasets could also represent a user data flow.

Only the relations defined at the description need to be implemented. Provider is not included

@alasarr
Copy link
Contributor Author

alasarr commented Sep 30, 2019

I've just added a couple of examples to clarify the doc

@simon-contreras-deel
Copy link
Contributor

Catalog().countries.get('es').categories.get('financial').datasets 

I am not sure about it. I don't know if the user will expect:

  • datasets from financial category
  • datasets from Spain and financial category

I mean, the second option makes sense and ir powerful, but maybe I have a preconceived idea and I will expect the first one

@alrocar
Copy link
Contributor

alrocar commented Oct 7, 2019

Yep, the idea is to nest filters, so you can filter out datasets during the hierarchy search.

So second option would be the result for that specific search.

@esloho
Copy link
Contributor

esloho commented Oct 8, 2019

While implementing the nested filters I came across some doubts regarding the navigation API.

Let's take for example:

catalog.countries.get('usa').category.get('demographics').datasets

The call catalog.countries.get('usa') (or its equivalent Country.get('usa') returns an instance of Country. Asking it for categories is hard to understand conceptually since a country has no categories at all. Also, each instance is being responsible of keeping track of the full chain of calls up to that moment. This makes the implementation harder (besides the conceptual stuff abovementioned).

As I understand, what we want here is, starting from a catalog with all available info (datasets, categories, geographies...), we want to apply different filters so we can narrow the catalog search. Each filter narrows the search a bit more, and when we ask for the list of datasets we obtain the result of that narrower search.

We could easily have that behavior with a fluent API for that part, in which the methods applying the filters return the catalog instance so we can add more filters. The catalog instance keeps track of all the filters and can then initiate the query process passing them down.

The above example then would look as follow:

catalog.country('usa').category('demographics').datasets

Conceptually it makes more sense to me, aligned with the idea of searching the catalog and narrowing the search incrementally.

On the other hand, we still have direct access to the entities so we can do the same we could before:

Dataset.get('some.super.cool.dataset'),to_dataframe()

TL;DR: After considering the problem with @alrocar I implemented this version (it was easier to achieve) so we can have something working (see #1069) However, I marked the PR as draft until we could discuss it.

@alasarr
Copy link
Contributor Author

alasarr commented Oct 9, 2019

I like the new approach, but I think we should be consistent over the hierarchy.

1 ) You can do: catalog.country('usa')
2) Also: catalog.country('usa').category('demographics')
3) Also: catalog.country('usa').category('demographics').geographies
4) Also: catalog.country('usa').category('demographics').geography().datasets
5) You should be allowed to catalog.country('usa').category('demographics').dataset()
6) You should be allowed to catalog.country('usa').category('demographics').dataset().variables

Slug filter has more priority. It's not too important at this moment.

@esloho
Copy link
Contributor

esloho commented Oct 10, 2019

  1. You should be allowed to catalog.country('usa').category('demographics').dataset()
  2. You should be allowed to catalog.country('usa').category('demographics').dataset().variables

catalog needs to be seen as a search tool and it returns a list as search result. In these examples, the user would call the datasets property and obtain a CatalogList of Dataset instances. The user can then access the first element and "play" with it.

So the equivalent calls would be:

  • catalog.country('usa').category('demographics').datasets[0]
  • catalog.country('usa').category('demographics').datasets[0].variables

I think it is consistent with the API but if it is not clear enough maybe we could rename Catalog with CatalogSearch or something like that. Could that be helpful?

@esloho
Copy link
Contributor

esloho commented Oct 10, 2019

After talking with @alasarr we agreed on that notation and decided to allow the entities to keep the filters of their creation so following queries will take them into account :)

@alasarr
Copy link
Contributor Author

alasarr commented Oct 11, 2019

I've detected some methods are missing

We can get it filtering the catalog but I think we should have these methods in the classes.

Country

Missing categories methods: cannot get categories available for a country

Country.get('esp').categories

Category

Missing methods:

Category.get('points_of_interest').countries => Returns countries with data at category points_of_interest
Category.get('points_of_interest').geographies => Returns geographies with data at category points_of_interest

@esloho
Copy link
Contributor

esloho commented Oct 11, 2019

Missing methods have been added in PR #1093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants