-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog by smart classes #1038
Comments
Since there are several things to change here, I'll be creating different PRs to solve the following points in an incremental manner:
|
Regarding the navigation by categories, I would like to better understand the user's goal and expectation on this so I have a little more context :) So, leaving aside the implementation details by now, what would be the data flow for final users? For example: starting from a particular country, they would need to obtain the datasets from that country that also belong to a particular category. Is that the case? Is there any restrictions when crossing different entities or any combination should be allowed? i.e. country-variable-datasets, or provider-category-datasets could also represent a user data flow. |
The canonical use case should be:
Only the relations defined at the description need to be implemented. Provider is not included |
I've just added a couple of examples to clarify the doc |
I am not sure about it. I don't know if the user will expect:
I mean, the second option makes sense and ir powerful, but maybe I have a preconceived idea and I will expect the first one |
Yep, the idea is to nest filters, so you can filter out datasets during the hierarchy search. So second option would be the result for that specific search. |
While implementing the nested filters I came across some doubts regarding the navigation API. Let's take for example:
The call As I understand, what we want here is, starting from a catalog with all available info (datasets, categories, geographies...), we want to apply different filters so we can narrow the catalog search. Each filter narrows the search a bit more, and when we ask for the list of datasets we obtain the result of that narrower search. We could easily have that behavior with a fluent API for that part, in which the methods applying the filters return the catalog instance so we can add more filters. The catalog instance keeps track of all the filters and can then initiate the query process passing them down. The above example then would look as follow:
Conceptually it makes more sense to me, aligned with the idea of searching the catalog and narrowing the search incrementally. On the other hand, we still have direct access to the entities so we can do the same we could before:
TL;DR: After considering the problem with @alrocar I implemented this version (it was easier to achieve) so we can have something working (see #1069) However, I marked the PR as draft until we could discuss it. |
I like the new approach, but I think we should be consistent over the hierarchy. 1 ) You can do: catalog.country('usa') Slug filter has more priority. It's not too important at this moment. |
catalog needs to be seen as a search tool and it returns a list as search result. In these examples, the user would call the So the equivalent calls would be:
I think it is consistent with the API but if it is not clear enough maybe we could rename Catalog with CatalogSearch or something like that. Could that be helpful? |
After talking with @alasarr we agreed on that notation and decided to allow the entities to keep the filters of their creation so following queries will take them into account :) |
I've detected some methods are missing We can get it filtering the catalog but I think we should have these methods in the classes. CountryMissing categories methods: cannot get categories available for a country
CategoryMissing methods:
|
Missing methods have been added in PR #1093 |
Pandas implementation of the catalog doesn't work pretty well because of these two main issues:
Cannot implement a full extension of the classes without corner cases: Cannot filter catalog using standard Pandas #1032
The logic of the classes is delegated to the user and it makes quite complicated when the catalog amount of data increases.
Thus, let's move to an approach where classes are smarter.
At the definition, we're just including the properties that work as methods. The rest of the properties are not defined but must appear with the same name as in the metadata DB.
[]
Denotes a list that will work as an Entity List (see below)Methods are replaced by properties. I think it's better for a catalog, i.e.,
Catalog.countries
👍 instead ofCatalog.countries()
EntityList
get
: It will allow finding by id:Catalog.countries.get('es')
to_dataframe
: returns a pandas dataframe of the list.Classes
Catalog
Catalog.countries => [Country] #Static
Catalog.datasets => [Datasets] #Static
Catalog.categories => [Category] #Static
Country
Country.get(<country_id>) => Country #Static
Country.id => String
Country.categories => [Category]
Country.datasets => [Dataset]
Country.geographies => [Geography]
Category
Category.get(<category_id>)
Category.id => String
Category.datasets => [Dataset]
Category.geographies => [Geography] #It returns all the geographies with datasets for this category (country and category), This instance of category must be create with the optional parameter category_id
Category.countries => [Country]
Dataset
Dataset.get(<dataset_id>) #Static
Dataset.id => String
Dataset.variables => [Variable]
Dataset.variables_groups => { 'group_1': [Variable], 'group_2': [Variable] } # It removes the concept of Variables Groups!
Dataset.geography => Geography
Variable
Variable.get(<variable_id>) #static
Variable.id => String
Variable.dataset => Dataset
Geography
Geography.get(<geography_id>)
Geography.datasets = [Dataset]
Geography.support = String (admin|quadgrid|postalcodes)
Geography.support_level = 1,2,3,4
Geography.country = Country
If Geography class is instantiate by providing category_id, datasets method will return all the datasets filtered by the category provided.
Usage
Get all categories of a country
Convert a list to pandas
Get all datasets of a category
Get all datasets of a category
Get all boundaries with demographics datasets
Get all demographics datasets for block groups of a country
cc: @alrocar @cmongut
The text was updated successfully, but these errors were encountered: