Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Better aggregate metric for dataset comparison #324

Closed
patcon opened this issue Nov 28, 2016 · 1 comment
Closed

[Discussion] Better aggregate metric for dataset comparison #324

patcon opened this issue Nov 28, 2016 · 1 comment

Comments

@patcon
Copy link

patcon commented Nov 28, 2016

Context

The City of Toronto has led the way in Canada with it's initial creation of its data portal and its dataset offerings. In recent years, the open data community has felt that there's been some stagnation in the City open data policies, and the departmental embrace of the underlying principles. There is renewed debate in Toronto about how the City can do better.

During these conversations, City staff stakeholders (in particular Harvey Low) have repeatedly expressed frustration at the metric with which the community shallowly compares progress between cities -- often via dataset counts. They've rightfully brought up that dataset organization greatly colours any comparison. For example, it's frequently mentioned that City of Toronto packages city-wide data together as a dataset, whereas NYC after releases borough-specific datasets.

To be clear, the community critique of City of Toronto open data policy is more nuanced than criticism of the dataset count. (e.g. value of datasets to citizens, rather than numerical criteria). But the city staff definitely have a point: only having dataset count as the overall metric with which to compare between cities does a disservice to the conversation.

It would be great to use the Data Package Spec as a launch point to discuss better metrics, so that criticism can be accounted for in the comparison of open data policy between cities.


Solution

I feel the following would work to resolve the above concerns for tabular data package:

  1. Add a boolean property to describe specific columns as dataColumn.
  2. Add a integer dataPointCount property to resource metadata (and perhaps summed in overall data package metadata).

Since the columns that contain significantly countable data are labelled as such, we can easily script the generation of the data point count. At the portal level, we could then have a much better basis of comparison both within cities (ie. city departments, districts, stewards, etc.) and between cities themselves.

Would the above suggestion be something we'd consider adding to the spec? Obviously, I'm interested in further conversation and other ideas :)

@pwalsh
Copy link
Member

pwalsh commented Feb 5, 2017

Hey @patcon

We are doing lots of work on data quality tooling and specs, which I know you know as you are using goodtables.

I'm super interested in codifying other data points than raw count of published data sets as part of a much wider discussion around open data portals and so on.

In terms of what can be specified in these specs, let's continue this discussion over at #364 and I'll close this for now as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants