-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More json schema #1
More json schema #1
Conversation
common definitions.
First of all, this is amazing - thank you for refining this @ian-r-rose! I'll leave a few comments in-line to keep them in context and out of the main thread. |
provider/common.json
Outdated
"$id": "https://github.com/CityofLosAngeles/mobility-data-specification/blob/master/common", | ||
"$schema": "http://json-schema.org/draft-06/schema#", | ||
"title": "The MDS Provider Schema, common definitions", | ||
"definitions": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a duplicate "definitions"
key here (the empty one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
provider/common.json
Outdated
"examples": [ | ||
"0.1.0" | ||
], | ||
"pattern": "^0\.1\.[0-9]+$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like making the version explicit with the pattern, nice!
provider/status_change.json
Outdated
"$id": "#/properties/data", | ||
"type": "object", | ||
"title": "The page of data", | ||
"required": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, shouldn't "status_changes"
be inside "required"
here, since we are saying this is the spec for that payload specifically?
provider/status_change.json
Outdated
"description": "The type of propulsion; allows multiple values", | ||
"items": { | ||
"type": "string", | ||
"enum": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add electric_assist
to this enum, openmobilityfoundation#48
provider/status_change.json
Outdated
"rebalance_pick_up", | ||
"maintenance_pick_up" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For event_type
and event_type_reason
, it would be great if we could get even more granular with this schema to indicate that some reasons
apply to some event_types
, and others do not. I'm not sure how to represent that or if it's possible... we have like a hierarchical enum (taxonomy?) here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be possible. Let me look more closely at how to use some of these more complex anyOf
/allOf
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might punt on this for now -- I don't quite see a way to make event_type_reason
contextually a different type. Maybe we can think on this for a follow-up?
provider/trips.json
Outdated
"$id": "#/properties/data", | ||
"type": "object", | ||
"title": "The page of data", | ||
"required": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, "trips"
should be "required"
for this schema, right?
provider/trips.json
Outdated
"description": "The type of propulsion; allows multiple values", | ||
"items": { | ||
"type": "string", | ||
"enum": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, "electric_assist"
was added to this list.
Is there a way to refactor this (and vehicle_type
) out into common.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, happy to try to refactor that into the common definitions.
provider/FeatureCollectionMDS.json
Outdated
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a quick glance this looks good too... it covers the case of 1 single point, right?
Came across this discussion on SO, which is also related to geometry as it turns out. I guess I hoped there would be an "official" GeoJSON
spec somewhere that we could just reference in, rather than re-implement it like this. Is that hoping for too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of the spec was that it should have at least two points, so I tried to have that reflected here. Happy to change it to one if that is more correct.
There is an official GeoJSON spec for this, to be found here. Unfortunately, JSON schemas don't have a good story for extending interfaces (rather than composing them). So for this, I wrote a python script that does pull the official schema, and then patches it up to conform to the MDS narrative.
I waffled on whether that script was appropriate to include here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooo I like that... interesting! I'm curious to see the script in any case. So the issue here is -- we can't use the official spec and require the accuracy
properties or other aspects custom for MDS? Hence the script to combine.
On the question of one point vs. two -- I was hoping we could address openmobilityfoundation#74 via JSON Schema, and so if we had Point as a thing on its own (with a single set of coordinates) it could also be used for the event_location
field on status_change
events instead of the wonky "GPS string" we have now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the script I used for generating the spin on FeatureCollection
. It's short, so I'll inline it:
import json
import requests
# Get the canonical FeatureCollection
r = requests.get("http://geojson.org/schema/FeatureCollection.json")
feature_collection = r.json()
# Modify some metadata
feature_collection["$schema"] = "http://json-schema.org/draft-06/schema#"
feature_collection["$id"] = "https://github.com/CityofLosAngeles/mobility-data-specification/blob/master/FeatureCollectionMDS.json"
feature_collection["title"] = "GeoJSON FeatureCollection (MDS spin)"
features = feature_collection["properties"]["features"]
# Only accept Points in the FeatureCollection
geometry = features["items"]["properties"]["geometry"]
point = next(g for g in geometry["oneOf"] if (g.get("title") == "GeoJSON Point"))
features["items"]["properties"]["geometry"] = point
# Force the FeatureCollection to at least have a start and an end.
feature_collection["properties"]["features"]["minItems"] = 2
# Add a required timestamp property
properties = features["items"]["properties"]["properties"]
properties["required"] = ["timestamp"]
properties["properties"] = {
"timestamp": {
"type": "number",
"minimum": 0.0
}
}
with open('FeatureCollectionMDS.json', 'w') as out:
out.write(json.dumps(feature_collection, indent=2))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool, I love how straightforward this is in python!
How about the question of addressing that issue I link above - and making event_location
more fully specified as a GeoJSON Point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's reasonable. Should event_location
be a straight GeoJSON point, or do you think we should modify it in some way? If the former, we could directly reference the URL (though maybe we would want validation to also work in a setting without internet access).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding differences with the actual spec: it seems that the GeoJSON spec for primitive points, polygons, etc is intentially somewhat limited. If we want to add additional properties then it becomes a Feature
(hence the reason that FeatureCollection
is used here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have made event_location
a Point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you were right, it should be a GeoJSON Feature
so we can add the additional properties (e.g. accuracy
) if needed. Then the route
is a FeatureCollection
that is an array of this MDS-specific Feature
, does that make sense?
Okay, I think I have addressed most of your comments: at least as a first pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, can you also refactor vehicle_type
since that one is also shared between status_change
and trip
?
Done! |
Do you have an easy way to do this? Here are a few data files (had to rename them to |
Great, that's exactly what I was looking for! |
Still figuring out a good validation story, but some early reviews on your synthetic trips data: jsonschema.exceptions.ValidationError: 'accuracy' is a required property jsonschema.exceptions.ValidationError: 845.9042534919543 is not of type 'integer'
Failed validating 'type' in schema['properties']['data']['properties']['trips']['items']['properties']['trip_duration']: jsonschema.exceptions.ValidationError: 2798.1217587696774 is not of type 'integer'
Failed validating 'type' in schema['properties']['data']['properties']['trips']['items']['properties']['trip_distance']: |
After spending some time with the referencing of external schemas (e.g. Instead, what would you think about each schema being stand-alone? This would mean de-factoring of some of the common definitions. However, we could write a script so that the schemas are generated automatically from something similar to what we already have (similar to what is already done for getting the |
Makes total sense, and I think this is the key insight and should be the overriding concern for this effort. If we can't use the schema to easily validate data, it is worthless |
Yep, it has to be a super smooth process. I'm going to experiment with defactoring of the definitions. Do you think code should live in this repo, or some side-repo? Could be something like
|
I like your folder layout, within this repo. It seems that if a change in the spec were being made in this repo, it might necessitate a corresponding change to said code -- makes sense if it's all contained. Then we might be able to look towards a more formalized "release" process as a series of steps that includes generating the new schema.
-- On The Other Hand -- Just to think about another point of view, since I'm the loser on Windows that can never play nice with all the cool kids on *nix, it might make sense to wrap this up as a container for easier sharing. In that case, a separate repo is probably the way to go... Tagging @hunterowens for your thoughts. TLDR; we're looking to generate a JSON Schema that covers objects for |
Okay, what do you think of these changes? |
My validation script for your sample data looks like import json
import jsonschema
# Test trips
instance = {"version": "0.1.0", "data": {"trips": json.load(open("trips.txt"))}}
schema = json.load(open('provider/trips.json'))
jsonschema.Draft6Validator(schema).validate(instance)
# Test status changes
instance = {"version": "0.1.0", "data": {"status_changes": json.load(open("status_changes.txt"))}}
schema = json.load(open('provider/status_changes.json'))
jsonschema.Draft6Validator(schema).validate(instance) |
🤦♂️ 💥 too easy mate |
500 | ||
] | ||
}, | ||
"actual_cost": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have international currencies but we will probably need to extend this at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. We could add an additional currency
field and use the ISO 4217 standard codes.
Status checking this. |
I think we're waiting for some movement on both openmobilityfoundation#74 and openmobilityfoundation#87, both of which I marked for There's also the question of whether the python code that generates the schema files belongs in the MDS repo or elsewhere. It's a fairly simple script. |
We could also merge this so that the PR against the official repo (openmobilityfoundation#53) is updated. |
Ok, let's merge this one and continue the discussion in the main repo, that probably makes more sense and is more visible. |
schema.json
intocommon.json
, for common definitions,trips.json
for the trips endpoint, andstatus_change.json
for the status_change endpoint.version
,links
, anduuid
.FeatureCollection
schema, which has only points in the collection, and those points must have a timestamp property.I haven't really tested it on synthetic data yet.