Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More json schema #1

Merged
merged 18 commits into from
Sep 25, 2018

Conversation

ian-r-rose
Copy link

  • Splits schema.json into common.json, for common definitions, trips.json for the trips endpoint, and status_change.json for the status_change endpoint.
  • Adds definitions for version, links, and uuid.
  • Uses a modified FeatureCollection schema, which has only points in the collection, and those points must have a timestamp property.

I haven't really tested it on synthetic data yet.

@thekaveman
Copy link

First of all, this is amazing - thank you for refining this @ian-r-rose! I'll leave a few comments in-line to keep them in context and out of the main thread.

"$id": "https://github.com/CityofLosAngeles/mobility-data-specification/blob/master/common",
"$schema": "http://json-schema.org/draft-06/schema#",
"title": "The MDS Provider Schema, common definitions",
"definitions": {},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a duplicate "definitions" key here (the empty one)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

"examples": [
"0.1.0"
],
"pattern": "^0\.1\.[0-9]+$"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like making the version explicit with the pattern, nice!

"$id": "#/properties/data",
"type": "object",
"title": "The page of data",
"required": [],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, shouldn't "status_changes" be inside "required" here, since we are saying this is the spec for that payload specifically?

"description": "The type of propulsion; allows multiple values",
"items": {
"type": "string",
"enum": [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add electric_assist to this enum, openmobilityfoundation#48

"rebalance_pick_up",
"maintenance_pick_up"
]
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For event_type and event_type_reason, it would be great if we could get even more granular with this schema to indicate that some reasons apply to some event_types, and others do not. I'm not sure how to represent that or if it's possible... we have like a hierarchical enum (taxonomy?) here 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be possible. Let me look more closely at how to use some of these more complex anyOf/allOf stuff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might punt on this for now -- I don't quite see a way to make event_type_reason contextually a different type. Maybe we can think on this for a follow-up?

"$id": "#/properties/data",
"type": "object",
"title": "The page of data",
"required": [],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, "trips" should be "required" for this schema, right?

"description": "The type of propulsion; allows multiple values",
"items": {
"type": "string",
"enum": [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, "electric_assist" was added to this list.

Is there a way to refactor this (and vehicle_type) out into common.json?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, happy to try to refactor that into the common definitions.

}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a quick glance this looks good too... it covers the case of 1 single point, right?

Came across this discussion on SO, which is also related to geometry as it turns out. I guess I hoped there would be an "official" GeoJSON spec somewhere that we could just reference in, rather than re-implement it like this. Is that hoping for too much?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the spec was that it should have at least two points, so I tried to have that reflected here. Happy to change it to one if that is more correct.

There is an official GeoJSON spec for this, to be found here. Unfortunately, JSON schemas don't have a good story for extending interfaces (rather than composing them). So for this, I wrote a python script that does pull the official schema, and then patches it up to conform to the MDS narrative.

I waffled on whether that script was appropriate to include here. What do you think?

Copy link

@thekaveman thekaveman Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooo I like that... interesting! I'm curious to see the script in any case. So the issue here is -- we can't use the official spec and require the accuracy properties or other aspects custom for MDS? Hence the script to combine.

On the question of one point vs. two -- I was hoping we could address openmobilityfoundation#74 via JSON Schema, and so if we had Point as a thing on its own (with a single set of coordinates) it could also be used for the event_location field on status_change events instead of the wonky "GPS string" we have now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the script I used for generating the spin on FeatureCollection. It's short, so I'll inline it:

import json
import requests

# Get the canonical FeatureCollection
r = requests.get("http://geojson.org/schema/FeatureCollection.json")
feature_collection = r.json()

# Modify some metadata
feature_collection["$schema"] = "http://json-schema.org/draft-06/schema#"
feature_collection["$id"] = "https://github.com/CityofLosAngeles/mobility-data-specification/blob/master/FeatureCollectionMDS.json"
feature_collection["title"] = "GeoJSON FeatureCollection (MDS spin)"

features = feature_collection["properties"]["features"]

# Only accept Points in the FeatureCollection
geometry = features["items"]["properties"]["geometry"]
point = next(g for g in geometry["oneOf"] if (g.get("title") == "GeoJSON Point"))
features["items"]["properties"]["geometry"] = point

# Force the FeatureCollection to at least have a start and an end.
feature_collection["properties"]["features"]["minItems"] = 2

# Add a required timestamp property
properties = features["items"]["properties"]["properties"]
properties["required"] = ["timestamp"]
properties["properties"] = {
        "timestamp": {
            "type": "number",
            "minimum": 0.0
            }
        }

with open('FeatureCollectionMDS.json', 'w') as out:
    out.write(json.dumps(feature_collection, indent=2))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool, I love how straightforward this is in python!

How about the question of addressing that issue I link above - and making event_location more fully specified as a GeoJSON Point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable. Should event_location be a straight GeoJSON point, or do you think we should modify it in some way? If the former, we could directly reference the URL (though maybe we would want validation to also work in a setting without internet access).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding differences with the actual spec: it seems that the GeoJSON spec for primitive points, polygons, etc is intentially somewhat limited. If we want to add additional properties then it becomes a Feature (hence the reason that FeatureCollection is used here).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have made event_location a Point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you were right, it should be a GeoJSON Feature so we can add the additional properties (e.g. accuracy) if needed. Then the route is a FeatureCollection that is an array of this MDS-specific Feature, does that make sense?

@ian-r-rose
Copy link
Author

Okay, I think I have addressed most of your comments: at least as a first pass.

Copy link

@thekaveman thekaveman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, can you also refactor vehicle_type since that one is also shared between status_change and trip?

@ian-r-rose
Copy link
Author

Done!

@thekaveman
Copy link

I haven't really tested it on synthetic data yet.

Do you have an easy way to do this? Here are a few data files (had to rename them to .txt to attach) generated by my code (releasing in the next day or so I promise...)

status_changes.txt
trips.txt

@ian-r-rose
Copy link
Author

Great, that's exactly what I was looking for!

@ian-r-rose
Copy link
Author

Still figuring out a good validation story, but some early reviews on your synthetic trips data:

jsonschema.exceptions.ValidationError: 'accuracy' is a required property
jsonschema.exceptions.ValidationError: 845.9042534919543 is not of type 'integer'

Failed validating 'type' in schema['properties']['data']['properties']['trips']['items']['properties']['trip_duration']:
jsonschema.exceptions.ValidationError: 2798.1217587696774 is not of type 'integer'

Failed validating 'type' in schema['properties']['data']['properties']['trips']['items']['properties']['trip_distance']:

@ian-r-rose
Copy link
Author

After spending some time with the referencing of external schemas (e.g. common.json) with the $ref keyword, I am starting to think that it is more trouble than it is worth. The $refs are supposed to be URIs, which relative paths are not. Actually getting schema validators to correctly resolve these references is a bit tricky, and may be too much to ask for consumers of the schemas. This hinders their usefulness as a validation tool.

Instead, what would you think about each schema being stand-alone? This would mean de-factoring of some of the common definitions. However, we could write a script so that the schemas are generated automatically from something similar to what we already have (similar to what is already done for getting the FeatureCollectionMDS.json). This is, in fact, the approach taken by the GeoJSON schemas: each of them is stand-alone, with the definitions for things like Point repeated in several places. The code for generating it, however, only has a single definition for Point: https://github.com/geojson/schema

@thekaveman
Copy link

Actually getting schema validators to correctly resolve these references is a bit tricky, and may be too much to ask for consumers of the schemas

Makes total sense, and I think this is the key insight and should be the overriding concern for this effort. If we can't use the schema to easily validate data, it is worthless

@ian-r-rose
Copy link
Author

Yep, it has to be a super smooth process. I'm going to experiment with defactoring of the definitions. Do you think code should live in this repo, or some side-repo? Could be something like

README.md
    provider/
        trips.json
        status_changes.json
        README.md
    agency/
        README.md
        agency.json
    code/
        generate_schema.py
        common.json
        provider/
            trips.json
            status_changes.json 
        agency/
            agency.json

@thekaveman
Copy link

thekaveman commented Sep 21, 2018

Do you think code should live in this repo, or some side-repo?

I like your folder layout, within this repo. It seems that if a change in the spec were being made in this repo, it might necessitate a corresponding change to said code -- makes sense if it's all contained. Then we might be able to look towards a more formalized "release" process as a series of steps that includes generating the new schema.

code/ may be a bit too generic of a name however. Folks may assume it contains code that implements either one of the specs (although your python file is very clearly generate_schema.py)

-- On The Other Hand --

Just to think about another point of view, since I'm the loser on Windows that can never play nice with all the cool kids on *nix, it might make sense to wrap this up as a container for easier sharing. In that case, a separate repo is probably the way to go...

Tagging @hunterowens for your thoughts. TLDR; we're looking to generate a JSON Schema that covers objects for agency and provider. Does that code make sense in this repo or a side-repo? I'm leaning side-repo given my inability to play nice (see above - last 3 comments are most relevant).

@ian-r-rose
Copy link
Author

Okay, what do you think of these changes?
The generate_provider_schema.py script just depends on the json standard library package, as well as the requests package. We could probably drop the latter and just rely on the standard library.

@ian-r-rose
Copy link
Author

My validation script for your sample data looks like

import json
import jsonschema

# Test trips
instance = {"version": "0.1.0", "data": {"trips": json.load(open("trips.txt"))}}
schema = json.load(open('provider/trips.json'))

jsonschema.Draft6Validator(schema).validate(instance)

# Test status changes
instance = {"version": "0.1.0", "data": {"status_changes": json.load(open("status_changes.txt"))}}
schema = json.load(open('provider/status_changes.json'))

jsonschema.Draft6Validator(schema).validate(instance)

@thekaveman
Copy link

🤦‍♂️ 💥 too easy mate

500
]
},
"actual_cost": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have international currencies but we will probably need to extend this at some point in the future.

Copy link

@thekaveman thekaveman Sep 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. We could add an additional currency field and use the ISO 4217 standard codes.

@hunterowens
Copy link

Status checking this.

@thekaveman
Copy link

I think we're waiting for some movement on both openmobilityfoundation#74 and openmobilityfoundation#87, both of which I marked for 0.1.1 as they are more clarifications than additions/changes.

There's also the question of whether the python code that generates the schema files belongs in the MDS repo or elsewhere. It's a fairly simple script.

@ian-r-rose
Copy link
Author

We could also merge this so that the PR against the official repo (openmobilityfoundation#53) is updated.

@thekaveman
Copy link

Ok, let's merge this one and continue the discussion in the main repo, that probably makes more sense and is more visible.

@thekaveman thekaveman merged commit 3ed4bbd into CityofSantaMonica:json-schema Sep 25, 2018
thekaveman pushed a commit that referenced this pull request Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants