You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using py-ocsf-models in combination with svdimchenko/pydantic-glue to generate the schema for tables in AWS Glue, and currently I need to subclass and override field serialization for Pydantic to generate a compatible schema for fields with an arbitrary type.
Would you be interested in a PR to bring support for this upstream?
The OCSF Schema defines several attributes of only type Object as an unordered collection of attributes.
It also defines the Type of a few attributes at JSON, as is the case with an Enrichment object.
Currently in py-ocsf-models these types are handled by using the python type name object:
Pydantic will generate a JSON Schema element such as {"type": "object"} for such fields, and the only way to represent this in AWS Glue type definitions would be struct<> which is not valid.
This is an example of how I add this serialization typing in a subclass for BaseEvent:
frompydanticimportfield_serializerfrompy_ocsf_models.events.base_eventimportBaseEventfrompy_ocsf_models.objects.enrichmentimportEnrichmentdefjsonable_object_serializer(value: object) ->str:
""" Serialize a JSON-able field that contains a type of `object` to `str` by dumping as JSON. For fields that contain a type of 'object' a reasonable schema for conversion to AWS Glue column type definitions cannot be provided by Pydantic, which when providing a JSON schema will use an entry of type "object" but with no "properties" key, which if we convert to Glue schema, will type as `struct<>` which is not valid. The workaround is to use a string typed column, and store as a string, and then parse and query the JSON in the query engine you use, such as the AWS Athena support for querying JSON data. References: - https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html - https://repost.aws/questions/QU0CQ6q_tkSwGCd_vQ36M0TA/best-glue-catalog-table-column-type-to-store-variable-json-docs """returnjson.dumps(value)
classOCSFEnrichment(Enrichment):
@field_serializer("data")defdata_serializer(self, value: object) ->str:
returnjsonable_object_serializer(value)
classOCSFBaseEvent(BaseEvent):
enrichments: Optional[list[OCSFEnrichment]]
@field_serializer("unmapped")defunmapped_serializer(self, value: object) ->str:
returnjsonable_object_serializer(value)
Which when the JSON Schema is then generated, dumped, and processed by pydantic-glue, returns the Glue column types as desired:
Is this something that you'd consider supporting in py-ocsf-models?
Further note: Pydantic's field_serializer supports a return_type argument, so it should be possible to make this behaviour optional, and controlled by an environment variable (say PY_OCSF_MODELS_OBJECT_SERIALIZATION_FORMAT=dumped_json) set before import if the existing behaviour is relied upon.
The text was updated successfully, but these errors were encountered:
I am using py-ocsf-models in combination with svdimchenko/pydantic-glue to generate the schema for tables in AWS Glue, and currently I need to subclass and override field serialization for Pydantic to generate a compatible schema for fields with an arbitrary type.
Would you be interested in a PR to bring support for this upstream?
The OCSF Schema defines several attributes of only type
Object
as an unordered collection of attributes.It also defines the Type of a few attributes at
JSON
, as is the case with an Enrichment object.Currently in
py-ocsf-models
these types are handled by using the python type nameobject
:py-ocsf-models/py_ocsf_models/events/base_event.py
Line 85 in b366488
py-ocsf-models/py_ocsf_models/objects/enrichment.py
Line 19 in b366488
py-ocsf-models/py_ocsf_models/objects/evidence_artifacts.py
Line 31 in b366488
py-ocsf-models/py_ocsf_models/objects/request_elements.py
Line 22 in b366488
py-ocsf-models/py_ocsf_models/objects/resource_details.py
Line 33 in b366488
py-ocsf-models/py_ocsf_models/objects/response_elements.py
Line 24 in b366488
Pydantic will generate a JSON Schema element such as
{"type": "object"}
for such fields, and the only way to represent this in AWS Glue type definitions would bestruct<>
which is not valid.The workaround is to use a string typed column, and store as a string, and then parse and query the JSON in the query engine you use, such as the AWS Athena support for querying JSON data.
This is an example of how I add this serialization typing in a subclass for
BaseEvent
:Which when the JSON Schema is then generated, dumped, and processed by
pydantic-glue
, returns the Glue column types as desired:Is this something that you'd consider supporting in py-ocsf-models?
Further note: Pydantic's field_serializer supports a
return_type
argument, so it should be possible to make this behaviour optional, and controlled by an environment variable (sayPY_OCSF_MODELS_OBJECT_SERIALIZATION_FORMAT=dumped_json
) set before import if the existing behaviour is relied upon.The text was updated successfully, but these errors were encountered: