Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization Type for Fields Containing Python Type object #138

Open
robodair opened this issue Nov 7, 2024 · 0 comments
Open

Serialization Type for Fields Containing Python Type object #138

robodair opened this issue Nov 7, 2024 · 0 comments

Comments

@robodair
Copy link

robodair commented Nov 7, 2024

I am using py-ocsf-models in combination with svdimchenko/pydantic-glue to generate the schema for tables in AWS Glue, and currently I need to subclass and override field serialization for Pydantic to generate a compatible schema for fields with an arbitrary type.

Would you be interested in a PR to bring support for this upstream?

The OCSF Schema defines several attributes of only type Object as an unordered collection of attributes.

It also defines the Type of a few attributes at JSON, as is the case with an Enrichment object.

Currently in py-ocsf-models these types are handled by using the python type name object:

unmapped: Optional[object]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

data: Optional[dict[str, object]]

Pydantic will generate a JSON Schema element such as {"type": "object"} for such fields, and the only way to represent this in AWS Glue type definitions would be struct<> which is not valid.

The workaround is to use a string typed column, and store as a string, and then parse and query the JSON in the query engine you use, such as the AWS Athena support for querying JSON data.

This is an example of how I add this serialization typing in a subclass for BaseEvent:

from pydantic import field_serializer

from py_ocsf_models.events.base_event import BaseEvent
from py_ocsf_models.objects.enrichment import Enrichment


def jsonable_object_serializer(value: object) -> str:
    """
    Serialize a JSON-able field that contains a type of `object` to `str` by dumping as JSON.

    For fields that contain a type of 'object' a reasonable schema for conversion to AWS Glue column
    type definitions cannot be provided by Pydantic, which when providing a JSON schema will use an
    entry of type "object" but with no "properties" key, which if we convert to Glue schema, will
    type as `struct<>` which is not valid.

    The workaround is to use a string typed column, and store as a string, and then parse and query
    the JSON in the query engine you use, such as the AWS Athena support for querying JSON data.

    References:
    - https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html
    - https://repost.aws/questions/QU0CQ6q_tkSwGCd_vQ36M0TA/best-glue-catalog-table-column-type-to-store-variable-json-docs
    """
    return json.dumps(value)


class OCSFEnrichment(Enrichment):

    @field_serializer("data")
    def data_serializer(self, value: object) -> str:
        return jsonable_object_serializer(value)


class OCSFBaseEvent(BaseEvent):
    enrichments: Optional[list[OCSFEnrichment]]

    @field_serializer("unmapped")
    def unmapped_serializer(self, value: object) -> str:
        return jsonable_object_serializer(value)

Which when the JSON Schema is then generated, dumped, and processed by pydantic-glue, returns the Glue column types as desired:

schema = json.dumps(
    OCSFBaseEvent.model_json_schema(
        mode="serialization",
    ),
)

import pydantic_glue
import pprint

pprint.pprint(pydantic_glue.convert(schema))

[('enrichments',
  'array<struct<data:string,name:string,provider:string,type:string,value:string>>'),
 ('message', 'string'),
 ('metadata',
  'struct<correlation_uid:string,event_code:string,uid:string,labels:array<string>,log_level:string,log_name:string,log_provider:string,log_version:string,logged_time:timestamp,loggers:array<struct<device:struct<uid_alt:string,autoscale_uid:string,is_compliant:boolean,created_time:timestamp,desc:string,domain:string,first_seen_time:timestamp,location:struct<city:string,continent:string,coordinates:array<float>,country:string,desc:string,isp:string,is_on_premises:boolean,postal_code:string,provider:string,region:string>,groups:array<struct<type:string,desc:string,domain:string,name:string,privileges:array<string>,uid:string>>,hw_info:struct<bios_date:string,bios_manufacturer:string,bios_ver:string,cpu_bits:int,cpu_cores:int,cpu_count:int,chassis:string,desktop_display:string,keyboard_info:string,cpu_speed:int,cpu_type:string,ram_size:int,serial_number:string>,hostname:string,hypervisor:string,imei:string,ip:string,image:struct<tag:string,labels:array<string>,name:string,path:string,uid:string>,instance_uid:string,last_seen_time:timestamp,mac:string,is_managed:boolean,modified_time:timestamp,name:string,interface_uid:string,interface_name:string,network_interfaces:array<struct<hostname:string,ip:string,mac:string,name:string,namespace:string,subnet_prefix:int,type:string,type_id:int,uid:string>>,zone:string,os:struct<cpu_bits:int,country:string,lang:string,name:string,build:string,edition:string,sp_name:string,sp_ver:int,cpe_name:string,type:string,type_id:int,version:string>,org:struct<name:string,ou_uid:string,ou_name:string,uid:string>,is_personal:boolean,region:string,risk_level:string,risk_level_id:int,risk_score:int,subnet:string,subnet_uid:string,is_trusted:boolean,type:string,type_id:int,uid:string,vlan_uid:string,vpc_uid:string>,log_level:string,log_name:string,log_provider:string,log_version:string,logged_time:timestamp,name:string,product:struct<feature:struct<name:string,uid:string,version:string>,lang:string,name:string,path:string,cpe_name:string,url_string:string,uid:string,vendor_name:string,version:string>,transmit_time:timestamp,uid:string,version:string>>,modified_time:timestamp,original_time:string,processed_time:timestamp,product:struct<feature:struct<name:string,uid:string,version:string>,lang:string,name:string,path:string,cpe_name:string,url_string:string,uid:string,vendor_name:string,version:string>,profiles:array<string>,extensions:array<struct<name:string,uid:string,version:string>>,sequence:int,tenant_uid:string,version:string>'),
 ('observables',
  'array<struct<name:string,reputation:struct<provider:string,base_score:float,score:string,score_id:int>,type:string,type_id:int,value:string>>'),
 ('raw_data', 'string'),
 ('severity_id', 'int'),
 ('severity', 'string'),
 ('status', 'string'),
 ('status_code', 'string'),
 ('status_detail', 'string'),
 ('status_id', 'int'),
 ('unmapped', 'string')]

Is this something that you'd consider supporting in py-ocsf-models?

Further note: Pydantic's field_serializer supports a return_type argument, so it should be possible to make this behaviour optional, and controlled by an environment variable (say PY_OCSF_MODELS_OBJECT_SERIALIZATION_FORMAT=dumped_json) set before import if the existing behaviour is relied upon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant