Document all readers

gabledata · Jul 25, 2023 · 395a925 · 395a925
1 parent f1e8d32
commit 395a925
Show file tree

Hide file tree

Showing 9 changed files with 278 additions and 3 deletions.
diff --git a/docs/converters/avro.md b/docs/converters/avro.md
@@ -80,7 +80,7 @@ recap_schema = AvroConverter().to_recap(avro_schema)
 
 ### From Recap to Avro
 
-| Recap Type (with attribute limits) | Avro Type |
+| Recap Type | Avro Type |
 |------------------------------------|-----------|
 | NullType                           | null      |
 | BoolType                           | boolean   |
@@ -98,7 +98,7 @@ recap_schema = AvroConverter().to_recap(avro_schema)
 
 ### From Avro to Recap
 
-| Avro Type | Recap Type (with attribute limits) |
+| Avro Type | Recap Type |
 |-----------|------------------------------------|
 | null      | NullType                           |
 | boolean   | BoolType                           |

diff --git a/docs/converters/json_schema.md → docs/converters/json-schema.md b/docs/converters/json_schema.md → docs/converters/json-schema.md
diff --git a/docs/converters/protobuf.md b/docs/converters/protobuf.md
@@ -81,7 +81,7 @@ recap_schema = ProtobufConverter().to_recap(protobuf_schema)
 
 This table shows the corresponding Protobuf types for each Recap type.
 
-| Recap Type (with attribute limits) | Protobuf Type |
+| Recap Type | Protobuf Type |
 |------------------------------------|---------------|
 | NullType                           | google.protobuf.NullValue |
 | BoolType                           | bool |

diff --git a/docs/readers/bigquery.md b/docs/readers/bigquery.md
@@ -0,0 +1,55 @@
+---
+layout: default
+title: "BigQuery"
+parent: "Readers"
+---
+
+# BigQuery
+{: .no_toc }
+
+1. TOC
+{:toc}
+
+The `BigQueryReader` class is used to convert BigQuery table schemas to Recap types. The main method in this class is `to_recap`.
+
+## `to_recap`
+
+```python
+def to_recap(self, dataset: str, table: str) -> StructType
+```
+
+The `to_recap` method takes in the name of a BigQuery dataset and table, and returns a Recap `StructType` that represents the BigQuery table schema.
+
+### Example
+
+```python
+from google.cloud import bigquery
+from recap.readers.bigquery import BigQueryReader
+
+client = bigquery.Client()
+recap_schema = BigQueryReader(client).to_recap("my_dataset", "my_table")
+```
+
+In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_dataset`.
+
+## Type Conversion
+
+This table shows the corresponding Recap types for each BigQuery type, along with the associated attributes:
+
+| BigQuery Type | Recap Type |
+|---------------|------------------------------------|
+| STRING, JSON | StringType (bytes <= 65_536) |
+| BYTES | BytesType (bytes <= 65_536) |
+| INT64, INTEGER, INT, SMALLINT, TINYINT, BYTEINT | IntType (bits=64, signed=True) |
+| FLOAT, FLOAT64 | FloatType (bits=64) |
+| BOOLEAN | BoolType |
+| TIMESTAMP, DATETIME | IntType (logical="build.recap.Timestamp", bits=64, unit="microsecond") |
+| TIME | IntType (logical="build.recap.Time", bits=32, unit="microsecond") |
+| DATE | IntType (logical="build.recap.Date", bits=32, unit="day") |
+| RECORD, STRUCT | StructType |
+| NUMERIC, DECIMAL | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision <= 38, scale <= 0) |
+| BIGNUMERIC, BIGDECIMAL | BytesType (logical="build.recap.Decimal", bytes=32, variable=False, precision <= 76, scale <= 0) |
+
+## Limitations and Constraints
+
+The conversion functions raise a `ValueError` exception if the conversion is not possible.
diff --git a/docs/readers/confluent-schema-registry.md b/docs/readers/confluent-schema-registry.md
@@ -0,0 +1,49 @@
+---
+layout: default
+title: "Confluent Schema Registry"
+parent: "Readers"
+---
+
+# Confluent Schema Registry
+{: .no_toc }
+
+1. TOC
+{:toc}
+
+The `ConfluentRegistryReader` class is used to convert schemas registered in a Confluent Schema Registry to Recap types. The main method in this class is `to_recap`.
+
+## `to_recap`
+
+```python
+def to_recap(self, topic: str) -> StructType
+```
+
+The `to_recap` method takes in the name of a Kafka topic, fetches the associated schema from the Confluent Schema Registry, and converts it to a Recap `StructType`. The method supports Avro, JSON, and Protobuf schemas.
+
+### Example
+
+```python
+from confluent_kafka.schema_registry import SchemaRegistryClient
+from recap.readers.confluent_registry import ConfluentRegistryReader
+
+registry = SchemaRegistryClient({"url": "http://my-registry:8081"})
+recap_schema = ConfluentRegistryReader(registry).to_recap("my_topic")
+```
+
+In this example, `recap_schema` will be a `StructType` that represents the schema of the value of messages in `my_topic`.
+
+## Type Conversion
+
+The `to_recap` method uses the `AvroConverter`, `JSONSchemaConverter`, and `ProtobufConverter` classes to convert schemas, based on their type. 
+
+Please see the individual documentation for these classes for information on how they convert types:
+
+- Avro: [AvroConverter]({{site.baseurl}}/docs/converters/avro)
+- JSON schema: [JSONSchemaConverter]({{site.baseurl}}/docs/converters/json-schema)
+- Protocol Buffers: [ProtobufConverter]({{site.baseurl}}/docs/converters/protobuf)
+
+## Limitations and Constraints
+
+1. ConfluentRegistryReader does not support [schema references](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html#schema-references).
+
+The conversion functions raise a `ValueError` exception if the conversion is not possible.
diff --git a/docs/readers/hive-metastore.md b/docs/readers/hive-metastore.md
@@ -0,0 +1,69 @@
+---
+layout: default
+title: "Hive Metastore"
+parent: "Readers"
+---
+
+# Hive Metastore
+{: .no_toc }
+
+1. TOC
+{:toc}
+
+The `HiveMetastoreReader` class is used to convert Hive table schemas into Recap types. This class can also be used to fetch and convert table statistics from Hive Metastore.
+
+## `to_recap`
+
+```python
+def to_recap(
+    self,
+    database_name: str,
+    table_name: str,
+    include_stats: bool = False,
+) -> StructType
+```
+
+The `to_recap` method takes in the name of a database and a table within that database, retrieves the associated schema from the Hive Metastore, and converts it into a Recap `StructType`. If `include_stats` is set to True, the method will also fetch table statistics from the Hive Metastore and include them in the returned `StructType`.
+
+### Example
+
+```python
+from pymetastore.metastore import HMS
+from recap.readers.hive_metastore import HiveMetastoreReader
+
+with HMS.create("localhos", 9093) as client:
+  recap_schema = HiveMetastoreReader(client).to_recap("my_database", "my_table")
+```
+
+In this example, `recap_schema` will be a `StructType` that represents the schema of the `my_table` table in the `my_database` database.
+
+## Type Conversion
+
+| Hive Type                          | Recap Type |
+|------------------------------------|------------------------------------|
+| BOOLEAN        | BoolType |
+| BYTE           | IntType (bits=8) |
+| SHORT          | IntType (bits=16) |
+| INT            | IntType (bits=32) |
+| LONG           | IntType (bits=64) |
+| FLOAT          | FloatType (bits=32) |
+| DOUBLE         | FloatType (bits=64) |
+| VOID           | NullType |
+| STRING         | StringType (bytes <= 9_223_372_036_854_775_807) |
+| BINARY         | BytesType (bytes <= 2_147_483_647) |
+| DECIMAL                     | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision, scale) |
+| VARCHAR                     | StringType (bytes=length) |
+| CHAR                        | StringType (bytes=length, variable=False) |
+| DATE           | IntType (logical="build.recap.Date", bits=32, signed=True, unit="day") |
+| TIMESTAMP      | IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone="UTC") |
+| TIMESTAMPLOCALTZ| IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone=None) |
+| INTERVAL_YEAR_MONTH | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="month") |
+| INTERVAL_DAY_TIME | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="second") |
+| MAP                         | MapType |
+| ARRAY                        | ListType |
+| UNIONTYPE                       | UnionType |
+| STRUCT                      | StructType |
+
+## Limitations and Constraints
+
+The conversion functions raise a `ValueError` exception if the conversion is not possible.
diff --git a/docs/readers/index.md b/docs/readers/index.md
@@ -0,0 +1,5 @@
+---
+layout: default
+title: "Readers"
+has_children: true
+---
diff --git a/docs/readers/postgresql.md b/docs/readers/postgresql.md
@@ -0,0 +1,56 @@
+---
+layout: default
+title: "PostgreSQL"
+parent: "Readers"
+---
+
+# PostgreSQL
+{: .no_toc }
+
+1. TOC
+{:toc}
+
+The `PostgresqlReader` class is used to convert PostgreSQL table schemas to Recap types. The main method in this class is `to_recap`.
+
+## `to_recap`
+
+```python
+def to_recap(self, table: str, schema: str, catalog: str) -> StructType
+```
+
+The `to_recap` method takes in the name of a PostgreSQL table, schema, and catalog, and returns a Recap `StructType` that represents the PostgreSQL table schema.
+
+### Example
+
+```python
+from psycopg2 import connect
+from recap.readers.postgresql import PostgresqlReader
+
+connection = connect(database="my_database", user="my_user", password="my_password")
+recap_schema = PostgresqlReader(connection).to_recap("my_table", "my_schema", "my_catalog")
+```
+
+In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_schema` within `my_catalog`.
+
+## Type Conversion
+
+This table shows the corresponding Recap types for each PostgreSQL type, along with the associated attributes:
+
+| PostgreSQL Type | Recap Type |
+|-----------------|------------------------------------|
+| bigint, int8, bigserial, serial8 | IntType (bits=64, signed=True) |
+| integer, int, int4, serial, serial4 | IntType (bits=32, signed=True) |
+| smallint, smallserial, serial2 | IntType (bits=16, signed=True) |
+| double precision, float8 | FloatType (bits=64) |
+| real, float4 | FloatType (bits=32) |
+| boolean | BoolType |
+| text, json, jsonb, character varying, varchar | StringType (bytes_=OCTET_LENGTH, variable=True) |
+| char | StringType (bytes_=OCTET_LENGTH, variable=False) |
+| bytea, bit varying | BytesType (bytes_=MAX_FIELD_SIZE, variable=True) |
+| bit | BytesType (bytes_=ceil(BIT_LENGTH / 8), variable=False) |
+| timestamp | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) |
+| decimal, numeric | BytesType(logical="build.recap.Decimal", bytes_=32, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) |
+
+## Limitations and Constraints
+
+The conversion functions raise a `ValueError` exception if the conversion is not possible due to the PostgreSQL data type being unknown.
diff --git a/docs/readers/snowflake.md b/docs/readers/snowflake.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: "Snowflake"
+parent: "Readers"
+---
+
+# Snowflake
+{: .no_toc }
+
+1. TOC
+{:toc}
+
+The `SnowflakeReader` class is used to convert Snowflake table schemas to Recap types. The main method in this class is `to_recap`.
+
+## `to_recap`
+
+```python
+def to_recap(self, table: str, schema: str, catalog: str) -> StructType
+```
+
+The `to_recap` method is used to translate a specific Snowflake table to a `StructType` (a Recap type). The method takes the table name, schema, and catalog as arguments and uses these to query the Snowflake `information_schema.columns` view for the metadata of the specified table. It constructs a `StructType` from these column definitions, converting each column to the corresponding Recap type.
+
+## Type Conversion
+
+This table shows the corresponding Recap types for each Snowflake type, along with the associated attributes:
+
+| Snowflake Type | Recap Type |
+|-----------------|------------------------------------|
+| float, float4, float8, double, double precision, real | FloatType (bits=64) |
+| boolean | BoolType |
+| number, decimal, numeric, int, integer, bigint, smallint, tinyint, byteint | BytesType (logical="build.recap.Decimal", bytes_=16, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) |
+| varchar, string, text, nvarchar, nvarchar2, char varying, nchar varying | StringType (bytes_=OCTET_LENGTH, variable=True) |
+| char, nchar, character | StringType (bytes_=OCTET_LENGTH, variable=True) |
+| binary, varbinary, blob | BytesType (bytes_=OCTET_LENGTH) |
+| date | IntType(bits=32, logical="build.recap.Date", unit="day") |
+| timestamp, datetime | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) |
+| time | IntType(bits=32, logical="build.recap.Time", unit=unit) |
+
+## Limitations and Constraints
+
+The conversion functions raise a `ValueError` exception if the conversion is not possible due to the Snowflake data type being unknown.