Skip to content

Commit

Permalink
Document all readers
Browse files Browse the repository at this point in the history
  • Loading branch information
criccomini committed Jul 25, 2023
1 parent f1e8d32 commit 395a925
Show file tree
Hide file tree
Showing 9 changed files with 278 additions and 3 deletions.
4 changes: 2 additions & 2 deletions docs/converters/avro.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ recap_schema = AvroConverter().to_recap(avro_schema)

### From Recap to Avro

| Recap Type (with attribute limits) | Avro Type |
| Recap Type | Avro Type |
|------------------------------------|-----------|
| NullType | null |
| BoolType | boolean |
Expand All @@ -98,7 +98,7 @@ recap_schema = AvroConverter().to_recap(avro_schema)

### From Avro to Recap

| Avro Type | Recap Type (with attribute limits) |
| Avro Type | Recap Type |
|-----------|------------------------------------|
| null | NullType |
| boolean | BoolType |
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/converters/protobuf.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ recap_schema = ProtobufConverter().to_recap(protobuf_schema)

This table shows the corresponding Protobuf types for each Recap type.

| Recap Type (with attribute limits) | Protobuf Type |
| Recap Type | Protobuf Type |
|------------------------------------|---------------|
| NullType | google.protobuf.NullValue |
| BoolType | bool |
Expand Down
55 changes: 55 additions & 0 deletions docs/readers/bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
layout: default
title: "BigQuery"
parent: "Readers"
---

# BigQuery
{: .no_toc }

1. TOC
{:toc}

The `BigQueryReader` class is used to convert BigQuery table schemas to Recap types. The main method in this class is `to_recap`.

## `to_recap`

```python
def to_recap(self, dataset: str, table: str) -> StructType
```

The `to_recap` method takes in the name of a BigQuery dataset and table, and returns a Recap `StructType` that represents the BigQuery table schema.

### Example

```python
from google.cloud import bigquery
from recap.readers.bigquery import BigQueryReader

client = bigquery.Client()
recap_schema = BigQueryReader(client).to_recap("my_dataset", "my_table")
```

In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_dataset`.

## Type Conversion

This table shows the corresponding Recap types for each BigQuery type, along with the associated attributes:

| BigQuery Type | Recap Type |
|---------------|------------------------------------|
| STRING, JSON | StringType (bytes <= 65_536) |
| BYTES | BytesType (bytes <= 65_536) |
| INT64, INTEGER, INT, SMALLINT, TINYINT, BYTEINT | IntType (bits=64, signed=True) |
| FLOAT, FLOAT64 | FloatType (bits=64) |
| BOOLEAN | BoolType |
| TIMESTAMP, DATETIME | IntType (logical="build.recap.Timestamp", bits=64, unit="microsecond") |
| TIME | IntType (logical="build.recap.Time", bits=32, unit="microsecond") |
| DATE | IntType (logical="build.recap.Date", bits=32, unit="day") |
| RECORD, STRUCT | StructType |
| NUMERIC, DECIMAL | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision <= 38, scale <= 0) |
| BIGNUMERIC, BIGDECIMAL | BytesType (logical="build.recap.Decimal", bytes=32, variable=False, precision <= 76, scale <= 0) |

## Limitations and Constraints

The conversion functions raise a `ValueError` exception if the conversion is not possible.
49 changes: 49 additions & 0 deletions docs/readers/confluent-schema-registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
layout: default
title: "Confluent Schema Registry"
parent: "Readers"
---

# Confluent Schema Registry
{: .no_toc }

1. TOC
{:toc}

The `ConfluentRegistryReader` class is used to convert schemas registered in a Confluent Schema Registry to Recap types. The main method in this class is `to_recap`.

## `to_recap`

```python
def to_recap(self, topic: str) -> StructType
```

The `to_recap` method takes in the name of a Kafka topic, fetches the associated schema from the Confluent Schema Registry, and converts it to a Recap `StructType`. The method supports Avro, JSON, and Protobuf schemas.

### Example

```python
from confluent_kafka.schema_registry import SchemaRegistryClient
from recap.readers.confluent_registry import ConfluentRegistryReader

registry = SchemaRegistryClient({"url": "http://my-registry:8081"})
recap_schema = ConfluentRegistryReader(registry).to_recap("my_topic")
```

In this example, `recap_schema` will be a `StructType` that represents the schema of the value of messages in `my_topic`.

## Type Conversion

The `to_recap` method uses the `AvroConverter`, `JSONSchemaConverter`, and `ProtobufConverter` classes to convert schemas, based on their type.

Please see the individual documentation for these classes for information on how they convert types:

- Avro: [AvroConverter]({{site.baseurl}}/docs/converters/avro)
- JSON schema: [JSONSchemaConverter]({{site.baseurl}}/docs/converters/json-schema)
- Protocol Buffers: [ProtobufConverter]({{site.baseurl}}/docs/converters/protobuf)

## Limitations and Constraints

1. ConfluentRegistryReader does not support [schema references](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html#schema-references).

The conversion functions raise a `ValueError` exception if the conversion is not possible.
69 changes: 69 additions & 0 deletions docs/readers/hive-metastore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
layout: default
title: "Hive Metastore"
parent: "Readers"
---

# Hive Metastore
{: .no_toc }

1. TOC
{:toc}

The `HiveMetastoreReader` class is used to convert Hive table schemas into Recap types. This class can also be used to fetch and convert table statistics from Hive Metastore.

## `to_recap`

```python
def to_recap(
self,
database_name: str,
table_name: str,
include_stats: bool = False,
) -> StructType
```

The `to_recap` method takes in the name of a database and a table within that database, retrieves the associated schema from the Hive Metastore, and converts it into a Recap `StructType`. If `include_stats` is set to True, the method will also fetch table statistics from the Hive Metastore and include them in the returned `StructType`.

### Example

```python
from pymetastore.metastore import HMS
from recap.readers.hive_metastore import HiveMetastoreReader

with HMS.create("localhos", 9093) as client:
recap_schema = HiveMetastoreReader(client).to_recap("my_database", "my_table")
```

In this example, `recap_schema` will be a `StructType` that represents the schema of the `my_table` table in the `my_database` database.

## Type Conversion

| Hive Type | Recap Type |
|------------------------------------|------------------------------------|
| BOOLEAN | BoolType |
| BYTE | IntType (bits=8) |
| SHORT | IntType (bits=16) |
| INT | IntType (bits=32) |
| LONG | IntType (bits=64) |
| FLOAT | FloatType (bits=32) |
| DOUBLE | FloatType (bits=64) |
| VOID | NullType |
| STRING | StringType (bytes <= 9_223_372_036_854_775_807) |
| BINARY | BytesType (bytes <= 2_147_483_647) |
| DECIMAL | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision, scale) |
| VARCHAR | StringType (bytes=length) |
| CHAR | StringType (bytes=length, variable=False) |
| DATE | IntType (logical="build.recap.Date", bits=32, signed=True, unit="day") |
| TIMESTAMP | IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone="UTC") |
| TIMESTAMPLOCALTZ| IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone=None) |
| INTERVAL_YEAR_MONTH | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="month") |
| INTERVAL_DAY_TIME | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="second") |
| MAP | MapType |
| ARRAY | ListType |
| UNIONTYPE | UnionType |
| STRUCT | StructType |

## Limitations and Constraints

The conversion functions raise a `ValueError` exception if the conversion is not possible.
5 changes: 5 additions & 0 deletions docs/readers/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
layout: default
title: "Readers"
has_children: true
---
56 changes: 56 additions & 0 deletions docs/readers/postgresql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
layout: default
title: "PostgreSQL"
parent: "Readers"
---

# PostgreSQL
{: .no_toc }

1. TOC
{:toc}

The `PostgresqlReader` class is used to convert PostgreSQL table schemas to Recap types. The main method in this class is `to_recap`.

## `to_recap`

```python
def to_recap(self, table: str, schema: str, catalog: str) -> StructType
```

The `to_recap` method takes in the name of a PostgreSQL table, schema, and catalog, and returns a Recap `StructType` that represents the PostgreSQL table schema.

### Example

```python
from psycopg2 import connect
from recap.readers.postgresql import PostgresqlReader

connection = connect(database="my_database", user="my_user", password="my_password")
recap_schema = PostgresqlReader(connection).to_recap("my_table", "my_schema", "my_catalog")
```

In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_schema` within `my_catalog`.

## Type Conversion

This table shows the corresponding Recap types for each PostgreSQL type, along with the associated attributes:

| PostgreSQL Type | Recap Type |
|-----------------|------------------------------------|
| bigint, int8, bigserial, serial8 | IntType (bits=64, signed=True) |
| integer, int, int4, serial, serial4 | IntType (bits=32, signed=True) |
| smallint, smallserial, serial2 | IntType (bits=16, signed=True) |
| double precision, float8 | FloatType (bits=64) |
| real, float4 | FloatType (bits=32) |
| boolean | BoolType |
| text, json, jsonb, character varying, varchar | StringType (bytes_=OCTET_LENGTH, variable=True) |
| char | StringType (bytes_=OCTET_LENGTH, variable=False) |
| bytea, bit varying | BytesType (bytes_=MAX_FIELD_SIZE, variable=True) |
| bit | BytesType (bytes_=ceil(BIT_LENGTH / 8), variable=False) |
| timestamp | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) |
| decimal, numeric | BytesType(logical="build.recap.Decimal", bytes_=32, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) |

## Limitations and Constraints

The conversion functions raise a `ValueError` exception if the conversion is not possible due to the PostgreSQL data type being unknown.
41 changes: 41 additions & 0 deletions docs/readers/snowflake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
layout: default
title: "Snowflake"
parent: "Readers"
---

# Snowflake
{: .no_toc }

1. TOC
{:toc}

The `SnowflakeReader` class is used to convert Snowflake table schemas to Recap types. The main method in this class is `to_recap`.

## `to_recap`

```python
def to_recap(self, table: str, schema: str, catalog: str) -> StructType
```

The `to_recap` method is used to translate a specific Snowflake table to a `StructType` (a Recap type). The method takes the table name, schema, and catalog as arguments and uses these to query the Snowflake `information_schema.columns` view for the metadata of the specified table. It constructs a `StructType` from these column definitions, converting each column to the corresponding Recap type.

## Type Conversion

This table shows the corresponding Recap types for each Snowflake type, along with the associated attributes:

| Snowflake Type | Recap Type |
|-----------------|------------------------------------|
| float, float4, float8, double, double precision, real | FloatType (bits=64) |
| boolean | BoolType |
| number, decimal, numeric, int, integer, bigint, smallint, tinyint, byteint | BytesType (logical="build.recap.Decimal", bytes_=16, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) |
| varchar, string, text, nvarchar, nvarchar2, char varying, nchar varying | StringType (bytes_=OCTET_LENGTH, variable=True) |
| char, nchar, character | StringType (bytes_=OCTET_LENGTH, variable=True) |
| binary, varbinary, blob | BytesType (bytes_=OCTET_LENGTH) |
| date | IntType(bits=32, logical="build.recap.Date", unit="day") |
| timestamp, datetime | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) |
| time | IntType(bits=32, logical="build.recap.Time", unit=unit) |

## Limitations and Constraints

The conversion functions raise a `ValueError` exception if the conversion is not possible due to the Snowflake data type being unknown.

0 comments on commit 395a925

Please sign in to comment.