-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f1e8d32
commit 395a925
Showing
9 changed files
with
278 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
layout: default | ||
title: "BigQuery" | ||
parent: "Readers" | ||
--- | ||
|
||
# BigQuery | ||
{: .no_toc } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
The `BigQueryReader` class is used to convert BigQuery table schemas to Recap types. The main method in this class is `to_recap`. | ||
|
||
## `to_recap` | ||
|
||
```python | ||
def to_recap(self, dataset: str, table: str) -> StructType | ||
``` | ||
|
||
The `to_recap` method takes in the name of a BigQuery dataset and table, and returns a Recap `StructType` that represents the BigQuery table schema. | ||
|
||
### Example | ||
|
||
```python | ||
from google.cloud import bigquery | ||
from recap.readers.bigquery import BigQueryReader | ||
|
||
client = bigquery.Client() | ||
recap_schema = BigQueryReader(client).to_recap("my_dataset", "my_table") | ||
``` | ||
|
||
In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_dataset`. | ||
|
||
## Type Conversion | ||
|
||
This table shows the corresponding Recap types for each BigQuery type, along with the associated attributes: | ||
|
||
| BigQuery Type | Recap Type | | ||
|---------------|------------------------------------| | ||
| STRING, JSON | StringType (bytes <= 65_536) | | ||
| BYTES | BytesType (bytes <= 65_536) | | ||
| INT64, INTEGER, INT, SMALLINT, TINYINT, BYTEINT | IntType (bits=64, signed=True) | | ||
| FLOAT, FLOAT64 | FloatType (bits=64) | | ||
| BOOLEAN | BoolType | | ||
| TIMESTAMP, DATETIME | IntType (logical="build.recap.Timestamp", bits=64, unit="microsecond") | | ||
| TIME | IntType (logical="build.recap.Time", bits=32, unit="microsecond") | | ||
| DATE | IntType (logical="build.recap.Date", bits=32, unit="day") | | ||
| RECORD, STRUCT | StructType | | ||
| NUMERIC, DECIMAL | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision <= 38, scale <= 0) | | ||
| BIGNUMERIC, BIGDECIMAL | BytesType (logical="build.recap.Decimal", bytes=32, variable=False, precision <= 76, scale <= 0) | | ||
|
||
## Limitations and Constraints | ||
|
||
The conversion functions raise a `ValueError` exception if the conversion is not possible. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
--- | ||
layout: default | ||
title: "Confluent Schema Registry" | ||
parent: "Readers" | ||
--- | ||
|
||
# Confluent Schema Registry | ||
{: .no_toc } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
The `ConfluentRegistryReader` class is used to convert schemas registered in a Confluent Schema Registry to Recap types. The main method in this class is `to_recap`. | ||
|
||
## `to_recap` | ||
|
||
```python | ||
def to_recap(self, topic: str) -> StructType | ||
``` | ||
|
||
The `to_recap` method takes in the name of a Kafka topic, fetches the associated schema from the Confluent Schema Registry, and converts it to a Recap `StructType`. The method supports Avro, JSON, and Protobuf schemas. | ||
|
||
### Example | ||
|
||
```python | ||
from confluent_kafka.schema_registry import SchemaRegistryClient | ||
from recap.readers.confluent_registry import ConfluentRegistryReader | ||
|
||
registry = SchemaRegistryClient({"url": "http://my-registry:8081"}) | ||
recap_schema = ConfluentRegistryReader(registry).to_recap("my_topic") | ||
``` | ||
|
||
In this example, `recap_schema` will be a `StructType` that represents the schema of the value of messages in `my_topic`. | ||
|
||
## Type Conversion | ||
|
||
The `to_recap` method uses the `AvroConverter`, `JSONSchemaConverter`, and `ProtobufConverter` classes to convert schemas, based on their type. | ||
|
||
Please see the individual documentation for these classes for information on how they convert types: | ||
|
||
- Avro: [AvroConverter]({{site.baseurl}}/docs/converters/avro) | ||
- JSON schema: [JSONSchemaConverter]({{site.baseurl}}/docs/converters/json-schema) | ||
- Protocol Buffers: [ProtobufConverter]({{site.baseurl}}/docs/converters/protobuf) | ||
|
||
## Limitations and Constraints | ||
|
||
1. ConfluentRegistryReader does not support [schema references](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html#schema-references). | ||
|
||
The conversion functions raise a `ValueError` exception if the conversion is not possible. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
--- | ||
layout: default | ||
title: "Hive Metastore" | ||
parent: "Readers" | ||
--- | ||
|
||
# Hive Metastore | ||
{: .no_toc } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
The `HiveMetastoreReader` class is used to convert Hive table schemas into Recap types. This class can also be used to fetch and convert table statistics from Hive Metastore. | ||
|
||
## `to_recap` | ||
|
||
```python | ||
def to_recap( | ||
self, | ||
database_name: str, | ||
table_name: str, | ||
include_stats: bool = False, | ||
) -> StructType | ||
``` | ||
|
||
The `to_recap` method takes in the name of a database and a table within that database, retrieves the associated schema from the Hive Metastore, and converts it into a Recap `StructType`. If `include_stats` is set to True, the method will also fetch table statistics from the Hive Metastore and include them in the returned `StructType`. | ||
|
||
### Example | ||
|
||
```python | ||
from pymetastore.metastore import HMS | ||
from recap.readers.hive_metastore import HiveMetastoreReader | ||
|
||
with HMS.create("localhos", 9093) as client: | ||
recap_schema = HiveMetastoreReader(client).to_recap("my_database", "my_table") | ||
``` | ||
|
||
In this example, `recap_schema` will be a `StructType` that represents the schema of the `my_table` table in the `my_database` database. | ||
|
||
## Type Conversion | ||
|
||
| Hive Type | Recap Type | | ||
|------------------------------------|------------------------------------| | ||
| BOOLEAN | BoolType | | ||
| BYTE | IntType (bits=8) | | ||
| SHORT | IntType (bits=16) | | ||
| INT | IntType (bits=32) | | ||
| LONG | IntType (bits=64) | | ||
| FLOAT | FloatType (bits=32) | | ||
| DOUBLE | FloatType (bits=64) | | ||
| VOID | NullType | | ||
| STRING | StringType (bytes <= 9_223_372_036_854_775_807) | | ||
| BINARY | BytesType (bytes <= 2_147_483_647) | | ||
| DECIMAL | BytesType (logical="build.recap.Decimal", bytes=16, variable=False, precision, scale) | | ||
| VARCHAR | StringType (bytes=length) | | ||
| CHAR | StringType (bytes=length, variable=False) | | ||
| DATE | IntType (logical="build.recap.Date", bits=32, signed=True, unit="day") | | ||
| TIMESTAMP | IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone="UTC") | | ||
| TIMESTAMPLOCALTZ| IntType (logical="build.recap.Timestamp", bits=64, signed=True, unit="nanosecond", timezone=None) | | ||
| INTERVAL_YEAR_MONTH | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="month") | | ||
| INTERVAL_DAY_TIME | BytesType (logical="build.recap.Interval", bytes=12, signed=True, unit="second") | | ||
| MAP | MapType | | ||
| ARRAY | ListType | | ||
| UNIONTYPE | UnionType | | ||
| STRUCT | StructType | | ||
|
||
## Limitations and Constraints | ||
|
||
The conversion functions raise a `ValueError` exception if the conversion is not possible. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
layout: default | ||
title: "Readers" | ||
has_children: true | ||
--- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
layout: default | ||
title: "PostgreSQL" | ||
parent: "Readers" | ||
--- | ||
|
||
# PostgreSQL | ||
{: .no_toc } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
The `PostgresqlReader` class is used to convert PostgreSQL table schemas to Recap types. The main method in this class is `to_recap`. | ||
|
||
## `to_recap` | ||
|
||
```python | ||
def to_recap(self, table: str, schema: str, catalog: str) -> StructType | ||
``` | ||
|
||
The `to_recap` method takes in the name of a PostgreSQL table, schema, and catalog, and returns a Recap `StructType` that represents the PostgreSQL table schema. | ||
|
||
### Example | ||
|
||
```python | ||
from psycopg2 import connect | ||
from recap.readers.postgresql import PostgresqlReader | ||
|
||
connection = connect(database="my_database", user="my_user", password="my_password") | ||
recap_schema = PostgresqlReader(connection).to_recap("my_table", "my_schema", "my_catalog") | ||
``` | ||
|
||
In this example, `recap_schema` will be a `StructType` that represents the schema of `my_table` in `my_schema` within `my_catalog`. | ||
|
||
## Type Conversion | ||
|
||
This table shows the corresponding Recap types for each PostgreSQL type, along with the associated attributes: | ||
|
||
| PostgreSQL Type | Recap Type | | ||
|-----------------|------------------------------------| | ||
| bigint, int8, bigserial, serial8 | IntType (bits=64, signed=True) | | ||
| integer, int, int4, serial, serial4 | IntType (bits=32, signed=True) | | ||
| smallint, smallserial, serial2 | IntType (bits=16, signed=True) | | ||
| double precision, float8 | FloatType (bits=64) | | ||
| real, float4 | FloatType (bits=32) | | ||
| boolean | BoolType | | ||
| text, json, jsonb, character varying, varchar | StringType (bytes_=OCTET_LENGTH, variable=True) | | ||
| char | StringType (bytes_=OCTET_LENGTH, variable=False) | | ||
| bytea, bit varying | BytesType (bytes_=MAX_FIELD_SIZE, variable=True) | | ||
| bit | BytesType (bytes_=ceil(BIT_LENGTH / 8), variable=False) | | ||
| timestamp | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) | | ||
| decimal, numeric | BytesType(logical="build.recap.Decimal", bytes_=32, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) | | ||
|
||
## Limitations and Constraints | ||
|
||
The conversion functions raise a `ValueError` exception if the conversion is not possible due to the PostgreSQL data type being unknown. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
--- | ||
layout: default | ||
title: "Snowflake" | ||
parent: "Readers" | ||
--- | ||
|
||
# Snowflake | ||
{: .no_toc } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
The `SnowflakeReader` class is used to convert Snowflake table schemas to Recap types. The main method in this class is `to_recap`. | ||
|
||
## `to_recap` | ||
|
||
```python | ||
def to_recap(self, table: str, schema: str, catalog: str) -> StructType | ||
``` | ||
|
||
The `to_recap` method is used to translate a specific Snowflake table to a `StructType` (a Recap type). The method takes the table name, schema, and catalog as arguments and uses these to query the Snowflake `information_schema.columns` view for the metadata of the specified table. It constructs a `StructType` from these column definitions, converting each column to the corresponding Recap type. | ||
|
||
## Type Conversion | ||
|
||
This table shows the corresponding Recap types for each Snowflake type, along with the associated attributes: | ||
|
||
| Snowflake Type | Recap Type | | ||
|-----------------|------------------------------------| | ||
| float, float4, float8, double, double precision, real | FloatType (bits=64) | | ||
| boolean | BoolType | | ||
| number, decimal, numeric, int, integer, bigint, smallint, tinyint, byteint | BytesType (logical="build.recap.Decimal", bytes_=16, variable=False, precision=NUMERIC_PRECISION, scale=NUMERIC_SCALE) | | ||
| varchar, string, text, nvarchar, nvarchar2, char varying, nchar varying | StringType (bytes_=OCTET_LENGTH, variable=True) | | ||
| char, nchar, character | StringType (bytes_=OCTET_LENGTH, variable=True) | | ||
| binary, varbinary, blob | BytesType (bytes_=OCTET_LENGTH) | | ||
| date | IntType(bits=32, logical="build.recap.Date", unit="day") | | ||
| timestamp, datetime | IntType(bits=64, logical="build.recap.Timestamp", unit=unit) | | ||
| time | IntType(bits=32, logical="build.recap.Time", unit=unit) | | ||
|
||
## Limitations and Constraints | ||
|
||
The conversion functions raise a `ValueError` exception if the conversion is not possible due to the Snowflake data type being unknown. |