Skip to content

Latest commit

 

History

History
201 lines (140 loc) · 11 KB

README.md

File metadata and controls

201 lines (140 loc) · 11 KB

Mara DB

Build Status PyPI - License PyPI version Slack Status

Mini package for configuring and accessing multiple databases in a single project. Decouples the use of databases and their configuration by using "aliases" for databases.

The file mara_db/dbs.py contains abstract database configurations for PostgreSQL, Mysql, SQL Server, Oracle, SQLite and Big Query. The database connections of a project are configured by overwriting the databases function in mara_db/config.py:

import mara_db.config
import mara_db.dbs

## configure database connections for different aliases
mara_db.config.databases = lambda: {
    'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='root', database='mara'),
    'dwh': mara_db.dbs.PostgreSQLDB(database='dwh'),
    'source-1': mara_db.dbs.MysqlDB(host='some-localhost', database='my_app', user='dwh'),
    'source-2': mara_db.dbs.SQLServerDB(user='dwh_read', password='123abc', database='db1', host='some-sql-server')
}

## access individual database configurations with `dbs.db`:
print(mara_db.dbs.db('mara'))
# -> <PostgreSQLDB: host=localhost, database=mara>

 

Visualization of (PostgreSQL, MySQL, SQL Server) database schemas

mara_db/views.py contains a schema visualization for all configured databases using graphviz (currently PostgreSQL, Mysql and SQL Server only). It basically show tables of selected schemas together with the foreign key relations between them.

Schema visualization

For finding missing foreign key constraints, columns that follow a specific naming pattern (configurable via config.schema_ui_foreign_key_column_regex, default *_fk) and that are not part of foreign key constraints are drawn in pink.

 

Fast batch processing: Accessing databases with shell commands

The file mara_db/shell.py contains functions that create commands for accessing databases via their command line clients.

For example, the query_command function creates a shell command that can receive an SQL query from stdin and execute it:

import mara_db.shell

print(mara_db.shell.query_command('source-1'))
# -> mysql --default-character-set=utf8mb4 --user=dwh --host=some-localhost my_app

print(mara_db.shell.query_command('dwh', timezone='Europe/Lisbon', echo_queries=False))
# -> PGTZ=Europe/Lisbon PGOPTIONS=--client-min-messages=warning psql  --no-psqlrc --set ON_ERROR_STOP=on dwh

The function copy_to_stdout_command creates a shell command that receives a query on stdin and writes the result to stdout in tabular form:

print(mara_db.shell.copy_to_stdout_command('source-1'))
# -> mysql --default-character-set=utf8mb4 --user=dwh --host=some-localhost my_app --skip-column-names

Similarly, copy_from_stdin_command creates a client command that receives tabular data from stdin and and writes it to a target table:

print(mara_db.shell.copy_from_stdin_command('dwh', target_table='some_table', delimiter_char=';'))
# -> PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --echo-all --no-psqlrc --set ON_ERROR_STOP=on dwh \
#      --command="COPY some_table FROM STDIN WITH DELIMITER AS ';'"

Finally, copy_command creates a shell command that receives a sql query from stdin, executes the query in source_db and then writes the result of to target_table in target_db:

print(mara_db.shell.copy_command('source-2', 'dwh', target_table='some_table'))
# -> sed 's/\\\\$/\$/g;s/\$/\\\\$/g' \
#   | sqsh  -U dwh_read -P 123abc -S some-sql-server -D db1 -m csv \
#   | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --echo-all --no-psqlrc --set ON_ERROR_STOP=on dwh \
#         --command = "COPY some_table FROM STDIN WITH CSV HEADER"

 

The following command line clients are used to access the various databases:

Database Client binary Comments
Postgresql / Redshift psql Included in standard distributions.
MariaDB / Mysql mysql Included in standard distributions.
SQL Server sqsh
- or -
sqlcmd
sqsh: From https://sourceforge.net/projects/sqsh/, usually messy to get working. On ubuntu, use http://ppa.launchpad.net/jasc/sqsh/ubuntu/ backport. On Mac, try the homebrew version or install from source.
sqlcmd: Official Microsoft Utility for SQL Server. See sqlcmd Utility
Oracle sqlplus64 See the Oracle Instant Client homepage for details. On Mac, follow these instructions. Then sudo ln -s /usr/local/bin/sqlplus /usr/local/bin/sqlplus64 to make the binary accessible as sqlplus64.
SQLite sqlite3 Available in standard distributions. Version >3.20.x required (not the case on Ubuntu 14.04).
Big Query bq See the Google Cloud SDK page for details.
Snowflake snowsql See SnowSQL (CLI Client)
Databricks dbsqlcli Included when using package extra databricks via package databricks-sql-cli. See Databricks SQL CLI

 

Make it so! Auto-migration of SQLAlchemy models

Alembic has a feature that can create a diff between the state of a database and the ORM models of an application. This feature is used in mara_db/auto_migrate.py to automatically perform all necessary database transformations, without intermediate migration files:

# define a model / table
class MyTable(sqlalchemy.ext.declarative.declarative_base()):
    __tablename__ = 'my_table'
    my_table_id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    column_1 = sqlalchemy.Column(sqlalchemy.TEXT, nullable=False, index=True)


db = mara_db.dbs.SQLiteDB(file_name='/tmp/test.sqlite')

# create database and table
mara_db.auto_migration.auto_migrate(engine=mara_db.auto_migration.engine(db), models=[MyTable])
# ->
# Created database "sqlite:////tmp/test.sqlite"
#
# CREATE TABLE my_table (
#     my_table_id SERIAL NOT NULL,
#     column_1 TEXT NOT NULL,
#     PRIMARY KEY (my_table_id)
# );
#
# CREATE INDEX ix_my_table_column_1 ON my_table (column_1);

When the model is changed later, then auto_migrate creates a diff against the existing database and applies it:

# remove index and add another column
class MyTable(sqlalchemy.ext.declarative.declarative_base()):
    __tablename__ = 'my_table'
    my_table_id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    column_1 = sqlalchemy.Column(sqlalchemy.TEXT, nullable=False)
    column_2 = sqlalchemy.Column(sqlalchemy.Integer)

auto_migrate(engine=engine(db), models=[MyTable])
# ->
# ALTER TABLE my_table ADD COLUMN column_2 INTEGER;
#
# DROP INDEX ix_my_table_text_column_1;

Use with care! The are lot of changes that alembic auto-generate can not detect. We recommend testing each aut-migration on a staging system first before deploying to production. Sometimes manual migration scripts will be necessary.

Installation

pip install mara-db

or

pip install git+https://github.com/mara/mara-db.git

Optional: Installation of requirements for SQL Server

For usage with SQL Server, the python module pyodbc and a odbc driver (e.g. Microsoft ODBC Driver 17 for SQL Server) is required which is not included in the general requirement.

To see how to install pyodbc, take a look into this install guide. To see how to install ODBC 17, take a look into Installing the Microsoft ODBC Driver for SQL Server on Linux and macOS.

On Linux, you most likely will have to deal with an SSL issue, see this issue. A quick, dirty option in a test/development environment could be to disable the requirement for TLS 1.2.

Optional: Installation of requirements for BigQuery

For usage with BigQuery, the official bq and gcloud clients are required. See the Google Cloud SDK page for installation details.

Enabling the BigQuery API and Service account JSON credentials are also required as listed in the official documentation here.

One time authentication of the service-account used:

gcloud auth activate-service-account --key-file='path-to/service-account.json'

Optionally, for loading data from files into BigQuery, the gcloud_gcs_bucket_name can be specified in the database initialization. This will use the Google Cloud Storage bucket specified as cache for loading data and over-coming potential limitations. For more see loading-data. By default, files will directly loaded locally as described in loading-local-data.

A BigQuery context with a python cursor is also available on demand for easy access to BigQuery databases. In order to use, install the official Google python client library: google-cloud-bigquery.

Links