-
Notifications
You must be signed in to change notification settings - Fork 81
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #86 from coursera/sb_docs
Documentation
- Loading branch information
Showing
30 changed files
with
1,851 additions
and
264 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,44 @@ | ||
# Changes in dataduct | ||
|
||
### 0.1.0 | ||
- Initial version of the dataduct library released | ||
- Support for the following steps: | ||
- emr_streaming | ||
- extract-local | ||
- extract-s3 | ||
- extract-rds | ||
- extract-redshift | ||
- load-redshift | ||
- sql-command | ||
- transform | ||
- Examples and documentation added for all the steps | ||
|
||
### 0.2.0 | ||
- Travis integration for continous builds | ||
- QA steps and logging to S3 | ||
- Visualizing pipeline | ||
- Dataduct CLI updated as a single entry point | ||
- RDS connections for scripts | ||
- Bootstrap step for pipelines | ||
- Backfill or delay activation | ||
- Output path and input path options | ||
- Script directory for transform step | ||
- SQL sanatization for DBA actions | ||
- SQL parser for select and create table statements | ||
- Logging across the library | ||
- Support for custom steps | ||
- Pipeline dependency step | ||
- Reduce verbosity of imports | ||
- Step parsing is isolated in steps | ||
- More examples for steps | ||
- QA step functions added | ||
- Visualization of pipelines | ||
- Sync config with S3 | ||
- Config overides with modes | ||
- Rename keywords and safe config failure handling | ||
- MySQL and Redshift connection support | ||
- EMR Streaming support with hadoop 2 | ||
- Custom EMR job step | ||
- Support for input_path to steps to directly create S3Nodes | ||
- Transform step to support directory based installs | ||
- Exceptions cleanup | ||
- Read the docs support | ||
- Creating tables automatically for various steps | ||
- History table support | ||
- EC2 and EMR config control from YAML | ||
- Slack integration | ||
- Support for Regions in DP | ||
|
||
### 0.1.0 | ||
- Initial version of the dataduct library released | ||
- Support for the following steps: | ||
- emr_streaming | ||
- extract-local | ||
- extract-s3 | ||
- extract-rds | ||
- extract-redshift | ||
- load-redshift | ||
- sql-command | ||
- transform | ||
- Examples and documentation added for all the steps |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
Config | ||
====== | ||
|
||
All the dataduct setting are controlled from a single config file that | ||
stores the credentials as well as different settings. | ||
|
||
The config file is read from the following places in the specified order | ||
of priority. | ||
|
||
1. ``/etc/dataduct.cfg`` | ||
2. ``~/.dataduct`` | ||
3. ``DATADUCT_CONFIG_PATH`` environment variable | ||
|
||
Minimum example config: | ||
|
||
.. code:: YAML | ||
ec2: | ||
INSTANCE_TYPE: m1.large | ||
ETL_AMI: ami-05355a6c # Default AMI used by data pipeline - Python 2.6 | ||
SECURITY_GROUP: FILL_ME_IN | ||
emr: | ||
MASTER_INSTANCE_TYPE: m1.large | ||
NUM_CORE_INSTANCES: 1 | ||
CORE_INSTANCE_TYPE: m1.large | ||
CLUSTER_AMI: 3.1.0 | ||
etl: | ||
S3_ETL_BUCKET: FILL_ME_IN | ||
ROLE: FILL_ME_IN | ||
RESOURCE_ROLE: FILL_ME_IN | ||
Config Parameters | ||
----------------- | ||
|
||
Bootstrap | ||
~~~~~~~~~ | ||
|
||
.. code:: YAML | ||
bootstrap: | ||
ec2: | ||
- step_type: transform | ||
command: echo "Welcome to dataduct" | ||
no_output: true | ||
emr: | ||
- step_type: transform | ||
command: echo "Welcome to dataduct" | ||
no_output: true | ||
Bootstrap steps are a chain of steps that should be executed before any | ||
other step in the datapipeline. This can be used to copy files from S3 | ||
or install libraries on the resource. At Coursera we use this to | ||
download some binaries from S3 that are required for some of the | ||
transformations. | ||
|
||
Note that the EMR bootstrap is only executed on the master node. If you | ||
want to install something on the task nodes then you should use the | ||
bootstrap parameter in the ``emr_cluster_config`` in your datapipeline. | ||
|
||
Custom Steps | ||
~~~~~~~~~~~~ | ||
|
||
:: | ||
|
||
custom_steps: | ||
- class_name: CustomExtractLocalStep | ||
file_path: custom_extract_local.py | ||
step_type: custom-extract-local | ||
|
||
Custom steps are steps that are not part of dataduct but are created to | ||
augment the functionality provided by dataduct. At Coursera these are | ||
often Steps that Inherit from the current object but abstract out some | ||
of the functionality so that multiple pipelines don't have to write the | ||
same thing twice. | ||
|
||
The file\_path can be an absolute path or a relative path with respect | ||
to the ``CUSTOM_STEPS_PATH`` path defined in the ETL parameter field. | ||
The Step classes are dynamically imported based on the config and | ||
``step-type`` field is the one that is matched when parsing the pipeline | ||
definition. | ||
|
||
Database | ||
~~~~~~~~ | ||
|
||
:: | ||
|
||
database: | ||
permissions: | ||
- user: admin | ||
permission: all | ||
- group: consumer_group | ||
permission: select | ||
|
||
Some steps such as ``upsert`` or ``create-load-redshift`` create tables | ||
and grant them appropriate permissions so that one does not have to | ||
create tables prior to running the ETL. The permission is the | ||
``permission`` being granted on the table or view to the ``user`` or | ||
``group``. If both are specified then both the grant statements are | ||
executed. | ||
|
||
EC2 | ||
~~~ | ||
|
||
:: | ||
|
||
ec2: | ||
INSTANCE_TYPE: m1.small | ||
ETL_AMI: ami-05355a6c # Default AMI used by data pipeline - Python 2.6 | ||
SECURITY_GROUP: FILL_ME_IN | ||
|
||
The ec2 config controls the configuration for the ec2-resource started | ||
by the datapipeline. You can override these with ``ec2_resouce_config`` | ||
in your pipeline definition for specific pipelines. | ||
|
||
EMR | ||
~~~ | ||
|
||
:: | ||
|
||
emr: | ||
CLUSTER_AMI: 3.1.0 | ||
CLUSTER_TIMEOUT: 6 Hours | ||
CORE_INSTANCE_TYPE: m1.large | ||
NUM_CORE_INSTANCES: 1 | ||
HADOOP_VERSION: 2.4.0 | ||
HIVE_VERSION: null | ||
MASTER_INSTANCE_TYPE: m3.xlarge | ||
PIG_VERSION: null | ||
TASK_INSTANCE_BID_PRICE: null | ||
TASK_INSTANCE_TYPE: m1.large | ||
|
||
The emr config controls the configuration for the emr-resource started | ||
by the datapipeline. | ||
|
||
ETL | ||
~~~ | ||
|
||
:: | ||
|
||
etl: | ||
CONNECTION_RETRIES: 2 | ||
CUSTOM_STEPS_PATH: ~/dataduct/examples/steps | ||
DAILY_LOAD_TIME: 1 | ||
KEY_PAIR: FILL_ME_IN | ||
MAX_RETRIES: 2 | ||
NAME_PREFIX: dev | ||
QA_LOG_PATH: qa | ||
DP_INSTANCE_LOG_PATH: dp_instances | ||
DP_PIPELINE_LOG_PATH: dp_pipelines | ||
DP_QA_TESTS_LOG_PATH: dba_table_qa_tests | ||
RESOURCE_BASE_PATH: ~/dataduct/examples/resources | ||
RESOURCE_ROLE: FILL_ME_IN | ||
RETRY_DELAY: 10 Minutes | ||
REGION: us-east-1 | ||
ROLE: FILL_ME_IN | ||
S3_BASE_PATH: dev | ||
S3_ETL_BUCKET: FILL_ME_IN | ||
SNS_TOPIC_ARN_FAILURE: null | ||
SNS_TOPIC_ARN_WARNING: null | ||
FREQUENCY_OVERRIDE: one-time | ||
DEPENDENCY_OVERRIDE: false | ||
slack: | ||
api_token: FILL_ME_IN | ||
channel_name: "#dataduct" | ||
username: FILL_ME_IN | ||
bot_username: Dataduct Bot | ||
TAGS: | ||
env: | ||
string: dev | ||
Name: | ||
variable: name | ||
|
||
This is the core parameter object which controls the ETL at the high | ||
level. The parameters are explained below: | ||
|
||
- ``CONNECTION_RETRIES``: Number of retries for the database | ||
connections. This is used to eliminate some of the transient errors | ||
that might occur. | ||
- ``CUSTOM_STEPS_PATH``: Path to the directory to be used for custom | ||
steps that are specified using a relative path. | ||
- ``DAILY_LOAD_TIME``: Default time to be used for running pipelines | ||
- ``KEY_PAIR``: SSH key pair to be used in both the ec2 and the emr | ||
resource. | ||
- ``MAX_RETRIES``: Number of retries for the pipeline activities | ||
- ``NAME_PREFIX``: Prefix all the pipeline names with this string | ||
- ``QA_LOG_PATH``: Path prefix for all the QA steps when logging output | ||
to S3 | ||
- ``DP_INSTANCE_LOG_PATH``: Path prefix for DP instances to be logged | ||
before destroying | ||
- ``DP_PIPELINE_LOG_PATH``: Path prefix for DP pipelines to be logged | ||
- ``DP_QA_TESTS_LOG_PATH``: Path prefix for QA tests to be logged | ||
- ``RESOURCE_BASE_PATH``: Path to the directory used to relative | ||
resource paths | ||
- ``RESOURCE_ROLE``: Resource role needed for DP | ||
- ``RETRY_DELAY``: Delay between each of activity retires | ||
- ``REGION``: Region to run the datapipeline from | ||
- ``ROLE``: Role needed for DP | ||
- ``S3_BASE_PATH``: Prefix to be used for all S3 paths that are created | ||
anywhere. This is used for splitting logs across multiple developer | ||
or across production and dev | ||
- ``S3_ETL_BUCKET``: S3 bucket to use for DP data, logs, source code | ||
etc. | ||
- ``SNS_TOPIC_ARN_FAILURE``: SNS to trigger for failed steps or | ||
pipelines | ||
- ``SNS_TOPIC_ARN_WARNING``: SNS to trigger for failed QA checks | ||
- ``FREQUENCY_OVERRIDE``: Override every frequency given in a pipeline | ||
with this unless overridden by CLI | ||
- ``DEPENDENCY_OVERRIDE``: Will ignore the dependency step if set to | ||
true. | ||
- ``slack``: Configuration for posting messages on slack whenever a | ||
pipeline is run | ||
- ``Tags``: Tags to be added to the pipeline. The first key is the Tag | ||
to be used, the second key is the type. If the type is string the | ||
value is passed directly. If the type is variable then it looks up | ||
the pipeline object for that variable. | ||
|
||
Logging | ||
~~~~~~~ | ||
|
||
:: | ||
|
||
logging: | ||
CONSOLE_DEBUG_LEVEL: INFO | ||
FILE_DEBUG_LEVEL: DEBUG | ||
LOG_DIR: ~/.dataduct | ||
LOG_FILE: dataduct.log | ||
|
||
Settings for specifying where the logs should be outputted and debug | ||
levels that should be used in the library code execution. | ||
|
||
MySQL | ||
~~~~~ | ||
|
||
:: | ||
|
||
mysql: | ||
host_alias_1: | ||
HOST: FILL_ME_IN | ||
PASSWORD: FILL_ME_IN | ||
USERNAME: FILL_ME_IN | ||
host_alias_2: | ||
HOST: FILL_ME_IN | ||
PASSWORD: FILL_ME_IN | ||
USERNAME: FILL_ME_IN | ||
|
||
Rds (MySQL) database connections are stored in this parameter. The | ||
pipeline definitions can refer to the host with the host\_alias. | ||
``HOST`` refers to the full db hostname inside AWS. | ||
|
||
Redshift | ||
~~~~~~~~ | ||
|
||
:: | ||
|
||
redshift: | ||
CLUSTER_ID: FILL_ME_IN | ||
DATABASE_NAME: FILL_ME_IN | ||
HOST: FILL_ME_IN | ||
PASSWORD: FILL_ME_IN | ||
USERNAME: FILL_ME_IN | ||
PORT: FILL_ME_IN | ||
|
||
Redshift database credentials that are used in all the steps that | ||
interact with a warehouse. ``CLUSTER_ID`` is the first word of the | ||
``HOST`` as this is used by ``RedshiftNode`` at a few places to identify | ||
the cluster. | ||
|
||
Modes | ||
~~~~~ | ||
|
||
:: | ||
|
||
production: | ||
etl: | ||
S3_BASE_PATH: prod | ||
|
||
Modes define override settings for running a pipeline. As config is a | ||
singleton we can declare the overrides once and that should update the | ||
config settings across all use cases. | ||
|
||
In the example we have a mode called ``production`` in which the | ||
``S3_BASE_PATH`` is overridden to ``prod`` instead of whatever value was | ||
specified in the defaults. | ||
|
||
At coursera one of the uses for modes is to change between the dev | ||
redshift cluster to the production one when we deploy a new ETL. |
Oops, something went wrong.