Skip to content

Commit

Permalink
Merge pull request #86 from coursera/sb_docs
Browse files Browse the repository at this point in the history
Documentation
  • Loading branch information
sb2nov committed Mar 20, 2015
2 parents 9b7319a + a2bd9f6 commit a9e65bb
Show file tree
Hide file tree
Showing 30 changed files with 1,851 additions and 264 deletions.
49 changes: 30 additions & 19 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,44 @@
# Changes in dataduct

### 0.1.0
- Initial version of the dataduct library released
- Support for the following steps:
- emr_streaming
- extract-local
- extract-s3
- extract-rds
- extract-redshift
- load-redshift
- sql-command
- transform
- Examples and documentation added for all the steps

### 0.2.0
- Travis integration for continous builds
- QA steps and logging to S3
- Visualizing pipeline
- Dataduct CLI updated as a single entry point
- RDS connections for scripts
- Bootstrap step for pipelines
- Backfill or delay activation
- Output path and input path options
- Script directory for transform step
- SQL sanatization for DBA actions
- SQL parser for select and create table statements
- Logging across the library
- Support for custom steps
- Pipeline dependency step
- Reduce verbosity of imports
- Step parsing is isolated in steps
- More examples for steps
- QA step functions added
- Visualization of pipelines
- Sync config with S3
- Config overides with modes
- Rename keywords and safe config failure handling
- MySQL and Redshift connection support
- EMR Streaming support with hadoop 2
- Custom EMR job step
- Support for input_path to steps to directly create S3Nodes
- Transform step to support directory based installs
- Exceptions cleanup
- Read the docs support
- Creating tables automatically for various steps
- History table support
- EC2 and EMR config control from YAML
- Slack integration
- Support for Regions in DP

### 0.1.0
- Initial version of the dataduct library released
- Support for the following steps:
- emr_streaming
- extract-local
- extract-s3
- extract-rds
- extract-redshift
- load-redshift
- sql-command
- transform
- Examples and documentation added for all the steps
4 changes: 2 additions & 2 deletions dataduct/config/credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ def get_aws_credentials_from_iam():
"""Get aws credentials using the IAM api
Note: this script only runs on an EC2 instance with the appropriate
resource roles. For more information, see the following:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/\
AESDG-chapter-instancedata.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/\
AESDG-chapter-instancedata.html
Returns:
access_key(str): AWS access key
Expand Down
8 changes: 4 additions & 4 deletions dataduct/pipeline/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,16 @@ def get_response_from_boto(fn, *args, **kwargs):
Args:
func(function): Function to call
*args(optional): arguments
**kwargs(optional): keyword arguments
args(optional): arguments
kwargs(optional): keyword arguments
Returns:
response(json): request response.
Input:
func(function): Function to call
*args(optional): arguments
**kwargs(optional): keyword arguments
args(optional): arguments
kwargs(optional): keyword arguments
"""

response = None
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@
# dir menu entry, description, category)
texinfo_documents = [
('index', 'dataduct', u'dataduct Documentation',
u'Coursera', 'dataduct', 'One line description of project.',
u'Coursera', 'dataduct', 'DataPipeline for Humans.',
'Miscellaneous'),
]

Expand Down
288 changes: 288 additions & 0 deletions docs/config.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
Config
======

All the dataduct setting are controlled from a single config file that
stores the credentials as well as different settings.

The config file is read from the following places in the specified order
of priority.

1. ``/etc/dataduct.cfg``
2. ``~/.dataduct``
3. ``DATADUCT_CONFIG_PATH`` environment variable

Minimum example config:

.. code:: YAML
ec2:
INSTANCE_TYPE: m1.large
ETL_AMI: ami-05355a6c # Default AMI used by data pipeline - Python 2.6
SECURITY_GROUP: FILL_ME_IN
emr:
MASTER_INSTANCE_TYPE: m1.large
NUM_CORE_INSTANCES: 1
CORE_INSTANCE_TYPE: m1.large
CLUSTER_AMI: 3.1.0
etl:
S3_ETL_BUCKET: FILL_ME_IN
ROLE: FILL_ME_IN
RESOURCE_ROLE: FILL_ME_IN
Config Parameters
-----------------

Bootstrap
~~~~~~~~~

.. code:: YAML
bootstrap:
ec2:
- step_type: transform
command: echo "Welcome to dataduct"
no_output: true
emr:
- step_type: transform
command: echo "Welcome to dataduct"
no_output: true
Bootstrap steps are a chain of steps that should be executed before any
other step in the datapipeline. This can be used to copy files from S3
or install libraries on the resource. At Coursera we use this to
download some binaries from S3 that are required for some of the
transformations.

Note that the EMR bootstrap is only executed on the master node. If you
want to install something on the task nodes then you should use the
bootstrap parameter in the ``emr_cluster_config`` in your datapipeline.

Custom Steps
~~~~~~~~~~~~

::

custom_steps:
- class_name: CustomExtractLocalStep
file_path: custom_extract_local.py
step_type: custom-extract-local

Custom steps are steps that are not part of dataduct but are created to
augment the functionality provided by dataduct. At Coursera these are
often Steps that Inherit from the current object but abstract out some
of the functionality so that multiple pipelines don't have to write the
same thing twice.

The file\_path can be an absolute path or a relative path with respect
to the ``CUSTOM_STEPS_PATH`` path defined in the ETL parameter field.
The Step classes are dynamically imported based on the config and
``step-type`` field is the one that is matched when parsing the pipeline
definition.

Database
~~~~~~~~

::

database:
permissions:
- user: admin
permission: all
- group: consumer_group
permission: select

Some steps such as ``upsert`` or ``create-load-redshift`` create tables
and grant them appropriate permissions so that one does not have to
create tables prior to running the ETL. The permission is the
``permission`` being granted on the table or view to the ``user`` or
``group``. If both are specified then both the grant statements are
executed.

EC2
~~~

::

ec2:
INSTANCE_TYPE: m1.small
ETL_AMI: ami-05355a6c # Default AMI used by data pipeline - Python 2.6
SECURITY_GROUP: FILL_ME_IN

The ec2 config controls the configuration for the ec2-resource started
by the datapipeline. You can override these with ``ec2_resouce_config``
in your pipeline definition for specific pipelines.

EMR
~~~

::

emr:
CLUSTER_AMI: 3.1.0
CLUSTER_TIMEOUT: 6 Hours
CORE_INSTANCE_TYPE: m1.large
NUM_CORE_INSTANCES: 1
HADOOP_VERSION: 2.4.0
HIVE_VERSION: null
MASTER_INSTANCE_TYPE: m3.xlarge
PIG_VERSION: null
TASK_INSTANCE_BID_PRICE: null
TASK_INSTANCE_TYPE: m1.large

The emr config controls the configuration for the emr-resource started
by the datapipeline.

ETL
~~~

::

etl:
CONNECTION_RETRIES: 2
CUSTOM_STEPS_PATH: ~/dataduct/examples/steps
DAILY_LOAD_TIME: 1
KEY_PAIR: FILL_ME_IN
MAX_RETRIES: 2
NAME_PREFIX: dev
QA_LOG_PATH: qa
DP_INSTANCE_LOG_PATH: dp_instances
DP_PIPELINE_LOG_PATH: dp_pipelines
DP_QA_TESTS_LOG_PATH: dba_table_qa_tests
RESOURCE_BASE_PATH: ~/dataduct/examples/resources
RESOURCE_ROLE: FILL_ME_IN
RETRY_DELAY: 10 Minutes
REGION: us-east-1
ROLE: FILL_ME_IN
S3_BASE_PATH: dev
S3_ETL_BUCKET: FILL_ME_IN
SNS_TOPIC_ARN_FAILURE: null
SNS_TOPIC_ARN_WARNING: null
FREQUENCY_OVERRIDE: one-time
DEPENDENCY_OVERRIDE: false
slack:
api_token: FILL_ME_IN
channel_name: "#dataduct"
username: FILL_ME_IN
bot_username: Dataduct Bot
TAGS:
env:
string: dev
Name:
variable: name

This is the core parameter object which controls the ETL at the high
level. The parameters are explained below:

- ``CONNECTION_RETRIES``: Number of retries for the database
connections. This is used to eliminate some of the transient errors
that might occur.
- ``CUSTOM_STEPS_PATH``: Path to the directory to be used for custom
steps that are specified using a relative path.
- ``DAILY_LOAD_TIME``: Default time to be used for running pipelines
- ``KEY_PAIR``: SSH key pair to be used in both the ec2 and the emr
resource.
- ``MAX_RETRIES``: Number of retries for the pipeline activities
- ``NAME_PREFIX``: Prefix all the pipeline names with this string
- ``QA_LOG_PATH``: Path prefix for all the QA steps when logging output
to S3
- ``DP_INSTANCE_LOG_PATH``: Path prefix for DP instances to be logged
before destroying
- ``DP_PIPELINE_LOG_PATH``: Path prefix for DP pipelines to be logged
- ``DP_QA_TESTS_LOG_PATH``: Path prefix for QA tests to be logged
- ``RESOURCE_BASE_PATH``: Path to the directory used to relative
resource paths
- ``RESOURCE_ROLE``: Resource role needed for DP
- ``RETRY_DELAY``: Delay between each of activity retires
- ``REGION``: Region to run the datapipeline from
- ``ROLE``: Role needed for DP
- ``S3_BASE_PATH``: Prefix to be used for all S3 paths that are created
anywhere. This is used for splitting logs across multiple developer
or across production and dev
- ``S3_ETL_BUCKET``: S3 bucket to use for DP data, logs, source code
etc.
- ``SNS_TOPIC_ARN_FAILURE``: SNS to trigger for failed steps or
pipelines
- ``SNS_TOPIC_ARN_WARNING``: SNS to trigger for failed QA checks
- ``FREQUENCY_OVERRIDE``: Override every frequency given in a pipeline
with this unless overridden by CLI
- ``DEPENDENCY_OVERRIDE``: Will ignore the dependency step if set to
true.
- ``slack``: Configuration for posting messages on slack whenever a
pipeline is run
- ``Tags``: Tags to be added to the pipeline. The first key is the Tag
to be used, the second key is the type. If the type is string the
value is passed directly. If the type is variable then it looks up
the pipeline object for that variable.

Logging
~~~~~~~

::

logging:
CONSOLE_DEBUG_LEVEL: INFO
FILE_DEBUG_LEVEL: DEBUG
LOG_DIR: ~/.dataduct
LOG_FILE: dataduct.log

Settings for specifying where the logs should be outputted and debug
levels that should be used in the library code execution.

MySQL
~~~~~

::

mysql:
host_alias_1:
HOST: FILL_ME_IN
PASSWORD: FILL_ME_IN
USERNAME: FILL_ME_IN
host_alias_2:
HOST: FILL_ME_IN
PASSWORD: FILL_ME_IN
USERNAME: FILL_ME_IN

Rds (MySQL) database connections are stored in this parameter. The
pipeline definitions can refer to the host with the host\_alias.
``HOST`` refers to the full db hostname inside AWS.

Redshift
~~~~~~~~

::

redshift:
CLUSTER_ID: FILL_ME_IN
DATABASE_NAME: FILL_ME_IN
HOST: FILL_ME_IN
PASSWORD: FILL_ME_IN
USERNAME: FILL_ME_IN
PORT: FILL_ME_IN

Redshift database credentials that are used in all the steps that
interact with a warehouse. ``CLUSTER_ID`` is the first word of the
``HOST`` as this is used by ``RedshiftNode`` at a few places to identify
the cluster.

Modes
~~~~~

::

production:
etl:
S3_BASE_PATH: prod

Modes define override settings for running a pipeline. As config is a
singleton we can declare the overrides once and that should update the
config settings across all use cases.

In the example we have a mode called ``production`` in which the
``S3_BASE_PATH`` is overridden to ``prod`` instead of whatever value was
specified in the defaults.

At coursera one of the uses for modes is to change between the dev
redshift cluster to the production one when we deploy a new ETL.
Loading

0 comments on commit a9e65bb

Please sign in to comment.