We'd love to accept your patches! Before we can take them, we have to jump a couple of legal hurdles.
Please fill out either the individual or corporate Contributor License Agreement (CLA).
- If you are an individual writing original source code and you're sure you own the intellectual property, then you'll need to sign an [individual CLA] (https://developers.google.com/open-source/cla/individual).
- If you work for a company that wants to allow you to contribute your work, then you'll need to sign a [corporate CLA] (https://developers.google.com/open-source/cla/corporate).
Follow either of the two links above to access the appropriate CLA and instructions for how to sign and return it. Once we receive it, we'll be able to accept your pull requests.
- Submit an issue describing your proposed change to the repo in question.
- The repo owner will respond to your issue promptly.
- If your proposed change is accepted, and you haven't already done so, sign a Contributor License Agreement (see details above).
- Fork the desired repo, develop and test your code changes.
- Ensure that your code adheres to the existing style in the sample to which you are contributing. Refer to the [Google Cloud Platform Samples Style Guide] (https://github.com/GoogleCloudPlatform/Template/wiki/style.html) for the recommended coding standards for this organization.
- Ensure that your code has an appropriate set of unit tests which all pass.
- Submit a pull request.
The following best-practice guidelines will help ensure your initialization actions are less likely to break from one Dataproc version to another, and most likely to support different single-node, high-availability, and standard cluster modes.
- Where possible, use
apt-get install
to install from Dataproc's prebuilt Debian packages (built using Apache Bigtop) instead of installing from tarballs. The list of Dataproc packages can be found under/var/lib/apt/lists/*dataproc-bigtop-repo_*Packages
on any Dataproc cluster. - Do not string-replace or string-grep fields out of Hadoop XML files;
instead, use
bdconfig
which is a Python utility available on Dataproc clusters to interact with the XML files. - If it's not possible to make the additional software inherit Hadoop
classpaths and configuration via
/etc/hadoop/conf
, then where possible use symlinks to necessary jarfiles and conf files instead of copying them into directories used by your software. This helps perserve having a single source of truth for jarfile versions and configuration files rather than letting them diverge in the face of further customization. - Use the
dataproc-role
metadata key to distinguish behavior between workers, masters, etc. - Do not assume node names are always a suffix on the ${CLUSTER_NAME};
for example, do not assume that ${CLUSTER_NAME}-m is the HDFS namenode.
Instead, use things like
fs.default.name
from/etc/hadoop/conf/core-site.xml
to determine a default filesystem URI,/etc/hadoop/conf/hdfs-site.xml
for other HDFS settings,/etc/zookeeper/conf/zoo.cfg
for Zookeeper nodes, etc. See the Apache Drill initialization action for examples. - Instead of directly launching any long-running daemon services, create a systemd config to ensure your daemon service automatically restarts on reboot or crash. See #111 and #113 for an example of integrating Jupyter with systemd.