Skip to content

Ruby gem that provides utilities (ls, find, cat, and others) for HDFS (Hadoop Distributed File System).

License

Notifications You must be signed in to change notification settings

davidchaiken/hdfsutils

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HdfsUtils

Ruby gem that provides utilities (ls, find, and eventually others) for HDFS (Hadoop Distributed File System).

This gem uses the webhdfs interface, which provides fast, compatible, remote access to files and directories stored in HDFS.

Settings

The precedence order of sources of settings, from lowest to highest, is:

  1. Defaults in this repository.
  2. Standard Hadoop configuration files.
  3. Environment variables
  4. Command-line options.

Environment Variables

The following environment variables may be used to configure the utilities.

Variable Description Default
HDFS_HOST The IP hostname of the webhdfs server. localhost
HDFS_PORT The IP port number of the webhdfs service. 50070
HDFS_USERNAME The username used to access HDFS. The value of the shell environment HADOOP_USER_NAME or USER variables.
HDFS_URI The location of the webhdfs service: [webhdfs://]hostname[:port] webhdfs://localhost:50070
HDFS_DOAS HTTP doas username to use with webhdfs. none
HDFS_PROXYHOST HTTP proxy host to use with webhdfs. none
HDFS_PROXYPORT HTTP proxy port to use with webhdfs. none
HADOOP_CONF_DIR The directory that contains Hadoop configuration files. /etc/hadoop

Common Command-Line Options

All of the utilities take the following options, which override the environment variables when specified.

Option Description Default
--hdfsuri=[webhdfs://]hostname[:port] The location of the webhdfs service. webhdfs://localhost:50070
--log-level=[debug|info|warn|error|fatal] Logging level. When debug is specified, failures will generate a stack trace. fatal

Contributing

Altiscale has just started developing hdfsutils. We're focusing on delivering a specific use case for one of our customers, but intend to build a much more complete set of utilities. Contributions are welcome.

To add new functionality to an existing utility, you'll probably want to edit the utility's options.rb file and the utility implementation.

To develop a completely new utility: find, copy, and modify the template code. Here's the current list of template code files at the time that this documentation was written:

$ find . -path '*template*'
./bin/hdtemplate
./lib/hdfsutils/utils/hdtemplate
./lib/hdfsutils/utils/hdtemplate/implementation.rb
./lib/hdfsutils/utils/hdtemplate/options.rb
./lib/hdfsutils/utils/hdtemplate/template.rb
./spec/utils/hdtemplate_spec.rb

The code in all pull requests must pass the rubocop and rspec tests. New functionality should be submitted with corresponding rspec unit tests. The best way to run rubocop and rspec is to use rvm, bundler, and rake. Assuming that rvm is already installed with bundler in the default gemset, run rake as follows:

$ rvm use @hdfsutils-devel --create
ruby-2.0.0-p353 - #gemset created /Users/chaiken/.rvm/gems/ruby-2.0.0-p353@hdfsutils-devel
ruby-2.0.0-p353 - #generating hdfsutils-devel wrappers..........
Using /Users/chaiken/.rvm/gems/ruby-2.0.0-p353 with gemset hdfsutils-devel
bash-3.2$ bundle install
Fetching gem metadata from https://rubygems.org/............
Fetching version metadata from https://rubygems.org/..
Resolving dependencies...
<installs the development dependencies in hdfsutils.gemspec>
Bundle complete! <D> Gemfile dependencies, <G> gems now installed.
Use `bundle show [gemname]` to see where a bundled gem is installed.
bash-3.2$ rake
Running RuboCop...
Inspecting <F> files
..............................

<F> files inspected, no offenses detected
<path>/ruby <path>/rspec --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb

HdfsUtils::Ls
<ls utility tests>

HdfsUtils::Template
<template utility tests>

Finished in <N> seconds (files took <M> seconds to load)
<X> examples, <F> failures

Release Notes

0.0.4

  • support HADOOP_USER_NAME shell environment variable

0.0.3

  • hdmv implementation
  • reuse webhdfs connections, if possible

0.0.2

  • unix, si and iec filesize units and human-readable option
  • help formatting fix
  • improvements based on Altiscale customer feedback

0.0.1

Original Release

Authors

License

Apache License Version 2.0 (See LICENSE.txt)

About

Ruby gem that provides utilities (ls, find, cat, and others) for HDFS (Hadoop Distributed File System).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 100.0%