Related Projects

Here we provide a list of related projects that also provide ETL/data processing tools.

selected formats

Cocoon - Apache Cocoon XML pipeline
csvkit - Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
Datamash - performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files.
DNB-Conv-Tools - Java conversion tools for MARC, ONIX, MAB, Pica and others
easyM2R - https://github.com/cKlee/easyM2R
ETL-Yertl
jq
Librisxl - Tools for conversion of libris.kb.se data
MABLE - MABLE+ ist ein Java-gestütztes Software-Tool zur automatischen Daten- und Fehleranalyse von Bibliothekskatalogen.
MABTools - MAB tools created by the Deutschen Nationalbibliothek
MARCEdit - http://marcedit.reeset.net/
MARCgrep.pl - MARCgrep.pl is a Perl script to filter or count bibliographic records based on condition built upon tag name, indicators, subfield, field value (or tag, positions, value for control fields 00x).
marc2rdf - https://github.com/digibib/marc2rdf (uses JSON mappings such as this)
MARCspec - http://cklee.github.io/marc-spec/marc-spec.html (mapping language for MARC)
marctools - https://github.com/ubleipzig/marctools (various MARC command line utilities)
MARiMbA - is a command-line tool, designed with librarians in mind, to transform MARC (MAchine-Readable Cataloging) records to RDF
miller - is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON
pymarc - pymarc is a python library for working with bibliographic data encoded in MARC21
rml - RML Generic Mapping Language (RDF)
solrmarc - https://code.google.com/p/solrmarc/
TARQL - a SPARQL-based data mapping language to convert CSV, XML, JSON to RDF
Traject - an easy to use, high-performance, flexible and extensible MARC to Solr indexer.

general frameworks

Akara - Akara is a platform for developing data services available on the Web, using REST architecture. Akara is open source software written in Python and C
App::RecordStream - App::RecordStream - recs - A system for command-line analysis of data.
ATTX - Putting Linked Data to Work (University of Helsinki)
bibcat - Engineering toolkit for building semantic web and bibliographic applications
Conduit - Haskell framework for dealing with streaming data
COMSODE - The project COMSODE is an SME-driven RTD project aimed at progressing the capabilities in the field of Open Data re-use.
DNet
d:swarm - data management platform for enrichment, normalization and linkage of knowledge data structures.
ETL::Yertl - ETL with a Shell
Fink - Apache Flink® - Stateful Computations over Data Streams
Heiðrún - Heiðrún is the DPLA metadata ingestion and QA system, and is an implementation of the Kri-kri Rails engine.
JAQL - Query Language for JavaScript(r) Object Notation (JSON)
KNIME - Open source Analytics Platform
Krikri - DPLA Ruby on Rails engine for metadata aggregation, enhancement and quality control.
Luwak - A Lucene extention to search data streams. See also this blog entry.
Metadata Interoperability Framework (MIF) - http://elag2014.org/programme/elag-workshops-list-page/11-5/ PPT
MINT - Metadata Interoperability Services
Meresco - Under the Meresco name Dutch public institutions share quality software components related to metadata management and search.
Metacrunch
metafacture - used in culturegraph
Metadata Services Toolkit - part of the eXtensible Catalog (XC)
Metadata & Object Repository (MoRe)
MUPD8 - Data stream processing from Wallmartlabs.
OpenRefine - (formerly Google Refine) a toolkit to work with tabular data.
Petl - Python ETL library
Pig - Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Ratchet - A library for performing data pipeline / ETL tasks in Go.
REPOX - Data Aggregation and Interoperability Manager
Samza - Apache Samza is a distributed stream processing framework.
Silk - The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked.
Spark Streaming - Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Storm - Apache Storm is a distributed stream processing framework.
Strukt - The most interactive way to work with all kinds of tabular data
Supplejack - Supplejack was designed to provide assurance to the quality of data management activities when working at scale.
TeePee - Command line tool to extract data from structures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Related Projects

selected formats

general frameworks

Clone this wiki locally