-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Gary Anderson edited this page Dec 6, 2018
·
11 revisions
Welcome to the BotWorks (including DataSpider and Kosh) wiki - We are glad you came!
The BotWorks ecosystem is a technology developed by a partnership between GlaxoSmithKline (GSK) and Modak Analytics as a framework for ingestion and curation for the Big Data Platform that evolved over the last few years. Bots operate in the larger Hadoop ecosystem and provide both functionality and orchestration for both simple source to target data movement as well as much more complicated multi-step data curation processes.
In 2018 a decision was made to opensource the basic Bot code, a meta-data crawler, and a meta-data design specification. A brief overview of each:
- Kosh: A metadata repository that contains both technical and operational (scheduling,status,etc) metadata used by the Bots to provide the execution and orchestration of ingestion and curation tasks. The GitHub project contains the DDL needed to create a repository. The repository is intended to store metadata on both raw data brought into the environment and data on the curated artifacts generated by the Bots. There is not really a concept of source/target data since every generated target dataset is just a source for someone else.
- DataSpider: This is an application that can crawl across many types of enterprise data sources and populate the Kosh metadata repository with the information needed for Bots to orchestrate the ingestion of data sources into the Big Data repository as well as to perform curation activities within the system.
- BotWorks: This is a very decoupled, asynchronous and meta-data driven framework for deploying tasks related to data ingestion, data curation, and numerous other types of activities.
Each of these technologies is discussed in depth in this Wiki.