Skip to content

Commit

Permalink
Add triton docs and fix false link (#2557)
Browse files Browse the repository at this point in the history
Signed-off-by: Kaiyuan Hu <kaiyuan.hu@zilliz.com>
  • Loading branch information
Chiiizzzy authored May 25, 2023
1 parent fe485d9 commit fc50bf9
Show file tree
Hide file tree
Showing 30 changed files with 192 additions and 20 deletions.
8 changes: 5 additions & 3 deletions docs/01-Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,25 @@ Towhee is a framework that provides [ETL](https://databricks.com/glossary/extrac

Unstructured data refers to data that cannot be stored in a tabular or key-value format. Nearly all human-generated data (images, video, text, etc...) is unstructured - some market analysts estimate that over 80% of data generated by 2024 will be unstructured data. Towhee is the first open-source project that's meant to process a variety of unstructured data using ETL pipelines.

To accomplish this, we built Towhee atop popular machine learning and unstructured data processing libraries, i.e. `torch`, `timm`, `transformers`, etc. Models or functions from different libraries are wrapped as standard Towhee operators, and can be freely integrated into application-oriented pipelines using a [Pythonic API](./04-API Reference). To ensure user-friendliness, pre-built pipelines can also be called in just a single line of code, without the need to understand the underlying models or modules used to build them.
To accomplish this, we built Towhee atop popular machine learning and unstructured data processing libraries, i.e. `torch`, `timm`, `transformers`, etc. Models or functions from different libraries are wrapped as standard Towhee operators, and can be freely integrated into application-oriented pipelines using a [Pythonic API](/05-API%20Reference). To ensure user-friendliness, pre-built pipelines can also be called in just a single line of code, without the need to understand the underlying models or modules used to build them.

For more information, take a look at our [quick start](/02-Getting%20Started/01-quick-start.mdx) page.

## Problems Towhee solves

- **Modern ML applications require far more than a single neural network.** Running a modern ML application in production requires a combination of online pre-processing, data transformation, the models themselves, and other ML-related tools. Building an application that recognizes objects within a video, for example, involves decompression, key-frame extraction, image deduplication, object detection, etc. This necessitates a platform that offers a fast and robust method for developing end-to-end application pipelines that use ML models in addition to supporting data parallelism and resource management.

Towhee solves this problem by reintroducing the concept of `Pipeline` as being _application-centric_ instead of _model-centric_. Where model-centric pipelines are composed of a single model followed by auxiliary code, application-centric pipelines treat every single data processing step as a first-class citizen. Towhee also exposes a [Pythonic API](./04-API Reference) for developing more complex applications in just a couple lines of code.
Towhee solves this problem by reintroducing the concept of `Pipeline` as being _application-centric_ instead of _model-centric_. Where model-centric pipelines are composed of a single model followed by auxiliary code, application-centric pipelines treat every single data processing step as a first-class citizen. Towhee also exposes a [Pythonic API](/05-API%20Reference) for developing more complex applications in just a couple lines of code.

- **Too many model implementations exist without any interface standard.** Machine learning models (NN-based and traditional) are ubiquitous. Different implementations of machine learning models require different auxiliary code to support testing and fine-tuning, making model evaluation and productionization a tedious task.

Towhee solves this by providing a universal `Operator` wrapper for dataset loading, basic data transformations, ML models, and other miscellaneous scripts. Operators have a pre-defined API and glue logic to make Towhee work with a number of machine learning and data processing libraries. Operators can be chained together in a DAG to form entire ML applications.

- **ETL pipelines for unstructured data are nearly nonexistent.** ETL, short for _extract, transform, and load_, is a framework used by data scientists, ML application developers, and other engineers to extract data from various sources, transform the data into a format that can be understood by computers, and load the data into downstream platforms for recommendation, analytics, and other business intelligence tasks.

Towhee solves this by providing an open-source vision for ETL in the era of unstructured data. We provide: 1) over 300 pre-built pipelines across a multitude of different data transformation tasks (including but not limited to image embedding, audio embedding, text summarization), and 2) a way to build pipelines of arbitrary complexity through an intuitive Python API.
Towhee solves this by providing an open-source vision for ETL in the era of unstructured data. We provide:
1. over 300 pre-built pipelines across a multitude of different data transformation tasks (including but not limited to image embedding, audio embedding, text summarization)
2. a way to build pipelines of arbitrary complexity through an intuitive Python API.

## Design philosophy

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ Currently, Towhee supports nine types of nodes. Input and output nodes are used

| **Node Type** | **Description** |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| **input***(\*input_schema)* | This node defines the input schema of a pipeline and is the beginning of a pipeline's definition. Note that a pipeline's input schema can not be empty. Refer to [input API](/04-API%20Reference/01-Pipeline%20API/01-input.md) for more details. |
| **output**(**output_schema*) | This node defines the pipeline's output schema, and ends a pipeline definition. Once called, a pipeline instance will be created and returned. Refer to [output API](/04-API%20Reference/01-Pipeline%20API/02-output.md) for more details. |
| **map**(*input_schema, output_schema, func*) | This node applies the given function `func` to each of its inputs and returns the transformed data. `map` returns one row for every row of input. Refer to [map API](/04-API%20Reference/01-Pipeline%20API/03-map.md) for more details. |
| **flat_map***(**input_schema, output_schema,* *func)* | This node flattens the results after applying the function to every row of input, and returns the flattened data respectively.The returned data can have the same count or more number of rows compared with the input. This is one of the major differences between `flat_map` and `map`, where `map` always returns the same number of rows as input. Refer to [flat_map API](/04-API%20Reference/01-Pipeline%20API/04-flat-map.md) for more details. |
| **filter**(*input_schema, output_schema,* *filter_columns, func*) | This node applies the filter function `func` to the `filter_columns`.Refer to [filter API](/04-API%20Reference/01-Pipeline%20API/05-filter.md) for more details. |
| **window**(*input_schema, output_schema,* *size, step, func*) | This node batches the input rows into multiple rows based on the specified window `size` and `step`. Then it applies a function `func` to each of the windowed data, and returns the results - one row of results for each of the windows. Refer to [window API](/04-API%20Reference/01-Pipeline%20API/06-window.md) for more details. |
| **time_window**(*input_schema, output_schema,* *timestamp_col, size, step, func*) | This node is used to batch rows that have a time sequence, for example, audio or video frames.`time_window` is similar to `window`, but the batching rule is applied based on a timestamp column (`timestamp_col`). `size` is the time interval of each window, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. Refer to [time_window API](/04-API%20Reference/01-Pipeline%20API/07-time-window.md) for more details. |
| **window_all**(*input_schema, output_schema,* *func*) | This node batches all input rows into one window, and returns the result by applying a function `func` to the window. Refer to [window_all API](/04-API%20Reference/01-Pipeline%20API/08-window-all.md) for more details. |
| **concat***(\*pipelines)* | This node concats multiple pipelines' intermediate results, and groups all the pipelines into a bigger one. Refer to [concat API](/04-API%20Reference/01-Pipeline%20API/09-concat.md) for more details. |
| **input***(\*input_schema)* | This node defines the input schema of a pipeline and is the beginning of a pipeline's definition. Note that a pipeline's input schema can not be empty. Refer to [input API](/05-API%20Reference/01-Pipeline%20API/01-input.md) for more details. |
| **output**(**output_schema*) | This node defines the pipeline's output schema, and ends a pipeline definition. Once called, a pipeline instance will be created and returned. Refer to [output API](/05-API%20Reference/01-Pipeline%20API/02-output.md) for more details. |
| **map**(*input_schema, output_schema, func*) | This node applies the given function `func` to each of its inputs and returns the transformed data. `map` returns one row for every row of input. Refer to [map API](/05-API%20Reference/01-Pipeline%20API/03-map.md) for more details. |
| **flat_map***(**input_schema, output_schema,* *func)* | This node flattens the results after applying the function to every row of input, and returns the flattened data respectively.The returned data can have the same count or more number of rows compared with the input. This is one of the major differences between `flat_map` and `map`, where `map` always returns the same number of rows as input. Refer to [flat_map API](/05-API%20Reference/01-Pipeline%20API/04-flat-map.md) for more details. |
| **filter**(*input_schema, output_schema,* *filter_columns, func*) | This node applies the filter function `func` to the `filter_columns`.Refer to [filter API](/05-API%20Reference/01-Pipeline%20API/05-filter.md) for more details. |
| **window**(*input_schema, output_schema,* *size, step, func*) | This node batches the input rows into multiple rows based on the specified window `size` and `step`. Then it applies a function `func` to each of the windowed data, and returns the results - one row of results for each of the windows. Refer to [window API](/05-API%20Reference/01-Pipeline%20API/06-window.md) for more details. |
| **time_window**(*input_schema, output_schema,* *timestamp_col, size, step, func*) | This node is used to batch rows that have a time sequence, for example, audio or video frames.`time_window` is similar to `window`, but the batching rule is applied based on a timestamp column (`timestamp_col`). `size` is the time interval of each window, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. Refer to [time_window API](/05-API%20Reference/01-Pipeline%20API/07-time-window.md) for more details. |
| **window_all**(*input_schema, output_schema,* *func*) | This node batches all input rows into one window, and returns the result by applying a function `func` to the window. Refer to [window_all API](/05-API%20Reference/01-Pipeline%20API/08-window-all.md) for more details. |
| **concat***(\*pipelines)* | This node concats multiple pipelines' intermediate results, and groups all the pipelines into a bigger one. Refer to [concat API](/05-API%20Reference/01-Pipeline%20API/09-concat.md) for more details. |
4 changes: 2 additions & 2 deletions docs/03-User Guides/01-Pipeline Programing Guide/02-map.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

A map node applies a given function to each of its inputs and returns the transformed data. `map` returns one row for every row of input. Refer to [map API](../../04-API Reference/01-Pipeline API/03-map.md) for more details.
A map node applies a given function to each of its inputs and returns the transformed data. `map` returns one row for every row of input. Refer to [map API](../../05-API%20Reference/01-Pipeline%20API/03-map.md) for more details.

The figure below illustrates how `map` applies the transformation to each row of inputs.

Expand All @@ -24,7 +24,7 @@ Now let's take an text feature extraction pipeline as an example to demonstrate

This example defines a pipeline for text feature extraction.

> When running the pipeline, you can use [batch (batch_inputs)](/04-API%20Reference/01-Pipeline%20API/10-batch.md) to insert multiple rows of data at a time.
> When running the pipeline, you can use [batch (batch_inputs)](/05-API%20Reference/01-Pipeline%20API/10-batch.md) to insert multiple rows of data at a time.
```Python
from towhee import pipe, ops
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

A flat map node flattens the results after applying the function to every row of input, and returns the flattened data respectively.

The returned data can have the same count or more number of rows compared with the input. This is one of the major differences between `flat_map` and `map`, where `map` always returns the same number of rows as input. Refer to [flat_map API](/04-API%20Reference/01-Pipeline%20API/04-flat-map.md) for more details.
The returned data can have the same count or more number of rows compared with the input. This is one of the major differences between `flat_map` and `map`, where `map` always returns the same number of rows as input. Refer to [flat_map API](/05-API%20Reference/01-Pipeline%20API/04-flat-map.md) for more details.

The figure below illustrates how `flat_map` applies the transformation to each row of inputs.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

A filter node filters rows based on the return values (T/F) of a given function `func` and takes `filter_columns` as input. The transformation only takes effect on the columns specified by the `input_schema`. Refer to [filter API](/04-API%20Reference/01-Pipeline%20API/05-filter.md) for more details.
A filter node filters rows based on the return values (T/F) of a given function `func` and takes `filter_columns` as input. The transformation only takes effect on the columns specified by the `input_schema`. Refer to [filter API](/05-API%20Reference/01-Pipeline%20API/05-filter.md) for more details.

![img](https://github.com/towhee-io/data/blob/main/image/docs/filter_intro.png?raw=true)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

The window node batches the input rows into multiple rows based on the specified window size (`size`) and steps (`step`). The `size` determines the window length, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. The window node applies a function `func` to each of the windowed data, and returns the results - one row of results for each of the windows. Refer to [window API](/04-API%20Reference/01-Pipeline%20API/06-window.md) for more details.
The window node batches the input rows into multiple rows based on the specified window size (`size`) and steps (`step`). The `size` determines the window length, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. The window node applies a function `func` to each of the windowed data, and returns the results - one row of results for each of the windows. Refer to [window API](/05-API%20Reference/01-Pipeline%20API/06-window.md) for more details.

This figure shows the relationship between `size`, `step`, input rows, and windows:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

The time window node is used to batch rows that have a time sequence, for example, audio or video frames.

`time_window` is similar to `window`, but the batching rule is applied based on a timestamp column `timestamp_col`. `size` is the time interval of each window, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. Refer to [time_window API](/04-API%20Reference/01-Pipeline%20API/07-time-window.md) for more details.
`time_window` is similar to `window`, but the batching rule is applied based on a timestamp column `timestamp_col`. `size` is the time interval of each window, and `step` determines how long a window moves from the previous one. Note that if `step` is less than `size`, the windows will overlap. Refer to [time_window API](/05-API%20Reference/01-Pipeline%20API/07-time-window.md) for more details.

The figure below shows the relationship between `size`, `step`, input rows, and windows. Note that `size` and `step` are both measured by time units. In addition, the video frames may vary in length, so the number of frames in each window can be different.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

A window all node batches all input rows into one, and returns the result by applying a function to the window. Refer to [window_all API](/04-API%20Reference/01-Pipeline%20API/08-window-all.md) for more details.
A window all node batches all input rows into one, and returns the result by applying a function to the window. Refer to [window_all API](/05-API%20Reference/01-Pipeline%20API/08-window-all.md) for more details.

![img](https://github.com/towhee-io/data/blob/main/image/docs/window_all_intro.png?raw=true)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

A concat node concats multiple pipelines' intermediate results, and groups all the pipelines into a bigger one.

The concat node does not apply functions for data processing. Instead, this node only merges the outputs of multiple pipelines. Refer to [concat API](/04-API%20Reference/01-Pipeline%20API/09-concat.md) for more details.
The concat node does not apply functions for data processing. Instead, this node only merges the outputs of multiple pipelines. Refer to [concat API](/05-API%20Reference/01-Pipeline%20API/09-concat.md) for more details.

![img](https://github.com/towhee-io/data/blob/main/image/docs/concat_intro.png?raw=true)

Expand Down
Binary file added docs/04-Triton Server/qps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit fc50bf9

Please sign in to comment.