Design

Introduction

This document describes the design of the advanced analytics framework on which complex data analysis algorithms can be built.

Constraints / Limitations

The best way to run dynamic SQL queries is to use Lua scripts. Because Lua scripting enables us to run queries in the same transaction using the pquery method.
The results of pquery are simple handles and there is a limit of 250 open result set handles for Lua scripts.
The main logic of the framework should be implemented in Python, because it is widely used in data processing and analysis.
Python UDF, where the main logic is placed, returns the result of the query and finishes its execution. Due to that, it cannot be kept running and used as server.

Assumptions

Since the number open result set handles in Lua scripts is limited, it is assumed that Lua scripts is not nested more than 250 times.
The result of pquery is assumed to not exceed the db memory limit in size. The DB usually allows 4 GB per query and node. The UDF can consume up to 32 GB per node, but gets throttled down after the 4 GB. However, the 32 GB are shared between all queries on a node.
Because Python UDF can't run constantly, we need to save its state to the BucketFS. Saving and loading needs to reasonable fast to not introduce too much overhead. For that, we assume the state of the Python UDF is reasonable small (max a few MB).
The (de-)serialization and the upload/download of the state to the BucketFS take less than 3 seconds.

Design Considerations

Due to the fact that pquery is only available for Lua and Python is useful to implement the user code, we proposed an event-driven design for this framework. The designed framework is divided into two parts:

The first part needs to be a Lua Script which is responsible for running SQL queries.
The second part runs the python user code and is responsible for generating the SQL queries

Because Python UDF cannot be kept running and used as server, we have to call for each iteration the UDF again.

This means the UDF has to store and load its state each time it is getting called.
Furthermore, the interface to the user code needs to be suitable for this form of execution.
The user code can't wait actively for the result of a query. The framework will execute it while the UDF is not running anymore.
- The user code has to get called again, when the SQL queries are ready.
- The user code is a kind of callback, however a call back with state.
- To this type of execution, the model of an query handler fits best.
- This means the Lua Script is our query loop.

System Design and Architecture

The designed event-driven framework consists of 3 main components: (1) Query Loop that handles only the state transitions, (2) Query Handler Framework that defines a state machine and provide a framework to the user code, (3) Query Handler, where the user implements their algorithm. The Query Loop component is proposed to be implemented in Lua, because a Lua script is the only way to run the dynamic SQL queries in the same transactions. On the other hand, the Query Handler is implemented in Python script, since Python simplifies the development of data analysis methods by offering a wide variety of data processing tools.

Query Loop

The Query Loop processes only the state transitions by executing queries returned by the Query Handler. This loop provides the new state transition by calling the Query Handler until the Query Handler returns the stop signal with a result or an error.

Initiating Query Loop

dsn~initiating-query-loop~1

The Query Loop is responsible for getting the parameters of the Query Handler that will be called in the loop and forward them to it.

Covers:

req~intiating-algorithm~1

Needs: impl, utest, itest

Iterating over Loop

dsn~handling-states~1

The output of the called Query Handler query can include SQL queries to complete its computation. In that case the Query Loop runs them on each iteration, until the Query Handler returns the status as complete.

Covers:

req~iterating-over-loop~1

Needs: impl, utest, itest

Returning JSON Result

dsn~returning-json-result~1

The Query Loop returns its result in json format.

Covers:

req~returning-result~1

Needs: impl, utest, itest

Query Handler

The Query Handler is a framework whose logic is implemented by the users. It reacts to the query results of the Query Loop and generates SQL queries of the next state. It stores the states in BucketFS.

Python UDF Framework

dsn~python-udf-framework~1

An Query Handler designed as Python UDF instead of a Lua script allows users to have an easy-to-use and better tooling framework to build their algorithm on top of. The user can develop the algorithm as a Python UDF script in which the framework is imported. Some details of this framework interface are listed below:

The query handler is implemented as a class and states are kept as attributes of a class to facilitate the states operations.
The user code should implement a method handle_event where the logic of the code is placed.
The first time the framework calls the query handler, it calls its constructor with a config init(config:Dict[str,Any]), the config is given to the query loop and the Python UDF in JSON format and deserialized to a Dict
The query handler is called by getting the following inputs: The results of the return query, the ExasolDB context object.
Upon completion of the algorithm, the Query Handler calls methods itself to remove temporary records.
In case of an error caught in the Query Loop, If the keep option is not set, the cleanup event is called for the Query Handler and the temporary records are deleted. Otherwise, they are kept and can be used for further investigations such as debugging.

Below you can find the code snippet that simply indicates the Python framework interface. Please note that the return_query is wrapped into the next call to query handler UDF.

class Result:
    query_list: List[str]
    return_query: Optional[str]

class Context:
    """The Context is part of the Python framework and provides additional functions to the user code."""
    
    def getTemporaryTableManager():TemporaryTableManager
    def getTemporaryBucketFSFileManager():TemporaryBucketFSFileManager
    
def handle_event(row_iterator:RowIterable,context:Context)->Result
    # user code

Covers:

req~user-friendly-framework-interface~1

Needs: impl, utest, itest

Storing States in BucketFS

dsn~storing-states-in-bucketfs~1

The framework keeps states during iterations by storing them in BucketFS. The states include the query handler class of the user and the context state for managing temporary tables or bucketfs files

Covers:

req~managing-temporary-bucketfs-files~1

Needs: impl, utest, itest

Returning Queries

dsn~returning-queries~1

The Query Handler returns a list of SQL queries to execute and the return query which will be called again as the next state action so that the new state transition is performed.

Covers:

feat~implementation-framework~1

Needs: impl, utest, itest

Managing Temporary BucketFS Files

dsn~managing-temporary-bucketfs-files~1

The Query Handler create temporary BucketFS files that will be kept as result. These temporary files are placed in the same directory in the BucketFS.

Covers:

req~managing-temporary-bucketfs-files~1

Needs: impl, utest, itest

Managing Temporary Tables

dsn~managing-temporary-tables~1

The Query Handler uses temporary tables to store large intermediate results.

Covers:

req~managing-temporary-tables~1

Needs: impl, utest, itest

Electing Leader

dsn~electing-leader~1

The leader selection consist of 3 main steps: Firstly, each UDF is assigned a unique id which is the concatenation of node_id and vm_id. Secondly, the list of UDF instances are discovered using pycos. Thirdly, the UDF instance with the largest id is elected as the leader. Other instances send a confirmation message to the instance with the highest id that it is the leader. If the leader got all the confirmation, it acknowledges them.

Covers:

req~electing-leader~1

Needs: impl, utest, itest

Collective Operation

dsn~collective-operation~1

The framework uses a collective operation approach, where tasks are simultaneously run on multiple UDF instances to achieve parallelism.

Covers:

req~collective-operation~1

Needs: impl, utest, itest

Error Handling

In case of any errors in the execution, temporarily created files and tables are cleaned.

Cleanup Temporary Tables

dsn~cleanup-temporary-tables~1

Temporary tables are removed in case of an error occuring during the execution of the framework with the keeping option not selected.

Covers:

req~cleanup-temporary-tables~1

Needs: impl, utest, itest

Cleanup Temporary BucketFS Files

dsn~cleanup-temporary-bucketfs-files~1

Temporary BucketFS files are removed in case of an error that occur during the execution of the framework with the keeping option not selected.

Covers:

req~cleanup-temporary-bucketfs-files~1

Needs: impl, utest, itest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.md

design.md

Design

Introduction

Constraints / Limitations

Assumptions

Design Considerations

System Design and Architecture

Query Loop

Initiating Query Loop

Iterating over Loop

Returning JSON Result

Query Handler

Python UDF Framework

Storing States in BucketFS

Returning Queries

Managing Temporary BucketFS Files

Managing Temporary Tables

Electing Leader

Collective Operation

Error Handling

Cleanup Temporary Tables

Cleanup Temporary BucketFS Files

Files

design.md

Latest commit

History

design.md

File metadata and controls

Design

Introduction

Constraints / Limitations

Assumptions

Design Considerations

System Design and Architecture

Query Loop

Initiating Query Loop

Iterating over Loop

Returning JSON Result

Query Handler

Python UDF Framework

Storing States in BucketFS

Returning Queries

Managing Temporary BucketFS Files

Managing Temporary Tables

Electing Leader

Collective Operation

Error Handling

Cleanup Temporary Tables

Cleanup Temporary BucketFS Files