Skip to content

martingaldeca/BMATDevTest

Repository files navigation

BMAT Dev Test

The test for BMAT consists of 2 parts

  • CSV processing module
  • API and task execution system uses this module to process the uploaded files.

Each part is related to one big decision to be discussed:

  • What package is used for processing large CSV files
  • What framework to use for the API

If the project will only run locally, the best solution is probably to use vaex. It is speedy to perform simple operations like the one we need to apply for the test (group by with sum). But after asking the team, we expect a TB per month, more or less, and it probably will grow up in the future. So it's better to use an option that can scale horizontally quickly. So the bests solutions are Dask or Ray to scale the project in a more significant cluster.

The next step should be to make some performance tests with both solutions and try to research more packages for this case, but this is a test. So I decided to use Dask because I have more experience with it.

Dask works as a lower-level scheduler that allows you to execute different operations in parallel. For example, when you run processes in the collections level of dask (when you use arrays, bags o, like in this case, dataframes), dask generates graphs of execution tasks. Then there is a master scheduler that takes the graph and makes the workers (in local, your assigned cores) execute them.

In future steps, we should include using the distributed client and a cluster to process the data with more partitions.

The choice of the framework to use to generate the API also depends on the future of the application. For example, I assumed that soon we would have a repository of songs, dates, and plays that should be cumulative through different CSVs passed through the API. So probably, it will scale in complexity with varying batch and query processes.

Also, I assume that, as in most cases, there will be a department of operations/support in the company that would not be very technical. So to provide them with an accessible admin web to see the process results, Django is the best option.

With Django, we can integrate the ORM and have multiple databases for each application. So probably, the cumulative songs should have a NoSQL database. (probably the best choice is a well-configured DynamoDB). Also, we can connect that database easily with tools like QuickSight, and visualize the data with a professional panel without tons of development.

Requirements

The following requirements are needed:

docker
docker compose

But to debug tasks and also to simplify the work is highly recommended to have also these dependencies:

make

Installation

Running the project is a piece of cake if you have the minimum requirements. Just run the following:

cp ./docker/src/post_deploy.sh ./src
cp ./docker/src/run* ./src
docker compose build
docker-compose up -d

And if you have make, is even easier:

make complete-build

And voila! Your project will be running in a docker container

After the first compilation, you can just use the following commands to run the project

docker-compose up -d

or just

make

Configuration

You can update some values. For example, the admin superuser, by default, is admin with password root1234.

Also, depending on your pc and the file you are using to test the application, you can update MAX_SINGLE_FILE_SIZE.

For my pc and the data autogenerated, the best solution is to use 500000000, which means files of max 500MB for each partition.

How it works?

This is a Django project dockerized with a Postgres database, so once it is started, you can check if everything is ok going to the local admin

If you can see the admin, everything is ok. If not, you should check the logs

make logs

You can check the API structure with the following interfaces:

The API has four endpoints. The first two are for the login. It uses simplejwt to log in as a user and get a token. After that, you can use the admin user created when the projects start to run (admin with password root1234 if you don't update the env values).

With the token you have obtained, you can call the other endpoints passing the token to the authorization header. (There is a postman helpers folder in the root of the project to make you easier the configuration for the API calls)

Now you can call the main endpoints of the tests api/process_file and api/csv_task_result/{uuid}/.

The first is a simple endpoint to post a CSV file (it would check the format). Once you post it, it will return a UUID of the task created to process it. However, if you post the exact file (assuming that the names of the files are unique), it will raise a 409, and in error, it will give you the UUID of the task that previously tried to process the file.

The endpoint will create a CSVTask object, saving the file you passed and its original name. And then, file processing begins. First, it will create a celery task and execute the process. The process will generate files of max MAX_SINGLE_FILE_SIZE Bytes with the results and will associate the CSVTask with them. You can check that using the admin.

If the file is corrupted, the process will fail, the CSVTask will be marked as error_processing, and it will also save the error message.

The following endpoint will give you the task result by its UUID using a get. You must also provide to authentication login. If the process ends, you will see a 200; in the body, the field output_files will give you a list of files with the results.

If the process of the initial file is not ended, it will return you a 425 Too Early Exception. Also, if there is a problem processing the file, it will return you a 406 Not Acceptable Exception, and the error will send you the error message during the file process.

Testing process from the terminal

If you do not want to use the API to process a file, you can process them from the shell.

Django's shell plus will import all you need, so it is recommended.

make shell

Once in the shell, the first step will be to create an empty CSVTask. You can create one using one of the following commands:

csv_task = CSVTaskFactory(output_files__total=0)

or

csv_task = CSVTask.objects.create()

Now that you have a CSVTask, just use the function process_csv. The file you want to process must be inside the container. The volumes map all your local src into the container /src/, so drop anywhere inside your local src, and it will be available in the container. Also, you can use the Django command demo_file and create a file directly in the container with fake data.

python manage.py demo_file

It will default create a file named test_data.csv of 1GB, but you can specify other values.

python manage.py demo_file --total_size XXXXXXXXXX --file_name your_name.csv

It will drop you the file in apps/data_processor/csvs/in/, and now you can test the process function without celery.

process_csv(task_uuid=csv_task.uuid.hex, csv_path='apps/data_processor/csvs/in/test_data.csv')  # Update your csv path if needed

And that's it. Now you can see the process logs and how much time it takes to perform the task. Now you can check that the output files was generated using the following sentence:

csv_task.output_files_urls

Or you can go to the admin page and check the files.

Tests

The project only has unit tests, and to check that all tests are ok, just exec the following:

make test

or

docker exec bmat_dev_test_backend python -m pytest --log-cli-level=ERROR --disable-pytest-warnings

Next steps

  • The first one is to set up a CI/CD and set up in a cloud (for example, AWS is the one I most like)
  • Create integration test, using, for example a Newman container
  • Create a dask cluster

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published