The test for BMAT consists of 2 parts
- CSV processing module
- API and task execution system uses this module to process the uploaded files.
Each part is related to one big decision to be discussed:
- What package is used for processing large CSV files
- What framework to use for the API
If the project will only run locally, the best solution is probably to use vaex. It is speedy to perform simple operations like the one we need to apply for the test (group by with sum). But after asking the team, we expect a TB per month, more or less, and it probably will grow up in the future. So it's better to use an option that can scale horizontally quickly. So the bests solutions are Dask or Ray to scale the project in a more significant cluster.
The next step should be to make some performance tests with both solutions and try to research more packages for this case, but this is a test. So I decided to use Dask because I have more experience with it.
Dask works as a lower-level scheduler that allows you to execute different operations in parallel. For example, when you run processes in the collections level of dask (when you use arrays, bags o, like in this case, dataframes), dask generates graphs of execution tasks. Then there is a master scheduler that takes the graph and makes the workers (in local, your assigned cores) execute them.
In future steps, we should include using the distributed client and a cluster to process the data with more partitions.
The choice of the framework to use to generate the API also depends on the future of the application. For example, I assumed that soon we would have a repository of songs, dates, and plays that should be cumulative through different CSVs passed through the API. So probably, it will scale in complexity with varying batch and query processes.
Also, I assume that, as in most cases, there will be a department of operations/support in the company that would not be very technical. So to provide them with an accessible admin web to see the process results, Django is the best option.
With Django, we can integrate the ORM and have multiple databases for each application. So probably, the cumulative songs should have a NoSQL database. (probably the best choice is a well-configured DynamoDB). Also, we can connect that database easily with tools like QuickSight, and visualize the data with a professional panel without tons of development.
The following requirements are needed:
docker
docker compose
But to debug tasks and also to simplify the work is highly recommended to have also these dependencies:
make
Running the project is a piece of cake if you have the minimum requirements. Just run the following:
cp ./docker/src/post_deploy.sh ./src
cp ./docker/src/run* ./src
docker compose build
docker-compose up -d
And if you have make, is even easier:
make complete-build
And voila! Your project will be running in a docker container
After the first compilation, you can just use the following commands to run the project
docker-compose up -d
or just
make
You can update some values. For example, the admin superuser, by default, is admin
with password root1234
.
Also, depending on your pc and the file you are using to test the application, you can update MAX_SINGLE_FILE_SIZE
.
For my pc and the data autogenerated, the best solution is to use 500000000, which means files of max 500MB for each partition.
This is a Django project dockerized with a Postgres database, so once it is started, you can check if everything is ok going to the local admin
If you can see the admin, everything is ok. If not, you should check the logs
make logs
You can check the API structure with the following interfaces:
The API has four endpoints. The first two are for the login. It uses
simplejwt to log in as a user and get a token. After that,
you can use the admin user created when the projects start to run (admin
with password root1234
if you don't update
the env values).
With the token you have obtained, you can call the other endpoints passing the token to the authorization header. (There is a postman helpers folder in the root of the project to make you easier the configuration for the API calls)
Now you can call the main endpoints of the tests api/process_file
and api/csv_task_result/{uuid}/
.
The first is a simple endpoint to post a CSV file (it would check the format). Once you post it, it will return a UUID
of the task created to process it. However, if you post the exact file (assuming that the names of the files are
unique), it will raise a 409
, and in error, it will give you the UUID of the task that previously tried to process
the file.
The endpoint will create a CSVTask
object, saving the file you passed and its original name. And then, file
processing begins. First, it will create a celery task and execute the process. The process will generate files of
max MAX_SINGLE_FILE_SIZE
Bytes with the results and will associate the CSVTask
with them. You can check that using
the admin.
If the file is corrupted, the process will fail, the CSVTask
will be marked as error_processing
, and it will also
save the error message.
The following endpoint will give you the task result by its UUID using a get. You must also provide to authentication
login. If the process ends, you will see a 200
; in the body, the field output_files
will give you a list of files
with the results.
If the process of the initial file is not ended, it will return you a 425
Too Early Exception. Also, if there is a
problem processing the file, it will return you a 406
Not Acceptable Exception, and the error will send you the
error message during the file process.
If you do not want to use the API to process a file, you can process them from the shell.
Django's shell plus will import all you need, so it is recommended.
make shell
Once in the shell, the first step will be to create an empty CSVTask
. You can create one using one of the following
commands:
csv_task = CSVTaskFactory(output_files__total=0)
or
csv_task = CSVTask.objects.create()
Now that you have a CSVTask, just use the function process_csv
. The file you want to process must be inside the
container. The volumes map all your local src into the container /src/, so drop anywhere inside your local src, and
it will be available in the container. Also, you can use the Django command demo_file
and create a file directly in
the container with fake data.
python manage.py demo_file
It will default create a file named test_data.csv
of 1GB, but you can specify other values.
python manage.py demo_file --total_size XXXXXXXXXX --file_name your_name.csv
It will drop you the file in apps/data_processor/csvs/in/
, and now you can test the process function without celery.
process_csv(task_uuid=csv_task.uuid.hex, csv_path='apps/data_processor/csvs/in/test_data.csv') # Update your csv path if needed
And that's it. Now you can see the process logs and how much time it takes to perform the task. Now you can check that the output files was generated using the following sentence:
csv_task.output_files_urls
Or you can go to the admin page and check the files.
The project only has unit tests, and to check that all tests are ok, just exec the following:
make test
or
docker exec bmat_dev_test_backend python -m pytest --log-cli-level=ERROR --disable-pytest-warnings
- The first one is to set up a CI/CD and set up in a cloud (for example, AWS is the one I most like)
- Create integration test, using, for example a Newman container
- Create a dask cluster