Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements : Index on ClaimMutationID #362

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

mngoe
Copy link
Contributor

@mngoe mngoe commented Jan 28, 2025

MutationLog request per ClientMutationID is one of the most common request in the database.

On a fresh install this kind of request is not really challenging, but for hundreds of uses and millions of mutation in the table without index, the database is really struggling.

MutationLog request per ClientMutationID is one of the most common request in the database.

On a fresh install this kind of request is not really challenging, but for hundreds of uses and millions of mutation in the table without index, the database is really struggling.
@delcroip
Copy link
Member

What would you think about this kind of approach
https://www.perplexity.ai/search/best-django-type-for-uuid-in-m-dLREnfSOSlusSqx10PnDvw#1 (@sunilparajuli mentionned that the uuid as text was not so fast on mssql)

@delcroip
Copy link
Member

@mngoe I am surprised to see no migration did you run makemigrations ?

@sunilparajuli
Copy link
Member

@delcroip @mngoe

the difference is largely significant in large volume of data. We got this issue with Social Security Fund Nepal with almost 1 TB of database size, consisting of around ~300GB of textual data , ~600GB attachment (the attachment was configured to be stored in db which was bad choice previously ). Contributing around ~40GB of Mutation log table

Major problem faced:
the mutation log is generated almost everytime claim is created/updated..... etc. and due to the very large size, the ordering (recent mutation which we see at right panel), slowed down. When each thread (worker) starts to take time, it also required more RAM since, work is being held by mssql.

Diagnosis of the performance issue

  1. Not readable.

  2. Not naturally sortable according to creation time. Although the UUID v1 format includes a timestamp, it encodes it in a little-endian format, where the least significant portion of the time appears first. This makes it difficult to sort UUIDs based on their creation time.

  3. For database like MySQL, Oracle, which uses clustered primary key, both UUIDv1 and UUIDv4 will hurt insertion performance if used as the primary key. This is because it requires reordering the rows to place the newly inserted row at the right position inside the clustered index.
    -> On the other hand, PostgreSQL uses heap instead of clustered primary key, thus using UUID as the primary key won't impact PostgreSQL's insertion performance (still i would say no to uuid for PK in large dataset).

Analysis

image

Deep analysis

Why Integer Primary Key is Faster?
Smaller Storage Size – An INT (4 bytes) or BIGINT (8 bytes) requires much less space than a UUID (16 bytes).
Better Indexing – Smaller keys result in smaller indexes, making lookups, joins, and updates faster.
Sequential Inserts – Auto-incrementing INT (IDENTITY) ensures sequential inserts, reducing index fragmentation.
Efficient Joins – Foreign keys using INT are more efficient compared to UUID.
Why UUID Can Be Slower?
Random Inserts – UUIDs are randomly generated, causing page splits and fragmentation in indexes.
Larger Index Size – More storage overhead leads to slower lookups.
Sorting Issues – Unlike IDENTITY, UUIDs are not sequential, making ordered queries slower.

Since, we are not using any distributed kind of things so, using pk as auto-increment field is better choice.

Mitigation strategy:

  • Had to create new table for fresh start ( because - claim mutation log & core mutation log seems to take eternity for deletion so, could not do in realtime and service needs to be up 24/7 almost ),
  • We named table as mutation table v2 (something), remove data slowing in original table (mutation log, took several repeated task)
  • renamed back to original table (though it was not needed we still did). Still we need to fix this one using PK as integer.
    @weilu , though it seems small area but , this was the performance issue that i was referring yesterday.

@sunilparajuli
Copy link
Member

What would you think about this kind of approach https://www.perplexity.ai/search/best-django-type-for-uuid-in-m-dLREnfSOSlusSqx10PnDvw#1 (@sunilparajuli mentionned that the uuid as text was not so fast on mssql)

thanks i will check on this one , also whatever we need to do, the data type should support indexing

@delcroip
Copy link
Member

@sunilparajuli thaks a lot for that feedback, we already reached the conclusion that MSSQL was not ideal but it adds up; why cannot we clear/archive mutation log with a running windows of 7 days ?

@sunilparajuli
Copy link
Member

@sunilparajuli thaks a lot for that feedback, we already reached the conclusion that MSSQL was not ideal but it adds up; why cannot we clear/archive mutation log with a running windows of 7 days ?

I just mean it was taking lot of time (as a metaphor or smthing), it could take lots of hours for clearing the data but the risk was maintaining uptime. (deleting process of mutation log with parallel insertion of mutation log from active claim insertion)

When we perform a DELETE operation in a database like MSSQL, the system needs to lock the rows, pages, or even the entire table to ensure that no other operations interfere with the deletion process. Even if you try to limit the locks to individual rows (using something like ROWLOCK), deleting a large amount of data can cause the locks to escalate to a table-level lock. This means the entire table gets locked, which can block other operations like reads or writes.

If the table is being used actively—for example, in a 24/7 service where data is constantly being added or queried—this locking can create contention, when multiple operations are trying to access the same resource (in this case, the table) at the same time. Because of this probably , the deletion process becomes much slower because it has to compete with other operations for access to the table. So deleting a large amount of data from an actively used table can take a very long time.

There could be other factors affecting the operation's efficiency like how much resources maxime has allocated in the infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants