Cassandra Approach for Project ML Exhaust #103

Shakthieshwari · 2023-04-13T11:54:13Z

Shakthieshwari
Apr 13, 2023

Hi @reshmi-nair @amit-tarento ,

As part of next release, We are planning to do few ML Exhaust Optimisation for scaling .
JIra Ticket Link :- https://project-sunbird.atlassian.net/browse/LR-472

Problem Statement :- CSV to be extracted from Transaction DB(Cassandra) and not from Druid to avoid deletion of druid datasource via Batch Ingestion

Reason for Deletion of Datasource :- Since the Status of the project vary every time and druid doesn't support updating a record, We are daily deleting the entire data from druid and re-ingesting the whole data into druid on a daily basis to get the updated status of a submission.

Concern :- Huge Data Handling is not supported by Druid when extracted as a CSV

Approach(Solution) :- Please check this confluence doc https://project-sunbird.atlassian.net/l/cp/TRSTnzhN , we have detailed out the design.

Similar to Data Product and Flink Jobs Implemented for PII , we need to create the same for projects as well.

Note :- The Design Doc Attached here is pretty much similar to the design developed for ML PII

Please provide us your @SanthoshVasabhaktula @reshmi-nair @rhwarrier @ approval and suggestions, if we can go a head on this.

Cc- @aishwaryashikshalokam @Ashwiniev95 @Prateek-slokam @aks30 @kiranharidas187 @vijiurs @vivek.m@pacewisdom.com

Please do the needful at the earliest....

Thanks

Shakthieshwari · 2023-04-17T04:04:59Z

Shakthieshwari
Apr 17, 2023
Author

@SanthoshVasabhaktula @reshmi-nair @rhwarrier Can you please help us out here at the earliest ?

Thanks

8 replies

Shakthieshwari Apr 18, 2023
Author

@SanthoshVasabhaktula Can we please get on a call? Let us know your availability .... We have few queries

Shakthieshwari Apr 20, 2023
Author

@SanthoshVasabhaktula Can we please Let us know your availability? We can discuss ...

Shakthieshwari Apr 24, 2023
Author

@SanthoshVasabhaktula Cassandra DB we opted for below mentioned reasons :-

To get the Unique Aggregation Count from Cassandra DB to solve the Admin Dashboard Issues
We have still now seen Cassandra DB is efficient enough in case of huge data extraction, we have observed in the Course CSV Exhaust. Hence we thought we will use Cassandra DB and also we are not sure if MongoDB is efficient enough in-case of huge data query extraction.
Data in MongoDB is not flattened, Hence Each Exhaust would require a Data Product and flattening logic have to be repeated multiple times in all the Data Product Exhaust. If we should use MongoDB, then i think we should flatten the data using Flink Jobs by consuming the data from Kafka and store back into another tables of MongoDB and create a customised data product on top of flattened MongoDB
Our Plan was to remove Druid for the Exhaust Extraction and Point the Exhaust to Cassandra

If there are further queries, I would suggest lets get on a call to discuss in detailed. Please let us know your availability ? I will share the invite

SanthoshVasabhaktula Apr 24, 2023

@Shakthieshwari - Please see my response below for your reasons:

To get the Unique Aggregation Count from Cassandra DB to solve the Admin Dashboard Issues
SV: Cassandra is a key value store and doesn't support efficient aggregations compared to MongoDB. One has to load all the data into memory and perform the aggregation
We have still now seen Cassandra DB is efficient enough in case of huge data extraction, we have observed in the Course CSV Exhaust. Hence we thought we will use Cassandra DB and also we are not sure if MongoDB is efficient enough in-case of huge data query extraction.
SV: Cassandra is optimised for writes not reads. Since it is a key-value distributed db reads are performant. But that doesn't mean mongodb cannot perform well. You can go through this [link](https://sciresol.s3.us-east-2.amazonaws.com/IJST/Articles/2022/Issue-31/IJST-2022-1352.pdf) to understand the performance of Redis vs Cassandra vs Mongodb. Please do your own benchmark before coming to a conclusion that Cassandra is faster than MongoDB
Data in MongoDB is not flattened, Hence Each Exhaust would require a Data Product and flattening logic have to be repeated multiple times in all the Data Product Exhaust. If we should use MongoDB, then i think we should flatten the data using Flink Jobs by consuming the data from Kafka and store back into another tables of MongoDB and create a customised data product on top of flattened MongoDB
SV: As mentioned earlier this can be done directly in Spark and avoid the complexity of "MongoDB -> Kafka -> Flink -> Cassandra -> Spark" and directly use "MongoDB -> Spark". Makes your code simpler and easier to test that there is no missing data
Our Plan was to remove Druid for the Exhaust Extraction and Point the Exhaust to Cassandra
SV: I assumed the plan was to move to transactional db but not Cassandra explicitly.

My recommendation is unchanged - Do not complicate the system by creating a complex pipeline and duplicating data. If you still think that the above design helps in accelerating your move from MongoDB to Cassandra, please go ahead with the implementation.

If you still have any further queries please setup sometime on my calendar tomorrow second half.

Shakthieshwari Apr 25, 2023
Author

@SanthoshVasabhaktula I have scheduled a call today at 2pm, Please do join. We have few queries....

Shakthieshwari · 2023-06-23T07:25:53Z

Shakthieshwari
Jun 23, 2023
Author

@SanthoshVasabhaktula @mathewjpallan As suggested by you, we explored on MongoDB Approach and we did a POC on it with the below 2 approaches mentioned :-

Using Spark to get the data from mongo db,Flattened & Format the data based on our requirements and store the data into the CSV => We could not achieve this with spark because we faced lot of issues in formatting the data as per our needs
Using UDF Function in Scala-Spark => We could achieve this as per our data formatting needs, but this is failing in terms of huge volumes, which means Large Scaling of Data is not achieved here (Java Heap space issue we are getting for huge volumes)
Please check this doc for in-detailed POC we have had - https://project-sunbird.atlassian.net/l/cp/dh6Ha1uB

Hence as per our understanding, we did POC to max extend and we could not achieve it, Hence we are thinking to go back with the Cassandra Approach Design only which was reviewed early and proposed by us https://project-sunbird.atlassian.net/l/cp/mMBCB4Xy

The only concerns raised by @SanthoshVasabhaktula for the Cassandra Approach was :- DB Sync between the 2 transactional DB and data duplication across 2 different transactional DB, Please find below the approach we have put to resolve the same:-
Approach 1. From App/Portal store the data directly to Cassandra
Approach 2. Run a cron job to sync the data between MongoDB and CassandraDB (may be every 5mins etc) by adding a new key value pair like processed or not processed
Approach 3. From App/Portal push the data into Kafka and Flink Jobs Processes it and store it into Cassandra, In-case of cassandra failure, store the cassandra insert queries into log file and with the help of cron job script insert the data once, the cassandra is up and running

Detailed doc on the above approaches is mentioned in https://project-sunbird.atlassian.net/l/cp/mMBCB4Xy doc at the bottom, please check ...

Please review it and let us know the next steps of actions, if required we can get on a call as well

@rakeshSgr Request you to take this forward from here ...

Cc- @aks30 @kiranharidas187 @vijiurs @Vivek-M-08 @rakeshSgr

Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sunbird Lern

Cassandra Approach for Project ML Exhaust #103

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sunbird Lern

Cassandra Approach for Project ML Exhaust #103

Shakthieshwari Apr 13, 2023

Replies: 2 comments · 8 replies

Shakthieshwari Apr 17, 2023 Author

Shakthieshwari Apr 18, 2023 Author

Shakthieshwari Apr 20, 2023 Author

Shakthieshwari Apr 24, 2023 Author

SanthoshVasabhaktula Apr 24, 2023

Shakthieshwari Apr 25, 2023 Author

Shakthieshwari Jun 23, 2023 Author

Shakthieshwari
Apr 13, 2023

Replies: 2 comments 8 replies

Shakthieshwari
Apr 17, 2023
Author

Shakthieshwari Apr 18, 2023
Author

Shakthieshwari Apr 20, 2023
Author

Shakthieshwari Apr 24, 2023
Author

Shakthieshwari Apr 25, 2023
Author

Shakthieshwari
Jun 23, 2023
Author