Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

gaurav274 · 2022-10-27T15:31:45Z

No description provided.

eva/storage/sqlite_storage_engine.py

jarulraj · 2022-10-28T17:05:48Z

eva/storage/sqlite_storage_engine.py

+            if len(data_batch) * row_size >= batch_mem_size:
+                yield Batch(pd.DataFrame(data_batch))
+                data_batch = []
+        if data_batch:


We should add a pretty print function for a Batch as well

eva/catalog/schema_utils.py

eva/storage/sqlite_storage_engine.py

xzdandy · 2022-10-31T01:43:50Z

eva/storage/sqlite_storage_engine.py

+        for col in columns:
+            if col.type == ColumnType.NDARRAY:
+                dict_row[col.name] = self._serializer.serialize(dict_row[col.name])
+            elif isinstance(dict_row[col.name], (np.generic,)):


What is the ColumnType for this case? It seems that there are only TEXT, INTERGER and FLOAT left. In which case the type will be np.generic?

Ah, I see. np.generic is the base class for all numpy types, and to_list is a general approach to convert them to python types. https://stackoverflow.com/a/53067954, Maybe update the comment. I was initially confused by the tolist()

xzdandy · 2022-10-31T02:02:52Z

eva/storage/sqlite_storage_engine.py

+                dict_row[col.name] = sql_row[idx]
+        return dict_row
+
+    def create(self, table: DataFrameMetadata, **kwargs):


What happen if the table is already existed? This raised an issue that I did not notice before. I was using if_not_exists in the load executor. In mat executor, we can use handle_if_not_exists because the catalog entry is created in the mat executor. But this is not the case for the load operator, the catalog entry is created before the load executor. I think we shall choose one way for both of them. And update the mat, load design and this create (e.g., doc what happens the table exists, remove **kwargs, add a call to check whether table exists or can we use read to do it?)

Can you raise it as an issue? We can fix it later as I want to get rid of Pyspark asap

gaurav274 and others added 20 commits July 31, 2022 12:47

docs: clean up

4ba5cf3

docs: development guide clean up and version add to docs

cd27791

style: fix

f74b81a

style: only style the defualt dir

d88fae5

Merge branch 'master' of github.com:georgia-tech-db/eva

f876385

merge

e555aae

merge master

b6d5cc5

Merge branch 'master' of github.com:georgia-tech-db/eva

9edf79c

remove spark

dc95ee1

merge

8d0e9cb

Merge branch 'remove_spark' of github.com:georgia-tech-db/eva

a0cf2a0

feat: replace spark+petastorm with sqlalchemy+sqlite

5b9cdd6

feat: remove petastorm related code

7ea159e

feat: remove spark+petastorm dependency

0f37cd0

test: add sqlite test case

7eb3d5d

feat: remove petastorm test case

d033383

feat: enable drop table for sqlite

e9c92b7

style: ran black

177a637

docs: remove old file

40656da

fix: sqlalchemy does not support numpy data types

06c2120

gaurav274 requested review from xzdandy, jarulraj and LordDarkula October 27, 2022 17:43

gaurav274 changed the title ~~[WIP] - Replace Spark+Petastorm with Sqlite+SqlAlchemy~~ Replace Spark+Petastorm with Sqlite+SqlAlchemy Oct 27, 2022

docs: improve docs

bdf8c6e

jarulraj approved these changes Oct 28, 2022

View reviewed changes

gaurav274 commented Oct 28, 2022

View reviewed changes

eva/storage/sqlite_storage_engine.py Outdated Show resolved Hide resolved

eva/storage/sqlite_storage_engine.py Show resolved Hide resolved

eva/storage/sqlite_storage_engine.py Show resolved Hide resolved

xzdandy requested changes Oct 31, 2022

View reviewed changes

gaurav274 added 2 commits October 31, 2022 23:08

fix: address PR comments

b649a47

merge

fa4bf70

fix: merge issues

570081a

gaurav274 merged commit b843a7b into master Nov 1, 2022

gaurav274 deleted the remove_spark branch November 1, 2022 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

gaurav274 commented Oct 27, 2022

jarulraj Oct 28, 2022

xzdandy Oct 31, 2022

xzdandy Oct 31, 2022

xzdandy Oct 31, 2022

gaurav274 Nov 1, 2022

Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

Conversation

gaurav274 commented Oct 27, 2022

jarulraj Oct 28, 2022

Choose a reason for hiding this comment

xzdandy Oct 31, 2022

Choose a reason for hiding this comment

xzdandy Oct 31, 2022

Choose a reason for hiding this comment

xzdandy Oct 31, 2022

Choose a reason for hiding this comment

gaurav274 Nov 1, 2022

Choose a reason for hiding this comment