Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Spark+Petastorm with Sqlite+SqlAlchemy #445

Merged
merged 24 commits into from
Nov 1, 2022
Merged

Conversation

gaurav274
Copy link
Member

No description provided.

@gaurav274 gaurav274 changed the title [WIP] - Replace Spark+Petastorm with Sqlite+SqlAlchemy Replace Spark+Petastorm with Sqlite+SqlAlchemy Oct 27, 2022
eva/storage/sqlite_storage_engine.py Show resolved Hide resolved
eva/storage/sqlite_storage_engine.py Show resolved Hide resolved
if len(data_batch) * row_size >= batch_mem_size:
yield Batch(pd.DataFrame(data_batch))
data_batch = []
if data_batch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a pretty print function for a Batch as well

eva/catalog/schema_utils.py Show resolved Hide resolved
eva/storage/sqlite_storage_engine.py Outdated Show resolved Hide resolved
eva/storage/sqlite_storage_engine.py Show resolved Hide resolved
eva/storage/sqlite_storage_engine.py Show resolved Hide resolved
for col in columns:
if col.type == ColumnType.NDARRAY:
dict_row[col.name] = self._serializer.serialize(dict_row[col.name])
elif isinstance(dict_row[col.name], (np.generic,)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the ColumnType for this case? It seems that there are only TEXT, INTERGER and FLOAT left. In which case the type will be np.generic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. np.generic is the base class for all numpy types, and to_list is a general approach to convert them to python types. https://stackoverflow.com/a/53067954, Maybe update the comment. I was initially confused by the tolist()

dict_row[col.name] = sql_row[idx]
return dict_row

def create(self, table: DataFrameMetadata, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if the table is already existed? This raised an issue that I did not notice before. I was using if_not_exists in the load executor. In mat executor, we can use handle_if_not_exists because the catalog entry is created in the mat executor. But this is not the case for the load operator, the catalog entry is created before the load executor. I think we shall choose one way for both of them. And update the mat, load design and this create (e.g., doc what happens the table exists, remove **kwargs, add a call to check whether table exists or can we use read to do it?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you raise it as an issue? We can fix it later as I want to get rid of Pyspark asap

@gaurav274 gaurav274 merged commit b843a7b into master Nov 1, 2022
@gaurav274 gaurav274 deleted the remove_spark branch November 1, 2022 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants