Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: internal state of Dataset changes after Map operation #861

Closed
andy-esch opened this issue Jul 22, 2019 · 6 comments · Fixed by #902
Closed

Bug: internal state of Dataset changes after Map operation #861

andy-esch opened this issue Jul 22, 2019 · 6 comments · Fixed by #902
Assignees
Labels

Comments

@andy-esch
Copy link
Contributor

After using a Dataset in the following workflow, the Dataset seems to have forgotten it's tablename

# create dataset and download it -- works as expected
ds = Dataset('tablename')
df = ds.download()

# create map - works as expected
Map(color_category_layer(ds, 'first_name'))

# download again
df2 = ds.download()
df2 is None # evaluates to True

This is using the latest version of develop.

@simon-contreras-deel
Copy link
Contributor

simon-contreras-deel commented Aug 5, 2019

This is a simplified case:

ds = Dataset('table')
ds.download()  # returns a Dataframe
ds.download()  # returns None

Why?

  1. It creates a TableDataset
ds = Dataset('table')
  1. Changes to DataFrameDataset
ds.download()
  1. The next download is over a DataFrameDataset and "it has nothing to download" (the data is already locally), this why it is returning a None.

Of course, I understand the problem. Following the functional approach (pure functions), both requests over the same object, should return the same.

Proposal

I have 2 approaches for the result of the second download:

  1. Returns None and logs a message: The data is already local. Are you trying to get the data? Use 'dataset.dataframe' instead of
  2. Returns the dataframe (with or without the same message as the 1 option)

What do you think? @cmongut @andy-esch

I would go with the second one without a message.

@simon-contreras-deel
Copy link
Contributor

simon-contreras-deel commented Aug 5, 2019

Something similar happens with upload method working with a query.

ds = Dataset('SELECT * FROM my_table WHERE ...')
ds.upload(table_name='my_table')  # creating table 'my_table' from query
ds.upload(table_name='my_table')  # raises error: 'Nothing to upload. Dataset needs a DataFrame, a GeoDataFrame or a query to upload data to CARTO.'

In the first upload we are moving from QueryDataset into a TableDataset. The second upload over a TableDataset generates the raise.

In this case, it is more difficult to detect the situation to bring better advice to the user or better behavior.

I think this is a special use case and not very usual, but maybe we should take into a ccount

@cmongut
Copy link
Contributor

cmongut commented Aug 5, 2019

Let's wait for @andy-esch opinion but here it's mine.

When calling ds.download(), ds shouldn't change its type. This means that the next time I do ds.download(), it would download the dataset again.

@simon-contreras-deel
Copy link
Contributor

simon-contreras-deel commented Aug 6, 2019

A bit of context about the state of the Dataset class.

In the beginning, we wanted to solve every situation with the same class. But we realized that it was a real pain, full of corner cases. And finally, we only supported 2 specific "state" changes:

  • From table and query: download creates a "from dataframe" object. Why? we want CF being local first: work with local if possible (mainly because it is faster and you don't need to go over the internet). For example creating a map: if dataset has the table name and the local data, what should it use? Local
  • From query: upload creates a "from table" object. Why? it is easier / faster / easy to cache working with a table than a data subset of a table or several tables from a query. For example creating a map from a query select * from my_table where xxx > 50: the queries against CARTO will be faster if we create a new table with the values bigger than 50, than using the full table with the WHERE.

download case:

For me, download again only makes sense in this case: you have already downloaded a table and then the table receives some changes (inserts, updates or deletes). You know that and you want to download it again. In any other case, it makes no sense. And following our approach about the "dataset state" it could/should? be solved creating a new Dataset.

@andy-esch
Copy link
Contributor Author

andy-esch commented Aug 6, 2019

I like proposal 2 as well. One main reason is that the table could change between operations and the user wants to fetch the latest version.

In the first upload we are moving from QueryDataset into a TableDataset. The second upload over a TableDataset generates the raise.

I vote for ds.upload(table_name='tablename') to keep the state of ds the same (QueryDataset) and return a TableDataset('tablename').

By the way, I would expect ds.upload with an already created table to return an error that the table already exists.

Overall, I thought we decided a while back to make the Dataset objects immutable.

With this approach, the Dataset instances don't change but operations on them return new class instances.

ds_table = Dataset('tablename')
# download the table dataset into a DataframeDataset
ds_dataframe = ds_table.download()

ds_query = Dataset('select * from table where x > 50')
# 'uploading' a query to create a table returns a
# TableDataset
ds_table_from_query = ds.upload(table_name='tablename')

@simon-contreras-deel
Copy link
Contributor

simon-contreras-deel commented Aug 9, 2019

By the way, I would expect ds.upload with an already created table to return an error that the table already exists.

It is what we are doing right now if you don't use if_exist=REPLACE or if_exist=APPEND

Overall, I thought we decided a while back to make the Dataset objects immutable.

I see you both think the 2 exceptions are a bad solution and I agree.

Now, I see what I explained in the previous comment (as the reasons to have 2 exceptions) as a user responsibility or decision. And probably, seeing this way, Dataset class becomes easier to understand.

Example: dataframe - download

ds_remote = Dataset('table')  # or Dataset('select * from my_table')
df = ds_remote.download()

ds_local = Dataset(df)
ds_local.upload(table_name='my_new_table')

ds_remote2 = Dataset('my_new_table')

Example: query - upload

ds_query = Dataset('select * from my_table')
ds_query.upload(table_name='my_new_table')

ds_table = Dataset('my_new_table')

operations on them return new class instances

It is another possibility, but I am not sure about it. For example, in the download case, I think we should return a DataFrame, because it is the object the people want to play with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants