Bug: internal state of Dataset changes after Map operation #861

andy-esch · 2019-07-22T18:50:39Z

After using a Dataset in the following workflow, the Dataset seems to have forgotten it's tablename

# create dataset and download it -- works as expected
ds = Dataset('tablename')
df = ds.download()

# create map - works as expected
Map(color_category_layer(ds, 'first_name'))

# download again
df2 = ds.download()
df2 is None # evaluates to True

This is using the latest version of develop.

The text was updated successfully, but these errors were encountered:

simon-contreras-deel · 2019-08-05T13:46:54Z

This is a simplified case:

ds = Dataset('table')
ds.download()  # returns a Dataframe
ds.download()  # returns None

Why?

It creates a TableDataset

ds = Dataset('table')

Changes to DataFrameDataset

ds.download()

The next download is over a DataFrameDataset and "it has nothing to download" (the data is already locally), this why it is returning a None.

Of course, I understand the problem. Following the functional approach (pure functions), both requests over the same object, should return the same.

Proposal

I have 2 approaches for the result of the second download:

Returns None and logs a message: The data is already local. Are you trying to get the data? Use 'dataset.dataframe' instead of
Returns the dataframe (with or without the same message as the 1 option)

What do you think? @cmongut @andy-esch

I would go with the second one without a message.

simon-contreras-deel · 2019-08-05T14:41:25Z

Something similar happens with upload method working with a query.

ds = Dataset('SELECT * FROM my_table WHERE ...')
ds.upload(table_name='my_table')  # creating table 'my_table' from query
ds.upload(table_name='my_table')  # raises error: 'Nothing to upload. Dataset needs a DataFrame, a GeoDataFrame or a query to upload data to CARTO.'

In the first upload we are moving from QueryDataset into a TableDataset. The second upload over a TableDataset generates the raise.

In this case, it is more difficult to detect the situation to bring better advice to the user or better behavior.

I think this is a special use case and not very usual, but maybe we should take into a ccount

cmongut · 2019-08-05T17:24:05Z

Let's wait for @andy-esch opinion but here it's mine.

When calling ds.download(), ds shouldn't change its type. This means that the next time I do ds.download(), it would download the dataset again.

simon-contreras-deel · 2019-08-06T10:32:42Z

A bit of context about the state of the Dataset class.

In the beginning, we wanted to solve every situation with the same class. But we realized that it was a real pain, full of corner cases. And finally, we only supported 2 specific "state" changes:

From table and query: download creates a "from dataframe" object. Why? we want CF being local first: work with local if possible (mainly because it is faster and you don't need to go over the internet). For example creating a map: if dataset has the table name and the local data, what should it use? Local
From query: upload creates a "from table" object. Why? it is easier / faster / easy to cache working with a table than a data subset of a table or several tables from a query. For example creating a map from a query select * from my_table where xxx > 50: the queries against CARTO will be faster if we create a new table with the values bigger than 50, than using the full table with the WHERE.

download case:

For me, download again only makes sense in this case: you have already downloaded a table and then the table receives some changes (inserts, updates or deletes). You know that and you want to download it again. In any other case, it makes no sense. And following our approach about the "dataset state" it could/should? be solved creating a new Dataset.

andy-esch · 2019-08-06T19:00:25Z

I like proposal 2 as well. One main reason is that the table could change between operations and the user wants to fetch the latest version.

In the first upload we are moving from QueryDataset into a TableDataset. The second upload over a TableDataset generates the raise.

I vote for ds.upload(table_name='tablename') to keep the state of ds the same (QueryDataset) and return a TableDataset('tablename').

By the way, I would expect ds.upload with an already created table to return an error that the table already exists.

Overall, I thought we decided a while back to make the Dataset objects immutable.

With this approach, the Dataset instances don't change but operations on them return new class instances.

ds_table = Dataset('tablename')
# download the table dataset into a DataframeDataset
ds_dataframe = ds_table.download()

ds_query = Dataset('select * from table where x > 50')
# 'uploading' a query to create a table returns a
# TableDataset
ds_table_from_query = ds.upload(table_name='tablename')

simon-contreras-deel · 2019-08-09T13:09:56Z

By the way, I would expect ds.upload with an already created table to return an error that the table already exists.

It is what we are doing right now if you don't use if_exist=REPLACE or if_exist=APPEND

Overall, I thought we decided a while back to make the Dataset objects immutable.

I see you both think the 2 exceptions are a bad solution and I agree.

Now, I see what I explained in the previous comment (as the reasons to have 2 exceptions) as a user responsibility or decision. And probably, seeing this way, Dataset class becomes easier to understand.

Example: dataframe - download

ds_remote = Dataset('table')  # or Dataset('select * from my_table')
df = ds_remote.download()

ds_local = Dataset(df)
ds_local.upload(table_name='my_new_table')

ds_remote2 = Dataset('my_new_table')

Example: query - upload

ds_query = Dataset('select * from my_table')
ds_query.upload(table_name='my_new_table')

ds_table = Dataset('my_new_table')

operations on them return new class instances

It is another possibility, but I am not sure about it. For example, in the download case, I think we should return a DataFrame, because it is the object the people want to play with.

andy-esch added the bug label Jul 22, 2019

cmongut added this to the [1.0RC] Kuviz improvements and bugs fixing milestone Aug 1, 2019

simon-contreras-deel self-assigned this Aug 5, 2019

simon-contreras-deel mentioned this issue Aug 9, 2019

removing Dataset state changes #902

Merged

7 tasks

simon-contreras-deel closed this as completed in #902 Sep 2, 2019

simon-contreras-deel mentioned this issue Nov 11, 2019

Add CartoDataFrame class #1130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: internal state of Dataset changes after Map operation #861

Bug: internal state of Dataset changes after Map operation #861

andy-esch commented Jul 22, 2019

simon-contreras-deel commented Aug 5, 2019 •

edited

Loading

simon-contreras-deel commented Aug 5, 2019 •

edited

Loading

cmongut commented Aug 5, 2019 •

edited

Loading

simon-contreras-deel commented Aug 6, 2019 •

edited

Loading

andy-esch commented Aug 6, 2019 •

edited

Loading

simon-contreras-deel commented Aug 9, 2019 •

edited

Loading

Bug: internal state of Dataset changes after Map operation #861

Bug: internal state of Dataset changes after Map operation #861

Comments

andy-esch commented Jul 22, 2019

simon-contreras-deel commented Aug 5, 2019 • edited Loading

Why?

Proposal

simon-contreras-deel commented Aug 5, 2019 • edited Loading

cmongut commented Aug 5, 2019 • edited Loading

simon-contreras-deel commented Aug 6, 2019 • edited Loading

A bit of context about the state of the Dataset class.

download case:

andy-esch commented Aug 6, 2019 • edited Loading

simon-contreras-deel commented Aug 9, 2019 • edited Loading

simon-contreras-deel commented Aug 5, 2019 •

edited

Loading

simon-contreras-deel commented Aug 5, 2019 •

edited

Loading

cmongut commented Aug 5, 2019 •

edited

Loading

simon-contreras-deel commented Aug 6, 2019 •

edited

Loading

andy-esch commented Aug 6, 2019 •

edited

Loading

simon-contreras-deel commented Aug 9, 2019 •

edited

Loading