Transfer Excel data I/O to ixmp #321

khaeru · 2020-03-21T12:10:06Z

This PR removes the message_ix.Scenario.to_excel() and .read_excel() methods. ~~iiasa/ixmp#286~~ iiasa/ixmp#289 adds these to ixmp.Scenario, so they are available for non-MESSAGE scenarios.

It also incorporates test improvements from #265.

How to review

~~Wait for Transfer Excel data I/O from message_ix ixmp#289 to merge, or check out its branch.~~
Run tests against the updated ixmp code.
Try with any old Excel exports one has lying around.

PR checklist

Tests added.
Documentation added.
Release notes updated.

codecov · 2020-03-26T11:02:29Z

Codecov Report

Merging #321 into master will increase coverage by 0.56%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   88.06%   88.63%   +0.56%     
==========================================
  Files          29       29              
  Lines        2581     2543      -38     
==========================================
- Hits         2273     2254      -19     
+ Misses        308      289      -19

Impacted Files	Coverage Δ
message_ix/core.py	`95.65% <100.00%> (+3.63%)`	⬆️
message_ix/tests/test_core.py	`100.00% <100.00%> (ø)`
message_ix/tests/test_integration.py	`100.00% <0.00%> (+15.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 326b6bb...0376627. Read the comment docs.

behnam-zakeri · 2020-03-27T15:00:06Z

Thanks a lot @khaeru for cleaning up here and moving this to ixmp.
I tested this with the existing test (core.test_excel_read_write) and it passes the test on my machine as is. So, now it does initialize if a set or parameter is in Excel but not in the scenario default items. That's great.
However, I believe the behaviour of the functionality has changed a bit. Because:
1- The method to_excel() now is trying to save output variables of a scenario as well. Previously to_excel() only saved sets and parameters. This was enough for a user to read the data back and solve the scenario.
2- Due to this change in behaviour, the functionality fails if a scenario has a solution. Because there is a variable DEMAND in message_ix, which has exact the same name but in upper case as parameter demand. This makes pandas xlsxwriter raising the following error:

  File "C:\Users\...\anaconda3\lib\site-packages\xlsxwriter\workbook.py", line 805, in _check_sheetname)

DuplicateWorksheetName: Sheetname 'DEMAND', with case ignored, is already in use.

3- I personally think having variables written in Excel is not needed while it can be useful sometimes, especially for small scenarios. Because for scenarios with large data, there is already a memory burden in writing all data in Excel. I assume with many variables added to this, it becomes more difficult.
4- Previously, only those sets and parameters were written in Excel, which were not empty. Now it seems all items are written, making many empty sheets, which makes it difficult for the user to browse.
5- Suggestion: Usually for the user it's difficult to browse the sheets in Excel, even when there is data for the sheets, as there are many items and not alphabetically ordered. I understand this is because this comes directly from Scenario.set_list() and Scenario.par_list(), which are not ordered alphabetically. If possible, it's good to have an alphabetical order for sets and parameters separately. I mean the Excel sheets start from the left hand side with sets but orderd, and then ordered parameters.
Todo:

a. I'll add a line to the test to check the functionality when a scenario has a solution (to check No. 2 above).
b. It's good to ignore writing those items that are empty (No. 4 above). Is this easy for you to add or I can do that?

c. Any feedback welcome for No. 3 and 5 above.

khaeru · 2020-03-28T11:27:23Z

Thanks for these detailed comments, and for catching the issue with demand/DEMAND. I think they help to clarify what our requirements are here. Picking out just a couple points:

Previously to_excel() only saved sets and parameters. This was enough for a user to read the data back and solve the scenario.

This requirement (call it R1) is “the Excel file contains data necessary to solve a Scenario.”
Now it seems all items are written, making many empty sheets, which makes it difficult for the user to browse.

This requirement (R2) is “the Excel file should be easy for the user to browse,” e.g. without having to skip sheets which are empty, or to hunt through sheets which are out of order.
Since these weren't written anywhere, I guessed the requirement was (R3) “the Excel file contains [all] the data of a Scenario,” i.e. it's sufficient to make s2.read_excel(...) identical to s1.to_excel(). The difference gives rise to some other things you point out, e.g.:
- R1 was previously met while omitting variables and equations, i.e. not meeting R3.
- R2 was previously met by omitting empty sets and parameters, which conflicts with R3: without writing info for empty sets/parameters, they can't be initialized, so the Scenario as-read cannot be identical.

As @zikolach pointed out (see iiasa/ixmp#292 (comment)), we can imagine multiple ways to change the format. But I don't think that's very urgent and—as I commented on that issue—people have existing Excel files sitting around that they will want to continue to use. So I would like to minimize disruption.

Here's what I'd propose. All these changes will go in ixmp:

Change to_excel() to write only sets and parameters by default, with an option to write variables and equations. This should address @behnam-zakeri's points (1), (2), and (3).
As you suggest, change the order of sheets to the following groups. Sheets within each group will be in alphabetical order.
a. All sets with data.
b. All parameters with data.
c. All sets without data.
d. All parameters without data.
e. Equations (if the option is selected).
f. Variables (if the option is selected).

This helps with R2 because the user can focus on a and b. It shouldn't make the to_excel() code much more complicated, and requires no change to the read_excel() code.

Some other reactions:

Because for scenarios with large data, there is already a memory burden in writing all data in Excel. I assume with many variables added to this, it becomes more difficult.

Can you say how this memory burden manifests? Is it just the sheet rows limit, or something else? Ideally, we should add a test of a MESSAGEix-GLOBIOM-sized Scenario (i.e. full of random data, but the same number of elements in parameters) and test that the code works; kind of similar to this code.
there is a variable DEMAND in message_ix, which has exact the same name but in upper case as parameter demand. This makes pandas xlsxwriter raising the following error

This is interesting: on my system I think openpyxl is used, not xlsxwriter. I wonder if this is a difference in the two packages, or a limitation of the Excel format itself. Will investigate.

…ed hard coded number (#2)

khaeru · 2020-03-31T08:41:20Z

On my system I think openpyxl is used, not xlsxwriter. I wonder if this is a difference in the two packages, or a limitation of the Excel format itself. Will investigate.

for engine in 'openpyxl', 'xlsxwriter': 
    xw = pd.ExcelWriter(f'{engine}.xlsx', engine=engine) 
    df = pd.DataFrame()
    df.to_excel(xw, sheet_name='foo') 
    df.to_excel(xw, sheet_name='FOO') 
    xw.close()

What I observe:

With engine='openpyxl', the code completes successfully, but the file has two sheets named 'foo', and 'FOO1', i.e. the '1' is automatically appended.
With engine='xlsxwriter', the exception reported by @behnam-zakeri is reported.

So this is likely a limitation of the Excel format per se, and openpyxl contains a workaround that would, in any case, defeat our code.

khaeru · 2020-03-31T10:44:36Z

In order to keep moving, I went ahead and implemented the suggestion in the above comment as iiasa/ixmp#297. @behnam-zakeri @Jihoon can either of you confirm that the combination of these two branches works as desired?

khaeru · 2020-04-01T13:30:46Z

As with iiasa/ixmp#297, will merge this without review in order to clear the ground for hackathon PRs. If needed, we can make another to adjust, later.

behnam-zakeri · 2020-04-07T15:21:43Z

Thanks @khaeru for taking this forward. The solutions you suggested are excellent, for making sets and parameters default data to be written, and for ordering the sheets. And interesting test for Excel engine you illustrated.
related to your question:

Can you say how this memory burden manifests?

I meant memory burden for writing many rather large parameters to multiple Excel sheets, and not particularly exceeding row limits.
To put an end to this, I suggest we also resolve the issue of parameters exceeding the row limits of Excel. My proposal is to write such parameters in two or more sheets, like: land_output(1), land_output(2), etc. And for reading back the data to ixmp, it should be quite straightforward to drop these parentheses and add several dataframes one after each other. I can make an issue and slightly modify the code in to_excel() and read_excel() for this proposal if makes sense.

khaeru · 2020-04-07T15:58:44Z

iiasa/ixmp#292 is the follow-up issue for modifying the Excel file format to extend past the ~10⁶ row limit. It would be great if you wanted to tackle this, but no pressure. If so, please create an ixmp (here is message_ix) branch named e.g. issue/292.

Also note that @volker-krey supplied example code in iiasa/ixmp#300 that addresses the same limitation by establishing a new, zipped-CSV format, which might be rolled into the same PR.

behnam-zakeri · 2020-04-07T17:20:51Z

Thanks @khaeru for the instructions. I'll follow this on ixmp.

khaeru added the enh New features & functionality label Mar 21, 2020

khaeru added this to the 3.0 milestone Mar 21, 2020

khaeru self-assigned this Mar 21, 2020

This was referenced Mar 21, 2020

Transfer Excel data I/O from message_ix iiasa/ixmp#286

Closed

Transfer Excel data I/O from message_ix iiasa/ixmp#289

Merged

khaeru added 2 commits March 24, 2020 09:55

Add test improvements from iiasa#265; thanks @behnam-zakeri

d03b393

Remove Scenario.to_excel. .read_excel overrides

dd03ad0

khaeru force-pushed the feature/remove-xlsx-io branch from d75df63 to dd03ad0 Compare March 24, 2020 08:56

khaeru mentioned this pull request Mar 26, 2020

Allow existing item with matching names in io.maybe_init_item() iiasa/ixmp#294

Merged

3 tasks

Update documentation, release notes

cfef44e

khaeru force-pushed the feature/remove-xlsx-io branch from 2c5e8d7 to cfef44e Compare March 26, 2020 11:17

khaeru requested a review from behnam-zakeri March 26, 2020 11:44

Improve test_excel_read_write for scenarios with solution, also remov…

b176db0

…ed hard coded number (#2)

khaeru mentioned this pull request Mar 31, 2020

Remove import pd_read and pd_write from core.py #324

Closed

Remove extraneous os.remove() of a test file

0376627

khaeru mentioned this pull request Mar 31, 2020

Write only sets and parameters to Excel by default iiasa/ixmp#297

Merged

3 tasks

khaeru merged commit 76d2775 into iiasa:master Apr 1, 2020

khaeru deleted the feature/remove-xlsx-io branch April 1, 2020 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer Excel data I/O to ixmp #321

Transfer Excel data I/O to ixmp #321

khaeru commented Mar 21, 2020 •

edited

Loading

codecov bot commented Mar 26, 2020 •

edited

Loading

behnam-zakeri commented Mar 27, 2020 •

edited

Loading

khaeru commented Mar 28, 2020

khaeru commented Mar 31, 2020

khaeru commented Mar 31, 2020

khaeru commented Apr 1, 2020

behnam-zakeri commented Apr 7, 2020 •

edited

Loading

khaeru commented Apr 7, 2020

behnam-zakeri commented Apr 7, 2020

Transfer Excel data I/O to ixmp #321

Transfer Excel data I/O to ixmp #321

Conversation

khaeru commented Mar 21, 2020 • edited Loading

How to review

PR checklist

codecov bot commented Mar 26, 2020 • edited Loading

Codecov Report

behnam-zakeri commented Mar 27, 2020 • edited Loading

khaeru commented Mar 28, 2020

khaeru commented Mar 31, 2020

khaeru commented Mar 31, 2020

khaeru commented Apr 1, 2020

behnam-zakeri commented Apr 7, 2020 • edited Loading

khaeru commented Apr 7, 2020

behnam-zakeri commented Apr 7, 2020

khaeru commented Mar 21, 2020 •

edited

Loading

codecov bot commented Mar 26, 2020 •

edited

Loading

behnam-zakeri commented Mar 27, 2020 •

edited

Loading

behnam-zakeri commented Apr 7, 2020 •

edited

Loading