Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer Excel data I/O to ixmp #321

Merged
merged 5 commits into from
Apr 1, 2020
Merged

Conversation

khaeru
Copy link
Member

@khaeru khaeru commented Mar 21, 2020

This PR removes the message_ix.Scenario.to_excel() and .read_excel() methods. iiasa/ixmp#286 iiasa/ixmp#289 adds these to ixmp.Scenario, so they are available for non-MESSAGE scenarios.

It also incorporates test improvements from #265.

How to review

PR checklist

  • Tests added.
  • Documentation added.
  • Release notes updated.

@khaeru khaeru added the enh New features & functionality label Mar 21, 2020
@khaeru khaeru added this to the 3.0 milestone Mar 21, 2020
@khaeru khaeru self-assigned this Mar 21, 2020
@codecov
Copy link

codecov bot commented Mar 26, 2020

Codecov Report

Merging #321 into master will increase coverage by 0.56%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
+ Coverage   88.06%   88.63%   +0.56%     
==========================================
  Files          29       29              
  Lines        2581     2543      -38     
==========================================
- Hits         2273     2254      -19     
+ Misses        308      289      -19     
Impacted Files Coverage Δ
message_ix/core.py 95.65% <100.00%> (+3.63%) ⬆️
message_ix/tests/test_core.py 100.00% <100.00%> (ø)
message_ix/tests/test_integration.py 100.00% <0.00%> (+15.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 326b6bb...0376627. Read the comment docs.

@behnam-zakeri
Copy link
Contributor

behnam-zakeri commented Mar 27, 2020

Thanks a lot @khaeru for cleaning up here and moving this to ixmp.
I tested this with the existing test (core.test_excel_read_write) and it passes the test on my machine as is. So, now it does initialize if a set or parameter is in Excel but not in the scenario default items. That's great.
However, I believe the behaviour of the functionality has changed a bit. Because:
1- The method to_excel() now is trying to save output variables of a scenario as well. Previously to_excel() only saved sets and parameters. This was enough for a user to read the data back and solve the scenario.
2- Due to this change in behaviour, the functionality fails if a scenario has a solution. Because there is a variable DEMAND in message_ix, which has exact the same name but in upper case as parameter demand. This makes pandas xlsxwriter raising the following error:

  File "C:\Users\...\anaconda3\lib\site-packages\xlsxwriter\workbook.py", line 805, in _check_sheetname)

DuplicateWorksheetName: Sheetname 'DEMAND', with case ignored, is already in use.

3- I personally think having variables written in Excel is not needed while it can be useful sometimes, especially for small scenarios. Because for scenarios with large data, there is already a memory burden in writing all data in Excel. I assume with many variables added to this, it becomes more difficult.
4- Previously, only those sets and parameters were written in Excel, which were not empty. Now it seems all items are written, making many empty sheets, which makes it difficult for the user to browse.
5- Suggestion: Usually for the user it's difficult to browse the sheets in Excel, even when there is data for the sheets, as there are many items and not alphabetically ordered. I understand this is because this comes directly from Scenario.set_list() and Scenario.par_list(), which are not ordered alphabetically. If possible, it's good to have an alphabetical order for sets and parameters separately. I mean the Excel sheets start from the left hand side with sets but orderd, and then ordered parameters.
Todo:

  • a. I'll add a line to the test to check the functionality when a scenario has a solution (to check No. 2 above).

  • b. It's good to ignore writing those items that are empty (No. 4 above). Is this easy for you to add or I can do that?

c. Any feedback welcome for No. 3 and 5 above.

@khaeru
Copy link
Member Author

khaeru commented Mar 28, 2020

Thanks for these detailed comments, and for catching the issue with demand/DEMAND. I think they help to clarify what our requirements are here. Picking out just a couple points:

  • Previously to_excel() only saved sets and parameters. This was enough for a user to read the data back and solve the scenario.

    This requirement (call it R1) is “the Excel file contains data necessary to solve a Scenario.”

  • Now it seems all items are written, making many empty sheets, which makes it difficult for the user to browse.

    This requirement (R2) is “the Excel file should be easy for the user to browse,” e.g. without having to skip sheets which are empty, or to hunt through sheets which are out of order.

  • Since these weren't written anywhere, I guessed the requirement was (R3) “the Excel file contains [all] the data of a Scenario,” i.e. it's sufficient to make s2.read_excel(...) identical to s1.to_excel(). The difference gives rise to some other things you point out, e.g.:

    • R1 was previously met while omitting variables and equations, i.e. not meeting R3.
    • R2 was previously met by omitting empty sets and parameters, which conflicts with R3: without writing info for empty sets/parameters, they can't be initialized, so the Scenario as-read cannot be identical.

As @zikolach pointed out (see iiasa/ixmp#292 (comment)), we can imagine multiple ways to change the format. But I don't think that's very urgent and—as I commented on that issue—people have existing Excel files sitting around that they will want to continue to use. So I would like to minimize disruption.

Here's what I'd propose. All these changes will go in ixmp:

  1. Change to_excel() to write only sets and parameters by default, with an option to write variables and equations. This should address @behnam-zakeri's points (1), (2), and (3).

  2. As you suggest, change the order of sheets to the following groups. Sheets within each group will be in alphabetical order.
    a. All sets with data.
    b. All parameters with data.
    c. All sets without data.
    d. All parameters without data.
    e. Equations (if the option is selected).
    f. Variables (if the option is selected).

    This helps with R2 because the user can focus on a and b. It shouldn't make the to_excel() code much more complicated, and requires no change to the read_excel() code.

Some other reactions:

  • Because for scenarios with large data, there is already a memory burden in writing all data in Excel. I assume with many variables added to this, it becomes more difficult.

    Can you say how this memory burden manifests? Is it just the sheet rows limit, or something else? Ideally, we should add a test of a MESSAGEix-GLOBIOM-sized Scenario (i.e. full of random data, but the same number of elements in parameters) and test that the code works; kind of similar to this code.

  • there is a variable DEMAND in message_ix, which has exact the same name but in upper case as parameter demand. This makes pandas xlsxwriter raising the following error

    This is interesting: on my system I think openpyxl is used, not xlsxwriter. I wonder if this is a difference in the two packages, or a limitation of the Excel format itself. Will investigate.

@khaeru
Copy link
Member Author

khaeru commented Mar 31, 2020

On my system I think openpyxl is used, not xlsxwriter. I wonder if this is a difference in the two packages, or a limitation of the Excel format itself. Will investigate.

for engine in 'openpyxl', 'xlsxwriter': 
    xw = pd.ExcelWriter(f'{engine}.xlsx', engine=engine) 
    df = pd.DataFrame()
    df.to_excel(xw, sheet_name='foo') 
    df.to_excel(xw, sheet_name='FOO') 
    xw.close() 

What I observe:

  • With engine='openpyxl', the code completes successfully, but the file has two sheets named 'foo', and 'FOO1', i.e. the '1' is automatically appended.
  • With engine='xlsxwriter', the exception reported by @behnam-zakeri is reported.

So this is likely a limitation of the Excel format per se, and openpyxl contains a workaround that would, in any case, defeat our code.

@khaeru
Copy link
Member Author

khaeru commented Mar 31, 2020

In order to keep moving, I went ahead and implemented the suggestion in the above comment as iiasa/ixmp#297. @behnam-zakeri @Jihoon can either of you confirm that the combination of these two branches works as desired?

@khaeru
Copy link
Member Author

khaeru commented Apr 1, 2020

As with iiasa/ixmp#297, will merge this without review in order to clear the ground for hackathon PRs. If needed, we can make another to adjust, later.

@khaeru khaeru merged commit 76d2775 into iiasa:master Apr 1, 2020
@khaeru khaeru deleted the feature/remove-xlsx-io branch April 1, 2020 13:31
@behnam-zakeri
Copy link
Contributor

behnam-zakeri commented Apr 7, 2020

Thanks @khaeru for taking this forward. The solutions you suggested are excellent, for making sets and parameters default data to be written, and for ordering the sheets. And interesting test for Excel engine you illustrated.
related to your question:

Can you say how this memory burden manifests?

I meant memory burden for writing many rather large parameters to multiple Excel sheets, and not particularly exceeding row limits.
To put an end to this, I suggest we also resolve the issue of parameters exceeding the row limits of Excel. My proposal is to write such parameters in two or more sheets, like: land_output(1), land_output(2), etc. And for reading back the data to ixmp, it should be quite straightforward to drop these parentheses and add several dataframes one after each other. I can make an issue and slightly modify the code in to_excel() and read_excel() for this proposal if makes sense.

@khaeru
Copy link
Member Author

khaeru commented Apr 7, 2020

iiasa/ixmp#292 is the follow-up issue for modifying the Excel file format to extend past the ~10⁶ row limit. It would be great if you wanted to tackle this, but no pressure. If so, please create an ixmp (here is message_ix) branch named e.g. issue/292.

Also note that @volker-krey supplied example code in iiasa/ixmp#300 that addresses the same limitation by establishing a new, zipped-CSV format, which might be rolled into the same PR.

@behnam-zakeri
Copy link
Contributor

Thanks @khaeru for the instructions. I'll follow this on ixmp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enh New features & functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants