Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion caused by parameter + multiple execution of global section. #1219

Closed
BoPeng opened this issue Feb 22, 2019 · 19 comments
Closed

Confusion caused by parameter + multiple execution of global section. #1219

BoPeng opened this issue Feb 22, 2019 · 19 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Feb 22, 2019

A user reported a case in gitter which boils down to

[global]
parameter: A=[1, 2]
parameter: B=[]
B.extend(A)

[default]
print(B)
input: for_each=dict(i=range(2))
print(B)

and the output is

[1, 2]
[1, 2, 1, 2]
[1, 2, 1, 2]

We should investigate the behavior of parameter in global section more closely.

@gaow
Copy link
Member

gaow commented Feb 22, 2019

I think I've raised the concern a few times in the past that global section got executed multiple times. It is expected that global section should only be executed once. But I recall you gave some good reasons why it executes multiple times ... clearly this now causes troubles. Maybe let's revisit those reasons?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 22, 2019

The easiest example is

[global]
import time

[1]
input: for_each=dict(i=range(2))
time.sleep(0)

because time module is not pickleable and the substep would not get time unless the global section is executed by the substep each time.

Cannot recall immediately the reason for ignoring parameter in some cases though.

@gaow
Copy link
Member

gaow commented Feb 22, 2019

I see, now I recall the import this exception. But it looks this is an exception we can possibly hack around? Are there other exceptions?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 22, 2019

Yes, we can hack and separate import statements, but things can get complicated. this document shows what can and what cannot be pickled.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 22, 2019

For example

import some_class from some_module
a = some_class()

and a might not be pickleable.

We can enforce the pickleable requirement for global section though. Basically, we can say, other than module, which we can import separately, everything else needs to be pickleable....

@gaow
Copy link
Member

gaow commented Feb 22, 2019

Basically, we can say, other than module, which we can import separately, everything else needs to be pickleable....

This is a more acceptable behavior than the example above that B get accumulated. I do not see it very much limiting.

For global objects pickled, are they shared to substeps? In other words each substep will use the environment loaded by the step rather than loading and unpickling saved object on their own? If objects are shared, are they read-only?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 22, 2019

Right now we send global statement to each substep and execute their, and there is also a background variable obtained from the step level...

Let me revisit this after I completes #1218

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 24, 2019

Use the following to test the pickleability of things:

import pickle

stmt = '''
from time import sleep

def a():
    print('hi')
'''

bundle = {}

exec(stmt, bundle. bundle)

print(bundle.keys())

pickle.dumps(bundle)

So

  1. import time is not, but can be easily separated.
  2. from time import time is ok, but other objects might not.
  3. def a() cannot, and this can be a problem.
  4. class () cannot.

@gaow
Copy link
Member

gaow commented Feb 25, 2019

def and class can be huge limitations ...

Can we then instead only pickle parameters and variables, and run the rest of the code -- I guess this is not easy to do?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 25, 2019

We can do something complicated but I am not sure how reliable it can be... Basically, we can parse the global section and

  1. Pick out import, def, and class from the statement.
  2. Run these statements and get a dictionary A
  3. Run the entire global section and get another dictionary B
  4. Remove the elements in A from B, and we require the rest of the stuff to be pickleable.

Now the global section is separated into import/def/class statements and a pickleable dictionary. Now for each place when global statement is needed, we

  1. Execute the imports and definitions.
  2. Restore from the pickled dictionary.

This should work most of the time but

  1. It assumes the order of imports and definitions do not matter, which I am sure is not sure for some corner cases.
  2. It will prevent users from the creation of non-pickleable objects using something like
A = function_call()

but I guess we can tolerate such limitations.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 25, 2019

Not sure if you are still using it but I will have to remove the %include and %from magic from the syntax because it includes global section from another script to the current one and makes the handling of this ticket even more troublesome.

@gaow
Copy link
Member

gaow commented Feb 25, 2019

I do not use these any more. I think you managed to convinced me to stop using these at some point.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 25, 2019

Yes, I think I was under the influence of snakemake which can manage multiple files, but frowned upon the complexity and the implementation and especially the using of these features, especially because they are not in line with our design philosophy ...

I am removing them now.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 26, 2019

Done. This also resolves #1155 (execution of global statement on remote host) because now the global statement will only be executed once (locally) and be pickled to all steps, substeps, and tasks.

@BoPeng BoPeng closed this as completed Feb 26, 2019
@gaow
Copy link
Member

gaow commented Feb 26, 2019

Great! Since it is pickled, how does each substep use this pickled data? Do they each load it on their own, or the step loads it and copies to all substeps -- by reference or by value?

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 26, 2019

As far as I can tell, each substep will load it because everything is mixed up on the same worker. This could be improved but right now correctness is more important.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 26, 2019

The trunk is not ready as there are still randomly failed tests and I can now reproduce the DAG problem but I believe the key parts are done.

@gaow
Copy link
Member

gaow commented Feb 26, 2019

I can now reproduce the DAG problem

This can be reproduced reliably now on your end with some example? I do not think I get it every time, at least I did not see it in versions before 0.18.7.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 26, 2019

No. Only with the complicated example you have with 20+k jobs and many nested workflows. I can possibly try to trim down the workflow to create a test case though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants