Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient reset of global dictionary #1107

Closed
BoPeng opened this issue Dec 14, 2018 · 8 comments
Closed

More efficient reset of global dictionary #1107

BoPeng opened this issue Dec 14, 2018 · 8 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Dec 14, 2018

We many times need "clean" dictionaries for step analysis and execution of steps and we usually do

        env.sos_dict = WorkflowDict()

        env.sos_dict.set('SOS_VERSION', __version__)
        env.sos_dict.set('__step_output__', sos_targets([]))

        # load configuration files
        load_config_files(env.config['config_file'])

        SoS_exec('import os, sys, glob', None)
        SoS_exec('from sos.runtime import *', None)

        # excute global definition to get some basic setup
        try:
            SoS_exec(self.workflow.global_def)
        except Exception:
            if env.verbosity > 2:
                sys.stderr.write(get_traceback())
            raise

repeatedly.

It would be helpful to do this once, save the result to a global dictionary, and use that dictionary to populate sos_dict when needed.

The problem here is that the dictionary would be effective in only one process so it is not that useful when processes are created and destroyed quickly.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 3, 2019

Not doable because the global section can import modules etc.

@BoPeng BoPeng assigned BoPeng and unassigned gaow and BoPeng Jan 3, 2019
@BoPeng BoPeng added the wontfix label Jan 3, 2019
BoPeng pushed a commit that referenced this issue Jan 3, 2019
@gaow
Copy link
Member

gaow commented Jan 3, 2019

Okay ... but looks like some optimization can be done with the patch above -- it is the best we can do? One common scenario is that global section can have some large variables (that involves reading / parsing files on disk). Would be nice to be able to facilitate passing them around unless it is too costly to cache.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 3, 2019

For you particular workflow, the input: step that creates 34k groups take a lot of time to process... I am checking it.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 3, 2019

It is surprising that

[f'{ss_data_prefix:a}/{x}/{x}.summary_stats.gz' for x in chunks]

takes 1.1s to execute and the :a part takes 1s. Basically, we have 34k os.path.abspath calls here although in theory you can pre-compute ss_data_prefix:a.

@gaow
Copy link
Member

gaow commented Jan 3, 2019

I see ... thank you! I tend to do this a lot (not pre-computing path formats ...). Do you think there is a more principled way to do it? Also 1.1s is not too bad -- compared to the total 4min delay we are talking about in #1146 , right?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 3, 2019

On you machine it is about 5s for each [f'{:a}/ for x in chunks]. Not a very big deal but could be cut to about 1s if you do ss_data_prefix = ss_data_prefix.absolute after the parameter definition.

I think the rest of the delay is for checking the existence of these files and build DAG etc.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jan 10, 2019

The problem is also that in many cases we evaluate global definition along with a user-passed dictionary, which makes caching results of global definition more difficult because user-passed dictionary is not hashable.

@BoPeng
Copy link
Contributor Author

BoPeng commented Feb 28, 2019

Done by #1219

@BoPeng BoPeng closed this as completed Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants