-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporary files in outlier detection median computation should read faster #8774
Comments
Comment by Ned Molter on JIRA: The proposed intermediate solution of simply using numpy arrays instead of datamodels for the repeated I/O unfortunately only leads to marginal runtime gains and does not resolve the issue that the runtime is highly sensitive to the choice of chunk size. See the two attached files comparing the runtimes for a similar buffer size. Using data models and a 7 MB buffer, the runtime of load operations is 122 seconds, whereas using numpy arrays and a slightly larger 10 MB buffer, the runtime is 105 seconds. I also checked the difference for a 1 MB buffer, and again there is little difference: both take ~1000 seconds total to do the load operations, i.e., the runtime is scaling ~linearly with the inverse of chunk size. This small test justifies attempting the more nuanced (but more challenging to implement) approach suggested by Brett Graham in this GitHub comment. |
Comment by Ned Molter on JIRA: With some help from Brett, I made a draft of this which can be seen here. The change appears to make a large difference in the amount of runtime spent on file I/O. I'm attaching a third runtime graph for the same input association, but this time using the PR branch. The amount of time spent on file I/O went from roughly 100 seconds to 0.6 seconds. The buffer size is now computed based on the size of the input image data arrays (this can be changed) but for reference to the other tests it amounted to ~3 MB per model. I'm also attaching a memory profile. |
Comment by Ned Molter on JIRA: A note on the memory profiles. I have attached memray_flamegraph_output_3741.html for my test association on the PR branch with in_memory=False, and memray_flamegraph_output_3741_main.html for the same test association on the master branch also with in_memory=False. Peak memory usage on main: 5.6 GB Peak memory usage on PR banch: 5.4 GB So they are nearly identical, and if anything the PR branch is using a bit less memory. I take this as evidence that the PR branch really did get its speedups from decreasing file I/O and not from inadvertently using more memory. |
Comment by Ned Molter on JIRA: I'm adding memray outputs for
|
Comment by Ned Molter on JIRA: Additional changes have decreased the |
Comment by Ned Molter on JIRA: Fixed by #8782 |
Issue JP-3741 was created on JIRA by Ned Molter:
Runtimes can be very long for outlier detection on large associations when the
in_memory
parameter is set to False. When all models cannot be stored in memory at once, the median is computed in spatial sections as follows:As currently written, this operation requires a number of load operations equal to n_sections * n_models, which gets very large especially for small sections. Because datamodels are used to load each section, these load operations are extremely inefficient, as they are loading unnecessary arrays (e.g. error and dq arrays) and schema validation occurs every single time.
As shown by the comments on linked ticket JP-3706 as well as a long discussion of the associated pull request, one effect of these inefficiencies is that the runtime is highly sensitive to the choice of section size, and there is no clearly "best" section size to use as a default.
There are multiple ways to fix this, ranging from quick-and-dirty to elegant and hard to implement.
One of the quicker ones is simply to store the temporary files as numpy arrays and load them as such, instead of using datamodels. The current plan is to implement a draft of this, then test the extent to which this improves the runtime of the file I/O and decreases the sensitivity of the entire step's runtime to the section size.
The text was updated successfully, but these errors were encountered: