Make the dummy data timeout configurable #2258

evansd · 2024-11-28T14:27:01Z

We were deliberately being slow in adding configuration options to avoid locking ourselves in to any regrettable APIs. However this option now has a demonstrable use case, and also doesn't really lock us in to much of anything.

Docs now look like:

configure_dummy_data(population_size=10, legacy=False, timeout=60)
Configure the dummy data to be generated.

population_size
Maximum number of patients to generate.

Note that you may get fewer patients than this if the generator runs out of time – see timeout below.

legacy
Use legacy dummy data.

timeout
Maximum time in seconds to spend generating dummy data.

Related threads:
https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1732791626217789
https://bennettoxford.slack.com/archives/C07MZ5WCJV6/p1732800801777969
https://bennettoxford.slack.com/archives/C07MZ5WCJV6/p1732801544406129

We're seeing cases where the generator is making progress, but slow enough that the default timeout doesn't generate enough patients. Being able to bump up the timeout would unblock users in this particular case, although it's obviously not a great solution. As part of this we use the default values from the config class to define the keyword argument defaults, and we remove the default values from the docstring (as they're already shown in the docs in the function signature). These changes reduce the possibility for these values to go out of sync.

DRMacIver

As well as two nitpicky review comments inline, two more:

I wonder if it would be helpful somewhere to have a test that we actually respect the configured timeout. I think the code is correct and we do pass it through correctly, but this seems like the sort of thing that would be easy to get wrong.
It might be worth logging how to change the timeout in the dummy data generators. We already do this for the population size.

None of this is merge-blocking though, these are just my slightly nitpicky thoughts in the course of reading through it. Please feel free to do any or none of them.

DRMacIver · 2024-11-28T14:31:06Z

ehrql/query_language.py


        _legacy_<br>
        Use legacy dummy data.

+        _timeout_<br>
+        Maximum time in seconds to spend generating dummy data.


This may be too pedantic to be worth noting but this isn't technically right. The timeout is the time after which the generator will stop trying to generate more dummy data. In the event you hit the timeout it will typically take longer than the timeout.

Yes, that is true. But I can't think of a way of writing this which makes it more correct without simultaneously making it harder to understand, and to some extend defeating the point of documenting it in the first place.

Unless we just say:

Maximum time in seconds (approximately) to spend generating dummy data.

DRMacIver · 2024-11-28T14:31:41Z

tests/unit/measures/test_dummy_data.py

@@ -102,7 +102,8 @@ def test_configured_population_size(legacy):
        intervals=years(1).starting_on("2020-01-01"),
    )

-    measures.configure_dummy_data(population_size=10, legacy=legacy)
+    measures.configure_dummy_data(population_size=99, legacy=legacy, timeout=123)


The answer to this doesn't really matter, but for my curiousity, why did the population size change here?

I just noticed that it was setting it to the default value (or what is now the default, maybe it wasn't when the test was written) and therefore this would pass even if the argument was completely ignored.

evansd · 2024-11-28T14:52:38Z

Yes, I agree with both of these, thanks. I think I might merge now though so as to unblock the user (because I need to head out in a bit) and then follow these up later. Both are worth doing though.

evansd added 2 commits November 28, 2024 14:18

Run just generate-docs

ccea2e5

DRMacIver approved these changes Nov 28, 2024

View reviewed changes

evansd merged commit 1552455 into main Nov 28, 2024
9 checks passed

evansd deleted the evansd/dummy-data-timeout branch November 28, 2024 14:52

evansd mentioned this pull request Nov 28, 2024

Dummy data configuration tweaks #2262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the dummy data timeout configurable #2258

Make the dummy data timeout configurable #2258

evansd commented Nov 28, 2024

DRMacIver left a comment

DRMacIver Nov 28, 2024

evansd Nov 28, 2024

DRMacIver Nov 28, 2024

evansd Nov 28, 2024

evansd commented Nov 28, 2024

Make the dummy data timeout configurable #2258

Make the dummy data timeout configurable #2258

Conversation

evansd commented Nov 28, 2024

DRMacIver left a comment

Choose a reason for hiding this comment

DRMacIver Nov 28, 2024

Choose a reason for hiding this comment

evansd Nov 28, 2024

Choose a reason for hiding this comment

DRMacIver Nov 28, 2024

Choose a reason for hiding this comment

evansd Nov 28, 2024

Choose a reason for hiding this comment

evansd commented Nov 28, 2024