FIX failing sphinx-gallery CI #1145

Vincent-Maladiere · 2024-11-19T08:46:52Z

Vincent-Maladiere · 2024-11-19T09:54:42Z

So, it succeeds when I cut example 07 just before importing the Joiner. Let's now run the 2 next cells.

Vincent-Maladiere · 2024-11-19T10:11:36Z

Let's now run everything but the final CV

Vincent-Maladiere · 2024-11-19T10:31:09Z

It worked (the tick is red because some unrelated tests failed due to server issues).
Then, the last cell with the CV should crash

Vincent-Maladiere · 2024-11-19T10:50:49Z

... which it did. So, there might be some weird memory usage due to the text-encoder example, and sentence-transformer or pytorch caching shenanigans.

I didn't properly run all examples on the first commit, let's do it now and check that everything runs smoothly.

GaelVaroquaux · 2024-11-19T10:57:28Z

So, it succeeds when I cut example 07 just before importing the Joiner

OK, so it's an interaction...

Vincent-Maladiere · 2024-11-19T11:11:29Z

It failed, that's weird. It means that this is not provoked by the TextEncoder example: https://app.circleci.com/pipelines/github/skrub-data/skrub/4761/workflows/c392d89f-b5dd-4ab2-8d72-3438dd3e4911/jobs/9249 (example 2 is not run there, although example 7 fails).

Let's try this one more time, replacing the CV with a simple fit.

Vincent-Maladiere · 2024-11-19T11:32:19Z

The simple fit worked. So we're likely having a memory error due to CV on the (Joiner, TV, HGBT) pipeline. But why now? Let's remove the transformers module from the doc requirements and see what happens if I set the CV back.

GaelVaroquaux · 2024-11-19T11:57:14Z

It could be triggered by an import...

GaelVaroquaux · 2024-11-19T12:33:06Z

But why now?

Memory that is not cleared? Maybe something that you can investigate on your computer. Are we holding a reference on a big object?

jeromedockes · 2024-11-19T12:45:06Z

thanks a lot for investigating this @Vincent-Maladiere I know it's not fun 😅 . it's good to have in mind that sphinx gallery runs all the examples in the same process so I guess it is possible that some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit

Vincent-Maladiere · 2024-11-19T13:23:36Z

So, when we remove the "transformers" environment (i.e., pytorch, transformers and sentence-transformers dependencies) from the pixi doc environment, the CI runs fine.

some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit

That was my hypothesis too, but we have observed that without example 02 (which uses TextEncoder):

Running the CV on example 07, with "transformers" in the doc environment fails ❌
Running a simple fit on example 07, with "transformers" in the doc environment works ✅
Running the CV without "transformers" in the doc environment works ✅

We only import sentence-transformers or pytorch within the TextEncoder, not outside. However, since I disabled example 02, the CI never imports or uses the TextEncoder. So I don't know how to conclude here. I will try to reproduce it locally.

GaelVaroquaux · 2024-11-19T14:05:22Z

thanks a lot for investigating this @Vincent-Maladiere I know it's not fun 😅

Indeed. But it is an amazing stress test that reveals how the code lives (and dies) in challenging settings. It makes the software more robust, which the users then love.

. it's good to have in mind that sphinx gallery runs all the examples in the same process so I guess it is possible that some of the memory from the textencoder example has not been freed when the example 7 runs and that the 2 together pass circleci's memory limit

Quite possible. If that's the case, hopefully we can find how to free memory: it would make the library more lean.

jeromedockes · 2024-11-19T14:06:16Z

That was my hypothesis too, but we have observed that without example 02 (which uses TextEncoder): - Running the CV on example 07, with "transformers" in the doc environment fails ❌ - Running **a simple fit** on example 07, with "transformers" in the doc environment works ✅ - Running the CV **without "transformers"** in the doc environment works ✅ We only import sentence-transformers or pytorch within the TextEncoder, not outside. However, since I disabled example 02, the CI never imports or uses the TextEncoder. So I don't know how to conclude here.

if both installing transformers and running the CV take a lot of time, could it be a time limit?

Vincent-Maladiere · 2024-11-19T14:17:18Z

Quite possible. If that's the case, hopefully we can find how to free memory: it would make the library more lean.

This is not what's happening. Cf my message above.

could it be a time limit?

Interesting, it could be.

Vincent-Maladiere · 2024-11-19T15:24:42Z

I can't reproduce that locally. Let's rerun the CI with example 02 (TextEncoder), but without example 07 (Joiner) as a sanity check. Note that example 07 takes 8min to run: https://skrub-data.org/stable/sg_execution_times

Vincent-Maladiere · 2024-11-19T15:44:32Z

Great! So, the example 02 runs fine. The matter is with example 07. It looks like a timeout issue as @jeromedockes guessed.

Let's have another run with all examples and transformers, with maximum verbose on example 07 CV, to see if this is related to sphinx-gallery/sphinx-gallery#301

Vincent-Maladiere · 2024-11-19T18:39:34Z

When example 02 runs, the CI crashes before reaching the cross-validation part of example 07.

It looks like there is a timeout at 10min (11 actually) in circle-ci, but the run is not idle for long, and no_output_timeout is set to 30m anyway. With circle-ci, the max run time is 1h for freemium accounts, so we're not hitting this limit either. I can't find any other kind of timeout limit.

Could it be RAM consumption? But then, why would the cross-validation in example 07 crash now, whereas we never import TextEncoder, or torch. I'm starting to run out of ideas.

jeromedockes · 2024-11-20T08:56:53Z

I think the duration of example 7 is a problem in itself anyways -- it takes 8 minutes on circleci. So I would suggest reducing it by reducing the sizes of the subsamples we use (say 5,000 flights and 10,000 weather points), and replacing the cross-validation with a train/test split. Hopefully that will get the doc build passing again.

In a second time, we can work on improving that example and maybe using a different dataset. Indeed, at the moment

it downloads a very large dataset
the join with weather data does not improve predictions significantly IIRC
the predictions are not super far from chance level which makes the example less compelling IMO

we may never pin down the exact reason why the circleci job getting killed but I think I can live with that as long as we get it running again :D

jeromedockes · 2024-11-20T09:00:14Z

and I think we owe @Vincent-Maladiere a beer for this long session of blind debugging with a 20-min feedback delay which can be quite frustrating 😅

Vincent-Maladiere · 2024-11-20T09:03:36Z

we may never pin down the exact reason why the circleci job getting killed but I think I can live with that as long as we get it running again :D

Haha, I couldn't agree more! You're right, let's revamp this example slightly to make the CI run.
I'm also creating an issue to improve this example in the longer term.

and I think we owe @Vincent-Maladiere a beer for this long session of blind debugging with a 20-min feedback delay which can be quite frustrating 😅

I only did my duty 🫡

jeromedockes · 2024-11-20T09:09:50Z

cool, thanks! and if that doesn't work: I remember that for nilearn at some point I had reached out to the circleci support and had a good experience (and nilearn uses the free plan) so that option exists as well if all else fails

Vincent-Maladiere · 2024-11-20T09:45:36Z

It worked! 🎉🎉

The computation times on Circle-ci are now:

computation time summary:
    - ../examples/08_join_aggregation.py:            216.80 sec   0.0 MB
    - ../examples/02_text_with_string_encoders.py:   180.11 sec   0.0 MB
    - ../examples/06_ken_embeddings.py:               69.42 sec   0.0 MB
    - ../examples/01_encodings.py:                    67.53 sec   0.0 MB
    - ../examples/07_multiple_key_join.py:            28.30 sec   0.0 MB
    - ../examples/04_fuzzy_joining.py:                11.46 sec   0.0 MB
    - ../examples/00_getting_started.py:               6.69 sec   0.0 MB
    - ../examples/03_datetime_encoder.py:              5.45 sec   0.0 MB
    - ../examples/09_interpolation_join.py:            5.31 sec   0.0 MB
    - ../examples/05_deduplication.py:                 4.65 sec   0.0 MB

We could also simplify/speed up example 08 by removing the grid search and hardcoding the best hyper-parameters, in another PR.

jeromedockes · 2024-11-20T09:51:51Z

nice!! thanks again! I'll merge it before circleci gets a chance to change its mind 😆

jeromedockes · 2024-11-20T09:52:49Z

the example still looks pretty much the same after reducing the subsample sizes so that's good

GaelVaroquaux · 2024-11-21T08:24:36Z

Wohoo! Thanks a lot Vincent!

Vincent-Maladiere added 4 commits November 19, 2024 09:45

tmp remove transformer example

bedc750

change the model use to limit memory footprint

71e5f43

[doc build]

d162b3f

[doc build]

3c5f713

[doc build]

c66fc19

[doc build]

bb95312

[doc build]

c4f65a5

[doc build]

8e981eb

[doc build]

9547719

[doc build]

708abcf

[doc build]

08ae0c5

[doc build]

c5acf30

Vincent-Maladiere added 2 commits November 19, 2024 17:52

[doc build]

d807d87

[doc build]

a06b728

Vincent-Maladiere added 3 commits November 19, 2024 18:27

[doc build]

00834f5

[doc build]

e316eae

[doc build]

951390a

[doc build]

b19584e

[doc build]

02bd08b

jeromedockes approved these changes Nov 20, 2024

View reviewed changes

jeromedockes merged commit 2b8d68b into skrub-data:main Nov 20, 2024
25 checks passed

Vincent-Maladiere mentioned this pull request Nov 20, 2024

[DOC] Make Joiner example 07 more compelling #1148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX failing sphinx-gallery CI #1145

FIX failing sphinx-gallery CI #1145

Vincent-Maladiere commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

GaelVaroquaux commented Nov 19, 2024 via email

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

GaelVaroquaux commented Nov 19, 2024 via email

GaelVaroquaux commented Nov 19, 2024 via email

jeromedockes commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

GaelVaroquaux commented Nov 19, 2024 via email

jeromedockes commented Nov 19, 2024 via email

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

jeromedockes commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

Vincent-Maladiere commented Nov 20, 2024 •

edited

Loading

jeromedockes commented Nov 20, 2024

Vincent-Maladiere commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

GaelVaroquaux commented Nov 21, 2024 via email

FIX failing sphinx-gallery CI #1145

FIX failing sphinx-gallery CI #1145

Conversation

Vincent-Maladiere commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

GaelVaroquaux commented Nov 19, 2024 via email

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

GaelVaroquaux commented Nov 19, 2024 via email

GaelVaroquaux commented Nov 19, 2024 via email

jeromedockes commented Nov 19, 2024

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

GaelVaroquaux commented Nov 19, 2024 via email

jeromedockes commented Nov 19, 2024 via email

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

Vincent-Maladiere commented Nov 19, 2024 • edited Loading

jeromedockes commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

Vincent-Maladiere commented Nov 20, 2024 • edited Loading

jeromedockes commented Nov 20, 2024

Vincent-Maladiere commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

jeromedockes commented Nov 20, 2024

GaelVaroquaux commented Nov 21, 2024 via email

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 19, 2024 •

edited

Loading

Vincent-Maladiere commented Nov 20, 2024 •

edited

Loading