-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squeak 6.0 beta image can not be saved #159
Comments
Have a more close look at this issue, the saving process seems be stalled at And I found something suspicious, there is an extra process BTW, this extra process also does not exist in TruffleSqueak with TruffleSqueak-22.1.0 image. |
Thanks for the report and for doing some digging. It seems that EventSensor>>#shutDown
InterruptWatcherProcess ifNotNil: [
InterruptWatcherProcess terminate.
InterruptWatcherProcess := nil ].
EventTicklerProcess ifNotNil: [
"EventTicklerProcess terminate".
EventTicklerProcess := nil. ].
inputSemaphore ifNotNil:[Smalltalk unregisterExternalObject: inputSemaphore]. Any idea what could be going on here? /cc @marceltaeumel |
Apparently, the semantics of |
A few months ago, Also /cc @isCzech |
Hi @LinqLover, all, it's been a year since the extended #terminate logic has been introduced :) Before digging deeper, could you please filein and check the latest #terminate version (attached) to see if anything changes? The trunk #terminate still contains some bugs which have been fixed in the latest enclosed version (expected to be merged into the release image). And yes, the change in the #terminate logic only affects terminating processes suspended/terminated inside #ensure: unwind blocks. |
Follow up: To be precise, the enclosed changeset Kernel-jar.1447.followUp.cs.zip is meant to be used along with the updated #suspend attached here: The new release VM (bundled in the beta image) supports new #suspend primitive 578 fixing incorrect #suspend semantics of the old primitive 88. The attached RevisedSuspend.2.cs.zip enables the image to use the new primitive. Alternatively, instead of filing-in these two changesets you can try Kernel-jar.1447 from the Inbox to see if anything changes. |
Thanks for the details! Could you summarize how primitive 578 is different to primitive 88? |
The change affects processes waiting on a Semaphore or Mutex; so far #suspend removed the waiting process from the semaphore or mutex and placed it in a run queue when resumed. That's a bug. Now, #suspend backs up the process's code before the wait so as a result the process returns to the same wait state when resumed. Here's part of Eliot's comment for the revised #suspend: I'm not sure applying these changes will help as I don't understand the root cause. But it's a start :) |
I'm not familiar with TruffleSqueak but I've downloaded TruffleSqueakImage-22.1.0, used the latest VM version VMMaker.oscog-mt.3184 and tried two experiments:
What was your scenario when saving the image failed and the eventTickler process didn't terminate. Where is the difference from my scenarios? |
Hi @isCzech , sorry that I should give more details in the issue description. This issue can be reproduced using TruffleSqueak VM with recent Squeak 6.0 images, following are detailed steps under Windows:
I'll try your attached patches later. |
Hi @isCzech , I have tried your two changesets, image saving works after applied! But there is some other problem, for each time I do a saving, there will be an extra process created, following is a screenshot of Process Browser: |
Thanks for checking, @dram! TruffleSqueak does not implement primitive 578 yet, so maybe that's why you get to see these additional processes. Let me try and push an implementation. |
Hi @dram, I'm a bit confused now: you use Squeak6.0alpha-21736 and say later ones are not compatible. How did you test Squeak beta which is 21757 and later? |
Hi @fniephaus,
Worth a try but as long as #suspend uses primitive 88, things should still work as before... |
I tried implementing primitive 578 (see b926e17) but unfortunately, I still see stale processes when saving the image.
I just opened the |
Well, I guess it means the extended #terminate logic itself is not the culprit... I'll try to follow @dram 's steps and see :) |
Hi again, I can confirm I can reproduce the incorrect behavior (installed things as per steps 1 to 3, then for some reason step 4 didn't work at all - the image didn't start but I used TruffleSqueakImage 22.1.0 instead and updated the terminate method) I've noticed Cuis uses different priorities for the interrupt watcher and the low space watcher while Squeak runs all of them with the same priority, however, changing the priorities doesn't help. But there's another difference - for some reason Cuis has modified the EventSensor implementation, including the evenTickler, which may explain Cuis works flawlessly :) Maybe @jvuletich could shed some light here why (Hi Juan, I hope you don't mind :) ) |
Just in case you, @isCzech, are not aware: TruffleSqueak is an alternative Smalltalk VM implementation, so it's very likely that something on the VM level isn't working correctly. However, things used to work fine with older Squeak images and with Cuis images. I noticed that newly introduced |
Thanks indeed! I've realized you must have a different VM (not sure how different though and how the latest VM changes map into your VM); nonetheless you're absolutely right there must be something off on the image level if Cuis works fine either way :) |
Hi @isCzech, the incompatibility which I mentioned is about issue #156. It is fixed recently, but not released yet. I tested Squeak6.0beta-21772 with TruffleSqueak build from If you need to experiment with latest Squeak image, you can follow instructions in https://github.com/hpi-swa/trufflesqueak/blob/main/docs/development.md |
Hi @dram, thanks, I understand now. In the meantime I've come to a tentative conclusion there may be something wrong with the Squeak image other than the new termination logic. Not sure what though; it's possible fixing #terminate bugs may have unmasked some other issues - something Cuis has fixed or never suffered from. See above observations; I hope Juan (@jvuletich) could enlighten us why he changed the EventSensor implementation recently - whether he was possibly fixing some bugs...? In which case Squeak could follow suit :) |
Hi @isCzech, when looking around code of
According to the doc of
There is some similar code in your patched version, i.e.:
Similarly, after I change it to following, no more stale processes created after saving:
I'm not familiar with Squeak's process scheduling system, so not sure if such change is reasonable. Also I'm curious why there is no problem in OpenSmalltalk VM. |
Yes, the description "ASSUMES aSender is a sender of self" always bothered me :) I probably used the method beyond it's original intended use; however, it works correctly even when the receiver and 'aSender' are identical.
yes, it will cover most cases but I'm attaching a test that will fail with your modification; it is a bug in the original #terminate. It fails to recognize the case when the unwind block is the top context (of the stack). #runUntilReturnFrom: is a trivialized (or less restricted) version of #runUntilErrorOrReturnFrom: designed as a helper method for #unwindTo:; it's purpose is to run a fragment of another context stack. However, file-in the attached fix - it works in my case. I've compared Cuis's code and found this difference - thanks @jvuletich !! Nice catch :) I don't understand yet why the same problem won't show with the Squeak VM... |
I wonder if speed of the processing could have anything to do with it? Anyway, if you let me know whether the fix works in all your scenarios, I'll send the fix to Squeak Inbox. Thanks |
Hi @isCzech, I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists. Tests are taken in two environments:
Steps:
|
WRT #eventTickler in Cuis, it was last tweaked in December 2021 by Andrés Valloud. There is a possible weakness in Delay. If a Delay is waiting, but its process is terminated, and then the same delay is sent #wait without checking #beingWaitedOn, the system wil hang. See senders of #wait and #beingWaitedOn in Cuis. In any case, I don't know if this has any relation with the problems when saving Squeak in TruffleSqueak. I'm just answering Jaromir's question about #ventTicker in Cuis. |
yes, sorry, I haven't realized you don't have the full set of patches... It also depends on the combination of the VM+image :) |
Hi @dram, at this point it shouldn't matter whether you apply Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs because they have nothing to do with the issue. Could you please tell me whether the problem persists when you apply only the EventSensor-eventTickler.st patch? (with the recent Squeak image indeed) ? Does it mean that the problem disappeared on Windows 10 and possibly some other scenarios? (I have no way to test Win11/Linux/Mac) It seems, as Juan indicated, it may be a timing issue caused by the eventTickler implementation. I only guess the new terminate made the problem more visible in certain scenarios. It never occurred in Cuis/Squeak though and I may only hypothesize it could be because of the speed of the VM?? I noticed GraalVM is slower... |
Hi @isCzech, according to the test case I mentioned in previous comment #159 (comment), Cuis's |
True, this |
Hi @dram, I have some preliminary results:
Thanks for letting me know. |
Hi @isCzech, I kind of think that the new
Anyway, the first thing to do would be determine what that feature is. The error message and exception stack trace mentioned in #159 (comment) may give some hints. |
Hi @dram, why would you think it may be a "tricky" feature in OpenSmalltak VM and not in the TruffleSqueak VM? ;) One of the differences between the old and new terminate is the old terminate uses the simulation machinery (the whole unwind is simulated using I've compared the termination step by step in both the failing case and the "fixed" case with Now, from that follows, I hope, that if the simulation (debugger) correctly terminates the event tickler process but the live run fails, the difference is most likely within the TruffleSqueak VM. So I'd say #terminate in TruffleSqueak relies on being run via the simulation machinery (#popTo & company). Once the TruffleSqueak VM trie to terminate using "regular" code, it fails, while at the same time when running the same "regular" code as a simulation (in debugger) it works correctly. (It's a mystery why changing the one line "fixes" the issue for TruffleSqueak but as discussed in #159 (comment). I say it's not really a fix - it's rather a reminder something doesn't work as expected :) ) What's your thought? Hi @fniephaus, does the exception stack trace mentioned in #159 (comment) tell anything to you? Would you agree with the conclusion above? |
Hi @dram and @isCzech, What would be very helpful are very short doIts that demonstrate problems in TruffleSqueak or a difference in OSVM vs TruffleSqueak behavior. Something like the doIt in #159 (comment) and some of the |
The crucial point that makes the difference for TruffleSqueak image is the replacement @dram discovered in this comment. The suggested change will make TruffleSqueak work but I can't see any logical explanation why - IMO the answer lies within the TruffelSqueak VM. For OpenSmalltalk VM @dram's change has no effect and the original version (not working in TruffleSqueak) is more logical/consistent. As for the expected release image, I propose including methods in Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs (attached previously) - but these two will require the latest VM with the 578 suspend primitive. The final decision should be made soon (two weeks timeframe?) Just to make sure: the issue doesn't appear to be a timing or other issue with the event tickler. |
Maybe it's too soon to rule this out... Cuis image with the new terminate seems to work fine with TruffleSqueak VM in terms of saving the image (and terminating the even tickler) but @dram's observation here shows there's an issue with Cuis as well. |
I have a hunch what's going on: [] ensure: [
self suspend.
context := suspendedContext ifNil: [^self].
suspendedContext :=
[context releaseCriticalSection; unwindTo: nil. self suspend] asContext.
self priority: Processor activePriority + 1; resume] It seems that the ensure block is never executed as part of the top-level unwinding logic: Lines 44 to 57 in a068eca
That, at least, explains why all three processes of |
Interesting! I wish you were right :) Cuis use the same approach; they just use a slightly different form: they use a method to get the receiver evaluated as an unwind block:
Applying it to the Squeak's #terminate at TruffleSqueak makes no difference though... |
Hi @isCzech, while rethinking about
After this change, image saving in TruffleSqueak works with no problem. |
Hi @dram, nice try ;) |
Hi @isCzech, thanks for pointing to Anyway, following is another try:
All tests passed, so I'm a bit curious that why not use |
Hi @dram, this one's even better but nope, try this: |
I just remembered that lots of trufflesqueak/src/de.hpi.swa.trufflesqueak.test/src/de/hpi/swa/trufflesqueak/test/runCuisTests.st Lines 12 to 28 in b34dc08
So I guess there are plenty of tests that could potentially help find the bug(s) in TruffleSqueak. |
These are the tests added to complement the new (fixed/extended) termination logic introduced last year. They should work with newer Squeak or Cuis images... |
Hi @isCzech, For the exception case, it can be fixed in this way: [ctx isNil] whileFalse: [
(ctx tempAt: 2) ifNil: [
ctx tempAt: 2 put: true.
- top := (ctx tempAt: 1) asContextWithSender: ctx. "see the note below"
- top runUntilReturnFrom: top].
+ [(ctx tempAt: 1) value]
+ on: Exception do: [:ex |
+ (ctx nextHandlerContextForSignal: ex) ifNotNil: [:hdl | hdl handleSignal: ex]]].
ctx := ctx findNextUnwindContextUpTo: aContext] But it will be failed for following case, which is based on your version, with more nested [
[
[
[
[Processor activeProcess terminate] ensure: [Transcript showln: 1. 1 / 0]
] ensure: [Transcript showln: 2]
] on: ZeroDivide do: [Transcript showln: 3]
] ensure: [Transcript showln: 4]
] fork Need more investigation. |
Found the problem, [ctx isNil] whileFalse: [
(ctx tempAt: 2) ifNil: [
ctx tempAt: 2 put: true.
- top := (ctx tempAt: 1) asContextWithSender: ctx. "see the note below"
- top runUntilReturnFrom: top].
+ [(ctx tempAt: 1) value]
+ on: Exception do: [:ex |
+ (ctx nextHandlerContextForSignal: ex) ifNotNil: [:hdl | hdl fireHandlerActionForSignal: ex]]].
ctx := ctx findNextUnwindContextUpTo: aContext] For this version, all tests in |
Interesting solution... You're searching for handlers on two disjointed stacks instead of joining the stacks :) I wonder what would happen in case of more complicated exceptions like chained outer or other crazy scenarios though. It seems to me joining stacks and executing on the joint stack is what the methods have been developed and tested for. Not saying it wouldn't work, just that it'd require some deep thinking :) I'd still prefer finding out why TruffleSqueak's VM executes the same code differently from the OSVM and take it from there. Fabio seemed to find something suspicious so I'm curious what his findings will be. Thanks! |
Hi @isCzech, While experiment with exceptions and
After some more investigation, I kind of think that it may be hard or impossible to make So if resource cleanup is needed, some other mechanisms should be used, e.g. see this Java document. Anyway, I'm also looking forward to @fniephaus's findings. Hope this issue can be solved, as image saving is quite a fundamental feature. |
Hi @dram,
Saving is overrated ;) |
@dram, try |
Hi @isCzech, that makes sense, thanks! |
Hi @fniephaus, I wonder if you found something interesting :) I guess it's worth finding out what's causing the irregular behavior but in any case we can try to replace the line |
The problem in TruffleSqueak is indeed that the top-level unwinding logic neither handles So from the TruffleSqueak side, there is no requirement how things should be done in Squeak. However, I think we should put primitive 578 to work asap so that it can be tested as much as possible before the release. If someone could notify me here when this is done, I'm happy to take another look and add support for 578 to TruffleSqueak. |
Thanks for the info; I'll let you know once the support for the new suspend primitives is merged. |
It's merged now. |
Thanks again everyone for the detailed analysis around the new process termination logic and for ultimately helping to improve TruffleSqueak! I have managed to fix saving of An implementation of primitive 578 is also waiting to be merged once we upgrade TruffleSqueak's image so that it is based on the upcoming Squeak 6.0 release. |
Fix confirmed, thanks! |
In TruffleSqueak, Squeak 6.0 beta image can not be saved, no error is displayed, neither in Squeak window, nor in the console.
The text was updated successfully, but these errors were encountered: