Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squeak 6.0 beta image can not be saved #159

Closed
dram opened this issue May 24, 2022 · 63 comments
Closed

Squeak 6.0 beta image can not be saved #159

dram opened this issue May 24, 2022 · 63 comments
Labels

Comments

@dram
Copy link

dram commented May 24, 2022

In TruffleSqueak, Squeak 6.0 beta image can not be saved, no error is displayed, neither in Squeak window, nor in the console.

@dram
Copy link
Author

dram commented May 24, 2022

Have a more close look at this issue, the saving process seems be stalled at SmalltalkImage>>#processShutDownList: in SmalltalkImage>>#snapshot:andQuit:withExitCode:embedded:.

And I found something suspicious, there is an extra process Delay>>wait in TruffleSqueak with 6.0 beta image, which does not exist in OpenSmalltalk VM, i.e.:

image

BTW, this extra process also does not exist in TruffleSqueak with TruffleSqueak-22.1.0 image.

@fniephaus
Copy link
Member

Thanks for the report and for doing some digging. It seems that EventTicklerProcess terminate causes the issue. Not sure what has changed recently, but using the following allows me to save the image again:

EventSensor>>#shutDown

	InterruptWatcherProcess ifNotNil: [
		InterruptWatcherProcess terminate.
		InterruptWatcherProcess := nil ].

	EventTicklerProcess ifNotNil: [
		"EventTicklerProcess terminate".
		EventTicklerProcess := nil. ].
	
	inputSemaphore ifNotNil:[Smalltalk unregisterExternalObject: inputSemaphore].

Any idea what could be going on here? /cc @marceltaeumel

@fniephaus fniephaus added the bug label May 24, 2022
@fniephaus
Copy link
Member

Apparently, the semantics of Process>>#terminate have changed a bit. Maybe @LinqLover has an idea what could be going on.

@LinqLover
Copy link
Collaborator

A few months ago, #terminate has started to complete all ensure: contexts on the receiver stack first, even those that were already active before the termination request. Could there be any ensure: context like this on the stack of the event tickler process?

Also /cc @isCzech

@isCzech
Copy link

isCzech commented May 24, 2022

Hi @LinqLover, all, it's been a year since the extended #terminate logic has been introduced :) Before digging deeper, could you please filein and check the latest #terminate version (attached) to see if anything changes? The trunk #terminate still contains some bugs which have been fixed in the latest enclosed version (expected to be merged into the release image).

And yes, the change in the #terminate logic only affects terminating processes suspended/terminated inside #ensure: unwind blocks.
Kernel-jar.1447.followUp.cs.zip

@isCzech
Copy link

isCzech commented May 24, 2022

Follow up: To be precise, the enclosed changeset Kernel-jar.1447.followUp.cs.zip is meant to be used along with the updated #suspend attached here:
RevisedSuspend.2.cs.zip

The new release VM (bundled in the beta image) supports new #suspend primitive 578 fixing incorrect #suspend semantics of the old primitive 88. The attached RevisedSuspend.2.cs.zip enables the image to use the new primitive.

Alternatively, instead of filing-in these two changesets you can try Kernel-jar.1447 from the Inbox to see if anything changes.

@fniephaus
Copy link
Member

Thanks for the details! Could you summarize how primitive 578 is different to primitive 88?

@isCzech
Copy link

isCzech commented May 24, 2022

The change affects processes waiting on a Semaphore or Mutex; so far #suspend removed the waiting process from the semaphore or mutex and placed it in a run queue when resumed. That's a bug. Now, #suspend backs up the process's code before the wait so as a result the process returns to the same wait state when resumed.

Here's part of Eliot's comment for the revised #suspend:
"...If the given process is not the active process, take it off
its corresponding list. If the list was not its run queue assume it was on some
condition variable (Semaphore, Mutex) and back up its pc to the send that
invoked the wait state the process entered. Hence when the process resumes
it will reenter the wait state. ..."

I'm not sure applying these changes will help as I don't understand the root cause. But it's a start :)

@isCzech
Copy link

isCzech commented May 24, 2022

I'm not familiar with TruffleSqueak but I've downloaded TruffleSqueakImage-22.1.0, used the latest VM version VMMaker.oscog-mt.3184 and tried two experiments:

  1. updated the image via Squeak -> Update Squeak menu to the latest... saving works OK; then filed-in the two changesets attached above... again saving the image works OK
  2. only filed-in the two changesets above to verify #terminate and #suspend work correctly... and again saving the image works OK

What was your scenario when saving the image failed and the eventTickler process didn't terminate. Where is the difference from my scenarios?

@dram
Copy link
Author

dram commented May 24, 2022

Hi @isCzech , sorry that I should give more details in the issue description.

This issue can be reproduced using TruffleSqueak VM with recent Squeak 6.0 images, following are detailed steps under Windows:

  1. download graalvm-ce-java17-windows-amd64-22.1.0.zip from https://github.com/graalvm/graalvm-ce-builds/releases and unpack
  2. download trufflesqueak-installable-svm-java17-windows-amd64-22.1.0.jar from https://github.com/hpi-swa/trufflesqueak/releases
  3. install TruffleSqueak into GraalVM directory with command: graalvm-ce-java17-22.1.0\bin\gu –local-file install trufflesqueak-installable-svm-java17-windows-amd64-22.1.0.jar
  4. download Squeak6.0alpha-21736-64bit-202204190959-Windows-x64.zip from https://files.squeak.org/ and unpack (latest image is not compatible with TruffleSqueak 22.1.0, so a slightly older one is used here)
  5. copy Squeak6.0alpha-21736-64bit.image, Squeak6.0alpha-21736-64bit.changes and SqueakV50.sources into directory graalvm-ce-java17-22.1.0\languages\smalltalk\resources
  6. Launch TruffleSqueak with command graalvm-ce-java17-22.1.0\bin\trufflesqueak

I'll try your attached patches later.

@dram
Copy link
Author

dram commented May 25, 2022

Hi @isCzech ,

I have tried your two changesets, image saving works after applied!

But there is some other problem, for each time I do a saving, there will be an extra process created, following is a screenshot of Process Browser:

image

@fniephaus
Copy link
Member

Thanks for checking, @dram! TruffleSqueak does not implement primitive 578 yet, so maybe that's why you get to see these additional processes. Let me try and push an implementation.

@isCzech
Copy link

isCzech commented May 25, 2022

Hi @dram, I'm a bit confused now: you use Squeak6.0alpha-21736 and say later ones are not compatible. How did you test Squeak beta which is 21757 and later?
About the incompatibility: Again, I'm not familiar with TruffleSqueak; I understand you don't use OpenSamlltalkVM but GraalVM - is it possible the GraalVM somehow doesn't work correctly with the latest Squeak images? I mean even the not so latest, causing incorrect behavior? Is there a version number that works and the next one doesn't? That would help :)))

@isCzech
Copy link

isCzech commented May 25, 2022

Hi @fniephaus,

TruffleSqueak does not implement primitive 578 yet, so maybe that's why you get to see these additional processes. Let me try and push an implementation.

Worth a try but as long as #suspend uses primitive 88, things should still work as before...
A question: You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

@fniephaus
Copy link
Member

I tried implementing primitive 578 (see b926e17) but unfortunately, I still see stale processes when saving the image.

You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

I just opened the Cuis6.0-5171.image with a TruffleSqueak built from b926e17 and saving works just fine, without any extra processes.

@isCzech
Copy link

isCzech commented May 25, 2022

I tried implementing primitive 578 (see b926e17) but unfortunately, I still see stale processes when saving the image.

You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

I just opened the Cuis6.0-5171.image with a TruffleSqueak built from b926e17 and saving works just fine, without any extra processes.

Well, I guess it means the extended #terminate logic itself is not the culprit... I'll try to follow @dram 's steps and see :)

@isCzech
Copy link

isCzech commented May 25, 2022

Hi again, I can confirm I can reproduce the incorrect behavior (installed things as per steps 1 to 3, then for some reason step 4 didn't work at all - the image didn't start but I used TruffleSqueakImage 22.1.0 instead and updated the terminate method)
It's strange that the same Truffle/Squeak image works ok with the OpenSmalltalk VM and behaves incorrectly with the GraalVM, while at the same time the Cuis image with the updated terminate works ok with both VMs!

I've noticed Cuis uses different priorities for the interrupt watcher and the low space watcher while Squeak runs all of them with the same priority, however, changing the priorities doesn't help. But there's another difference - for some reason Cuis has modified the EventSensor implementation, including the evenTickler, which may explain Cuis works flawlessly :) Maybe @jvuletich could shed some light here why (Hi Juan, I hope you don't mind :) )

@fniephaus
Copy link
Member

Just in case you, @isCzech, are not aware: TruffleSqueak is an alternative Smalltalk VM implementation, so it's very likely that something on the VM level isn't working correctly. However, things used to work fine with older Squeak images and with Cuis images. I noticed that newly introduced ProcessTest crash TruffleSqueak, so I expect there to be at least a VM-level problem. But there's probably also something wrong on the image level as Cuis has demonstrated.

@isCzech
Copy link

isCzech commented May 25, 2022

Thanks indeed! I've realized you must have a different VM (not sure how different though and how the latest VM changes map into your VM); nonetheless you're absolutely right there must be something off on the image level if Cuis works fine either way :)
(I haven't tested the new Process tests on an old image but yes, they may be very disruptive with the old image)

@dram
Copy link
Author

dram commented May 25, 2022

Hi @dram, I'm a bit confused now: you use Squeak6.0alpha-21736 and say later ones are not compatible. How did you test Squeak beta which is 21757 and later? About the incompatibility: Again, I'm not familiar with TruffleSqueak; I understand you don't use OpenSamlltalkVM but GraalVM - is it possible the GraalVM somehow doesn't work correctly with the latest Squeak images? I mean even the not so latest, causing incorrect behavior? Is there a version number that works and the next one doesn't? That would help :)))

Hi @isCzech, the incompatibility which I mentioned is about issue #156. It is fixed recently, but not released yet. I tested Squeak6.0beta-21772 with TruffleSqueak build from main branch.

If you need to experiment with latest Squeak image, you can follow instructions in https://github.com/hpi-swa/trufflesqueak/blob/main/docs/development.md

@isCzech
Copy link

isCzech commented May 25, 2022

Hi @dram, thanks, I understand now. In the meantime I've come to a tentative conclusion there may be something wrong with the Squeak image other than the new termination logic. Not sure what though; it's possible fixing #terminate bugs may have unmasked some other issues - something Cuis has fixed or never suffered from. See above observations; I hope Juan (@jvuletich) could enlighten us why he changed the EventSensor implementation recently - whether he was possibly fixing some bugs...? In which case Squeak could follow suit :)

@dram
Copy link
Author

dram commented May 25, 2022

Hi @isCzech, when looking around code of Process>>#terminate, I find following line a bit suspicious:

suspendedContext runUntilErrorOrReturnFrom: suspendedContext

According to the doc of Context>>#runUntilErrorOrReturnFrom:, the argument of it should be a sender of self, so I try to change this line to following, after that, image saving works without blocking:

suspendedContext runUntilErrorOrReturnFrom: ctxt

There is some similar code in your patched version, i.e.:

top runUntilReturnFrom: top

Similarly, after I change it to following, no more stale processes created after saving:

top runUntilReturnFrom: ctx

I'm not familiar with Squeak's process scheduling system, so not sure if such change is reasonable. Also I'm curious why there is no problem in OpenSmalltalk VM.

@isCzech
Copy link

isCzech commented May 25, 2022

According to the doc of Context>>#runUntilErrorOrReturnFrom:, the argument of it should be a sender of self,

Yes, the description "ASSUMES aSender is a sender of self" always bothered me :) I probably used the method beyond it's original intended use; however, it works correctly even when the receiver and 'aSender' are identical.

so I try to change this line to following, after that, image saving works without blocking:

suspendedContext runUntilErrorOrReturnFrom: ctxt

yes, it will cover most cases but I'm attaching a test that will fail with your modification; it is a bug in the original #terminate. It fails to recognize the case when the unwind block is the top context (of the stack). #runUntilReturnFrom: is a trivialized (or less restricted) version of #runUntilErrorOrReturnFrom: designed as a helper method for #unwindTo:; it's purpose is to run a fragment of another context stack.
ProcessTest-testTerminateEnsureAsStackTop.zip

However, file-in the attached fix - it works in my case. I've compared Cuis's code and found this difference - thanks @jvuletich !! Nice catch :)
EventSensor-eventTickler.zip

I don't understand yet why the same problem won't show with the Squeak VM...

@isCzech
Copy link

isCzech commented May 25, 2022

I don't understand yet why the same problem won't show with the Squeak VM...

I wonder if speed of the processing could have anything to do with it?

Anyway, if you let me know whether the fix works in all your scenarios, I'll send the fix to Squeak Inbox. Thanks

@dram
Copy link
Author

dram commented May 25, 2022

Hi @isCzech, I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists.

Tests are taken in two environments:

  • Debian Linux 11, TruffleSqueak (main branch), Squeak6.0beta-21772
  • Windows 11, TruffleSqueak 22.1.0, Squeak6.0alpha-21736

Steps:

  1. launch the clean Squeak image
  2. file in EventSensor-eventTickler.st, Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs
  3. open Process Browser
  4. Trigger saving from menu bar

@dram
Copy link
Author

dram commented May 25, 2022

And regarding to ProcessTest-testTerminateEnsureAsStackTop, seems that it failed in TruffleSqueak with EventSensor-eventTickler.st and Kernel-jar.1447.followUp.cs filed in.

image

The assertion failure is caused by following line:

self assert: p1 isTerminated & p2 isTerminated & p3 isTerminated.

@jvuletich
Copy link

WRT #eventTickler in Cuis, it was last tweaked in December 2021 by Andrés Valloud. There is a possible weakness in Delay. If a Delay is waiting, but its process is terminated, and then the same delay is sent #wait without checking #beingWaitedOn, the system wil hang. See senders of #wait and #beingWaitedOn in Cuis. In any case, I don't know if this has any relation with the problems when saving Squeak in TruffleSqueak. I'm just answering Jaromir's question about #ventTicker in Cuis.

@isCzech
Copy link

isCzech commented May 26, 2022

And regarding to ProcessTest-testTerminateEnsureAsStackTop, seems that it failed in TruffleSqueak with EventSensor-eventTickler.st and Kernel-jar.1447.followUp.cs filed in.

image

The assertion failure is caused by following line:

self assert: p1 isTerminated & p2 isTerminated & p3 isTerminated.

yes, sorry, I haven't realized you don't have the full set of patches... It also depends on the combination of the VM+image :)

@isCzech
Copy link

isCzech commented May 26, 2022

Hi @isCzech, I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists.

Tests are taken in two environments:

  • Debian Linux 11, TruffleSqueak (main branch), Squeak6.0beta-21772
  • Windows 11, TruffleSqueak 22.1.0, Squeak6.0alpha-21736

Steps:

  1. launch the clean Squeak image
  2. file in EventSensor-eventTickler.st, Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs
  3. open Process Browser
  4. Trigger saving from menu bar

Hi @dram, at this point it shouldn't matter whether you apply Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs because they have nothing to do with the issue. Could you please tell me whether the problem persists when you apply only the EventSensor-eventTickler.st patch? (with the recent Squeak image indeed) ?

Does it mean that the problem disappeared on Windows 10 and possibly some other scenarios? (I have no way to test Win11/Linux/Mac)

It seems, as Juan indicated, it may be a timing issue caused by the eventTickler implementation. I only guess the new terminate made the problem more visible in certain scenarios. It never occurred in Cuis/Squeak though and I may only hypothesize it could be because of the speed of the VM?? I noticed GraalVM is slower...
Thanks!

@dram
Copy link
Author

dram commented May 26, 2022

Hi @isCzech, according to the test case I mentioned in previous comment #159 (comment), Cuis's #terminate is also broken with TruffleSqueak, it will crash the VM.

@isCzech
Copy link

isCzech commented May 26, 2022

Hi @isCzech, according to the test case I mentioned in previous comment #159 (comment), Cuis's #terminate is also broken with TruffleSqueak, it will crash the VM.

True, this [trufflesqueak] Unwind error: ... in both cases is weird but I can't say what it means; I guess it's related to the TruffleSqueak VM which I know nothing about :)

@isCzech
Copy link

isCzech commented May 26, 2022

Hi @dram, I have some preliminary results:

  1. I've fixed the test ProcessTest-testTerminateEnsureAsStackTop, you had a good hunch something was wrong - it was supposed to work with the change you suggested (top runUntilReturnFrom: ctx); check this:
    ProcessTest-testTerminateEnsureAsStackTop-fixed.zip

  2. The change you suggested (top runUntilReturnFrom: ctx) should not change the semantics of termination (but I haven't tested it as thoroughly as the current version and there might be some border situations, e.g. when the unwind context is at the bottom of the stack); what it does is it just finishes executing the unwind context and if you look at the code, finishing it should be harmless; however - it is not very logical to finish what is not supposed to be finished (the process has been terminated and only unwind blocks are supposed to be executed). What's most interesting is it fixes your problem with TruffleSqueak :) BUT I still think there's something off with TruffleSqueak VM because the new terminate works on both Squeak and Cuis (actually it runs on Pharo as well!) so I'm inclined to believe it's TruffleSqueak VM's issue (I guess your results above also point in that direction if I understand correctly. And nice example "reached here", thanks!).

  3. For info: Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs are supposed to work with the latest VM with #suspend primitive 578 (without it #isTerminated needs fixing) and Kernel-jar.1447-Terminate.1.cs (enclosed above) should work with the older VMs and with TruffleSqueak's (In both cases there may be some tests that need fixing, e.g. testAtomicSuspend)

  4. I'd be very interested if you find out more - whether it's a VM problem and what's wrong; I traced the computation and can't find anything wrong. It's really surprising that the small change you suggested causing the unwind context finish executing fixes the issue for you! Maybe it has something to do with the fact that the top of the stack in this particular case is just a primitive call. But I have no experience with that :)

Thanks for letting me know.
best,
Jaromir

@dram
Copy link
Author

dram commented May 26, 2022

Hi @isCzech,

I kind of think that the new #terminate may rely on some "tricky" feature in OpenSmalltak VM, which is not supported in TruffleSqueak. So there are two approaches to fix this problem:

  1. tweak #terminate to not rely on that feature
  2. add that feature to TruffleSqueak

Anyway, the first thing to do would be determine what that feature is. The error message and exception stack trace mentioned in #159 (comment) may give some hints.

@isCzech
Copy link

isCzech commented May 27, 2022

Hi @dram, why would you think it may be a "tricky" feature in OpenSmalltak VM and not in the TruffleSqueak VM? ;)

One of the differences between the old and new terminate is the old terminate uses the simulation machinery (the whole unwind is simulated using self popTo: suspendedContext bottomContext). The new terminate on the other hand replaced all simulation with a "regular", "no simulation, no exceptions" code. One of the advantages is you can more easily debug (debugging a simulation is tricky).

I've compared the termination step by step in both the failing case and the "fixed" case with top runUntilReturnFrom: ctx line replaced: The simulation (debugging) in both cases is identical and in both cases leads to correct termination of the tickler process. Debugging the same thing with the old version of terminate crashed the VM but that's expected with regard to the note above (debugging a simulation...).

Now, from that follows, I hope, that if the simulation (debugger) correctly terminates the event tickler process but the live run fails, the difference is most likely within the TruffleSqueak VM. So I'd say #terminate in TruffleSqueak relies on being run via the simulation machinery (#popTo & company). Once the TruffleSqueak VM trie to terminate using "regular" code, it fails, while at the same time when running the same "regular" code as a simulation (in debugger) it works correctly.

(It's a mystery why changing the one line "fixes" the issue for TruffleSqueak but as discussed in #159 (comment). I say it's not really a fix - it's rather a reminder something doesn't work as expected :) )

What's your thought?

Hi @fniephaus, does the exception stack trace mentioned in #159 (comment) tell anything to you? Would you agree with the conclusion above?
Thanks!

@fniephaus
Copy link
Member

Hi @dram and @isCzech,
Thanks a lot for digging into this so much. Yes, I agree: it very much looks like there is at least one bug in TruffleSqueak somewhere. However, that "somewhere" could be relatively hard to find because scheduling, in general, is rather complex and has some interesting corner cases.

What would be very helpful are very short doIts that demonstrate problems in TruffleSqueak or a difference in OSVM vs TruffleSqueak behavior. Something like the doIt in #159 (comment) and some of the ProcessTests are already rather long. A lot of change sets have been posted in this thread and I almost lost track: how do I obtain a Squeak/Smalltalk image that has all the "right" methods and tests that can be expected to land in the next Squeak release?

@isCzech
Copy link

isCzech commented May 27, 2022

A lot of change sets have been posted in this thread and I almost lost track: how do I obtain a Squeak/Smalltalk image that has all the "right" methods and tests that can be expected to land in the next Squeak release?

The crucial point that makes the difference for TruffleSqueak image is the replacement @dram discovered in this comment. The suggested change will make TruffleSqueak work but I can't see any logical explanation why - IMO the answer lies within the TruffelSqueak VM. For OpenSmalltalk VM @dram's change has no effect and the original version (not working in TruffleSqueak) is more logical/consistent.

As for the expected release image, I propose including methods in Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs (attached previously) - but these two will require the latest VM with the 578 suspend primitive. The final decision should be made soon (two weeks timeframe?)

Just to make sure: the issue doesn't appear to be a timing or other issue with the event tickler.

@isCzech
Copy link

isCzech commented May 27, 2022

Just to make sure: the issue doesn't appear to be a timing or other issue with the event tickler.

Maybe it's too soon to rule this out... Cuis image with the new terminate seems to work fine with TruffleSqueak VM in terms of saving the image (and terminating the even tickler) but @dram's observation here shows there's an issue with Cuis as well.

@fniephaus
Copy link
Member

I have a hunch what's going on: Process>>#terminate was changed and the actual termination logic was moved into an ensure block:

	[] ensure: [
		self suspend.
		context := suspendedContext ifNil: [^self].
		suspendedContext := 
			[context releaseCriticalSection; unwindTo: nil. self suspend] asContext.
		self priority: Processor activePriority + 1; resume]

It seems that the ensure block is never executed as part of the top-level unwinding logic:

protected final ContextObject doUnwind(final ContextObject startContext, final ContextObject targetContext, final Object returnValue) {
ContextObject context = startContext;
while (context != targetContext) {
final AbstractSqueakObject sender = context.getSender();
if (!(sender instanceof ContextObject)) {
CompilerDirectives.transferToInterpreter();
getContext().printToStdErr("Unwind error: sender of", context, "is nil, unwinding towards", targetContext, "with return value:", returnValue);
break;
}
context.terminate();
context = (ContextObject) sender;
}
targetContext.push(returnValue);
return targetContext;

That, at least, explains why all three processes of ProcessTest>>#testTerminateEnsureAsStackTop are never terminated. I'm trying to work out a fix, but this part of the VM is everything but trivial.

@isCzech
Copy link

isCzech commented May 27, 2022

Interesting! I wish you were right :) Cuis use the same approach; they just use a slightly different form: they use a method to get the receiver evaluated as an unwind block:

valueEnsured
	[] ensure: self

Applying it to the Squeak's #terminate at TruffleSqueak makes no difference though...

@dram
Copy link
Author

dram commented May 27, 2022

Hi @isCzech, while rethinking about runUntilErrorOrReturnFrom:, I realized that if the whole point of using it instead of value is to deal with non-local return, why not do something like following:

unwindBlock hasMethodReturn ifTrue: [
    ... runUntilErrorOrReturnFrom: ...
] ifFalse: [
    unwindBlock value
]

After this change, image saving in TruffleSqueak works with no problem.

@isCzech
Copy link

isCzech commented May 27, 2022

Hi @dram, nice try ;)
evaluate this though:
[[] valueWithExit] hasMethodReturn
Yes, the whole point of using #runUntilReturn is to deal with non-local returns. They may be nested however. But it's a great experiment, thanks!

@dram
Copy link
Author

dram commented May 28, 2022

Hi @isCzech, thanks for pointing to valueWithExit, really a complex story. :)

Anyway, following is another try:

  1. filed in Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs into Squeak6.0beta-21829-64bit.image, and running under Open Smalltalk VM
  2. do following change in Context>>#unwindTo:safely:
    	[ctx isNil] whileFalse: [
    		(ctx tempAt: 2) ifNil: [
    			ctx tempAt: 2 put: true.
    -			top := (ctx tempAt: 1) asContextWithSender: ctx.  "see the note below"
    -			top runUntilReturnFrom: top].
    +			[(ctx tempAt: 1) value] on: BlockCannotReturn do: []].
    		ctx := ctx findNextUnwindContextUpTo: aContext]
  3. Run all tests in KernelTests-Processes group

All tests passed, so I'm a bit curious that why not use [... value] on: BlockCannotReturn do: [] instead of runUntilReturnFrom:? In my opinion semantics of these two are similar, and the former is more simple and straightforward.

@isCzech
Copy link

isCzech commented May 28, 2022

Hi @dram, this one's even better but nope, try this:
[ [[Processor activeProcess terminate] ensure: [1/0]] on: ZeroDivide do: [] ] fork
You have to run unwind blocks on the original stack to make sure you won't miss exception handlers.
But you've discovered a missing test situation - thanks!

@fniephaus
Copy link
Member

I just remembered that lots of Process-related tests from Cuis fail or don't even terminate on TruffleSqueak:

#ProcessTerminateBug -> #(#testTerminationDuringNestedUnwind5 #testUnwindFromForeignProcess).
#ProcessTerminateUnwindNonLocalReturn -> #(#test1ATerminate #test1BTerminate #test1CTerminate #test1DTerminate #test2ATerminate #test2BTerminate #test2CTerminate #test2DTerminate #test3ATerminate #test3BTerminate #test3CTerminate #test3DTerminate #test4ATerminate #test4BTerminate #test4CTerminate #test4DTerminate #test5ATerminate #test5BTerminate #test5CTerminate #test5DTerminate #test6ATerminate #test6BTerminate #test6CTerminate #test6DTerminate #test7ATerminate #test7BTerminate #test7CTerminate #test7DTerminate #test8ATerminate #test8BTerminate #test8CTerminate #test8DTerminate).
#SemaphoreTest -> #(#testSemaAfterCriticalWait).
#TestValueWithinFix -> #(#testValueWithinNonLocalReturnFixReal #testValueWithinNonLocalReturnFixSimply).
} collect: [:assoc | | testCase |
testCase := Smalltalk at: assoc key.
assoc value do: [:sel | nonTerminatingTestCases add: (testCase selector: sel) ]].
StdIOWriteStream stdout newLine; nextPutAll: 'Non-terminating TestCases:'; newLine; flush.
nonTerminatingTestCases do: [:ea | StdIOWriteStream stdout nextPutAll: '- ', ea asString; newLine; flush ].
StdIOWriteStream stdout newLine; flush.
failingTests := OrderedCollection new.
{
#FloatTest -> #(#testHashWithSmallishLargeNegativeInteger #testHashWithSmallishLargePositiveInteger #testIsDenormalized #testPrimDivideBy #testPrimTruncated).
#IntegerTest -> #(#testRange "TruffleSqueak supports 64-bit SmallIntegers").
#ProcessorTest -> #("flaky" #testGrabProcessor #testGrabProcessorOnlyForNoTimeout #testGrabProcessorOnlyForTimeout #testValueUnpreemptively).
#ProcessTest -> #(#testTerminateInEnsure #testValueEnsured).

So I guess there are plenty of tests that could potentially help find the bug(s) in TruffleSqueak.

@isCzech
Copy link

isCzech commented May 28, 2022

These are the tests added to complement the new (fixed/extended) termination logic introduced last year. They should work with newer Squeak or Cuis images...

@dram
Copy link
Author

dram commented May 28, 2022

Hi @isCzech,

For the exception case, it can be fixed in this way:

	[ctx isNil] whileFalse: [
		(ctx tempAt: 2) ifNil: [
			ctx tempAt: 2 put: true.
-			top := (ctx tempAt: 1) asContextWithSender: ctx.  "see the note below"
-			top runUntilReturnFrom: top].
+			[(ctx tempAt: 1) value]
+				on: Exception do: [:ex |
+					(ctx nextHandlerContextForSignal: ex) ifNotNil: [:hdl | hdl handleSignal: ex]]].
		ctx := ctx findNextUnwindContextUpTo: aContext]

But it will be failed for following case, which is based on your version, with more nested ensure::

[
	[
		[
			[
				[Processor activeProcess terminate] ensure: [Transcript showln: 1. 1 / 0]
			] ensure: [Transcript showln: 2]
		] on: ZeroDivide do: [Transcript showln: 3]
	] ensure: [Transcript showln: 4]
] fork

Need more investigation.

@dram
Copy link
Author

dram commented May 28, 2022

Found the problem, fireHandlerActionForSignal: should be used instead of handleSignal:, i.e.:

	[ctx isNil] whileFalse: [
		(ctx tempAt: 2) ifNil: [
			ctx tempAt: 2 put: true.
-			top := (ctx tempAt: 1) asContextWithSender: ctx.  "see the note below"
-			top runUntilReturnFrom: top].
+			[(ctx tempAt: 1) value]
+				on: Exception do: [:ex |
+					(ctx nextHandlerContextForSignal: ex) ifNotNil: [:hdl | hdl fireHandlerActionForSignal: ex]]].
		ctx := ctx findNextUnwindContextUpTo: aContext]

For this version, all tests in KernelTests-Processes group are passed under Open Smalltalk VM, and image saving works under TruffleSqueak.

@isCzech
Copy link

isCzech commented May 28, 2022

Interesting solution... You're searching for handlers on two disjointed stacks instead of joining the stacks :) I wonder what would happen in case of more complicated exceptions like chained outer or other crazy scenarios though. It seems to me joining stacks and executing on the joint stack is what the methods have been developed and tested for. Not saying it wouldn't work, just that it'd require some deep thinking :)

I'd still prefer finding out why TruffleSqueak's VM executes the same code differently from the OSVM and take it from there. Fabio seemed to find something suspicious so I'm curious what his findings will be.

Thanks!

@dram
Copy link
Author

dram commented May 29, 2022

Hi @isCzech,

While experiment with exceptions and #terminate, I find following DoIt will raise a BlockCannotReturn exception in Open Smalltalk VM, both with and without your patches, as #cull: is used indirectly in Context>>#handleSignal::

[
	[
		[Processor activeProcess terminate] ensure: [1 / 0]
	] on: ZeroDivide do: [^ nil]
] fork

After some more investigation, I kind of think that it may be hard or impossible to make #terminate do right for all situations, as there are several languages just deprecate or not recommend to use similar functions, see Java, .NET and Swift.

So if resource cleanup is needed, some other mechanisms should be used, e.g. see this Java document.

Anyway, I'm also looking forward to @fniephaus's findings. Hope this issue can be solved, as image saving is quite a fundamental feature.

@isCzech
Copy link

isCzech commented May 29, 2022

Hi @dram,
but raising the BlockCannotReturn is the right thing :) The non-local return ^nil is trying to return to a different thread (i.e. to the UI thread) which is a functionality (currently) prohibited in Smalltalk. In theory such scenarios could work but it would need to be implemented.

Anyway, I'm also looking forward to @fniephaus's findings. Hope this issue can be solved, as image saving is quite a fundamental feature.

Saving is overrated ;)

@isCzech
Copy link

isCzech commented May 29, 2022

@dram, try [^2] fork - same thing...

@dram
Copy link
Author

dram commented May 29, 2022

Hi @isCzech, that makes sense, thanks!

@isCzech
Copy link

isCzech commented May 30, 2022

I have a hunch what's going on: Process>>#terminate was changed and the actual termination logic was moved into an ensure block:

	[] ensure: [
		self suspend.
		context := suspendedContext ifNil: [^self].
		suspendedContext := 
			[context releaseCriticalSection; unwindTo: nil. self suspend] asContext.
		self priority: Processor activePriority + 1; resume]

It seems that the ensure block is never executed as part of the top-level unwinding logic:

protected final ContextObject doUnwind(final ContextObject startContext, final ContextObject targetContext, final Object returnValue) {
ContextObject context = startContext;
while (context != targetContext) {
final AbstractSqueakObject sender = context.getSender();
if (!(sender instanceof ContextObject)) {
CompilerDirectives.transferToInterpreter();
getContext().printToStdErr("Unwind error: sender of", context, "is nil, unwinding towards", targetContext, "with return value:", returnValue);
break;
}
context.terminate();
context = (ContextObject) sender;
}
targetContext.push(returnValue);
return targetContext;

That, at least, explains why all three processes of ProcessTest>>#testTerminateEnsureAsStackTop are never terminated. I'm trying to work out a fix, but this part of the VM is everything but trivial.

Hi @fniephaus, I wonder if you found something interesting :) I guess it's worth finding out what's causing the irregular behavior but in any case we can try to replace the line top runUntilReturnFrom: ctx as per @dram's suggestion (nice observation!) in Squeak's implementation.

@fniephaus
Copy link
Member

Hi @fniephaus, I wonder if you found something interesting :)

The problem in TruffleSqueak is indeed that the top-level unwinding logic neither handles #aboutToReturn:through: nor #cannotReturn:. I have something running (can save image without generating orphan processes) but I need more time to make things work (there are some edge cases that the OSVM doesn't have to deal with).

So from the TruffleSqueak side, there is no requirement how things should be done in Squeak. However, I think we should put primitive 578 to work asap so that it can be tested as much as possible before the release. If someone could notify me here when this is done, I'm happy to take another look and add support for 578 to TruffleSqueak.

@isCzech
Copy link

isCzech commented May 30, 2022

Thanks for the info; I'll let you know once the support for the new suspend primitives is merged.

@marceltaeumel
Copy link

I'll let you know once the support for the new suspend primitives is merged.

It's merged now.

@fniephaus
Copy link
Member

Thanks again everyone for the detailed analysis around the new process termination logic and for ultimately helping to improve TruffleSqueak!

I have managed to fix saving of Squeak6.0beta images in TruffleSqueak. As mentioned before, TruffleSqueak neither send #aboutToReturn:through: nor #cannotReturn: when unwinding the stack on the top level. In case anyone is interested, here are the changes: 4821996...e5d7db9. As another result of this, I was also able to enable more process-related tests in Cuis (see bf18db2).

An implementation of primitive 578 is also waiting to be merged once we upgrade TruffleSqueak's image so that it is based on the upcoming Squeak 6.0 release.

@dram
Copy link
Author

dram commented Jun 12, 2022

Fix confirmed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants