Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main thread spread exception when thread-mgr enabled #1889

Merged
merged 20 commits into from
Jan 20, 2023

Conversation

xujuntwt95329
Copy link
Collaborator

No description provided.

@@ -1794,6 +1798,9 @@ wasm_runtime_call_wasm(WASMExecEnv *exec_env,
result_argc, argv)) {
wasm_runtime_set_exception(exec_env->module_inst,
"the result conversion is failed");
#if WASM_ENABLE_THREAD_MGR != 0
wasm_cluster_spread_exception(exec_env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also spread exception when wasm_runtime_prepare_call_function fails, L1767?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to our discussion, I‘ve moved the spread into wasm_set_exception function, so we don't need to process the spread elsewhere

@@ -902,7 +902,7 @@ create_exports(AOTModuleInstance *module_inst, AOTModule *module,
static bool
clear_wasi_proc_exit_exception(AOTModuleInstance *module_inst)
{
#if WASM_ENABLE_LIBC_WASI != 0
#if (WASM_ENABLE_LIBC_WASI != 0) && (WASM_ENABLE_THREAD_MGR == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why skip clear the exception for multi-threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored, thanks~

@@ -4021,7 +4021,7 @@ fast_jit_call_func_bytecode(WASMModuleInstance *module_inst,
static bool
clear_wasi_proc_exit_exception(WASMModuleInstance *module_inst)
{
#if WASM_ENABLE_LIBC_WASI != 0
#if (WASM_ENABLE_LIBC_WASI != 0) && (WASM_ENABLE_THREAD_MGR == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, why skip clear for multi-threads?

@yamt
Copy link
Collaborator

yamt commented Jan 16, 2023

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.
  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

@xujuntwt95329 xujuntwt95329 changed the title Main thread spread Main thread spread exception when thread-mgr enabled Jan 16, 2023
@xujuntwt95329
Copy link
Collaborator Author

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

  1. get cluster->lock
  2. set cluster's status to exception
  3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

Yes, and traverse_list is used for different lists, I‘ll analyze every situation and add the locks if necessary

@yamt
Copy link
Collaborator

yamt commented Jan 16, 2023

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

1. get `cluster->lock`

2. set cluster's status to `exception`

3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

i think it works.

  • i guess traverse_list should require a lock. (cluster->lock?) but it isn't a fault of this PR.

Yes, and traverse_list is used for different lists, I‘ll analyze every situation and add the locks if necessary

thank you.

@wenyongh
Copy link
Contributor

Thanks for the suggestions.

  • i guess you need some kind of a retry loop as other threads can create threads in the meantime.

I think we can add a status field to record the status of a cluster. When we spreading the exception:

  1. get cluster->lock
  2. set cluster's status to exception
  3. spread the exception

Then if there are any other threads creating new thread at the meantime, we directly fail it if the cluster's status is exception

@yamt @wenyongh how do you think about this solution?

It looks good to me.

bh_assert(module_inst_comm->module_type == Wasm_Module_Bytecode
|| module_inst_comm->module_type == Wasm_Module_AoT);

const char *exception = wasm_get_exception(module_inst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had better declare the variable at the beginning of the function, a concern is that old version compiler might report warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

#if WASM_ENABLE_THREAD_MGR != 0
wasm_cluster_spread_exception(
wasm_clusters_search_exec_env((WASMModuleInstanceCommon *)module_inst),
exception ? false : true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When exception is NULL, does it mean to clear exception of other threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@xujuntwt95329
Copy link
Collaborator Author

When adding lock for traverse_list, we will meet dead lock issue: in wasm_cluster_terminate_all_except_self, we get the lock, and call os_thread_join to wait, but the exiting thread will also try to get the lock so dead lock occurred.

I use this strategy to avoid this:

  • add a new field processing in cluster
  • in wasm_cluster_terminate_all_except_self we set processing to true, and unlock the lock before calling thread_join
  • other threads can still get this lock, but in wasm_cluster_spawn_exec_env and wasm_cluster_create_thread it will fail directly if processing is true
  • wasm_cluster_terminate_all_except_self clear processing flag once it finished

@@ -956,7 +941,8 @@ execute_start_function(AOTModuleInstance *module_inst)
u.f(exec_env);

wasm_exec_env_destroy(exec_env);
(void)clear_wasi_proc_exit_exception(module_inst);
(void)clear_wasi_proc_exit_exception(
(WASMModuleInstanceCommon *)module_inst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems no need to clear wasi proc exit exception here. WASI module doesn't set internal start func index,instead it exports a function named "_start". If here it really call wasi proc exit,we can also let instantiation process failed. @lum1n0us what is your opinion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with that, and updated

@@ -1743,6 +1743,30 @@ wasm_runtime_finalize_call_function(WASMExecEnv *exec_env,
}
#endif

bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can set to static if the above is ok

traverse_list(&cluster->exec_env_list, terminate_thread_visitor, NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does traverse_list add lock for the list? why not remove L845 and L849

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terminate_thread_visitor, we will call os_thread_join to wait for other threads to exit, and the exited thread need to get cluster->lock for accessing the list, so we can't hold the lock during terminate_thread_visitor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

Copy link
Contributor

@wenyongh wenyongh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenyongh wenyongh merged commit cadf9d0 into bytecodealliance:main Jan 20, 2023
@wenyongh wenyongh mentioned this pull request Jan 20, 2023
victoryang00 pushed a commit to victoryang00/wamr-aot-gc-checkpoint-restore that referenced this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants