[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

perlun · 2015-04-20T17:43:58Z

This happens now when trying to start up the console and vga servers. I spent a bit of time looking into this now:

The tss_list variable gets set primarily from the thread_link method. It is also set in the thread_unlink method.
thread_link gets called 2-4 times before the problem happens (thanks to gdb for letting me see this). The value of tss_list seems to be completely sane on the last invocation of the method.
When breaking the code when the kernel bug occurs, this is what I see (slightly simplified):

(gdb) p *list
$13 = (tss_list_type *) 0xd490048
(gdb) p list
$14 = (tss_list_type **) 0x116b3c
(gdb) p tss_list
$15 = (tss_list_type *) 0xd490048
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
debug_run () at debug.c:787
787             while ((port_in_u8(DEBUG_KEYBOARD_STATUS) & DEBUG_KEYBOARD_INPUT_FULL) == 0);
(gdb) p tss_list
$18 = (tss_list_type *) 0x0
(gdb) p &tss_list
$19 = (tss_list_type **) 0x116b3c

So, in essence, something (whatever it is) is overwriting this memory location. For now, thread_unlink would be my best bet actually...

The text was updated successfully, but these errors were encountered:

perlun · 2015-05-05T10:27:11Z

OK, I think I know the cause for this now... It's this code: https://github.com/chaos4ever/chaos/blob/master/storm/ia32/port.c#L136-L155

If the TSS (including I/O map) is too small, it will get expanded. It will then do a thread_unlink first, and then a thread_link (the line after). That stuff doesn't seen to work entirely as intended...

perlun · 2015-05-05T10:39:05Z

Note to self (and to others who may be finding this thread when debugging something similar): watch tss_list, when using gdb + qemu is just so incredibly cool. It steps into the debugger whenever the variable changes. HOW COOL IS THAT! Now, I'm just minutes away from finding the reason for this bug...

perlun · 2015-05-05T19:55:36Z

9 hours later, and "minutes" seems to be a "programmer estimate"... 😄 I managed to solve #25, but I'm still struggling with something fishy:

Kernel:
  Alert: Illegal page fault!
  Version: storm 4d25b1b.
  Uptime: 0 timeslices (0 h, 0 m and 0 s).
Memory:
  Physical memory: 1972 KByte used. 30668 KByte free.
  Global memory: 10 KByte used. 131061 KByte free.
Causing process:
  Process: console (PROCESS ID 1).
  Thread: unnamed (THREAD ID 3).
  TSS: 0D491218
  Thread was dispatched 2 times.
  Exception occured at address 00000008:00000300
Registers:
  EAX: 0x00000006 EBX: 0x0D493018 ECX: 0x00000000 EDX: 0xFFFFFFFF
  ESP: 0xFC000F74 EBP: 0x00000001 ESI: 0xFFFFFF6C EDI: 0x00000001
   DS: 0x00000023  ES: 0x00000023  FS: 0x00000023  GS: 0x00000023
  CR2: 0x00000300 CR3: 0x0019E000              EFLAGS: 0x00000292
Stack pointer: 00005F80

I have no idea why it tries to jump into that crazy address (00000008:00000300) in the first place. I've tried adding assertions in dispatch_next to prevent it from jumping to such a task, but it doesn't work so maybe there's something even weirder going on here...

tss_list is non-NULL. And its content seems to be reasonably sane:

(gdb) p *tss_list
$2 = {previous = 0x0, next = 0xd490078, tss = 0xd491218, thread_id = 3}
(gdb) p *(tss_list_type *)tss_list->next
$5 = {previous = 0xd4900d8, next = 0xd490048, tss = 0xd491418, thread_id = 2}
(gdb) p *(tss_list_type *)0xd490048
$9 = {previous = 0xd490078, next = 0xd490028, tss = 0xd491018, thread_id = 1}
(gdb) p *(tss_list_type *)0xd490028
$10 = {previous = 0xd490048, next = 0x0, tss = 0xa000, thread_id = 0}

Four threads, all of which seem reasonable. So why on earth did it try to jump to that weird place? These are the kind of bugs I'd really not have to sit and debug, and even more so, not debug alone. It just doesn't make any sense, since these bugs feel incredibly hard to nail down on my own.

I placed breakpoints now in both the dispatch and dispatch_task_switcher methods to no avail. It doesn't seem to be either one of them that causes this, and those are the only methods I know of that performs task switching. Weird!!!

perlun · 2015-05-06T09:37:05Z

Alright, I know now what is causing this (from thread.c):

number_of_tasks++;
new_tss->esp = cpu_get_esp();
process_info->number_of_threads++;

The new stack pointer (ESP) is set to the current value of the stack pointer. But the value of the stack pointer at that point is not reliable. Hence, we get a bogus value on the stack and the ret instruction at the end of new_thread_entry will pop that bogus value off of the stack and try jumping to it. Kaboom! Page fault...

perlun · 2015-05-06T17:17:37Z

(It was my refactorings that broke stuff. I reverted a few commits and I am now closer to the original issue again.)

…n a NULL pointer sometimes (which could be causing #21)

perlun · 2015-05-06T19:42:17Z

Alright. I have it now. It is, surprise-surprise, caused by what it seems to be a bug in gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40115

The guy who posted this didn't got so good response. But actually, it behaves oddly for me; I can't really seem to use the label pointer in a reliable way. (The pointer is off by a few instructions.)

perlun · 2015-05-06T19:44:04Z

I'm thinking we change the thread_create syntax to take a function pointer as parameter instead. That's also how pthread_create works in POSIX, so it has the extra advantage in that it makes it easier to port Linux/Unix stuff to chaos... 😄

perlun · 2015-05-10T07:03:40Z

We will indeed change the thread_create syntax. I will set up a separate PR/feature branch for that. It shouldn't be so much work, but the downside is that all servers & programs relying on this functionality will need to be changed.

) It would previously use the PL0 stack for the newly created thread. This commit fixes this, so that the PL3 stack (which is much less limited in size) is used instead. The use of the PL0 stack was a bug I introduced while fixing #21 via pull request #30, but since it was three years ago I can honestly not say with certainty whether this was "by design" or "by accident". Either way, doing this fix is much better than (ab)using the PL0 stack since the latter is limited in size to a single 4 KiB page => way too little for certain recursive programs and larger programs in general. (This change was a drive-by change while I was researching the root cause for some issues mentioned in #120)

perlun self-assigned this Apr 20, 2015

perlun added a commit that referenced this issue May 6, 2015

Added assertion after memory_global_allocate, since it seems to retur…

0eb23f5

…n a NULL pointer sometimes (which could be causing #21)

perlun mentioned this issue May 10, 2015

Misc cleanups and closed #25 #26

Merged

perlun mentioned this issue May 10, 2015

Rewrite system_calls.pl in Ruby #27

Merged

perlun mentioned this issue May 19, 2015

Changed thread_create() to take the new thread entry point as parameter #30

Merged

perlun closed this as completed in #30 May 20, 2015

perlun mentioned this issue May 20, 2015

Make chaos bootable again #31

Closed

9 tasks

perlun mentioned this issue Jul 15, 2019

Deleting a thread causes Illegal page fault #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

perlun commented Apr 20, 2015 •

edited

Loading

perlun commented May 5, 2015

perlun commented May 5, 2015

perlun commented May 5, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 10, 2015

[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

Comments

perlun commented Apr 20, 2015 • edited Loading

perlun commented May 5, 2015

perlun commented May 5, 2015

perlun commented May 5, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 6, 2015

perlun commented May 10, 2015

perlun commented Apr 20, 2015 •

edited

Loading