Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KERNEL BUG] update_data: tss_list == NULL (dispatch.c:94, process = console (1), thread = unnamed (3) #21

Closed
perlun opened this issue Apr 20, 2015 · 8 comments · Fixed by #30
Assignees

Comments

@perlun
Copy link
Contributor

perlun commented Apr 20, 2015

This happens now when trying to start up the console and vga servers. I spent a bit of time looking into this now:

  1. The tss_list variable gets set primarily from the thread_link method. It is also set in the thread_unlink method.
  2. thread_link gets called 2-4 times before the problem happens (thanks to gdb for letting me see this). The value of tss_list seems to be completely sane on the last invocation of the method.
  3. When breaking the code when the kernel bug occurs, this is what I see (slightly simplified):
(gdb) p *list
$13 = (tss_list_type *) 0xd490048
(gdb) p list
$14 = (tss_list_type **) 0x116b3c
(gdb) p tss_list
$15 = (tss_list_type *) 0xd490048
(gdb) cont
Continuing.
^C
Program received signal SIGINT, Interrupt.
debug_run () at debug.c:787
787             while ((port_in_u8(DEBUG_KEYBOARD_STATUS) & DEBUG_KEYBOARD_INPUT_FULL) == 0);
(gdb) p tss_list
$18 = (tss_list_type *) 0x0
(gdb) p &tss_list
$19 = (tss_list_type **) 0x116b3c

So, in essence, something (whatever it is) is overwriting this memory location. For now, thread_unlink would be my best bet actually...

@perlun perlun self-assigned this Apr 20, 2015
@perlun
Copy link
Contributor Author

perlun commented May 5, 2015

OK, I think I know the cause for this now... It's this code: https://github.com/chaos4ever/chaos/blob/master/storm/ia32/port.c#L136-L155

If the TSS (including I/O map) is too small, it will get expanded. It will then do a thread_unlink first, and then a thread_link (the line after). That stuff doesn't seen to work entirely as intended...

@perlun
Copy link
Contributor Author

perlun commented May 5, 2015

Note to self (and to others who may be finding this thread when debugging something similar): watch tss_list, when using gdb + qemu is just so incredibly cool. It steps into the debugger whenever the variable changes. HOW COOL IS THAT! Now, I'm just minutes away from finding the reason for this bug...

@perlun
Copy link
Contributor Author

perlun commented May 5, 2015

9 hours later, and "minutes" seems to be a "programmer estimate"... 😄 I managed to solve #25, but I'm still struggling with something fishy:

Kernel:
  Alert: Illegal page fault!
  Version: storm 4d25b1b.
  Uptime: 0 timeslices (0 h, 0 m and 0 s).
Memory:
  Physical memory: 1972 KByte used. 30668 KByte free.
  Global memory: 10 KByte used. 131061 KByte free.
Causing process:
  Process: console (PROCESS ID 1).
  Thread: unnamed (THREAD ID 3).
  TSS: 0D491218
  Thread was dispatched 2 times.
  Exception occured at address 00000008:00000300
Registers:
  EAX: 0x00000006 EBX: 0x0D493018 ECX: 0x00000000 EDX: 0xFFFFFFFF
  ESP: 0xFC000F74 EBP: 0x00000001 ESI: 0xFFFFFF6C EDI: 0x00000001
   DS: 0x00000023  ES: 0x00000023  FS: 0x00000023  GS: 0x00000023
  CR2: 0x00000300 CR3: 0x0019E000              EFLAGS: 0x00000292
Stack pointer: 00005F80

I have no idea why it tries to jump into that crazy address (00000008:00000300) in the first place. I've tried adding assertions in dispatch_next to prevent it from jumping to such a task, but it doesn't work so maybe there's something even weirder going on here...

tss_list is non-NULL. And its content seems to be reasonably sane:

(gdb) p *tss_list
$2 = {previous = 0x0, next = 0xd490078, tss = 0xd491218, thread_id = 3}
(gdb) p *(tss_list_type *)tss_list->next
$5 = {previous = 0xd4900d8, next = 0xd490048, tss = 0xd491418, thread_id = 2}
(gdb) p *(tss_list_type *)0xd490048
$9 = {previous = 0xd490078, next = 0xd490028, tss = 0xd491018, thread_id = 1}
(gdb) p *(tss_list_type *)0xd490028
$10 = {previous = 0xd490048, next = 0x0, tss = 0xa000, thread_id = 0}

Four threads, all of which seem reasonable. So why on earth did it try to jump to that weird place? These are the kind of bugs I'd really not have to sit and debug, and even more so, not debug alone. It just doesn't make any sense, since these bugs feel incredibly hard to nail down on my own.


I placed breakpoints now in both the dispatch and dispatch_task_switcher methods to no avail. It doesn't seem to be either one of them that causes this, and those are the only methods I know of that performs task switching. Weird!!!

@perlun
Copy link
Contributor Author

perlun commented May 6, 2015

Alright, I know now what is causing this (from thread.c):

number_of_tasks++;
new_tss->esp = cpu_get_esp();
process_info->number_of_threads++;

The new stack pointer (ESP) is set to the current value of the stack pointer. But the value of the stack pointer at that point is not reliable. Hence, we get a bogus value on the stack and the ret instruction at the end of new_thread_entry will pop that bogus value off of the stack and try jumping to it. Kaboom! Page fault...

@perlun
Copy link
Contributor Author

perlun commented May 6, 2015

(It was my refactorings that broke stuff. I reverted a few commits and I am now closer to the original issue again.)

perlun added a commit that referenced this issue May 6, 2015
…n a NULL pointer sometimes (which could be causing #21)
@perlun
Copy link
Contributor Author

perlun commented May 6, 2015

Alright. I have it now. It is, surprise-surprise, caused by what it seems to be a bug in gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40115

The guy who posted this didn't got so good response. But actually, it behaves oddly for me; I can't really seem to use the label pointer in a reliable way. (The pointer is off by a few instructions.)

@perlun
Copy link
Contributor Author

perlun commented May 6, 2015

I'm thinking we change the thread_create syntax to take a function pointer as parameter instead. That's also how pthread_create works in POSIX, so it has the extra advantage in that it makes it easier to port Linux/Unix stuff to chaos... 😄

@perlun
Copy link
Contributor Author

perlun commented May 10, 2015

We will indeed change the thread_create syntax. I will set up a separate PR/feature branch for that. It shouldn't be so much work, but the downside is that all servers & programs relying on this functionality will need to be changed.

@perlun perlun mentioned this issue May 20, 2015
9 tasks
perlun added a commit that referenced this issue Oct 8, 2018
)

It would previously use the PL0 stack for the newly created thread. This commit fixes this, so that the PL3 stack (which is much less limited in size) is used instead.

The use of the PL0 stack was a bug I introduced while fixing #21 via pull request #30, but since it was three years ago I can honestly not say with certainty whether this was "by design" or "by accident". Either way, doing this fix is much better than (ab)using the PL0 stack since the latter is limited in size to a single 4 KiB page => way too little for certain recursive programs and larger programs in general.

(This change was a drive-by change while I was researching the root cause for some issues mentioned in #120)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant