One of the more tricky problems when coding for an embedded system is ensuring that each task has enough stack space for its needs.
Most microprocessor control units (MCUs) have limited amounts of RAM; on a general-purpose machine, for example a Linux desktop, it would in theory be possible for a potential stack overflow to be recognised and for additional stack space to be allocated, and even if that's not possible, a large stack can be pre-allocated using virtual memory without any immediate cost.
This can't be done if the MCU only has real memory. Stack overflow may be caught or not. If you're lucky there'll be enough information left behind for you to use the debugger to find the reason for the resulting crash straight away, if not it could take a while. Under FreeRTOS you're very likely to get a hard fault to add to the joy (stacks and task control blocks live close together, so running off the end of your stack is likely to trample on the stored state of another task, resulting in it trying to access invalid memory).
The Python program stack_usage.py
is intended to help with this (it's not a panacea, though! if you have AdaCore support, you'll be better off using GNATstack).
The initial motivation for this work was a hard fault encountered while writing a test program to check that Ada timing events work properly (well, usably) with the FreeRTOS-based Cortex GNAT RTS.
stack_usage
has been developed on macOS Mojave using Python 2.7 and 3.7 and PLY (Python Lex and Yacc). To install PLY,
pip install --user ply
It relies on the information generated by the GCC compiler (FSF GCC 10 or later, GNAT GPL 2015 or later) using the switch -fcallgraph-info=su,da
.
usage: stack_usage.py [flags] <input .ci files>
flags:
-h, --help: output this message
-s, --save=file: save data in file
-l, --load=file: restore previously saved data
-o, --output=file: file for CSV output (D=stack_usage.csv)
-d, --diagnostics: output diagnostic info on parsing
Notes:
- When data is saved, it's in Python's pickle form.
- The diagnostics output is only useful while developing the parsing information.
As noted above, the initial motivation for this work was a hard fault encountered while writing a test program to check that Ada timing events work properly (well, usably) with the FreeRTOS-based Cortex GNAT RTS.
Timing events should, per ARM D.15(25), "be executed directly by the real-time clock interrupt mechanism". This wasn't done in Cortex GNAT RTS, because of ensuring proper handling of inter-task and inter-interrupt protection (especially tricky with the micro:bit, because it uses a cortex-M0 part); instead, a highest-priority task checks timings and runs handlers when required.
Handlers are protected procedures, and code for them is generated as for any protected procedure, with the proper locking mechanisms.
The Timer
task was specified as
task Timer is
pragma Priority (System.Priority'Last);
end Timer;
implemented, comments elided, as
task body Timer is
Next : Time := Time_First;
Period : constant Time_Span := Milliseconds (5);
begin
loop
Process_Queued_Events (Next_Event_Time => Next);
delay until (if Next = Time_First then Clock + Period else Next);
end loop;
end Timer;
In Cortex GNAT RTS, an Ada task is created as a FreeRTOS task whose initial procedure is a wrapper which allocates any secondary stack required out of the task's stack (in this case, 10% of the task's default 768 bytes) and invokes a compiler-generated wrapper round the user-written task body.
The event handlers for this software are specified as
protected LED_Event_Handling is
pragma Interrupt_Priority;
procedure Handle
(Event : in out Ada.Real_Time.Timing_Events.Timing_Event);
private
...
end LED_Event_Handling;
and are implemented by the compiler as a wrapper, which performs any locking and calls the user-written handler body. This wrapper is called directly by the timer task, so any stack used by the handler body comes out of the timer task's stack.
It turned out that a lot of the problem was that the standard compilation options used included -O0
(no optimisation), and this led to the wrapper procedure (see above) using 300 bytes more stack than the 40 bytes it used with -Og
(which "offer[s] a reasonable level of optimization while maintaining fast compilation and a good debugging experience").
A further improvement was to eliminate the Timer
task's secondary stack and bump its main stack:
task Timer
with
Priority => System.Priority'Last,
Storage_Size => 1024,
Secondary_Stack_Size => 0
is
pragma Task_Name ("events_timer");
end Timer;
At this point, it seemed like a good idea to implement stack_usage.py
. I also added callgraph code generation as an option to the Cortex GNAT RTS build process.
To generate raw callgraph info for the micro:bit RTS,
cd ~/cortex-gnat-rts/microbit
make CALLGRAPH=yes clean all install
To generate raw callgraph info for the events
test program,
cd ~/cortex-gnat-rts/test_microbit
make clean
gprbuild -p -P testbed events -cargs -Og -fcallgraph-info=su,da
To process and save the RTS's information,
cd ~/stack_usage
./stack_usage.py \
--save=microbit.p \
~/cortex-gnat-rts/microbit/.build/*.ci
To combine this saved RTS information with the events
program's data,
./stack_usage.py \
--load=microbit.p \
--output=events.csv \
~/cortex-gnat-rts/test-microbit/.build/*.ci
Looking at events.csv
, we find that ada.real_time.timing_events.timerTKVIP
(called during elaboration to create the Timer
task) uses 264 bytes (from the environment task's stack). The system.tasking.restricted.stages.wrapper
procedure uses 40 bytes of the Timer
task's stack, and calls (via a pointer, so invisibly to stack_usage.py
) ada.real_time.timing_events.timerTKB
, the task's body, which uses 296 bytes (including 64 bytes used by ada.real_time.timing_events.process_queued_events
).
Again, stack_usage.py
can't trace into the handler procedure, because it's called via a pointer:
if Handler /= null then
Handler.all (Timing_Event (Next_Event.all));
end if;
so we have to use our knowledge of the actual handlers to find that it's event_support.led_event_handling.handleN
(312 bytes), wrapped by event_support.led_event_handling.handleP
(which organises locking) for a total of 328 bytes.
The total stack usage for the timer task is then predicted to be 40 + 296 + 328 = 664 bytes.
This is a worst-case value: on measuring the actual free space, the actual peak usage was 424 bytes.
Why?!
The worst-case usage for a particular subprogram is the amount it uses for its own purposes (register storage, local variables) plus the maximum used by any of the subprograms it calls. If a particular execution pattern doesn't call that maximal subprogram, then the actual usage will be lower. In this case, one of the most expensive called subprograms was 64-bit scaled arithmetic division, at 192 bytes, called by Ada.Real_Time.Time_Of
at 248 bytes. I'm assuming that this was never actually invoked.
The tool currently ignores
- calls to subprograms not compiled with the required options (e.g.
memcmp
) - dynamic objects (for example,
declare
blocks with variables sized at runtime) - dispatching calls
At the very least it should mark the subprograms where there's a potential problem.
It might be helpful to see details of which subprograms are called by a given subprogram.