A first preemptive kernel on ARM Cortex-M3

1. Introduction

When it comes to operating systems for embedded software, they are generally thought to be an overkill for most solutions. However, between a completely “hosted” multiapplication system (which uses a multi thread and general purpose OS) and a completely “standalone/monolithic” or “bare-metal” application specific system, there are many variations we can go for.

In previous publications I explored the notion of cooperative tasks, from a loop calling tasks from an array of function pointers, to something a little more complex with the processes being handled in a circular buffer, with explicit time criteria. This time I will show a minimal preemptive implementation, to also pave the way for something a little more complex, as in other publications.

2 Preemptive versus Cooperative

In a fully cooperative system, the processor does not interrupt any task to accept another, and the task itself needs to release the processor for the next one to use it. There is nothing wrong with this. A Run-To-Completion scheme is sufficient and / or necessary for many applications, and many embedded systems were deployed this way, including some very complex ones. In the past, even non-embedded systems used a cooperative kernel (Windows 3.x, NetWare 4.x, among others). If a task crashes, the entire system is compromised when we speak in a strictly cooperative way: it keeps the processor from going further (so in a server operating system like NetWare, this does not seem to be a good idea, because multiple clients are a must!).

In preemptive mode, tasks are interrupted and later resumed– i.e., a context (set of states in the processor registers) is saved and then retrieved. This leads to more complexity to the implementation but, if well done, it increases the robustness and the possibility of meeting narrower timing requirements, mainly if used with a priority and/or periodicity criteria to manage the queue.

3 Call stack

A processor is, in fact, a programmable finite-state machine. With some simplification, each state of our program can be defined within the set of core register values . This set dictates which program point is active. Therefore, activating a task means pushing values ​​to the call stack so that this task will be processed. This set of values is called context. To resume the task afterwards, it is necessary to save the call stack data at that point in the program. This “frozen” data represents a program state and is therefore called a stack frame. For every saved stackframe there is a context related. To resume a task, a previously saved stack frame is loaded back into the call stack.

In the ARM Cortex-M3, the 32-bit registers that define the active state of the processor are: R0-R12 for general use and R13-R15 registers for special use, in addition to the Program Status Register (xPSR) – its value is on the top of any stackframe, and it is not actually a single physical register, but a composition of three (Application, Interrupt and Execution: APSR, IPSR e EPSR).

3.1. Load-store architectures

A load-store architecture is a processor architecture in which data from memory needs to be loaded to the core registers before being processed. Also, the result of this processing before being stored in memory must be in a register.

The two basic memory access operations on Cortex-M3:

// reads the data contained in the address indicated by Rn + offset and places it in Rn. 
LDR Rd, [Rn, #offset]
// stores data contained in Rn at the address pointed by Rd + offset
STR Rn, [Rd, #offset]

It is important to understand at least the Cortex-M3 instructions shown below. I suggest sources [1] and [2] as good references, in addition to this or this link.

MOV Rd, Rn // Rd = Rn
 MOV Rd, #M // Rd = M, the immediate being a 32-bit value (here represented by M)
 ADD Rd, Rn, Rm // Rd = Rn + Rm
 ADD Rd, Rn, #M // Rd = Rn + M
 SUB Rd, Rn, Rm // Rd = Rn - Rm
 SUB Rd, Rn, #M // Rd = Rn - M
// pseudo-instruction to save Rn in a memory location.
// After a PUSH, the value of the stack pointer is decreased by 4 bytes
PUSH {Rn} 
// POP increases SP by 4 bytes after loading data into Rn.
// this increase-decrease is based on the current address the SP is pointing to
POP {Rn}  
B label // jump to routine label
BX Rm // jump to routine specified indirectly by Rm
BL label // jump to label and moves the caller address to LR
CPSID I // enable interrupts
CPSIE I // disable interrupts

We will operate the M3 in Thumb mode , where the instructions are actually 16 bits. According to ARM , this is done to improve code density while maintaining the benefits of a 32-bit architecture. Bit 24 of the PSR is always 1.

3.2. Stacks and stack pointer (SP)

Stack is a memory usage model [1]. It works in the Last In – First Out format (last to enter, first to leave). It is as if I organized a pile of documents to read. It is convenient that the first document to be read is at the top of the stack, and the last at the end.

We usually divide the memory between heap and stack . As said, the “call stack” will contain that temporary data that determines the next state of the processor. The heap contains data which nature is not temporary in the course of the program (this does not mean “non-volatile”). The stack pointer is a kind of pivot that keeps control of the program flow, by pointing to some position of the stack.

Figure 3. Model for using a stack. When saving data before processing (transforming) it saves the previous information. (Figure from [1])
Figure 4. Regions of memory mapped on a Cortex M3. The region available for the stack is confined between the addresses 0x20008000 and 0 0x20007C00. [1]

4 Multitasking on the ARM Cortex M3

The M3 offers two stack pointers (Main Stack Pointer and Process Stack Pointer) to isolate user processes of the process kernel. Every interrupt service runs in kernel mode . It is not possible to go from user mode to kernel mode (actually called thread mode and privileged mode) without going through an interruption – but it is possible to go from privileged mode to user mode by changing the control register.

Figure 5. Changing context on an OS that isolates the kernel application [1]

The core also has dedicated hardware for switching tasks. The SysTick interrupt service can be used to implement synchronous context switching. There are still other asynchronous software interruptions (traps) like PendSV and SVC. Thus, SysTick is used for synchronous tasks in the kernel, while SVC serves asynchronous interruptions, when the application makes a call to the system. The PendSV  is a software interruption which by default can only be triggered in privileged mode. It is usually suggested [1] to trigger it within SysTick service, because it is possible to keep track of the ticks to meet the time criteria. The interruption by SysTick is soon served, with no risk of losing any tick of the clock. A secure OS implementation would use both stack pointers to separate user and kernel threads and separate memory domains if an MPU (Memory Protection Unit) is available. At first, we will only use the MSP in privileged mode.

Figure 6. Memory layout on an OS with two stack pointers and protected memory [1]

5. Building the kernel

Kernel is a somewhat broad concept, but I believe that there is no OS which kernel is not responsible for scheduling tasks. In addition, there must be IPC (inter-process communication) mechanisms. It is interesting to note the strong hardware-dependency of the scheduler that will be shown, due to its low-level nature .

5.1. Stackframes and context switching

Remember: call stack = the registers of the core ; stack or stackframe = state (values) of these registers saved in memory.

When a SysTick is served, part of the call stack is saved by the hardware (R0, R1, R2, R3, R12, R14 (LR) and R15 (PC) and PSR). Let’s call this portion saved by the hardware stackframe. The remaining is the software stackframe [3], which we must explicitly save and retrieve with the PUSH and POP instructions .

To think about our system, we can outline a complete context switch depicting the key positions the stack pointer assumes during the operation (in the figure below the memory addresses increase, from bottom to top. When SP points to R4 it is aligned with an address lower than the PC on the stack)

Figure 7. Switching contexts. The active task is saved by the hardware and the kernel. The stackpointer is re-assigned, according to pre-established criteria, to the R4 of the next stackframe to be activated. The data is rescued. The task is performed. (Figure based on [3]) (“Salvo pelo hardware/kernel” translates to “Saved /pushed by hardware/kernel”; “Resgatado pelo hardware/kernel” translates to “retrieved/popped by hardware/kernel”)

When an interruption takes place, SP will be pointing to the top of the stack (SP (O)) to be saved. This is inevitable because this is how the M3 works. In an interruption the hardware will save the first 8 highest registers in the call stack at the 8 addresses below the stack pointer, stopping at (SP (1)). When we save the remaining registers, the SP will now be pointing to the R4 of the current stack (SP (2)). When we reassign the SP to the address that points to the R4 of the next stack (SP (3)), the POP throws the values of R4-R11 to the call stack and the stack pointer is now at (SP (4)). Finally, the return from the interrupt pops the remaining stackframe, and the SP (5) is at the top of the stack that has just been activated. (If you’re wondering where R13 is: it stores the value of the stack pointer)

The context switching routine is written in assembly and implements exactly what is described in Figure 7.

Figure 8. Context switcher

PS: When an interruption occurs, the LR takes on a special code. 0xFFFFFFF9 , if the interrupted thread was using MSP or 0xFFFFFFFD if the interrupted thread was using PSP.

5.1 Initializing the stacks for each task

For the above strategy to work, we need to initialize the stacks for each task accordingly. The sp starts by pointing to R4. This is by definition the starting stack pointer of a task, as it is the lowest address in a frame .

In addition, we need to create a data structure that correctly points to the stacks that will be activated for each SysTick service . We usually call this structure a TCB (thread control block). For the time being we do not use any selection criteria and therefore there are no control parameters other than next: when a task is interrupted, the next one in the queue will be resumed and executed.

Figure 9. Thread control block
Figure 10. Initializing the stack (the values representing the registers, like 0x07070707 are for debugging purposes)

The kSetInitStack function initializes the stack for each “i” thread . The stack pointer in the associated TCB points to the data relating to R4. The stack data is initialized with the record number that must be loaded to facilitate debugging. The PSR only needs to be initialized with bit 24 in 1, which is the bit that identifies Thumb mode . The tasks have the signature void Task (void * args) .

To add a task to the stack, we initially need the address of the main function of the task. In addition, we will also pass an argument. The first argument is in R0. If more arguments are needed, other registers can be used, according to AAPCS (ARM Application Procedure Call Standard).

Figure 11. Routine for adding tasks and their arguments to the initial stackframe

5.3. Kernel start-up

It is not enough to initialize the stacks and wait for the SysTick. The TCB structure sp will only hold a valid stack pointer value when the task is interrupted. We have two types of threads running: background and foreground threads. The background includes the kernel routines, including the context switcher. At each SysTick, it is the kernel’s turn to use the processor. In the foreground are the applications.

If the task has not already been executed, the stack pointer saved in sp will not be valid. So we need to make it looks like the task has been executed, interrupted and saved – to be reactivated later. I used the following strategy:

  1. An interruption is forced (PendSV). Initial hardware stackframe is saved.
  2. tcb [0].sp is loaded in SP
  3. The R4 – R11 of the core are loaded with the values ​​of the initialized stackframe.
  4. ISR returns, retrieves the hardware stack frame and the SP will be at the top of the stack. The PC is now loaded with the address of the first call to be made, and the program follows the flow.
Figure 12. PendSV interrupt service to boot the kernel

In [2] a very smarter way of starting up the kernel is suggested:

Figure 13. Routine for booting the kernel

The interruption is dispensed and the call stack is loaded by activating the LR with the PC value of the stack . After finally taking SP to the top of the stack, BX LR executes the task and returns.

If we use the first method presented, kStart is simply:

// using CMSIS lib
void kStart(void) 

6. Putting it all together

To illustrate, we will perform, in Round-Robin 3 tasks that switch the output of 3 different pins and increment three different counters. The time between a change of context and another will be 1000 cycles of the main clock. Note that these functions run within a “while (1) {}”. It is like we have several main programs running on the foreground . Each stack has 64 x 4-byte elements (256 bytes).

Figure 14. System Tasks

Below the main function of the system. The hardware is initialized. Tasks are added and the stacks are initialized with the kAddThreads function. The RunPtr receives the address of the thread.  After setting the SysTick to trigger every 1000 clock cycles, boot up the kernel . After executing the first task and being interrupted, the system is switching between one task and another, with the context switcher running in the background .

Figure 15. Main program

6.1. Debug

You will need at least a simulator to implement the system more easily, as you will need to access the core registers and see the data moving in the stacks. If the system is working, each time the debugger is paused, the counters should have almost the same value.

In the photo below, I use an Arduino Due board with an Atmel SAM3X8E processor and an Atmel ICE debugger connected to the board’s JTAG. On the oscilloscope you can see the waveforms of the outputs switching for each of the 3 tasks.

Figure 16. Debug bench
Figure 17. Tasks 1, 2, 3 on the oscilloscope.

7 Conclusions

The implementation of a preemptive kernel requires reasonable knowledge of the processor architecture to be used. Loading the call stack registers and saving them in a “handmade” way allows us to have greater control of the system at the expense of the complexity of handling the stacks.

The example presented here is a minimal example where each task is given the same amount of time to be performed. After that time, the task is interrupted and suspended – the data from the call stack is saved. This saved set is called a stackframe – a “photograph” of the point at the program was. The next task to be performed is loaded at the point it was interrupted and resumed. The code was written in order to explain the concepts.

In the next publication we will split the threads between user mode and privileged mode – using two stack pointers – a fundamental requirement for a more secure system.


The text of this post as well as the non-referenced figures are from the author.
[1] The definitive guide to ARM Cortex-M3, Joseph Yiu
[2] Real-Time Operating Systems for the ARM Cortex-M, Jonathan Valvano
[3] https://www.embedded.com/taking-advantage-of-the-cortex-m3s-pre-emptive-context-switches/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s