[blog] Synchronous versus asynchronous reset on ASIC Design

Reset is about getting a system back to a known initial state. Temporary data is flushed. When I say data I mean 0s and 1s. Everything will be put to a known state so the circuit starts up. Sounds important.

When we go for synchronous or asynchronous reset, active low, our HDL code will look like one of these:

module synch_reset_pipeline(in1, in2, ..., out1, out2,...);
// #1 module with synchronous reset
always @(posedge clk) 
  if (!rst) begin
   // reset logic: put this module to known state 
  else begin
  // functional logic: processes data
module asynch_reset_pipeline(in1, in2, ..., out1, out2,...);
// #2 module with asynchronous reset
always @(posedge clk or negedge rst) 
  if (!rst) begin
   // reset logic 
  else begin
  // functional logic

To simplify, firstly, think about a single sequential element (a flip-flop). Then, imagine it is starting up, catching the very first data signal arriving on the input and propagating it to other pipeline stages. How would be a desirable reset signal? I would say:

  • it resets the circuit to a known state
  • not prone to any metastability
  • the duty cycle is just long enough to meet the above criteria

Given the above bullets what is the go for approach? Whatever fits the system better, on area, timing and power. On module #1, we are telling the tool that the reset signal is under the domain of the clock, and the reset logic will take place when reset level is low. So when the clock positive edge triggers and reset is asserted, the flops that are connected to that logic will reset. Therefore, the reset signal is pretty much another data signal (like: assign current_input = (!rst_sel) ? reset_value : next_input), selecting between the next value to be sampled or the reset data value, when a reset happens. And we should make sure it will be filtered from glitches and never prone to any meta-stability – but this is valid to any other external input too. It cannot take the fastest way through the datapath when asserted, otherwise it might reset the logic violating setup requirements, and therefore put it to a unknown state, that will be sampled and propagated. The deassertion by its turn need to respect the hold time of the signals that were propagated through the pipeline. As data and reset are on the same clock domain, the work relies pretty much on the clock tree synthesis. It costs more circuitry in general. (Although flops are smaller, they say)

On #2, the asynchronous reset is a high priority interruption, per say. The or inside the sensitivity list changes everything. Whenever reset is low (asserted) it comes to bring everything down and up again. If the pulse has a very low duty cycle, maybe the flop will not be stable yet, and will put the the circuit to an unknown state when reset is released; maybe a very bad combination (boom). That is, we need to model this asynchronous path to respect a time after asserting and before deasserting the signal, so we make sure every pipeline stage is at a known state. To prevent timing violation when asserting, we need to make sure the reset will take place respecting the (worst) removal time, so the initial data to be sampled is valid. For deasserting, we need to make sure the clock samples a stable value, by respecting the (worst) recovery time. Besides, you need to tell the tool that the signal that multiplex the input to your flops between operation data and reset data is a false path – that is the clock must ignore the impact of this signal on the data path, and not make any efforts to support setup and hold timings of the sequential elements. You will take care by yourself by controlling the reset behavior. A technique could be to model the buffers safely enough to charge and discharge on a determined relative time, so the reset removal and recovery time are always safe. That is you fix a positive delay for reset assertion and a negative delay for deassertion, regarding the active clock edge. You will need to synchronize the input reset signal to a common in both aproaches to mitigate metastability. Asynchronous reset costs less circuitry, but is also more complex: a very critical asynchronous control signal is on the game.

[blog] Tip: C variables and pointers declarations and definitions

It might sound naive, but it is a very common interview question, and I actually think the notation can be deceiving. And furthermore, it is essential. A few months off and you will be asking yourself “how is that again?”. Unless you get this rule: start reading from right to left.

Question: Please read the following declarations and explain their properties regarding read and write.

#1 int v;

Reads as v is a variable of integer type

int v; //declaration
int same_scope_var = 0x10; //declaration and definition
v = 0; //definition
v = same_scope_var; //redefinition v=0x10

After the scope (declaration) and initial value of v is defined, you can read and write it in run-time.

#2 int const v;

Reads as v is a constant integer, i.e., read-only

After the scope and initial value of v is defined, you can only read it in run-time.

#3 int *v;

Reads as v is variable pointer to an integer.

int *v; //declaration
int same_scope_var = 0xFFFFFFFF; //declaration and defitinion
int same_scope_var2 = 0xF0FFFFFF; //declaration and defitinion
v = &same_scope_var ; //definition
v = &same_scope_var2 ; //definition

After the scope and initial value (that is a memory address that holds an integer) of v is defined, you can change it to whatever other integer address you want in run-time.

#4 int const *v;

Reads as v is variable pointer to a constant integer; that is, the address v holds can change, although it is expected to point to a constant value.

int const same_scope_k = 0xFFFFFFFF; //declaration and defitinion
int const same_scope_k2 = 0xF0FFFFFF;
int same_scope_var = 0xF0FBFFFF;
int const *v; //declaration
v = &same_scope_k; //definition
v = &same_scope_k2; //redefinition
*v = same_scope_k; //error wont compile
v = &same_scope_var; //probably compile with a warning

After the scope and initial value of v is defined, you can write and read it whenever you want in run-time, but you cannot change the value it points to by dereferencing.

#5 int * const v;

Reads as v is a constant pointer to a variable integer, i.e., v is ready-only, but the value that it points to (dereference) can change.

int same_scope_variable  = 0xFFFFFFFF; //declaration and definition
int same_scope_variable2  = 0xF0FFFFFF; 
int * const v; //declaration
v = &same_scope_variable  ; //definition
*v = same_scope_variable2; // redefinition for same_scope_variable = same_scope_variable2

After the scope and initial value of v is defined, you can only read the address it points to in run-time, but you can change the value the pointed address holds, by dereferencing.

#6 int const * const v;

Reads as v is a constant pointer to a constant integer. That is, the value v holds is a read-only address that points to a read-only integer. Its more usual to see that as a function argument when you want to be sure the method cannot change neither the address v holds nor the value it dereferences.

int const same_scope_k  = 0xFFFFFFFF; 
int const same_scope_k2  = 0xF0FFFFFF; 
int const * const v; //declaration
v = &same_scope_k; //defined, read-only
v = &same_scope_k2; //error wont compile
*v = same_scope_k2; //error wont compile

The code below should compile with no warnings. It is on a PC, therefore memory addresses have 8 bytes

#include <stdio.h>
int main()
    int my_variable = 0x30; 
    int const my_constant = 0x60;
    //#1 defined and declared v 
    int v = 0x05; 
    //#2 defined and declared read-only const
    int const v_k = 0x10; 
    int const v_k2 = 0x11;
    //#3 defined and declared v_ptr as the address of v
    int *v_ptr = &v; 
    //#4 declared and defined this as a pointer to a constant 
    int const *v_ptr_to_k = &v_k; 
    //#5 declared and defined this as a constant pointer to an variable integer
    int * const v_k_ptr = &my_variable; 
    //declared and defined this as a constant pointer to a constant integer
    int const * const v_k_ptr_k = &my_constant;  
    // operations 
    // change v
    printf("v has the value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    v = 0x80;
    printf("v has now the value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    // changing v_ptr
    v_ptr = &my_variable;
    printf("v_ptr dereferences to value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    //change pointer to a constant
    printf("v_ptr_to_k dereferences to value %x, and v_ptr_to_k is %p\n\r", *v_ptr_to_k, (unsigned long int *)v_ptr_to_k);
    v_ptr_to_k = &v_k2;
    printf("v_ptr_to_k dereferences to value %x, and v_ptr_to_k is %p\n\r", *v_ptr_to_k, (unsigned long int *)v_ptr_to_k);
    //*v_ptr_to_k = 0x08; //<== not allowed!
    //operating with a constant pointer
    printf("v_k_ptr dereferences to the value %x, and v_k_ptr is %p\n\r", *v_k_ptr, (unsigned long int *)v_k_ptr);
    v = 0x35;
    *v_k_ptr = v;
    printf("v_k_ptr dereferences to the value %x, and v_k_ptr is %p\n\r", *v_k_ptr, (unsigned long int *)v_k_ptr);
    //operating with a constant pointer to a constant integer
    printf("v_k_ptr_k dereferences to the value %x, and v_k_ptr_k is %p\n\r", *v_k_ptr_k, (unsigned long int *)v_k_ptr_k);
    //*v_k_ptr_k = 0x92; <== not allowed!
    //v_k_ptr_k = &my_constant; <== not allowed!
    return 0;


(Erratum: in the printf above v_ptr_to_k was mentioned as v_ptr_k but the variable declared and defined on the code is v_ptr_to_k.)

A first preemptive kernel on ARM Cortex-M3

1. Introduction

When it comes to operating systems for embedded software, they are generally thought to be an overkill for most solutions. However, between a completely “hosted” multiapplication system (which uses a multi thread and general purpose OS) and a completely “standalone/monolithic” or “bare-metal” application specific system, there are many variations we can go for.

In previous publications I explored the notion of cooperative tasks, from a loop calling tasks from an array of function pointers, to something a little more complex with the processes being handled in a circular buffer, with explicit time criteria. This time I will show a minimal preemptive implementation, to also pave the way for something a little more complex, as in other publications.

2 Preemptive versus Cooperative

In a fully cooperative system, the processor does not interrupt any task to accept another, and the task itself needs to release the processor for the next one to use it. There is nothing wrong with this. A Run-To-Completion scheme is sufficient and / or necessary for many applications, and many embedded systems were deployed this way, including some very complex ones. In the past, even non-embedded systems used a cooperative kernel (Windows 3.x, NetWare 4.x, among others). If a task crashes, the entire system is compromised when we speak in a strictly cooperative way: it keeps the processor from going further (so in a server operating system like NetWare, this does not seem to be a good idea, because multiple clients are a must!).

In preemptive mode, tasks are interrupted and later resumed– i.e., a context (set of states in the processor registers) is saved and then retrieved. This leads to more complexity to the implementation but, if well done, it increases the robustness and the possibility of meeting narrower timing requirements, mainly if used with a priority and/or periodicity criteria to manage the queue.

3 Call stack

A processor is, in fact, a programmable finite-state machine. With some simplification, each state of our program can be defined within the set of core register values . This set dictates which program point is active. Therefore, activating a task means pushing values ​​to the call stack so that this task will be processed. This set of values is called context. To resume the task afterwards, it is necessary to save the call stack data at that point in the program. This “frozen” data represents a program state and is therefore called a stack frame. For every saved stackframe there is a context related. To resume a task, a previously saved stack frame is loaded back into the call stack.

In the ARM Cortex-M3, the 32-bit registers that define the active state of the processor are: R0-R12 for general use and R13-R15 registers for special use, in addition to the Program Status Register (xPSR) – its value is on the top of any stackframe, and it is not actually a single physical register, but a composition of three (Application, Interrupt and Execution: APSR, IPSR e EPSR).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é armcortexm3arch-1.png

3.1. Load-store architectures

A load-store architecture is a processor architecture in which data from memory needs to be loaded to the core registers before being processed. Also, the result of this processing before being stored in memory must be in a register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é registersm3-1.png

The two basic memory access operations on Cortex-M3:

// reads the data contained in the address indicated by Rn + offset and places it in Rn. 
LDR Rd, [Rn, #offset]
// stores data contained in Rn at the address pointed by Rd + offset
STR Rn, [Rd, #offset]

It is important to understand at least the Cortex-M3 instructions shown below. I suggest sources [1] and [2] as good references, in addition to this or this link.

MOV Rd, Rn // Rd = Rn
MOV Rd, #M // Rd = M, the immediate being a 32-bit value (here represented by M)
ADD Rd, Rn, Rm // Rd = Rn + Rm
ADD Rd, Rn, #M // Rd = Rn + M
SUB Rd, Rn, Rm // Rd = Rn - Rm
SUB Rd, Rn, #M // Rd = Rn - M
// pseudo-instruction to save Rn in a memory location.
// After a PUSH, the value of the stack pointer is decreased by 4 bytes
// POP increases SP by 4 bytes after loading data into Rn.
// this increase-decrease is based on the current address the SP is pointing to
POP {Rn}
B label // jump to routine label
BX Rm // jump to routine specified indirectly by Rm
BL label // jump to label and moves the caller address to LR

CPSID I // enable interrupts
CPSIE I // disable interrupts

We will operate the M3 in Thumb mode , where the instructions are actually 16 bits. According to ARM , this is done to improve code density while maintaining the benefits of a 32-bit architecture. Bit 24 of the PSR is always 1.

3.2. Stacks and stack pointer (SP)

Stack is a memory usage model [1]. It works in the Last In – First Out format (last to enter, first to leave). It is as if I organized a pile of documents to read. It is convenient that the first document to be read is at the top of the stack, and the last at the end.

We usually divide the memory between heap and stack . As said, the “call stack” will contain that temporary data that determines the next state of the processor. The heap contains data which nature is not temporary in the course of the program (this does not mean “non-volatile”). The stack pointer is a kind of pivot that keeps control of the program flow, by pointing to some position of the stack.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackusage.png
Figure 3. Model for using a stack. When saving data before processing (transforming) it saves the previous information. (Figure from [1])
Esta imagem possuí um atributo alt vazio; O nome do arquivo é annotation-2020-02-06-001056.png
Figure 4. Regions of memory mapped on a Cortex M3. The region available for the stack is confined between the addresses 0x20008000 and 0 0x20007C00. [1]

4 Multitasking on the ARM Cortex M3

The M3 offers two stack pointers (Main Stack Pointer and Process Stack Pointer) to isolate user processes of the process kernel. Every interrupt service runs in kernel mode . It is not possible to go from user mode to kernel mode (actually called thread mode and privileged mode) without going through an interruption – but it is possible to go from privileged mode to user mode by changing the control register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é so.png
Figure 5. Changing context on an OS that isolates the kernel application [1]

The core also has dedicated hardware for switching tasks. The SysTick interrupt service can be used to implement synchronous context switching. There are still other asynchronous software interruptions (traps) like PendSV and SVC. Thus, SysTick is used for synchronous tasks in the kernel, while SVC serves asynchronous interruptions, when the application makes a call to the system. The PendSV  is a software interruption which by default can only be triggered in privileged mode. It is usually suggested [1] to trigger it within SysTick service, because it is possible to keep track of the ticks to meet the time criteria. The interruption by SysTick is soon served, with no risk of losing any tick of the clock. A secure OS implementation would use both stack pointers to separate user and kernel threads and separate memory domains if an MPU (Memory Protection Unit) is available. At first, we will only use the MSP in privileged mode.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 2sps.png
Figure 6. Memory layout on an OS with two stack pointers and protected memory [1]

5. Building the kernel

Kernel is a somewhat broad concept, but I believe that there is no OS which kernel is not responsible for scheduling tasks. In addition, there must be IPC (inter-process communication) mechanisms. It is interesting to note the strong hardware-dependency of the scheduler that will be shown, due to its low-level nature .

5.1. Stackframes and context switching

Remember: call stack = the registers of the core ; stack or stackframe = state (values) of these registers saved in memory.

When a SysTick is served, part of the call stack is saved by the hardware (R0, R1, R2, R3, R12, R14 (LR) and R15 (PC) and PSR). Let’s call this portion saved by the hardware stackframe. The remaining is the software stackframe [3], which we must explicitly save and retrieve with the PUSH and POP instructions .

To think about our system, we can outline a complete context switch depicting the key positions the stack pointer assumes during the operation (in the figure below the memory addresses increase, from bottom to top. When SP points to R4 it is aligned with an address lower than the PC on the stack)

Esta imagem possuí um atributo alt vazio; O nome do arquivo é ctxtswitch-1.png
Figure 7. Switching contexts. The active task is saved by the hardware and the kernel. The stackpointer is re-assigned, according to pre-established criteria, to the R4 of the next stackframe to be activated. The data is rescued. The task is performed. (Figure based on [3]) (“Salvo pelo hardware/kernel” translates to “Saved /pushed by hardware/kernel”; “Resgatado pelo hardware/kernel” translates to “retrieved/popped by hardware/kernel”)

When an interruption takes place, SP will be pointing to the top of the stack (SP (O)) to be saved. This is inevitable because this is how the M3 works. In an interruption the hardware will save the first 8 highest registers in the call stack at the 8 addresses below the stack pointer, stopping at (SP (1)). When we save the remaining registers, the SP will now be pointing to the R4 of the current stack (SP (2)). When we reassign the SP to the address that points to the R4 of the next stack (SP (3)), the POP throws the values of R4-R11 to the call stack and the stack pointer is now at (SP (4)). Finally, the return from the interrupt pops the remaining stackframe, and the SP (5) is at the top of the stack that has just been activated. (If you’re wondering where R13 is: it stores the value of the stack pointer)

The context switching routine is written in assembly and implements exactly what is described in Figure 7.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é systick-2.png
Figure 8. Context switcher

PS: When an interruption occurs, the LR takes on a special code. 0xFFFFFFF9 , if the interrupted thread was using MSP or 0xFFFFFFFD if the interrupted thread was using PSP.

5.1 Initializing the stacks for each task

For the above strategy to work, we need to initialize the stacks for each task accordingly. The sp starts by pointing to R4. This is by definition the starting stack pointer of a task, as it is the lowest address in a frame .

In addition, we need to create a data structure that correctly points to the stacks that will be activated for each SysTick service . We usually call this structure a TCB (thread control block). For the time being we do not use any selection criteria and therefore there are no control parameters other than next: when a task is interrupted, the next one in the queue will be resumed and executed.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é struct.png
Figure 9. Thread control block
Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackinit-3.png
Figure 10. Initializing the stack (the values representing the registers, like 0x07070707 are for debugging purposes)

The kSetInitStack function initializes the stack for each “i” thread . The stack pointer in the associated TCB points to the data relating to R4. The stack data is initialized with the record number that must be loaded to facilitate debugging. The PSR only needs to be initialized with bit 24 as 1, which is the bit that identifies Thumb mode . The tasks have the signature void Task (void * args) .

To add a task to the stack, we initially need the address of the main function of the task. In addition, we will also pass an argument. The first argument is in R0. If more arguments are needed, other registers can be used, according to AAPCS (ARM Application Procedure Call Standard).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é addthreads-1.png
Figure 11. Routine for adding tasks and their arguments to the initial stackframe

5.3. Kernel start-up

It is not enough to initialize the stacks and wait for the SysTick. The TCB structure sp will only hold a valid stack pointer value when the task is interrupted. We have two types of threads running: background and foreground threads. The background includes the kernel routines, including the context switcher. At each SysTick, it is the kernel’s turn to use the processor. In the foreground are the applications.

If the task has not already been executed, the stack pointer saved in sp will not be valid. So we need to make it looks like the task has been executed, interrupted and saved – to be reactivated later. I used the following strategy:

  1. An interruption is forced (PendSV). Initial hardware stackframe is saved.
  2. tcb [0].sp is loaded in SP
  3. The R4 – R11 of the core are loaded with the values ​​of the initialized stackframe.
  4. ISR returns, retrieves the hardware stack frame and the SP will be at the top of the stack. The PC is now loaded with the address of the first call to be made, and the program follows the flow.
Esta imagem possuí um atributo alt vazio; O nome do arquivo é pendsv.png
Figure 12. PendSV interrupt service to boot the kernel

In [2] a very smarter way of starting up the kernel is suggested:

Esta imagem possuí um atributo alt vazio; O nome do arquivo é kstart-1.png
Figure 13. Routine for booting the kernel

The interruption is dispensed and the call stack is loaded by activating the LR with the PC value of the stack . After finally taking SP to the top of the stack, BX LR executes the task and returns.

If we use the first method presented, kStart is simply:

// using CMSIS lib 
void kStart(void) { SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk; }

6. Putting it all together

To illustrate, we will perform, in Round-Robin 3 tasks that switch the output of 3 different pins and increment three different counters. The time between a change of context and another will be 1000 cycles of the main clock. Note that these functions run within a “while (1) {}”. It is like we have several main programs running on the foreground . Each stack has 64 x 4-byte elements (256 bytes).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é tasks-1.png
Figure 14. System Tasks

Below the main function of the system. The hardware is initialized. Tasks are added and the stacks are initialized. The RunPtr receives the address of the thread.  After setting the SysTick to trigger every 1000 clock cycles, boot up the kernel . After executing the first task and being interrupted, the system is switching between one task and another, with the context switcher running in the background .

Esta imagem possuí um atributo alt vazio; O nome do arquivo é main.png
Figure 15. Main program

6.1. Debug

You will need at least a simulator to implement the system more easily, as you will need to access the core registers and see the data moving in the stacks. If the system is working, each time the debugger is paused, the counters should have almost the same value.

In the photo below, I use an Arduino Due board with an Atmel SAM3X8E processor and an Atmel ICE debugger connected to the board’s JTAG. On the oscilloscope you can see the waveforms of the outputs switching for each of the 3 tasks.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 20200209_163936.jpg
Figure 16. Debug bench
Esta imagem possuí um atributo alt vazio; O nome do arquivo é ds1z_quickprint1-1.png
Figure 17. Tasks 1, 2, 3 on the oscilloscope.

7 Conclusions

The implementation of a preemptive kernel requires reasonable knowledge of the processor architecture to be used. Loading the call stack registers and saving them in a “handmade” way allows us to have greater control of the system at the expense of the complexity of handling the stacks.

The example presented here is a minimal example where each task is given the same amount of time to be performed. After that time, the task is interrupted and suspended – the data from the call stack is saved. This saved set is called a stackframe – a “photograph” of the point at the program was. The next task to be performed is loaded at the point it was interrupted and resumed. The code was written in order to explain the concepts.

In the next publication we will split the threads between user mode and privileged mode – using two stack pointers – a fundamental requirement for a more secure system.


The text of this post as well as the non-referenced figures are from the author.
[1] The definitive guide to ARM Cortex-M3, Joseph Yiu
[2] Real-Time Operating Systems for the ARM Cortex-M, Jonathan Valvano
[3] https://www.embedded.com/taking-advantage-of-the-cortex-m3s-pre-emptive-context-switches/

Separating user space from kernel space on ARM Cortex-M3

1 . Introduction

ARM Cortex-M processors are in SoCs of several application domains, especially in much of what we call smart devices. This publication continues the previous one, in which I had demonstrated the implementation of a minimum preemptive scheduling mechanism for an ARM Cortex-M3, taking advantage of the special hardware resources for context switching.

Some important features of this architecture will now be used to improve the kernel development, such as the separation of user and kernel threads – with 2 stack pointers: MSP and PSP, besides the use of Supervisor Calls to implement system calls.

Although some concepts about operating systems are addressed because they are inherent to the subject, the objective is to explore the ARM Cortex-M3, a relatively inexpensive processor with wide applicability, and how to take advantage of it to develop more robust systems.

2. Special registers

There are 3 special registers on the ARM Cortex-M3. You can consult the ARM documentation to understand the role of each of them in the processor. The most important in this publication is CONTROL .

  • Program Status Registers (APSR, IPSR, and EPSR)
  • Interrupt Mask Registers (PRIMASK, FAULTMASK and BASEPRI)
  • Control (CONTROL) .

Special registers can only be accessed through the privileged instructions MRS (ARM Read Special Register) and MSR (ARM Set Special Register):

// Load in R0 the current value contained in the special register 
// Load in the special register the value contained in R0)

The CONTROL register has only 2 configurable bits. When an exception handler (e.g, SysTick_Handler) is running, the processor will be in privileged mode and using the main stack pointer (MSP) , and CONTROL [1] = 0, CONTROL [0] = 0 . In other routines that are not handlers, this register can assume different values ​​depending on the software implementation (Table 1).

In the small kernel shown before, the application tasks (Task1, Task2 and Task3) were also executed in privileged mode and using the main stack pointer (MSP). Thus, an application program could change the special registers of the core if it wanted to.

3. Kernel and user domains

In the last publication I highlighted the fact that register R13 is not part of the stackframe, as it stores the address of the current stack pointer. The R13 is a “banked” register, meaning that it is physically replicated, and takes a value or another depending on the state of the core.

CTRL [1] (0 = MSP / 1 = PSP)CTRL [0] (0 = Priv, 1 = Non priv )state
00Privileged handler * / Base mode
10Privileged thread
11User thread
Table 1. Possible states of the CONTROL register
* in exception handlers, this mode will always be active even if CTRL [0] = 1.

With two stack pointers, one for application and another for the kernel, means that a user thread can not easily corrupt the kernel stack by a programming application error or malicious code. According to the ARM manuals, a robust operating system typically has the following characteristics:

  • interrupt handlers use MSP (by default)
  • kernel routines are activated through SysTick at regular intervals to perform task scheduling and system management in privileged mode
  • user applications use PSP in non-privileged mode
  • memory for kernel routines can only be accessed in privileged mode* and use MSP

* for now we will not isolate the memory spaces

4. System Calls

Putting it simple, a system call is a method in which a software requests a service from the kernel or OS on which it is running. If we intend to separate our system into privilege levels, it is inevitable that the application level needs call the kernel to have access to, for example, hardware services or whatever else we think is critical to the security and stability of our system.

A common way to implement system calls in ARM Cortex-M3 (and in other ARMv7) is to use the software interrupt Supervisor Call (SVC). The SVC acts as an entry point for a service that requires privileges to run. The only input parameter of an SVC is its number (ASM instruction: SVC #N), which we associate with a function call (callback). Unlike other exceptions triggered via software available, like PendSV (Pendable Supervisor Call), the SVC can be triggered in user mode by default.

Figure 2. Block diagram of a possible operating system architecture. Supervisor Calls act as an entry point for privileged services. [2]

5. Design

5.1 Using two stack pointers

To use the two available stack pointers (MSP and PSP) it is essential to understand 2 things:

  • The control register manipulation: it is only possible to write or read the CONTROL register in handler mode (within an exception handler) or in privileged threads.
  • The exceptions mechanism: when an interrupt takes place, the processor saves the contents of registers R0-R3, LR, PC and xPSR, as explained in the previous publication. The value of the LR when we enter an exception indicates the mode the processor was running, when the thread was interrupted. We can manipulate this value of LR together with the manipulation of the stack pointer to control the program flow.
0xFFFFFFF9Returns to “base” mode, privileged MSP. (CONTROL = 0b00)
0xFFFFFFFDReturns to user mode (PSP, with the privilege level of entry) (Control = 0b1x)
0xFFFFFFF1Returns to the previous interruption, in case a higher priority interruption occurs during a lower priority.
Table 2. Exception return values

5.1.1. One kernel stack for each user stack

Each user stack will have a correspondent kernel stack (one kernel stack per thread). Thus, each Task is associated to a kernel stack and a user stack. Another approach would be only one kernel stack for the entire the system (one kernel stack per processor). The advantage of using the first approach is that from the point of view of who implements the system, the programs that run in the kernel follow the same development pattern as the application programs. The advantage of the second approach is less memory overhead and less latency in context switching.

Figure 3. Each user stack has an associated kernel stack

5.2 Kernel entry and exit mechanisms

In the previous publication, the interruption triggered by SysTick handled the context switching, i.e., it interrupted the running thread, saved its stackframe, searched for the next thread pointed to by the next field in the TCB (thread control block) structure and resumed it.

With the separation between user and supervisor spaces, we will have two mechanisms to get in and out the kernel, the system calls, explicitly called in code, and the interruption by SysTick that implements the scheduling routine. Although still using a round-robin scheme in which each task has the same time slice, the threads of the kernel also work cooperatively with a user thread that evoked it, that is: when there is nothing more to be done, the kernel can explicitly return. If the thread of the kernel takes longer than the time between one tick and another, it will be interrupted and rescheduled. User tasks could also use a similar mechanism, however, for simplicity of exposure, I chose to leave user tasks only in a fixed round-robin scheme, with no cooperative mechanisms.

5.2.1. Scheduler

The flowchart of the preemptive scheduler to be implemented is in Figure 4. The start-up of the kernel and user application is also shown for clarity. The kernel starts upo and voluntarily triggers the first user task. At every SysTick interruption, the thread has its state saved and the next scheduled task is resumed according to the state in which it was interrupted: kernel or user mode.

Figura 4. Scheduler flowchart

5.2.2 System Calls

System Calls are needed when the user requests access to a privileged service. In addition, I also use the same mechanism for a kernel thread to cooperatively return to the user thread.

Note that I chose not to run the kernel threads within the SVC handler, which would be more intuitive, as well as more usual. The reasons for this are because I wanted to take advantage of the processor’s own interrupt mechanism that when returning POPs registers RO-R3, LR, PC and xPSR, and also to avoid the interruption nesting during the preemption of kernel tasks.

If I had chosen to use only one kernel stack for all threads, the implementation within the handler itself I think would be better. Choices, design choices…

Figure 5. System Call flowchart

6. Implementation

Below I explain the codes created to implement previously described features proof of concept. Most of the kernel itself is written in assembly, except for a portion of the supervisor calls handler that is written in C with some inline assembly. In my opinion, more cumbersome and susceptible to errors than writing in assembly is to embed assembly in C code. The toolchain used is the GNU ARM.

6.1. Stacks

There is nothing special here, except that now in addition to the stack user declare another array of integers for the stack of the kernel . These will be associated in the Thread Control Block.

int32_t p_stacks[NTHREADS][STACK_SIZE]; // user stack
int32_t k_stacks[NTHREADS][STACK_SIZE]; // kernel stack

6.2. Task Scheduler

The main difference from this to the scheduler shown in the last publication is that we will now handle two different stack pointers: the MSP and the PSP. Thus, when entering an exception handler, the portion of the stackframe saved automatically depends on the stack pointer used when the exception took place. However, in the exception routine, the active stack pointer is always the MSP. Thus, in order to be able to handle a stack pointer when we are operating with another, we will cannot use the PUSH and POP pseudo-instructions because they have the active stack pointer as their base address . We will have to replace them with the instructions LDMIA (load multiple and increment after) for POP, and STMDB (store multiple decrement before) for PUSH, with the writeback sign “!” at the base address [1] .

// Example of POP 
MRS R12, PSP // reads the value of the process stack pointer in R12
LDMIA R12!, {R4-R11} // R12 contains the base address (PSP)
/ * the address contained in R12 now stores the value from R4; [R12] + 0x4
contains the value of R5, and so on until [R12] + 0x20 contains the
value of R11.
the initial value of R12 is also increased by 32 bytes
* /
MSR PSP, R12 // PSP is updated to the new value of R12

// Example of PUSH
STMDB R12!, {R4-R11}
/ * [R12] - 0x20 contains R4, [R12] - 0x16 contains R5, ..., [R12] contains R4
the initial value of R12 is decremented by 32 bytes * /
MSR MSP, R12 // MSP is updated to the new value of R12

Another difference is that the TCB structure now needs to contain a pointer to each of the stack pointers of the thread it controls, and also a flag indicating whether the task to be resumed was using MSP or PSP when it was interrupted.

// thread control block
struct tcb 
  int32_t*  psp; //psp saved from the last interrupted thread
  int32_t*  ksp; //ksp saved from the last interrupted kernel thread
  struct tcb    *next; //points to next tcb
  int32_t   pid; //task id
  int32_t   kernel_flag; // 0=kernel, 1=user    
typedef struct tcb tcb_t;
tcb_t tcb[NTHREADS]; //tcb array
tcb_t* RunPtr; 

The scheduler routines are shown below. The code was written so it is clear in its intention, without trying to save instructions. Note that in line 5 the value of LR at the entry of the exception is only compared with 0xFFFFFFFD, if false it is assumed that it is 0xFFFFFFFF9, this is because I guarantee that there will be no nested interrupts (SysTick never interrupts an SVC, for example), so the LR should never assume 0xFFFFFFF1. If other than a proof of concept, the test should be considered.

.global SysTick_Handler
.type SysTick_Handler, %function
CPSID   I // atomic begins
CMP LR, #0xFFFFFFFD // were we at an user thread?
BEQ SaveUserCtxt    // if yes
B   SaveKernelCtxt  //if no
STMDB   R12!, {R4-R11}  //push R4-R11
LDR R0,=RunPtr      
LDR R1, [R0]
LDR R2, [R1,#4]
STR R12, [R2]  //saves stack pointer
B   Schedule
STMB    R12!, {R4-R11}
LDR R0,=RunPtr
LDR R1, [R0]
STR R12, [R1]       
B   Schedule
LDR R1, =RunPtr //R1 <- RunPtr
LDR R2, [R1]    
LDR R2, [R2,#8] //R2 <- RunPtr.next
STR R2, [R1]    //updates RunPtr
LDR R0, =RunPtr
LDR R1, [R0]
LDR R2, [R1,#16]
CMP R2, #1       //if kernel_flag=1
BEQ ResumeUser   //yes, resume user thread
B   ResumeKernel //no, resume kernel thread
LDR R1, =RunPtr  //R1 <- RunPtr
LDR R2, [R1]
LDR R2, [R2]
LDMIA   R2!, {R4-R11} //Pop sw stackframe 
MSR PSP, R2     
MOV LR, #0xFFFFFFFD //LR=return to user thread
CPSIE   I       //atomica end
LDR R1, =RunPtr   //R1 <- RunPtr
LDR R2, [R1]
LDR R2, [R2, #4]
LDMIA   R2!, {R4-R11} //Retrieves sw stackframe 
MOV LR, #0xFFFFFFF9  //LR=return to kernel thread
CPSIE   I        //atomic end

6.3 System Calls

The implementation of system calls uses the SVC Handler. As stated, SVC has a unique input parameter (ARM makes it sounds like an advantage…), that is the number we associate with a callback. But then how do we pass the arguments forward to the callback, if we can make system calls with only one input parameter? They need to be retrieved from the stack. The AAPCS (ARM Application Procedure Call Standard) which is followed by compilers, says that when a function (caller) calls another function (callee), the callee expects its arguments to be in R0-R3. Likewise, the caller expects the callee return value to be in R0. R4-R11 must be preserved between calls. R12 is the scratch register and can be freely used.

No wonder that when an exception takes place the core saves (PUSH) the registers R0-R3, LR, PC and xPSR from the interrupted function, and when returning put them (POP) again in the core registers. It is fully prepared to get back to the same point when it was interrupted. But if we change the context, that is, after the interruption we do not return to the same point we were before, there will be a need to explicitly save the remaining stackframe so this thread can be resumed properly later. It is essential to follow the AAPCS if we want to evoke functions written in assembly from C code and vice-versa.

To system calls, I defined a macro function in C that receives the SVC code and the arguments for the callback (the syntax of inline assembly depends on the compiler used).

#define SysCall(svc_number, args) {                                       
    __ASM volatile ("MOV R0, %0 "     :: "r"            (args) );     
    __ASM volatile ("svc %[immediate]"::[immediate] "I" (svc_number) : );   

(There is a reason I created a macro and not a common function: it has to do with the return point to the user thread and the fact that the kernel callbacks are not executed within the exception, which requires changing the context. If I had created a common function for the system call, the user stack pointer would be saved within the call, and upon returning from the kernel, the SVC would be executed again. If you know how to run the system call outside the handler routine without using a macro, please let me know!)

The args value is stored in R0. The SVC call is made with the immediate “svc_number”. When the SVC is triggered, R0-R3 will be automatically saved to the stack. The code was written as follows, without saving instructions, for clarity:

global SVC_Handler
.type   SVC_Handler, %function
 MRS R12, PSP        //saves psp
 BEQ KernelEntry
 B   KernelExit
//saves user context
STMDB   R3!, {R4-R11}
LDR R1,=RunPtr
LDR R2, [R1]
STR R3, [R2]    
LDR R3, =#0  
STR R3, [R1, #16] //kernel flag = 0
MOV R0, R12   //gets r0 from CORE to retrieve SVC number
B svchandler_main //branch to C routine
//retrieves user context
LDR R0, =RunPtr
LDR R1, [R0]
LDR R2, [R1]
LDMIA   R2!, {R4-R11}
LDR R12, =#1 //kernel flag = 1
STR R12, [R1, #16]

The rest of the routine for entering the kernel is written in C [2, 3]. Note that in the routine written in assembly a simple branch occurs (line 20) and therefore we have not yet returned from the exception handler .

The svc_number, in turn, is retrieved by walking two bytes (hence the cast to char) out of the address of the PC that is 6 positions above R0 in the stack [1, 2, 3]. Note that it was necessary to assign to R0 the value contained in PSP shortly after entering the interrupt, before saving the rest of the stack (lines 4 and 19 of the assembly code).

After retrieving the system call number and its arguments, the MSP is overwritten with the value stored in the TCB. Then we change the value of LR so the exception returns to the base mode. The callback does not run within the handler. When the BX LR instruction is executed, the remaining of the stackframe is automatically activated onto the core registers.

#define SysCall_GPIO_Toggle  1 //svc number for gpio toggle
#define SysCall_Uart_PrintLn 2 //svc number for uart print line
void svchandler_main(uint32_t * svc_args)
    uint32_t svc_number;
    uint32_t svc_arg0;
    uint32_t svc_arg1;
    svc_number = ((char *) svc_args[6])[-2]; // recupera o imediato 
    svc_arg0 = svc_args[0];
    svc_arg1 = svc_args[1]; 
 case SysCall_GPIO_Toggle: 
    k_stacks[RunPtr->pid][STACK_SIZE-2] = (int32_t)SysCallGPIO_Toggle_; //PC
    k_stacks[RunPtr->pid][STACK_SIZE-8] = (int32_t)svc_arg0; //R0
    k_stacks[RunPtr->pid][STACK_SIZE-1] = (1 << 24); // T=1 (xPSR)
    __ASM volatile ("MSR MSP, %0" : : "r" (RunPtr->ksp) : );
    __ASM volatile ("POP {R4-R11}");
    __ASM volatile ("MOV LR, #0xFFFFFFF9");
    __ASM volatile ("BX LR"); //returns from exception
 case SysCall_Uart_PrintLn: 
    k_stacks[RunPtr->pid][STACK_SIZE-2] = (int32_t)SysCallUART_PrintLn_; 
    k_stacks[RunPtr->pid][STACK_SIZE-8] = (int32_t)svc_arg0;
    k_stacks[RunPtr->pid][STACK_SIZE-1] = (1 << 24); // T=1
    __ASM volatile ("MSR MSP, %0" : : "r" (RunPtr->ksp) : );
    __ASM volatile ("POP {R4-R11}");
    __ASM volatile ("MOV LR, #0xFFFFFFF9");
    __ASM volatile ("BX LR"); //returns from exception
    __ASM volatile("B SysCall_Dummy");

A callback looks like this:

static void SysCall_CallBack_(void* args)
    BSP_Function((int32_t*) args); //BSP function with one argument int32
    exitKernel_(); // leaves cooperatively

6.4. Start-up

The start-up is a critical point. The system starts in base mode. The stacks are assembled. The first task to be performed by the kernel after booting the system is to configure SysTick, switch to user mode and trigger the first user thread .

The assembly routines for the star-up are as follows:

.equ SYSTICK_CTRL, 0xE000E010 
.equ TIME_SLICE,    999
.global kStart 
.type kStart, %function
LDR R0, =RunPtrStart
LDR R1, [R0]
LDR R2, [R1,#4]
MSR MSP, R2   // MSP <- RunPtr.ksp
POP {R4-R11}  //loads stackframe 0 at call stack
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR}     //LR <- PC = UsrAppStart
ADD SP, SP, #4
BX  LR // branches to UsrAppStart
//this function manages the stack to run the first user thread
.global UsrAppStart 
.type   UsrAppStart, %function
LDR R1, =RunPtr //R1 <- RunPtr
LDR R2, [R1]        
LDR R2, [R2]
BL  SysTickConf //configures systick
MOV R0, #0x3
MSR CONTROL, R0 //thread unprivileged mode
ISB         // inst set barrier: guarantees CONTROL is updated before going
POP {R4-R11}   //loads stackframe 0
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR}       //LR <- PC
ADD SP, SP, #4
MOV R1, #0
STR R1, [R0]  // resets counter
MOV R1, #0x7   // 0b111:
            // 1: Clock source = core clock 
            // 1: Enables irq
            // 1: Enables counter
STR R1, [R0]        
BX  LR      //get back to caller

7. Test

For a little demonstration, we will write on the PC screen via UART. The callback for the system call was written as follows:

static void SysCallUART_PrintLn_(const char* args)
    uart_write_line(UART, args);        
// waits until transmission is done
    while (uart_get_status(UART) != UART_SR_TXRDY); 
    exitKernel_(); // exit kernel cooperatively

It is necessary to be careful when using multitasking to use any shared resource, since we have not yet inserted any inter-process communication mechanism. However, the operation is at a “Guarded Region”, and it will not be interrupted by SysTick. The main program is as follows:

#include <commondefs.h> //board support package, std libs, etc.
#include <kernel.h>  
#include <tasks.h>
int main(void)
  kAddThreads(Task1, (void*)"Task1nr", Task2, (void*)"Task2nr", Task3, (void*)"Task3nr");
  RunPtrStart = &tcbs[0]; 
  RunPtr = &tcbs[1];
  uart_write_line(UART, "Inicializando kernel...nr");
  delay_ms(500); //delay to print screen  : P

The tasks (main threads) look like this:

void Task1(void* args)
    const char* string = (char*)args;
        SysCall(SysCall_Uart_PrintLn, string);

In the figure below, the running system:

8. Conclusions

The use of two stack pointers, one for application and another for the kernel isolates these spaces not allowing the application to corrupt the kernel stack. The privileges prevent the user from overwriting special registers, keeping the kernel safe from application programming errors or malicious code.

Adding another stack pointer to the system required changes to the scheduling routine because we now manipulate two stacks in the domain of two different stack pointers, and both can be preempted. In addition, a cooperative mechanism has also been added for kernel exiting.

The one kernel stack per user stack approach makes the development of kernel or application routines to follow the same pattern from the perspective of who is writing the system. The price to pay is memory overhead and more latency when switching contexts. To mitigate the last, cooperative mechanisms can be added as shown. To mitigate the memory overhead more attention should be put when modeling the tasks (or concurrent units), so they are efficiently allocated.

The system call mechanism is used as an entry point to hardware services, or whatever else we deem critical for the security and stability of the system. This will make even more sense by separating not only the stacks at privilege levels but also the memory regions with the MPU .

For the next publications we will: 1) create IPC mechanisms, 2) add priority levels to the tasks that are running on a fixed round-robin scheme 3) configure the MPU

9. References

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf

[2] The definitive Guide to ARM Cortex M3, Joseph Yiu

[3] https://developer.arm.com/docs/dui0471/j/handling-processor-exceptions/svc-handlers-in-c-and-assembly-language