The Observer Pattern (also known as Publish-Subscribe pattern) provides notification to a set of interested clients that relevant data have changed. The data-server does not need to have any a priori knowledge about its clients. Most commonly data are sent whenever new data arrive, but clients can also be updated periodically.
Figure 1. The Observer Pattern on an automotive software. Speed changes are pushed to a set of observer clients. [Automotive Software Architectures, Staron 2017]
It is a special case of the client-server style. In this blog I show a simple implementation of this pattern using the sleep/wake-up synchronization mechanisms designed for the kernel that has been presented. The goal is to use the machinery provided by the kernel – namely the multitasking and synchronization – to implement this very common pattern.
The publisher server does not know anything about the clients; the clients themselves will “subscribe” by going to sleep while new data do not arrive. The code below shows the SensorSet and SensorGet functions using the kernel calls to wake-up all listeners when new data is sensed, and to sleep while no new data arrive, respectively.
Let’s implement what is depicted on Figure 1.
/*
@file sensors.h
*/
#ifndef SENSORS_H_
#define SENSORS_H_
typedef struct Sensor Sensor_t;
/*Sensor number*/
#define SPEED 1
#define TEMP 2
#define FLOW 3
struct Sensor
{
int sensorData;
};
int SensorInit(Sensor_t* sensor);
void SensorSet(Sensor_t* sensor, int data);
int SensorGet(Sensor_t* sensor);
#endif /* SENSORS_H_ */
/*
@file sensors.c
*/
#include "kernel.h"
#include "sensors.h"
int SensorInit(Sensor_t* sensor)
{
if (sensor != NULL)
{
sensor->sensorData = 0;
return OK;
}
return NOK;
}
void SensorSet(Sensor_t* sensor, int data)
{
if (sensor != NULL)
{
sensor->sensorData = data;
/*wake-up all clients*/
kCallWake(&sensor->sensorData);
}
}
int SensorGet(Sensor_t* sensor)
{
if (sensor != NULL)
{
/*sleep for new data*/
kCallSleep(&sensor->sensorData);
return sensor->sensorData;
}
return NOK;
}
To emulate data changes on a sensor, I will send data via keyboard through UART, following a simple APDU: [SensorNumber, NewValue, Return].
/*
@file tasks.c
this usage example shows an observer pattern for sensors
using sleep/wake up synch mechanisms
*/
#include <stdio.h>
#include <assert.h>
#include "kernel.h"
#include "tasks.h"
#include "sensors.h"
SEMA_t rcvd; /*sema to signal publisher server*/
volatile char uartbuffer[3]; /*apdu buffer*/
volatile int norb = 0; /*n of rcvd bytes*/
volatile char rcvdbyte; /*rcvd byte*/
Sensor_t SpeedSensor; /*speed sensor*/
/*get data changes via uart
Apdu={sensor type, new value, '\r'}
*/
ISR(UART_Handler)
{
__disable_irq();
if((uart_get_status((Uart*)USART_SERIAL) & UART_SR_RXRDY) == UART_SR_RXRDY)
{
uart_read((Uart*)USART_SERIAL, &rcvdbyte);
uartbuffer[norb] = rcvdbyte;
norb++;
if (rcvdbyte == '\r')
{
kSemaSignal(&rcvd); /* signal publisher */
norb=0;
}
}
__enable_irq();
}
/*publisher server*/
/*highest priority task*/
void PublisherServer(void* args)
{
assert(SensorInit(&SpeedSensor) == OK);
kCall2(SEMAINIT, &rcvd, 0); /*kcall to init sema*/
while(1)
{
WAIT(&rcvd);
/*wake subscribers*/
if (uartbuffer[0] == SPEED)
{
printf("Speed sensor changed\n\r");
SensorSet(&SpeedSensor, (int)uartbuffer[1]);
}
else
{
printf("No route\n\r");
}
}
}
/* windshield client*/
void WindshieldClient(void* args)
{
int value;
while(1)
{
/*Sleep until new data*/
value = SensorGet(&SpeedSensor);
printf("Windshield notified. Speed: %d\n\r", value);
/*do some processing*/
}
}
/* radio volume client */
void RadioClient(void* args)
{
int value;
while(1)
{
value = SensorGet(&SpeedSensor);
printf("Radio notified. Speed: %d\n\r", value);
/*do some processing*/
}
}
/*lane departure client*/
void LaneDepartureClient(void* args)
{
int value;
while(1)
{
value = SensorGet(&SpeedSensor);
printf("Lane departure notified. Speed: %d\n\r", value);
/*do some processing*/
}
}
Note that the SensorGet methods called on the client tasks are not polling or busy-waiting the sensors. They put the tasks to sleep, so they wake up when new data arrive.
The figure below shows the outputs on the PC screen.
I do think system programming is one of the nicest things to do: taking a computer board (I/O, CPU, Memory) and creating its system software from the scratch so it performs some work is awesome.
I have presented a kernel implementation for an ARM Cortex-M3 microprocessor. So far, the features are:
Cooperative mechanism for user threads by software interruption(not described on the publications)
Preemptive-Cooperative mechanism for kernel threads
System Calls (software interrupts) to separate user threads from kernel threads
… and only that! :O
Just these features took a considerable amount of knowledge on the processor architecture, not-so-basic low-level programming skills and operating system concepts.
Now, we will keep improving our kernel from the scratch and start implementing the next important feature: interprocess communication.
2. CONCEPTSAND MOTIVATION
The term process maybe is not the most adequate when talking about embedded systems with no memory virtualization. Since all the concurrent execution units share the same heap and globals, as well the I/Os, the most preferred term is task. The term task is widely used when talking about embedded kernels. Nevertheless, most of the literature use task and process interchangeably.
The need for synchronization
A main concept is the idea of concurrent units competing for a limited number of resources on a computer.
«Concurrent Unit is a set of sequential actions in the same thread of execution. Also known as a task or thread.»
Figure 1. A, B, C and D are Concurrent Units: a) Sharing a single core (pseudo-concurrent) b) Each unit has its own core c) When running, only one concurrent unit will be active at a given time (Figure from [2])
To a system achieve its goals, tasks must cooperate to use the resources.
«Synchronization points are those where tasks will share information and/or ensure that certain preconditions are satisfied before proceeding.»
On Figure 2, the synchronization barrier only allows the program to proceed when all processes have reached that point – the conditions for all system actions to be correctly synchronized are true.
Figure 2. a) Running processes b) All processes except C are ready c) Once process C reaches the synchronization point, the program moves on. (Figure from [2])
Figure 3 shows concurrent units, modeled as three tasks. What do we know about them? If we analyze it individually, it becomes clear that in Task 1 Action A occurs before Action B, which goes through a conditional path until reaching E. The same reasoning goes for Tasks 2 and 3.
How are the individual actions of Tasks 1, 2, and 3 related to each other? If the (true) answer is “they do not relate”, there is no concurrency problem to be solved. And your system is probably very simple or very well designed.
Figure 3. Concurrent units represented as flowcharts (Figure from [1])
However, if at some point it is necessary to ensure, for example, that Actions F, X and Zeta are executed only after Actions E, Z and Gamma meet certain conditions, we need to somehow join each of the states of these tasks in logically a single thread, on a synchronization point. After the necessary conditions have been met at the synchronization point, the threads are again forked, and the program continues.
This means that whenever there are conditions between two or more processes to achieve a certain objective, we need to create mechanisms for them to communicate.
When I presented how the System Calls were implemented, there was a call to UART peripheral like this:
static void SysCallUART_PrintLn_(const char* args)
{
__disable_irq(); //critical session begin
uart_write_line(UART, args);
while (uart_get_status(UART)!=UART_SR_TXRDY); // waits until transmission is done
__enable_irq(); // critical session end
exitKernel_(); // exit kernel cooperatively
}
When the task calls the system to access UART, all interruptions are disabled and nothing else will be done until the transmission ends. Bad. Instead for waiting to transmission end, would be better to switch to other tasks that are not using UART.
Semaphores
Among the many valuable concepts on Computer Science Dijkstra has presented, the idea of semaphores for operating systems is a big one. The simplest semaphore is a non-negative integer “s” that accepts two atomic operations (apart from initialization), Wait and Signal:
It seems simple but it is not. The use of semaphores can be very error prone, and they are actually primitives to create higher-level IPCs schemes.
Synchronization [3]
The simplest form of synchronization is that a process A should not proceed beyond a point Ll until some other process B has reached L2 [3]. Examples of this situations arise whenever A requires information at point LI which is provided by B when it reaches L2. The synchronization of the two processes below is achieved if proceed is a semaphore initialized as 0.
Figure 5. Semaphore used for synchronization. (Figure from [3])
Mutual Exclusion [3]
Non-shareable resources, whether peripherals, files, or data in memory, can be protected from simultaneous access by several processes by preventing the processes from concurrently executing the pieces of program through which access is made. [3] These pieces of program are called critical sections.
Mutual exclusion in the use of resources can be regarded as mutual exclusion in the execution of critical sections, as on the figure below, where mutex is a semaphore.
Figure 4. Semaphore for mutual exclusion. If the initial value of mutex is 1, the first process to grab it will lock the resource. (Figure from [3])
When processes compete for resources, we can get into a situation where none of them can move forward. Let A and B be processes. A is blocked, waiting for a resource to be freed to move on. B in turn will only release this resource when a certain condition is satisfied. If this condition depends on A, which is already blocked, we have a deadlock.
Spin lockingand Blocking
The simplest semaphore implementation is called spinlock. Because it only protects a resource from being accessed but does not avoid another task that needs that resource to be dispatched. On Figure 6 if Task1 is preempted while still using resource, Task2 runs even though it performs nothing – it is spinning. At least we have not disabled all interrupts and Task3 can perform useful work.
Figure 6. Spin locking. The processor is granted to Task2 even though it is locked.
To solve the spin locking problem we need to tell the scheduler that a task which is waiting for a resource is not able to be dispatched. It is “blocked” until the task using the resource signals the semaphore. It means the Wait and Signal operations need to be able to switch a thread status between blocked and runnable.
Figure 7. Blocking Semaphore: Task 2 will not be schedulable until resource is released.
3. DESIGN
Big Picture
The initial design and implementation of the kernel was presented on great level of detail on two publications, starting from the round-robin scheduler for one single stack pointer up to a scheduler managing user and supervisor threads, the latter being accessed by System Calls.
Now, including a Semaphore mechanism to serialize access to a resource, the kernel model is represented on UML:
Figure 8. Kernel Model serializing access to a single resource. Note how dependent from the semaphore the thread control flow is. Bad IPC can create hard-to-find bugs.
A given resource might have a semaphore to manage how it is accessed. When a task calls a Wait on a semaphore associated to a resource that has no slots available, the thread will be blocked. When a task calls a Signal on a semaphore and the resource is released, the next thread on the queue that is blocked by that semaphore is set as runnable. With some effort other criteria can be added: priority, aging, etc.
A Task has a Thread Control Block which is associated to the Scheduler. The Scheduler is accountable to select (schedule), dispatch and save/resume user/kernel threads.
The Scheduler flowchart updated considering the blocked threads:
Figure 9. Scheduler flowchart.
So far, the description of a Semaphore I gave took for granted how the Wait and Signal operations are designed.Some authors [2] use the terms Down and Up respectively.
Blocking Semaphore
I will use an approach based on [4]. A blocking semaphore is a signed integer variable. Every time it is affected by a Signal operation, it is increased. Every time it is affected by a Wait operation, it is decreased. This way, for a value N > 0, it means a given resource has N slots to be shared. A value 0 means it is taken. A negative value, say -2, means it is taken by a thread, and other two threads are also waiting for it. With this approach we can infer the number of tasks that are blocked on the queue.
Below, the flow chart of the described semaphore operations. Note it is a system call, that’s why the last step is to exit the kernel as I have been doing for other system calls.
Figure 10. Flowchart for Semaphore operations
4. IMPLEMENTATION
The kSemaphore_t type was defined as an int32_t. The Wait and Signal procedures are System Calls, and the callbacks are written as below, following the same pattern for System Calls already presented. exitKernel_() tells the system to leave the kernel and call the user thread switcher (an unblocked user thread will be launched/resumed).
static void SysCallSemaSignal_(kSemaphore_t* s)
{
tcb_t* next_tcb_ptr;
__disable_irq();
(*s) = (*s) + 1;
if ((*s) <= 0)
{
/* switching RunPtr->next, not RunPtr! */
next_tcb_ptr = RunPtr->next;
while (next_tcb_ptr->block_pt != s)
{
next_tcb_ptr = next_tcb_ptr->next;
}
next_tcb_ptr->block_pt=0; /*unblocks a blocked by s*/
}
__enable_irq();
}
static void SysCallSemaWait_(kSemaphore_t* s)
{
__disable_irq();
(*s) = (*s) - 1;
if ((*s) < 0)
{
RunPtr->block_pt = s; // reason it is blocked
__enable_irq();
exitKernel_();
}
__enable_irq();
}
To make a System Call to a Semaphore, the API is the following:
// General SysCall API
SysCall(SysCall_ID, arguments)
// Semaphores
// Wait Procedure:
SysCall(SemaWait, kSemaphore_t* sema)
// Signal Procedure:
SysCall(SemaSignal, kSemaphore_t* sema)
4.1 Changes
Some small but critical changes have been made on the code. The Thread Control Block data structure now has a block field that is a pointer to a semaphore, as explained, the one who blocked the thread.
The Scheduler routine within the SysTick Handler is also changed. First and most important, the scheduler routine now is handled by lower priority trap, PendSV. SysTick only triggers PendSV.
Besides SysTick delegating the switching to a lower priority software interrupt, there is the addition of the kGetNext routine:
//@File system.s
.global kGetNext
.type kGetNext, %function
//@File kernel.h
extern void kGetNext(void);
//@File kernel.c
/*
Schedules next unblocked thread
*/
void kGetNext(void)
{
__disable_irq();
RunPtr = RunPtr->next;
/*!if there is a deadlock, the kernel will hang on this loop!*/
while(RunPtr->block_pt || RunPtr->sleeping)
{
RunPtr = RunPtr->next;
}
__enable_irq();
__ASM volatile("BX LR"); // Dispatch
}
5. VALIDATION
Producer-consumerproblem
Figure 11. Producer and Consumer processes sharing a FIFO [4]
A set of “producer” processes and a set of “consumer” processes communicate by means of inserting and extracting items on a buffer. The buffer has size N. There are two main criteria to achieve synchronization: producers cannot write on a full buffer and consumers cannot read from an empty buffer. That is for the number of inserted data i and extracted data e: 0 < i – e < N. Furthermore, there must be a mutex so only one process can access the buffer at a time.
In [3] the author proves the following algorithm solves the consumer-producer problem.
Figure 12. Algorithm for a Producer-Consumer [3]
Space Available is a semaphore that is initialized as N. Buffer Manipulation initialized as 1. Item available, is initialized as 0.
Now let’s see if we can reproduce this solution with the implemented Semaphores and Scheduler. Task1, the producer, increments a counter and inserts it into a FIFO, when there is space and the buffer is unlocked. Task2 reads this value when there is an item and the resource is unlocked.
The code below does not encapsulate the semaphores within the Data Queue, i.e., the FIFO methods do not handle the synchronization themselves.
#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO service
/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item = 0;
kSemaphore_t mutex = 1;
/* Producer */
void Task1(void* args)
{
volatile int counter1 = 0;
while(1)
{
counter1++;
SysCall(SemaWait, &space); /* wait there is space */
SysCall(SemaWait, &mutex); /* lock */
osFifoPut(counter1); /* insert into fifo */
SysCall(SemaSignal, &mutex); /* unlock */
SysCall(SemaSignal, &item); /* signals there is an item */
}
}
/* Consumer */
void Task2(void* args)
{
volatile int read = 0;
while(1)
{
SysCall(SemaWait, &item); /* wait there is an item */
SysCall(SemaWait, &mutex); /* lock */
read = osFifoGet(); /* read the last written data */
SysCall(SemaSignal, &mutex); /* unlock */
SysCall(SemaSignal, &space); /* signal there is space*/
}
}
I tested both preemptive-only and cooperative-only scheduling, and alsopreemptive+cooperative (although not presented, there is a cooperative mechanism for the user threads too).
The FIFO has 10 positions. On my tests the program never hangs, buffer never overflows or underflows. Well, I have nothing fancy to show here, but my debugging window:
Figure 13. Producer-consumer code on debug
Now let’s make something a little more elaborate to test our synchronization primitives, and see it happening on the PC screen.
Let’s add a third task. It starts and blocks waiting for a semaphore to signal. This semaphore is signaled if the number consumed is divisible by 1000, and the task sleeps for 5 system ticks, wakes-up, and all over again.
#include <stdio.h>
#include <asf.h>
#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO
/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item = 0;
kSemaphore_t mutex = 1;
kSemaphore_t sleep = 0;
static inline void printline_(const char* string)
{
usart_write_line(USART_SERIAL, string);
}
/* Producer */
void Task1(void* args)
{
osFifoInit();
volatile int counter1 = 0;
while(1)
{
counter1++;
SysCall(SemaWait, &space); /* waits there is space */
SysCall(SemaWait, &mutex); /* locks */
osFifoPut(counter1); /* insert into fifo */
SysCall(SemaSignal, &mutex); /* unlocks */
SysCall(SemaSignal, &item); /* signals there is an item */
}
}
/* Consumer */
void Task2(void* args)
{
volatile int read = 0;
while(1)
{
SysCall(SemaWait, &item); /* waits there is an item */
SysCall(SemaWait, &mutex); /* locks */
read = osFifoGet(); /* reads the last written data */
if (read % 1000 == 0) { SysCall(SemaSignal, &sleep); } /* signals task 3 to sleep */
SysCall(SemaSignal, &mutex); /* unlocks */
SysCall(SemaSignal, &space); /* signals there is space*/
}
}
void Task3(void* args)
{
while(1)
{
printline_("Task3 active\n\r");
SysCall(SemaWait, &sleep);
printline_("Task3 is going to sleep\n\r");
SysCallSleepTicks(5);
}
}
If you are wondering the call to UART is not a system call, that is because the design is shifting to a microkernel approach: drivers run on user space. The sleep implementation was not presented yet also.
Here is the output:
On the next publication we will leverage the use of these primitives on the design and implementation of IPC mechanisms.
[1] Design Patterns for Embedded Systems in C (Douglass Powell, 2009)
[2] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014)
[3] Fundamentals of Operating Systems (A. M. Lister, 1984)
[4] Embedded Systems: Real-Time Operating Systems for Arm Cortex-M MCUs (Jonathan Valvanno, 2012)
ARM Cortex-M processors are in SoCs of several application domains, especially in much of what we call smart devices. This publication continues the previous one, in which I had demonstrated the implementation of a minimum preemptive scheduling mechanism for an ARM Cortex-M3, taking advantage of the special hardware resources for context switching.
Other architecture features are now explored: the separation of user and kernel threads and the use of Supervisor Calls to implement system calls.
Although concepts about operating systems are addressed because they are inherent to the subject, the main goal is to explore the ARM Cortex-M3, a relatively inexpensive processor with wide applicability.
2. Special registers
There are 3 special registers on the ARM Cortex-M3. You can consult the ARM documentation to understand the role of each of them in the processor. The most important in this publication is CONTROL .
Program Status Registers (APSR, IPSR, and EPSR)
Interrupt Mask Registers (PRIMASK, FAULTMASK and BASEPRI)
Control (CONTROL) .
Special registers can only be accessed through the privileged instructions MRS (ARMRead Special Register) and MSR (ARM Set Special Register):
// Load in R0 the current value contained in the special register MRS R0, SPECIAL // Load in the special register the value contained in R0) MSR SPECIAL, R0
The CONTROL register has only 2 configurable bits. When an exception handler (e.g, SysTick_Handler) is running, the processor will be in privileged mode and using the main stack pointer (MSP) , and CONTROL [1] = 0, CONTROL [0] = 0 . In other routines that are not handlers, this register can assume different values depending on the software implementation (Table 1).
In the small kernel shown before, the application tasks (Task1, Task2 and Task3) were also executed in privileged mode and using the main stack pointer (MSP). Thus, an application program could change the special registers of the core if it wanted to.
3. Kernel and user domains
In the last publication I highlighted the fact that register R13 is not part of the stackframe, as it stores the address of the current stackpointer. The R13 is a “banked” register, meaning that it is physically replicated, and takes a value or another depending on the state of the core.
CTRL [1] (0 = MSP / 1 = PSP)
CTRL [0] (0 = Priv, 1 = Non priv )
state
0
0
Privileged handler * / Base mode
0
1
Unprivileged
1
0
Privileged thread
1
1
User thread
Table 1. Possible states of the CONTROL register * in exception handlers, this mode will always be active even if CTRL [0] = 1.
With two stackpointers, one for application and another for the kernel, means that a user thread can not easily corrupt the kernel stack by a programming application error or malicious code. According to the ARM manuals, a robust operating system typically has the following characteristics:
interrupt handlers use MSP (by default)
kernel routines are activated through SysTick at regular intervals to perform task scheduling and system management in privileged mode
user applications use PSP in non-privileged mode
memory for kernel routines can only be accessed in privileged mode* and use MSP
* for now we will not isolate the memory spaces
4. System Calls
Putting it simple, a system call is a method in which a software requests a service from the kernel or OS on which it is running. If we intend to separate our system into privilege levels, it is inevitable that the application level needs call the kernel to have access to, for example, hardware services or whatever else we think is critical to the security and stability of our system.
A common way to implement system calls in ARM Cortex-M3 (and in other ARMv7) is to use the software interrupt Supervisor Call (SVC). The SVC acts as an entry point for a service that requires privileges to run. The only input parameter of an SVC is its number (ASM instruction: SVC #N), which we associate with a function call (callback). Unlike other exceptions triggered via software available, like PendSV(Pendable Supervisor Call), the SVC can be triggered in user mode by default.
Figure 2. Block diagram of a possible operating system architecture. Supervisor Calls act as an entry point for privileged services. [2]
5. Design
5.1 Using two stack pointers
To use the two available stack pointers (MSP and PSP) it is essential to understand 2 things:
The control register manipulation: it is only possible to write or read the CONTROL register in handler mode (within an exception handler) or in privileged threads.
The exceptions mechanism: when an interrupt takes place, the processor saves the contents of registers R0-R3, LR, PC and xPSR, as explained in the previous publication. The value of the LR when we enter an exception indicates the mode the processor was running, when the thread was interrupted. We can manipulate this value of LR together with the manipulation of the stack pointer to control the program flow.
LR
BX LR
0xFFFFFFF9
Returns to “base” mode, privileged MSP. (CONTROL = 0b00)
0xFFFFFFFD
Returns to user mode (PSP, with the privilege level of entry) (Control = 0b1x)
0xFFFFFFF1
Returns to the previous interruption, in case a higher priority interruption occurs during a lower priority.
Table 2. Exception return values
5.1.1. One kernel stack for each user stack
Each user stack will have a correspondent kernel stack (one kernel stack per thread). Thus, each Task is associated to a kernel stack and a user stack. Another approach would be only one kernelstack for the entire the system (one kernel stack per processor). The advantage of using the first approach is that from the point of view of who implements the system, the programs that run in the kernel follow the same development pattern as the application programs. The advantage of the second approach is less memory overhead and less latency in context switching.
Figure 3. Each user stack has an associated kernel stack
5.2 Kernel entry and exit mechanisms
In the previous publication, the interruption triggered by SysTick handled the context switching, i.e., it interrupted the running thread, saved its stackframe, searched for the next thread pointed to by the next field in the TCB (thread control block) structure and resumed it.
With the separation between user and supervisor spaces, we will have two mechanisms to get in and out the kernel, the systemcalls, explicitly called in code, and the interruption by SysTick that implements the scheduling routine. Although still using a round-robin scheme in which each task has the same time slice, the kernel threads also work cooperatively with user threads, that is: when there is nothing more to be done, the kernel can explicitly return. If the kernel thread takes longer than the time between one tick and another, it will be interrupted and rescheduled. User tasks could also use a similar mechanism, however, for simplicity of exposure, I chose to leave user tasks only in a fixed round-robin scheme, with no cooperative mechanisms.
5.2.1. Scheduler
The flowchart of the preemptive scheduler to be implemented is in Figure 4. The start-up of the kernel and user application is also shown for clarity. The kernel starts upo and voluntarily triggers the first user task. At every SysTick interruption, the thread has its state saved and the next scheduled task is resumed according to the state in which it was interrupted: kernel or user mode.
Figura 4. Scheduler flowchart
5.2.2 System Calls
System Calls are needed when the user requests access to a privileged service. In addition, I also use the same mechanism for a kernel thread to cooperatively return to the user thread.
Figure 5. System Call flowchart
6. Implementation
Below I explain the codes created to implement the proof of concept. Most of the kernel itself is written in assembly, except for a portion of the supervisor calls handler that is written in C with some inline assembly. In my opinion, more cumbersome and susceptible to errors than writing in assembly is to embed assembly in C code. The toolchain used is the GNU ARM.
6.1. Stacks
There is nothing special here, except that now in addition to the user stack declare another array of integers for the kernel stack. These will be associated in the Thread Control Block.
int32_t p_stacks[NTHREADS][STACK_SIZE]; // user stack
int32_t k_stacks[NTHREADS][STACK_SIZE]; // kernel stack
6.2. Task Scheduler
The main difference from this to the scheduler shown in the last publication is that we will now handle two different stack pointers: MSP and PSP. Thus, when entering an exception handler, the portion of the stackframe saved automatically depends on the stack pointer used when the exception took place. However, in the exception routine, the active stack pointer is always the MSP. Thus, in order to be able to handle a stack pointer when we are operating with another, we cannot use the PUSH and POP pseudo-instructions because they have the active stack pointer as their base address . We will have to replace them with the instructions LDMIA (load multiple and increment after) for POP, and STMDB (store multiple decrement before) for PUSH, with the writeback sign “!” at the base address [1] .
// Example of POP MRS R12, PSP // reads the value of the process stack pointer in R12 LDMIA R12!, {R4-R11} // R12 contains the base address (PSP) / * the address contained in R12 now stores the value from R4; [R12] + 0x4 contains the value of R5, and so on until [R12] + 0x20 contains the value of R11. the initial value of R12 is also increased by 32 bytes * / MSR PSP, R12 // PSP is updated to the new value of R12
// Example of PUSH MSR R12, MSP STMDB R12!, {R4-R11} / * [R12] - 0x20 contains R4, [R12] - 0x16 contains R5, ..., [R12] contains R4 the initial value of R12 is decremented by 32 bytes * / MSR MSP, R12 // MSP is updated to the new value of R12
Another difference is that the TCB structure now needs to contain a pointer to each of the stack pointers of the thread it controls, and also a flag indicating whether the task to be resumed was using MSP or PSP when it was interrupted.
// thread control block
struct tcb
{
int32_t* psp; //psp saved from the last interrupted thread
int32_t* ksp; //ksp saved from the last interrupted kernel thread
struct tcb *next; //points to next tcb
int32_t pid; //task id
int32_t kernel_flag; // 0=kernel, 1=user
};
typedef struct tcb tcb_t;
tcb_t tcb[NTHREADS]; //tcb array
tcb_t* RunPtr;
The scheduler routines are shown below. The code was written so it is clear in its intention, without trying to save instructions. Note that in line 5 the value of LR at the entry of the exception is only compared with 0xFFFFFFFD, if false it is assumed that it is 0xFFFFFFFF9, this is because I guarantee that there will be no nested interrupts (SysTick never interrupts an SVC, for example), so the LR should never assume 0xFFFFFFF1. If other than a proof of concept, the test should be considered.
The implementation of system calls uses the SVC Handler. As stated, SVC has a unique input parameter (ARM makes it sounds like an advantage…), that is the number we associate with a callback. But then how do we pass the arguments forward to the callback, if system calls can handle only one parameter? They need to be retrieved from the stack. The AAPCS (ARM Application Procedure Call Standard) which is followed by compilers, says that when a function (caller) calls another function (callee), the callee expects its arguments to be in R0-R3. Likewise, the caller expects the callee return value to be in R0. R4-R11 must be preserved between calls. R12 is the scratch register and can be freely used.
No wonder that when an exception takes place the core saves (PUSH) the registers R0-R3, LR, PC and xPSR from the interrupted function, and when returning put them (POP) again in the core registers. It is fully prepared to get back to the same point when it was interrupted. But if we change the context, that is, after the interruption we do not return to the same point we were before, there will be a need to explicitly save the remaining stackframe so this thread can be resumed properly later. It is essential to follow the AAPCS if we want to evoke functions written in assembly from C code and vice-versa.
To system calls, I defined a macro function in C that receives the SVC code and the arguments for the callback (the syntax of inline assembly depends on the compiler used). Beware you need to tell the compiler R0 is a clobberedregister so it will not store another value in it.
The args value is stored in R0. The SVC call is made with the immediate “svc_number”. When the SVC is triggered, R0-R3 will be automatically saved to the stack. The code was written as follows, without saving instructions, for clarity:
The rest of the routine for entering the kernel is written in C [2, 3]. Note that in the routine written in assembly a simple branch occurs (line 14) and therefore we have not yet returned from the exception handler .
The svc_number, in turn, is retrieved by walking two bytes (hence the cast to char) out of the address of the PC that is 6 positions above R0 in the stack [1, 2, 3]. Note that it was necessary to assign to R0 the value contained in PSP shortly after entering the interrupt, before saving the rest of the stack (lines 2 and 13 of the assembly code).
After retrieving the system call number and its arguments, the MSP is overwritten with the value stored in the TCB. Then we change the value of LR so the exception returns to the base mode. In this implementation the callback does not run within the handler. When the BX LR instruction is executed, the remaining of the stackframe is automatically activated onto the core registers.
static void SysCall_CallBack_(void* args)
{
BSP_Function((int32_t*) args); //BSP function with one argument int32
exitKernel_(); // leaves cooperatively
}
6.4. Start-up
The start-up is a critical point. System starts in base mode. Stacks are assembled. The first task to be performed by the kernel after booting is to configure SysTick, switch to user mode and trigger the first user thread .
The assembly routines for the start–up are as follows:
.equ SYSTICK_CTRL, 0xE000E010
.equ TIME_SLICE, 999
.global kStart
.type kStart, %function
kStart:
LDR R0, =RunPtrStart
LDR R1, [R0]
LDR R2, [R1,#4]
MSR MSP, R2 // MSP <- RunPtr.ksp
POP {R4-R11} //loads stackframe 0 at call stack
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR} //LR <- PC = UsrAppStart
ADD SP, SP, #4
BX LR // branches to UsrAppStart
//this function manages the stack to run the first user thread
.global UsrAppStart
.type UsrAppStart, %function
UsrAppStart:
LDR R1, =RunPtr //R1 <- RunPtr
LDR R2, [R1]
LDR R2, [R2]
MSR PSP, R2
BL SysTickConf //configures systick
MOV R0, #0x3
MSR CONTROL, R0 //thread unprivileged mode
ISB // inst set barrier: guarantees CONTROL is updated before going
POP {R4-R11} //loads stackframe 0
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR} //LR <- PC
ADD SP, SP, #4
BX LR
SysTickConf:
LDR R0, =SYSTICK_CTRL
MOV R1, #0
STR R1, [R0] // resets counter
LDR R1, =TIME_SLICE
STR R1, [R0,#4] // RELOAD <- TIME_SLICE
STR R1, [R0,#8] // CURR_VALUE <- TIME_SLICE
MOV R1, #0x7 // 0b111:
// 1: Clock source = core clock
// 1: Enables irq
// 1: Enables counter
STR R1, [R0]
BX LR //get back to caller
7. Test
As a small test, we will write on the PC screen via UART. The callback for the system call was written as follows:
static void SysCallUART_PrintLn_(const char* args)
{
__disable_irq(); //guarded begin
uart_write_line(UART, args);
// waits until transmission is done
while (uart_get_status(UART) != UART_SR_TXRDY);
__enable_irq(); // guarded end
exitKernel_(); // exit kernel cooperatively
}
Be careful when using multitasking to access any shared resource, since we have not yet inserted any inter-process communication mechanism. However, the operation is within a “Guarded Region”, and it will not be interrupted by SysTick. The main program is as follows:
The use of two stack pointers, one for application and another for the kernel isolate these spaces not allowing the application to corrupt the kernel stack. Privilege levels prevent the user from overwriting special registers, keeping the kernel safe from application programming errors or malicious code.
Adding another stack pointer to the system required changes to the scheduling routine because we now manipulate two stacks in the domain of two different stack pointers, and both can be preempted. In addition, a cooperative mechanism has also been added for kernelexiting.
The one kernel stack per user stack approach makes the development of kernel or application routines to follow the same pattern from the perspective of who is writing the system. The price to pay is memory overhead and more latency when switching contexts. To mitigate the last, cooperative mechanisms can be added as shown. To mitigate the memory overhead more attention should be put when modeling the tasks (or concurrent units), so they are efficiently allocated.
The system call mechanism is used as an entry point to hardware services, or whatever else we deem critical for the security and stability of the system. This will make even more sense by separating not only the stacks at privilege levels but also the memory regions with the MPU .