Designing and implementing IPC mechanisms on an ARM Cortex-M3 Kernel (2/2)

6. PROCESS COMMUNICATION

Process communication refers to schemes or mechanisms that allow processes to exchange information. It can be accomplished in many different ways, all of which depend on process synchronization. [1] ITC or Inter-Task Communication is a common term used on the embedded domain.

Now we have validated our semaphores are able to synchronize tasks running concurrently in our system, it is time to use them to create higher-level communications mechanisms.

Microkernel architecture

The overall design goal is to be a microkernel. That said, there will be tasks which will implement servers, and other tasks are going to be clients. They need to communicate.

Figure 14. The target architecture

Shared Memory, Pipes, and Message Passing are common communication schemes in operating systems [1, 2, 3]. Shared Memory using simple mutexes are effective for direct task communication on the same address space. Pipes are effective to share streams of data. Message Passing is used to exchange messages ready to be processed.

6.1 Synchronous versus Asynchronous mechanisms

Communication can be synchronous or asynchronous. The later means when a client calls a server, it is free to perform other tasks while the client does not answer. The former means after sending it will block until it gets an answer.

Figure 15. General client-server synchronous communication.

Asynchronous communication is intended mainly for loosely-coupled systems, in which interprocess communication is infrequent, i.e., processes do not exchange messages on a planned or regular basis. For such systems, asynchronous communication is more suitable due to its greater flexibility. [1]

Synchronous communication is well suited to tightly-coupled systems in which processes exchange messages on a planned or regular basis. In such a system, processes can expect messages to come when they are needed, and the usage of message buffers is carefully planned. [1]

7. MESSAGE PASSING

Message Passing is a general form of inter-task communication mechanism. On operating systems it is the basis of the so-called microkernel; on distributed systems it is used for RPCs (remote procedure calls), and also the basis for server-client oriented programming on computer networks. [1, 2, 3]

Message Passing can by synchronous or asynchronous. On Asynchronous both send and receive operations are non-blocking. On synchronous, both are blocking.

In this publication I will be showing a synchronous message passing mechanism. As said, Message Passing is the basis of a microkernel, the messages will be conveying requests to be processed by servers.

7.1. Design and Implementation of a Synchronous Message Passing

In a simple client-server communication system, there are two kinds of processes: servers and clients. Clients can only communicate with servers, and servers can only communicate with clients. Each client has either a guaranteed or dedicated message buffer. After sending a request to a server, a client immediately waits for a reply since there is nothing else the client can do. Upon receiving a client request, a server handles the request, which may require some time delay, but it eventually sends a reply to the client. In more complex server-client systems, there are processes that can only be clients, processes that can only be servers and processes that can be both. [1,2]

The design and implementation presented here is based on [1, 2], and processes are either clients or servers, never both.

7.2 Design

We can extend the consumer-producer problem to N messages. [3]

buffer[N], N = number of messages
process producer:
 I=0
 while (TRUE):
  produce contents
  receive buffer[I] from consumer
  put contents on buffer
  send to consumer
  I=(I+1)%N
process consumer:
 for K=0 to K=N-1 send empty messages to producer
 I=0
 while(TRUE)
  receive buffer[I] from producer
  extract contents from message
  put empty item on buffer
  send to producer
  I=(I+1)%N

Based on this algorithm, but for multiple producers and consumers, using a message queue to sequence the message processing, the design is proposed as follows:

  • The message buffers are arranged on a linked list. The next field of the last buffer of the list points to zero, so we know it is the last. The message buffer data structure contains the sender PID, the message contents and the address of the next linked buffer.
  • Every thread has a pointer to a Message Buffer on its Thread Control Block. Since there is a limited number of buffers available on the system, when a client wants to send a message to a server, it must fetch a free buffer from a list. If there is a free buffer, it dequeues its address from the list, writes the message, enqueues it on the receiver message queue and signals the receiver there is a new message. Initially, all buffers are free.
  • When a receiver is signaled, it dequeues the message buffer from its own TCB, reads its contents, and releases the buffer – put it back on the free list. The figure below shows a possible state, for a system with seven message buffers. Three of them are taken, waiting to be read by the receiver threads. It means 3 send operations had ocurred and no receive operations had took place for these messages yet.
  • The number of free buffers is controlled by a counting semaphore, which maximum count is the total number of available buffers on the system. The access to the free buffer list is a critical region (CR), disputed by senders, so it is protected by a mutex. The same reasoning applies to the message queues of every thread control block, each has its own mutex.
Figure 16. Possible state of a Message Passing with three messages waiting to be read by receivers.

Note for asynchronous communication the changes would affect the sending and receiving algorithms, the architecture depicted on Figure 16 would still apply.

7.3 Implementation

Likewise the design, the implementation is simple. The hardest part is the synchronization. Maybe the linked queue need some special attention. The send-receive interface must be a kernel call if the buffers are stored on kernel space (if you got a MMU). I hope the implementation mirrors the design idea:

/******************/
/*@File message.h*/
/*****************/
/*Message Buffer Data Structure*/
#define MSG_SIZE 32
typedef struct mbuff
{
	struct mbuff *next;
	int			 sender_pid;
	char		 contents[MSG_SIZE];
}MBUFF_t;
/*Interface*/
/*initialize free list*/
extern void init_mbuffs(void);
/*send *msg to pid, return OK or NOK*/
extern int sendmsg(char *msg, int pid); 
/*receive msg and store at *msg, return sender PID, 
0 if fails (0 is a reserved PID)*/
extern int recvmsg(char *msg); 
/****************/
/*@File kernel.h*/
/****************/
/*Thread control block with message passing fields*/
struct tcb 
{
	int32_t*		psp;			/*last saved process stack*/
	int32_t*		ksp;			/*last saved kernel stack*/
	struct tcb		*next;			/*next tcb*/
	int32_t			kernel_flag;	/*kernel flag 1=user thread, 0=kernel thread*/
	int32_t			pid;			/*process id*/
	kSemaphore_t*	block_pt;		/*blocked 0=not blocked, semaphore address=blocked*/
	uint32_t		sleeping;		/*sleep counter*/
	kSemaphore_t	nmsg;			/*new msg sema*/
	kSemaphore_t	mlock;			/*thread msg queue sema*/
	MBUFF_t			*mqueue;        /*thread msg queue*/
};
/******************/
/*@File message.c*/
/*****************/
/************************************************************************/
/* Synch Message Passing                                                 */
/************************************************************************/
MBUFF_t mbuff[NMBUF]; /*message buffers*/
MBUFF_t *mbufflist=NULL; /*free message buffers*/
kSemaphore_t nmbuf=NMBUF; /*counting semaphore*/
kSemaphore_t mlock=1; /*sema for mbufflist*/
static inline void copyString_(char *dest, char* src)
{
	for (int i = 0; i<MSG_SIZE; ++i) { dest[i] = src[i] ; }
}
static int menqueue(MBUFF_t **queue, MBUFF_t *p)
{
	MBUFF_t *q  = *queue;
	if (q==NULL)  /*empty list*/
	{
		*queue = p; /* insert buffer address */
		p->next = 0; /*make the last*/
		return OK;
	}
	while (q->next) /*search for the last position*/
	{
		q = q->next;
	}	
	q->next = p; /*take it*/
	p->next = 0; /*make the last*/
	return OK;
}
static MBUFF_t *mdequeue(MBUFF_t **queue)
{
	MBUFF_t *mp = *queue;
	if (mp)	*queue = mp->next;
	return mp;
}
void init_mbuffs(void)
{
	int i;
	for (i=0; i<NMBUF; i++) /*initially all buffers are on the free list*/
	{
		menqueue(&mbufflist, &mbuff[i]);
	}
}
/*Note the System Call API was renamed to kCall (kernel call).*/
static MBUFF_t *get_mbuf() //get a free buffer
{
	MBUFF_t *mp;
	kCall(SemaWait, &nmbuf); /*if no buffers, block*/
	kCall(SemaWait, &mlock); /*CR*/
	mp = mdequeue(&mbufflist);
	kCall(SemaSignal, &mlock); /*CR*/
	return mp;
}
static int put_mbuf(MBUFF_t *mp)
{
	kCall(SemaWait, &mlock); /*CR*/
	menqueue(&mbufflist, mp);
	kCall(SemaSignal, &mlock); /*CR*/
	kCall(SemaSignal, &nmbuf); /*increases counting sema*/
	return OK;
}
int sendmsg(char *msg, int pid)
{
	tcb_t *r_task; 
	r_task = &tcbs[pid]; /*receiver task*/
	MBUFF_t *mp = get_mbuf();
	if (mp == NULL)
	{
		return NOK;
	}
	mp->sender_pid = RunPtr->pid;
	copyString_(mp->contents, msg);
	kCall(SemaWait, &r_task->mlock); /*CR*/
	menqueue(&r_task->mqueue, mp);
	kCall(SemaSignal, &r_task->mlock); /*CR*/
	kCall(SemaSignal, &r_task->nmsg); /*signal a new msg*/
	return OK;
}
int recvmsg(char *msg)
{
	int pid;
	kCall(SemaWait, &RunPtr->nmsg); /*wait for new msg*/
	kCall(SemaWait, &RunPtr->mlock); /*CR*/
	MBUFF_t *mp = mdequeue(&RunPtr->mqueue);
	if (mp == NULL) 
	{
		kCall(SemaSignal, &RunPtr->mlock);
		return 0;
	}
	kCall(SemaSignal, &RunPtr->mlock); /*CR*/
	copyString_(msg, mp->contents);
	pid = mp->sender_pid;
	put_mbuf(mp); /*releases buffer*/
	return pid;
}

7.4 Simple Validation

This small validation shows the proposed Message Passing mechanism is able to send and receive messages synchronously.

Here is the test code:

void Task1(void* args)
{
	init_mbuffs();	/*init buffers*/
	int ret=NOK;
    /*Application Data Units*/
	char apdu2[] = {0xA2, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x02, 0x02};
	char apdu3[] = {0xA3, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF, 0x03, 0x03};
	while(1)
	{
		ret = sendmsg(apdu2, 2);
		if (ret == OK) printline_("Sent to 2\n\r");
		else assert(0);
		ret = sendmsg(apdu3, 3);
		if (ret == OK) printline_("Sent to 3\n\r");
		else assert(0);	
	}
}
void Task2(void* args)
{
	 int32_t sender_pid=0;
	 char msgrcvd[MSG_SIZE]={0};
	while(1)
	{  
		sender_pid = recvmsg(msgrcvd);	
		if (sender_pid == 1) { printline_("2 rcvd from 1\n\r");}
		else assert(0);
		(void)msgrcvd;
	}
}
void Task3(void* args)
{
	int32_t sender_pid=0;
	char msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		sender_pid = recvmsg(msgrcvd);
		if (sender_pid == 1) { printline_("3 rcvd from 1\n\r");}
		else assert(0);
		(void)msgrcvd;
	}
}

Note tasks are sending and receiving in loops and being blocked/preempted. No additional synchronization was added besides those provided by the implemented mechanism.

The total number of Message Buffers is 2, since we are going synchronous. The message size was set to 8 bytes.

Below a picture of the debugging window of Task 2 with the proper message copied from its queue.

Most important are the printed lines showing the expected synchronous behaviour:

Below another demonstration of message passing. An interrupt-driven system receives a 2-byte command from the keyboard via UART. The first byte is the receiver PID, the server. The server job is to blink an LED on certain frequency. The number of blinks is given by the second byte.

In this example, the Message Passing interface is implemented as a Kernel Call.

/*
 Message Passing Usage:
 This example receives a 2-byte message from UART
 The first byte is the PID of the receiving task
 The second byte is the number of led toggles
 Each server toggles the led on a different frequency 
*/
#include <stdio.h>
#include <bsp.h> 
#include "kernel.h"
#include "tasks.h"
MUTEX_t print;
SEMA_t rcvd;
MSG_t uartbuffer[2];
volatile int norb = 0;
volatile char rcvdbyte;
ISR(UART_Handler)
{
    __disable_irq();
	if((uart_get_status(UART) & UART_SR_RXRDY) == UART_SR_RXRDY)
	{
		uart_read(UART, &rcvdbyte);
		uartbuffer[norb] = rcvdbyte;
		norb++;
		if (norb == 2) 
		{
			kSemaSignal(&rcvd);
			norb=0;
		}		
	}
   __enable_irq();
}
static inline void printline_(const char* string)
{
	kCall(LOCK, &print);
	uart_write_line(UART, string);
	kCall(RELEASE, &print);
}
/* Client */
/* init_mbuffs() was called on kernel init */
void Task1(void* args)
{	
    kSemaInit(&rcvd, 0);
    kMutexInit(&print, 0);
	while(1)
	{
		WAIT(&rcvd); /*block until signaled by ISR*/
		if (uartbuffer[0] == 2) 
		{ 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
        }
		else if (uartbuffer[0] == 3) 
		{ 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
        }
		else
		{
			printline_("No route\n\r");
            SUSPEND(); //yield
		}
	}
}
/* Server */
void Task2(void* args)
{
	MSG_t msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1) /*rcvd from PID 1*/
		{
			printline_("2\n\r");
			for (int i = 0; i<msgrcvd[1]; i++)	
			{
				gpio_toggle_pin(LED0_GPIO);
				delay_ms(500);
			}
		}
	}
}
/* Server */
void Task3(void* args)
{
	MSG_t msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1)
		{
			printline_("3\n\r");
			for (int i = 0; i<msgrcvd[1]; i++)
			{
				gpio_toggle_pin(LED0_GPIO);
				delay_ms(1000);
			}
		}
	}
}

This usage demonstrates Message Passing conveying commands to be processed by servers. From this “Hello World” to the design of a system performing more complex tasks on a client-server pattern, it might take some effort but it is pretty doable because the mechanisms are set up.

8. FINAL REMARKS

This article ends a series of four publications that demonstrate the design and implementation of a functional preemptive kernel on an ARM Cortex-M3 MCU, sitting on the top of the CMSIS layer. GNU ARM toolchain was used.

I have started from creating a simple preeemptive round-robin scheduler to preempt concurrent tasks running on kernel space. Then, user and kernel threads were split, and because of that a system call mechanism was needed. Synchronization primitives were designed and implemented, and with them it was possible to create inter-task communication schemes.

Starting by exploring the ARM Cortex-M3 MCU, this series evolved into concepts and practice on embedded operating systems design, and while researching to write it I could learn a lot. Although the result is a useful functional multitasking engine, from this stage to a full-fledged (RT)OS kernel there is a distance. This is an ongoing project, on a very early phase. The full source code can be found on https://github.com/antoniogiacomelli/k0ba.

References

[1] Embedded and Real-Time Operating Systems, (K.C. Wang, 2017)

[2] Design and Implementation of the MTX Operating System, (K.C. Wang, 2015)

[3] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014)

Separating user space from kernel space on ARM Cortex-M3

1 . Introduction

ARM Cortex-M processors are in SoCs of several application domains, especially in much of what we call smart devices. This publication continues the previous one, in which I had demonstrated the implementation of a minimum preemptive scheduling mechanism for an ARM Cortex-M3, taking advantage of the special hardware resources for context switching.

Other architecture features are now explored: the separation of user and kernel threads and the use of Supervisor Calls to implement system calls.

Although concepts about operating systems are addressed because they are inherent to the subject, the main goal is to explore the ARM Cortex-M3, a relatively inexpensive processor with wide applicability.

2. Special registers

There are 3 special registers on the ARM Cortex-M3. You can consult the ARM documentation to understand the role of each of them in the processor. The most important in this publication is CONTROL .

  • Program Status Registers (APSR, IPSR, and EPSR)
  • Interrupt Mask Registers (PRIMASK, FAULTMASK and BASEPRI)
  • Control (CONTROL) .

Special registers can only be accessed through the privileged instructions MRS (ARM Read Special Register) and MSR (ARM Set Special Register):

// Load in R0 the current value contained in the special register 
MRS R0, SPECIAL
// Load in the special register the value contained in R0)
MSR SPECIAL, R0

The CONTROL register has only 2 configurable bits. When an exception handler (e.g, SysTick_Handler) is running, the processor will be in privileged mode and using the main stack pointer (MSP) , and CONTROL [1] = 0, CONTROL [0] = 0 . In other routines that are not handlers, this register can assume different values ​​depending on the software implementation (Table 1).

In the small kernel shown before, the application tasks (Task1, Task2 and Task3) were also executed in privileged mode and using the main stack pointer (MSP). Thus, an application program could change the special registers of the core if it wanted to.

3. Kernel and user domains

In the last publication I highlighted the fact that register R13 is not part of the stackframe, as it stores the address of the current stack pointer. The R13 is a “banked” register, meaning that it is physically replicated, and takes a value or another depending on the state of the core.

CTRL [1] (0 = MSP / 1 = PSP)CTRL [0] (0 = Priv, 1 = Non priv )state
00Privileged handler * / Base mode
01Unprivileged
10Privileged thread
11User thread
Table 1. Possible states of the CONTROL register
* in exception handlers, this mode will always be active even if CTRL [0] = 1.

With two stack pointers, one for application and another for the kernel, means that a user thread can not easily corrupt the kernel stack by a programming application error or malicious code. According to the ARM manuals, a robust operating system typically has the following characteristics:

  • interrupt handlers use MSP (by default)
  • kernel routines are activated through SysTick at regular intervals to perform task scheduling and system management in privileged mode
  • user applications use PSP in non-privileged mode
  • memory for kernel routines can only be accessed in privileged mode* and use MSP

* for now we will not isolate the memory spaces

4. System Calls

Putting it simple, a system call is a method in which a software requests a service from the kernel or OS on which it is running. If we intend to separate our system into privilege levels, it is inevitable that the application level needs call the kernel to have access to, for example, hardware services or whatever else we think is critical to the security and stability of our system.

A common way to implement system calls in ARM Cortex-M3 (and in other ARMv7) is to use the software interrupt Supervisor Call (SVC). The SVC acts as an entry point for a service that requires privileges to run. The only input parameter of an SVC is its number (ASM instruction: SVC #N), which we associate with a function call (callback). Unlike other exceptions triggered via software available, like PendSV (Pendable Supervisor Call), the SVC can be triggered in user mode by default.

Figure 2. Block diagram of a possible operating system architecture. Supervisor Calls act as an entry point for privileged services. [2]

5. Design

5.1 Using two stack pointers

To use the two available stack pointers (MSP and PSP) it is essential to understand 2 things:

  • The control register manipulation: it is only possible to write or read the CONTROL register in handler mode (within an exception handler) or in privileged threads.
  • The exceptions mechanism: when an interrupt takes place, the processor saves the contents of registers R0-R3, LR, PC and xPSR, as explained in the previous publication. The value of the LR when we enter an exception indicates the mode the processor was running, when the thread was interrupted. We can manipulate this value of LR together with the manipulation of the stack pointer to control the program flow.
LRBX LR
0xFFFFFFF9Returns to “base” mode, privileged MSP. (CONTROL = 0b00)
0xFFFFFFFDReturns to user mode (PSP, with the privilege level of entry) (Control = 0b1x)
0xFFFFFFF1Returns to the previous interruption, in case a higher priority interruption occurs during a lower priority.
Table 2. Exception return values

5.1.1. One kernel stack for each user stack

Each user stack will have a correspondent kernel stack (one kernel stack per thread). Thus, each Task is associated to a kernel stack and a user stack. Another approach would be only one kernel stack for the entire the system (one kernel stack per processor). The advantage of using the first approach is that from the point of view of who implements the system, the programs that run in the kernel follow the same development pattern as the application programs. The advantage of the second approach is less memory overhead and less latency in context switching.

Figure 3. Each user stack has an associated kernel stack

5.2 Kernel entry and exit mechanisms

In the previous publication, the interruption triggered by SysTick handled the context switching, i.e., it interrupted the running thread, saved its stackframe, searched for the next thread pointed to by the next field in the TCB (thread control block) structure and resumed it.

With the separation between user and supervisor spaces, we will have two mechanisms to get in and out the kernel, the system calls, explicitly called in code, and the interruption by SysTick that implements the scheduling routine. Although still using a round-robin scheme in which each task has the same time slice, the kernel threads also work cooperatively with user threads, that is: when there is nothing more to be done, the kernel can explicitly return. If the kernel thread takes longer than the time between one tick and another, it will be interrupted and rescheduled. User tasks could also use a similar mechanism, however, for simplicity of exposure, I chose to leave user tasks only in a fixed round-robin scheme, with no cooperative mechanisms.

5.2.1. Scheduler

The flowchart of the preemptive scheduler to be implemented is in Figure 4. The start-up of the kernel and user application is also shown for clarity. The kernel starts upo and voluntarily triggers the first user task. At every SysTick interruption, the thread has its state saved and the next scheduled task is resumed according to the state in which it was interrupted: kernel or user mode.

Figura 4. Scheduler flowchart

5.2.2 System Calls

System Calls are needed when the user requests access to a privileged service. In addition, I also use the same mechanism for a kernel thread to cooperatively return to the user thread.

Figure 5. System Call flowchart

6. Implementation

Below I explain the codes created to implement the proof of concept. Most of the kernel itself is written in assembly, except for a portion of the supervisor calls handler that is written in C with some inline assembly. In my opinion, more cumbersome and susceptible to errors than writing in assembly is to embed assembly in C code. The toolchain used is the GNU ARM.

6.1. Stacks

There is nothing special here, except that now in addition to the user stack declare another array of integers for the kernel stack. These will be associated in the Thread Control Block.

int32_t p_stacks[NTHREADS][STACK_SIZE]; // user stack
int32_t k_stacks[NTHREADS][STACK_SIZE]; // kernel stack

6.2. Task Scheduler

The main difference from this to the scheduler shown in the last publication is that we will now handle two different stack pointers: MSP and PSP. Thus, when entering an exception handler, the portion of the stackframe saved automatically depends on the stack pointer used when the exception took place. However, in the exception routine, the active stack pointer is always the MSP. Thus, in order to be able to handle a stack pointer when we are operating with another, we cannot use the PUSH and POP pseudo-instructions because they have the active stack pointer as their base address . We will have to replace them with the instructions LDMIA (load multiple and increment after) for POP, and STMDB (store multiple decrement before) for PUSH, with the writeback sign “!” at the base address [1] .

// Example of POP 
MRS R12, PSP // reads the value of the process stack pointer in R12
LDMIA R12!, {R4-R11} // R12 contains the base address (PSP)
/ * the address contained in R12 now stores the value from R4; [R12] + 0x4
contains the value of R5, and so on until [R12] + 0x20 contains the
value of R11.
the initial value of R12 is also increased by 32 bytes
* /
MSR PSP, R12 // PSP is updated to the new value of R12

// Example of PUSH
MSR R12, MSP
STMDB R12!, {R4-R11}
/ * [R12] - 0x20 contains R4, [R12] - 0x16 contains R5, ..., [R12] contains R4
the initial value of R12 is decremented by 32 bytes * /
MSR MSP, R12 // MSP is updated to the new value of R12

Another difference is that the TCB structure now needs to contain a pointer to each of the stack pointers of the thread it controls, and also a flag indicating whether the task to be resumed was using MSP or PSP when it was interrupted.

// thread control block
struct tcb 
{
  int32_t*  psp; //psp saved from the last interrupted thread
  int32_t*  ksp; //ksp saved from the last interrupted kernel thread
  struct tcb    *next; //points to next tcb
  int32_t   pid; //task id
  int32_t   kernel_flag; // 0=kernel, 1=user    
};
typedef struct tcb tcb_t;
tcb_t tcb[NTHREADS]; //tcb array
tcb_t* RunPtr; 

The scheduler routines are shown below. The code was written so it is clear in its intention, without trying to save instructions. Note that in line 5 the value of LR at the entry of the exception is only compared with 0xFFFFFFFD, if false it is assumed that it is 0xFFFFFFFF9, this is because I guarantee that there will be no nested interrupts (SysTick never interrupts an SVC, for example), so the LR should never assume 0xFFFFFFF1. If other than a proof of concept, the test should be considered.

.global SysTick_Handler
.type SysTick_Handler, %function
SysTick_Handler:            
CMP LR, #0xFFFFFFFD // were we at an user thread?
BEQ SaveUserCtxt    //yes
B   SaveKernelCtxt  //no
 
SaveKernelCtxt:
PUSH	{R4-R11}  //push R4-R11
LDR		R0,=RunPtr
LDR		R1, [R0]
LDR		R2, [R1,#4]
LDR		R3, =#0  
STR		R3, [R1, #16] //kernel flag = 0
STR		SP, [R2]
B		Schedule

SaveUserCtxt:
MRS		R12, PSP
STMDB	R12!, {R4-R11}
MSR		PSP, R12
LDR		R0,=RunPtr
LDR		R1, [R0]
LDR		R3, =#1  
STR		R3, [R1, #16] //kernel flag = 1
STR		R12, [R1]		
B		Schedule
 
Schedule:
LDR R1, =RunPtr //R1 <- RunPtr
LDR R2, [R1]    
LDR R2, [R2,#8] //R2 <- RunPtr.next
STR R2, [R1]    //updates RunPtr
LDR R0, =RunPtr
LDR R1, [R0]
LDR R2, [R1,#16]
CMP R2, #1       //kernel_flag==1?
BEQ ResumeUser   //yes, resume user thread
B   ResumeKernel //no, resume kernel thread
 
ResumeUser:
LDR		R1, =RunPtr			//R1 <- RunPtr
LDR		R2, [R1]
LDR		R2, [R2]
LDMIA	R2!, {R4-R11}		//restore sw stackframe 
MSR		PSP, R2				//PSP4
MOV		LR, #0xFFFFFFFD		//LR=return to user thread
MOV		R0, #0x03
MSR		CONTROL, R0
ISB
BX		LR 
	
ResumeKernel:
LDR		R1, =RunPtr			//R1 <- RunPtr updated
LDR		R2, [R1]
LDR		R2, [R2, #4]
LDR		SP, [R2]
POP		{R4-R11}
MOV		LR, 0xFFFFFFF9
MOV		R0, #0x00
MSR		CONTROL, R0
ISB
BX		LR

6.3 System Calls

The implementation of system calls uses the SVC Handler. As stated, SVC has a unique input parameter (ARM makes it sounds like an advantage…), that is the number we associate with a callback. But then how do we pass the arguments forward to the callback, if system calls can handle only one parameter? They need to be retrieved from the stack. The AAPCS (ARM Application Procedure Call Standard) which is followed by compilers, says that when a function (caller) calls another function (callee), the callee expects its arguments to be in R0-R3. Likewise, the caller expects the callee return value to be in R0. R4-R11 must be preserved between calls. R12 is the scratch register and can be freely used.

No wonder that when an exception takes place the core saves (PUSH) the registers R0-R3, LR, PC and xPSR from the interrupted function, and when returning put them (POP) again in the core registers. It is fully prepared to get back to the same point when it was interrupted. But if we change the context, that is, after the interruption we do not return to the same point we were before, there will be a need to explicitly save the remaining stackframe so this thread can be resumed properly later. It is essential to follow the AAPCS if we want to evoke functions written in assembly from C code and vice-versa.

To system calls, I defined a macro function in C that receives the SVC code and the arguments for the callback (the syntax of inline assembly depends on the compiler used). Beware you need to tell the compiler R0 is a clobbered register so it will not store another value in it.

/*
asm ( assembler template
: output operands
: input operands
: list of clobbered registers
)*/

#define SysCall(svc_number, args) {\                                        
	asm volatile ("MOV R0, %0 " \
	: \
	: "r"  (args) \
	: "%r0");     \
	asm volatile ("svc %[immediate]" \
	: \
	:[immediate] "I" (svc_number) \
	: );  	  	  \
}

The args value is stored in R0. The SVC call is made with the immediate “svc_number”. When the SVC is triggered, R0-R3 will be automatically saved to the stack. The code was written as follows, without saving instructions, for clarity:

SVC_Handler:
MRS R12, PSP		 //saves psp
CMP LR, #0xFFFFFFFD
BEQ KernelEntry
B	 KernelExit
KernelEntry: 
MRS		R3, PSP
STMDB	R3!, {R4-R11}
MSR		PSP, R3
LDR		R1,=RunPtr
LDR		R2, [R1]
STR		R3, [R2]	
MOV		R0, R12 
B      svchandler_main

KernelExit:
// save kernel context
MRS		R12, MSP 
STMDB	R12!, {R4-R11}  //push R4-R11
MSR		MSP, R12
LDR		R0,=RunPtr
LDR		R1, [R0]
LDR		R2, [R1,#4]
STR		R12, [R2]		
// load user context
LDR		R2, [R1]
LDMIA	R2!, {R4-R11}
MOV		LR, #0xFFFFFFFD
MSR		PSP, R2
MOV		R0, #0x03
MSR		CONTROL, R0
ISB
BX		LR 

The rest of the routine for entering the kernel is written in C [2, 3]. Note that in the routine written in assembly a simple branch occurs (line 14) and therefore we have not yet returned from the exception handler .

The svc_number, in turn, is retrieved by walking two bytes (hence the cast to char) out of the address of the PC that is 6 positions above R0 in the stack [1, 2, 3]. Note that it was necessary to assign to R0 the value contained in PSP shortly after entering the interrupt, before saving the rest of the stack (lines 2 and 13 of the assembly code).

After retrieving the system call number and its arguments, the MSP is overwritten with the value stored in the TCB. Then we change the value of LR so the exception returns to the base mode. In this implementation the callback does not run within the handler. When the BX LR instruction is executed, the remaining of the stackframe is automatically activated onto the core registers.

#define SysCall_GPIO_Toggle  1 //svc number for gpio toggle
#define SysCall_Uart_PrintLn 2 //svc number for uart print line
 
void svchandler_main(uint32_t * svc_args)
{       
    uint32_t svc_number;
    uint32_t svc_arg0;
    uint32_t svc_arg1;
    svc_number = ((char *) svc_args[6])[-2]; // recupera o imediato 
    svc_arg0 = svc_args[0];
    svc_arg1 = svc_args[1]; 
  
 switch(svc_number)
 {
 case SysCall_GPIO_Toggle: 
    k_stacks[RunPtr->pid][STACK_SIZE-2] = (int32_t)SysCallGPIO_Toggle_; //PC
    k_stacks[RunPtr->pid][STACK_SIZE-8] = (int32_t)svc_arg0; //R0
    k_stacks[RunPtr->pid][STACK_SIZE-1] = (1 << 24); // T=1 (xPSR)
    __ASM volatile ("MSR MSP, %0" : : "r" (RunPtr->ksp) : );
    __ASM volatile ("POP {R4-R11}");
    __ASM volatile ("MOV LR, #0xFFFFFFF9");
	__ASM volatile ("MOV R0, #0x0"); 
	__ASM volatile ("MSR CONTROL, R0"); /*base mode*/
    __ASM volatile ("ISB");
    __ASM volatile ("BX LR"); /***returns from exception****/
    break;
 case SysCall_Uart_PrintLn: 
    k_stacks[RunPtr->pid][STACK_SIZE-2] = (int32_t)SysCallUART_PrintLn_; 
    k_stacks[RunPtr->pid][STACK_SIZE-8] = (int32_t)svc_arg0;
    k_stacks[RunPtr->pid][STACK_SIZE-1] = (1 << 24); // T=1
    __ASM volatile ("MSR MSP, %0" : : "r" (RunPtr->ksp) : );
    __ASM volatile ("POP {R4-R11}");
    __ASM volatile ("MOV LR, #0xFFFFFFF9");
	__ASM volatile ("MOV R0, #0x0"); 
	__ASM volatile ("MSR CONTROL, R0"); 
    __ASM volatile ("ISB");
    __ASM volatile ("BX LR"); /***returns from exception***/
    break;
 default:
    __ASM volatile("B SysCall_Dummy");
    break;
 break;
 }
}

A callback looks like this:

static void SysCall_CallBack_(void* args)
{
    BSP_Function((int32_t*) args); //BSP function with one argument int32
    exitKernel_(); // leaves cooperatively
}

6.4. Start-up

The start-up is a critical point. System starts in base mode. Stacks are assembled. The first task to be performed by the kernel after booting is to configure SysTick, switch to user mode and trigger the first user thread .

The assembly routines for the startup are as follows:

.equ SYSTICK_CTRL, 0xE000E010 
.equ TIME_SLICE,    999
 
.global kStart 
.type kStart, %function
kStart:
LDR R0, =RunPtrStart
LDR R1, [R0]
LDR R2, [R1,#4]
MSR MSP, R2   // MSP <- RunPtr.ksp
POP {R4-R11}  //loads stackframe 0 at call stack
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR}     //LR <- PC = UsrAppStart
ADD SP, SP, #4
BX  LR // branches to UsrAppStart
 
//this function manages the stack to run the first user thread
.global UsrAppStart 
.type   UsrAppStart, %function
UsrAppStart:                
LDR R1, =RunPtr //R1 <- RunPtr
LDR R2, [R1]        
LDR R2, [R2]
MSR PSP, R2
BL  SysTickConf //configures systick
MOV R0, #0x3
MSR CONTROL, R0 //thread unprivileged mode
ISB         // inst set barrier: guarantees CONTROL is updated before going
POP {R4-R11}   //loads stackframe 0
POP {R0-R3}
POP {R12}
ADD SP, SP, #4
POP {LR}       //LR <- PC
ADD SP, SP, #4
BX LR
     
SysTickConf:
LDR R0, =SYSTICK_CTRL 
MOV R1, #0
STR R1, [R0]  // resets counter
LDR R1, =TIME_SLICE  
STR R1, [R0,#4] // RELOAD <- TIME_SLICE
STR R1, [R0,#8] // CURR_VALUE <- TIME_SLICE
MOV R1, #0x7   // 0b111:
            // 1: Clock source = core clock 
            // 1: Enables irq
            // 1: Enables counter
STR R1, [R0]        
BX  LR      //get back to caller

7. Test

As a small test, we will write on the PC screen via UART. The callback for the system call was written as follows:

static void SysCallUART_PrintLn_(const char* args)
{
    __disable_irq(); //guarded begin
    uart_write_line(UART, args);        
// waits until transmission is done
    while (uart_get_status(UART) != UART_SR_TXRDY); 
    __enable_irq(); // guarded end
    exitKernel_(); // exit kernel cooperatively
}

The tasks (main threads) look like this:

void Task1(void* args)
{
    const char* string = (char*)args;
    while(1)
    {
        SysCall(SysCall_Uart_PrintLn, string);
    }
}

Be careful when using multitasking to access any shared resource, since we have not yet inserted any inter-process communication mechanism. However, the operation is within a “Guarded Region”, and it will not be interrupted by SysTick. The main program is as follows:

#include <commondefs.h> //board support package, std libs, etc.
#include <kernel.h>  
#include <tasks.h>
 
int main(void)
{
  kHardwareInit(); 
  kAddThreads(Task1, (void*)"Task1\nr", Task2, (void*)"Task2\nr", Task3, (void*)"Task3\nr");
  RunPtrStart = &tcbs[0]; 
  RunPtr = &tcbs[1];
  uart_write_line(UART, "Inicializando kernel...\nr");
  kStart(); 
  while(1);
}

System running:

8. Conclusions

The use of two stack pointers, one for application and another for the kernel isolate these spaces not allowing the application to corrupt the kernel stack. Privilege levels prevent the user from overwriting special registers, keeping the kernel safe from application programming errors or malicious code.

Adding another stack pointer to the system required changes to the scheduling routine because we now manipulate two stacks in the domain of two different stack pointers, and both can be preempted. In addition, a cooperative mechanism has also been added for kernel exiting.

The one kernel stack per user stack approach makes the development of kernel or application routines to follow the same pattern from the perspective of who is writing the system. The price to pay is memory overhead and more latency when switching contexts. To mitigate the last, cooperative mechanisms can be added as shown. To mitigate the memory overhead more attention should be put when modeling the tasks (or concurrent units), so they are efficiently allocated.

The system call mechanism is used as an entry point to hardware services, or whatever else we deem critical for the security and stability of the system. This will make even more sense by separating not only the stacks at privilege levels but also the memory regions with the MPU .

Next publication

9. References

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/DDI0337E_cortex_m3_r1p1_trm.pdf

[2] The definitive Guide to ARM Cortex M3, Joseph Yiu

[3] https://developer.arm.com/docs/dui0471/j/handling-processor-exceptions/svc-handlers-in-c-and-assembly-language