Designing and implementing IPC mechanisms on an ARM Cortex-M3 kernel (1/2)

1. INTRODUCTION

I do think system programming is one of the nicest things to do: taking a computer board (I/O, CPU, Memory) and creating its system software from the scratch so it performs some work is awesome.

I have presented a kernel implementation for an ARM Cortex-M3 microprocessor. So far, the features are:

  • Fixed preemptive round-robin scheduling by hardware interruption
  • Cooperative mechanism for user threads by software interruption (not described on the publications)
  • Preemptive-Cooperative mechanism for kernel threads
  • System Calls (software interrupts) to separate user threads from kernel threads
  • … and only that! :O

Just these features took a considerable amount of knowledge on the processor architecture, not-so-basic low-level programming skills and operating system concepts.

Now, we will keep improving our kernel from the scratch and start implementing the next important feature: interprocess communication.

2. CONCEPTS AND MOTIVATION

The need for synchronization

A main concept is the idea of concurrent units competing for a limited number of resources on a computer.

«Concurrent Unit is a set of sequential actions in the same thread of executionAlso known as a task or thread.» 

Figure 1. A, B, C and D are Concurrent Units: a) Sharing a single core (pseudo-concurrent) b) Each unit has its own core c) When running, only one concurrent unit will be active at a given time (Figure from [2])

To a system achieve its goals, tasks must cooperate to use the resources.

«Synchronization points are those where tasks will share information and/or ensure that certain preconditions are satisfied before proceeding. » 

On Figure 2, the synchronization barrier only allows the program to proceed when all processes have reached that point – the conditions for all system actions to be correctly synchronized are true.

Figure 2. a) Running processes b) All processes except C are ready c) Once process C reaches the synchronization point, the program moves on. (Figure from [2])

Figure 3 shows concurrent units, modeled as three tasks. What do we know about them? If we analyze it individually, it becomes clear that in Task 1 Action A occurs before Action B, which goes through a conditional path until reaching E. The same reasoning goes for Tasks 2 and 3.

How are the individual actions of Tasks 1, 2, and 3 related to each other? If the (true) answer is “they do not relate”, there is no concurrency problem to be solved. And your system is probably very simple or very well designed.

Figure 3. Concurrent units represented as flowcharts (Figure from [1])

However, if at some point it is necessary to ensure, for example, that Actions FX and Zeta are executed only after Actions E, Z and Gamma meet certain conditions, we need to somehow join each of the states of these tasks in logically a single thread, on a synchronization point. After the necessary conditions have been met at the synchronization point, the threads are again forked, and the program continues.

This means that whenever there are conditions between two or more processes to achieve a certain objective, we need to create mechanisms for them to communicate.

When I presented how the System Calls were implemented, there was a call to UART peripheral like this:

static void SysCallUART_PrintLn_(const char* args)
{
    __disable_irq(); //critical session begin
    uart_write_line(UART, args); 
    while (uart_get_status(UART)!=UART_SR_TXRDY); // waits until transmission is done 
    __enable_irq(); // critical session end
    exitKernel_(); // exit kernel cooperatively
}

When the task calls the system to access UART, all interruptions are disabled and nothing else will be done until the transmission ends. Bad. Instead for waiting to transmission end, would be better to switch to other tasks that are not using UART.

Semaphores

Among the many valuable concepts on Computer Science Dijkstra has presented, the idea of semaphores for operating systems is a big one. The simplest semaphore is a non-negative integer “s” that accepts two atomic operations (apart from initialization), Wait and Signal:

wait(s): Hello system, I would like to use the resource related to this semaphore.
signal(s): Hello system, I have just finished using the resource related to this semaphore.

It seems simple but it is not. The use of semaphores can be very error prone, and they are actually primitives to create higher-level IPCs schemes.

Mutual Exclusion [3]

Non-shareable resources, whether peripherals, files, or data in memory, can be protected from simultaneous access by several processes by preventing the processes from concurrently executing the pieces of program through which access is made. [3] These pieces of program are called critical sections.

Mutual exclusion in the use of resources can be regarded as mutual exclusion in the execution of critical sections, as on the figure below, where mutex is a semaphore.

Figure 4. Semaphore for mutual exclusion. If the initial value of mutex is 1, the first process to grab it will lock the resource. (Figure from [3])

Synchronization [3]

The simplest form of synchronization is that a process A should not proceed beyond a point Ll until some other process B has reached L2 [3]. Examples of this situations arise whenever A requires information at point LI which is provided by B when it reaches L2. The synchronization of the two processes below is achieved if proceed is a semaphore initialized as 0.

Figure 5. Semaphore used for synchronization. (Figure from [3])

Deadlocks

When processes compete for resources, we can get into a situation where none of them can move forward. Let A and B be processes. A is blocked, waiting for a resource to be freed to move on. B in turn will only release this resource when a certain condition is satisfied. If this condition depends on A, which is already blocked, we have a deadlock.

Spin locking and Blocking

The simplest semaphore implementation is called spinlock. Because it only protects a resource from being accessed but does not avoid another task that needs that resource to be dispatched. On Figure 6 if Task1 is preempted while still using resource, Task2 runs even though it performs nothing – it is spinning. At least we have not disabled all interrupts and Task3 can perform useful work.

Figure 6. Spin locking. The processor is granted to Task2 even though it is locked.

To solve the spin locking problem we need to tell the scheduler that a task which is waiting for a resource is not able to be dispatched. It is “blocked” until the task using the resource signals the semaphore. It means the Wait and Signal operations need to be able to switch a thread status between blocked and runnable.

Figure 7. Blocking Semaphore: Task 2 will not be schedulable until resource is released.

3. DESIGN

Big Picture

The initial design and implementation of the kernel was presented on great level of detail on two publications, starting from the round-robin scheduler for one single stack pointer up to a scheduler managing user and supervisor threads, the latter being accessed by System Calls.

Now, including a Semaphore mechanism to serialize access to a resource, the kernel model is represented on UML:

Figure 8. Kernel Model serializing access to a single resource. Note how dependent from the semaphore the thread control flow is. Bad IPC can create hard-to-find bugs.

A given resource might have a semaphore to manage how it is accessed. When a task calls a Wait on a semaphore associated to a resource that has no slots available, the thread will be blocked. When a task calls a Signal on a semaphore and the resource is released, the next thread on the queue that is blocked by that semaphore is set as runnable. With some effort other criteria can be added: priority, aging, etc.

A Task has a Thread Control Block which is associated to the Scheduler. The Scheduler is accountable to select (schedule), dispatch and save/resume user/kernel threads. A better model would isolate the Dispatcher as an instance itself, but considering the way it is coded on ASM, the dispatch is just a branch for the address the updated RunPtr is pointing to, after a scheduled thread context is retrieved.

The Scheduler flowchart updated considering the blocked threads:

Figure 9. Scheduler flowchart.

So far, the description of a Semaphore I gave took for granted how the Wait and Signal operations are designed. Some authors [2] use the terms Down and Up respectively.

There are several ways to use a semaphore. The initial value depends on the goal. On Figure 5, the semaphore was initialized as 0, for instance. If we aim mutual exclusion, we initialize the value as 1 meaning the resource is initially free.

Blocking Semaphore

Also, there are several ways to design them. I will use an approach based on [4]. A blocking semaphore is a signed integer variable. Every time it is affected by a Signal operation, it is increased. Every time it is affected by a Wait operation, it is decreased. This way, for a value N > 0, it means a given resource has N slots to be shared. A value 0 means it is taken. A negative value, say -2, means it is taken by a thread, and other two threads are also waiting for it. With this approach we can infer the number of tasks that are blocked on the queue.

Below, the flow chart of the described semaphore operations. Note it is a system call, that’s why the last step is to exit the kernel as I have been doing for other system calls.

Figure 10. Flowchart for Semaphore operations

4. IMPLEMENTATION

The kSemaphore_t type was defined as an int32_t. The Wait and Signal procedures are System Calls, and the callbacks are written as below, following the same pattern for System Calls already presented. exitKernel_() tells the system to leave the kernel and call the user thread switcher (an unblocked user thread will be launched/resumed).

static void SysCallSemaSignal_(kSemaphore_t* s)
{	
	tcb_t* next_tcb_ptr;
	__disable_irq();
	(*s) = (*s) + 1;
	if ((*s) <= 0)
	{
       /* switching RunPtr->next, not RunPtr! */
		next_tcb_ptr = RunPtr->next;
		while (next_tcb_ptr->block_pt != s)
		{
			/*if deadlock this loop hangs the kernel!*/
			next_tcb_ptr = next_tcb_ptr->next;
		}
		next_tcb_ptr->block_pt=0; /*unblocks a blocked by s*/
		__enable_irq();
		exitKernel_();
	}
	__enable_irq();
	exitKernel_();
}
static void SysCallSemaWait_(kSemaphore_t* s)
{
	__disable_irq();
	(*s) = (*s) - 1;
	if ((*s) < 0)
	{
		RunPtr->block_pt = s; // reason it is blocked
		__enable_irq();
		exitKernel_();
	}
	__enable_irq();
	exitKernel_();
}

To make a System Call to a Semaphore, the API is the following:

// General SysCall API
SysCall(SysCall_ID, arguments)
// Semaphores
// Wait Procedure:
SysCall(SemaWait, kSemaphore_t* sema)
// Signal Procedure:
SysCall(SemaSignal, kSemaphore_t* sema)

4.1 Changes

Some small but critical changes have been made on the code. The Thread Control Block data structure now has a block field that is a pointer to a semaphore, as explained, the one who blocked the thread.

// @File kernel.h
// Structs for Threads
//********************
struct tcb 
{
	int32_t*		psp;			/*last saved process stack*/
	int32_t*		ksp;			/*last saved kernel stack*/
	struct tcb		*next;			/*next tcb*/
	int32_t			pid;
	int32_t			kernel_flag;	/*kernel flag 1=user thread, 0=kernel thread*/
	kSemaphore_t*	block_pt;		/*0=not blocked, semaphore address=blocked*/
};
/* Globals */
extern tcb_t tcbs[NTHREADS];	// thread control block				
extern int32_t p_stacks[NTHREADS][STACK_SIZE];   // process stack
extern int32_t k_stacks[NTHREADS][STACK_SIZE];   // kernel stack
extern tcb_t* RunPtr; 

The Scheduler routine within the SysTick Handler is also changed. First and most important, the scheduler routine now is handled by lower priority trap, PendSV. SysTick only triggers PendSV. This is done because we do not want a context switch during a System Call. If for example, a context switch happens just before the system enters the critical section to lock/unlock a mutex, we will probably mess our synchronization, and all the problems that come with that, like a deadlock. Besides SysTick delegating the switching to a lower priority software interrupt, there is the addition of the kGetNext routine:

//@File: system.s
SysTick_Handler:
// set the bit 28 as 1 to trigger pendsv
	LDR R0, =SCB_ICSR //0xE000ED04
	LDR R1, =ISCR_SETPSV // (1 << 28) 
	STR R1, [R0]
	ISB //flush pipeline
// within the PendSV handler routine
PendSV_Handler:
    /* same as before up to here */ 
    Schedule:
	BL		kGetNext             //get next unblocked
	Dispatch:
	LDR		R0, =RunPtr
	LDR		R1, [R0]
	LDR		R2, [R1,#16]
	CMP		R2, #1				//kernel_flag==1?
	BEQ		ResumeUser			//yes
	B		ResumeKernel	    //no
	ResumeUser:
	LDR		R1, =RunPtr			//R1 <- RunPtr
	LDR		R2, [R1]
	LDR		R2, [R2]
	LDMIA	R2!, {R4-R11}		//retrieve sw stackframe 
	MSR		PSP, R2				//PSP4
	MOV		LR, #0xFFFFFFFD		//LR=return to user thread
	CPSIE	I					//critical end
	BX		LR 
	
	ResumeKernel:
	LDR		R1, =RunPtr			//R1 <- RunPtr 
	LDR		R2, [R1]
	LDR		R2, [R2, #4]
	MSR		MSP, R2
	LDMIA	R2!, {R4-R11}		//retrieve sw stackframe 
	MSR		MSP, R2
	MOV		LR, #0xFFFFFFF9		//LR=return to kernel thread
	CPSIE	I					//critical end
	BX		LR

The kGetNext routine was written in C with a line of inline assembly:

//@File system.s
.global kGetNext
.type kGetNext, %function
//@File kernel.h
extern void kGetNext(void);
//@File kernel.c
/*
 Schedules next unblocked thread
*/
void kGetNext(void)
{
	__disable_irq();
	RunPtr = RunPtr->next; 
    /*!if there is a deadlock, the kernel will hang on this loop!*/
	while(RunPtr->block_pt || RunPtr->sleeping)
	{ 
    	RunPtr = RunPtr->next;
	}
	__enable_irq();
	__ASM volatile("BX LR"); // Dispatch
}

5. VALIDATION

Producer-consumer problem

Figure 11. Producer and Consumer processes sharing a FIFO [4]

A set of “producer” processes and a set of “consumer” processes communicate by means of inserting and extracting items on a buffer. The buffer has size N. There are two main criteria to achieve synchronization: producers cannot write on a full buffer and consumers cannot read from an empty buffer. That is for the number of inserted data i and extracted data e: 0 < i – e < N. Furthermore, there must be a mutex so only one process can access the buffer at a time.

In [3] the author proves the following algorithm solves the consumer-producer problem.

Figure 12. Algorithm for a Producer-Consumer [3]

Space Available is a semaphore that is initialized as N. Buffer Manipulation is the mutex, initialized as 1. Item available, is initialized as 0.

Now let’s see if we can reproduce this solution with the implemented Semaphores and Scheduler. Task1, the producer, increments a counter and inserts it into a FIFO, when there is space and the buffer is unlocked. Task2 reads this value when there is an item and the resource is unlocked.

The code below does not encapsulate the semaphores within the Data Queue, i.e., the FIFO methods do not handle the synchronization themselves.

#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO service
/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item = 0;
kSemaphore_t mutex = 1;
/* Produced data, consumed data */
volatile int counter1 = 0;
volatile int read = 0;
/* Producer */
void Task1(void* args)
{
	while(1)
	{
		counter1++;
		SysCall(SemaWait, &space); /* waits there is space */
		SysCall(SemaWait, &mutex); /* locks */
		osFifoPut(counter1); /* insert into fifo */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &item); /* signals there is an item */
        //SysCall_SuspendTask(); /* leaves cooperatively */
	}
}
/* Consumer */
void Task2(void* args)
{
	while(1)
	{
		SysCall(SemaWait, &item); /* waits there is an item */
		SysCall(SemaWait, &mutex); /* locks */
		read = osFifoGet(); /* reads the last written data */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &space); /* signals there is space*/
        //SysCall_SuspendTask(); /* leaves cooperatively */
	}
}

I tested both preemptive-only and cooperative-only scheduling, and also preemptive+cooperative (although not presented, there is a cooperative mechanism for the user threads too).

The FIFO has 10 positions. On my tests the program never hangs, buffer never overflows or underflows. Well, I have nothing fancy to show here, but my debugging window:

Figure 13. Producer-consumer code on debug

Now let’s make something a little more elaborate to test our synchronization primitives, and see it happening on the PC screen.

Let’s add a third task. It starts and blocks waiting for a semaphore to signal. This semaphore is signaled if the number consumed is divisible by 1000, and the task sleeps for 5 system ticks, wakes-up, and all over again.

#include <stdio.h>
#include <asf.h>
#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO

/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item =  0;
kSemaphore_t mutex = 1;
kSemaphore_t sleep = 0;
kSemaphore_t print_sema = 1;

static inline void printline_(const char* string)
{
	SysCall(SemaWait, &print_sema);
	usart_write_line(USART_SERIAL, string);
	SysCall(SemaSignal, &print_sema);
}
/* Produced data, consumed data */
volatile int counter1 = 0;
volatile int read = 0;
/* Producer */
void Task1(void* args)
{
	osFifoInit();
	while(1)
	{
		counter1++;
		SysCall(SemaWait, &space); /* waits there is space */
		SysCall(SemaWait, &mutex); /* locks */
		osFifoPut(counter1); /* insert into fifo */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &item); /* signals there is an item */
		//SysCallSuspend(); /* leaves cooperatively */
	}
}
/* Consumer */
void Task2(void* args)
{
	while(1)
	{
		SysCall(SemaWait, &item); /* waits there is an item */
		SysCall(SemaWait, &mutex); /* locks */
		read = osFifoGet(); /* reads the last written data */
		if (read % 1000 == 0) { SysCall(SemaSignal, &sleep); } /* signals task 3 to sleep */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &space); /* signals there is space*/
		//SysCallSuspend(); /* leaves cooperatively */
	}
}
void Task3(void* args)
{
	while(1)
	{
		printline_("Task3 active\n\r");
		SysCall(SemaWait, &sleep);
		printline_("Task3 is going to sleep\n\r");
		SysCallSleepTicks(5);
	}
}

If you are wondering the call to UART is not a system call, that is because the design is shifting to a microkernel approach: drivers run on user space. The sleep implementation was not presented yet also.

Here is the output:

On the next publication we will leverage the use of these primitives on the design and implementation of IPC mechanisms.

[1] Design Patterns for Embedded Systems in C (Douglass Powell, 2009) 
[2] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014) 
[3] Fundamentals of Operating Systems (A. M. Lister, 1984) 
[4] Embedded Systems: Real-Time Operating Systems for Arm Cortex-M MCUs (Jonathan Valvanno, 2012) 

Author: Antonio Giacomelli de Oliveira

Embedded Systems Engineer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: