[blog] Monolithic Kernel and Microkernel

Setting aside more exotic approaches, generally, two kernel models are described: monolithic and microkernel. The crucial difference between them is the amount of system resources that run in supervisor space and user space.

Figure 1 conceptually illustrates the two approaches. In the microkernel a smaller set of processes run in kernel space. System programs, such as the file system, device drivers, etc., will be in the user domain and act as servers. In the kernel domain are the most essential modules, such as the scheduler and interprocess communication mechanisms. The primary function of the microkernel is to manage resources, and to a lesser extent to abstract the hardware. Modularity is a strong point of the architecture, normally described as a loosely coupled set of layers in the client-server model.

Figure 1. Monolithic Kernel vs. Microkernel [1]

A monolithic kernel concentrates all services in a single compilation unit, and only the application will be running in user space. In this type of system, the kernel besides being a resource manager is also an extension of the machine (“an extended machine”), because it completely abstracts the hardware for the user.

Kernel + OS

In Figure 2 a famous conceptual illustration of the GNU Operating System on the top of the Linux kernel. The fundamental glibc library is on user space. In the kernel space there is a part that is “universal” and another part that depends on the hardware platform.

Figure 2. GNU/Linux [3]

Compare it to the System V architecture from a 1986 book.

Figure 3. System V [5]

On microkernel-based systems such as those on Figure 4 and 5, most of the system is in user space. Besides modularity, another alleged advantage of this type of architecture is that, for instance, if a device driver enters a deadlock, the kernel can reset the system. On a system with a monolithic kernel this would not be possible because the system would already be locked in supervisor mode. Microkernels are however known to be slower on general-purpose systems – in addition to adding complexity to interprocess communication, which can eventually be exploited by malicious code.

Figure 4. QNX Neutrino
Figure 5. Minix 3 [4]

Different operating system flavors are possible when using a microkernel. In Figure 6a, the monolithic OS means that all servers are concentrated in a single program, and applications use clients to access these concentrated services. In Figure 6b, the servers are modularly distributed. In Figure 6c., a specialized (dedicated) application interacts directly with the microkernel, a common approach on embedded systems.

Figure 6. Systems with microkernel: a) Monolithic OS with Microkernel b) OS distributed with microkernel c) Monolithic system with microkernel [2]

[1] Ken, Yu. RTOS Model and Simulation using System C . 2010

[2] HERDER, Jorrit. Torwards a true Microkernel Operating System. 2005

[3] Anatomy of the Linux kernel – IBM Developer

[4] https://upload.wikimedia.org/wikipedia/commons/7/7d/The_MINIX_3_Microkernel_Architecture.png

[5] BACH, Maurice. The Design of the UNIX® Operating System. 1986

Designing and implementing IPC mechanisms on an ARM Cortex-M3 Kernel (2/2)

6. PROCESS COMMUNICATION

Process communication refers to schemes or mechanisms that allow processes to exchange information. It can be accomplished in many different ways, all of which depend on process synchronization. [1] ITC or Inter-Task Communication is a common term used on the embedded domain.

Now we have validated our semaphores are able to synchronize tasks running concurrently in our system, it is time to use them to create higher-level communications mechanisms.

Microkernel architecture

The overall design goal is to be a microkernel. That said, there will be tasks which will implement servers, and other tasks are going to be clients. They need to communicate.

Figure 14. The target architecture

Shared Memory, Pipes, and Message Passing are common communication schemes in operating systems [1, 2, 3]. Shared Memory using simple mutexes are effective for direct task communication on the same address space. Pipes are effective to share streams of data. Message Passing is used to exchange messages ready to be processed.

6.1 Synchronous versus Asynchronous mechanisms

Communication can be synchronous or asynchronous. The later means when a client calls a server, it is free to perform other tasks while the client does not answer. The former means after sending it will block until it gets an answer.

Figure 15. General client-server synchronous communication.

Asynchronous communication is intended mainly for loosely-coupled systems, in which interprocess communication is infrequent, i.e., processes do not exchange messages on a planned or regular basis. For such systems, asynchronous communication is more suitable due to its greater flexibility. [1]

Synchronous communication is well suited to tightly-coupled systems in which processes exchange messages on a planned or regular basis. In such a system, processes can expect messages to come when they are needed, and the usage of message buffers is carefully planned. [1]

7. MESSAGE PASSING

Message Passing is a general form of inter-task communication mechanism. On operating systems it is the basis of the so-called microkernel; on distributed systems it is used for RPCs (remote procedure calls), and also the basis for server-client oriented programming on computer networks. [1, 2, 3]

Message Passing can by synchronous or asynchronous. On Asynchronous both send and receive operations are non-blocking. On synchronous, both are blocking.

In this publication I will be showing a synchronous message passing mechanism. As said, Message Passing is the basis of a microkernel, the messages will be conveying requests to be processed by servers.

7.1. Design and Implementation of a Synchronous Message Passing

In a simple client-server communication system, there are two kinds of processes: servers and clients. Clients can only communicate with servers, and servers can only communicate with clients. Each client has either a guaranteed or dedicated message buffer. After sending a request to a server, a client immediately waits for a reply since there is nothing else the client can do. Upon receiving a client request, a server handles the request, which may require some time delay, but it eventually sends a reply to the client. In more complex server-client systems, there are processes that can only be clients, processes that can only be servers and processes that can be both. [1,2]

The design and implementation presented here is based on [1, 2], and processes are either clients or servers, never both.

7.2 Design

We can extend the consumer-producer problem to N messages. [3]

buffer[N], N = number of messages
process producer:
 I=0
 while (TRUE):
  produce contents
  receive buffer[I] from consumer
  put contents on buffer
  send to consumer
  I=(I+1)%N
process consumer:
 for K=0 to K=N-1 send empty messages to producer
 I=0
 while(TRUE)
  receive buffer[I] from producer
  extract contents from message
  put empty item on buffer
  send to producer
  I=(I+1)%N

Based on this algorithm, but for multiple producers and consumers, using a message queue to sequence the message processing, the design is proposed as follows:

  • The message buffers are arranged on a linked list. The next field of the last buffer of the list points to zero, so we know it is the last. The message buffer data structure contains the sender PID, the message contents and the address of the next linked buffer.
  • Every thread has a pointer to a Message Buffer on its Thread Control Block. Since there is a limited number of buffers available on the system, when a client wants to send a message to a server, it must fetch a free buffer from a list. If there is a free buffer, it dequeues its address from the list, writes the message, enqueues it on the receiver message queue and signals the receiver there is a new message. Initially, all buffers are free.
  • When a receiver is signaled, it dequeues the message buffer from its own TCB, reads its contents, and releases the buffer – put it back on the free list. The figure below shows a possible state, for a system with seven message buffers. Three of them are taken, waiting to be read by the receiver threads. It means 3 send operations had ocurred and no receive operations had took place for these messages yet.
  • The number of free buffers is controlled by a counting semaphore, which maximum count is the total number of available buffers on the system. The access to the free buffer list is a critical region (CR), disputed by senders, so it is protected by a mutex. The same reasoning applies to the message queues of every thread control block, each has its own mutex.
Figure 16. Possible state of a Message Passing with three messages waiting to be read by receivers.

Note for asynchronous communication the changes would affect the sending and receiving algorithms, the architecture depicted on Figure 16 would still apply.

7.3 Implementation

Likewise the design, the implementation is simple. The hardest part is the synchronization. Maybe the linked queue need some special attention. The send-receive interface must be a kernel call if the buffers are stored on kernel space (if you got a MMU). I hope the implementation mirrors the design idea:

/******************/
/*@File message.h*/
/*****************/
/*Message Buffer Data Structure*/
#define MSG_SIZE 32
typedef struct mbuff
{
	struct mbuff *next;
	int			 sender_pid;
	char		 contents[MSG_SIZE];
}MBUFF_t;
/*Interface*/
/*initialize free list*/
extern void init_mbuffs(void);
/*send *msg to pid, return OK or NOK*/
extern int sendmsg(char *msg, int pid); 
/*receive msg and store at *msg, return sender PID, 
0 if fails (0 is a reserved PID)*/
extern int recvmsg(char *msg); 
/****************/
/*@File kernel.h*/
/****************/
/*Thread control block with message passing fields*/
struct tcb 
{
	int32_t*		psp;			/*last saved process stack*/
	int32_t*		ksp;			/*last saved kernel stack*/
	struct tcb		*next;			/*next tcb*/
	int32_t			kernel_flag;	/*kernel flag 1=user thread, 0=kernel thread*/
	int32_t			pid;			/*process id*/
	kSemaphore_t*	block_pt;		/*blocked 0=not blocked, semaphore address=blocked*/
	uint32_t		sleeping;		/*sleep counter*/
	kSemaphore_t	nmsg;			/*new msg sema*/
	kSemaphore_t	mlock;			/*thread msg queue sema*/
	MBUFF_t			*mqueue;        /*thread msg queue*/
};
/******************/
/*@File message.c*/
/*****************/
/************************************************************************/
/* Synch Message Passing                                                 */
/************************************************************************/
MBUFF_t mbuff[NMBUF]; /*message buffers*/
MBUFF_t *mbufflist=NULL; /*free message buffers*/
kSemaphore_t nmbuf=NMBUF; /*counting semaphore*/
kSemaphore_t mlock=1; /*sema for mbufflist*/
static inline void copyString_(char *dest, char* src)
{
	for (int i = 0; i<MSG_SIZE; ++i) { dest[i] = src[i] ; }
}
static int menqueue(MBUFF_t **queue, MBUFF_t *p)
{
	MBUFF_t *q  = *queue;
	if (q==NULL)  /*empty list*/
	{
		*queue = p; /* insert buffer address */
		p->next = 0; /*make the last*/
		return OK;
	}
	while (q->next) /*search for the last position*/
	{
		q = q->next;
	}	
	q->next = p; /*take it*/
	p->next = 0; /*make the last*/
	return OK;
}
static MBUFF_t *mdequeue(MBUFF_t **queue)
{
	MBUFF_t *mp = *queue;
	if (mp)	*queue = mp->next;
	return mp;
}
void init_mbuffs(void)
{
	int i;
	for (i=0; i<NMBUF; i++) /*initially all buffers are on the free list*/
	{
		menqueue(&mbufflist, &mbuff[i]);
	}
}
/*Note the System Call API was renamed to kCall (kernel call).*/
static MBUFF_t *get_mbuf() //get a free buffer
{
	MBUFF_t *mp;
	kCall(SemaWait, &nmbuf); /*if no buffers, block*/
	kCall(SemaWait, &mlock); /*CR*/
	mp = mdequeue(&mbufflist);
	kCall(SemaSignal, &mlock); /*CR*/
	return mp;
}
static int put_mbuf(MBUFF_t *mp)
{
	kCall(SemaWait, &mlock); /*CR*/
	menqueue(&mbufflist, mp);
	kCall(SemaSignal, &mlock); /*CR*/
	kCall(SemaSignal, &nmbuf); /*increases counting sema*/
	return OK;
}
int sendmsg(char *msg, int pid)
{
	tcb_t *r_task; 
	r_task = &tcbs[pid]; /*receiver task*/
	MBUFF_t *mp = get_mbuf();
	if (mp == NULL)
	{
		return NOK;
	}
	mp->sender_pid = RunPtr->pid;
	copyString_(mp->contents, msg);
	kCall(SemaWait, &r_task->mlock); /*CR*/
	menqueue(&r_task->mqueue, mp);
	kCall(SemaSignal, &r_task->mlock); /*CR*/
	kCall(SemaSignal, &r_task->nmsg); /*signal a new msg*/
	return OK;
}
int recvmsg(char *msg)
{
	int pid;
	kCall(SemaWait, &RunPtr->nmsg); /*wait for new msg*/
	kCall(SemaWait, &RunPtr->mlock); /*CR*/
	MBUFF_t *mp = mdequeue(&RunPtr->mqueue);
	if (mp == NULL) 
	{
		kCall(SemaSignal, &RunPtr->mlock);
		return 0;
	}
	kCall(SemaSignal, &RunPtr->mlock); /*CR*/
	copyString_(msg, mp->contents);
	pid = mp->sender_pid;
	put_mbuf(mp); /*releases buffer*/
	return pid;
}

7.4 Simple Validation

This small validation shows the proposed Message Passing mechanism is able to send and receive messages synchronously.

Here is the test code:

void Task1(void* args)
{
	init_mbuffs();	/*init buffers*/
	int ret=NOK;
    /*Application Data Units*/
	char apdu2[] = {0xA2, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x02, 0x02};
	char apdu3[] = {0xA3, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF, 0x03, 0x03};
	while(1)
	{
		ret = sendmsg(apdu2, 2);
		if (ret == OK) printline_("Sent to 2\n\r");
		else assert(0);
		ret = sendmsg(apdu3, 3);
		if (ret == OK) printline_("Sent to 3\n\r");
		else assert(0);	
	}
}
void Task2(void* args)
{
	 int32_t sender_pid=0;
	 char msgrcvd[MSG_SIZE]={0};
	while(1)
	{  
		sender_pid = recvmsg(msgrcvd);	
		if (sender_pid == 1) { printline_("2 rcvd from 1\n\r");}
		else assert(0);
		(void)msgrcvd;
	}
}
void Task3(void* args)
{
	int32_t sender_pid=0;
	char msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		sender_pid = recvmsg(msgrcvd);
		if (sender_pid == 1) { printline_("3 rcvd from 1\n\r");}
		else assert(0);
		(void)msgrcvd;
	}
}

Note tasks are sending and receiving in loops and being blocked/preempted. No additional synchronization was added besides those provided by the implemented mechanism.

The total number of Message Buffers is 2, since we are going synchronous. The message size was set to 8 bytes.

Below a picture of the debugging window of Task 2 with the proper message copied from its queue.

Most important are the printed lines showing the expected synchronous behaviour:

Below another demonstration of message passing. An interrupt-driven system receives a 2-byte command from the keyboard via UART. The first byte is the receiver PID, the server. The server job is to blink an LED on certain frequency. The number of blinks is given by the second byte.

In this example, the Message Passing interface is implemented as a Kernel Call.

/*
 Message Passing Usage:
 This example receives a 2-byte message from UART
 The first byte is the PID of the receiving task
 The second byte is the number of led toggles
 Each server toggles the led on a different frequency 
*/
#include <stdio.h>
#include <bsp.h> 
#include "kernel.h"
#include "tasks.h"
MUTEX_t print;
SEMA_t rcvd;
MSG_t uartbuffer[2];
volatile int norb = 0;
volatile char rcvdbyte;
ISR(UART_Handler)
{
    __disable_irq();
	if((uart_get_status(UART) & UART_SR_RXRDY) == UART_SR_RXRDY)
	{
		uart_read(UART, &rcvdbyte);
		uartbuffer[norb] = rcvdbyte;
		norb++;
		if (norb == 2) 
		{
			kSemaSignal(&rcvd);
			norb=0;
		}		
	}
   __enable_irq();
}
static inline void printline_(const char* string)
{
	kCall(LOCK, &print);
	uart_write_line(UART, string);
	kCall(RELEASE, &print);
}
/* Client */
/* init_mbuffs() was called on kernel init */
void Task1(void* args)
{	
    kSemaInit(&rcvd, 0);
    kMutexInit(&print, 0);
	while(1)
	{
		WAIT(&rcvd); /*block until signaled by ISR*/
		if (uartbuffer[0] == 2) 
		{ 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
        }
		else if (uartbuffer[0] == 3) 
		{ 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
        }
		else
		{
			printline_("No route\n\r");
            SUSPEND(); //yield
		}
	}
}
/* Server */
void Task2(void* args)
{
	MSG_t msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1) /*rcvd from PID 1*/
		{
			printline_("2\n\r");
			for (int i = 0; i<msgrcvd[1]; i++)	
			{
				gpio_toggle_pin(LED0_GPIO);
				delay_ms(500);
			}
		}
	}
}
/* Server */
void Task3(void* args)
{
	MSG_t msgrcvd[MSG_SIZE]={0};
	while(1)
	{
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1)
		{
			printline_("3\n\r");
			for (int i = 0; i<msgrcvd[1]; i++)
			{
				gpio_toggle_pin(LED0_GPIO);
				delay_ms(1000);
			}
		}
	}
}

This usage demonstrates Message Passing conveying commands to be processed by servers. From this “Hello World” to the design of a system performing more complex tasks on a client-server pattern, it might take some effort but it is pretty doable because the mechanisms are set up.

8. FINAL REMARKS

This article ends a series of four publications that demonstrate the design and implementation of a functional preemptive kernel on an ARM Cortex-M3 MCU, sitting on the top of the CMSIS layer. GNU ARM toolchain was used.

I have started from creating a simple preeemptive round-robin scheduler to preempt concurrent tasks running on kernel space. Then, user and kernel threads were split, and because of that a system call mechanism was needed. Synchronization primitives were designed and implemented, and with them it was possible to create inter-task communication schemes.

Starting by exploring the ARM Cortex-M3 MCU, this series evolved into concepts and practice on embedded operating systems design, and while researching to write it I could learn a lot. Although the result is a useful functional multitasking engine, from this stage to a full-fledged (RT)OS kernel there is a distance. This is an ongoing project, on a very early phase. The full source code can be found on https://github.com/antoniogiacomelli/k0ba.

References

[1] Embedded and Real-Time Operating Systems, (K.C. Wang, 2017)

[2] Design and Implementation of the MTX Operating System, (K.C. Wang, 2015)

[3] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014)

Designing and implementing IPC mechanisms on an ARM Cortex-M3 kernel (1/2)

1. INTRODUCTION

I do think system programming is one of the nicest things to do: taking a computer board (I/O, CPU, Memory) and creating its system software from the scratch so it performs some work is awesome.

I have presented a kernel implementation for an ARM Cortex-M3 microprocessor. So far, the features are:

  • Fixed preemptive round-robin scheduling by hardware interruption
  • Cooperative mechanism for user threads by software interruption (not described on the publications)
  • Preemptive-Cooperative mechanism for kernel threads
  • System Calls (software interrupts) to separate user threads from kernel threads
  • … and only that! :O

Just these features took a considerable amount of knowledge on the processor architecture, not-so-basic low-level programming skills and operating system concepts.

Now, we will keep improving our kernel from the scratch and start implementing the next important feature: interprocess communication.

2. CONCEPTS AND MOTIVATION

The term process maybe is not the most adequate when talking about embedded systems with no memory virtualization. Since all the concurrent execution units share the same heap and globals, as well the I/Os, the most preferred term is task. The term task is widely used when talking about embedded kernels. Nevertheless, most of the literature use task and process interchangeably.

The need for synchronization

A main concept is the idea of concurrent units competing for a limited number of resources on a computer.

«Concurrent Unit is a set of sequential actions in the same thread of executionAlso known as a task or thread.» 

Figure 1. A, B, C and D are Concurrent Units: a) Sharing a single core (pseudo-concurrent) b) Each unit has its own core c) When running, only one concurrent unit will be active at a given time (Figure from [2])

To a system achieve its goals, tasks must cooperate to use the resources.

«Synchronization points are those where tasks will share information and/or ensure that certain preconditions are satisfied before proceeding.» 

On Figure 2, the synchronization barrier only allows the program to proceed when all processes have reached that point – the conditions for all system actions to be correctly synchronized are true.

Figure 2. a) Running processes b) All processes except C are ready c) Once process C reaches the synchronization point, the program moves on. (Figure from [2])

Figure 3 shows concurrent units, modeled as three tasks. What do we know about them? If we analyze it individually, it becomes clear that in Task 1 Action A occurs before Action B, which goes through a conditional path until reaching E. The same reasoning goes for Tasks 2 and 3.

How are the individual actions of Tasks 1, 2, and 3 related to each other? If the (true) answer is “they do not relate”, there is no concurrency problem to be solved. And your system is probably very simple or very well designed.

Figure 3. Concurrent units represented as flowcharts (Figure from [1])

However, if at some point it is necessary to ensure, for example, that Actions FX and Zeta are executed only after Actions E, Z and Gamma meet certain conditions, we need to somehow join each of the states of these tasks in logically a single thread, on a synchronization point. After the necessary conditions have been met at the synchronization point, the threads are again forked, and the program continues.

This means that whenever there are conditions between two or more processes to achieve a certain objective, we need to create mechanisms for them to communicate.

When I presented how the System Calls were implemented, there was a call to UART peripheral like this:

static void SysCallUART_PrintLn_(const char* args)
{
    __disable_irq(); //critical session begin
    uart_write_line(UART, args); 
    while (uart_get_status(UART)!=UART_SR_TXRDY); // waits until transmission is done 
    __enable_irq(); // critical session end
    exitKernel_(); // exit kernel cooperatively
}

When the task calls the system to access UART, all interruptions are disabled and nothing else will be done until the transmission ends. Bad. Instead for waiting to transmission end, would be better to switch to other tasks that are not using UART.

Semaphores

Among the many valuable concepts on Computer Science Dijkstra has presented, the idea of semaphores for operating systems is a big one. The simplest semaphore is a non-negative integer “s” that accepts two atomic operations (apart from initialization), Wait and Signal:

wait(s): decrement semaphore value.
signal(s):increment semaphore value.

It seems simple but it is not. The use of semaphores can be very error prone, and they are actually primitives to create higher-level IPCs schemes.

Synchronization [3]

The simplest form of synchronization is that a process A should not proceed beyond a point Ll until some other process B has reached L2 [3]. Examples of this situations arise whenever A requires information at point LI which is provided by B when it reaches L2. The synchronization of the two processes below is achieved if proceed is a semaphore initialized as 0.

Figure 5. Semaphore used for synchronization. (Figure from [3])

Mutual Exclusion [3]

Non-shareable resources, whether peripherals, files, or data in memory, can be protected from simultaneous access by several processes by preventing the processes from concurrently executing the pieces of program through which access is made. [3] These pieces of program are called critical sections.

Mutual exclusion in the use of resources can be regarded as mutual exclusion in the execution of critical sections, as on the figure below, where mutex is a semaphore.

Figure 4. Semaphore for mutual exclusion. If the initial value of mutex is 1, the first process to grab it will lock the resource. (Figure from [3])

PS: Mutexes and Semaphores are different mechanisms on a operating system, and the operations on a Mutex are often regarded as “Lock” and “Release”. Here for simplicity of exposure a semaphore is being used to perform mutual exclusion. See https://www.beningo.com/everything-you-need-to-know-about-semaphores-and-mutexes/

Deadlocks

When processes compete for resources, we can get into a situation where none of them can move forward. Let A and B be processes. A is blocked, waiting for a resource to be freed to move on. B in turn will only release this resource when a certain condition is satisfied. If this condition depends on A, which is already blocked, we have a deadlock.

Spin locking and Blocking

The simplest semaphore implementation is called spinlock. Because it only protects a resource from being accessed but does not avoid another task that needs that resource to be dispatched. On Figure 6 if Task1 is preempted while still using resource, Task2 runs even though it performs nothing – it is spinning. At least we have not disabled all interrupts and Task3 can perform useful work.

Figure 6. Spin locking. The processor is granted to Task2 even though it is locked.

To solve the spin locking problem we need to tell the scheduler that a task which is waiting for a resource is not able to be dispatched. It is “blocked” until the task using the resource signals the semaphore. It means the Wait and Signal operations need to be able to switch a thread status between blocked and runnable.

Figure 7. Blocking Semaphore: Task 2 will not be schedulable until resource is released.

3. DESIGN

Big Picture

The initial design and implementation of the kernel was presented on great level of detail on two publications, starting from the round-robin scheduler for one single stack pointer up to a scheduler managing user and supervisor threads, the latter being accessed by System Calls.

Now, including a Semaphore mechanism to serialize access to a resource, the kernel model is represented on UML:

Figure 8. Kernel Model serializing access to a single resource. Note how dependent from the semaphore the thread control flow is. Bad IPC can create hard-to-find bugs.

A given resource might have a semaphore to manage how it is accessed. When a task calls a Wait on a semaphore associated to a resource that has no slots available, the thread will be blocked. When a task calls a Signal on a semaphore and the resource is released, the next thread on the queue that is blocked by that semaphore is set as runnable. With some effort other criteria can be added: priority, aging, etc.

A Task has a Thread Control Block which is associated to the Scheduler. The Scheduler is accountable to select (schedule), dispatch and save/resume user/kernel threads.

The Scheduler flowchart updated considering the blocked threads:

Figure 9. Scheduler flowchart.

So far, the description of a Semaphore I gave took for granted how the Wait and Signal operations are designed. Some authors [2] use the terms Down and Up respectively.

Blocking Semaphore

I will use an approach based on [4]. A blocking semaphore is a signed integer variable. Every time it is affected by a Signal operation, it is increased. Every time it is affected by a Wait operation, it is decreased. This way, for a value N > 0, it means a given resource has N slots to be shared. A value 0 means it is taken. A negative value, say -2, means it is taken by a thread, and other two threads are also waiting for it. With this approach we can infer the number of tasks that are blocked on the queue.

Below, the flow chart of the described semaphore operations. Note it is a system call, that’s why the last step is to exit the kernel as I have been doing for other system calls.

Figure 10. Flowchart for Semaphore operations

4. IMPLEMENTATION

The kSemaphore_t type was defined as an int32_t. The Wait and Signal procedures are System Calls, and the callbacks are written as below, following the same pattern for System Calls already presented. exitKernel_() tells the system to leave the kernel and call the user thread switcher (an unblocked user thread will be launched/resumed).

static void SysCallSemaSignal_(kSemaphore_t* s)
{	
	tcb_t* next_tcb_ptr;
	__disable_irq();
	(*s) = (*s) + 1;
	if ((*s) <= 0)
	{
       /* switching RunPtr->next, not RunPtr! */
		next_tcb_ptr = RunPtr->next;
		while (next_tcb_ptr->block_pt != s)
		{
			next_tcb_ptr = next_tcb_ptr->next;
		}
		next_tcb_ptr->block_pt=0; /*unblocks a blocked by s*/
	}
	__enable_irq();
}
static void SysCallSemaWait_(kSemaphore_t* s)
{
	__disable_irq();
	(*s) = (*s) - 1;
	if ((*s) < 0)
	{
		RunPtr->block_pt = s; // reason it is blocked
        __enable_irq();
	    exitKernel_();
	}
	__enable_irq();
}

To make a System Call to a Semaphore, the API is the following:

// General SysCall API
SysCall(SysCall_ID, arguments)
// Semaphores
// Wait Procedure:
SysCall(SemaWait, kSemaphore_t* sema)
// Signal Procedure:
SysCall(SemaSignal, kSemaphore_t* sema)

4.1 Changes

Some small but critical changes have been made on the code. The Thread Control Block data structure now has a block field that is a pointer to a semaphore, as explained, the one who blocked the thread.

// @File kernel.h
// Structs for Threads
//********************
struct tcb 
{
	int32_t*		psp;			/*last saved process stack*/
	int32_t*		ksp;			/*last saved kernel stack*/
	struct tcb		*next;			/*next tcb*/
	int32_t			pid;
	int32_t			kernel_flag;	/*kernel flag 1=user thread, 0=kernel thread*/
	kSemaphore_t*	block_pt;		/*0=not blocked, semaphore address=blocked*/
};
/* Globals */
extern tcb_t tcbs[NTHREADS];	// thread control block				
extern int32_t p_stacks[NTHREADS][STACK_SIZE];   // process stack
extern int32_t k_stacks[NTHREADS][STACK_SIZE];   // kernel stack
extern tcb_t* RunPtr; 

The Scheduler routine within the SysTick Handler is also changed. First and most important, the scheduler routine now is handled by lower priority trap, PendSV. SysTick only triggers PendSV.

//@File: system.s
.equ SCB_ICSR,		0xE000ED04 
.equ ISCR_SETPSV,  (1 << 28)
SysTick_Handler:
	LDR R0, =SCB_ICSR
	LDR R1, =ISCR_SETPSV
	STR R1, [R0]
	ISB
PendSV_Handler:		
	CMP		LR, #0xFFFFFFFD  // user thread?
	BEQ		SaveUserCtxt	 // yes
	CMP		LR, #0xFFFFFFF9  // kernel thread?
	BEQ		SaveKernelCtxt    
	BX		LR		//else
	SaveKernelCtxt:
	PUSH	{R4-R11}  //push R4-R11
	LDR		R0,=RunPtr
	LDR		R1, [R0]
	LDR		R2, [R1,#4]
	LDR		R3, =#0  
	STR		R3, [R1, #16] //kernel flag = 0
	STR		SP, [R2]
	B		Schedule
	SaveUserCtxt:
	MRS		R12, PSP
	STMDB	R12!, {R4-R11}
	MSR		PSP, R12
	LDR		R0,=RunPtr
	LDR		R1, [R0]
	LDR		R3, =#1  
	STR		R3, [R1, #16] //kernel flag = 1
	STR		R12, [R1]		
	B		Schedule
	
	Schedule:
	BL		kGetNext
	Dispatch:
	LDR		R0, =RunPtr
	LDR		R1, [R0]
	LDR		R2, [R1,#16]
	CMP		R2, #1				//kernel_flag==1
	BEQ		ResumeUser			//resume user thread
	B		ResumeKernel	    //resume kernel thread
	ResumeUser:
	CPSID	I
	LDR		R1, =RunPtr			//R1 <- RunPtr
	LDR		R2, [R1]
	LDR		R2, [R2]
	LDMIA	R2!, {R4-R11}		//sw stackframe 
	MSR		PSP, R2				
	MOV		LR, #0xFFFFFFFD		//LR=return to user thread
	MOV		R0, #0x03
	MSR		CONTROL, R0
	CPSIE	I
	BX		LR 
	
	ResumeKernel:
	CPSID	I
	LDR		R1, =RunPtr			
	LDR		R2, [R1]
	LDR		R2, [R2, #4]
	LDR		SP, [R2]
	POP		{R4-R11}
	MOV		LR, 0xFFFFFFF9
	MOV		R0, #0x00
	MSR		CONTROL, R0
	CPSIE	I
	BX		LR

Besides SysTick delegating the switching to a lower priority software interrupt, there is the addition of the kGetNext routine:

//@File system.s
.global kGetNext
.type kGetNext, %function
//@File kernel.h
extern void kGetNext(void);
//@File kernel.c
/*
 Schedules next unblocked thread
*/
void kGetNext(void)
{
	__disable_irq();
	RunPtr = RunPtr->next; 
    /*!if there is a deadlock, the kernel will hang on this loop!*/
	while(RunPtr->block_pt || RunPtr->sleeping)
	{ 
    	RunPtr = RunPtr->next;
	}
	__enable_irq();
	__ASM volatile("BX LR"); // Dispatch
}

5. VALIDATION

Producer-consumer problem

Figure 11. Producer and Consumer processes sharing a FIFO [4]

A set of “producer” processes and a set of “consumer” processes communicate by means of inserting and extracting items on a buffer. The buffer has size N. There are two main criteria to achieve synchronization: producers cannot write on a full buffer and consumers cannot read from an empty buffer. That is for the number of inserted data i and extracted data e: 0 < i – e < N. Furthermore, there must be a mutex so only one process can access the buffer at a time.

In [3] the author proves the following algorithm solves the consumer-producer problem.

Figure 12. Algorithm for a Producer-Consumer [3]

Space Available is a semaphore that is initialized as N. Buffer Manipulation initialized as 1. Item available, is initialized as 0.

Now let’s see if we can reproduce this solution with the implemented Semaphores and Scheduler. Task1, the producer, increments a counter and inserts it into a FIFO, when there is space and the buffer is unlocked. Task2 reads this value when there is an item and the resource is unlocked.

The code below does not encapsulate the semaphores within the Data Queue, i.e., the FIFO methods do not handle the synchronization themselves.

#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO service
/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item = 0;
kSemaphore_t mutex = 1;
/* Producer */
void Task1(void* args)
{
    volatile int counter1 = 0;
	while(1)
	{
		counter1++;
		SysCall(SemaWait, &space); /* wait there is space */
		SysCall(SemaWait, &mutex); /* lock */
		osFifoPut(counter1); /* insert into fifo */
		SysCall(SemaSignal, &mutex); /* unlock */
		SysCall(SemaSignal, &item); /* signals there is an item */
	}
}
/* Consumer */
void Task2(void* args)
{
    volatile int read = 0;
	while(1)
	{
		SysCall(SemaWait, &item); /* wait there is an item */
		SysCall(SemaWait, &mutex); /* lock */
		read = osFifoGet(); /* read the last written data */
		SysCall(SemaSignal, &mutex); /* unlock */
		SysCall(SemaSignal, &space); /* signal there is space*/
	}
}

I tested both preemptive-only and cooperative-only scheduling, and also preemptive+cooperative (although not presented, there is a cooperative mechanism for the user threads too).

The FIFO has 10 positions. On my tests the program never hangs, buffer never overflows or underflows. Well, I have nothing fancy to show here, but my debugging window:

Figure 13. Producer-consumer code on debug

Now let’s make something a little more elaborate to test our synchronization primitives, and see it happening on the PC screen.

Let’s add a third task. It starts and blocks waiting for a semaphore to signal. This semaphore is signaled if the number consumed is divisible by 1000, and the task sleeps for 5 system ticks, wakes-up, and all over again.

#include <stdio.h>
#include <asf.h>
#include "kernel.h"
#include "tasks.h"
#include "dataqueue.h" // FIFO
/* Semaphores */
kSemaphore_t space = FIFOSIZE; //FIFOSIZE=10
kSemaphore_t item =  0;
kSemaphore_t mutex = 1;
kSemaphore_t sleep = 0;
static inline void printline_(const char* string)
{
	usart_write_line(USART_SERIAL, string);
}
/* Producer */
void Task1(void* args)
{
	osFifoInit();
    volatile int counter1 = 0;
	while(1)
	{
		counter1++;
		SysCall(SemaWait, &space); /* waits there is space */
		SysCall(SemaWait, &mutex); /* locks */
		osFifoPut(counter1); /* insert into fifo */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &item); /* signals there is an item */
	}
}
/* Consumer */
void Task2(void* args)
{
    volatile int read = 0;
	while(1)
	{
		SysCall(SemaWait, &item); /* waits there is an item */
		SysCall(SemaWait, &mutex); /* locks */
		read = osFifoGet(); /* reads the last written data */
		if (read % 1000 == 0) { SysCall(SemaSignal, &sleep); } /* signals task 3 to sleep */
		SysCall(SemaSignal, &mutex); /* unlocks */
		SysCall(SemaSignal, &space); /* signals there is space*/
	}
}
void Task3(void* args)
{
	while(1)
	{
		printline_("Task3 active\n\r");
		SysCall(SemaWait, &sleep);
		printline_("Task3 is going to sleep\n\r");
		SysCallSleepTicks(5);
	}
}

If you are wondering the call to UART is not a system call, that is because the design is shifting to a microkernel approach: drivers run on user space. The sleep implementation was not presented yet also.

Here is the output:

On the next publication we will leverage the use of these primitives on the design and implementation of IPC mechanisms.

Next publication

[1] Design Patterns for Embedded Systems in C (Douglass Powell, 2009) 
[2] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014) 
[3] Fundamentals of Operating Systems (A. M. Lister, 1984) 
[4] Embedded Systems: Real-Time Operating Systems for Arm Cortex-M MCUs (Jonathan Valvanno, 2012)