Designing and implementing IPC mechanisms on an ARM Cortex-M3 Kernel (2/2)


Process communication refers to schemes or mechanisms that allow processes to exchange information. It can be accomplished in many different ways, all of which depend on process synchronization. [1] ITC or Inter-Task Communication is a common term used on the embedded domain.

Now we have validated our semaphores are able to synchronize tasks running concurrently in our system, it is time to use them to create higher-level communications mechanisms.

Microkernel architecture

The overall design goal is to be a microkernel. That said, there will be tasks which will implement servers, and other tasks are going to be clients. They need to communicate.

Figure 14. The target architecture

Shared Memory, Pipes, and Message Passing are common communication schemes in operating systems [1, 2, 3]. Shared Memory using simple mutexes are effective for direct task communication on the same address space. Pipes are effective to share streams of data. Message Passing is used to exchange messages ready to be processed.

6.1 Synchronous versus Asynchronous mechanisms

Communication can be synchronous or asynchronous. The later means when a client calls a server, it is free to perform other tasks while the client does not answer. The former means after sending it will block until it gets an answer.

Figure 15. General client-server synchronous communication.

Asynchronous communication is intended mainly for loosely-coupled systems, in which interprocess communication is infrequent, i.e., processes do not exchange messages on a planned or regular basis. For such systems, asynchronous communication is more suitable due to its greater flexibility. [1]

Synchronous communication is well suited to tightly-coupled systems in which processes exchange messages on a planned or regular basis. In such a system, processes can expect messages to come when they are needed, and the usage of message buffers is carefully planned. [1]


Message Passing is a general form of inter-task communication mechanism. On operating systems it is the basis of the so-called microkernel; on distributed systems it is used for RPCs (remote procedure calls), and also the basis for server-client oriented programming on computer networks. [1, 2, 3]

Message Passing can by synchronous or asynchronous. On Asynchronous both send and receive operations are non-blocking. On synchronous, both are blocking.

In this publication I will be showing a synchronous message passing mechanism. As said, Message Passing is the basis of a microkernel, the messages will be conveying requests to be processed by servers.

7.1. Design and Implementation of a Synchronous Message Passing

In a simple client-server communication system, there are two kinds of processes: servers and clients. Clients can only communicate with servers, and servers can only communicate with clients. Each client has either a guaranteed or dedicated message buffer. After sending a request to a server, a client immediately waits for a reply since there is nothing else the client can do. Upon receiving a client request, a server handles the request, which may require some time delay, but it eventually sends a reply to the client. In more complex server-client systems, there are processes that can only be clients, processes that can only be servers and processes that can be both. [1,2]

The design and implementation presented here is based on [1, 2], and processes are either clients or servers, never both.

7.2 Design

We can extend the consumer-producer problem to N messages. [3]

buffer[N], N = number of messages
process producer:
 while (TRUE):
  produce contents
  receive buffer[I] from consumer
  put contents on buffer
  send to consumer
process consumer:
 for K=0 to K=N-1 send empty messages to producer
  receive buffer[I] from producer
  extract contents from message
  put empty item on buffer
  send to producer

Based on this algorithm, but for multiple producers and consumers, using a message queue to sequence the message processing, the design is proposed as follows:

  • The message buffers are arranged on a linked list. The next field of the last buffer of the list points to zero, so we know it is the last. The message buffer data structure contains the sender PID, the message contents and the address of the next linked buffer.
  • Every thread has a pointer to a Message Buffer on its Thread Control Block. Since there is a limited number of buffers available on the system, when a client wants to send a message to a server, it must fetch a free buffer from a list. If there is a free buffer, it dequeues its address from the list, writes the message, enqueues it on the receiver message queue and signals the receiver there is a new message. Initially, all buffers are free.
  • When a receiver is signaled, it dequeues the message buffer from its own TCB, reads its contents, and releases the buffer – put it back on the free list. The figure below shows a possible state, for a system with seven message buffers. Three of them are taken, waiting to be read by the receiver threads. It means 3 send operations had ocurred and no receive operations had took place for these messages yet.
  • The number of free buffers is controlled by a counting semaphore, which maximum count is the total number of available buffers on the system. The access to the free buffer list is a critical region (CR), disputed by senders, so it is protected by a mutex. The same reasoning applies to the message queues of every thread control block, each has its own mutex.
Figure 16. Possible state of a Message Passing with three messages waiting to be read by receivers.

Note for asynchronous communication the changes would affect the sending and receiving algorithms, the architecture depicted on Figure 16 would still apply.

7.3 Implementation

Likewise the design, the implementation is simple. The hardest part is the synchronization. Maybe the linked queue need some special attention. The send-receive interface must be a kernel call if the buffers are stored on kernel space (if you got a MMU). I hope the implementation mirrors the design idea:

/*@File message.h*/
/*Message Buffer Data Structure*/
#define MSG_SIZE 32
typedef struct mbuff
	struct mbuff *next;
	int			 sender_pid;
	char		 contents[MSG_SIZE];
/*initialize free list*/
extern void init_mbuffs(void);
/*send *msg to pid, return OK or NOK*/
extern int sendmsg(char *msg, int pid); 
/*receive msg and store at *msg, return sender PID, 
0 if fails (0 is a reserved PID)*/
extern int recvmsg(char *msg); 
/*@File kernel.h*/
/*Thread control block with message passing fields*/
struct tcb 
	int32_t*		psp;			/*last saved process stack*/
	int32_t*		ksp;			/*last saved kernel stack*/
	struct tcb		*next;			/*next tcb*/
	int32_t			kernel_flag;	/*kernel flag 1=user thread, 0=kernel thread*/
	int32_t			pid;			/*process id*/
	kSemaphore_t*	block_pt;		/*blocked 0=not blocked, semaphore address=blocked*/
	uint32_t		sleeping;		/*sleep counter*/
	kSemaphore_t	nmsg;			/*new msg sema*/
	kSemaphore_t	mlock;			/*thread msg queue sema*/
	MBUFF_t			*mqueue;        /*thread msg queue*/
/*@File message.c*/
/* Synch Message Passing                                                 */
MBUFF_t mbuff[NMBUF]; /*message buffers*/
MBUFF_t *mbufflist=NULL; /*free message buffers*/
kSemaphore_t nmbuf=NMBUF; /*counting semaphore*/
kSemaphore_t mlock=1; /*sema for mbufflist*/
static inline void copyString_(char *dest, char* src)
	for (int i = 0; i<MSG_SIZE; ++i) { dest[i] = src[i] ; }
static int menqueue(MBUFF_t **queue, MBUFF_t *p)
	MBUFF_t *q  = *queue;
	if (q==NULL)  /*empty list*/
		*queue = p; /* insert buffer address */
		p->next = 0; /*make the last*/
		return OK;
	while (q->next) /*search for the last position*/
		q = q->next;
	q->next = p; /*take it*/
	p->next = 0; /*make the last*/
	return OK;
static MBUFF_t *mdequeue(MBUFF_t **queue)
	MBUFF_t *mp = *queue;
	if (mp)	*queue = mp->next;
	return mp;
void init_mbuffs(void)
	int i;
	for (i=0; i<NMBUF; i++) /*initially all buffers are on the free list*/
		menqueue(&mbufflist, &mbuff[i]);
/*Note the System Call API was renamed to kCall (kernel call).*/
static MBUFF_t *get_mbuf() //get a free buffer
	MBUFF_t *mp;
	kCall(SemaWait, &nmbuf); /*if no buffers, block*/
	kCall(SemaWait, &mlock); /*CR*/
	mp = mdequeue(&mbufflist);
	kCall(SemaSignal, &mlock); /*CR*/
	return mp;
static int put_mbuf(MBUFF_t *mp)
	kCall(SemaWait, &mlock); /*CR*/
	menqueue(&mbufflist, mp);
	kCall(SemaSignal, &mlock); /*CR*/
	kCall(SemaSignal, &nmbuf); /*increases counting sema*/
	return OK;
int sendmsg(char *msg, int pid)
	tcb_t *r_task; 
	r_task = &tcbs[pid]; /*receiver task*/
	MBUFF_t *mp = get_mbuf();
	if (mp == NULL)
		return NOK;
	mp->sender_pid = RunPtr->pid;
	copyString_(mp->contents, msg);
	kCall(SemaWait, &r_task->mlock); /*CR*/
	menqueue(&r_task->mqueue, mp);
	kCall(SemaSignal, &r_task->mlock); /*CR*/
	kCall(SemaSignal, &r_task->nmsg); /*signal a new msg*/
	return OK;
int recvmsg(char *msg)
	int pid;
	kCall(SemaWait, &RunPtr->nmsg); /*wait for new msg*/
	kCall(SemaWait, &RunPtr->mlock); /*CR*/
	MBUFF_t *mp = mdequeue(&RunPtr->mqueue);
	if (mp == NULL) 
		kCall(SemaSignal, &RunPtr->mlock);
		return 0;
	kCall(SemaSignal, &RunPtr->mlock); /*CR*/
	copyString_(msg, mp->contents);
	pid = mp->sender_pid;
	put_mbuf(mp); /*releases buffer*/
	return pid;

7.4 Simple Validation

This small validation shows the proposed Message Passing mechanism is able to send and receive messages synchronously.

Here is the test code:

void Task1(void* args)
	init_mbuffs();	/*init buffers*/
	int ret=NOK;
    /*Application Data Units*/
	char apdu2[] = {0xA2, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x02, 0x02};
	char apdu3[] = {0xA3, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF, 0x03, 0x03};
		ret = sendmsg(apdu2, 2);
		if (ret == OK) printline_("Sent to 2\n\r");
		else assert(0);
		ret = sendmsg(apdu3, 3);
		if (ret == OK) printline_("Sent to 3\n\r");
		else assert(0);	
void Task2(void* args)
	 int32_t sender_pid=0;
	 char msgrcvd[MSG_SIZE]={0};
		sender_pid = recvmsg(msgrcvd);	
		if (sender_pid == 1) { printline_("2 rcvd from 1\n\r");}
		else assert(0);
void Task3(void* args)
	int32_t sender_pid=0;
	char msgrcvd[MSG_SIZE]={0};
		sender_pid = recvmsg(msgrcvd);
		if (sender_pid == 1) { printline_("3 rcvd from 1\n\r");}
		else assert(0);

Note tasks are sending and receiving in loops and being blocked/preempted. No additional synchronization was added besides those provided by the implemented mechanism.

The total number of Message Buffers is 2, since we are going synchronous. The message size was set to 8 bytes.

Below a picture of the debugging window of Task 2 with the proper message copied from its queue.

Most important are the printed lines showing the expected synchronous behaviour:

Below another demonstration of message passing. An interrupt-driven system receives a 2-byte command from the keyboard via UART. The first byte is the receiver PID, the server. The server job is to blink an LED on certain frequency. The number of blinks is given by the second byte.

In this example, the Message Passing interface is implemented as a Kernel Call.

 Message Passing Usage:
 This example receives a 2-byte message from UART
 The first byte is the PID of the receiving task
 The second byte is the number of led toggles
 Each server toggles the led on a different frequency 
#include <stdio.h>
#include <bsp.h> 
#include "kernel.h"
#include "tasks.h"
MUTEX_t print;
SEMA_t rcvd;
MSG_t uartbuffer[2];
volatile int norb = 0;
volatile char rcvdbyte;
	if((uart_get_status(UART) & UART_SR_RXRDY) == UART_SR_RXRDY)
		uart_read(UART, &rcvdbyte);
		uartbuffer[norb] = rcvdbyte;
		if (norb == 2) 
static inline void printline_(const char* string)
	kCall(LOCK, &print);
	uart_write_line(UART, string);
	kCall(RELEASE, &print);
/* Client */
/* init_mbuffs() was called on kernel init */
void Task1(void* args)
    kSemaInit(&rcvd, 0);
    kMutexInit(&print, 0);
		WAIT(&rcvd); /*block until signaled by ISR*/
		if (uartbuffer[0] == 2) 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
		else if (uartbuffer[0] == 3) 
			kCall2(SENDMSG, uartbuffer, (int)uartbuffer[0]);
			printline_("No route\n\r");
            SUSPEND(); //yield
/* Server */
void Task2(void* args)
	MSG_t msgrcvd[MSG_SIZE]={0};
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1) /*rcvd from PID 1*/
			for (int i = 0; i<msgrcvd[1]; i++)	
/* Server */
void Task3(void* args)
	MSG_t msgrcvd[MSG_SIZE]={0};
		kCall(RECVMSG, msgrcvd);
		if (SYSRET == 1)
			for (int i = 0; i<msgrcvd[1]; i++)

This usage demonstrates Message Passing conveying commands to be processed by servers. From this “Hello World” to the design of a system performing more complex tasks on a client-server pattern, it might take some effort but it is pretty doable because the mechanisms are set up.


This article ends a series of four publications that demonstrate the design and implementation of a functional preemptive kernel on an ARM Cortex-M3 MCU, sitting on the top of the CMSIS layer. GNU ARM toolchain was used.

I have started from creating a simple preeemptive round-robin scheduler to preempt concurrent tasks running on kernel space. Then, user and kernel threads were split, and because of that a system call mechanism was needed. Synchronization primitives were designed and implemented, and with them it was possible to create inter-task communication schemes.

Starting by exploring the ARM Cortex-M3 MCU, this series evolved into concepts and practice on embedded operating systems design, and while researching to write it I could learn a lot. Although the result is a useful functional multitasking engine, from this stage to a full-fledged (RT)OS kernel there is a distance. This is an ongoing project, on a very early phase. The full source code can be found on


[1] Embedded and Real-Time Operating Systems, (K.C. Wang, 2017)

[2] Design and Implementation of the MTX Operating System, (K.C. Wang, 2015)

[3] Modern Operating Systems (Andrew Taneunbaum, Herbert Boss, 2014)

A first preemptive kernel on ARM Cortex-M3

1. Introduction

When it comes to operating systems for embedded software, they are generally thought to be an overkill for most solutions. However, between a completely “hosted” multiapplication system (which uses a multi thread and general purpose or real-time OS) and a completely “standalone/monolithic” or “bare-metal” application specific system, there are many variations we can go for.

In previous publications I explored the notion of cooperative tasks, from a loop calling tasks from an array of function pointers, to something a little more complex with the processes being handled in a circular buffer, with explicit time criteria. This time I will show a minimal preemptive implementation, to also pave the way for something a little more complex, as in other publications.

2 Preemptive versus Cooperative

In a fully cooperative system, the processor does not interrupt any task to accept another, and the task itself needs to release the processor for the next one to use it. There is nothing wrong with this. A Run-To-Completion scheme is sufficient and/or necessary for many applications, and many embedded systems were deployed this way, including some very complex ones. In the past, even non-embedded systems used a cooperative kernel (Windows 3.x, NetWare 4.x, among others). If a task crashes, the entire system is compromised when we speak in a strictly cooperative way: it keeps the processor from going further.

In preemptive mode, tasks are interrupted and later resumed– i.e., a context (set of states in the processor registers) is saved and then retrieved. This leads to more complexity to the implementation but, if well done, it increases the robustness and the possibility of meeting narrower timing requirements, mainly if used with a priority and/or periodicity criteria to manage the queue.

3 The core registers

A processor is, in fact, a programmable finite-state machine. Each state of our program can be defined within the set of core register values. This set dictates which program point is active. Therefore, activating a task means loading values ​​to the registers so that this task will be processed. To resume a task, a previously saved (on the memory) set of values is loaded back into the core registers.

In the ARM Cortex-M3, the 32-bit registers that define the active state of the processor are: R0-R12 for general use and R13-R15 registers for special use, in addition to the Program Status Register (xPSR) – its value is on the top of any stackframe, and it is not actually a single physical register, but a composition of three (Application, Interrupt and Execution: APSR, IPSR e EPSR).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é armcortexm3arch-1.png

3.1. Load-store architectures

A load-store architecture is a processor architecture in which data from memory needs to be loaded to the core registers before being processed. Also, the result of this processing before being stored in memory must be in a register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é registersm3-1.png

The two basic memory access operations on Cortex-M3:

// reads the data contained in the address indicated by Rn + offset and places it in Rd. 
LDR Rd, [Rn, #offset]
 // stores data contained in Rn at the address pointed by Rd + offset 
STR Rn, [Rd, #offset]

It is important to understand at least the Cortex-M3 instructions shown below. I suggest sources [1] and [2] as good references.

MOV Rd, Rn // Rd = Rn
MOV Rd, #M // Rd = M, the immediate being a 32-bit value (here represented by M)
ADD Rd, Rn, Rm // Rd = Rn + Rm
ADD Rd, Rn, #M // Rd = Rn + M
SUB Rd, Rn, Rm // Rd = Rn - Rm
SUB Rd, Rn, #M // Rd = Rn - M
// pseudo-instruction to save Rn in a memory location.
// After a PUSH, the value of the stack pointer is decreased by 4 bytes
// POP increases SP by 4 bytes after loading data into Rn.
// this increase-decrease is based on the current address the SP is pointing to
POP {Rn}
B label // jump to routine label
BX Rm // jump to routine specified indirectly by Rm
BL label // jump to label and moves the caller address to LR

CPSID I // enable interrupts
CPSIE I // disable interrupts

We will operate the M3 in Thumb mode , where the instructions are actually 16 bits. According to ARM , this is done to improve code density while maintaining the benefits of a 32-bit architecture. Bit 24 of the PSR is always 1.

3.2. Stacks and stack pointer (SP)

Stack is a memory usage model [1]. It works in the Last In – First Out format (last to enter, first to leave). It is as if I organized a pile of documents to read. It is convenient that the first document to be read is at the top of the stack, and the last at the end. The stack pointer is a kind of pivot that keeps control of the program flow, by pointing to some position of the stack.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackusage.png
Figure 3. Model for using a stack. When saving data before processing (transforming) it saves the previous information. (Figure from [1])
Esta imagem possuí um atributo alt vazio; O nome do arquivo é annotation-2020-02-06-001056.png
Figure 4. Regions of memory mapped on a Cortex M3. The region available for the stack is confined between the addresses 0x20008000 and 0 0x20007C00. [1]

4 Multitasking on the ARM Cortex M3

The M3 offers two stack pointers (Main Stack Pointer and Process Stack Pointer) to isolate user processes from the kernel processes. Every interrupt service runs in kernel mode. It is not possible to go from user mode to kernel mode (actually called thread mode and privileged mode, respectively) without going through an interruption – but it is possible to go from privileged mode to user mode by changing the control register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é so.png
Figure 5. Changing context on an OS that isolates the kernel application [1]

The core also has dedicated hardware for switching tasks. The SysTick interrupt service can be used to implement synchronous context switching. There are still other asynchronous software interruptions (traps) like PendSV and SVC. Thus, SysTick is used for synchronous tasks in the kernel, while SVC serves asynchronous interruptions, when the application makes a call to the system. The PendSV  is a software interruption which by default can only be triggered in privileged mode. It is usually suggested [1] to trigger it within SysTick service, because it is possible to keep track of the ticks to meet the time criteria. The interruption by SysTick is soon served, with no risk of losing any tick of the clock. A secure OS implementation would use both stack pointers to separate user and kernel threads and separate memory domains if an MPU (Memory Protection Unit) is available. At first, we will only use the MSP in privileged mode.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 2sps.png
Figure 6. Memory layout on an OS with two stack pointers and protected memory [1]

5. Building the kernel

On an operating system, the kernel is the program responsible – among other things, depending on the system – for managing processes (or tasks).

5.1. Stackframes and context switching

When a SysTick is served, part of the core registers values are saved by the hardware (R0, R1, R2, R3, R12, R14 (LR) and R15 (PC) and PSR). Let’s call this portion saved by the hardware as hardware stackframe. The remaining is the software stackframe [3], which we must explicitly save.

To think about our system, we can outline a complete context switch depicting the key positions the stack pointer assumes during the operation (in the figure below the memory addresses increase, from bottom to top. When SP points to R4 it is aligned with an address lower than the PC on the stack)

Figure 7. Switching contexts. The active task is saved by the hardware and the kernel. The stackpointer is re-assigned, according to pre-established criteria, to the R4 of the next stackframe to be activated. The data is rescued. The task is performed. (Figure based on [3])

When an interruption takes place, SP will be pointing to the top of the stackframe (SP (O)) to be saved. This is inevitable because this is how the M3 works. In an interruption the hardware will save the first 8 highest registers in the call stack at the 8 addresses below the stack pointer, stopping at (SP (1)). When we save the remaining registers, the SP will now be pointing to the R4 of the current stack (SP (2)). When we reassign the SP to the address that points to the R4 of the next stack (SP (3)), the POP throws the values of R4-R11 to the core registers and the stack pointer is now at (SP (4)). Finally, the return from the interrupt pops the remaining stackframe, and the SP (5) is at the top of the stack that has just been activated. (If you’re wondering where R13 is: it stores the value of the stack pointer)

The context switching routine is written in assembly and implements exactly what is described in Figure 7.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é systick-2.png
Figure 8. Context switcher (sorry for comments in portuguese…)

PS: When an interruption occurs, the LR takes on a special code. 0xFFFFFFF9 , if the interrupted thread was using MSP or 0xFFFFFFFD if the interrupted thread was using PSP.

5.1 Initializing the stacks for each task

For the above strategy to work, we need to initialize the stacks for each task accordingly. The sp starts by pointing to R4. This is by definition the starting stack pointer of a task, as it is the lowest address in a frame .

In addition, we need to create a data structure that correctly points to the stacks that will be activated for each SysTick service. We usually call this structure a TCB (thread control block), and it is implemented as a linked list.  We do not use any selection criteria and therefore there are no control parameters other than next: when a task is interrupted, the next one in the list will be resumed and executed.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é struct.png
Figure 9. Thread control block
Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackinit-3.png
Figure 10. Initializing the stack (the values representing the registers, like 0x07070707 are for debugging purposes)

The kSetInitStack function initializes the stack for each “i” thread . The stack pointer in the associated TCB points to the data relating to R4. The stack data is initialized with the register number, to facilitate debugging. The PSR only needs to be initialized with bit 24 as 1, which is the bit that identifies Thumb mode . The tasks have the signature void Task (void * args) .

To add a task to the stack, we initially need the address of the main function of the task. In addition, we will also pass an argument. The first argument is in R0. If more arguments are needed, other registers can be used, according to AAPCS (ARM Application Procedure Call Standard).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é addthreads-1.png
Figure 11. Routine for adding tasks and their arguments to the initial stackframe

5.3. Kernel start-up

We have two types of threads running: background and foreground threads. The background includes the kernel routines, including the context switcher. At each SysTick, it is the kernel’s turn to use the processor. In the foreground  are our tasks.

In [2] a start-up routine is suggested:

Esta imagem possuí um atributo alt vazio; O nome do arquivo é kstart-1.png
Figure 13. Routine for booting the kernel

The PC address contained on the TCB pointed by RunPtr is popped onto LR. After moving the SP to the top, the branch is performed.

6. Putting it all together

To illustrate, we will perform, in Round-Robin 3 tasks that switch the output of 3 different pins and increment three different counters. The time between a change of context and another will be 1001 cycles of the main clock. Note that these functions run within a “while (1) {}”. It is like we have several main programs running on the foreground

Esta imagem possuí um atributo alt vazio; O nome do arquivo é tasks-1.png
Figure 14. System Tasks

Below, the main function of the system. The hardware is initialized. Tasks are added and the stacks are initialized. The RunPtr receives the address of the first thread control block element.  After setting the SysTick to trigger every 1001 clock cycles, the kernel is started. Now the scheduler is running on the background. Each task has its turn to use the processor for a given time slice, and then it is interrupted.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é main.png
Figure 15. Main program

6.1. Debug

You will need at least a simulator to implement the system more easily, as you will need to access the core registers and see the data moving in the stacks. If the system is working, each time the debugger is paused, the counters should have almost the same value.

In the photo below, I use an Arduino Due board with an Atmel SAM3X8E processor and an Atmel ICE debugger connected to the board’s JTAG. On the oscilloscope you can see the waveforms of the outputs switching for each of the 3 tasks.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 20200209_163936.jpg
Figure 16. Debug bench
Esta imagem possuí um atributo alt vazio; O nome do arquivo é ds1z_quickprint1-1.png
Figure 17. Tasks 1, 2, 3 on the oscilloscope.

7 Conclusions

The implementation of a preemptive kernel requires reasonable knowledge of the processor architecture to be used. Loading the call stack registers and saving them in a “handmade” way allows us to have greater control of the system at the expense of the complexity of handling the stacks.

The example presented here is a minimal example where each task is given the same amount of time to be performed. After that time, the task is interrupted and suspended – the data from the call stack is saved. This saved set is called a stackframe – a “photograph” of the point at the program was. The next task to be performed is loaded from the point it was interrupted and resumed. The code was written in order to explain the concepts.

In the next publication we will split the threads between user mode and privileged mode – using two stack pointers – a fundamental requirement for a more secure system.

Next publication


The text of this post as well as the non-referenced figures are from the author.
[1] The definitive guide to ARM Cortex-M3, Joseph Yiu
[2] Real-Time Operating Systems for the ARM Cortex-M, Jonathan Valvano

Low-power 3/3: microarquitetura e RTL

6 Abordagens para redução de consumo em RTL

As ferramentas de síntese e back-end detectam oportunidades para salvar energia diretamente do RTL, portanto a microarquitetura e, consequentemente o estilo de codificação tem um grande impacto no consumo. A medida que o fluxo avança, menores são as intervenções que podem ser feitas para reduzir a energia dinâmica e estática. Um bom design low-power começa nos níveis arquiteturais como já foi discutido e, ainda antes da síntese, durante a escrita do código em linguagem de descrição de hardware. É dito que 80% do consumo em um ASIC estará definido no momento em que o RTL é finalizado. Somente os outros 20% dependerão do back-end. [1]

6.1 Máquinas de estado – codificação e decomposição

Minimizar a inversão de bits nas transições entre um estado e outro, naturalmente minimiza o consumo de uma máquina de estados. Codificação Grey, onde somente um bit inverte durante uma transição entre estados, é um cliché para a redução de consumo.

Figura 6.1 Codificação Binária versus Grey em uma FSM de 5 estados (Figura retirada de [1])

Ainda, pode-se considerar uma abordagem mais ad hoc, ao identificar os caminhos mais frequentes exercitados pela máquina, e codificá-los de forma que menos bits transicionem entre um estado e outro.

Na Figura 6.1, numa máquina de 5 estados, há 32 transições possíveis. Neste caso, somente 6 podem acontecer de fato. Na codificação binária, 2 bits invertem nas transições B->C e E->B. Se 30% das transições na operação normal forem E->B (ou B->C), a economia de energia com o Grey code fica em 15% somente na inversão dos bits dos registradores, e ainda haverá redução de consumo na lógica combinatória, na mesma proporção ou ainda mais [1].

Uma outra forma de economizar energia é decompor uma FSM em diferentes sub-máquinas, garantindo que as submáquinas em conjunto tenham o mesmo STG (grafo de transição de estados) da máquina original. Durante a operação, exceto se houver transições entre uma submáquina e outra, somente uma precisa estar ativa. Um problema desta abordagem é que os STGs crescem exponencialmente com o número de registros e as técnicas para decompor a máquina também tornam-se mais complicadas [3].

6.2 Representação binária

Já foi discutido neste site a representação e operação de números negativos em circuitos digitais. Para a maioria das aplicações, complemento de 2 é mais adequada que sua representação por sinal-magnitude [1]. Entretanto, para algumas aplicações bastante especificas, pode ser vantajoso o uso de sinal-magnitude. Numa transição de “0” para “-1”, a representação em complemento de 2 inverte todos os bits, ao contrário da lógica sinal-magnitude:

// complemento de 2:
"0" : 00000000
"-1": 11111111

// sinal-magnitude
"0" : 00000000
"-1": 10000001

6.3 Inferência para clock-gating

Abaixo duas possíveis descrições em Verilog para um banco de flip-flops (reg_bank_ff) registrar um novo dado somente (reg_bank_nxt) quando o sinal de seleção load_sel estiver alto. O próximo valor ainda depende de um sinal de seleção deadbeef_sel:

Código 1. Inferência de clock-gating

No “mau exemplo” o sinal i_load_sel_w não foi colocado diretamente sob o domínio do clock, e sim em um bloco de lógica puramente combinacional. É ainda possível que a ferramenta infira o clock-gating após o flattening, mas não é boa prática contar com isso. No “bom exemplo” a ferramenta de síntese prontamente infere que i_load_sel_w é o controle da célula de clock-gating que será integrada (linha 47), economizando 32 multiplexadores que seriam controlados por i_load_sel_w, como descrito no código de “mau exemplo”.

6.4 Multiplexadores One-Hot

Multiplexadores são descritos/inferidos normalmente a partir de cases e ifs [1]:

// Mux 4:1 (linha de seleção em código binário)  
case (SEL)     
2'b00: out = a;    
2'b01: out = b;    
2'b10: out = c;    
2'b11: out = d; 
//Mux 4:1  (linha de seleção em código one-hot)  
case (SEL)    
4'b0001: out = a;    
4'b0010: out = b;    
4'b0100: out = c;    
4'b1000: out = d;    
default: out = out;  

Se as linhas de seleção de um multiplexador são barramentos de vários bits, o chaveamento pode ser significante.

Na Figura 6.3, à direita, os barramentos não selecionados são prontamente mascarados pela lógica AND na entrada do gate OR, diminuindo ainda o atraso da entrada selecionada para a saída.

Figura 6.3 – Muxes N:1 – Binário x One Hot [1]

A maior parte da lógica em um design pode consistir em multiplexadores, portanto, evitar ou mascarar falsas transições deve diminuir o consumo significativamente [1, 2]. Note que a codificação one-hot também é um bom padrão de escolha para máquinas de estado.

6.5 Evitar transições redundantes

Analise a microarquitetura da Figura 6.4. O sinal load_op ativa os 4 operandos a_in, b_in, c_in e d_in. O sinal sel_in seleciona qual operação terá o resultado carregado na saída. Perceba que se o sinal load_op não precisa – ou não deveria – estar ativo se o sinal load_out não for ativado também, visto que as operações nas nuvens lógicas não terão efeito na saída.

Figura 6.4 – Microarquitetura que permite transições redundantes (inúteis) [1]

Na figura abaixo, a microarquitetura é modificada para evitar transições inúteis. A adição de portas AND na entrada dos operandos a_in e b_in faz com que estes sejam habilitados somente quando sel_in é ‘0’; c_in e d_in por sua vez, somente quando sel_in é ‘1’.

Figura 6.5 – Supressão das transições redudantes [1]

6.6 Compartilhamento de recursos

Observe as seguintes lógicas combinatórias descrita em Verilog [adaptado de 1]:

// Codigo #1 
// sem compartilhar recursos
always @(*) begin
   3'b000: OUT = 1'b0;
   3'b001: OUT = 1'b1;
   3'b010: OUT = (value1 == value2); //equals
   3'b011: OUT = (value1 != value2);  //different
   3'b100: OUT = (value1 >= value2); // greater or equal
   3'b101: OUT = (value1 <= value2); // less or equal
   3'b110: OUT = (value1 < value2); // less
   3'b111: OUT = (value1 > value2); // greater
// Codigo #2
// compartilhando recursos
assign cmp_equal = (value1 == value2);
assign cmp_greater = (value1 > value2);
always @(*) begin
   3'b000: OUT = 1'b0;
   3'b001: OUT = 1'b1;
   3'b010: OUT = cmp_equal;
   3'b011: OUT = !cmp_equal;
   3'b100: OUT = cmp_equal || cmp_greater;
   3'b101: OUT = !cmp_greater || cmp_equal ;
   3'b110: OUT = !cmp_greater;
   3'b111: OUT = cmp_greater;

O Código #1 faz uma operação aritmética para cada seleção de entrada SEL. Estas operações têm em comum a comparação entre 2 operandos, value1 e value2. No código #2, duas lógicas de comparação foram assinaladas às nets cmp_equal e cmp_greater. Estas, por sua vez, foram utilizadas nas saídas da seleção em operações lógicas para evitar a replicação de operações de comparação (mais custosas) que envolvem os mesmos operandos e operadores.

6.7 Inversão de barramento

Quando a Distância de Hamming entre o dado atual e o próximo é maior que N/2 (sendo N o tamanho do barramento) é possível invertê-los para minimizar o número de transições:

Figura 6.6 – Para um barramento de 8 bits, quando a distância de hamming entre um dado e o próximo é maior que 4, o número de transições é reduzido ao aplicarmos a inversão (Adaptado de [1]).

Note que um sinal de controle INV é necessário para indicar ao receptor que os dados foram invertidos. Note que mesmo que adicionemos um encoder em Tx e um decoder em Rx, o barramento e portanto as maiores capacitâncias transitaram menos. As maiores correntes são evitadas e a transmissão dos dados consome menos energia.

O texto desta publicação é original. As seguintes fontes foram consultadas: [1] The Art of Hardware Architecture, Mohit Ahora (ISBN 978-1-4614-0397-5)􀀤
[2] Ultra-Low Power Integrated Circuit Design, Niaxiong Nick Tan et al (ISBN 978-1-4419-9973-3)
[3] FSM decomposition by direct circuit manipulation applied to low power design, JC Monteiro et. al. DOI: 10.1109/ASPDAC.2000.835123