[blog] O Amplificador Diferencial

Comentários básicos sobre este workhorse em eletrônica. O conceito de amplificador diferencial é elemento-chave em uma infinidade de aplicações analógicas: buffers, filtros, osciladores, reguladores de tensão, referências de tensão/corrente, processamento de sinais, PLLs, etc., devido às suas propriedades úteis. As figuras foram retiradas do livro CMOS Circuit, Design, Layout and Simulation (Baker).

Figura retirada de analog.com

O Par diferencial

O conceito de diferencial é comum em engenharia e consiste na ideia de que determinado fluxo de energia é distribuído entre dois ramos ‘complementares’ de um dispositivo (no automóvel, isto permite que uma roda gire mais rápido que a oposta para que possamos fazer uma curva – cada uma das rodas vai percorrer diferentes comprimentos de arco no mesmo tempo). Em eletrônica o par diferencial nos permite principalmente eliminar ruídos de modo-comum (neste caso especificamente aqueles provindos de uma mesma fonte DC que alimenta os circuitos que geram os sinais AC de entrada, e possivelmente o amp-op).

O Amplificador Diferencial

A apresentação clássica do amplificador diferencial é assim:

Vout = (V+ – V)*Ganho [V]

Também sempre é dito que idealmente temos:

  • impedância de entrada infinita
  • impedância de saída nula
  • ganho infinito em malha aberta
  • rejeição de modo comum infinita

Vou tentar mostrar aqui porque estas características são importantes, utilizando circuitos analógicos CMOS básicos e análises intuitivas.

Talvez a primeira coisa que vem à mente de quem é introduzido ao assunto é que se o ganho é muito grande (como deve ser), a tensão Vout vai explodir. Primeiramente, esta equação está em malha aberta, sem a realimentação negativa, quando parte do sinal de saída é subtraído do sinal de entrada. Depois, na maior parte das vezes estamos interessados em trabalhar em pequenos sinais AC – centésimos ou milésimos de Vdd (v+, v-). Cabe ainda dizer que ganho mede-se em V/V (é um número sem dimensão), e independente do seu tamanho a máxima tensão alcançada será sempre Vdd.

Os pequenos sinais (v+,- v-) variam em torno da polarização DC (V+, V-). Polarizar significa determinar um ponto de operação do circuito, de cada transistor na verdade (em elementos semicondutores, as características elétricas variam conforme às tensões elétricas a que o dispositivo é submetido).  Na figura acima o ganho de malha aberta (open-loop) é mostrado como função da frequência do sinal de entrada. Uma topologia clássica do circuito amplificador diferencial em CMOS é assim:

(PS: transistores NMOS estarão em região linear se VGS >= Vth, e PMOS com VSG >= Vth e IDS = f(VDS,VGS), onde f assume formas distintas a depender da região de operação)

A corrente Iss é resultado da tensão de gate em M6, que estará com a corrente estável em um range de tensão dreno-fonte. Supondo uma tensão Vn, de modo comum aplicada aos gates de M1 e M2, a corrente drenada por M6 seja um valor Iss = Id1 + Id2.

Se extrairmos a tensão de saída no dreno de cada um dos transistores M1 e M2, Vout1,2= Id1,2*RCarga1,2. Isto basicamente diz que o ganho Av=Vout/Vin é proporcional à transcondutância do transistor em determinado ponto de polarização multiplicado pela carga, pois ID/Vin = Gm = 1/(Rds).

A saída Vout diferencial é VD1-VD2 ou ao contrário, a depender do que estabelecermos como saída positiva e negativa. Com a topologia acima é impossível dizer qualquer coisa sobre qual será a entrada positiva e qual será a negativa. Só podemos dizer isto quando definirmos qual a entrada que aumenta a tensão de saída, e a entrada que a diminui.

Para tanto, completa-se o par diferencial com uma carga espelho de corrente do tipo PMOS, uma carga ativa.

Se a tensão do gate M1 aumenta, mais corrente é drenada, maior a tensão VSG de M3 e M4 (ou menor a tensão VGS, se preferir). Se ambos M1 e M2 tiverem a mesma tensão de gate, a corrente no sentido dreno-fonte de cada elemento do par será ID1 = ID2 = Iss/2 [A]. Se tensão de gate em M1 aumenta em relação à tensão de gate M2, menos corrente é drenada por M2 e portanto menor a queda dreno-fonte em M2. Se M2 continuar no ponto de operação desejável, sua (trans)condutância é constante. Se a condutância de M2 mantém-se constante, mas sua corrente diminui, a tensão Vout = VDD – VSD4 obrigatoriamente aumenta. Então VG1 = V+. O raciocínio inverso vale para V: quanto maior a corrente drenada em M2, menor a tensão Vout. É este balanço na entrada que nos permite reduzir drasticamente o ruído de modo comum, se comparado a um amplificador de entrada única. Se adicionarmos mais um estágio de amplificação teremos ganho suficiente para chamá-lo de amplificador operacional. Tradicionalmente, o próximo estágio seria um amplificador PMOS, dreno-comum, com carga NMOS.

Ainda nesta topologia de só um estágio, a curva de transferência assume o seguinte, sendo o eixo horizontal a tensão de modo comum Vin nos gates do par diferencial, e no eixo vertical, Vout a tensão medida nos drenos de M2 e M4.

A derivada no ponto de polarização é o ganho do amplificador quando polarizado com a tensão correspondente Vin. Um bom ponto de polarização é aquela cuja derivada em relação ao pequeno sinal vin mantenha o maior intervalo em vout sem perturbar a transcondutância de M2 // M4. Idealmente é um ponto com derivada infinita. Pois para o sinal vout não remover M2 e M4 do ponto de operação desejável, as correntes drenadas não podem causar quedas de tensão que retire os transistores do ponto de operação, o que no limite significa impedância nula no nodo vout. Para a impedância de entrada ser infinita, a variação da tensão vgs de M1 e M2, não deve causar variação de tensão vgs + Vgs que retire o transistor da região de operação desejada. Se for saturado na região linear, a tensão dreno-fonte precisa manter-se em Vds >= tensão de overdrive (Vgs – Vth). A rejeição de modo comum será tão boa quanto a precisão do espelhamento de corrente – o mais iguais o módulo das correntes de dreno de M3 e M4, mais ruídos do sinal de tensão de modo-comum Vin estarão subtraídos no nodo Vout = Iout*RLout.

[blog] Synchronous versus asynchronous reset on ASIC Design

Reset is about getting a system back to a known initial state. Temporary data is flushed. When I say data I mean 0s and 1s. Everything will be put to a known state so the circuit starts up. Sounds important.

When we go for synchronous or asynchronous reset, active low, our HDL code will look like one of these:

module synch_reset_pipeline(in1, in2, ..., out1, out2,...);
// #1 module with synchronous reset
always @(posedge clk) 
begin
  if (!rst) begin
   // reset logic: put this module to known state 
  end
  else begin
  // functional logic: processes data
  end
end
endmodule
module asynch_reset_pipeline(in1, in2, ..., out1, out2,...);
// #2 module with asynchronous reset
always @(posedge clk or negedge rst) 
begin
  if (!rst) begin
   // reset logic 
  end
  else begin
  // functional logic
  end
end
endmodule

To simplify, firstly, think about a single sequential element (a flip-flop). Then, imagine it is starting up, catching the very first data signal arriving on the input and propagating it to other pipeline stages. How would be a desirable reset signal? I would say:

  • it resets the circuit to a known state
  • not prone to any metastability
  • the duty cycle is just long enough to meet the above criteria

Given the above bullets what is the go for approach? Whatever fits the system better, on area, timing and power. On module #1, we are telling the tool that the reset signal is under the domain of the clock, and the reset logic will take place when reset level is low. So when the clock positive edge triggers and reset is asserted, the flops that are connected to that logic will reset. Therefore, the reset signal is pretty much another data signal (like: assign current_input = (!rst_sel) ? reset_value : next_input), selecting between the next value to be sampled or the reset data value, when a reset happens. And we should make sure it will be filtered from glitches and never prone to any meta-stability – but this is valid to any other external input too. It cannot take the fastest way through the datapath when asserted, otherwise it might reset the logic violating setup requirements, and therefore put it to a unknown state, that will be sampled and propagated. The deassertion by its turn need to respect the hold time of the signals that were propagated through the pipeline. As data and reset are on the same clock domain, the work relies pretty much on the clock tree synthesis. It costs more circuitry in general. (Although flops are smaller, they say)

On #2, the asynchronous reset is a high priority interruption, per say. The or inside the sensitivity list changes everything. Whenever reset is low (asserted) it comes to bring everything down and up again. If the pulse has a very low duty cycle, maybe the flop will not be stable yet, and will put the the circuit to an unknown state when reset is released; maybe a very bad combination (boom). That is, we need to model this asynchronous path to respect a time after asserting and before deasserting the signal, so we make sure every pipeline stage is at a known state. To prevent timing violation when asserting, we need to make sure the reset will take place respecting the (worst) removal time, so the initial data to be sampled is valid. For deasserting, we need to make sure the clock samples a stable value, by respecting the (worst) recovery time. Besides, you need to tell the tool that the signal that multiplex the input to your flops between operation data and reset data is a false path – that is the clock must ignore the impact of this signal on the data path, and not make any efforts to support setup and hold timings of the sequential elements. You will take care by yourself by controlling the reset behavior. A technique could be to model the buffers safely enough to charge and discharge on a determined relative time, so the reset removal and recovery time are always safe. That is you fix a positive delay for reset assertion and a negative delay for deassertion, regarding the active clock edge. You will need to synchronize the input reset signal to a common in both aproaches to mitigate metastability. Asynchronous reset costs less circuitry, but is also more complex: a very critical asynchronous control signal is on the game.

[blog] Tip: C variables and pointers declarations and definitions

It might sound naive, but it is a very common interview question, and I actually think the notation can be deceiving. And furthermore, it is essential. A few months off and you will be asking yourself “how is that again?”. Unless you get this rule: start reading from right to left.

Question: Please read the following declarations and explain their properties regarding read and write.

#1 int v;

Reads as v is a variable of integer type

int v; //declaration
int same_scope_var = 0x10; //declaration and definition
v = 0; //definition
v = same_scope_var; //redefinition v=0x10

After the scope (declaration) and initial value of v is defined, you can read and write it in run-time.

#2 int const v;

Reads as v is a constant integer, i.e., read-only

After the scope and initial value of v is defined, you can only read it in run-time.

#3 int *v;

Reads as v is variable pointer to an integer.

int *v; //declaration
int same_scope_var = 0xFFFFFFFF; //declaration and defitinion
int same_scope_var2 = 0xF0FFFFFF; //declaration and defitinion
v = &same_scope_var ; //definition
v = &same_scope_var2 ; //definition

After the scope and initial value (that is a memory address that holds an integer) of v is defined, you can change it to whatever other integer address you want in run-time.

#4 int const *v;

Reads as v is variable pointer to a constant integer; that is, the address v holds can change, although it is expected to point to a constant value.

int const same_scope_k = 0xFFFFFFFF; //declaration and defitinion
int const same_scope_k2 = 0xF0FFFFFF;
int same_scope_var = 0xF0FBFFFF;
int const *v; //declaration
v = &same_scope_k; //definition
v = &same_scope_k2; //redefinition
*v = same_scope_k; //error wont compile
v = &same_scope_var; //probably compile with a warning

After the scope and initial value of v is defined, you can write and read it whenever you want in run-time, but you cannot change the value it points to by dereferencing.

#5 int * const v;

Reads as v is a constant pointer to a variable integer, i.e., v is ready-only, but the value that it points to (dereference) can change.

int same_scope_variable  = 0xFFFFFFFF; //declaration and definition
int same_scope_variable2  = 0xF0FFFFFF; 
int * const v; //declaration
v = &same_scope_variable  ; //definition
*v = same_scope_variable2; // redefinition for same_scope_variable = same_scope_variable2

After the scope and initial value of v is defined, you can only read the address it points to in run-time, but you can change the value the pointed address holds, by dereferencing.

#6 int const * const v;

Reads as v is a constant pointer to a constant integer. That is, the value v holds is a read-only address that points to a read-only integer. Its more usual to see that as a function argument when you want to be sure the method cannot change neither the address v holds nor the value it dereferences.

int const same_scope_k  = 0xFFFFFFFF; 
int const same_scope_k2  = 0xF0FFFFFF; 
int const * const v; //declaration
v = &same_scope_k; //defined, read-only
v = &same_scope_k2; //error wont compile
*v = same_scope_k2; //error wont compile

The code below should compile with no warnings. It is on a PC, therefore memory addresses have 8 bytes

#include <stdio.h>
int main()
{
    int my_variable = 0x30; 
    int const my_constant = 0x60;
    //#1 defined and declared v 
    int v = 0x05; 
    
    //#2 defined and declared read-only const
    int const v_k = 0x10; 
    int const v_k2 = 0x11;
    
    //#3 defined and declared v_ptr as the address of v
    int *v_ptr = &v; 
    
    //#4 declared and defined this as a pointer to a constant 
    int const *v_ptr_to_k = &v_k; 
    
    //#5 declared and defined this as a constant pointer to an variable integer
    int * const v_k_ptr = &my_variable; 
    
    //#6
    //declared and defined this as a constant pointer to a constant integer
    int const * const v_k_ptr_k = &my_constant;  
    // operations 
    // change v
    printf("v has the value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    v = 0x80;
    printf("v has now the value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    
    // changing v_ptr
    v_ptr = &my_variable;
    printf("v_ptr dereferences to value %x, and v_ptr is %p\n\r", *v_ptr, v_ptr);
    //change pointer to a constant
    printf("v_ptr_to_k dereferences to value %x, and v_ptr_to_k is %p\n\r", *v_ptr_to_k, (unsigned long int *)v_ptr_to_k);
    v_ptr_to_k = &v_k2;
    printf("v_ptr_to_k dereferences to value %x, and v_ptr_to_k is %p\n\r", *v_ptr_to_k, (unsigned long int *)v_ptr_to_k);
    //*v_ptr_to_k = 0x08; //<== not allowed!
    
    //operating with a constant pointer
    printf("v_k_ptr dereferences to the value %x, and v_k_ptr is %p\n\r", *v_k_ptr, (unsigned long int *)v_k_ptr);
    v = 0x35;
    *v_k_ptr = v;
    printf("v_k_ptr dereferences to the value %x, and v_k_ptr is %p\n\r", *v_k_ptr, (unsigned long int *)v_k_ptr);
    
    //operating with a constant pointer to a constant integer
    printf("v_k_ptr_k dereferences to the value %x, and v_k_ptr_k is %p\n\r", *v_k_ptr_k, (unsigned long int *)v_k_ptr_k);
    //*v_k_ptr_k = 0x92; <== not allowed!
    //v_k_ptr_k = &my_constant; <== not allowed!
    return 0;
}

Output:

(Erratum: in the printf above v_ptr_to_k was mentioned as v_ptr_k but the variable declared and defined on the code is v_ptr_to_k.)

[blog] Good reads for free #1

I just ran into this awesome book, FPGA-Based Prototyping Methodology Manual

«The FPGA-Based Prototyping Methodology Manual: Best practices in Design-for-Prototyping(FPMM) is a comprehensive and practical guide to using FPGAs as a platform for SoC development and verification. The manual is organized into chapters which are roughly in the same order as the tasks and decisions which are performed during an FPGA-based prototyping project. The manual can be read start-to-finish or, since the chapters are designed to stand alone, you can start reading at any point that is of current interest to you.

The main lesson of the FPGA-based prototyping methodology manual might be to tackle complexity one step at a time but make sure that each step is in the right direction and not into a dead end. When starting an FPGA-based prototyping project, our success will come from a combination of preparation and effort; the more we have of the former, the less we should need of the latter.” —- Doug Amos, Austin Lesea, René Richter»

You can download it for free, with a corporate email account:

[blog] NASA’s Ten Principles of Safety-Critical Code

These are the NASAs ten principles for safety-critical code. It is interesting to note that they were published later than MISRA-C that already covers it naturally.

  1. Restrict all code to very simple control flow constructs, do not use goto statements, setjmp or longjmp constructs, direct or indirect recursion.
  2. Give all loops a fixed upper bound. It must be trivially possible for a checking tool to prove statically that the loop cannot exceed a preset upper bound on the number of iterations. If a tool cannot prove the loop bound statically, the rule is considered violated.
  3. Do not use dynamic memory allocation after initialization.
  4. No function should be longer than what can be printed on a single sheet of paper in a standard format with one line per statement and one line per declaration. Typically, this means no more than about 60 lines of code per function.
  5. The code’s assertion density should average to minimally two assertions per function. Assertions must be used to check for anomalous conditions that should never happen in real-life executions. Assertions must be side effect-free and should be defined as Boolean tests. When an assertion fails, an explicit recovery action must be taken, such as returning an error condition to the caller of the function that executes the failing assertion. Any assertion for which a static checking tool can prove that it can never fail or never hold violates this rule.
  6. Declare all data objects at the smallest possible level of scope.
  7. Each calling function must check the return value of non-void functions, and each called function must check the validity of all parameters provided by the caller.
  8. The use of the preprocessor must be limited to the inclusion of header files and simple macro definitions. Token pasting, variable argument lists (ellipses), and recursive macro calls are not allowed. All macros must expand into complete syntactic units. The use of conditional compilation directives must be kept to a minimum.
  9. The use of pointers must be restricted. Specifically, no more than one level of dereferencing should be used. Pointer dereference operations may not be hidden in macro definitions or inside typedef declarations. Function pointers are not permitted.
  10. All code must be compiled, from the first day of development, with all compiler warnings enabled at the most pedantic setting available. All code must compile without warnings. All code must also be checked daily with at least one, but preferably more than one, strong static source code analyzer and should pass all analyses with zero warnings.

A first preemptive kernel on ARM Cortex-M3

1. Introduction

When it comes to operating systems for embedded software, they are generally thought to be an overkill for most solutions. However, between a completely “hosted” multiapplication system (which uses a multi thread and general purpose OS) and a completely “standalone/monolithic” or “bare-metal” application specific system, there are many variations we can go for.

In previous publications I explored the notion of cooperative tasks, from a loop calling tasks from an array of function pointers, to something a little more complex with the processes being handled in a circular buffer, with explicit time criteria. This time I will show a minimal preemptive implementation, to also pave the way for something a little more complex, as in other publications.

2 Preemptive versus Cooperative

In a fully cooperative system, the processor does not interrupt any task to accept another, and the task itself needs to release the processor for the next one to use it. There is nothing wrong with this. A Run-To-Completion scheme is sufficient and / or necessary for many applications, and many embedded systems were deployed this way, including some very complex ones. In the past, even non-embedded systems used a cooperative kernel (Windows 3.x, NetWare 4.x, among others). If a task crashes, the entire system is compromised when we speak in a strictly cooperative way: it keeps the processor from going further (so in a server operating system like NetWare, this does not seem to be a good idea, because multiple clients are a must!).

In preemptive mode, tasks are interrupted and later resumed– i.e., a context (set of states in the processor registers) is saved and then retrieved. This leads to more complexity to the implementation but, if well done, it increases the robustness and the possibility of meeting narrower timing requirements, mainly if used with a priority and/or periodicity criteria to manage the queue.

3 Call stack

A processor is, in fact, a programmable finite-state machine. With some simplification, each state of our program can be defined within the set of core register values . This set dictates which program point is active. Therefore, activating a task means pushing values ​​to the call stack so that this task will be processed. This set of values is called context. To resume the task afterwards, it is necessary to save the call stack data at that point in the program. This “frozen” data represents a program state and is therefore called a stack frame. For every saved stackframe there is a context related. To resume a task, a previously saved stack frame is loaded back into the call stack.

In the ARM Cortex-M3, the 32-bit registers that define the active state of the processor are: R0-R12 for general use and R13-R15 registers for special use, in addition to the Program Status Register (xPSR) – its value is on the top of any stackframe, and it is not actually a single physical register, but a composition of three (Application, Interrupt and Execution: APSR, IPSR e EPSR).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é armcortexm3arch-1.png

3.1. Load-store architectures

A load-store architecture is a processor architecture in which data from memory needs to be loaded to the core registers before being processed. Also, the result of this processing before being stored in memory must be in a register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é registersm3-1.png

The two basic memory access operations on Cortex-M3:

// reads the data contained in the address indicated by Rn + offset and places it in Rn. 
LDR Rd, [Rn, #offset]
// stores data contained in Rn at the address pointed by Rd + offset
STR Rn, [Rd, #offset]

It is important to understand at least the Cortex-M3 instructions shown below. I suggest sources [1] and [2] as good references, in addition to this or this link.

MOV Rd, Rn // Rd = Rn
MOV Rd, #M // Rd = M, the immediate being a 32-bit value (here represented by M)
ADD Rd, Rn, Rm // Rd = Rn + Rm
ADD Rd, Rn, #M // Rd = Rn + M
SUB Rd, Rn, Rm // Rd = Rn - Rm
SUB Rd, Rn, #M // Rd = Rn - M
// pseudo-instruction to save Rn in a memory location.
// After a PUSH, the value of the stack pointer is decreased by 4 bytes
PUSH {Rn}
// POP increases SP by 4 bytes after loading data into Rn.
// this increase-decrease is based on the current address the SP is pointing to
POP {Rn}
B label // jump to routine label
BX Rm // jump to routine specified indirectly by Rm
BL label // jump to label and moves the caller address to LR

CPSID I // enable interrupts
CPSIE I // disable interrupts

We will operate the M3 in Thumb mode , where the instructions are actually 16 bits. According to ARM , this is done to improve code density while maintaining the benefits of a 32-bit architecture. Bit 24 of the PSR is always 1.

3.2. Stacks and stack pointer (SP)

Stack is a memory usage model [1]. It works in the Last In – First Out format (last to enter, first to leave). It is as if I organized a pile of documents to read. It is convenient that the first document to be read is at the top of the stack, and the last at the end.

We usually divide the memory between heap and stack . As said, the “call stack” will contain that temporary data that determines the next state of the processor. The heap contains data which nature is not temporary in the course of the program (this does not mean “non-volatile”). The stack pointer is a kind of pivot that keeps control of the program flow, by pointing to some position of the stack.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackusage.png
Figure 3. Model for using a stack. When saving data before processing (transforming) it saves the previous information. (Figure from [1])
Esta imagem possuí um atributo alt vazio; O nome do arquivo é annotation-2020-02-06-001056.png
Figure 4. Regions of memory mapped on a Cortex M3. The region available for the stack is confined between the addresses 0x20008000 and 0 0x20007C00. [1]

4 Multitasking on the ARM Cortex M3

The M3 offers two stack pointers (Main Stack Pointer and Process Stack Pointer) to isolate user processes of the process kernel. Every interrupt service runs in kernel mode . It is not possible to go from user mode to kernel mode (actually called thread mode and privileged mode) without going through an interruption – but it is possible to go from privileged mode to user mode by changing the control register.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é so.png
Figure 5. Changing context on an OS that isolates the kernel application [1]

The core also has dedicated hardware for switching tasks. The SysTick interrupt service can be used to implement synchronous context switching. There are still other asynchronous software interruptions (traps) like PendSV and SVC. Thus, SysTick is used for synchronous tasks in the kernel, while SVC serves asynchronous interruptions, when the application makes a call to the system. The PendSV  is a software interruption which by default can only be triggered in privileged mode. It is usually suggested [1] to trigger it within SysTick service, because it is possible to keep track of the ticks to meet the time criteria. The interruption by SysTick is soon served, with no risk of losing any tick of the clock. A secure OS implementation would use both stack pointers to separate user and kernel threads and separate memory domains if an MPU (Memory Protection Unit) is available. At first, we will only use the MSP in privileged mode.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 2sps.png
Figure 6. Memory layout on an OS with two stack pointers and protected memory [1]

5. Building the kernel

Kernel is a somewhat broad concept, but I believe that there is no OS which kernel is not responsible for scheduling tasks. In addition, there must be IPC (inter-process communication) mechanisms. It is interesting to note the strong hardware-dependency of the scheduler that will be shown, due to its low-level nature .

5.1. Stackframes and context switching

Remember: call stack = the registers of the core ; stack or stackframe = state (values) of these registers saved in memory.

When a SysTick is served, part of the call stack is saved by the hardware (R0, R1, R2, R3, R12, R14 (LR) and R15 (PC) and PSR). Let’s call this portion saved by the hardware stackframe. The remaining is the software stackframe [3], which we must explicitly save and retrieve with the PUSH and POP instructions .

To think about our system, we can outline a complete context switch depicting the key positions the stack pointer assumes during the operation (in the figure below the memory addresses increase, from bottom to top. When SP points to R4 it is aligned with an address lower than the PC on the stack)

Esta imagem possuí um atributo alt vazio; O nome do arquivo é ctxtswitch-1.png
Figure 7. Switching contexts. The active task is saved by the hardware and the kernel. The stackpointer is re-assigned, according to pre-established criteria, to the R4 of the next stackframe to be activated. The data is rescued. The task is performed. (Figure based on [3]) (“Salvo pelo hardware/kernel” translates to “Saved /pushed by hardware/kernel”; “Resgatado pelo hardware/kernel” translates to “retrieved/popped by hardware/kernel”)

When an interruption takes place, SP will be pointing to the top of the stack (SP (O)) to be saved. This is inevitable because this is how the M3 works. In an interruption the hardware will save the first 8 highest registers in the call stack at the 8 addresses below the stack pointer, stopping at (SP (1)). When we save the remaining registers, the SP will now be pointing to the R4 of the current stack (SP (2)). When we reassign the SP to the address that points to the R4 of the next stack (SP (3)), the POP throws the values of R4-R11 to the call stack and the stack pointer is now at (SP (4)). Finally, the return from the interrupt pops the remaining stackframe, and the SP (5) is at the top of the stack that has just been activated. (If you’re wondering where R13 is: it stores the value of the stack pointer)

The context switching routine is written in assembly and implements exactly what is described in Figure 7.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é systick-2.png
Figure 8. Context switcher

PS: When an interruption occurs, the LR takes on a special code. 0xFFFFFFF9 , if the interrupted thread was using MSP or 0xFFFFFFFD if the interrupted thread was using PSP.

5.1 Initializing the stacks for each task

For the above strategy to work, we need to initialize the stacks for each task accordingly. The sp starts by pointing to R4. This is by definition the starting stack pointer of a task, as it is the lowest address in a frame .

In addition, we need to create a data structure that correctly points to the stacks that will be activated for each SysTick service . We usually call this structure a TCB (thread control block). For the time being we do not use any selection criteria and therefore there are no control parameters other than next: when a task is interrupted, the next one in the queue will be resumed and executed.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é struct.png
Figure 9. Thread control block
Esta imagem possuí um atributo alt vazio; O nome do arquivo é stackinit-3.png
Figure 10. Initializing the stack (the values representing the registers, like 0x07070707 are for debugging purposes)

The kSetInitStack function initializes the stack for each “i” thread . The stack pointer in the associated TCB points to the data relating to R4. The stack data is initialized with the record number that must be loaded to facilitate debugging. The PSR only needs to be initialized with bit 24 as 1, which is the bit that identifies Thumb mode . The tasks have the signature void Task (void * args) .

To add a task to the stack, we initially need the address of the main function of the task. In addition, we will also pass an argument. The first argument is in R0. If more arguments are needed, other registers can be used, according to AAPCS (ARM Application Procedure Call Standard).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é addthreads-1.png
Figure 11. Routine for adding tasks and their arguments to the initial stackframe

5.3. Kernel start-up

It is not enough to initialize the stacks and wait for the SysTick. The TCB structure sp will only hold a valid stack pointer value when the task is interrupted. We have two types of threads running: background and foreground threads. The background includes the kernel routines, including the context switcher. At each SysTick, it is the kernel’s turn to use the processor. In the foreground are the applications.

If the task has not already been executed, the stack pointer saved in sp will not be valid. So we need to make it looks like the task has been executed, interrupted and saved – to be reactivated later. I used the following strategy:

  1. An interruption is forced (PendSV). Initial hardware stackframe is saved.
  2. tcb [0].sp is loaded in SP
  3. The R4 – R11 of the core are loaded with the values ​​of the initialized stackframe.
  4. ISR returns, retrieves the hardware stack frame and the SP will be at the top of the stack. The PC is now loaded with the address of the first call to be made, and the program follows the flow.
Esta imagem possuí um atributo alt vazio; O nome do arquivo é pendsv.png
Figure 12. PendSV interrupt service to boot the kernel

In [2] a very smarter way of starting up the kernel is suggested:

Esta imagem possuí um atributo alt vazio; O nome do arquivo é kstart-1.png
Figure 13. Routine for booting the kernel

The interruption is dispensed and the call stack is loaded by activating the LR with the PC value of the stack . After finally taking SP to the top of the stack, BX LR executes the task and returns.

If we use the first method presented, kStart is simply:

// using CMSIS lib 
void kStart(void) { SCB->ICSR |= SCB_ICSR_PENDSVSET_Msk; }

6. Putting it all together

To illustrate, we will perform, in Round-Robin 3 tasks that switch the output of 3 different pins and increment three different counters. The time between a change of context and another will be 1000 cycles of the main clock. Note that these functions run within a “while (1) {}”. It is like we have several main programs running on the foreground . Each stack has 64 x 4-byte elements (256 bytes).

Esta imagem possuí um atributo alt vazio; O nome do arquivo é tasks-1.png
Figure 14. System Tasks

Below the main function of the system. The hardware is initialized. Tasks are added and the stacks are initialized. The RunPtr receives the address of the thread.  After setting the SysTick to trigger every 1000 clock cycles, boot up the kernel . After executing the first task and being interrupted, the system is switching between one task and another, with the context switcher running in the background .

Esta imagem possuí um atributo alt vazio; O nome do arquivo é main.png
Figure 15. Main program

6.1. Debug

You will need at least a simulator to implement the system more easily, as you will need to access the core registers and see the data moving in the stacks. If the system is working, each time the debugger is paused, the counters should have almost the same value.

In the photo below, I use an Arduino Due board with an Atmel SAM3X8E processor and an Atmel ICE debugger connected to the board’s JTAG. On the oscilloscope you can see the waveforms of the outputs switching for each of the 3 tasks.

Esta imagem possuí um atributo alt vazio; O nome do arquivo é 20200209_163936.jpg
Figure 16. Debug bench
Esta imagem possuí um atributo alt vazio; O nome do arquivo é ds1z_quickprint1-1.png
Figure 17. Tasks 1, 2, 3 on the oscilloscope.

7 Conclusions

The implementation of a preemptive kernel requires reasonable knowledge of the processor architecture to be used. Loading the call stack registers and saving them in a “handmade” way allows us to have greater control of the system at the expense of the complexity of handling the stacks.

The example presented here is a minimal example where each task is given the same amount of time to be performed. After that time, the task is interrupted and suspended – the data from the call stack is saved. This saved set is called a stackframe – a “photograph” of the point at the program was. The next task to be performed is loaded at the point it was interrupted and resumed. The code was written in order to explain the concepts.

In the next publication we will split the threads between user mode and privileged mode – using two stack pointers – a fundamental requirement for a more secure system.

References

The text of this post as well as the non-referenced figures are from the author.
[1] The definitive guide to ARM Cortex-M3, Joseph Yiu
[2] Real-Time Operating Systems for the ARM Cortex-M, Jonathan Valvano
[3] https://www.embedded.com/taking-advantage-of-the-cortex-m3s-pre-emptive-context-switches/