Debugging memory corruption

 Let's take something that happened to me today as an example

After a while, an assert triggers inside freeRTOS (because you activated FreeRTOS asserts and linked them to one of your breakpoint, of course when developping  !)




Let's look at the assert :

#2  0x0800b5a4 in xQueueSemaphoreTake (xQueue=0x20001940 <ucHeap+4668>, xTicksToWait=<optimized out>) at /home/fx/Arduino_gd32/spotwelder_gd32/lnArduino/FreeRTOS/queue.c:1481
1481        configASSERT( pxQueue->uxItemSize == 0 );


So there is an inconsistency inside freeRtos mutex stuff.

Ok, let's look at pxQueue

(gdb) p *pxQueue  
$14 = {pcHead = 0x20001940 <ucHeap+4668> "@\031", pcWriteTo = 0x20001940 <ucHeap+4668> "@\031", u = {xQueue = {pcTail = 0x20001940 <ucHeap+4668> "@\031", pcReadFrom = 0x20001940 <ucHeap+4668> "@\031"},  
   xSemaphore = {xMutexHolder = 0x20001940 <ucHeap+4668>, uxRecursiveCallCount = 536877376}}, xTasksWaitingToSend = {uxNumberOfItems = 0, pxIndex = 0x20001958 <ucHeap+4692>, xListEnd = {
     xItemValue = 18446744073709551615, pxNext = 0x20001958 <ucHeap+4692>, pxPrevious = 0x20001958 <ucHeap+4692>}}, xTasksWaitingToReceive = {uxNumberOfItems = 0, pxIndex = 0x20001970 <ucHeap+4716>,  
   xListEnd = {xItemValue = 18446744073709551615, pxNext = 0x20001970 <ucHeap+4716>, pxPrevious = 0x20001970 <ucHeap+4716>}}, uxMessagesWaiting = 0, uxLength = 1, uxItemSize = 2105344, cRxLock = -33 '\337',  
 cTxLock = -33 '\337'}

The end is full of garbage, no need to understand what it means.
So we'll put a hw breakpoint on a field that should not change too often.
uxNumberOfItems seems like a nice candidate

(gdb) p &xQueue->xTasksWaitingToSend.uxNumberOfItems     
$22 = (volatile UBaseType_t *) 0x20001950 <ucHeap+4684>
(gdb) watch *(int *)0x20001950  
Hardware watchpoint 5: *(int *)0x20001950

So now the code will stop every time that address content is changed. 

/!\ It works because, in my case, the address in ram will stay the same as long as you don't change the code. Double check that's the case /!\.

Restart the code, it stops at the beginning when the memory is filled with zero.

Continue and just wait.

Hardware watchpoint 5: *(int *)0x20001950
Old value = 0
New value = 8
OLEDCore::myDrawChar (this=this@entry=0x20001998 <ucHeap+4756>, x=3, y=<optimized out>, c=<optimized out>, invert=<optimized out>)
   at /home/fx/Arduino_gd32/spotwelder_gd32/lnArduino/libraries/simplerSSD1306/ssd1306_ex_ll.cpp:177
177                 mask>>=1;

Ok, so the screen driver is writing directly inside freeRTOS stuff, this is highly abnormal.
Remember that the hw breakpoint will trigger AFTER the faulty instruction has been executed, so look at what happens just before.


Side note :

If the optimizer mangling it a bit too much, you can  ask objdump to interleave c/c++ and generated assembly code so it's easier to see what it tries to do.

/opt/gd32/toolchain2/bin/riscv32-unknown-elf-objdump  -S -d ./CMakeFiles/SSD1306.dir/lnArduino/libraries/simplerSSD1306/ssd1306_ex_ll.cpp.obj 


The thing to watch  for when dealing with such corruption is to find a corruption that is a bit repeatable, i.e. happens often at the same place,  so it can be tracked down.
If it is completely random, it is much more painful.

Update: 

After analysis, it is either a compiler bug or a race issue where some registers are not properly stored & restored.
Updating to gcc 11.x from 10.2 fixed the issue, but it still possibly a race.


Update 2:

Nope, it's a bug with the modified nuclei freeRTOS port. Using the QQ one fixes all problems.








Comments

Popular posts from this blog

Component tester with STM32 : Part 1 ADC, Resistor

Fixing the INA3221

INA3221, weird wiring