[sdiy] Using Cortex Mx Arms in Synth DIY

Byron G. Jacquot thescum at surfree.com
Wed Dec 28 03:43:54 CET 2011


I've spent the last 5 or so years doing ARM7/Cortex work, so I'd like to throw my 2 cents in.

>> Can anyone comment on the code output efficiency of the ARM GCC?  I've
>> used the IAR EWARM and it produces high quality and highly optimized
>> code.  Does ARM GCC come close to it (or better it)?  I haven't done a
>> side by side comparison.
>
>I can only speak about the code generated for the the 32 bit instruction
>set, so it does not apply to the Cortex-M line directly, but I doubt the
>differences between 32 bit and thumb mode are that large.

As far as absolute speed, I've found some other things that make a pretty large difference for performance, a bit more large-grained than instruction optimization.

I found that judiciously using thumb-mode can streamline parts of a program.  We did it on a subsystem-by-subsystem basis.  Thumb is not so good for array processing, but it's great for the boilerplate that gets you in and out of the array handling sections.  There's some overhead for mode switching on function calls on the ARM7, I believe it's gone on the Cortex. 

Understanding the memory controller hardware and layout was probably the most beneficial thing we got into.  On the NXP ARM7's, the internal flash is on a 128-bit-wide bus...it'll fetch a whole row from that memory, then carve the row into instructions as needed...4 ARM-mode, or 8 thumb-mode instructions, with no wait states.  This was the single biggest speedup I ever found.  We retargeted critical pieces of our app to take advantage of that...specifically the math library and the alpha-blending for the LCD controller.

External memories come with some caveats as well...you'll leave more port pins free with a 16-bit data bus, but it means ARM-mode instructions & 32-bit data take 2 fetches.  Plus wait-states, if needed.  Another argument for using thumb if possible.

If you're on a processor that uses DMA, consider carefully what it means, and how multiple channels will interact.  On the NXP ARM7s, the LCD controller for a QVGA refresh was effectively stealing something like 1/5th of the available memory cycles.  Adding SD-card and USB channels made it even worse.  Once again, the Cortexes have a crossbar arrangement that should help alleviate the bottleneck.

I've seen compilers emit less-that-wonderful code for data that was smaller than 32-bits...like for a single byte: a fetch operation, then a shift left to truncate off the 24 "dirty bits," then a shift right (or loading a mask and AND-ing), finally leaving 8 bits with leading 0's.  It can be a little counter-intuitive, but it's a case of fewer bits not necessarily taking less work to handle than more bits.  I tend to stick with the 32-bit types, unless I have a compelling reason to use fewer.

And speaking of compilers:

Most of the development work I was doing was using ifDev from iSystem...their little chiclet board, I think I sold for $79, with the IDE.  I was developing for the eCOS RTOS, and the iSystem stuff worked well for it.  We did swap to the GCC compiler that that comes with the latest eCOS revision - it's got some stuff specific to eCOS.  Some of the other GCC compilers we tried (GNUARM being the one I remember) were set up for other environments, and didn't want to work with the eCOS libraries.

I spent the last year using the "top-grade" Keil tools and RTX RTOS.  I found the IDE to be slower than iSystem (slower as in watching windows draw on my screen), and we found notable bugs in the debugger.

Byron Jacquot



More information about the Synth-diy mailing list