[sdiy] was how_you_got_started...

Olivier Gillet ol.gillet at gmail.com
Wed Sep 30 00:05:49 CEST 2015


> At the lowest level, the main overhead of C++ is the virtual
> function table and the code to navigate that table, all of
> which is constructed by the compiler and linker. Otherwise,
> C++ code should perform about the same as C code.

Note that this extra code is generated only if you use dynamic
polymorphism (virtual functions).

Other C++ features that can possibly cause bloating:
* Exceptions (code size increase of function intros/outtros + extra
RAM for bookkeeping)
* RTTI (at least one extra pointer stored in each object + some tables
in RAM for each type)
http://www.hexblog.com/wp-content/uploads/2012/06/Recon-2012-Skochinsky-Compiler-Internals.pdf

None of these are really necessary - and there is a fair amount of
large C++ projects that do not make use of these features (especially
the last 2).

Some of my favorite C++ features have no overhead at all, entirely
take place during compilation, and have nothing to do with objects:
* Data access policies (const, references)
* Evaluation of expressions during compilation
* Syntaxic sugar (auto, range loops, default arguments to functions)
* Functors (and now lambda expressions) leading to the generation of
very efficient code for filtering/traversing data-structures (eg
filtering a list of MIDI events sorted by time in a sequencer).
* Cleanup through RAII ("Set the CS line high or finish the
transaction with this peripheral irrespectively of the exit point of
this function").

That's actually the point of the language and what makes it unique -
that all the towers of abstraction it provides can be made to collapse
during compilation. Best example is templates. Yes, they are
controversial - they can bloat code size, but they are my favourite
C++ feature for expressing computations in a way that the compiler can
see through.

Let's say I want to build a granular sample player. It can work in a
mono and a stereo mode, using 8-bit and 16-bit samples, and depending
on some settings, with or without a complex envelope on the grains. If
I write the code naively in C or assembly, the code will be littered
with if statements checking for these conditions inside the audio
rendering loop. The tests will hamper code performance (taking time to
execute the tests, mess with prefetch, waste some registers). So in
such situations, you really want to create as many rendering loops as
there are combinations of parameters - each of these rendering loops
being tight and stripped of any test. While in C or assembly you'd
have to duplicate the loops (or maybe use some ugly pre-processor
macro tricks), you can do this in C++ with templates. The grain
rendering loop from my module Clouds looks like this:
https://github.com/pichenettes/eurorack/blob/master/clouds/dsp/grain.h#L120
. And this actually gets compiled into 2 x 3 x 2 variants for all
combinations of #channels, buffer resolution, and grain quality - each
of them very deeply inlined and almost branch-free. Of course this
generates 25k of highly redundant code - but factoring all the
configuration tests out of the loop gave a 20-40% boost in
performance.

I used a similar technique to write reusable oscillator code that is
configured at compile time (sketched from this:
https://github.com/pichenettes/polyblep/blob/master/dsp/oscillator.h#L193)
- you can configure whether parameters like frequency or waveshape are
provided at audio rate, linearly interpolated from control rate, or
set once per block at control rate. In the context of the development
of a module, you can quickly check which variant is feasible with your
computational budget - you get generic code that is "flattened" for
each project you use it in. Or you can include two variants, one
suitable for a higher quality, lower polyphony mode.

Another nice application is FFTs. Chip manufacturers give you these
nice little fft libraries in which the FFT size is an argument. But do
you often write programs in which the FFT size is decided only at run
time? If the FFT size is known at compile time, you can write your FFT
code as a template
(http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199500857),
and all the computations (including the computation of the root of
unity constants) can be done at run time and put right there in the
code, giving a gigantic branch-free blob of computation that can be
unrolled and reorganized to the compiler's will. Obviously, the code
size will suffer, but it can be worth it in terms of run time. I have
an upcoming project in which I use a similar approach for the
compile-time generation and "unrolling" of sample rate conversion code
(polyphase FIR). Once all the computations were unrolled, the compiler
found an efficient way of working through several adjacent samples at
once using batch/load stores that I wouldn't have figured out myself.

Some of these tricks might be doable in hand-written assembly with a
lot of patience (or with python/perl scripts generating the assembly
code) - but to me it's much more convenient to do everything in a
single language... that I can compile, test, debug and instrument to
my heart's content on my workstation rather than on the target device.
It's really worth the price of having to occasionally peek at the
generated assembly code to make sure that the compiler gets a good
grasp of the flow of the computations the code is meant to perform.

I really think this idiom of compile-time code
unrolling/specialization is worth knowing for signal processing
applications - embedded or not!

Olivier



More information about the Synth-diy mailing list