[sdiy] Hardware convolution box?
cheater00 cheater00
cheater00 at gmail.com
Wed Feb 15 08:06:02 CET 2017
Thanks for the feedback! Some pretty good points
Regarding 11 bit depth for audio: couldn't you remodulate audio in a dsd
like fashion - higher freq, lower bit rate? I assume convolution would
still work. I guess that might be a way to harness the power of GPUs
without forcing them into, say, inefficiently emulated 64-bit mode.
Regarding int vs fp on TI TMS dsps: for stuff like compressors or madtering
grade equipment people really want 64 bit fp processing; plus it does have
its benefits. Even if your int accumulator is 80 bits, i imagine you can't
always prevent truncation to 32 bits and so some algorithms will not be
able to tout back to back double precision. This is an issue. Also with the
prevalence of vst on x86 and x64, as well as now very popular gpgpu work, i
imagine fp based algorithms are going to be much easier to find, and will
have a lot more mindshare. So it is my guess that fp dsps are much better.
By the time any sort of community around this has lifted off, today's chips
will all be obsolete - am i wrong to think future chips will more likely be
fp than not?
TI dsps with fp start at $6.30 @ 1k, so budget for even the simplest stomp
box, effect, or synth module should not be an issue.
In any case the newer TIs still do integer dsp; I don't know if that's
emulated or native.
Power budget might be an issue. That is however an unusual requirement for
audio.
Great comment re book keeping etc on cpu vs dsp. Yes, direct comparisons of
GFLOPs or GMACS are very unfair to dsp. Especially if you have an
inexpensive separate dedicated arm cpu for the dsp board.
On Wed, 15 Feb 2017 07:28 , <rsdio at audiobanshee.com> wrote:
>
> On Feb 14, 2017, at 3:39 PM, cheater00 cheater00 <cheater00 at gmail.com>
> wrote:
> > If you started with one of the lower end TI dsps, say C55x or C674x,
> > would the asm code carry over to the C66x? It is my feeling the answer
> > would be a sound "no". If that is the case, maybe it would be best to
> > start with lower end C66x? The chips start at roughly $40 which really
> > isn't that much and provide 20 GMACS which is plenty already. This
> > family goes up to 358 GMACS.
>
> When I was getting into the open source library for the TMS320VC5506, I
> learned all of the opcodes for that revision of the family. I noticed that
> many of the open source assembly routines did not use the most efficient
> opcode in some parts of the code, and my assumption is that these were
> written for older chip instruction sets and never updated because they
> still run on the newer chips. In some cases, there were comments or
> conditionals to segregate different versions of the TMS320, but they
> largely run the same instructions. Each generation adds a lot, though.
>
> I might assume that C66x will execute C55x assembly, but I believe that
> C66x is floating point and C55x is fixed point as the major distinction
> between the families, so the algorithms would be quite different. It's
> possible that C66x has a very different set of opcodes, but I haven't
> written for C6600 yet.
>
> Note that the fixed point DSP instruction sets have things like Saturate
> to prevent calculations from overflowing or underflowing the available
> bits. There are also flags for many opcodes that will shift data by one or
> two bits to pre-scale or post-scale values, optionally, which is very
> useful for fixed point. I think that a lot of these opcodes would be
> unnecessary with floating point, since floating point is automatically
> shifting the significand bits and adjusting the exponent to compensate with
> every operation.
>
> Your best bet is to choose between C55x and C66x based on benchmarks
> provided by Texas Instruments. If only one of them has the minimum
> performance needed, then your choice is made. If both families fit the
> bill, then you might consider other things. As I mentioned, USB power (at
> 2.5 W) was a requirement for my design, and it seems that only the C55x (of
> the modern families) can handle this. The older C54x is also low power, but
> it is less capable in performance (and probably NRND).
>
> I might recommend floating point over fixed point for a beginner, but for
> convolution you can fairly easily predict the maximum Accumulator value
> based on the length of the impulse response. It's probably easy enough to
> just code for the length and then fixed point is almost as easy as float.
> Cost might also be a consideration if you're planning on manufacturing this
> effect. Maybe the C66x is just too expensive for a small production run - I
> haven't done the analysis.
>
>
> > C55x tops out at 0.6 GMACS and C674x at
> > 3.6 GMACS and neither are enough of an increase to challenge arm
> > boards which many people already have, most likely.
>
> The problem with comparing GMACS is that a DSP effect involves more
> operations than just the Multiply and Accumulate. As I mentioned, there's
> pointer math, loop counting, multiple simultaneous bus transactions, and
> bit-reversed addressing. When you consider all of the other things that
> have to happen in an FFT, the optimizations of DSP chips mean that you're
> comparing apples to oranges when looking at raw GMAC benchmarks. Granted,
> pure time domain convolution might be simple enough that they'd be close to
> the same number of cycles, but as soon as you use FFT for the longer
> impulse responses you're going to need far fewer cycles in the DSP
> instruction set.
>
> On that note, I've started developing with the Cortex-M4, but I keep
> reading that the CMSIS FFT implementation is only optimized for Cortex-M3.
> So, even tough the M4 is potentially way faster, nobody has taken advantage
> of it for the available FFT library. Does anyone know whether the FFT
> libraries have been updated to be optimized for M4? Now that we're seeing
> M7 and M33 chips, are we really able to take advantage of the new chips
> with FFT?
>
> Texas Instruments has tables showing how many clock cycles each type of
> FFT takes for each DSP. If you could get that kind of benchmark for the
> ARM, you'd be able to compare oranges to oranges.
>
>
> > I have had a longer chat about the capabilities of the Raspberry Pi
> > Broadcom QPU. It can be hacked to do realtime dsp, but, well, it's a
> > hack. It turns out they are 32-bit fp units and the APU (that contains
> > the QPUs) is already running an RTOS of some sort. I will include the
> > log in a separate email with a different subject.
>
> I scanned as much of that email as I could, and it seems that the QPU is
> basically a SIMD instruction set. We've had SIMD for ages. PowerPC has
> AltiVec, x86 has SSE3, ARM has Neon. SIMD can certainly be powerful, but
> often the challenge becomes streaming enough new data in and out to keep up
> with the fast processing. PowerPC had ways to prime a special streaming
> engine to keep the AltiVec engine fed with high-throughput data. I think
> that if you try to do convolution with QPU, then you'll have one very fast
> cog in a machine where all the other parts are really slow and inefficient.
> It might work out nice if you were processing quadrophonic audio, or 14.2
> surround, but a simple stereo stream might not be able to achieve full
> utilization of the maximum QPU processing power.
>
> Note that the half float format of the graphics accelerator can only
> handle 11 bits of audio data. While I don't think that you really need 24
> bit CODECs in your convolution effect, you certainly won't be able to get
> by with less than 12. The DSP chips handle 16 or 24 bit audio streams with
> 56 or more bits of precision, sometimes as much as 80 bits. You can't touch
> that with a graphics accelerator because graphics are tuned for only 8 bits
> or 10 bits per signal (red, green, and blue are handled separately). The
> way the eye works in adjusting to different overall brightness levels, you
> don't really need more than 11 bits, so I'm not criticizing the design of
> the graphics accelerator instruction sets - just pointing out how they're
> designed for a different type of data.
>
> Note also that the TMS320 can also do lots of parallel processing, without
> the same limitations as SIMD. For example, the C55x can execute 1 or 2
> completely different instructions in the same cycle, so long as they're not
> both using the same registers or units. SIMD, as you learned, must always
> execute the same instruction on multiple data values. I seem to recall that
> IIR effects are not able to take advantage of SIMD as much as other types
> of processing. I've done some AltiVec programming in AudioUnits plug-ins,
> but prefer DSP for embedded audio.
>
> Brian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170215/8f34835e/attachment.htm>
More information about the Synth-diy
mailing list