<p dir="ltr">Thanks for the feedback! Some pretty good points</p>
<p dir="ltr">Regarding 11 bit depth for audio: couldn't you remodulate audio in a dsd like fashion - higher freq, lower bit rate? I assume convolution would still work. I guess that might be a way to harness the power of GPUs without forcing them into, say, inefficiently emulated 64-bit mode.</p>
<p dir="ltr">Regarding int vs fp on TI TMS dsps: for stuff like compressors or madtering grade equipment people really want 64 bit fp processing; plus it does have its benefits. Even if your int accumulator is 80 bits, i imagine you can't always prevent truncation to 32 bits and so some algorithms will not be able to tout back to back double precision. This is an issue. Also with the prevalence of vst on x86 and x64, as well as now very popular gpgpu work, i imagine fp based algorithms are going to be much easier to find, and will have a lot more mindshare. So it is my guess that fp dsps are much better. By the time any sort of community around this has lifted off, today's chips will all be obsolete - am i wrong to think future chips will more likely be fp than not?</p>
<p dir="ltr">TI dsps with fp start at $6.30 @ 1k, so budget for even the simplest stomp box, effect, or synth module should not be an issue.</p>
<p dir="ltr">In any case the newer TIs still do integer dsp; I don't know if that's emulated or native.</p>
<p dir="ltr">Power budget might be an issue. That is however an unusual requirement for audio.</p>
<p dir="ltr">Great comment re book keeping etc on cpu vs dsp. Yes, direct comparisons of GFLOPs or GMACS are very unfair to dsp. Especially if you have an inexpensive separate dedicated arm cpu for the dsp board.</p>
<br><div class="gmail_quote"><div dir="ltr">On Wed, 15 Feb 2017 07:28 , <<a href="mailto:rsdio@audiobanshee.com">rsdio@audiobanshee.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br class="gmail_msg">
On Feb 14, 2017, at 3:39 PM, cheater00 cheater00 <<a href="mailto:cheater00@gmail.com" class="gmail_msg" target="_blank">cheater00@gmail.com</a>> wrote:<br class="gmail_msg">
> If you started with one of the lower end TI dsps, say C55x or C674x,<br class="gmail_msg">
> would the asm code carry over to the C66x? It is my feeling the answer<br class="gmail_msg">
> would be a sound "no". If that is the case, maybe it would be best to<br class="gmail_msg">
> start with lower end C66x? The chips start at roughly $40 which really<br class="gmail_msg">
> isn't that much and provide 20 GMACS which is plenty already. This<br class="gmail_msg">
> family goes up to 358 GMACS.<br class="gmail_msg">
<br class="gmail_msg">
When I was getting into the open source library for the TMS320VC5506, I learned all of the opcodes for that revision of the family. I noticed that many of the open source assembly routines did not use the most efficient opcode in some parts of the code, and my assumption is that these were written for older chip instruction sets and never updated because they still run on the newer chips. In some cases, there were comments or conditionals to segregate different versions of the TMS320, but they largely run the same instructions. Each generation adds a lot, though.<br class="gmail_msg">
<br class="gmail_msg">
I might assume that C66x will execute C55x assembly, but I believe that C66x is floating point and C55x is fixed point as the major distinction between the families, so the algorithms would be quite different. It's possible that C66x has a very different set of opcodes, but I haven't written for C6600 yet.<br class="gmail_msg">
<br class="gmail_msg">
Note that the fixed point DSP instruction sets have things like Saturate to prevent calculations from overflowing or underflowing the available bits. There are also flags for many opcodes that will shift data by one or two bits to pre-scale or post-scale values, optionally, which is very useful for fixed point. I think that a lot of these opcodes would be unnecessary with floating point, since floating point is automatically shifting the significand bits and adjusting the exponent to compensate with every operation.<br class="gmail_msg">
<br class="gmail_msg">
Your best bet is to choose between C55x and C66x based on benchmarks provided by Texas Instruments. If only one of them has the minimum performance needed, then your choice is made. If both families fit the bill, then you might consider other things. As I mentioned, USB power (at 2.5 W) was a requirement for my design, and it seems that only the C55x (of the modern families) can handle this. The older C54x is also low power, but it is less capable in performance (and probably NRND).<br class="gmail_msg">
<br class="gmail_msg">
I might recommend floating point over fixed point for a beginner, but for convolution you can fairly easily predict the maximum Accumulator value based on the length of the impulse response. It's probably easy enough to just code for the length and then fixed point is almost as easy as float. Cost might also be a consideration if you're planning on manufacturing this effect. Maybe the C66x is just too expensive for a small production run - I haven't done the analysis.<br class="gmail_msg">
<br class="gmail_msg">
<br class="gmail_msg">
> C55x tops out at 0.6 GMACS and C674x at<br class="gmail_msg">
> 3.6 GMACS and neither are enough of an increase to challenge arm<br class="gmail_msg">
> boards which many people already have, most likely.<br class="gmail_msg">
<br class="gmail_msg">
The problem with comparing GMACS is that a DSP effect involves more operations than just the Multiply and Accumulate. As I mentioned, there's pointer math, loop counting, multiple simultaneous bus transactions, and bit-reversed addressing. When you consider all of the other things that have to happen in an FFT, the optimizations of DSP chips mean that you're comparing apples to oranges when looking at raw GMAC benchmarks. Granted, pure time domain convolution might be simple enough that they'd be close to the same number of cycles, but as soon as you use FFT for the longer impulse responses you're going to need far fewer cycles in the DSP instruction set.<br class="gmail_msg">
<br class="gmail_msg">
On that note, I've started developing with the Cortex-M4, but I keep reading that the CMSIS FFT implementation is only optimized for Cortex-M3. So, even tough the M4 is potentially way faster, nobody has taken advantage of it for the available FFT library. Does anyone know whether the FFT libraries have been updated to be optimized for M4? Now that we're seeing M7 and M33 chips, are we really able to take advantage of the new chips with FFT?<br class="gmail_msg">
<br class="gmail_msg">
Texas Instruments has tables showing how many clock cycles each type of FFT takes for each DSP. If you could get that kind of benchmark for the ARM, you'd be able to compare oranges to oranges.<br class="gmail_msg">
<br class="gmail_msg">
<br class="gmail_msg">
> I have had a longer chat about the capabilities of the Raspberry Pi<br class="gmail_msg">
> Broadcom QPU. It can be hacked to do realtime dsp, but, well, it's a<br class="gmail_msg">
> hack. It turns out they are 32-bit fp units and the APU (that contains<br class="gmail_msg">
> the QPUs) is already running an RTOS of some sort. I will include the<br class="gmail_msg">
> log in a separate email with a different subject.<br class="gmail_msg">
<br class="gmail_msg">
I scanned as much of that email as I could, and it seems that the QPU is basically a SIMD instruction set. We've had SIMD for ages. PowerPC has AltiVec, x86 has SSE3, ARM has Neon. SIMD can certainly be powerful, but often the challenge becomes streaming enough new data in and out to keep up with the fast processing. PowerPC had ways to prime a special streaming engine to keep the AltiVec engine fed with high-throughput data. I think that if you try to do convolution with QPU, then you'll have one very fast cog in a machine where all the other parts are really slow and inefficient. It might work out nice if you were processing quadrophonic audio, or 14.2 surround, but a simple stereo stream might not be able to achieve full utilization of the maximum QPU processing power.<br class="gmail_msg">
<br class="gmail_msg">
Note that the half float format of the graphics accelerator can only handle 11 bits of audio data. While I don't think that you really need 24 bit CODECs in your convolution effect, you certainly won't be able to get by with less than 12. The DSP chips handle 16 or 24 bit audio streams with 56 or more bits of precision, sometimes as much as 80 bits. You can't touch that with a graphics accelerator because graphics are tuned for only 8 bits or 10 bits per signal (red, green, and blue are handled separately). The way the eye works in adjusting to different overall brightness levels, you don't really need more than 11 bits, so I'm not criticizing the design of the graphics accelerator instruction sets - just pointing out how they're designed for a different type of data.<br class="gmail_msg">
<br class="gmail_msg">
Note also that the TMS320 can also do lots of parallel processing, without the same limitations as SIMD. For example, the C55x can execute 1 or 2 completely different instructions in the same cycle, so long as they're not both using the same registers or units. SIMD, as you learned, must always execute the same instruction on multiple data values. I seem to recall that IIR effects are not able to take advantage of SIMD as much as other types of processing. I've done some AltiVec programming in AudioUnits plug-ins, but prefer DSP for embedded audio.<br class="gmail_msg">
<br class="gmail_msg">
Brian<br class="gmail_msg">
<br class="gmail_msg">
</blockquote></div>