[sdiy] Hardware convolution box?

cheater00 cheater00 cheater00 at gmail.com
Fri Feb 10 06:56:06 CET 2017


http://alumni.media.mit.edu/~adamb/docs/ConvolutionPaper.pdf

Efficient Convolution Without Latency
William G. Gardner
Perceptual Computing Group
MIT Media Lab E15-401B
20 Ames St
Cambridge MA 02139-4307
Internet: billg at media.mit.edu

November 11, 1993

Abstract
It is well known that a block FFT implementation of convolution is
vastly more efficient than the direct
form FIR filter. Unfortunately, block processing incurs significant
input/output latency which is undesirable
for real-time applications. A hybrid method is proposed for doing
convolution by combining direct form
and block FFT processing. The result is a zero latency convolver that
performs significantly better than
direct form methods.

On Fri, Feb 10, 2017 at 6:49 AM, Terry Shultz <thx1138 at earthlink.net> wrote:
> http://www.st.com/en/development-tools/apworkbench.html?sc=apworkbench
>
> Use a Cortex M7 Discovery board for more performance over the Cortex M4.
>
> regards,
>
> Terry
>
> On Feb 9, 2017, at 9:24 PM, Terry Shultz <thx1138 at earthlink.net> wrote:
>
> Another guy I know is Jason Kridner, who started BeagleBoard.org
> https://beagleboard.org/x15/
>
> This is a monster performer http://www.ti.com/product/am5728
>
> Processor: TI AM5728 2×1.5-GHz ARM® Cortex-A15
>
> 2GB DDR3 RAM
> 4GB 8-bit eMMC on-board flash storage
> 2D/3D graphics and video accelerators (GPUs)
> 2×700-MHz C66 digital signal processors (DSPs)
> 2×ARM Cortex-M4 microcontrollers (MCUs)
> 4×32-bit programmable real-time units (PRUs)
>
> Connectivity
>
> 2×Gigabit Ethernet
> 3×SuperSpeed USB 3.0 host
> HighSpeed USB 2.0 client
> eSATA (500mA)
> full-size HDMI video output
> microSD card slot
> Stereo audio in and out
> 4×60-pin headers with PCIe, LCD, mSATA
> and much more...
>
> Software Compatibility
>
> Debian
> Android
> Ubuntu
> Cloud9 IDE on Node.js
> plus much more
>
>
>
>
>
>
>
>
>
> http://uk.rs-online.com/web/p/processor-microcontroller-development-kits/8874764/
> cost is approx. 207.49 but it is backordered at this site until 18/04/2017.
>
> More than enough horsepower for Linux and Convolution engine I should think.
>
> regards,
>
> Terry
>
>
> On Feb 9, 2017, at 6:26 PM, Terry Shultz <thx1138 at earthlink.net> wrote:
>
> Check out my friend Dr. Paul Beckman’s site for tools
> https://www.dspconcepts.com/audio-weaver
>
> and my friend Tony Rouget site in Hong Kong https://www.minidsp.com
>
> and lastly my pal Al Clark’s site https://www.danvillesignal.com
> https://www.danvillesignal.com/landing-pages/snowbird-audio
>
> These are good examples of audio products that are better than the DSP
> Manufacture can build.
>
> and lastly my old friend from MIT Dr. Bill Gardner
>
> https://www.audiobuildersworkshop.com
>
> hope this helps you guys a bit more.
>
> regards,
>
> Terry
>
>
>
>
> On Feb 9, 2017, at 5:20 PM, cheater00 cheater00 <cheater00 at gmail.com> wrote:
>
> That makes sense, it's also a very solid way to do things, if one manages to
> dot all the i's so the result is an accurate copy of the naiive method.
>
>
> On Fri, 10 Feb 2017 02:01 Olivier Gillet, <ol.gillet at gmail.com> wrote:
>>
>> I think you're vastly overestimating how much computational resources
>> this requires.
>>
>> A well-known trick is to partition the head of the IR into small
>> blocks (say 32 samples long if you want sub ms latency at 48kHz), and
>> use larger blocks for the tail of the IR (latency is not a problem for
>> the tail). The whole convolution can be decomposed as a sum of
>> convolutions by each of the blocks, which can be evaluated in the
>> frequency domain by DFT, complex multiplication by the DFT of the IR
>> block, and IFT.
>>
>> I did some back of the envelope computations and arrived at the result
>> of 40 MMACs for a sample rate of 48kHz and a 2s-long IR.
>>
>> I found it a bit too good to be true and I got back to the source:
>>
>> http://www.cs.ust.hk/mjg_lib/bibs/DPSu/DPSu.Files/Ga95.PDF
>> p. 132, just before the beginning of section 6:
>>
>> "Thus a filter of size 128K samples will require approximately 427
>> multiples per output sample".
>>
>> This assumes that the DFT of all the blocks the IR is made of has been
>> pre-computed; but this can be done in faster than real-time when the
>> IR is loaded, assuming you've got enough RAM.
>>
>> Of course there's the issue of scheduling and a lot of additional
>> bookkeeping, but at the very worst the order of magnitude we're in are
>> hundreds of MMACs and a couple MBytes of RAM.
>>
>> On Fri, Feb 10, 2017 at 1:09 AM, cheater00 cheater00
>> <cheater00 at gmail.com> wrote:
>> > Found the right spot at the TI website. I've made a somewhat large
>> > survey of AD and TI chips. I've uploaded the data to Google Docs (see
>> > link at the end of this email).
>> >
>> > For a lot of power, TI can't be beat. Their chips are as cheap as
>> > $0.78/GMACS, that's on TMS320C6678CYP, a chip with 8.5MB ram and 256
>> > GMACS, $200 at Mouser.
>> >
>> > The cheapest TMS320C is TMS320C6652CZH6 with 19.2 GMACS, 1MB ram, at
>> > $41.95 at Arrow.
>> >
>> > For cheap chips, AD is great. Their most powerful non-obsolete
>> > offering is ADSP-BF561SKBCZ-5A, 2 GMACS, 328KB ram, $32.59 at Arrow,
>> > for $16.30/GMACS. Some of their unusually cheap chips include:
>> > ADSP-BF525BBCZ-5A, 1.2 GMACS, 132KB, $11.79 at Newark Element14 for
>> > $9.83/GMACS
>> > ADSP-BF534BBCZ-4A, 1 GMACS, 134KB, $5.88 at Newark Element14 for
>> > $5.88/GMACS
>> > ADSP-BF531SBBCZ400 0.8 GMACS, 53KB, $4.44 at Avnet for $5.54/GMACS
>> >
>> > Those chips were noticeably (3-4x) cheaper than their close
>> > counterparts, apparently Newark and Avnet have some sort of blowout.
>> >
>> > I stopped surveying AD chips around 1.2 GMACS. There are going to be
>> > much cheaper ones than I found, I guess, but they just have so many
>> > chips I'd spend 2 days figuring out the prices. It's obvious: their
>> > stuff is cheap.
>> >
>> > AD are inexpensive, but clearly, if you need a lot of processing power
>> > and/or a lot of memory the TI will be 5 to 10 times cheaper. $200
>> > might not be so much if that's the majority of the cost of the box for
>> > a DIY gamer.
>> >
>> > As far as evaluation boards go, the highest-powered AD board seems to
>> > be the best value. The TMDSEVM6678L costs $399 on TI's website, has
>> > 64MB Flash, 512 MB DDR3 SDRAM, gigabit ethernet, usb mini-B, 80 IO
>> > header and an AMC header with PCIe, an emulator port, a small FPGA for
>> > configuration and booting, etc. See features at these two links:
>> > http://www.ti.com/tool/tmdsevm6678#Technical%20Documents
>> > http://www2.advantech.com/Support/TI-EVM/6678le_of.aspx
>> >
>> > I don't know if the USB can be used in host mode. Does anyone know?
>> >
>> > It is unclear to me which version of the chip this board has - the
>> > 320GMACS one at 1.25 GHz or the 256 GMACS one at 1 GHz.
>> >
>> > Finally, there is a version of this board that costs $599 (50% more)
>> > and it has an XDS560V2 emulation mode. I understand that's a debugger.
>> > I don't know why exactly it is significant. What advantages does this
>> > bring for a developer?
>> > Is the emulator port shown on advantech's website only available in
>> > this more expensive version? If the cheaper version also has it, what
>> > can it be used for if the XDS560V2 emulation mode is not available?
>> >
>> > Survey data is on Google Docs. Anyone can comment:
>> >
>> >
>> > https://docs.google.com/spreadsheets/d/1oT-9PVh8yZMMwAkpltqGo8mhSL1NLXZJiC-LH7swsbY/edit?usp=sharing
>> >
>> >
>> > Have fun!
>> >
>> > On Thu, Feb 9, 2017 at 9:02 PM, cheater00 cheater00
>> > <cheater00 at gmail.com> wrote:
>> >> Yeah, usb host mode sounds super useful unless SD will allow faster UI
>> >> interaction.
>> >>
>> >> Do you know which TI chips have the most MMACS? I find the website
>> >> confusing.
>> >>
>> >>
>> >> On Thu, 9 Feb 2017 20:29 , <rsdio at audiobanshee.com> wrote:
>> >>>
>> >>> Based on your survey, I'd recommend the Analog Devices board, even
>> >>> though
>> >>> I usually lean towards TMS320. The TMS320 family is huge, including
>> >>> both
>> >>> fixed-point and floating-point, low-power and high-speed, old and new
>> >>> designs, etc. Some of the TMS320 boards you listed are really geared
>> >>> more
>> >>> towards motor control than audio, which is why they might be
>> >>> underpowered
>> >>> for long impulse response convolution. I know that the AD SHARC family
>> >>> is
>> >>> also large, and they're very popular, but I am less familiar with the
>> >>> options.
>> >>>
>> >>> Don't forget to look at the chip manufacturer as a direct source for
>> >>> these
>> >>> boards. I always buy directly from Texas Instruments because Digi-Key
>> >>> tends
>> >>> to have a markup. Outside the US, maybe it's a different story due to
>> >>> international availability.
>> >>>
>> >>> I'd recommend something like the 1MB 800 MMAC board and not worry
>> >>> about
>> >>> external RAM. 1MB seems like plenty. I'd also recommend trying to
>> >>> implement
>> >>> both the time domain convolution and the frequency domain version.
>> >>> There are
>> >>> ways to reduce the latency of the frequency domain approach, and at
>> >>> least it
>> >>> would allow for longer impulse responses to be supported. For IRs that
>> >>> are
>> >>> short enough, the time domain approach would work. I've also seen
>> >>> papers on
>> >>> combining the two, since LTI techniques can be run in parallel and
>> >>> summed.
>> >>>
>> >>> As for taking pairs of 16-bit samples to speed things up, be aware
>> >>> that
>> >>> not all instructions can work that way. I think that most DSPs can do
>> >>> a few
>> >>> simple operations on value pairs, but the most complex DSP
>> >>> instructions can
>> >>> only handle full samples. DSP architectures have internal registers
>> >>> that are
>> >>> much larger than the sample size, like 56-bit or higher. If you think
>> >>> about
>> >>> all of the potential overflow when adding thousands of samples from an
>> >>> impulse response, you can see why such large registers are needed.
>> >>> When
>> >>> working in that model, its not possible to handle the overflow from
>> >>> two
>> >>> samples that are combined in a single 32-bit input value.
>> >>>
>> >>> Finally, I think that nobody has made something like this because the
>> >>> user
>> >>> interface would be rather difficult. It's a bit of a power-user
>> >>> effect. On
>> >>> that note, some sort of SD card might be useful, so I can see why
>> >>> you're
>> >>> looking into that. However, perhaps just a custom USB class device
>> >>> would be
>> >>> enough of an interface to allow downloading impulse responses to the
>> >>> device.
>> >>> At a minimum, you'll need a large Flash to store the current impulse
>> >>> response, or some way to partition the program Flash to set aside room
>> >>> for
>> >>> the data. The AD board with USB host mode could feasibly read directly
>> >>> from
>> >>> a USB memory stick or Flash drive.
>> >>>
>> >>> Brian
>> >>>
>> >>>
>> >>> On Feb 6, 2017, at 9:20 PM, cheater00 cheater00 <cheater00 at gmail.com>
>> >>> wrote:
>> >>> > Brian, $50 is a steal. I've had a look at Digikey.
>> >>> >
>> >>> > This TI board is £24. It has ~150 KB on-chip RAM, but it has an
>> >>> > integrated SDRAM interface.
>> >>> >
>> >>> >
>> >>> > http://www.digikey.co.uk/product-detail/en/texas-instruments/LAUNCHXL-F28377S/296-42484-ND/5404239
>> >>> >
>> >>> >
>> >>> > This TI board is £40. It has ~384 KB on-chip RAM and an integrated
>> >>> > SRAM interface and SD card support. 200 MMACS.
>> >>> >
>> >>> >
>> >>> > http://www.digikey.co.uk/product-detail/en/texas-instruments/TMDX5505EZDSP/296-24965-ND/2127652
>> >>> >
>> >>> > This AD board is £60. It has 1MB on-chip RAM and USB host mode, no
>> >>> > idea about ram interface or SD card. 800 MMACS.
>> >>> >
>> >>> >
>> >>> > http://www.digikey.co.uk/product-detail/en/analog-devices-inc/ADZS-BF706-EZMINI/ADZS-BF706-EZMINI-ND/5408943
>> >>> >
>> >>> > This last one sports 800 MMACS. Is this enough processing power for
>> >>> > the 5-second convolution I mentioned above? It seemed to me like
>> >>> > that
>> >>> > would need 4608 MMACs. Maybe 2304 if we take pairs of 16 bit samples
>> >>> > and treat them as 32 bit values. Are my numbers correct? Are there
>> >>> > optimizations that can be done to lower this number, while still
>> >>> > having zero latency? I understand FFT domain convolution introduces
>> >>> > latency, which is not wanted in hardware. "Naive" MAC based
>> >>> > convolution doesn't seem too far out of reach.
>> >>> >
>> >>> > This TI board is £156. It has 256 KB on-chip RAM and support for
>> >>> > DDR2
>> >>> > SDRAM. No mention of MMACs but they say 3648 MIPS and I assume a
>> >>> > pipelined MAC costs one instruction, would that be correct?
>> >>> >
>> >>> >
>> >>> > http://www.digikey.co.uk/product-detail/en/texas-instruments/TMDSLCDK6748/TMDSLCDK6748-ND/5213032
>> >>> >
>> >>> >
>> >>> > The more expensive boards don't seem to have more powerful DSP
>> >>> > chips.
>> >>> > And those chips don't really get much more powerful either. However,
>> >>> > convolution is easily parallelised. So, worst case scenario, if you
>> >>> > wanted really long impulse responses you'd have to use a few chips.
>> >>> > However, even the really good 800 MMACS Blackfin ones are £15 unit
>> >>> > price, so that's not so bad...
>> >>> >
>> >>> > So, tell me, why hasn't anyone made this yet?
>> >>> >
>> >
>> > _______________________________________________
>> > Synth-diy mailing list
>> > Synth-diy at synth-diy.org
>> > http://synth-diy.org/mailman/listinfo/synth-diy
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>




More information about the Synth-diy mailing list