[sdiy] Using Raspberry PI GPU/APU/QPU from RTOS
cheater00 cheater00
cheater00 at gmail.com
Wed Feb 15 00:49:34 CET 2017
I have had a longer chat about the capabilities of the Raspberry Pi
Broadcom QPU. It can be hacked to do realtime dsp, but, well, it's a
hack. It turns out they are 32-bit fp units and the APU (that contains
the QPUs) is already running an RTOS of some sort.
you may find the irc log below, started in #raspberrypi and continued
in #raspberrypi-internals on Freenode.
22:11 < cheater > has there been an open source graphics driver
released for the raspberry pi 2 and 3 or do you still have to use the
broadcom stuff?
22:12 < ali1234 > cheater: the anholt driver is available. i dont
think it is finished though
22:12 < cheater > what is the anholt driver?
22:12 < ali1234 > cheater: it's an open source driver written by eric anholt
22:12 < cheater > is that a linux driver?
22:12 < ali1234 > everyone calls it the anholt driver because it
doesn't seem to have a name
22:12 < ali1234 > yes
22:14 < cheater > may i ask another question? i was wondering if the
pi 2/3 would be good as an embedded board to use for dsp. the GLES
support provided by the 3d hardware has a couple gigaflops which is
very powerful. however, you would like to be able to use it from a
realtime OS, with low latency. do you suppose that is doable?
22:15 < ali1234 > cheater: there's no real way to do GPGPU
programming on the Pi, except by reverse engineering
22:15 < ali1234 > there are a couple of simple examples and that is about it
22:17 <mgottschla > cheater: btw, those gigaflops are probably half precision
22:17 <mgottschla > (I don't know, but I'd expect that)
22:20 < clever > cheater: this is an old project i was working on,
it used the 3d hardware to render a single polygon from userland:
https://github.com/cleverca22/hackdriver
22:21 < clever > cheater: basicaly all it does is mmap /dev/mem and
start poking at things, so it would be easy to port to a more
baremetal OS
22:21 < cheater > clever: i'm looking to not do this in linux
because that is bad for realtime perf
22:21 < cheater > ah
22:21 < cheater > ok
22:21 < cheater > clever: how did that work out?
22:21 < cheater > also this is not about rendering, it's about using
instructions like FMA3/FMA4 aka multiply-and-accumulate
22:22 < cheater > for audio dsp
22:22 < clever > cheater: it was enough of a proof of concept that
i figured out how the 3d stuff worked, and was able to expand it to a
partialy working opengl stack with kernel support
22:22 < clever > cheater: i was also able to get the pixel shader to run
22:22 < cheater > clever: what do you think is the overhead to get
the pixel shader running?
22:22 < clever > cheater: this is a shader i had written to do
texture rendering:
https://github.com/cleverca22/hackdriver/blob/master/texture.s
22:23 < clever > cheater: i think there is a different pipeline in
the v3d hardware for GPCPU stuff, to bypass all of the polygon
handling
22:24 < clever > cheater: this function builds the shader program
up from raw words:
https://github.com/cleverca22/hackdriver/blob/master/triangle.cpp#L38
22:24 < cheater > "from raw words"?
22:24 < clever > cheater: 4 byte fragments, pre-compiled and typed in as hex
22:25 < cheater > ah
22:25 < cheater > btw, what is the precision of the GLES shaders?
22:25 < ali1234 > cheater: half i think (one of the differences
between GLES and GL)
22:25 < clever > cheater: and line 121 refers to the shader in a
shader record, which tells the polygon stuff what to execute on each
pixel
22:25 < ali1234 > cheater: may no longer apply these days, i dunno
22:25 < cheater > ali1234: so GL has full precision?
22:25 < clever > cheater: let me find the pdf i was using
22:25 < ali1234 > cheater: i think so yes
22:25 < cheater > half means 16 bit right?
22:25 < ali1234 > yeah
22:26 < cheater > thanks
22:26 < cheater > clever: cool
22:26 < cheater > what would be the reduction in performance be if i
wanted to emulate normal precision? what if i wanted double precision?
22:26 < clever > cheater:
https://web.archive.org/web/20160803202903/https://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf
22:27 < clever > cheater: page 35, the QPU instruction set
22:27 <mgottschla > clever: nice, I didn't know that there was a public manual
22:29 < clever > cheater: one oddity with the QPU, it has a form of
hyperthreading, 4 threads share a single pipeline, operating on 4 sets
of registers, but all 4 threads must be executing the same opcode
22:30 < clever > cheater: so to make the most out of it, those 4
threads have to run the same code, with all of them doing the same
branching (if any)
22:30 < clever > cheater: i have no idea what it will do if you try
to diverge from that
22:32 < clever > cheater: if you come to #raspberrypi-internals, i
can infodump more things i know and answer any other questions you
have
22:33 < cheater > clever: the 4-thread thing is fine, the processing
will most likely be fft anyways.
22:33 < cheater > so that can be broken up into 4.
22:33 < cheater > sure i can go there
(continued in #raspberrypi-internals)
22:34 < clever > cheater: and each of those 1c4t things, is grouped
up into a 4c16t module,
22:35 < clever > that 4c16t module, shares a single opcode decoder
22:35 < clever > and on each clock cycle, one core takes a turn on
the decoder
22:35 < clever > so the 4 cores are staggered
22:36 < cheater > clever: what does 1c4t mean?
22:36 < clever > so as an example, on the 1st clock cycle, the 1st
QPU will decode an opcode, and then run that opcode thru the pipeline,
22:36 < cheater > 1 cpu 4 threads?
22:36 < clever > 1 core, 4 thread
22:36 < cheater > and there's four of those?
22:37 < clever > 4 of those in each QPU
22:37 < clever > and 12 QPU's i believe
22:37 < cheater > and each core must execute those 4 "similar"
threads but the four cores could be executing different stuff?
22:37 < clever > yeah
22:37 < clever > but those 4 cores share a single opcode decoder
22:37 < cheater > what does that mean for me
22:37 < clever > so every 4 clock cycles, they take turns using it
22:37 < cheater > ah
22:37 < cheater > ok
22:38 < cheater > i don't really know how this works, but i envision
i would most likely create some sort of pipeline and just stream data
through it
22:38 < clever > and because the 4 clock cycle pipeline is running
4 threads on the same opcode
22:38 < clever > each core only needs to decode an opcode every 4 clocks
22:38 < cheater > i'm not sure if that would mean the opcode decoder
needs to be engaged all the time
22:38 < cheater > or if you engage it once and forget
22:39 < clever > i believe its designed so if all 4 QPU's are
active (16 threads total), the opcode decoder will be 100% active,
with no idle time or resource starvation
22:39 < cheater > if i wanted double precision FMA3
(multiply-and-accumulate), would my FLOP performance be divided by 4
(from 24 GFLOPs to 6 GFLOPs)?
22:40 < cheater > or would the performance hit be greater?
22:40 < clever > let me see
22:40 < clever >
https://web.archive.org/web/20160803202903/https://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf
22:40 < clever > page 35, floating point add's, hmmm
22:41 < clever > page 36, the multiply opcodes
22:41 < clever > another oddity (to me, ive previously only used
x86 and AVR), every opcode in the QPU is basicaly 4 different things
at once
22:42 < clever > you have an add operation, a mult operation, and 2
different sets of bit-fields for various control flags
22:42 < clever > so a single opcode and clock cycle, can do add,
multiply, and some logic control
22:42 < cheater > so each opcode is a MAC
22:42 < cheater > half precision
22:45 < clever > page 53, "qpu reading and writing of VPM"
22:46 < clever > i believe this is for general purpose ram
22:46 < cheater > clever so the native format of the QPUs is
floating point? or does it have a native integer mode too?
22:46 < clever > and it can contain either 32, 16, or 8bit data samples
22:47 < clever > i believe it is floating point natively, but every
opcode can accept either ints or floats
22:47 < clever > for both input and output
22:47 < clever > one of the pipeline stages will convert it
automaticaly, depending on bits within the opcode
22:48 < clever > so in the case of a pixel shader, it takes the
8bit color of a pixel, and turns it into a 0.0-1.0 float, does a
single operation, and turns it back into
an 8bit int
22:48 < clever > but i think you can also store it as 32bit floats
in registers/ram
22:49 < cheater > aha
22:49 < cheater > i'm fine with it being floats
22:49 < cheater > so on intel, when you add or multiply two numbers
the result is stored in an increased-precision 56 bit register before
being truncated for output. does
this happen on the QPUs as well? if so, what is
the bit depth of that?
22:49 < clever > not sure, been a few years since i worked on this
22:50 < clever > i'll skim over the PDF some more
22:52 < clever > cheater: page 60, it mentions that the varying are
32bit floats, and some of the x/y coords are 12.4 fixed point
22:53 < cheater > aha
22:53 < clever > ah, depends on the mode though, next page is a
different mode with different layout
22:53 < cheater > i am not sure why i would be using 12.4 fixed point?
22:53 < cheater > or rather
22:53 < cheater > i am not sure why i would be using x/y coords?
22:53 < clever > i think this is the format for interfacing with
some of the 3d focused hardware
22:54 < clever > so not exactly something you would be using
22:54 < cheater > i want FMA3's
22:54 < cheater > aka MAC
22:55 < cheater > this is for running a convolution engine
22:55 < cheater > with use for possibly other audio dsp related code
22:55 < clever > page 83, address 0x00430
22:56 < clever > the registers in this region are for executing
"user programs" on the QPU
22:56 < cheater > ok so it looks like all registers are 32 bit
22:56 < cheater > for convolution that's plenty fine
22:56 < clever > the bulk of those registers take raw physical
addresses for the programs
22:57 < cheater > for compression and filtering maybe less so
22:57 < cheater > compression meaning audio dynamics (loudness
leveling) processors, not video or audio codecs like h.264 or mp3
22:57 < clever > there are another 2 cores in the 3d area, that
handle polygon level stuff
22:58 < clever > i think the binner breaks the polygons up into
shorter lists for each 64x64 pixel square, and then it runs the pixel
shader on every pixel of every polygon within it
22:58 < clever > and each 64x64 pixel region renders in parallel
22:59 < clever > cheater: page 89, QPU schedule registers
22:59 < clever > these let you reserve each bank of cores for certain tasks
22:59 < clever > so the 3d hardware wont put shaders on it, though
if your running baremetal, that wont even be active
23:00 < clever > cheater: reguarding start.elf stuff, it wont even
talk to the V3D module if you dont tell it to, so that wont cause you
issues
23:01 < clever > cheater: but you may want to mask the V3D interupt
within start.elf, i can find the info on that
23:01 < clever > start.elf will crash hard if you enable the
interupt when it isnt expecting it
23:02 < clever > page 90, the description of V3D_SRQPC
23:02 < clever > it mentions that it has a 16 job FIFO, to post QPU
programs for general-purpose use
23:03 < clever > and the job gets its data from the uniform list
set in V3DRQUA at the time V3D_SRQPC was written to
23:05 < cheater > ok
23:05 < clever > page 91, there is a pair of 8 bit counters, for
the total number of requests made, and request finished
23:05 < cheater > so basically we can use 16 bit floating point data
and mult and add into 32 bit registers?
23:05 < clever > so you can monitor the load and job rate
23:06 < clever > page 94, the registers to enable interupts
23:06 < clever > you will need the right config.txt to use this, or
start.elf will lock up
23:07 < clever > cheater: oh, is your RTOS running on the VPU or ARM?
23:07 < cheater > ARM i assume
23:08 < clever > ah
23:08 < clever > the VPU side is far less documented, but already
has an RTOS running on it in every pi
23:08 < cheater > aha
23:08 < cheater > mhm
23:09 < clever > when the rpi boots, it loads start.elf into ram,
and then uses it to run an RTOS on the VPU cores (dual-core, no MMU)
23:09 < clever > and that is what then enables the arm core(s), and
loads linux into ram
23:09 < cheater > gotcha
23:09 < clever > then the VPU runs the firmware RTOS, and the ARM runs linux
23:09 < cheater > i guess the rtos being on the vpu is fine too
23:10 < cheater > the arm would do other stuff that's not related to dsp
23:10 < clever > the firmware blob is what manages bringing up the
dram and a lot of other stuff
23:10 < cheater > like stuff related to digital i/o, reading sd card, etc
23:10 < clever > so a lot more stuff has to be manualy done if you
use your own firmware
23:11 < clever > page 98 lists some performance counters, like the
QPU stalling for a memory read
23:11 < cheater > aha
23:12 < cheater > so the background is that me and a couple other
folks who are in the synthesizer dev community are talking about dsp
23:12 < clever > ah
23:12 < cheater > and we're wondering what would be a good
inexpensive platform for dsp
23:13 < cheater > many people already have raspberry pi's
23:13 < cheater > and they're fairly inexpensive for the FLOPs provided
23:13 < clever > page 13 has a block diagram of the entire system
23:14 < cheater > if you continue your work on using the QPUs from
rtos, then you might find people will use it for that purpose
23:14 < clever > ah, each grouping of 4c16t is called a slice in
this document
23:14 < clever > and there are 3 slices
23:14 < clever > so thats a total of 12c 192 thread i believe
23:15 < clever > up near the top is the control list executor that
i used for 2d polygon rendering
23:16 < clever > and the binner is blow that by 2
23:16 < cheater > aha
23:16 < clever > and you can see ~3 modules go to the QPU scheduler
23:16 < clever > and a register i previously mentioned, lets you
directly push jobs to the QPU scheduler
23:17 < clever > the VPM beside the scheduler is used for access to
the VPM memory i previously mentioned, and appears to also do general
DMA
23:18 < clever > the scoreboard is some kind of mutex management
module, so all of the cores can co-operate without conflicts
23:18 < clever > oops, 4 slices, missed slice 0
23:19 < clever > Scalability is principally provided by multiple
instances of a special purpose floating-point shader processor,
23:19 < clever > termed a Quad Processor (QPU). The QPU is a 16-way
SIMD processor. Each processor has two vector floatingpoint
23:19 < clever > ALUs which carry out multiply and non-multiply
operations in parallel with single instruction cycle
23:19 < clever > latency. Internally the QPU is a 4-way SIMD
processor multiplexed 4× over four cycles, making it particularly
23:19 < clever > suited to processing streams of quads of pixels.
23:19 < clever > cheater: from page 14
23:21 < cheater > so what does that mean for me?
23:21 < clever > just pasting random facts that look of interest
23:21 < clever > still searching for some solid numbers on the FPU
register sizes
23:22 < clever > cheater: page 15 mentions that the output goes to
a 64x64 pixel buffer, which can also do 64x32 pixels in 64bit float
mode
23:22 < clever > cheater: this implies that it uses 32bit floats by default
23:24 < clever > "For all intents and purposes the QPU can be
regarded as a 16-way 32-bit SIMD processor"
23:24 < clever > from page 16
23:25 < cheater > gotcha
23:25 < cheater > that's pretty good
23:25 < cheater > so the inputs are 32-bit and the internal
accumulators are 32-bit too?
23:25 < clever > probably
23:26 < clever > page 17 shows the pipeline diagram
23:26 < clever > cheater: the PACK and UNPACK you can see in the
pipeline deal with converting data to/from int/float
23:27 < clever > looks like you can only do that pack/unpack on the
main register file
23:27 < clever > and the accumulator doesnt support it
23:27 < clever > except for r4
23:28 < clever > 6 accumulators, but 2 of them are special purpose
23:28 < clever > and 1 supports unpack
23:29 < clever > 4 mux's, to select the A&B for the add&mult
seperately, and the A inputs can also be data that follows the opcode
or is embeded inside the opcode
23:30 < clever > and it looks like it takes 3 clock cycles for the
ALU pair to fully do the add+mult
23:30 < clever > but thats hiden from your program by running 4
threads interleaved
23:31 < clever > so each opcode takes 4 clocks to run thru it
fully, and within that thread, it only starts on every 4th clock
23:31 < clever > so less need to deal with the fact that the data
isnt valid until 4 clocks later
23:40 < clever > 32 registers for generic data, 32 registers for IO
to other modules
23:40 < clever > cheater: paragraph 1 on page 18 describes that
latency in detail
23:41 < clever > cheater: yeah, page 18 goes on to confirm that it
appears to be primarily a 32bit float system
23:42 < clever > cheater: "Both ALU units can operate on integer or
floating point data, and internally always operate on 32-bit data (or
vectors of 4 8-bit quantities)."
23:43 < clever > cheater: hmmm, i think its saying that the
registers are 16bit?, and the accumulator is 32bit?
23:50 < clever > cheater: oh yeah, this may be of use:
https://github.com/mn416/QPULib
23:51 < clever > cheater: its describing a lot of the same things i have been
23:57 < clever > cheater: oh, this looks of use to you also:
http://www.aholme.co.uk/GPU_FFT/Main.htm
23:59 < clever > cheater: ah, and it seems start.elf also has a
proper execute_qpu() API that could save you some work
00:26 < cheater > clever: that's a lot of info
00:26 < cheater > i will pass that on, thank you
00:26 < clever > kk
00:27 < cheater > clever: if you'd like, you should check out the
synth-diy mailing list, where we're talking about dsp platforms now
00:27 < clever > cheater: and if you have any more questions, feel
free to ask in here
00:27 < cheater > thanks clever
00:28 < clever > cheater: this also sounds like its related to a
past project i worked on, i wanted to make a crazy audio capture
system
More information about the Synth-diy
mailing list