[Bf-committers] CUDA backend implementation for GSoC?

Giuseppe Ghibò ghibo at mandriva.com
Tue Dec 16 19:06:12 CET 2008

Timothy Baldridge ha scritto:
>> Indeed with PCIe 2.0 you have doubled the bandwidth, and have a 0.5GB/s
>> per lane, thus
>> allowing 16GB/s (consider also you have configuration with SLI or quad-SLI).
> Right, but the comment still stands....in many (perhaps most) cases,
> going from memory->PCIe->GPU->Stream Processor->GPU->PCIe->memory is
> going to be slower or at least have more overhead than
> memory->CPU->memory.
yep of course the memory bus is always faster than the PCIe bus, but
IMHO at this point we don't know yet whether
these bottenlecks are visible and how much affecting. IMHO also the
20GB/s are theoretical. Furthermore there are also other memory
situations, whether the memory controller is internal/external,
availability of NUMA (e.g. Opterons up to 8*quad core = 32 way) etc.,
and all sort of of memory hogs that current day multicore systems
are affected of (see for instance this paper:
> Perhaps that's the best starting point. Can we get some solid
> benchmarks that show overhead (latency and bandwidth) for transfering
> data to and from the CPU (and setting up a simple program on the GPU)
> vs doing it all in memory.
yep, probably the best is to do some benchmark approach. Any volunteer?
>  Don't forget, in Blender you will have to
> grab data from and insert data back into the Blender structures,
> unless you plan on handing data to CUDA/OpenCL in the format Blender
> uses it in.
probably you'll have a graph showing the speedup of CUDA/OpenCL vs 
multithreaded CPU for
increasing DATA size values. At a certain point as DATA size further 
increases, this gain
will fall to 1 or even less: the benchmark should find this 
"size"/crossing point.
> >From what I last heard, there is no good way to get data from CUDA
> driectly into OpenGL without taking it out of the GPU and inserting it
> back in. I think OpenCL allows inserting data into textures from
> OpenCL. So if we were going to use this for Subdivision surfaces,
> you'd have to upload the data to the GPU then stream the verticies out
> of the GPU and back into the GPU. Whereas the current method only
> streams them to the CPU.
are you saying that the OpenGL part of the video card is not able to 
talk "directly" to the
OpenCL|CUDA part without passing from the CPU over and over (so not DMA?)?


More information about the Bf-committers mailing list