Dissecting OpenCL code

In my previous post, I got sample OpenCL matrix multiplication kernel to run inside Python on Windows. That was one of the first time I got to work on openCL. The code itself is pretty self explanatory for the most part, but it is interesting to see the comparisons with CUDA, and also how the kernel gets invoked though python.

import pyopencl as cl
import numpy as np
import os
os.environ['PYOPENCL_CTX']='0'
(n, m, p) = (3, 4, 5)
a = np.random.randn(n, m).astype(np.float32)
b = np.random.randn(m, p).astype(np.float32)
c = np.zeros((n*p), dtype=np.float32)
context = cl.create_some_context()
queue = cl.CommandQueue(context)

mf = cl.mem_flags
a_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
c_buf = cl.Buffer(context, mf.WRITE_ONLY, c.nbytes)
prg = cl.Program(context, """
    __kernel void multiply(ushort n,
    ushort m, ushort p, __global float *a,
    __global float *b, __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = 0.0f;
      int rowC = gid/p;
      int colC = gid%p;
      __global float *pA = &a[rowC*m];
      __global float *pB = &b[colC];
      for(int k=0; k<m; k++)
      {
         pB = &b[colC+k*p];
         c[gid] += (*(pA++))*(*pB);
      }
    }
    """).build()
prg.multiply(queue, c.shape, None,
             np.uint16(n), np.uint16(m), np.uint16(p),
             a_buf, b_buf, c_buf)
a_mul_b = np.empty_like(c)
cl.enqueue_copy(queue, a_mul_b, c_buf)
print("matrix A:")
print(a.reshape(n, m))
print("matrix B:")
print(b.reshape(m, p))
print("multiplied A*B:")
print(a_mul_b.reshape(n, p))

On a high level, we setup data on the CPU memory, allocate memory on the GPU memory and make copy the data over. It then runs the query and copies data back over to the CPU memory to print. But I'd like to dig into each line of the code and dissect it to understand it better.

context = cl.create_some_context() seems to set up some context on the target GPU device. What is a context? That search led me to this guide How does OpenCL work
The next command queue = cl.CommandQueue(context) is now easily understood based on the block diagram above. They can apparently either be synchronous or asynchronous queues and can be used for copies to and from GPU, and also the exeution of kernels. The queue abstraction to perform operations on the device looks pretty interesting to me.
a_buf = cl.Buffer() seems to be alocating memory on the device, not sure if that is executed synchronously or async.

https://documen.tician.de/pyopencl/runtime_memory.html#buffer indicates that "Buffers are attached to a Context and are only moved to a device once the buffer is used on that device"
But there also seems to be methods that can provide fine grained control on how the copy can be performed
In the docs, they also mention the concept of Shared Virtual Memory (SVM) where the host and GPU share address space and can reference the same pointers from memory. This is very interesting and something I should try out soon

The kernel is being initialized with the class cl.Program and the snippet of C code is being passed in. Looking at the docs, looks like it could also take prebuilt binaries.

The program class contains all the kernels that exist in the code and from the docs, "You may use program.kernel_name to obtain a Kernel object from a program", which explains the line of code prg.multiply

The kernel itself looks fairly straightforward, especially how it is invoked, and the global and local sizes seem to correspond to the number of blocks and number of threads per block in CUDA

global size is the size of the resultant matrix, so each element in the matrix is being computed by a workgroup.
local size is set to None, and from the docs, "None may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size". Not sure what that means, but the kernel doesn't seem to reference it, making me assume it to be 1 in this case
The rest of the kernel is just accessing the corresponding row and column and computing the element of the resultant matrix

a_mul_b = np.empty_like(c) seems like an unnecessary step because we already created c, but I could be wrong here.

Closing Thoughts

While trying to understand OpenCL and trying to compare it's programming model with CUDA, I ran into some other literature on newer GPU runtimes, especially for AMD GPUs. There seem to a newer frameworks to program both AMD and Nvidia GPUs, called Hip. AMD trying to make it easy to port CUDA code to AMD GPUs by building a similar, compatible framework sounds like a good strategy but not sure how that is working out. I also ran into another programming language by OpenAI called Triton. These are definitely worth exploring, as Triton claims to be python-like.

References

https://documen.tician.de/pyopencl/runtime.html
https://www.intel.com/content/www/us/en/developer/articles/technical/opencl-20-shared-virtual-memory-overview.html
https://futhark-lang.org/blog/2024-07-17-opencl-cuda-hip.html https://arxiv.org/vc/arxiv/papers/1005/1005.2581v1.pdf
https://openai.com/index/triton/

Search This Blog

ToMoTo log

Dissecting OpenCL code

Popular posts from this blog

Running pyopencl in Windows

Getting Started with pyOpenCL in WSL

Starting and using GT.M