Running pyopencl in Windows
After my failed attempt at executing openCL code on my AMD APU mini PC within WSL, I'm now attempting to do that in Windows. I haven't coded in python on Windows before, so I had to look up on how I can create a virtual env, and activate it. It was fairly straighforward with the use of powershell, which allowed me to use some of the Linux commands familiar to me.
After install pyopencl by runningpip install pyopencl
and then jumped into the python console
import pyopencl
from pyopencl.tools import get_test_platforms_and_devices
get_test_platforms_and_devices()
gave me a warning about not having siphash24 installed (which I installed later) but it was also showing the GPU as an opencl device
[(<pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ff8f0e98000>,
[<pyopencl.Device 'gfx902' on 'AMD Accelerated Parallel Processing' at 0x146c185c0b0>])]
Getting back to the exercise at https://homepages.math.uic.edu/~jan/mcs572f16/mcs572notes/lec29.html and rerunning that exercise as-is caused me to run into the error I saw earlier
RuntimeError: input did not match any platform
I was able to inspect the code in pyopencl/__init__.py and realized that I may have to not set some of the env vars given in the exercise. So I ran this snippet and it prompted me to choose a device
ctx = cl.create_some_context()
Choose platform:
[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ff8f0e98000>
Choice [0]:
Set the environment variable PYOPENCL_CTX='' to avoid being asked again.
I then went ahead and ran this code
import pyopencl as cl
import numpy as np
import os
os.environ['PYOPENCL_CTX']='0'
(n, m, p) = (3, 4, 5)
a = np.random.randn(n, m).astype(np.float32)
b = np.random.randn(m, p).astype(np.float32)
c = np.zeros((n*p), dtype=np.float32)
context = cl.create_some_context()
queue = cl.CommandQueue(context)
mf = cl.mem_flags
a_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
c_buf = cl.Buffer(context, mf.WRITE_ONLY, c.nbytes)
prg = cl.Program(context, """
__kernel void multiply(ushort n,
ushort m, ushort p, __global float *a,
__global float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = 0.0f;
int rowC = gid/p;
int colC = gid%p;
__global float *pA = &a[rowC*m];
__global float *pB = &b[colC];
for(int k=0; k<m; k++)
{
pB = &b[colC+k*p];
c[gid] += (*(pA++))*(*pB);
}
}
""").build()
prg.multiply(queue, c.shape, None,
np.uint16(n), np.uint16(m), np.uint16(p),
a_buf, b_buf, c_buf)
a_mul_b = np.empty_like(c)
cl.enqueue_copy(queue, a_mul_b, c_buf)
print("matrix A:")
print(a.reshape(n, m))
print("matrix B:")
print(b.reshape(m, p))
print("multiplied A*B:")
print(a_mul_b.reshape(n, p))
Now that we've successfully run a sample code, we can dissect this piece of code and then run some bandwidth tests on memory copies from CPU space to GPU space
