CUDA Tutorial 02 |
Before diving into the specifics of how to use your graphics card efficiently, this session is meant to show you just how simple getting started with CUDA really is. With the advent of CUDA 6.0, even more barriers standing in the way of beginners fall. If you are familiar with any programming language like C, getting CUDA devices to work is ridiculously easy.
In this tutorial, we will program a function that creates an array of numbers and than increments each number in that array. This might not be a very useful utilization of computing power, but it depicts how CUDA devices work and what purposes they are able to fulfill.
As in all tutorial session, the source code presented here can be found as an archive in the header of this document. You may download and build it using the Makefile or you copy (and adjust to your liking) the source code from this document and follow the build instructions.
Please consider the following source code:
As in all C programs, execution starts at the main function. After defining an array length (numberOfNumbers = 100), we tell the CUDA framework that we would like to have two arrays (numbers1 and numbers2) in unified memory. Unified memory is a construct introduced in CUDA 6.0, which simply states that the graphics processing unit (GPU - your graphics card) as well as the central processing unit (CPU) shall have access to the entries stored in it. Do not bother about this for now, the next tutorial will tell you more about it. For now, just know that this command (cudaMallocManaged) gives you some space in memory to work with.
Now we have to initialize our arrays with some number so we can start working on them. The loop just fills the arrays with incrementing numbers.
The task to fulfill is to increment each and every number in the arrays, so we tell the GPU to do that with the array numbers1 and the CPU do do the same with array numbers2.
Note that the function calls are very similar. The CUDA function (also called a kernel) needs some more information, which is passed to it inside the <<<>>>-brackets, namely the number of blocks we wish to invoke (here: 1) and the number of threads per block (here: numberOfNumbers = 100). Again, do not bother right now, just use one block with as many threads as you like.
If you take a look at the top of the file you will find the two functions we just called. The CUDA version (the kernel) is marked with the keyword __global__. Note that both these functions are very similar. The CPU version contains a for-loop, which iterates through all the elements consecutively.
You will notice that the CUDA kernel only manipulates one element of the array, namely that at position idx. This is because the kernel is invoked numberOfNumber(100) times, once for each element. This means that for each element, one thread is utilized to do the work. The first line in the kernel tells the thread which element in the array it has to work with. The identifiers blockIdx.x, blockDim.x and threadIdx are given by the CUDA framework an tell the thread, in which block it is positioned, how wide the block is and which position inside the block it occupies, respectively.
After performing operations on both our arrays, the main function checks if both arrays are identical and tells you, if that is indeed true.
So let’s go on an see how this program performs.
Save the source code from the previous section in a file called "main_really_simple.cu", open a terminal and navigate to the folder where you stored it. As we have two different processing units involved, we need two compilers in order to build our programs. The "normal" C/C++-compiler on Unix-like systems is called gcc (Gnu Compiler Collection). As for CUDA devices, NVIDIA created a compiler called nvcc (NVIDIA CUDA compiler). The following two lines create our program from source code:
Or, if you downloaded the archive and extracted it, you might use make (by typing exactly that in the terminal).
The only thing left to do is to run that program:
And that’s it! You just passed beyond the most complicated aspect of developing CUDA programs: You convinced yourself that it might be worth looking at.
Now give yourself a pat on the shoulder and go on to the next lesson. Or try changing the value of numberOfNumbers and run the program again. For which numbers does it work? When does it fail? Why is that? Hint: I told you in the very first lesson...
This document was translated from LATEX by HEVEA.