OpenMP Accelerator -- CIM and OpenMP
Convolution C code example, with OpenMP
style pragma highlighted in yellow
|
Matrix multiply C code example, with OpenMP style pragma
highlighted in yellow
|
Section of arbitrary C code marked for offloading to the CIM array
|
Separate compute intensive sections of C code, marked
for parallel execution (pragmas highlighted in yellow)
|
On this page...
Related pages...
Overview
CIM technology software supports three (3) basic modes of acceleration
- OpenMP
- API based (planned for a future release)
- CUDA emulation (planned for a future release)
In the first mode the CIM array functions as an "OpenMP accelerator". The basic concept of
OpenMP is to use
well-defined "markers", or directives at the source code level (programmer level) to inform the code build process which sections of source code
should be accelerated, and in what manner. Using CIM OpenMP Syntax, the following types of code sections can be marked:
- Parallel sections -- sections of code that run on different cores at the same time
- Parallel for-loops -- a nested for-loop that shares work among some number of cores. In this case, "work" typically
means the innermost processing performed by the loop
- Offloaded sections -- arbitrary sections of code, related or unrelated, marked for offloading to CIM array
The main advantage of the OpenMP model is that it does not require a code
rewrite. Compute intensive sections can be accelerated by marking them,
not actually changing source code content. A secondary advantage is that it fits
Texas Instruments multicore CPUs particularly well.
Another advantage is that a "group of tasks", or related sections of code, can
be moved (or "offloaded") to the CIM array. For example, an FFT, followed by
frequency domain operations, followed by an inverse FFT, followed by time domain
processing to reduce the data rate, could be handled by the CIM array, with data
transfer occurring only at the start and end of the task group. This type of
"algorithm based" processing is not readily supported by CUDA and the OpenCL standard,
which in this example would use a series of function calls or APIs (Application
Programming Interfaces), which in turn would need more frequent data transfers.
CIM Software Build Process
Here is a brief overview of the CIM software build process:
- Separates C source into two (2) separate code streams, one for x86 / ARM (the server "host" CPU), and one for TI multicore CPUs
- Modifies the generated code streams to deal with run-time communication (moving data between host CPU memory and CIM array memory)
- Builds both code streams, creating a host executable file, and one or more (typically many) CIM array executable files
When the host program runs, it downloads and initializes CIM array executable
files. This is done transparently to the host program's user. This is a key
point -- from the user's perspective, his/her program looks, feels, and runs as
it did before -- except faster.
Another key point is that CIM acceleration can easily be enabled/disabled at
run-time, without any program changes. If for any reason the CIM array is not
present in the server then programs still function and run normally -- just
slower.
CIM at Run-Time
- CIM and OpenMP pragmas may be used at the same time
- CIM pragmas use the "cim" keyword, OpenMP pragmas use "omp"
- Programs can make run-time decisions whether to use CIM and how much. If the CIM array is not available or offline, code is still functional (but runs slower)