Home > Products > HPC > CIM Arrays > OpenMP Accelerator -- CIM and OpenMP

OpenMP Accelerator -- CIM and OpenMP

Convolution C code example, with OpenMP
style pragma highlighted in yellow

Matrix multiply C code example, with OpenMP style pragma
highlighted in yellow

Section of arbitrary C code marked for offloading to the CIM array

Separate compute intensive sections of C code, marked
for parallel execution (pragmas highlighted in yellow)

On this page...

Overview
CIM Software Build Process
CIM at Run-Time
Multicore Hardware Supported

Related pages...

CIM and Linux Compatibility

Overview

CIM technology software supports three (3) basic modes of acceleration

OpenMP
API based (planned for a future release)
CUDA emulation (planned for a future release)

In the first mode the CIM array functions as an "OpenMP accelerator". The basic concept of OpenMP is to use well-defined "markers", or directives at the source code level (programmer level) to inform the code build process which sections of source code should be accelerated, and in what manner. Using CIM OpenMP Syntax, the following types of code sections can be marked:

Parallel sections -- sections of code that run on different cores at the same time
Parallel for-loops -- a nested for-loop that shares work among some number of cores. In this case, "work" typically means the innermost processing performed by the loop
Offloaded sections -- arbitrary sections of code, related or unrelated, marked for offloading to CIM array

The main advantage of the OpenMP model is that it does not require a code rewrite. Compute intensive sections can be accelerated by marking them, not actually changing source code content. A secondary advantage is that it fits Texas Instruments multicore CPUs particularly well.

Another advantage is that a "group of tasks", or related sections of code, can be moved (or "offloaded") to the CIM array. For example, an FFT, followed by frequency domain operations, followed by an inverse FFT, followed by time domain processing to reduce the data rate, could be handled by the CIM array, with data transfer occurring only at the start and end of the task group. This type of "algorithm based" processing is not readily supported by CUDA and the OpenCL standard, which in this example would use a series of function calls or APIs (Application Programming Interfaces), which in turn would need more frequent data transfers.

CIM Software Build Process

Here is a brief overview of the CIM software build process:

Separates C source into two (2) separate code streams, one for x86 / ARM (the server "host" CPU), and one for TI multicore CPUs
Modifies the generated code streams to deal with run-time communication (moving data between host CPU memory and CIM array memory)
Builds both code streams, creating a host executable file, and one or more (typically many) CIM array executable files

When the host program runs, it downloads and initializes CIM array executable files. This is done transparently to the host program's user. This is a key point -- from the user's perspective, his/her program looks, feels, and runs as it did before -- except faster.

Another key point is that CIM acceleration can easily be enabled/disabled at run-time, without any program changes. If for any reason the CIM array is not present in the server then programs still function and run normally -- just slower.

CIM at Run-Time

CIM and OpenMP pragmas may be used at the same time
CIM pragmas use the "cim" keyword, OpenMP pragmas use "omp"
Programs can make run-time decisions whether to use CIM and how much. If the CIM array is not available or offline, code is still functional (but runs slower)

Multicore Hardware Supported

Several types of multicore hardware are supported for CIM acceleration, including:

Advantech 32-core and 64-core PCIe cards (DSPC-8681 and DSPC-8682). Clock rates 1, 1.25, and 1.5 GHz. DDR3 mem amounts 1 and 2 GByte
Comm Agility 32-core uATCA modules
Advantech 160-core ATCA boards (8901 board). DDR3 mem amounts 1 and 2 GByte