3. CSDL Loops

CSDL contains a custom iterable for use in for loops that can accelerate compilation of CSDL code, use less memory, and give better run-time performance. In some cases, CSDL loops can replace manually vectorized code while maintaining similar performance. In addition to easing implementation, this results in more readable code, and is the recommended approach.

This tutorial shows how to use CSDL loops, explains when ths should and shouldn’t be used, and gives an overview of how they work and their performance benefits.

3.1. Basic Usage

To start, we’ll import the necessary libraries and demonstrate a simple example of a CSDL loop, where we’re summing the elements of an array.

import csdl_alpha as csdl
import numpy as np
recorder = csdl.Recorder(inline=True)
recorder.start()
dim = 100
array = np.ones((dim, dim))

sum = csdl.Variable(value=0)
array = csdl.Variable(value=array)
for i in csdl.frange(dim):
    for j in csdl.frange(dim):
        sum = sum + array[i, j]
recorder.stop()
print(sum.value)

[10000.]

As can be seen in this example, the CSDL loop is used in the same way as a normal for loop, but with the csdl.frange function used to generate the range of values to iterate over. While it looks like the full range is being iterated over, the loop is actually unrolled by the backend, and CSDL only stores a graph representing a single iteration of the loop. This means that the memory and time complexity of the loop is O(1) with respect to the number of iterations, and the loop can be used with very large ranges.

Note

Inline evaluation of CSDL loops does not apply any optimization, and the loop will be executed as normal. To see the performance benefits of CSDL loops, the code must be compiled and run.

To demonstrate the performance benefits, we’ll compare the performance of the CSDL loop to a standard python loop:

import time

dim = 300
np_array = np.random.rand(dim, dim)

# CSDL Loop
recorder = csdl.Recorder()
recorder.start()

sum = csdl.Variable(value=0)
array = csdl.Variable(value=np_array)
start = time.time()
for i in csdl.frange(dim):
    for j in csdl.frange(dim):
        sum = sum + array[i, j]
print('csdl loop time: ', time.time() - start)
recorder.stop()

# Standard Loop
recorder = csdl.Recorder()
recorder.start()

sum = csdl.Variable(value=0)
array = csdl.Variable(value=np_array)
start = time.time()
for i in range(dim):
    for j in range(dim):
        sum = sum + array[i, j]
print('standard loop time: ', time.time() - start)
recorder.stop()

csdl loop time:  0.0008194446563720703
standard loop time:  4.270884990692139

3.2. Loop Requirements

While CSDL loops can be used in many cases, there are some requirements that must be met for the loop to be valid. In short, because CSDL represents the loop using the graph of a single iteration, this graph cannot change between iterations, and everything done within the loop must be representable by the graph. This can be ensured by following these rules:

The loop must consist of only CSDL operations and CSDL variable creations.
The loop must not contain any conditional statements.
Variable setting and getting can depend on the loop index, but slices must be constant size.

Warning

CSDL does not currently allow slices using the loop index, so lists must be used instead. This is a limitation of the current implementation, and may be fixed in the future.

If these requirements are met, the loop can be used in place of a normal for loop, and will be compiled to a more efficient form. The first rule specifically may mean that some loops may need to be rewritten to use CSDL loops, but this can usually be accomplished by breaking the loop into two loops, one standard loop for non-CSDL operations, and one CSDL loop for CSDL operations. The following example demonstrates this in the context of assembling a sparse matrix. Constructing the row and column indices of the matrix is done with a standard loop, while the data is assembled with a CSDL loop.

recorder = csdl.Recorder()
recorder.start()

num_elements = 10
rows = np.zeros(16 * num_elements, dtype=np.int32)
cols = np.zeros(16 * num_elements, dtype=np.int32)
data = csdl.Variable(value=np.zeros((16 * num_elements,)))
K_local = csdl.Variable(value=np.random.rand(num_elements, 4, 4))
j = 16
for ind in range(num_elements): 
   ind1 = 2 * ind
   # NE quadrant
   rows[j:j+4] = np.array([ind1, ind1, ind1 + 1, ind1 + 1])
   cols[j:j+4] = np.array([ind1 + 2, ind1 + 3, ind1 + 2, ind1 + 3])

   # SE and SW quadrants together
   rows[j+4:j+12] = np.repeat(np.arange(ind1 + 2, ind1 + 4), 4)
   cols[j+4:j+12] = np.tile(np.arange(ind1, ind1 + 4), 2)

   j += 12

j_offset = 16
for ind in csdl.frange(num_elements):
   j = j_offset + ind * 12
   ind1 = 2 * ind
   K = K_local[ind, :, :]

   # NW quadrant gets summed with previous connected element.
   indices1 = [j-6, j-5]
   indices2 = [j-2, j-1]
   data = data.set(csdl.slice[indices1], data[indices1] + K[0, :2])
   data = data.set(csdl.slice[indices2], data[indices2] + K[1, :2])

   # NE quadrant
   data = data.set(csdl.slice[[j, j+1, j+2, j+3]], K[:2, 2:].flatten())

   # SE and SW quadrants together
   data = data.set(csdl.slice[[j+4, j+5, j+6, j+7, j+8, j+9, j+10, j+11]], K[2:, :].flatten())

recorder.stop()

3.3. Under the hood

While CSDL is able to represent a loop using the graph from a single iteration, obtaining this graph requires two iterations of the loop. This is needed to find feedback in the loop - that is, variables that are used then overwritten in the same iteration.

To demonstrate this, we’ll using the following example, which calculates the sum of the first n numbers:

recorder = csdl.Recorder()
recorder.start()

sum = csdl.Variable(value=0)
n = 100
for i in csdl.frange(n):
    sum = sum + i

The following image shows the graph of the loop after the first iteration, after the second iteration, and the final graph after all processing is completed:

alt text

The process for generating the final graph is as follows:

The loop is executed once, generating the graph for the first iteration.
The graph is parsed to make an ordered list of all inputs and outputs - that is, all variables that feed into operations but are not outputs of operations, and all outputs of operations.
All operations are removed from the graph
The loop is executed again, generating the graph for the second iteration.
The graph is again parsed to make an ordered list of all inputs. This list is compared to the list from the first iteration, and any variables that changed and are outputs from the first iteration are marked as feedback.
All inputs and outputs from the first iteration that are not used in the second are removed from the graph, resulting in the final graph.

This process is repeated for each loop in the code, and the final graph is compiled and executed. This process is what allows CSDL loops to be used in place of normal loops, and is what allows them to be more efficient.