@HaomingJiang 2017-11-11T04:06:58.000000Z 字数 1530 阅读 1029

# CSE6230 Final Project Checkpoint #2

Haoming Jiang, Wenjie Yao

### Computing Platform

Bridges, K80 (1--4) or P100 (1--2)

### Pseudocode

Denote the position of each entry of the cude with a turple $(i,j,k),\ 0\leq i,j,k < n$. Bascially we want to rearrange the computing order for entries, so that we can divide all entries into different successive parts where the entries in the same part can be computed in parallel.

Observation1 : Note that for certain $(i,j)$, any two $k,k'$ are not directly related, which means they can be computed in parallel.
Observation2 : For any stripe $(i,j,*)$, it only depends on its 8 neighbor stripes.

Based on the above Observation2, we can derive the following coloring method where entries for the same color can be computed in parallel:
1. i%2==0 && j%2==0, color=0
2. i%2==0 && j%2==1, color=1
3. i%2==1 && j%2==0, color=2
4. i%2==1 && j%2==1, color=3

We evenly divide $(i,j)$s for the same color into $block\_num$ parts, where $block\_num$ is the total number of blocks in GPU computing.

// Pseudocodeblock_id = get_block_id();thread_id = get_thread_id();for(i = 1:max_iterations)    for( color_id = 0:3 )    {        for( (i,j) in ijpairs[color_id][get_block_id] )            for( k = (n*thread_id)/thread_num:(n*thread_id+n)/thread_num-1 )                update(i,j,k);        synchronize(); // or barrier()    }

### Machine characteristics

K80:
Number of cores: 4992
Total Memory Bandwidth: 480 GB/s
Memory: 24 GB GDDR5
Double-precision Performance: 2.91 Teraflops
Single-precision Performance: 8.73 Teraflops
P100:
Number of cores: 3584
Total Memory Bandwidth: 732 GB/s
Memory: 16 GB GDDR5
Double-precision Performance: 4.7 Teraflops
Single-precision Performance: 9.3 Teraflops

PCIe x16 Interconnect Bandwidth(K80,P100): 32 GB/s
NVIDIA NVLink™ Interconnect Bandwidth(P100): 160 GB/s

• 私有
• 公开
• 删除