Skip to content

CS 149 Assignments

Posted on:2022.10.26

TOC

Open TOC

Assignment 1

after 04 prog basics

> screenfetch
██████████████████ ████████ vgalaxy@vgalaxy-virtualbox
██████████████████ ████████ OS: Manjaro 22.0.0 Sikaris
██████████████████ ████████ Kernel: x86_64 Linux 5.15.72-1-MANJARO
██████████████████ ████████ Uptime: 50m
████████ ████████ Packages: 1256
████████ ████████ ████████ Shell: zsh 5.9
████████ ████████ ████████ Resolution: 1920x964
████████ ████████ ████████ DE: GNOME 42.3
████████ ████████ ████████ WM: Mutter
████████ ████████ ████████ WM Theme:
████████ ████████ ████████ GTK Theme: Adw-dark [GTK2/3]
████████ ████████ ████████ Icon Theme: Papirus-Dark
████████ ████████ ████████ Font: Noto Sans 11
████████ ████████ ████████ Disk: 56G / 83G (71%)
CPU: Intel Core i7-10875H @ 8x 2.304GHz
GPU: llvmpipe (LLVM 14.0.6, 256 bits)
RAM: 3778MiB / 7947MiB
> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 165
model name : Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
stepping : 2
cpu MHz : 2303.998
cache size : 16384 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d arch_capabilities
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit srbds mmio_stale_data retbleed
bogomips : 4609.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
...

Program 1: Parallel Fractal Generation Using Threads (20 points)

使用 std::thread 库进行 assignment,assignment 只需按行平均划分即可,注意除法计算的误差

由于 thread #0 会等待其余线程 join,所以一般而言耗时最长

> ./mandelbrot -t 2 -v 2
[mandelbrot serial]: [249.444] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread #01]: [0.113] ms
[mandelbrot thread #00]: [0.153] ms
[mandelbrot thread #01]: [0.110] ms
[mandelbrot thread #00]: [0.152] ms
[mandelbrot thread #01]: [0.135] ms
[mandelbrot thread #00]: [0.171] ms
[mandelbrot thread #01]: [0.112] ms
[mandelbrot thread #00]: [0.154] ms
[mandelbrot thread #01]: [0.109] ms
[mandelbrot thread #00]: [0.151] ms
[mandelbrot thread]: [151.544] ms
Wrote image file mandelbrot-thread.ppm
(1.65x speedup from 2 threads)

下面是一些数据

> ./mandelbrot -t 2 -v 2
[mandelbrot serial]: [248.491] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [147.886] ms
Wrote image file mandelbrot-thread.ppm
(1.68x speedup from 2 threads)
> ./mandelbrot -t 3 -v 2
[mandelbrot serial]: [251.135] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [113.363] ms
Wrote image file mandelbrot-thread.ppm
(2.22x speedup from 3 threads)
> ./mandelbrot -t 4 -v 2
[mandelbrot serial]: [249.733] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [95.561] ms
Wrote image file mandelbrot-thread.ppm
(2.61x speedup from 4 threads)
> ./mandelbrot -t 5 -v 2
[mandelbrot serial]: [249.757] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [84.263] ms
Wrote image file mandelbrot-thread.ppm
(2.96x speedup from 5 threads)
> ./mandelbrot -t 6 -v 2
[mandelbrot serial]: [250.032] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [73.146] ms
Wrote image file mandelbrot-thread.ppm
(3.42x speedup from 6 threads)
> ./mandelbrot -t 7 -v 2
[mandelbrot serial]: [249.426] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [64.983] ms
Wrote image file mandelbrot-thread.ppm
(3.84x speedup from 7 threads)
> ./mandelbrot -t 8 -v 2
[mandelbrot serial]: [249.807] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [59.706] ms
Wrote image file mandelbrot-thread.ppm
(4.18x speedup from 8 threads)
> ./mandelbrot -t 16 -v 2
[mandelbrot serial]: [249.328] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [40.238] ms
Wrote image file mandelbrot-thread.ppm
(6.20x speedup from 16 threads)
> ./mandelbrot -t 32 -v 2
[mandelbrot serial]: [249.752] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]: [36.408] ms
Wrote image file mandelbrot-thread.ppm
(6.86x speedup from 32 threads)

Program 2: Vectorizing Code Using SIMD Intrinsics (20 points)

使用模拟的向量指令进行运算,每个向量指令都有一个 mask 参数,用来表示向量中的哪些位会参与运算

例如,对于如下的向量指令

void _cs149_vadd_int(__cs149_vec_int &vecResult, __cs149_vec_int &veca, __cs149_vec_int &vecb, __cs149_mask &mask);

只有 vecavecb 中 mask 为 1 的位会相加,其结果保存在 vecResult 对应的位中,否则 vecResult 对应的位保持不变

也可以通过向量指令构造 mask,例如

void _cs149_veq_int(__cs149_mask &vecResult, __cs149_vec_int &veca, __cs149_vec_int &vecb, __cs149_mask &mask);

只有 vecavecb 中对应的位相同,vecResult 中 mask 对应的位才会置为 1,否则置为 0

仍然在 mask 参数的语义下执行上述操作

由此实现 clampedExpVector 就不困难了,另外,考虑到 N 不一定为 VECTOR_WIDTH 的部分,所以需要对 underflow 和 overflow 情形下的 mask all 进行调整

下面对向量指令的性能进行分析,当 Vector Width 增加时,会发现 Vector Utilization 会降低

> ./myexp -s 10000
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 218409
Vector Utilization: 87.5%
Utilized Vector Lanes: 382241
Total Vector Lanes: 436818
> ./myexp -s 10000
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 125719
Vector Utilization: 85.5%
Utilized Vector Lanes: 430081
Total Vector Lanes: 502876
> ./myexp -s 10000
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 68284
Vector Utilization: 84.6%
Utilized Vector Lanes: 462157
Total Vector Lanes: 546272
> ./myexp -s 10000
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 35584
Vector Utilization: 84.2%
Utilized Vector Lanes: 479417
Total Vector Lanes: 569344

这是因为在 while 循环的次数取决于向量中最大的 exp,当 Vector Width 增加时,较大的 exp 所在的位就会波及更多的位,导致利用率降低

以 Vector Width = 16 为例

> ./myexp -l
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vset | ****************
vset | ****************
vset | ****************
vset | ****************
vload | ****************
vload | ****************
veq | ****************
vmove | ____________*___
masknot | ****************
vmove | ************_***
vsub | ************_***
maskand | ****************
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ****_*******_***
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ***__*******_**_
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ***__*******__*_
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ***__*******__*_
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | *____*******____
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | _____**___**____
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ______*___**____
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ______*____*____
vsub | ************_***
cntbits | ****************
vgt | ************_***
maskand | ****************
vmult | ________________
vsub | ************_***
cntbits | ****************
vmove | ************_***
vgt | ****************
vmove | ***____**_____*_
vstore | ****************

可以观察到 vmult 指令的后期 * 的比例越来越低

对于 arraySumVector,只需利用 _cs149_hadd_float_cs149_interleave_float 将向量元素和放置在向量的开头即可

Program 3: Parallel Fractal Generation Using ISPC (20 points)

安装 ispc

yay -S ispc-bin

Program 3, Part 1. A Few ISPC Basics (10 of 20 points)

示例程序改编自 https://ispc.github.io/example.html

运行结果如下

> ./mandelbrot_ispc
[mandelbrot serial]: [185.266] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [39.082] ms
Wrote image file mandelbrot-ispc.ppm
(4.74x speedup from ISPC)
> ./mandelbrot_ispc -v 2
[mandelbrot serial]: [116.578] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [29.948] ms
Wrote image file mandelbrot-ispc.ppm
(3.89x speedup from ISPC)

与 prog1 中 ./mandelbrot -t 8 -v 2 加速比相近

观察编译 ispc 程序的命令

ispc -O3 --target=avx2-i32x8 --arch=x86-64 --opt=disable-fma --pic mandelbrot.ispc -o objs/mandelbrot_ispc.o -h objs/mandelbrot_ispc.h

其中的参数可以参考 https://ispc.github.io/ispc.html#basic-command-line-options,或者 ispc --help

--target=avx2-i32x8 改为 --target=avx2-i32x16,加速比进一步提升,此处的 8 和 16 类似 prog2 中的 VECTOR_WIDTH

> ./mandelbrot_ispc -v 2
[mandelbrot serial]: [117.725] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [18.595] ms
Wrote image file mandelbrot-ispc.ppm
(6.33x speedup from ISPC)

这里 --opt=disable-fma 是关键的,否则可能会导致计算出错

FMA instruction set - Wikipedia

Program 3, Part 2: ISPC Tasks (10 of 20 points)

之前的 notes 中忘记介绍了 ispc tasks

使用关键字 launch 抽象 ispc tasks

export void mandelbrot_ispc_withtasks(uniform float x0, uniform float y0,
uniform float x1, uniform float y1,
uniform int width, uniform int height,
uniform int maxIterations,
uniform int output[])
{
uniform int rowsPerTask = height / 16;
// create 16 tasks
launch[16] mandelbrot_ispc_task(x0, y0, x1, y1,
width, height,
rowsPerTask,
maxIterations,
output);
}

思考:关键字 launch 和关键字 foreach 的区别

使用了 ispc tasks 后,加速比进一步提升,下面结果的 tasks 依次为 2, 4, 8, 16, 32

> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [128.511] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [28.792] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [16.295] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.46x speedup from ISPC)
(7.89x speedup from task ISPC)
> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [125.069] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [28.750] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [11.636] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.35x speedup from ISPC)
(10.75x speedup from task ISPC)
> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [125.179] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [30.252] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [7.951] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.14x speedup from ISPC)
(15.74x speedup from task ISPC)
> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [117.844] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [27.866] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [5.914] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.23x speedup from ISPC)
(19.93x speedup from task ISPC)
> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [117.066] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [27.534] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [5.250] ms
Wrote image file mandelbrot-task-ispc.ppm
(4.25x speedup from ISPC)
(22.30x speedup from task ISPC)

使用 avx2-i32x16 和 32 tasks,加速比已经达到了 34x 😲

> ./mandelbrot_ispc -v 2 --tasks
[mandelbrot serial]: [116.469] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]: [17.689] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]: [3.413] ms
Wrote image file mandelbrot-task-ispc.ppm
(6.58x speedup from ISPC)
(34.13x speedup from task ISPC)

由于 cpu 不支持 avx512,所以就不进一步加速了

相较于 prog1 中 32 threads,这里还利用了向量指令带来的加速,另外,ispc task 的 schedule 也比手动 join thread 更加高效

Program 4: Iterative sqrt (15 points)

avx2-i32x8 + 64 tasks,加速比如下

> ./sqrt
[sqrt serial]: [665.442] ms
[sqrt ispc]: [146.958] ms
[sqrt task ispc]: [23.555] ms
(4.53x speedup from ISPC)
(28.25x speedup from task ISPC)

其中

下面需要构造输入用例来最大化或最小化加速比

置输入用例全为 1,加速比如下

> ./sqrt
[sqrt serial]: [15.509] ms
[sqrt ispc]: [8.151] ms
[sqrt task ispc]: [8.114] ms
(1.90x speedup from ISPC)
(1.91x speedup from task ISPC)

其中

置输入用例全为 2.998,加速比如下

> ./sqrt
[sqrt serial]: [1549.775] ms
[sqrt ispc]: [223.970] ms
[sqrt task ispc]: [38.122] ms
(6.92x speedup from ISPC)
(40.65x speedup from task ISPC)

其中

一些思考

最后,使用 AVX2 intrinsics 来编写一个 sqrt 程序,参考 Intel Intrinsics Guide

有一些要点

最后的结果如下

> ./sqrt
[sqrt serial]: [669.400] ms
[sqrt avx]: [136.881] ms
[sqrt ispc]: [150.825] ms
[sqrt task ispc]: [23.403] ms
(4.89x speedup from AVX)
(4.44x speedup from ISPC)
(28.60x speedup from task ISPC)

比 ispc 生成的向量指令更加高效,毕竟是手写的 🤣

Program 5: BLAS saxpy (10 points)

avx2-i32x8 + 64 tasks,结果如下

> ./saxpy
[saxpy ispc]: [10.886] ms [27.376] GB/s [3.674] GFLOPS
[saxpy task ispc]: [12.660] ms [23.540] GB/s [3.159] GFLOPS
(0.86x speedup from use of tasks)

发现了用了 task 反而变慢了,猜测因为 saxpy 的核心计算只使用 fma 指令即可实现,所以维护 task 抽象的开销和内存带宽会成为瓶颈

result[i] = scale * X[i] + Y[i];

这里还计算了 bandwidth 和 gflops,其中认为,在整个计算中

TOTAL_BYTES = 4 * N * sizeof(float);
TOTAL_FLOPS = 2 * N;

一次计算中的两次浮点运算可以理解,TOTAL_BYTES 中的 4 需要考虑 cache

FROM PKUFlyingPig:

When the program writes to one element of result, it will first fetch the cache line which contains this element into the cache. This requires one memory operation. Then when this cache line is not needed, it will be flashed out of the cache, this requires another memory operation.

Program 6: Making K-Means Faster (15 points)

没数据哎……

Assignment 2

after 07 prog perf 2

Ref

C++ 并发编程

基于 C++11 实现线程池

ChunelFeng/CGraph

Test

测试主要分为两部分,一部分是以 tests/main.cpp 为入口的正确性测试,另一部分是以 tests/run_test_harness.py 为入口的性能测试

./runtasks -n 8 mandelbrot_chunked
python3 ../tests/run_test_harness.py -t simple_test_sync ping_pong_equal ping_pong_unequal super_light super_super_light recursive_fibonacci math_operations_in_tight_for_loop math_operations_in_tight_for_loop_fewer_tasks math_operations_in_tight_for_loop_fan_in math_operations_in_tight_for_loop_reduction_tree spin_between_run_calls mandelbrot_chunked

若要测试异步版本,需要添加 -a 参数,工作路径在 part_b 文件夹中

Part A

需要完成 ITaskSystem 的如下三个版本

TaskSystemParallelSpawn

由于最多只能使用 num_threads 个线程,所以只需在 run 函数中分批 spawn thread 然后 join 即可,此时不考虑 thread 的复用

TaskSystemParallelThreadPoolSpinning

需要考虑 thread 的复用,于是需要实现一个简单的线程池

线程池中 task 的型别固定为 void(),所以不必和复杂的模板语法打交道

为了能够在 run 函数中监控 task 是否完成,考虑使用 packaged_taskfuture 这两个并发标准库原语,当所有的任务均已 submit 且正在执行时,就可以安全的访问 future queue,future queue 会在 worker thread 执行 task 前更新

对 future 调用 get 会阻塞当前线程,所以 run 函数的这一部分并非 busy loop

std::future::get

使用 mutex 保护对 task queue 和 future queue 的并发访问,其范式如下

while (true) {
mu.lock();
if (!qu.empty()) {
// do something for qu
}
mu.unlock();
std::this_thread::yield();
}

这里使用了 yield,所以是 busy loop,会在下一部分中优化

还有一些细节

threads_.push_back(std::thread(&TaskSystemParallelThreadPoolSpinning::worker_thread, this));

TaskSystemParallelThreadPoolSleeping

这一部分大量参考了基于 C++11 实现线程池

首先为了方便,需要封装一个并发安全的 queue,用于存放 task

另外,worker thread 中会 wait 一个条件变量,当 run 函数中 submit 一个任务时,会唤醒一个 worker thread

auto task_ptr = std::make_shared<std::packaged_task<void()>>(
[=] { runnable->runTask(i, num_total_tasks); });
std::function<void()> warpper_func = [task_ptr]() { (*task_ptr)(); };
task_queue_.enqueue(warpper_func);
task_queue_cv_.notify_one();
future_queue_.emplace(task_ptr->get_future());

其次,优化了更新 future queue 的位置,在 run 函数中更新而非 worker thread 中

比较 tricky 的一个地方是 clean up 的过程

在原来的实现中,考虑使用 atomic<bool> 类型的变量表示线程池是否关闭,在析构函数中唤醒所有的 worker threads,并执行 join

然而存在如下的情形,对于 worker thread,当执行到第 4 行时,发现 done 为 false 且 queue 为空,于是准备执行第 5 行

while (!done_) {
{
std::unique_lock<std::mutex> lock(task_queue_mu_);
while (task_queue_.empty() && !done_) {
task_queue_cv_.wait(lock);
}
dequeued = task_queue_.dequeue(func);
}
if (dequeued) {
func();
}
}

此时调度到主线程,置 done 为 true 且唤醒所有的 worker threads,并 join 该线程

此时再调度到此线程,执行 wait,该线程就无法被唤醒了,出现死锁

由于主线程 join 子线程时会阻塞,所以没有办法再次唤醒

于是考虑使用 wait_for 替代 wait,由于 clean up 的过程并不计入总时间,若 wait for 的时间足够长,基本上不会影响性能

出现上述现象的本质原因是一个条件变量起到了两种作用

  • 队列是否为空
  • 线程池是否关闭

如果能够解耦,也许就不会出现上述问题了

Result

性能测试结果如下

================================================================================
Running task system grading harness... (11 total tests)
- Detected CPU with 8 execution contexts
- Task system configured to use at most 8 threads
================================================================================
================================================================================
Executing test: super_super_light...
Reference binary: ./runtasks_ref_linux
Results for: super_super_light
STUDENT REFERENCE PERF?
[Serial] 6.243 6.289 0.99 (OK)
[Parallel + Always Spawn] 673.277 107.176 6.28 (NOT OK)
[Parallel + Thread Pool + Spin] 46.542 25.794 1.80 (NOT OK)
[Parallel + Thread Pool + Sleep] 80.643 29.738 2.71 (NOT OK)
================================================================================
Executing test: super_light...
Reference binary: ./runtasks_ref_linux
Results for: super_light
STUDENT REFERENCE PERF?
[Serial] 56.501 57.531 0.98 (OK)
[Parallel + Always Spawn] 697.43 125.474 5.56 (NOT OK)
[Parallel + Thread Pool + Spin] 35.857 33.212 1.08 (OK)
[Parallel + Thread Pool + Sleep] 76.662 30.164 2.54 (NOT OK)
================================================================================
Executing test: ping_pong_equal...
Reference binary: ./runtasks_ref_linux
Results for: ping_pong_equal
STUDENT REFERENCE PERF?
[Serial] 904.174 919.997 0.98 (OK)
[Parallel + Always Spawn] 882.033 236.661 3.73 (NOT OK)
[Parallel + Thread Pool + Spin] 179.942 184.491 0.98 (OK)
[Parallel + Thread Pool + Sleep] 194.083 169.073 1.15 (OK)
================================================================================
Executing test: ping_pong_unequal...
Reference binary: ./runtasks_ref_linux
Results for: ping_pong_unequal
STUDENT REFERENCE PERF?
[Serial] 1628.283 1663.599 0.98 (OK)
[Parallel + Always Spawn] 980.627 358.202 2.74 (NOT OK)
[Parallel + Thread Pool + Spin] 301.565 298.065 1.01 (OK)
[Parallel + Thread Pool + Sleep] 285.468 280.122 1.02 (OK)
================================================================================
Executing test: recursive_fibonacci...
Reference binary: ./runtasks_ref_linux
Results for: recursive_fibonacci
STUDENT REFERENCE PERF?
[Serial] 893.467 1263.774 0.71 (OK)
[Parallel + Always Spawn] 385.19 189.118 2.04 (NOT OK)
[Parallel + Thread Pool + Spin] 149.51 204.019 0.73 (OK)
[Parallel + Thread Pool + Sleep] 142.117 186.332 0.76 (OK)
================================================================================
Executing test: math_operations_in_tight_for_loop...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop
STUDENT REFERENCE PERF?
[Serial] 699.348 701.187 1.00 (OK)
[Parallel + Always Spawn] 1020.442 508.353 2.01 (NOT OK)
[Parallel + Thread Pool + Spin] 202.333 215.601 0.94 (OK)
[Parallel + Thread Pool + Sleep] 246.562 219.489 1.12 (OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_fewer_tasks...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_fewer_tasks
STUDENT REFERENCE PERF?
[Serial] 696.272 700.483 0.99 (OK)
[Parallel + Always Spawn] 797.813 485.306 1.64 (NOT OK)
[Parallel + Thread Pool + Spin] 239.505 243.286 0.98 (OK)
[Parallel + Thread Pool + Sleep] 272.237 270.787 1.01 (OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_fan_in...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_fan_in
STUDENT REFERENCE PERF?
[Serial] 369.658 368.772 1.00 (OK)
[Parallel + Always Spawn] 522.272 106.377 4.91 (NOT OK)
[Parallel + Thread Pool + Spin] 77.449 77.174 1.00 (OK)
[Parallel + Thread Pool + Sleep] 85.46 73.101 1.17 (OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_reduction_tree...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_reduction_tree
STUDENT REFERENCE PERF?
[Serial] 369.786 384.031 0.96 (OK)
[Parallel + Always Spawn] 129.87 76.565 1.70 (NOT OK)
[Parallel + Thread Pool + Spin] 67.842 71.859 0.94 (OK)
[Parallel + Thread Pool + Sleep] 64.901 62.265 1.04 (OK)
================================================================================
Executing test: spin_between_run_calls...
Reference binary: ./runtasks_ref_linux
Results for: spin_between_run_calls
STUDENT REFERENCE PERF?
[Serial] 315.139 458.11 0.69 (OK)
[Parallel + Always Spawn] 162.345 233.455 0.70 (OK)
[Parallel + Thread Pool + Spin] 182.367 265.183 0.69 (OK)
[Parallel + Thread Pool + Sleep] 163.849 236.194 0.69 (OK)
================================================================================
Executing test: mandelbrot_chunked...
Reference binary: ./runtasks_ref_linux
Results for: mandelbrot_chunked
STUDENT REFERENCE PERF?
[Serial] 373.527 378.335 0.99 (OK)
[Parallel + Always Spawn] 66.077 56.558 1.17 (OK)
[Parallel + Thread Pool + Spin] 58.571 62.57 0.94 (OK)
[Parallel + Thread Pool + Sleep] 54.34 55.382 0.98 (OK)
================================================================================
Overall performance results
[Serial] : All passed Perf
[Parallel + Always Spawn] : Perf did not pass all tests
[Parallel + Thread Pool + Spin] : Perf did not pass all tests
[Parallel + Thread Pool + Sleep] : Perf did not pass all tests

由于没有考虑优化 TaskSystemParallelSpawn,所以其性能并不理想

有如下一些观察

对于 long task,例如 ping_pong_unequal,Sleep 才有一些优势

这是因为维护 task system 的开销成为了瓶颈

对于 spin_between_run_calls,可能的原因已经写在了注释里面

// Notice that since the number of tasks is small, only a small
// number of available threads can perform work here, while the
// rest spin.

对于 mandelbrot_chunked,可能的原因是 task workload 比较平衡,所以 Always Spawn 中 thread 的利用率并不比 Thread Pool 差

Part B

仅修改 TaskSystemParallelThreadPoolSleeping 的实现,使其支持异步 run,并调度存在依赖关系的 task

正确实现并不复杂,考虑引入 scheduler thread 和 ready task queue

对于依赖关系的记录,使用如下的数据结构进行简单的模拟

std::vector<{id, num_total_tasks, finished_tasks}>

其中的 id 为 TaskID 类型,即异步 run 函数的返回类型,需要与 runTask 的参数 task_id 区分

此时的同步 run 函数就可以通过异步 run 函数和 sync 函数实现

性能测试结果如下

================================================================================
Running task system grading harness... (11 total tests)
- Detected CPU with 8 execution contexts
- Task system configured to use at most 8 threads
================================================================================
================================================================================
Executing test: super_super_light_async...
Reference binary: ./runtasks_ref_linux
Results for: super_super_light_async
STUDENT REFERENCE PERF?
[Serial] 6.099 6.228 0.98 (OK)
[Parallel + Always Spawn] 5.905 106.762 0.06 (OK)
[Parallel + Thread Pool + Spin] 6.595 25.646 0.26 (OK)
[Parallel + Thread Pool + Sleep] 634.362 10.897 58.21 (NOT OK)
================================================================================
Executing test: super_light_async...
Reference binary: ./runtasks_ref_linux
Results for: super_light_async
STUDENT REFERENCE PERF?
[Serial] 56.418 57.362 0.98 (OK)
[Parallel + Always Spawn] 56.383 120.668 0.47 (OK)
[Parallel + Thread Pool + Spin] 63.654 26.483 2.40 (NOT OK)
[Parallel + Thread Pool + Sleep] 722.975 28.171 25.66 (NOT OK)
================================================================================
Executing test: ping_pong_equal_async...
Reference binary: ./runtasks_ref_linux
Results for: ping_pong_equal_async
STUDENT REFERENCE PERF?
[Serial] 913.298 931.521 0.98 (OK)
[Parallel + Always Spawn] 911.16 235.139 3.87 (NOT OK)
[Parallel + Thread Pool + Spin] 1048.457 189.585 5.53 (NOT OK)
[Parallel + Thread Pool + Sleep] 914.774 175.214 5.22 (NOT OK)
================================================================================
Executing test: ping_pong_unequal_async...
Reference binary: ./runtasks_ref_linux
Results for: ping_pong_unequal_async
STUDENT REFERENCE PERF?
[Serial] 1641.079 1673.688 0.98 (OK)
[Parallel + Always Spawn] 1641.178 354.939 4.62 (NOT OK)
[Parallel + Thread Pool + Spin] 1898.547 317.095 5.99 (NOT OK)
[Parallel + Thread Pool + Sleep] 970.493 284.24 3.41 (NOT OK)
================================================================================
Executing test: recursive_fibonacci_async...
Reference binary: ./runtasks_ref_linux
Results for: recursive_fibonacci_async
STUDENT REFERENCE PERF?
[Serial] 902.116 1280.551 0.70 (OK)
[Parallel + Always Spawn] 900.286 190.191 4.73 (NOT OK)
[Parallel + Thread Pool + Spin] 1027.269 200.316 5.13 (NOT OK)
[Parallel + Thread Pool + Sleep] 154.708 185.133 0.84 (OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_async...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_async
STUDENT REFERENCE PERF?
[Serial] 702.094 705.272 1.00 (OK)
[Parallel + Always Spawn] 699.175 507.874 1.38 (NOT OK)
[Parallel + Thread Pool + Spin] 832.174 193.913 4.29 (NOT OK)
[Parallel + Thread Pool + Sleep] 10441.547 192.014 54.38 (NOT OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_fewer_tasks_async...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_fewer_tasks_async
STUDENT REFERENCE PERF?
[Serial] 699.177 701.617 1.00 (OK)
[Parallel + Always Spawn] 695.58 492.998 1.41 (NOT OK)
[Parallel + Thread Pool + Spin] 821.325 116.941 7.02 (NOT OK)
[Parallel + Thread Pool + Sleep] 130.087 107.359 1.21 (NOT OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_fan_in_async...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_fan_in_async
STUDENT REFERENCE PERF?
[Serial] 360.279 360.273 1.00 (OK)
[Parallel + Always Spawn] 359.416 103.933 3.46 (NOT OK)
[Parallel + Thread Pool + Spin] 423.262 61.165 6.92 (NOT OK)
[Parallel + Thread Pool + Sleep] 281.535 57.842 4.87 (NOT OK)
================================================================================
Executing test: math_operations_in_tight_for_loop_reduction_tree_async...
Reference binary: ./runtasks_ref_linux
Results for: math_operations_in_tight_for_loop_reduction_tree_async
STUDENT REFERENCE PERF?
[Serial] 356.873 363.942 0.98 (OK)
[Parallel + Always Spawn] 356.569 72.74 4.90 (NOT OK)
[Parallel + Thread Pool + Spin] 420.686 58.409 7.20 (NOT OK)
[Parallel + Thread Pool + Sleep] 64.009 55.38 1.16 (OK)
================================================================================
Executing test: spin_between_run_calls_async...
Reference binary: ./runtasks_ref_linux
Results for: spin_between_run_calls_async
STUDENT REFERENCE PERF?
[Serial] 313.752 451.405 0.70 (OK)
[Parallel + Always Spawn] 313.145 231.897 1.35 (NOT OK)
[Parallel + Thread Pool + Spin] 365.605 257.833 1.42 (NOT OK)
[Parallel + Thread Pool + Sleep] 164.641 233.714 0.70 (OK)
================================================================================
Executing test: mandelbrot_chunked_async...
Reference binary: ./runtasks_ref_linux
Results for: mandelbrot_chunked_async
STUDENT REFERENCE PERF?
[Serial] 369.846 373.895 0.99 (OK)
[Parallel + Always Spawn] 368.05 57.582 6.39 (NOT OK)
[Parallel + Thread Pool + Spin] 415.671 60.354 6.89 (NOT OK)
[Parallel + Thread Pool + Sleep] 61.402 53.752 1.14 (OK)
================================================================================
Overall performance results
[Serial] : All passed Perf
[Parallel + Always Spawn] : Perf did not pass all tests
[Parallel + Thread Pool + Spin] : Perf did not pass all tests
[Parallel + Thread Pool + Sleep] : Perf did not pass all tests

注意除了 TaskSystemParallelThreadPoolSleeping,其余 task system 的异步实现均为 serial

有如下一些观察

这两个测试的依赖关系都较为复杂,在 super_super_light_async 测试中,其依赖关系如下

graph LR A-->B B-->C C-->O["..."] O-->D

math_operations_in_tight_for_loop_async 测试中,其依赖关系为一棵二叉树

并且这两个测试中的 task 体量并不算大,所以 scheduler thread 中维护的依赖关系会成为瓶颈

用 perf 对 super_super_light_async 跑个火焰图

f4d234276970451f8ff1aa3508e80b35.svg

可以发现 scheduler thread 中判断 runnable 占据了大部分时间

一种可能的改进方案是,通过某种方式,能够按照依赖关系对 task queue 进行拓扑排序,这样可以减少判断 runnable 的次数,具体可以参考 CGraph

Assignment 3

RTFM

https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Environment Setup

白嫖 Google Colab

from google.colab import drive
drive.mount('/content/drive')
!cd /content/drive/MyDrive/asst3-master/saxpy && make && ./cudaSaxpy

Part 1: CUDA Warm-Up 1: SAXPY (5 pts)

参考 An Easy Introduction to CUDA C and C++ | NVIDIA Technical Blog 即可

结果如下

mkdir -p objs/
nvcc saxpy.cu -O3 -m64 --gpu-architecture compute_35 -c -o objs/saxpy.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
g++ -m64 -O3 -Wall -o cudaSaxpy objs/main.o objs/saxpy.o -L/usr/local/cuda/lib64/ -lcudart
---------------------------------------------------------
Found 1 CUDA devices
Device 0: Tesla T4
SMs: 40
Global mem: 15110 MB
CUDA Cap: 7.5
---------------------------------------------------------
Running 3 timing tests:
Overall Kernel Duration 0.004634s
Kernel BW by CUDA saxpy: 4.634 ms [241.182 GB/s]
Effective BW by CUDA saxpy: 266.717 ms [4.190 GB/s]
Overall Kernel Duration 0.004607s
Kernel BW by CUDA saxpy: 4.607 ms [242.604 GB/s]
Effective BW by CUDA saxpy: 286.075 ms [3.907 GB/s]
Overall Kernel Duration 0.004610s
Kernel BW by CUDA saxpy: 4.610 ms [242.419 GB/s]
Effective BW by CUDA saxpy: 283.091 ms [3.948 GB/s]

如果只考虑 kernel execution 的时间,其性能远超 Assignment 3 的 Program 5

但是受限于 GPU 的内存带宽,cudaMemcpy 会消耗大量的时间

实际预测 Google Colab 中 Tesla T4 的内存带宽为 4 GB/s

Part 2: CUDA Warm-Up 2: Parallel Prefix-Sum (10 pts)

首先需要了解并行前缀和算法,参考

给出如下算法

1a2e35dd53dc4b5ea31936f7d9e11e4a.png

此处 exclusive scan 的含义如下

e2870a0cd750401faee8e67117c8f72d.png

对于上述算法,串行版本的代码描述如下,此处定义二元运算为 +

void cpu_exclusive_scan(int *start, int *end, int *output) {
int N = end - start;
memmove(output, start, N * sizeof(int));
// upsweep phase
for (int twod = 1; twod < N / 2; twod *= 2) {
int twod1 = twod * 2;
for (int i = 0; i < N; i += twod1) {
output[i + twod1 - 1] = output[i + twod - 1] + output[i + twod1 - 1];
}
}
output[N - 1] = 0;
// downsweep phase
for (int twod = N / 2; twod >= 1; twod /= 2) {
int twod1 = twod * 2;
for (int i = 0; i < N; i += twod1) {
int tmp = output[i + twod - 1];
output[i + twod - 1] = output[i + twod1 - 1];
output[i + twod1 - 1] = tmp + output[i + twod1 - 1];
}
}
}

需要利用 cuda 实现其并行版本,考虑到外层 for 循环的不同迭代之间存在依赖,而内层 for 循环的不同迭代之间是相互独立的,于是考虑 launch kernel 函数替换内层 for 循环的计算,有如下要点

#define DEBUG
#ifdef DEBUG
#define cudaCheckError(ans) { cudaAssert((ans), __FILE__, __LINE__); }
inline void cudaAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "CUDA Error: %s at %s:%d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#else
#define cudaCheckError(ans) ans
#endif

下面需要利用 exclusive scan 实现 find_repeats 方法,即

Given a list of integers A, returns a list of all indices i for which A[i] == A[i+1].
For example, given the array {1,2,2,1,1,1,3,5,3,3}, your program should output the array {1,3,4,8}.

此处 parallel exclusive scan 的作用是并行的获取满足条件的 index

仍然考虑上述例子,定义 array B,若 A[i] == A[i+1],则置 B[i] = 1,否则置 B[i] = 0

B = {0,1,0,1,1,0,0,0,1,0}

定义 array C = exclusive_scan(+,B)

C = {0,0,1,1,2,3,3,3,3,4}

那么,对于 output array O,当 B[i] = 1 时,只需置 O[C[i]] = i 即可,并且 length(O) = C[length(C)-1]

注意对于不同的 i,上述操作是相互独立的,所以可以通过 launch kernel 函数高效的实现

最终的性能结果如下

-------------------------
Scan Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count | Ref Time | Student Time | Score |
-------------------------------------------------------------------------
| 1000000 | | 0.824 | 0 |
| 10000000 | | 8.094 | 0 |
| 20000000 | | 15.541 | 0 |
| 40000000 | | 30.654 | 0 |
-------------------------------------------------------------------------
| | Total score: | 0/5 |
-------------------------------------------------------------------------
-------------------------
Find_repeats Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count | Ref Time | Student Time | Score |
-------------------------------------------------------------------------
| 1000000 | | 1.306 | 0 |
| 10000000 | | 10.950 | 0 |
| 20000000 | | 19.956 | 0 |
| 40000000 | | 38.701 | 0 |
-------------------------------------------------------------------------
| | Total score: | 0/5 |
-------------------------------------------------------------------------

由于 ref 程序版本不兼容

UP WARNING: A CUDA error occurred: code=222, the provided PTX was compiled with an unsupported toolchain. 42

README.md 中的数据对比,性能差距不大

Part 3: A Simple Circle Renderer (85 pts)

TODO

Assignment 4

Ref

https://github.com/aksbaih/cs149-asst4

RTFM

访问 https://www.openmp.org/specifications/ 获取最新的 specifications

OpenMP for automatically splits for loop iterations for us.

But, depending on our program, the default behavior may not be ideal.

For loops where each iteration takes roughly equal time, static schedules work best, as they have little overhead.

For loops where each iteration can take very different amounts of time, dynamic schedules, work best as the work will be split more evenly across threads.

Specifying chunks, or using a guided schedule provide a trade-off (uma alternativa) between the two.

Choosing the best schedule depends on understanding your loop.

Jaka’s Corner OpenMP series

Environment Setup

安装 OpenMP

install openmp

使用 cmake 构建

https://cliutils.gitlab.io/modern-cmake/chapters/packages/OpenMP.html

Representing Graphs

有向图,其数据分布比较紧凑

struct graph
{
// Number of edges in the graph
int num_edges;
// Number of vertices in the graph
int num_nodes;
// The node reached by vertex i's first outgoing edge is given by
// outgoing_edges[outgoing_starts[i]]. To iterate over all
// outgoing edges, please see the top-down bfs implementation.
int* outgoing_starts;
Vertex* outgoing_edges;
int* incoming_starts;
Vertex* incoming_edges;
};

顶点编号范围为 0 ~ num_nodes - 1

对于顶点 i,其所有出边的另一个顶点编号存储在 outgoing_edges[j] 中,其中 j 的范围为 outgoing_starts[i] ~ outgoing_starts[i+1]-1

本质上是 gather 和 scatter 操作

Page Rank

Algorithm

https://www.dbs.ifi.lmu.de/Lehre/BigData-Management&Analytics/WS15-16/Chapter-8_Link_Analysis_part1.pdf

首先介绍一下 Page Rank 算法,该算法用于评估网页重要性,在这里

Page Rank 算法的核心思想是 page is more important if it has more in-links

由此定义网页重要性 rank

rj=ijridir_j = \sum_{i \to j} \frac{r_i}{d_i}

其中 did_i 代表网页 ii出链数量

不难发现上述定义是一个递归定义,我们约定递归的初始条件为

ri=1Nr_i = \frac{1}{N}

其中 NN 代表顶点的总数

实际上可以将上述转移关系写成矩阵的形式,那么有

R=MRR = M \cdot R

下面是一个例子

79017dd3fa8546d3b92a647504e081d5.png

然而,在实际中,上述算法会遇到如下问题

  1. Some pages are dead ends (have no out-links)

这些 pages 会导致网页重要性流失

可以假设这些 pages 包含所有网页的链接

  1. Spider traps (all out-links are within a group)

这些 pages 会最终吸收掉所有的网页重要性

可以假设在一定概率下跳出这个 group,随机前往任意一个网页

1e7c196d3d9141cead3d6c04be10cb9e.png

于是可以将上述转移关系修改为

rj=βijridi+(1β)1N+c1Nr_j = \beta \sum_{i \to j} \frac{r_i}{d_i} + (1-\beta) \frac{1}{N} + c \frac{1}{N}

这里 cc 代表没有出链的网页数量

在实际的代码中,damping 代表了参数 β\betaconvergence 代表了收敛阈值

然而,上述算法并不能保证最终的结果能够收敛,取决于具体图的性质

Implementation

首先下载图数据 http://cs149.stanford.edu/cs149asstdata/all_graphs.tgz

main.cpp 对应 pr,需要传入完整的数据路径,grade.cpp 对应 pr_grader 需要传入数据所在的目录

./pr ~/Desktop/shared/graphs/tiny.graph 8
./pr_grader ~/Desktop/shared/graphs/

上述算法的串行实现是 trivial 的

为了提升性能,有如下一些要点

Result

实际上感觉导入时间远大于计算时间

> ./pr ~/Desktop/shared/graphs/com-orkut_117m.graph
----------------------------------------------------------
Max system threads = 8
----------------------------------------------------------
Loading graph...
Graph stats:
Edges: 117185083
Nodes: 3072441
----------------------------------------------------------
Running with 1 threads
Testing Correctness of Page Rank
----------------------------------------------------------
Running with 2 threads
Testing Correctness of Page Rank
----------------------------------------------------------
Running with 4 threads
Testing Correctness of Page Rank
----------------------------------------------------------
Running with 8 threads
Testing Correctness of Page Rank
----------------------------------------------------------
Your Code: Timing Summary
Threads Time (Speedup)
1: 6.7025 (1.0000x)
2: 3.5078 (1.9108x)
4: 2.0653 (3.2452x)
8: 1.2832 (5.2234x)
----------------------------------------------------------
Reference: Timing Summary
Threads Time (Speedup)
1: 6.2877 (1.0000x)
2: 4.2482 (1.4801x)
4: 2.5614 (2.4548x)
8: 1.7962 (3.5005x)
----------------------------------------------------------
Correctness:
Relative Speedup to Reference:
Threads Speedup
1: 0.94x
2: 1.21x
4: 1.24x
8: 1.40x
> ./pr_grader ~/Desktop/shared/graphs/
Max system threads = 8
Running with 8 threads
Graph: soc-livejournal1_68m.graph
ref_time: 2.99031s
stu_time: 1.46187s
Graph: com-orkut_117m.graph
ref_time: 1.92497s
stu_time: 1.31045s
Graph: rmat_200m.graph
ref_time: 16.1516s
stu_time: 12.896s
-------------------------------------------
SCORES :
-------------------------------------------
soc-livejournal1_68m.graph | 4.00000 / 4 |
-------------------------------------------
com-orkut_117m.graph | 4.00000 / 4 |
-------------------------------------------
rmat_200m.graph | 4.00000 / 4 |
-------------------------------------------
TOTAL | 12.00000 / 12 |
-------------------------------------------

pr_grader 并未测试 random_500m.graph,因为内存不够了

Top Down

框架代码使用了下述数据结构描述 BFS 过程中的 frontier

struct vertex_set {
// # of vertices in the set
int count;
// max size of buffer vertices
int max_vertices;
// array of vertex ids in set
int *vertices;
};

需要将 top_down_step 这个函数并行化,思路如下

对于 OpenMP 的使用,有几个要点

  1. #pragma omp parallel#pragma omp for

具体含义如下

#pragma omp parallel [clauses]
{ // parallel region
#pragma omp for [clauses]
for (...)
{
// body
}
}

实际上 #pragma omp parallel for#pragma omp parallel#pragma omp for 的简写,也就是当 parallel region 中只有一个 #pragma omp for 时,例如

#pragma omp parallel for [clauses]
for (...)
{
// body
}
  1. default(none) 子句

The default(none) clause forces a programmer to explicitly specify the data-sharing attributes of all variables in a parallel region. Using this clause then forces the programmer to think about data-sharing attributes.

clang-tidy 会默认检测 #pragma omp parallel 中是否使用了 default(none)

当使用了 default(none) 后,对于线程间共享的变量,需要显式在 shared clause 中指出

在 parallel region 中定义的变量,OpenMP 认为是线程私有变量,例如上述的 local frontier

使用了 default(none) 子句后,感觉性能提升了一些

  1. 临界区保护

考虑使用 #pragma omp atomic capture

#pragma omp atomic capture
structured-block

在实际使用中,这里的 structured-block 为

{
v = x;
x binop= expr;
}

其中 vexpr 为线程私有变量,而 x 为线程共享变量

也可以使用 #pragma omp critical,具体 RTFM

Bottom Up

伪代码如下

for each vertex v in graph:
if v has not been visited AND v shares an incoming edge with a vertex u on the frontier:
add vertex v to (new) frontier;

实际上无需记录 frontier 的顶点编号,只需记录其存在性即可

于是考虑使用下述数据结构描述 BFS 过程中的 frontier

struct vertex_bitmap {
int count;
bool *visited;
};

这里比较关键的地方在于,每一轮被新访问的顶点不应该作为这一轮的 frontier,而是下一轮的 frontier,换句话说,Bottom Up 和 Top Down 在每一轮所得到的 frontier 应该是一致的

数据竞争方面,只需 reduce new_bitmap_count 这个变量即可,其余过程线程间是独立的

Hybrid

需要结合 Top Down 和 Bottom Up 两种策略的优势

理想情况下,在 BFS 的过程中,随着轮数增加,frontier 的 count 应该是先变大,再变小

于是考虑下述策略

由于两种策略的转换之间存在开销,并且从 Top Down 策略转换为 Bottom Up 策略的开销要小于从 Bottom Up 策略转换为 Top Down 策略的开销

因为目前的 vertex_bitmap 并未显式记录 frontier 的顶点编号

所以考虑仅使用一次转换

下面讨论阈值的选择,在有向完全图的假设下,一轮 Top Down 所需要访问的顶点个数为

frontier-countD2\text{frontier-count} \cdot \frac{D}{2}

一轮 Bottom Up 所需要访问的顶点个数为

ND2N \cdot \frac{D}{2}

于是将阈值设定为 NN,其中 NN 代表顶点的总数

Result

> ./bfs ~/Desktop/shared/graphs/com-orkut_117m.graph
----------------------------------------------------------
Max system threads = 8
----------------------------------------------------------
Loading graph...
Graph stats:
Edges: 117185083
Nodes: 3072441
----------------------------------------------------------
Running with 1 threads
Testing Correctness of Top Down
Testing Correctness of Bottom Up
[Switching to bottom up mode with frontier=777010]
Testing Correctness of Hybrid
----------------------------------------------------------
Running with 2 threads
Testing Correctness of Top Down
Testing Correctness of Bottom Up
[Switching to bottom up mode with frontier=777010]
Testing Correctness of Hybrid
----------------------------------------------------------
Running with 4 threads
Testing Correctness of Top Down
Testing Correctness of Bottom Up
[Switching to bottom up mode with frontier=777010]
Testing Correctness of Hybrid
----------------------------------------------------------
Running with 8 threads
Testing Correctness of Top Down
Testing Correctness of Bottom Up
[Switching to bottom up mode with frontier=777010]
Testing Correctness of Hybrid
----------------------------------------------------------
Your Code: Timing Summary
Threads Top Down Bottom Up Hybrid
1: 0.38 (1.00x) 0.50 (1.00x) 0.05 (1.00x)
2: 0.27 (1.44x) 0.25 (2.05x) 0.03 (1.65x)
4: 0.16 (2.35x) 0.14 (3.68x) 0.02 (2.48x)
8: 0.10 (3.84x) 0.10 (4.93x) 0.02 (3.25x)
----------------------------------------------------------
Reference: Timing Summary
Threads Top Down Bottom Up Hybrid
1: 0.38 (1.00x) 0.50 (1.00x) 0.07 (1.00x)
2: 0.23 (1.65x) 0.25 (2.01x) 0.07 (1.10x)
4: 0.16 (2.32x) 0.15 (3.38x) 0.06 (1.27x)
8: 0.15 (2.61x) 0.10 (5.19x) 0.05 (1.46x)
----------------------------------------------------------
Correctness:
Speedup vs. Reference:
Threads Top Down Bottom Up Hybrid
1: 0.99 1.01 1.37
2: 0.86 1.02 2.05
4: 1.00 1.10 2.67
8: 1.46 0.96 3.06
> ./bfs_grader ~/Desktop/shared/graphs
Max system threads = 8
Running with 8 threads
Graph: grid1000x1000.graph
Top down bfs
ref_time: 0.0333177s
stu_time: 0.214779s
Bottom up bfs
ref_time: 0.719265s
stu_time: 0.703228s
Hybrid bfs
ref_time: 0.292586s
stu_time: 0.198405s
Graph: soc-livejournal1_68m.graph
Top down bfs
ref_time: 0.144646s
stu_time: 0.104211s
Bottom up bfs
ref_time: 0.0981488s
stu_time: 0.125747s
Hybrid bfs
[Switching to bottom up mode with frontier=1174905]
ref_time: 0.0638454s
stu_time: 0.0495188s
Graph: com-orkut_117m.graph
Top down bfs
ref_time: 0.172277s
stu_time: 0.0954649s
Bottom up bfs
ref_time: 0.110134s
stu_time: 0.10408s
Hybrid bfs
[Switching to bottom up mode with frontier=777010]
ref_time: 0.0509855s
stu_time: 0.0175802s
Graph: rmat_200m.graph
Top down bfs
ref_time: 1.60406s
stu_time: 1.58742s
Bottom up bfs
ref_time: 1.00752s
stu_time: 1.13439s
Hybrid bfs
[Switching to bottom up mode with frontier=10172240]
ref_time: 0.756962s
stu_time: 0.886052s
--------------------------------------------------------------------------
SCORES : | Top-Down | Bott-Up | Hybrid |
--------------------------------------------------------------------------
grid1000x1000.graph | 0.40 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
soc-livejournal1_68m.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
com-orkut_117m.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
rmat_200m.graph | 7.00 / 7 | 8.00 / 8 | 8.00 / 8 |
--------------------------------------------------------------------------
TOTAL | 45.40 / 47 |
--------------------------------------------------------------------------

同上,bfs_grader 并未测试 random_500m.graph,因为内存不够了