StanFord CS149 Assignment1 简记

2024-02-01 14:03:10 # Labs #C++ #Lab #Parallel #CUDA

环境

于 Windows11, WSL2, Ubuntu22.04 环境下进行作业

1 2	wget https://github.com/ispc/ispc/releases/download/v1.22.0/ispc-v1.22.0-linux.tar.gz tar -xvf ispc-v1.22.0-linux.tar.gz

1 2	# 加入 ~/.bashrc export PATH=/home/lozical/Apps/ispc-v1.22.0-linux/bin:$PATH

1
2
3

git clone git@github.com:stanford-cs149/asst1.git
cd asst1/prog3_mandelbrot_ispc
make

make 成功即环境搭建完成

Program 1

main.cpp

读取命令行参数
串行计算, 并行计算
对比答案并计算加速效果

mandelbrotSerial 函数

根据 i, j 计算所在点的值

mandelbrotThread 函数

每个线程记录参数
执行 1 ~ numThreads 这些任务
主线程执行第 0 个任务
等待其他线程任务执行完毕

任务 1

根据注释, 完成 workerThreadStart 函数即可

关键点是填充 mandelbrotSerial 函数中 startRow 和 totalRows 参数

暂时把行平均分配给这些线程
$$
averageRows = height / numThreads
$$
对于所有的线程
$$
StartRow = threadId * averageRows
$$

对于前边的线程, $totalRows = averageRows$

对于最后一个线程, 还要处理边角料部分

1 2	if (height < startRow + numRows + numRows) numRows = height - startRow;

最后完成 workerThreadStart 函数

void workerThreadStart(WorkerArgs *const args)
{
    auto id = args->threadId;
    int width = args->width, height = args->height;

    int numRows = height / args->numThreads;
    int startRow = id * numRows;
    if (height < startRow + numRows + numRows)
        numRows = height - startRow;

    mandelbrotSerial(args->x0, args->y0, args->x1, args->y1,
                     width, height, startRow, numRows,
                     args->maxIterations, args->output);
}

任务 2

对于 --view 1

(2.00x speedup from 2 threads)
(1.64x speedup from 3 threads)
(2.43x speedup from 4 threads)
(2.48x speedup from 5 threads)
(3.26x speedup from 6 threads)
(3.44x speedup from 7 threads)
(4.08x speedup from 8 threads)

因为负载均衡的问题, 平均分配明显有瓶颈

任务 3

16 线程测试结果, 负载不均衡

[thread 0]:             [0.716] ms
[thread 15]:            [0.748] ms
[thread 1]:             [3.796] ms
[thread 14]:            [4.365] ms
[thread 13]:            [6.186] ms
[thread 2]:             [6.479] ms
[thread 3]:             [17.412] ms
[thread 12]:            [17.786] ms
[thread 4]:             [21.650] ms
[thread 11]:            [22.190] ms
[thread 5]:             [26.202] ms
[thread 10]:            [27.264] ms
[thread 6]:             [34.356] ms
[thread 9]:             [34.580] ms
[thread 7]:             [37.976] ms
[thread 8]:             [38.341] ms

任务 4

重新写一个函数 mandelbrotStepSerial

分块的方式从原来的整块分配变为交错分配

void mandelbrotStepSerial(
    float x0, float y0, float x1, float y1,
    int width, int height,
    int startRow, int step,
    int maxIterations,
    int output[])
{
    float dx = (x1 - x0) / width;
    float dy = (y1 - y0) / height;

    for (int j = startRow; j < height; j+=step) {
        for (int i = 0; i < width; ++i) {
            float x = x0 + i * dx;
            float y = y0 + j * dy;

            int index = (j * width + i);
            output[index] = mandel(x, y, maxIterations);
        }
    }
}

void workerThreadStart(WorkerArgs *const args)
{

    double startTime = CycleTimer::currentSeconds();

    mandelbrotStepSerial(args->x0, args->y0, args->x1, args->y1,
                         args->width, args->height, args->threadId, args->numThreads,
                         args->maxIterations, args->output);

    double endTime = CycleTimer::currentSeconds();

    printf("[thread %d]:\t\t[%.3f] ms\n", args->threadId, (endTime - startTime) * 1000);
}

任务5

如果官方指定的四核 CPU 上, 没有明显提升, 因为最多支持 8 个线程

如果在本机 CPU 上, 有明显提升

Program 2

任务 1

调用写好的函数翻译一下即可, 具体看代码

任务 2

// 2
****************** Printing Vector Unit Statistics *******************
Vector Width:              2
Total Vector Instructions: 162728
Vector Utilization:        82.8%
Utilized Vector Lanes:     269354
Total Vector Lanes:        325456
// 4
****************** Printing Vector Unit Statistics *******************
Vector Width:              4
Total Vector Instructions: 94576
Vector Utilization:        77.5%
Utilized Vector Lanes:     293250
Total Vector Lanes:        378304
// 6
****************** Printing Vector Unit Statistics *******************
Vector Width:              6
Total Vector Instructions: 66444
Vector Utilization:        75.9%
Utilized Vector Lanes:     302617
Total Vector Lanes:        398664
// 8
****************** Printing Vector Unit Statistics *******************
Vector Width:              8
Total Vector Instructions: 51628
Vector Utilization:        74.9%
Utilized Vector Lanes:     309300
Total Vector Lanes:        413024

任务 3

先调用 _cs149_vadd_float 函数把所有的数加起来

最后还剩 VECTOR_WIDTH 个数

迭代 log2(VECTOR_WIDTH) 次, 每次

1 2	_cs149_hadd_float(sum, sum); _cs149_interleave_float(sum, sum);

具体可以自己模拟

Program 3

Part I

理想情况下是八倍加速

// -v 1
(6.00x speedup from ISPC)
// -v 2
(5.30x speedup from ISPC)

原因可能还是像上次一样的负载不均衡问题

Part II

任务 1

// -v 1
(6.09x speedup from ISPC)
(11.85x speedup from task ISPC)
// -v 2
(5.26x speedup from ISPC)
(8.90x speedup from task ISPC)

任务 2

export void mandelbrot_ispc_withtasks(uniform float x0, uniform float y0,
                                      uniform float x1, uniform float y1,
                                      uniform int width, uniform int height,
                                      uniform int maxIterations,
                                      uniform int output[])
{

    uniform int cnt = 32;
    uniform int rowsPerTask = height / cnt;

    // create 2 tasks
    launch[cnt] mandelbrot_ispc_task(x0, y0, x1, y1,
                                     width, height,
                                     rowsPerTask,
                                     maxIterations,
                                     output); 
}

把 cnt 改成自己机器支持的最大超线程数即可

1 2	(5.86x speedup from ISPC) (64.05x speedup from task ISPC)

任务 3

关于线程抽象和 ISPC 任务抽象的区别

抽象级别
线程抽象：更通用, 可以处理粗粒度和细粒度任务
ISPC任务抽象： ISPC任务专门用于数据并行任务, 细粒度任务比较好
创建和同步：
线程抽象：线程通过pthread_create和pthread_join等机制显式创建和加入。通常需要使用锁或屏障等显式同步机制来协调线程。
ISPC任务抽象：在ISPC中，使用launch关键字启动任务，使用sync关键字实现同步。ISPC任务在任务块结束时隐式同步。
开销和可扩展性：
线程抽象：创建和管理线程可能具有较高的开销
ISPC任务抽象： ISPC任务通常针对SIMD架构进行了优化，运行时系统可以高效处理大量任务。对于数据并行工作负载，可扩展性可能更有利。
任务依赖性：
线程抽象：可能需要使用条件变量或其他同步原语来显式管理任务依赖性。
ISPC任务抽象：任务依赖性通常在数据并行结构中隐含处理。依赖关系通过程序中控制流的自然流程解决。

任务数量过多

ISPC任务：启动10,000个ISPC任务可能更有效，由于ISPC的数据并行特性，任务可以被高效地调度和并行执行，特别是在具有SIMD功能的架构上。
线程：启动10,000个线程可能导致更高的开销，由于线程创建、管理的成本，以及对系统资源的潜在争用。对线程管理进行精细调整对于可扩展性至关重要。

Program 4

任务 1

1 2	(4.76x speedup from ISPC) (91.56x speedup from task ISPC)

任务 2

values 都改成 2.999

[sqrt serial]:          [1322.087] ms
[sqrt ispc]:            [193.282] ms
[sqrt task ispc]:       [8.609] ms
                                (6.84x speedup from ISPC)
                                (153.56x speedup from task ISPC)

任务 3

values 都改成 1

[sqrt serial]:          [11.529] ms
[sqrt ispc]:            [5.537] ms
[sqrt task ispc]:       [6.095] ms
                                (2.08x speedup from ISPC)
                                (1.89x speedup from task ISPC)

这样调度成了主要开销

values 索引为 8 倍数值为 2.999 否则为 1

[sqrt serial]:          [176.444] ms
[sqrt ispc]:            [195.748] ms
[sqrt task ispc]:       [9.730] ms
                                (0.90x speedup from ISPC)
                                (18.13x speedup from task ISPC)

这样构造, 在不用 task 的时候效率类似串行计算
再加上 SIMD 的其他开销就比串行计算慢了

任务 4

弃之

Program 5

任务 1

主要的瓶颈是在内存访问上, 计算速度很快, 所以优化不明显

任务 2

因为主要是行缓存读取, 内存一次性取出一行而不是一个 float

任务 3

弃之

Program 6

没数据, 弃之

2024-02-01 14:03:10 # Labs #C++ #Lab #Parallel #CUDA