一种基于FPGA的CNN硬件加速器实现-AET-电子技术应用

一种基于FPGA的CNN硬件加速器实现

电子技术应用

邱臻博

重庆邮电大学光电工程学院，重庆 400065

摘要： 提出了一种基于FPGA的通用CNN硬件加速器设计方案。针对计算量最大的卷积层，采用了输入通道并行、核内并行、输出通道并行三种加速方式，根据FPGA的片上资源，合理地设置相应并行度。在数据加载方面，采用相邻数据位宽合并传输，有效提高了加速器的实际传输带宽。基于行的数据流加载思想，设计了输入缓存模块。该缓存模块只需缓存两行数据即可开始卷积运算，有效地提前卷积运算的开始时间。在数据输入、数据运算、数据输出模块之间，利用流水线循环优化方式，极大地提高了硬件的计算性能。最后将该加速器应用于VGG16和Darknet-19网络，实验表明，计算性能分别达到34.30 GOPS和33.68 GOPS，DSP计算效率分别高达79.45%和78.01%。

关键词： 卷积神经网络加速 FPGA 行数据加载模块划分流水线结构

中图分类号：TP391 文献标志码：A DOI: 10.16157/j.issn.0258-7998.234372
中文引用格式： 邱臻博. 一种基于FPGA的CNN硬件加速器实现[J]. 电子技术应用，2023，49(12)：20-25.
英文引用格式： Qiu Zhenbo. An FPGA-based implementation of CNN hardware accelerator[J]. Application of Electronic Technique，2023，49(12)：20-25.

An FPGA-based implementation of CNN hardware accelerator

Qiu Zhenbo

College of Photoelectric Engineering， Chongqing University of Posts and Telecommunications， Chongqing 400065， China

Abstract： This paper proposes a general CNN hardware accelerator design scheme based on FPGA. For the most computationally intensive convolutional layer, three acceleration modes are adopted: input channel parallelism, intra-core parallelism, and output channel parallelism, and the corresponding parallelism degree is reasonably set according to the on-chip resources of FPGA. In terms of data loading, adjacent data bit width combined transmission is adopted, which effectively improves the actual transmission bandwidth of the accelerator. Based on the idea of row-based data flow loading, the input cache module is designed. The cache module only needs to cache two rows of data to start the convolution operation, effectively advancing the start time of the convolution operation. Between the data input, data operation, and data output modules, the pipeline cycle optimization method is used to greatly improve the computing performance of the hardware. Finally, the accelerator is applied to VGG16 and Darknet-19 networks, and experiments show that the computing performance reaches 34.30 GOPS and 33.68 GOPS, respectively, and the DSP computing efficiency is as high as 79.45% and 78.01%.

Key words : convolutional neural network acceleration；FPGA；row data loading；module division；pipeline structure

0　引言

随着深度学习技术的飞速发展，神经网络模型在图像识别、目标检测和图像分割等领域取得了巨大技术进步[1-2]。然而相比较传统算法，神经网络在获得高的性能同时也带来了高计算复杂度的问题，使得基于专用硬件设备加速神经网络成为神经网络模型应用领域关注的焦点。目前，神经网络模型硬件加速的主要方案有GPU、ASIC和FPGA三种方案。相比较GPU，FPGA具有成本功耗低的特点；相比较ASIC，FPGA具有模型实现灵活、开发速度快、综合成本低的特点，特别适用于当前神经网络在边缘设备上部署的需求，因此基于FPGA的神经网络模型加速研究成为当前神经网络领域研究的热点[3-5]。

大多数神经网络模型中卷积层的运算量占到了总计算量的90%以上，因此可以通过在FPGA中执行卷积运算来实现神经网络加速[6-7]。文献[6]基于FPGA实现通用矩阵乘法加速器来实现神经网络加速，获得了很好的加速性能。文献[7]则提出了一种基于脉动阵结构的矩阵乘法加速模块，并用于神经网络加速，获得了较好的性能提升。文献[8-9]从卷积运算的加速算法方面进行研究，Liang Y[8]等人基于二维Winograd算法在FPGA上对CNN进行了实现，与常规的卷积计算单元相比，该实现中基于二维Winograd算法设计的卷积计算单元将乘法操作减少了56%。Tahmid Abtahi[10]等人使用快速傅里叶变换（Fast Fourier Transform，FFT）对ResNet-20模型中的卷积运算进行优化，成功减少了单个卷积计算单元的DSP资源使用量。除卷积运算加速外，相关研究团队对神经网络加速过程中的其他方面也展开深入研究[10-14]。文献[10]提出了一种块卷积方法，这是传统卷积的一种内存高效替代方法，将中间数据缓冲区从外部DRAM完全移动到片上存储器，但随着分块层数的增加，精度会降低。文献[11]提出一种相邻层位宽合并和权重参数重排序的策略实现数据传输的优化方法，增加数据传输并行度的同时节省了通道的使用。文献[12-14]采取乒-乓处理结构，分别在输入模块、卷积运算单元、输出模块方面提升了卷积运算的速率。

本文详细内容请下载：https://www.chinaaet.com/resource/share/2000005800

作者信息

邱臻博

（重庆邮电大学光电工程学院，重庆 400065）

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容