一种多教师模型知识蒸馏深度神经网络模型压缩算法-AET-电子技术应用

一种多教师模型知识蒸馏深度神经网络模型压缩算法

2023年电子技术应用第8期

顾明珠1，2，明瑞成2，邱创一1，2，王新文1，2

（1.福州大学先进制造学院，福建泉州 362000；2.中国科学院海西研究院泉州装备制造研究中心，福建泉州 362000）

摘要： 为了能将庞大的深度学习模型压缩后部署到算力和存储能力有限的设备中时尽可能减小精度损失，对知识蒸馏模型压缩方法进行研究，提出了一种改进后带筛选的多教师模型知识蒸馏压缩算法。利用多教师模型的集成优势，以各教师模型的预测交叉熵为筛选的量化标准筛选出表现更好的教师模型对学生进行指导，并让学生模型从教师模型的特征层开始提取信息，同时让表现更好的教师模型在指导中更具有话语权。在CIFAR100数据集上的VGG13等分类模型实验结果表明，与其他压缩算法相比在最终得到的学生模型大小相同的情况下，精度上有着更好的表现。

关键词： 模型压缩知识蒸馏多教师模型交叉熵特征层

中图分类号：TP399 文献标志码：A DOI: 10.16157/j.issn.0258-7998.233812
中文引用格式： 顾明珠，明瑞成，邱创一，等. 一种多教师模型知识蒸馏深度神经网络模型压缩算法[J]. 电子技术应用，2023，49(8)：7-12.
英文引用格式： Gu Mingzhu，Ming Ruicheng，Qiu Chuangyi，et al. A multi-teacher knowledge distillation model compression algorithm for deep neural network[J]. Application of Electronic Technique，2023，49(8)：7-12.

A multi-teacher knowledge distillation model compression algorithm for deep neural network

Gu Mingzhu1，2，Ming Ruicheng2，Qiu Chuangyi1，2，Wang Xinwen1，2

(1.School of Advanced Manufacturing， Fuzhou University， Quanzhou 362000， China； 2.Quanzhou Institute of Equipment Manufacturing，Haixi Institutes Chinese Academy of Sciences，Quanzhou 362000， China)

Abstract： In order to minimize the accuracy loss when compressing huge deep learning models and deploying them to devices with limited computing power and storage capacity, a knowledge distillation model compression method is investigated and an improved multi-teacher model knowledge distillation compression algorithm with filtering is proposed. Taking advantage of the integration of multi-teacher models, the better-performing teacher models are screened for student instruction using the predicted cross-entropy of each teacher model as the quantitative criterion for screening, and the student models are allowed to extract information starting from the feature layer of the teacher models, while the better-performing teacher models are allowed to have more say in the instruction. The experimental results of classification models such as VGG13 on the CIFAR100 dataset show that the multi-teacher model compression method in this paper has better performance in terms of accuracy compared with other compression algorithms with the same size of the final obtained student models.

Key words : model compression；distillation of knowledge；multi-teacher model；cross entropy；feature layer

0　引言

随着人工智能技术发展，要将越来越庞大的的模型部署到实际的工业社会中时,相应硬件的算力要求和存储要求成为了最大障碍。因此，为加快人工智能技术在社会生活和工业的广泛使用，越来越多的学者们对深度学习模型进行轻量化压缩进行研究[1]，而知识蒸馏方法已然成为比较主流的模型轻量化方法[2]。

知识蒸馏是指利用已经训练好的大型深度学习模型辅助训练出一个小型模型，其中大型模型称为教师模型，起到监督和辅助小型模型训练的作用。小型模型称为学生模型，接受来自教师模型的知识，并最终用于实际部署。2015年Hinton[3]首次提出了知识蒸馏这一概念以来，研究者们开始对压缩后如何保证学生模型精度这一问题进行研究。知识蒸馏从教师模型规模分为单教师模型的蒸馏和多教师模型知识蒸馏两类。单教师模型即只使用一个教师模型对学生模型进行蒸馏，如Romero[4]将学生模型的网络设计成较细且层数较深的形状，并且将学生模型和教师模型的特征层连接，让学生模型从教师模型的特征层提层知识。Chen[5]等在蒸馏中加入GAN结构，模拟原始数据集扩大数据量提供给新的模型进行知识蒸馏。Liu[6]等人将NAS引入知识蒸馏，根据教师模型结构从NAS中选择最契合的学生网络与之匹配以达到最佳蒸馏效果，但NAS需要巨大内存使该方法难以大面积推广。Dai[7]等提出利用教师模型和学生模型预测实例的差异，提出实例差异的评估指标，并利用可区分的实例进行蒸馏。知识蒸馏中学生网络的知识大部分来源于教师模型，因此由单个教师模型知识蒸馏得到的学生模型精度上限受限于对应的教师模型，难以有很大提升。

本文详细内容请下载：https://www.chinaaet.com/resource/share/2000005484

作者信息：

顾明珠1，2，明瑞成2，邱创一1，2，王新文1，2

（1.福州大学先进制造学院，福建泉州 362000；2.中国科学院海西研究院泉州装备制造研究中心，福建泉州 362000）

微信图片_20210517164139.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容