论文精翻《Progressive Tandem Learning for Pattern Recognition With Deep Spiking Neural Networks》

0 摘要/Abstract1 简介/Introduction2 相关工作/Related Work3 重新思考ANN-to-SNN的转换/Rethinking ANN-to-SNN Conversion3.1 脉冲神经元与ANN神经元/Spiking Neuron Versus ANN Neuron3.2 神经离散化与激活量化/Neural Discretization Versus Activation Quantization3.3 阈值层归一化/Threshold LayerNorm3.4 神经元编码/Neural Coding4 渐进串联学习/Progressive Tandem Learning4.1 串联学习/Tandem Learning4.2 渐进串联学习的调度/Scheduling of Progressive Tandem Learning4.3 其他硬件约束的优化/Optimizing for Other Hardware Constraints5 模式分类实验/Experiments on Pattern Classification5.1 实验设置/Experimental Setup5.2 基于脉冲的端到端学习导致累积梯度近似误差/End-to-End Spike-Based Learning Leads to Accumulated Gradient Approximation Errors5.3 Cifar-10和ImageNet-12上的目标识别/Object Recognition on Cifar-10 and ImageNet-125.4 低精度神经形态硬件量化感知训练/Quantization-Aware Training for Low Precision Neuromorphic Hardware5.5 基于SNN的快速高效分类/Rapid and Efficient Classification With SNNs6 信号重构实验/Experiments on Signal Reconstruction6.1 图像重建与自动编码器/Image Reconstruction With Autoencoder6.2 时域语音分离/Time-Domain Speech Separation6.3 实验设置/Experimental Setup6.3.1 图像重建/Image Reconstruction6.3.2 时域语音分离/Time-Domain Speech Separation6.4 实验结果/Experimental Results6.4.1 自编码器图像重建/Image Reconstruction With Autoencoder6.4.2 时域语音分离/Time-Domain Speech Separation7 结论/Conclusion

《Progressive Tandem Learning for Pattern Recognition With Deep Spiking Neural Networks》论文精翻

👉 CSDN-论文原文下载

⚠️ 请注意：译文中，加粗文字为译者认为的重点部分，加粗斜体文字为译者觉得难以翻译/翻译不准的部分。

0 摘要/Abstract

脉冲神经网络(SNNs)由于其事件驱动和稀疏通信的特性，在低延迟和高计算效率方面比传统人工神经网络(ANN)显示出明显的优势。然而，深度SNN的训练并不简单。在本文中，我们提出了一种新的ANN-to-SNN转换和分层学习框架，用于快速有效的模式识别，称为渐进串联学习。通过研究ANN和SNN在离散表示空间中的等价性，引入了一种原始网络转换方法，充分利用脉冲计数来近似ANN神经元的激活值。为了补偿由原始网络转换引起的近似误差，我们进一步引入了一种分层学习方法，使用自适应训练调度器来微调网络权重。渐进串联学习框架还允许在训练过程中逐步施加硬件约束，例如有限的重量精度与扇入连接。这样训练的SNN在大规模物体识别、图像重建与语音分离任务上表现出了卓越的分类和回归能力，同时比其他最先进的SNN减少至少一个数量级的推理时间与突触操作。因此，这使得在功耗预算有限的移动和嵌入式设备上普及提供了无数的机会。

Spiking neural networks (SNNs) have shown clear advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency, due to their event-driven nature and sparse communication. However, the training of deep SNNs is not straightforward. In this paper, we propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition, which is referred to asprogressive tandem learning. By studying the equivalence between ANNs and SNNs in the discrete representation space, a primitive network conversion method is introduced that takes full advantage of spike count to approximate the activation value of ANN neurons. To compensate for the approximation errors arising from the primitive network conversion, we further introduce alayer-wise learning method with an adaptive training scheduler to fine-tune the network weights. The progressive tandem learning framework also allows hardware constraints, such as limited weight precision and fan-in connections, to be progressively imposed during training. The SNNs thus trained have demonstrated remarkable classification and regression capabilities on large-scale object recognition, image reconstruction, and speech separation tasks, while requiring at least an order of magnitude reduced inference time and synaptic operations than other state-of-the-art SNN implementations. It, therefore, opens up a myriad of opportunities for pervasive mobile and embedded devices with a limited power budget.

1 简介/Introduction

经过数亿年的进化，人类的大脑具有令人难以置信的效率，能够执行复杂的模式识别任务。近年来，受分层组织皮层网络启发的深度人工神经网络(ANN)已成为许多模式识别任务的主要方法，并在广泛的应用领域取得了显著的成功，例如语音处理[1]，[2]，计算机视觉[3]，[4]，语言理解[5]和机器人[6]。然而，在计算成本和内存使用方面，深度ANN的操作成本是出了名的昂贵。因此，它们被禁止大规模部署在普遍的移动和物联网(IoT)设备中。

Human brains, after evolving for many hundreds of millions of years, are incredibly efficient and capable of performing complex pattern recognition tasks. In recent years, the deep artificial neural networks (ANNs) that are inspired by the hierarchically organized cortical networks have become the dominant approach for many pattern recognition tasks and achieved remarkable successes in a wide spectrum of application domains, instances include speech processing [1], [2], computer vision [3], [4], language understanding [5] and robotics [6]. The deep ANNs, however, are notoriously expensive to operate both in terms of computational cost and memory usage. Therefore, they are prohibited from largescale deployments in pervasive mobile and Internet-of-Things (IoT) devices.

相比之下，成年人的大脑在执行复杂的感知与认知任务时只消耗了大约20瓦，这只相当于一个昏暗灯泡的功耗。虽然许多努力都致力于提高深度神经网络的内存和计算效率，例如网络压缩[8]，网络量化[9]与知识蒸馏[10]，但更有趣的是利用生物神经系统固有的高效计算范式，这与上述策略有根本不同，并且可能与上述策略集成。

In contrast, the adult’s brains only consume about 20 watts to perform complex perceptual and cognitive tasks that are only equivalent to the power consumption of a dim light bulb [7]. While many efforts are devoted to improving the memory and computational efficiency of deep ANNs, for example, network compression [8], network quantization [9] and knowledge distillation [10], it is more interesting to exploit the efficient computation paradigm inherent to the biological neural systems that are fundamentally different from and potentially integratable with the aforementioned strategies.

脉冲神经网络(Spike Neural Networks, SNNs)最初是为了研究生物大脑的功能和组织机制。最近的研究表明，深度ANN也受益于生物现实的实现，如事件驱动计算与稀疏通信[11]，以提高计算效率。神经形态计算(Neuromorphic computing, NC)是一种新兴的非冯·诺依曼计算范式，旨在利用硅中的SNN[12]模拟生物神经系统。新的神经形态计算架构，包括Tianjic[13]， TrueNorth[14]和Loihi[15]，在模式识别任务中显示出引人注目的吞吐量和能源效率，这归功于它们固有的事件驱动计算与计算单元的细粒度并行性。此外，在数据驱动模式识别任务中，内存和计算的协同部署可以有效地缓解计算单元和内存之间的低带宽问题（即冯·诺伊曼瓶颈）。

The spiking neural networks (SNNs) are initially introduced to study the functioning and organizing mechanisms of biological brains. Recent studies have shown that deep ANNs also benefit from biologically realistic implementation, such as event-driven computation and sparse communication [11], for computational efficiency. Neuromorphic computing (NC), as an emerging non-von Neumann computing paradigm, aims to mimic the biological neural systems with SNNs in silicon [12]. The novel neuromorphic computing architectures, including Tianjic [13], TrueNorth [14], and Loihi [15], have shown compelling throughput and energy-efficiency in pattern recognition tasks, crediting to their inherent event-driven computation and fine-grained parallelism of the computing units. Moreover, the colocated memory and computation can effectively mitigate the problem of low bandwidth between the computing units and memory (i.e., von Neumann bottleneck) in datadriven pattern recognition tasks.

然而，训练大规模的脉冲神经网络并部署到这些NC芯片上，用于现实世界的模式识别任务，仍然是一个挑战。由于脉冲神经元函数的离散性与不可微性，广泛用于深度ANN训练的反向传播(BP)算法并不直接适用于SNN。

While it remains a challenge to train large-scale spiking neural networks that can be deployed onto these NC chips for real-world pattern recognition tasks. Due to the discrete and hence non-differentiable nature of spiking neuronal function, the powerful back-propagation (BP) algorithm that is widely used for deep ANN training is not directly applicable to the SNN.

最近的研究表明，脉冲神经元形成的动力系统可以被描述为一个循环ANN[16]，从而可以有效地模拟这些泄漏积分器（即脉冲神经元）的阈下膜电位动态。此外，脉冲生成函数的不连续可以通过代理梯度来避免，该梯度提供了真实梯度的无偏估计[17]，[18]，[19]，[20]，[21]，[22]。这样，可以应用典型误差反向传播时间算法(BPTT)对SNN进行优化。然而，使用BPTT算法优化SNN的计算效率和内存效率都很低，因为脉冲序列通常在时间与空间上都非常稀疏。因此，该技术的可伸缩性还有待提高，例如在一个手势分类任务[19]中，SNN的大小受GPU内存限制。此外，面对长时间或低发射速率的输入脉冲序列，BPTT算法的梯度消失与梯度爆炸问题[23]将对学习产生不利影响。

Recent studies suggest that the dynamical system formed by spiking neuronscan be formulated as a recurrent ANN[16], whereby the subthreshold membrane potential dynamics of these leaky integrators (i.e., spiking neurons) can be effectively modeled. In addition, the discontinuity of the spike generation function can be circumvented withsurrogate gradientsthat provide an unbiased estimation of the true gradients [17], [18], [19], [20], [21], [22]. In this way, the canonical error back-propagation through time algorithm (BPTT) can be applied to optimize the SNN. However, it is bothcomputation- and memory-inefficientto optimize the SNN using the BPTT algorithm since spike trains are typically very sparse in both time and space. Therefore, the scalability of the technique remains to be improved, for instance, the size of SNNs is GPU memory bounded as demonstrated in a gesture classification task [19]. Furthermore, thevanishing and exploding gradient problem[23] of the BPTT algorithm adversely affects the learning in face of input spike trains of long temporal duration or low firing rate.

为了解决代理梯度学习中的上述问题，一种新的串联学习框架[24]被提出。该学习框架由一个ANN和一个通过权值共享耦合的SNN组成，其中SNN用于推导精确的神经表示，而ANN用于在脉冲序列级别上近似代理梯度。这样训练的SNN已经在一些基于帧和事件的基准测试上展示了具有竞争力的分类和回归能力，显着降低了计算成本和内存使用。尽管这些基于脉冲的学习方法表现出了很好的学习性能，但它们对于具有10个以上隐藏层的深度SNN的适用性仍然是难以捉摸的。

To address the aforementioned issues in surrogate gradient learning, a novel tandem learning framework [24] has been proposed. This learning framework consists ofan ANN and an SNN coupled through weight sharing, wherein theSNN is used to derive the exact neural representation, while theANN is designed to approximate the surrogate gradients at the spike-train level. The SNNs thus trained have demonstrated competitive classification and regression capabilities on a number of frame- and event-based benchmarks, with significantly reduced computational cost and memory usage. Despite the promising learning performance demonstrated by these spike-based learning methods, their applicability to deep SNNs with more than 10 hidden layers remains elusive.

最近的研究表明，按照速率编码的思想，用脉冲神经元的放电速率来近似ANN神经元的激活值，可以有效地构建SNN[25]，[26]，[27]，[28]，[29]，[30]。这种方法不仅简化了上述基于脉冲的学习方法的训练过程，而且还使SNN能够在许多具有挑战性的任务上获得最佳报告结果，包括ImageNet-12数据集[27]，[28]上的物体识别和PASCAL VOC和MS COCO数据集[29]上的物体检测。然而，为了达到可靠的发射速率近似值，它需要一个非常大的编码时间窗口，至少有几百个时间步长。此外，执行一次分类所需的突触操作总数通常会随着编码时间窗口的大小而增加，因此，较大的编码时间窗口也会对计算效率产生不利影响。一个理想的SNN模型不仅要能高精度地执行模式识别任务，而且要能以尽可能少的时间步长快速地获得结果，并以少量的突触操作高效地获得结果。在这项工作中，我们介绍了一种新的ANN-to-SNN转换和学习框架，将预先训练好的ANN逐步转换为SNN，以实现准确、快速和高效的模式识别。

Following the idea of rate-coding, recent studies have shown that SNNs can be effectively constructed from ANNs by approximating the activation value of ANN neurons with the firing rate of spiking neurons [25], [26], [27], [28], [29], [30]. This approach not only simplifies the training procedures of the aforementioned spike-based learning methods but also enable SNNs to achieve the best-reported results on a number of challenging tasks, including object recognition on the ImageNet-12 dataset [27], [28] and object detection on the PASCAL VOC and MS COCO datasets [29]. However, to reach a reliable firing rate approximation, it requires a notoriously large encoding time window with at least a few hundred time steps. Moreover, the total number of synaptic operations required to perform one classification usually increases with the size of the encoding time window, therefore, a large encoding time window will also adversely impact the computational efficiency.An ideal SNN model should not only perform pattern recognition tasks with high accuracy but also obtained the results rapidly with as few time steps as possible, and efficiently with a small number of synaptic operations. In this work, we introduce a novel ANN-to-SNN conversion and learning framework to progressively convert a pre-trained ANN into an SNN for accurate, rapid, and efficient pattern recognition.

为了提高推理速度和能量效率，我们引入了一种分层阈值确定机制，充分利用脉冲神经元的编码时间窗口进行信息表示。为了保持较高的模式识别精度，进一步应用带有自适应训练调度器的分层学习方法对每个原始层转换后的网络权重进行微调，以补偿转换误差。所提出的分层转换和学习框架还通过在训练过程中逐步施加硬件约束来支持有效的算法-硬件协同设计。综上所述，本工作的主要贡献有四个方面：

To improve the inference speed and energy efficiency, we introduce alayer-wise threshold determination mechanismto make good use ofthe encoding time window of spiking neurons for information representation. To maintain a high pattern recognition accuracy,a layer-wise learning method with an adaptive training scheduleris further applied to fine-tune the network weights after each primitive layer conversion that compensates for the conversion errors. The proposed layer-wise conversion and learning framework also supports effectivealgorithm-hardware codesignbyprogressively imposing hardware constraintsduring the training process. To summarize, the main contributions of this work are in four aspects:

重新思考ANN-to-SNN转换：我们引入了一个新的视角来理解脉冲神经元的神经离散化过程，将其与ANN神经元的激活量化进行比较，这为理解和执行网络转换提供了一个新的角度。通过有效利用以编码时间窗口大小为上限的脉冲计数来表示对应事物的信息，推理速度和计算成本比基于发射速率近似值的其他转换方法显著降低。

Rethinking ANN-to-SNN Conversion: We introduce a new perspective to understand the neural discretization process of spiking neurons by comparing it to the activation quantization of ANN neurons, which offers a new angle to understand and perform network conversion. By making efficient use of thespike countthat is upper bounded by the encoding time window size to represent the information of counterparts, the inference speed, and computational cost can be significantly reduced over other conversion methods grounded on a firing rate approximation.

渐进式串联学习框架：我们提出了一种新颖的分层ANN-to-SNN转换和学习框架，具有自适应训练调度器，支持轻松高效的转换，允许深度SNN快速、准确和高效的模式识别。所提出的转换框架还允许轻松地将硬件约束纳入训练过程，例如，有限的权重精度与扇入连接，以便在部署到实际的神经形态芯片时实现最佳性能。

Progressive Tandem Learning Framework: We propose a novel layer-wise ANN-to-SNN conversion and learning framework with an adaptive training scheduler to support effortless and efficient conversion, which allows fast, accurate, and efficient pattern recognition with deep SNNs. The proposed conversion framework also allows easy incorporation of hardware constraints into the training process, for instance, limited weight precision and fan-in connections, such that the optimal performance can be achieved when deploying onto the actual neuromorphic chips.

重新思考基于脉冲的学习方法：我们对基于时间的代理梯度学习和基于脉冲计数的串联学习方法的可扩展性进行了全面的研究，揭示了累积的梯度近似误差可能会阻碍深度SNN的训练收敛。

Rethinking Spike-based Learning Methods: We conduct a comprehensive study on the scalability of both the time-based surrogate gradient learning and the spike count-based tandem learning methods, revealing thatthe accumulated gradient approximation errors may impede the training convergence in deep SNNs.

用SNN解决鸡尾酒会问题（计算机语音识别领域中一类盲源分离问题，译者注）：为了评估所提出的学习框架，我们应用深度SNN从混合多说话者语音中分离出高保真的声音，这有效地模拟了人脑的感知和认知能力。据我们所知，这是第一个成功应用深度SNN来解决具有挑战性的鸡尾酒会问题的工作。

Solving Cocktail Party Problem with SNN: To evaluate the proposed learning framework, we apply deep SNNs to separate high fidelity voices from a mixed multiple-talker speech, which effectively mimics the perceptual and cognitive ability of the human brain. To the best of our knowledge, this is the first work that successfully applied deep SNNs to solve the challenging cocktail party problem.

本文的其余部分组织如下。在第2节中，我们首先回顾了传统的ANN到SNN转换方法，并讨论了准确性和延迟之间的权衡。在第3节中，我们比较了脉冲神经元和ANN神经元之间的神经元功能，以及它们的离散等价，这为执行网络转换提供了一个新的视角。基于此，我们提出用脉冲神经元的脉冲计数作为脉冲神经元与其对应的ANN网络之间的桥梁，进行网络转换。在第4节中，为了最小化转换误差，我们提出了一种新的分层学习方法，使用自适应训练调度器来微调网络权重。在第5节和第6节中，我们通过一组分类和回归任务，包括大规模图像分类、时域语音分离和图像重建，验证了所提出的网络转换和学习框架，即渐进串联学习(PTL)。最后，我们在第7节对本文进行总结。

The rest of the paper is organized as follows. In Section 2, we first review the conventional ANN-to-SNN conversion methods and discuss the trade-off between accuracy and latency. In Section 3, we compare the neuronal functions between the spiking neurons and ANN neurons, and their discrete equivalents, which provide a new perspective to perform network conversion. With this, we propose to use thespike count of spiking neurons as the bridge between the spiking neurons and their ANN counterparts for network conversion. In Section 4, to minimize the conversion errors, we propose a novel layer-wise learning method with an adaptive training scheduler to fine-tune network weights. In Sections 5 and 6, we validate the proposed network conversion and learning framework, that is referred to as progressive tandem learning (PTL), through a set of classification and regression tasks, including the large-scale image classification, time-domain speech separation and image reconstruction. Finally, we conclude the paper in Section 7.

2 相关工作/Related Work

近年来，人们提出了许多ANN-to-SNN的转换方法。这些方法几乎都遵循速率编码的思想，将ANN神经元的激活值与脉冲神经元的放电速率近似。在接下来的内容中，我们将回顾ANN-to-SNN转换方法的发展，并强调这些方法中的准确性与延迟的权衡问题。

Recently, many ANN-to-SNN conversion methods are proposed. Nearly all of these methods follow the idea of ratecoding, which approximates the activation value of ANN neurons with the firing rate of spiking neurons. In what follows, we will review the development of ANN-to-SNN conversion methods and highlight the issue ofaccuracy and latency trade-offin these methods.

ANN-to-SNN转换的最早尝试出现在[31]中，Perez-Carrasco等人设计了一种使用ANN神经元对泄漏的集成-放电(LIF)神经元的近似方法。在复制到SNN之前，通过考虑脉冲神经元的泄漏率等参数，重新调整预先训练的ANN神经元的权重。该转换方法用于处理事件驱动摄像机捕获的事件流，并在人体轮廓方向和扑克牌符号识别任务中展示了良好的识别效果。而这种转换方法需要手动确定大量的超参数，并且转换过程存在量化等近似误差。

The earliest attempt for ANN-to-SNN conversion was presented in [31], where Perez-Carrasco et al. devised an approximation method for leaky integrate-and-fire (LIF) neurons using ANN neurons. The pre-trained weights of ANN neurons are rescaled by considering the leaky rate and other parameters of spiking neurons before copying into the SNN. This conversion method was proposed to handle event streams captured by the event-driven camera, whereby promising recognition results were demonstrated on the human silhouette orientation and poker card symbol recognition tasks. While this conversion method requires a large number of hyperparameters to be determined manually and the conversion process suffers from quantization and other approximation errors.

这些是近年来对ANN-to-SNN转换，并将其应用于基于帧的图像的精确目标识别和检测的研究。Cao等[25]提出了一种转换框架，使用线性整流单元(ReLU)作为ANN神经元的激活函数，并将偏置项设为零。因此，ANN神经元的激活值可以很好地近似于整合和发射(IF)神经元的发射速率。此外，对于基于速率的SNN，难以在时域确定的最大池化层操作被平均池化取代。Diehl et al.[26]通过分析性能下降的原因进一步改进了这一转换框架，揭示了脉冲神经元过激发与激活不足的潜在问题。为了解决这些问题，他们提出了基于模型和基于数据的权重归一化方案，根据ANN神经元的最大激活值重新缩放SNN权重。这些归一化方案防止了脉冲神经元的过激发与激活不足，并在放电阈值和模型权重之间取得了良好的平衡。结果，在MNIST数据集上使用全连接与卷积的脉冲神经网络报告了近乎无损的分类精度。

There are recent studies on ANN-to-SNN conversion with applications to accurate object recognition and detection on frame-based images. Cao et al. [25] proposed a conversion framework byusing the rectified linear unit (ReLU) as the activation function for ANN neurons and set the bias term to zero. The activation value of ANN neurons can thus be well approximated by the firing rate of integrate-and-fire (IF) neurons. Furthermore,the max-pooling operation, which is hard to determine in the temporal domain for a rate-based SNN, is replaced with the average pooling. Diehl et al. [26] further improved this conversion framework by analyzing the causes of performance degradation, which revealsthe potential problems of over- and under-activation of spiking neurons. To address these problems, they proposed model- and data-based weight normalization schemes to rescale the SNN weights based on the maximum activation values of ANN neurons. These normalization schemes prevent the over- and under-activation of spiking neurons and strike a good balance between the firing threshold and the model weights. As a result, near-lossless classification accuracies were reported on the MNIST dataset with fully connected and convolutional spiking neural networks.

Rueckauer等人[27]发现了由IF神经元重置为零方案引起的量化误差，其中超过放电阈值的剩余膜电位在放电后被丢弃。这种量化误差容易在层间累积，严重影响转换后深层SNN的分类精度。为了解决这一问题，他们提出了一种减法重置方案，以保留每次发射后的剩余膜电位。此外，引入了改进的基于数据的权重归一化方案，提高了对异常值的鲁棒性，显著提高了脉冲神经元的放电率，从而提高了SNN的推理速度。在具有挑战性的ImageNet-12物体识别任务中，他们第一次展示了与ANN具有竞争性的结果。

Rueckauer et al. [27] identified a quantization error caused by the reset-to-zero scheme of IF neurons, where surplus membrane potential over the firing threshold is discarded after firing.This quantization error tends to accumulate over layersand severely impacts the classification accuracy of converted deep SNNs. To address this problem, they propose areset-by-subtractionscheme to preserve the surplus membrane potential after each firing. Moreover, a modified data-based weight normalization scheme is introduced to improve the robustness against outliers, which significantly improves the firing rate of spiking neurons and hence the inference speed of SNN. For the first time, they had demonstrated competitive results to the ANN counterparts on the challenging ImageNet-12 object recognition task.

在同一研究领域，Hu等人[30]提供了一种系统的方法来转换深度残差网络，并提出了一种误差补偿方案来解决累积的量化误差。通过这些修改，他们实现了近无损转换，使脉冲残差网络达到110层。Kim等人[29]通过将权值归一化通道应用于卷积神经网络扩展了转换框架，并提出了一种转换具有正激活值和负激活值的ANN神经元的有效策略。所提出的通道归一化方案提高了神经元的放电速率，从而提高了信息传输速率。得益于这些改进，在需要精确预测bbox坐标的挑战性目标检测任务中展示了具有竞争性的结果。Sengupta et al.[28]通过考虑运行时脉冲神经元的行为进一步优化了权重归一化方案，在ImageNet-12数据集上获得了最好的报告结果。为了提高上述转换方法对池化层的适用性，降低整体计算开销，Xu et al.[32]和Wang et al.[33]提出将触发阈值归一化，而不是权值归一化。

In the same line of research, Hu et al. [30] provided a systematic approach to convert deep residual networks and propose an error compensation scheme to address the accumulated quantization errors. With these modifications, they achieved near-lossless conversion for spiking residual networks up to 110 layers. Kim et al. [29] extended the conversion framework by applying the weight normalization channel-wise for convolutional neural networks and propose an effective strategy for converting ANN neurons with both positive and negative activation values. The proposed channel-wise normalization scheme boosted the firing rate of neurons and hence improved the information transmission rate. Benefiting from these modifications, competitive results are demonstrated in the challenging objection detection task where the precise coordinate of bounding boxes is required to be predicted. Sengupta et al. [28] further optimized the weight normalization scheme by taking into consideration the behavior of spiking neurons at the run time, which achieved the best-reported result on the ImageNet-12 dataset. To improve the applicability of the aforementioned conversion methods to the pooling layer as well as to reduce the overall computational overhead, Xu et al. [32] and Wang et al. [33] proposed to normalize the firing threshold instead of the weights.

在这些早期的研究中，提出了发射阈值确定或权重归一化的方法，以获得较好的发射速率近似。尽管这些转换方法取得了具有竞争力的结果，但潜在的发射速率假设导致了准确性和延迟之间的内在权衡，这需要几百到数千个时间步才能达到稳定的发射速率。Rueckauer等人[27]通过分析这些ANN-to-SNN转换方法的发射速率偏差，对这一问题进行了理论分析。通过假设第一层脉冲神经元的输入电流恒定，第一层(Eq.(1))和后续层(Eq.(2))的实际放电速率可以总结如下

In these earlier studies, methods are proposed for the firing threshold determination or weight normalization so as to achieve a good firing rate approximation. Despite competitive results achieved by these conversion methods,the underlying firing-rate assumption has led to an inherent trade-off between accuracy and latency, which requires a few hundred to thousands of time steps to reach a stable firing rate. Rueckauer et al. [27] provided a theoretical analysis of this issue by analyzing the firing rate deviation of these ANN-to-SNN conversion methods. By assuming a constant input current to spiking neurons at the first layer, the actual firing rate of the first (Eq. (1)) and subsequent layers (Eq. (2)) can be summarised as follows

ri1(t)=ailrmax⁡−Vil(t)tϑ\begin{equation}r^1_i(t)=a^l_ir_{\max}-\frac{V^l_i(t)}{t\vartheta}\end{equation}ri1(t)=ailrmax−tϑVil(t)

ril(t)=∑jMl−1wijlrjl−1(t)+bilrmax⁡−Vil(t)tϑ\begin{equation}r^l_i(t)=\sum^{M^{l-1}}_jw^l_{ij}r^{l-1}_j(t)+b^l_ir_{\max}-\frac{V^l_i(t)}{t\vartheta}\end{equation}ril(t)=j∑Ml−1wijlrjl−1(t)+bilrmax−tϑVil(t)

其中ril(t)r^l_i(t)ril(t)表示第iii层神经元lll的放电速率，rmax⁡r_{\max}rmax表示由时间步长决定的最大放电速率。aila^l_iail为第一层ANN神经元iii的激活值，Vil(t)V^l_i(t)Vil(t)为对应脉冲神经元的膜电位，ϑ\varthetaϑ为神经元放电阈值。Ml−1M^{l-1}Ml−1为第l−1l-1l−1层神经元的总数，bilb^l_ibil为第lll层ANN神经元iii的偏置项。理想情况下，脉冲神经元的放电速率应与对应ANN的激活值成正比，如式(1)的第一项所示。而模拟结束时尚未放电的剩余膜电位将导致如式(1)的第二项所示的近似误差，这可以通过大的发射阈值或大的编码时间窗口来抵消。由于增加发射阈值将不可避免地延长证据积累时间，因此通常首选适当的发射阈值，以防止脉冲神经元激活不足或过度激活，并延长编码时间窗口以最小化这种发射速率近似误差。

where ril(t)r^l_i(t)ril(t) denotes the firing rate of neuron iii at layer lll and rmax⁡r_{\max}rmax denotes the maximum firing rate that is determined by the time step size. aila^l_iail is the activation value of ANN neuron iii at the first layer, Vil(t)V^l_i(t)Vil(t) is the membrane potential of the corresponding spiking neuron, and ϑ\varthetaϑ is the neuronal firing threshold. Ml−1M^{l-1}Ml−1 is the total number of neurons in layer l−1l-1l−1 and bilb^l_ibil is the bias term of ANN neuron iii at layer lll. Ideally, the firing rate of spiking neurons should be proportional to the activation value of their ANN counterparts as per the first term of Eq. (1). While the surplus membrane potential that has not been discharged by the end of simulation will cause an approximation error as shown by the second term of Eq. (1), which can be counteracted with a large firing threshold or a large encoding time window. Since increasing the firing threshold will inevitably prolong theevidence accumulation time, a proper firing threshold that can prevent spiking neurons from either under- or over-activating is usually preferred and the encoding time window is extended to minimize such a firing rate approximation error.

此外，如Eq.(2)所示，这种近似误差在层间传播时逐渐累积，因此需要进一步扩展编码时间窗口来补偿。因此，对于具有10层以上的深度SNN[28]，[29]，通常需要数千个时间步才能达到具有竞争力的精度。从这些公式可以清楚地看出，用脉冲神经元的放电速率来近似ANN的连续输入-输出表示，将不可避免地导致准确性和延迟的权衡。为了克服这个问题，我们将在接下来的章节中介绍，我们提出了一种基于离散神经表示的新转换方法，其中以编码时间窗口大小为上限的脉冲计数被用于近似ANN的离散输入-输出表示。为了有效地利用脉冲计数进行信息表示，我们提出了一种新的触发阈值确定策略，从而可以用SNN实现快速有效的模式识别。为了抵消转换误差，从而确保模式识别任务的高精度，进一步提出了一种分层学习方法来对网络进行微调。

Besides, this approximation error accumulates gradually while propagating over layers as shown in Eq. (2), thereby a further extension of the encoding time window is required to compensate. As such, a few thousand time steps are typically required to achieve a competitive accuracy for deep SNNs with more than 10 layers [28], [29]. From these formulations, it is clear that to approximate the continuous input-output representation of ANNs with the firing rate of spiking neurons will inevitably lead to the accuracy and latency trade-off. To overcome this issue, as will be introduced in the following sections, we propose a novelconversion method that is grounded on the discrete neural representation, whereby the spike count, upper bounded by the encoding time window size, is taken to approximate the discrete input-output representation of ANNs. To make efficient use of the spike count for information representation, we propose a novelfiring threshold determination strategysuch that rapid and efficient pattern recognition can be achieved with SNNs. To counteract the conversion errors and hence ensure high accuracies in pattern recognition tasks, a layer-wise learning method is further proposed to fine-tune the network.

3 重新思考ANN-to-SNN的转换/Rethinking ANN-to-SNN Conversion

近年来，人们开发了许多脉冲神经元模型来描述生物神经元丰富的动力学行为。然而，对于现实世界的模式识别任务来说，其中大多数都过于复杂。如第2节所述，为了计算简单且易于转换，IF神经元模型通常用于ANN-to-SNN的转换工作[26]，[27]，[28]。尽管这种简化的脉冲神经元模型没有模拟生物神经元丰富的亚阈值动态，但它保留了离散和稀疏通信的诱人特性，因此可以实现高效的硬件实现。在本节中，我们将重新研究ReLU ANN神经元和累积-发射脉冲神经元之间的输入-输出表示的近似。

Over the years, many spiking neuron models are developed to describe the rich dynamical behavior of biological neurons. Most of them, however, are too complex for real-world pattern recognition tasks. As discussed in Section 2, for computational simplicity and ease of conversion, the IF neuron model is commonly used in ANN-to-SNN conversion works [26], [27], [28]. Although this simplified spiking neuron model does not emulate the rich sub-threshold dynamics of biological neurons, it preserves attractive properties ofdiscrete and sparse communication, therefore, allows for efficient hardware implementation. In this section, we reinvestigate the approximation of input-output representation between a ReLU ANN neuron and anintegrate-and-firespiking neuron.

3.1 脉冲神经元与ANN神经元/Spiking Neuron Versus ANN Neuron

让我们考虑一个编码时间窗口NsN_sNs的脉冲神经元的离散时间模拟，其中编码时间窗口决定一个SNN的推理速度。在每一个时间步ttt，根据下式，在第lll层神经元iii的输入脉冲被转导为突触电流zil[t]z^l_i[t]zil[t]：

Let us consider a discrete-time simulation of spiking neurons with an encoding time window of NsN_sNs that determines the inference speed of an SNN. At each time step ttt, the incoming spikes to the neuron iii at layer lll are transduced into synaptic current zil[t]z^l_i[t]zil[t] according to

zil[t]=∑jwijl−1sjl−1[t]+bil\begin{equation}z^l_i[t]=\sum_jw^{l-1}_{ij}s^{l-1}_j[t]+b^l_i\end{equation} zil[t]=j∑wijl−1sjl−1[t]+bil

其中sjl−1[t]s^{l-1}_j[t]sjl−1[t]表示在时间步ttt时出现的输入脉冲，wijl−1w^{l-1}_{ij}wijl−1是lll层突触前神经元jjj和突触后神经元iii之间的突触权值。bilb^l_ibil可以解释为恒定的注入电流。

where sjl−1[t]s^{l-1}_j[t]sjl−1[t] indicates the occurrence of an input spike at time step t, and wijl−1w^{l-1}_{ij}wijl−1 is the synaptic weight between the presynaptic neuron jjj and the post-synaptic neuron iii at layer lll. bilb^l_ibil can be interpreted as a constant injecting current.

根据式(4)，突触电流zil[t]z^l_i[t]zil[t]进一步集成到膜电位Vil[t]V^l_i[t]Vil[t]中。在不失一般性的情况下，本工作假设膜电阻为单位电阻。膜电位通过每次放电后减去放电阈值来重置，如式(4)的最后一项所述。

The synaptic current zil[t]z^l_i[t]zil[t] is further integrated into the membrane potential Vil[t]V^l_i[t]Vil[t] as per Eq. (4). Without loss of generality, aunitarymembrane resistance is assumed in this work. The membrane potential is reset by subtracting the firing threshold after each firing as described by the last term of Eq. (4).

Vil[t]=Vil[t−1]+zil[t]−ϑlsil[t−1]\begin{equation}V^l_i[t]=V^l_i[t-1]+z^l_i[t]-\vartheta^l s^l_i[t-1]\end{equation} Vil[t]=Vil[t−1]+zil[t]−ϑlsil[t−1]

每当Vil[t]V^l_i[t]Vil[t]上升到发射阈值ϑl\vartheta^lϑl以上（由层确定），就会产生输出脉冲，如下所示

An output spike is generated whenever the Vil[t]V^l_i[t]Vil[t] rises above the firing threshold ϑl\vartheta^lϑl (determined layer-wise) as follows

sil[t]=Θ(Vil[t]−ϑl)withΘ(x)={1,ifx≥00,otherwise\begin{equation}s^l_i[t]=\Theta(V^l_i[t]-\vartheta^l)\ \text{with}\ \Theta(x)=\begin{cases}1,\quad\text{if}\ x\ge 0\\0,\quad\text{otherwise}\end{cases}\end{equation} sil[t]=Θ(Vil[t]−ϑl)withΘ(x)={1,ifx≥00,otherwise

因此，可以确定NsN_sNs时间窗口内的脉冲序列sils^l_isil与脉冲计数cilc^l_icil，并表示如下

The spike train sils^l_isil and spike count cilc^l_icil for a time window of NsN_sNs can thus be determined and represented as follows

sil={sil[1],⋯,sil[Ns]}cil=∑t=1Nssil[t]\begin{equation}\begin{split}s^l_i&=\{s^l_i[1],\cdots,s^l_i[N_s]\}\\c^l_i&=\sum^{N_s}_{t=1}s^l_i[t]\end{split}\end{equation} silcil={sil[1],⋯,sil[Ns]}=t=1∑Nssil[t]

对于非脉冲ANN神经元，我们将神经元iii在第lll层的神经元功能描述为

For non-spiking ANN neurons, let us describe the neuronal function of neuron iii at layer lll as

ail=f(∑jwijl−1xjl−1+bil)\begin{equation}a^l_i=f(\sum_jw^{l-1}_{ij}x^{l-1}_j+b^l_i)\end{equation} ail=f(j∑wijl−1xjl−1+bil)

其中wijl−1w^{l-1}_{ij}wijl−1和bilb^l_ibil作为权重和偏差。xjl−1x^{l-1}_jxjl−1和aila^l_iail表示ANN神经元的输入与输出。f(⋅)f(\cdot)f(⋅)表示激活函数，我们在这项工作中使用了ReLU。对于ANN-to-SNN的转换，在转换之前，首先训练具有ReLU神经元的ANN，这被称为预训练。

which has wijl−1w^{l-1}_{ij}wijl−1 and bilb^l_ibil as the weight and bias. xjl−1x^{l-1}_jxjl−1 and aila^l_iail denote the input and output of the ANN neuron. f(⋅)f(\cdot)f(⋅) denotes the activation function, which we use the ReLU in this work. For ANN-to-SNN conversion, an ANN with the ReLU neurons is first trained, that is called pre-training, before the conversion.

3.2 神经离散化与激活量化/Neural Discretization Versus Activation Quantization

在传统的ANN-to-SNN的转换研究中，通常采用脉冲神经元的放电速率来近似预训练ANN的连续输入输出表示。正如第2节所讨论的，脉冲神经元需要一个众所周知的长时间窗口才能可靠地近似一个连续值。然而，最近的研究表明，对于ANN来说，这种连续的神经表征可能不是必需的[34]。事实上，将ANN神经元的激活值适当量化为一个低精度的离散表示[35]，[36]，即激活量化，对网络性能影响不大。

In the conventional ANN-to-SNN conversion studies, the firing rate of spiking neurons is usually taken to approximate the continuous input-output representation of the pretrained ANN. As discussed in Section 2, a spiking neuron takes a notoriously long time window to reliably approximate a continuous value. Recent studies, however, suggestsuch a continuous neural representation may not be necessary for ANNs[34]. In fact, there could be little impact on the network performance when the activation value of ANN neurons are properly quantized to a low-precision discrete representation [35], [36], which is known as activation quantization.

在ANN中，激活量化是指将一个浮点激活值ail,fa^{l,f}_ iail,f映射到一个量化值ail,qa^{l,q}_iail,q。使用ReLU激活函数，激活量化可以表述如下

In ANNs, the activation quantization refers to the mapping of a floating-point activation value ail,fa^{l,f}_ iail,f to a quantized value ail,qa^{l,q}_iail,q . With a ReLU activaiton function, the activation quantization can be formulated as follows

a^il,f=min⁡(max⁡(ail,f,0),aul)φl=aulNqail,q=round(a^il,fφl)⋅φl\begin{equation}\begin{split}\hat{a}^{l,f}_i=&\min(\max(a^{l,f}_i,0),a^l_u)\\\varphi^l=&\frac{a^l_u}{N_q}\\a^{l,q}_i=&round\left(\frac{\hat{a}^{l,f}_i}{\varphi^l}\right)\cdot\varphi^l\end{split}\end{equation} a^il,f=φl=ail,q=min(max(ail,f,0),aul)Nqaulround(φla^il,f)⋅φl

其中aula^l_uaul为第lll层量化范围的上界，其值通常由训练数据确定。NqN_qNq为量化层的总数，φl\varphi^lφl为第lll层的量化尺度。采用这种离散的神经表示，可以显著降低ANN训练和推理过程中的计算和存储开销。激活量化的成功可以用连续神经表示中存在高度冗余这一事实来解释。

where aula^l_uaul refers to the upper bound of the quantization range at layer lll, whose values are usually determined from the training data. NqN_qNq is the total number of quantization levels and φl\varphi^lφl is the quantization scale for layer lll. With such a discrete neural representation, the computation and storage overheads during the training and inference of ANNs can be significantly reduced.The success of activation quantization can be explained by the fact that there is a high level of redundancy in the continuous neural representation.

在SNN中，根据脉冲神经元的神经元动力学，将信息固有地离散成脉冲序列，下文称为神经离散化。值得注意的是，编码时间窗口的大小决定了SNN的离散表示空间。ANN的激活量化导致数据存储的减少，这发生在空间域中。通过将表现良好的ANN的离散神经表示映射到SNN，我们期望将数据存储的减少转化为编码时间窗口大小的减少，从而允许使用SNN进行快速有效的模式识别。

In SNNs, the information is inherently discretized into spike trains according to the neuronal dynamics of spiking neurons, which is referred to as the neural discretization hereafter. It is worth noting that the size of the encoding time window determines the discrete representation space for SNNs. The activation quantization of ANNs leads to a reduction in data storage, which takes place in the spatial domain. By mapping the discrete neural representation of a good performing ANN to an SNN, it is expected thatwe translate the reduction of the data storage into the reduction of the encoding time window size, thus allowing rapid and efficient pattern recognition with SNNs.

ANN神经元立即对输入刺激做出反应，而脉冲神经元通过时间窗口内的时间过程对输入脉冲序列做出反应。为了建立ANN神经元的激活量化和脉冲神经元的神经离散化之间的对应关系，我们假设前一层的脉冲序列和恒定注入电流被累积并瞬间释放，从而简化了神经离散化过程。前一层的脉冲序列和恒定注入电流的总体贡献可概括为自由聚合膜电位（无发射）[24]，定义为

The ANN neurons respond to the input stimuli instantly, while spiking neurons respond to the input spike trains through a temporal process within a time window. In order to establish a correspondence between the activation quantization of ANN neurons and the neural discretization of spiking neurons, we simplify the neural discretization process by assuming the preceding layer’s spike trains and the constant injecting current are integrated and discharged instantly. The overall contributions from the preceding layer’s spike trains and constant injecting current can be summarized by the free aggregate membrane potential (no firing) [24] defined as

Vil=∑jwijl−1cjl−1+bilNs\begin{equation}V^l_i=\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iN_s\end{equation} Vil=j∑wijl−1cjl−1+bilNs

将bilNsb^l_iN_sbilNs作为偏置项，将cjl−1c^{l - 1}_jcjl−1作为Eq.(7)中定义的ANN神经元的输入，VilV^l_iVil与非脉冲ANN神经元的预激活量完全相同。通过将脉冲神经元的脉冲计数作为信息载体，神经离散化的简化为将ANN神经元的离散输入映射到脉冲神经元的离散脉冲计数输入提供了基础。

By considering bilNsb^l_iN_sbilNs as the bias term and cjl−1c^{l - 1}_jcjl−1 as the input to ANN neurons that defined in Eq. (7), VilV^l_iVil is exactly the same as the pre-activation quantity of non-spiking ANN neurons. By considering the spike count of spiking neurons as the information carrier, the simplification of neural discretization provides the basis for mapping the discrete inputs of an ANN neuron to the discrete spike count inputs of a spiking neuron.

请注意，IF神经元对输入脉冲序列的反应是发射零或正数量的输出脉冲。它执行类似于ANN神经元的ReLU激活函数的非线性转换。如式(8)所定义，激活量化将ReLU神经元的正激活值按固定量化尺度φl\varphi^lφl离散为整数。类似地，IF神经元的神经离散化通过固定的离散化尺度（即发射阈值ϑl\vartheta^lϑl）将正数值VilV^l_iVil离散为离散的脉冲计数，可以表述如下

Note that an IF neuron responds to the input spike trains by firing zero or a positive number of output spikes. It performs a non-linear transformation similar to that of the ReLU activation function of an ANN neuron. As defined in Eq. (8), the activation quantization discretizes the positive activation value of ReLU neurons, by a fixed quantization scale φl\varphi^lφl, into an integer. Similarly, the neural discretization of an IF neuron discretizes the positive-valued VilV^l_iVil by a fixed discretization scale, that is the firing threshold ϑl\vartheta^lϑl, into a discrete spike count, that can be formulated as follows

V^il=min⁡(max⁡(Vil,0),Vul)ϑl=VulNsVil,q=round(V^ilϑl)⏟≈cil⋅ϑl\begin{equation}\begin{split}\hat{V}^l_i=&\min(\max(V^l_i,0),V^l_u)\\\vartheta^l=&\frac{V^l_u}{N_s}\\V^{l,q}_i=&\underbrace{round\left(\frac{\hat{V}^l_i}{\vartheta^l}\right)}_{\approx c^l_i}\cdot\vartheta^l\end{split}\end{equation} V^il=ϑl=Vil,q=min(max(Vil,0),Vul)NsVul≈cilround(ϑlV^il)⋅ϑl

式中VulV^l_uVul为第lll层的自由聚集膜电位上界。(8)和(10)建立了ReLU神经元的激活量化与IF神经元的离散神经表示之间的对应关系，从而为将ANN神经元的离散输出映射到脉冲神经元的脉冲计数输出提供了基础。值得注意的是，对于ANN，量化尺度φl\varphi^lφl通常是独立存储的，并在操作过程中乘以定点数。然而，离散化尺度ϑl\vartheta^lϑl仅存储在脉冲神经元中，并不与输出脉冲序列一起传播到下一层。这个问题可以通过将ϑl\vartheta^lϑl乘以后续层l+1l+1l+1中神经元的权重来轻松抵消。

where VulV^l_uVul refers to the free aggregated membrane potential upper bound of layer lll. The Eqs. (8) and (10) establisha correspondence between the activation quantization of a ReLU neuron and the discrete neural representation of an IF neuron, thus provides the basis for mapping the discrete output of an ANN neuron to the spike count output of a spiking neuron. It is worth noting that the quantization scale φl\varphi^lφl is usually stored independently for ANNs and multiplied to the fixed point number during operations. However, the discretization scale ϑl\vartheta^lϑl is only stored at the spiking neuron and does not propagate together with output spike trains to the next layer. This issue can be easily counteracted by multiplying ϑl\vartheta^lϑl to the weights of neurons in the subsequent layer l+1l+1l+1.

通过对神经离散化的简化，我们证明了ANN神经元的离散输入-输出表示可以很好地近似于脉冲神经元。按照这个公式，可以通过直接复制预训练的ANN的权重来构建SNN。通过将相应ANN神经元的偏置项除以NsN_sNs，可以确定脉冲神经元的恒定注入电流。根据式(10)，在lll层上的脉冲神经元的放电阈值ϑl\vartheta^lϑl可以通过上限VulV^l_uVul除以NsN_sNs来确定。从方程式。(7)和(9)可知，上限VulV^l_uVul等于相应ANN层的最大激活值aula^l_uaul，因此可以直接取得。

With the simplification of neural discretization, we show that the discrete input-output representation of ANN neurons can be well approximated with spiking neurons. Following this formulation, an SNN can be constructed from the pre-trained ANN by directly copying its weights. The constant injecting current to spiking neurons can be determined by dividing the bias term of the corresponding ANN neuron over NsN_sNs. According to Eq. (10), the firing threshold ϑl\vartheta^lϑl of spiking neurons at layer lll can be determined by dividing the upper bound VulV^l_uVul over NsN_sNs. From Eqs. (7) and (9), it clear that the upper bound VulV^l_uVul is equivalent to and hence can be directly taken from the maximum activation value aula^l_uaul of the corresponding ANN layer.

然而，这个公式可能会产生两个潜在的错误：一个是受编码时间窗口大小影响的量化误差，另一个是由输入脉冲序列的时间结构引起的脉冲计数近似误差，这可能会影响VilV^l_iVil的实际放电。然而，这些转换错误可以通过阈值归一化机制和以下部分将介绍的分层训练方法有效地减轻。

However, two potential errors may arise from this formulation:a quantization error affected by the encoding time window sizeanda spike count approximation error arising from the temporal structure of input spike trainsthat may affect the actual discharging of VilV^l_iVil . These conversion errors, however, can be effectively mitigated by the threshold normalization mechanism and the layer-wise training method that will be introduced in the following sections.

3.3 阈值层归一化/Threshold LayerNorm

为了使用具有预定义编码时间窗口NsN_sNs的脉冲神经元更好地表示ANN神经元的量化范围，我们引入了一种新的脉冲神经元阈值确定机制。为了正确定义一层ANN神经元的量化范围，我们需要确定激活值的上限aula^l_uaul。

To better represent the quantization range of ANN neurons using spiking neurons that have a pre-defined encoding time window NsN_sNs, we introduce a novel threshold determination mechanism for spiking neurons. To properly define the quantization range of ANN neurons in a layer, we need to determine the activation value upper bound aula^l_uaul.

如图1和[27]所示，aula^l_uaul往往会受到离群值样本的偏置，例如，Conv1层的aula^l_uaul比99百分位数大5倍（突出显示为蓝色虚线）。为了有效地利用可用的离散表示空间并减少量化误差，我们建议使用从训练数据中确定的层中所有aila^l_iail的第99或99.9百分位数作为aula^l_uaul的上限，以便能够很好地保存关键信息。鉴于aula^l_uaul和VulV^l_uVul在前一节中建立的等价性，因此可以通过aula^l_uaul除以NsN_sNs的值来确定第lll层脉冲神经元的放电阈值ϑl\vartheta^lϑl。对于大于99.9百分位数，计算的触发阈值容易受到离群值的影响。另一方面，低于99百分位将导致SNN中没有表示的一些信息激活范围。在实践中，我们观察到这两个百分位数在足够大的批大小（例如，128或256）的数据批次中保持相对稳定。因此，aula^l_uaul可以有效地从随机训练批中得到。

As shown in Fig. 1 and also highlighted in [27], the al u tends to be biased by the outlier samples, for instance, the a1u of Conv1 layer is five times larger than the 99th percentile (highlighted as the blue dotted line). To make efficient use of the available discrete representation space and reduce the quantization errors, we propose to use the 99th or 99.9th percentile of all al i in a layer, determined from the training data, as the upper bound al u such that the key information can be well-preserved. Given the equivalence of al u and V l u established in the earlier section, the firing threshold #l of spiking neurons at layer l can hence be determined by dividing the value of al u over Ns. For percentiles larger than 99.9th, the calculated firing threshold is prone to be affected by the outlier. On the other hand, a percentile below 99th will result in some informative activation range not represented in the SNN. In practice, we observe these two percentiles remain relatively stable across data batches with a sufficiently large batch size (e.g., 128 or 256). Therefore, the al u can be effectively derived from a random training batch.

🖼️ 图1 ReLU神经元激活值aila^l_iail在预训练ANN层中的分布。在这里，横轴表示激活值，而纵轴表示对数尺度下的神经元数量。大多数神经元输出的激活值较低，神经元数量随着激活值的增加而迅速减少。虚线表示每一层神经元数量的第99个百分位数。

Fig. 1. Distribution of the activation value aila^l_iail of ReLU neurons in the pretrained ANN layers. Here, the horizontal axis represents the activation values, while the vertical axis represents the number of neurons in a log scale. The majority of neurons output low activation values and the number of neurons decreases rapidly as the activation value increases. The dotted lines mark the 99th percentile of the number of neurons in each layer.

为了进一步提高卷积神经网络的数值分辨率，可以像[29]中提出的那样，为每个通道独立确定发射阈值。虽然我们在实验中没有注意到分类或回归性能的显著改善，但可能是由于我们应用的分层学习方法抵消了性能下降。

To further improve the numerical resolution for convolutional neural networks, the firing threshold can be determined independently for each channel similar to that proposed in [29]. While we did not notice significant improvements in the classification or regression performance in our experiments, probably due to the layer-wise learning method that we have applied counteracts the performance drop.

3.4 神经元编码/Neural Coding

为了将静态输入特征张量或图像转换为脉冲序列进行神经元处理，需要一种合适的神经编码方案。人们发现，直接对输入进行离散化会对基础信息造成严重扭曲。在对网络的第一层的特征张量进行离散化的同时，利用高维特征表示中的冗余可以有效地保留信息[37]。按照这种方法，我们将ANN神经元的激活值aila^l _iail解释为对应脉冲神经元的输入电流，并在第一时间将其添加到式(4)中。根据IF神经元的动态特性，在连续的时间步长上分布这个数量，从而生成脉冲序列；脉冲输出从第一个隐藏层开始。该神经编码方案有效地离散了特征张量，并将其表示为脉冲计数。

A suitable neural encoding scheme is required to convert the static input feature tensors or images into spike trains for neural processing in SNNs. It was found that a direct discretization of the inputs introduces significant distortions to the underlying information. While discretizing the feature tensors derived from the first network layer can effectively preserve the information by leveraging the redundancies in the high-dimensional feature representation [37]. Following this approach, we interpret the activation value aila^l_iail of ANN neurons as the input current to the corresponding spiking neurons and add it to Eq. (4) at the first time step. The spike trains are generated by distributing this quantity over consecutive time steps according to the dynamic of IF neurons; the spiking output then starts from the first hidden layer. This neural encoding schemeeffectively discretizes the feature tensor and represents it as spike counts.

神经解码决定了脉冲神经元突触活动的输出类别。我们建议使用最终SNN层中神经元的自由聚合膜电位来确定输出类，而不是使用离散脉冲计数，由于在输出层获得的连续误差梯度[24]，因此在离散脉冲计数上提供了更平滑的学习曲线。而且，这个连续的量也可以直接被认为是回归任务中的输出，例如本文后面将要介绍的图像重建和语音分离。在下一节中将解释，神经编码层和解码层分别在图2B所示的混合网络的第一与最后一个网络转换阶段添加到其中。

The neural decoding determines the output class from the synaptic activity of spiking neurons.Instead of using the discrete spike counts, we suggest using the free aggregate membrane potential of neurons in the final SNN layer to determine the output class, which provides a much smoother learning curve over the discrete spike count due to the continuous error gradients derived at the output layer [24]. Moreover, this continuous quantity can also be directly considered as the outputs in regression tasks, such as image reconstruction and speech separation that will be presented later in this paper. As will be explained in the following section, the neural encoding and decoding layers are added into the hybrid network shown in Fig. 2B at the first and the last network conversion stage, respectively.

4 渐进串联学习/Progressive Tandem Learning

前一节中介绍的原始ANN-to-SNN转换方法提供了一种更有效的方法来近似ANN的输入-输出表示。然而，转换过程固有地引入量化和脉冲计数近似误差，如第3.2节所述。这种误差倾向于在层间积累，并导致显著的性能下降，特别是当NsN_sNs很小时。因此，这就要求训练方案在原语转换后对网络权重进行微调，以补偿这些转换误差。

The primitive ANN-to-SNN conversion method introduced in the earlier section provides a more efficient way to approximate the input-output representation of ANNs. However, the conversion process inherently introduces quantization and spike count approximation errors as discussed in Section 3.2. Such errors tend to accumulate over layers and cause significant performance degradation especially with a small NsN_sNs. This therefore calls for a training scheme to finetune the network weights after the primitive conversion, so as to compensate for these conversion errors.

目前已有基于脉冲的学习方案，如基于时间的代理梯度学习[16]和基于脉冲计数的串联学习方法[24]，用于端到端的SNN训练。但是，对于所需的微调任务，它们并不是最好的。例如，从这些方法近似得到的代理梯度对于我们想要的极短的编码时间窗口往往是有噪声的。在5.2节中可以看到，使用这些端到端学习方法，梯度逼近误差会在层上累积，这大大降低了超过10层的SNN的学习性能。

There have been spike-based learning schemes, such as time-based surrogate gradient learning [16] and spike count-based tandem learning methods [24], for SNN training in an end-to-end manner. However, they don’t work the best for the required fine-tuning task. For example, the surrogate gradients approximated from these methods tend to be noisy for an extremely short encoding time window that we would like to have. As will be seen in Section 5.2, gradient approximation errors accumulate over layers with these end-to-end learning methods, which significantly degrade the learning performance for an SNN of over 10 layers.

为了解决这个问题，我们提出了一种分层学习方法，将ANN层一次一层地转换为SNN层，以防止梯度逼近误差的累积。我们将一个SNN层的转换和权值微调定义为一个阶段。因此，如图2A所示，对于LLL层的ANN网络，需要LLL级才能完成整个转换和微调过程。

To address this issue, we propose a layer-wise learning method, wherebyANN layers are converted into SNN layers one layer at a time to prevent the gradient approximation errors from accumulating. We define the conversion and weight fine-tuning of one SNN layer as one stage. Therefore, for an ANN network of LLL layers, as shown in Fig. 2A, it takes LLL stages to complete the entire conversion and fine-tuning process.

🖼️ 图2 拟议的PTL框架的说明。整个训练过程被组织成独立的阶段。(B)训练阶段2的混合网络细节。请注意，SNN Layer 1执行的神经编码遵循章节3.4中描述的过程。(C)第二阶段训练过程的细节。(D)自适应训练调度器的说明。

Fig. 2. Illustration of the proposed PTL framework. (A) The whole training process is organized into separate stages. (B) Details of the hybrid network at the training stage 2. Note that the SNN Layer 1 performs neural encoding following the process described in Section 3.4. (C) Details of the training processes at stage 2. (D) Illustration of the adaptive training scheduler.

每个训练阶段的细节如图2C所示。同一SNN层的所有脉冲神经元共享相同的放电阈值，这是首先根据提出的阈值层归一化机制而确定的。此外，通过ANN神经元对应的偏置项除以NsN_sNs来确定脉冲神经元的恒定注入电流。采用串联学习方法[24]，将转换后的SNN层与预训练的ANN层通过权值共享耦合，进一步构建混合网络，ANN层成为辅助结构，便于对转换后的SNN层进行微调。在每个训练阶段，PTL方案都遵循串联学习的思想，除了：1)固定前几个阶段SNN层的权重；2)我们只更新一个SNN层和所有的ANN层。

The details of each training stage are illustrated in Fig. 2C.All spiking neurons in the same SNN layer share the same firing threshold, which is first determined according to the proposed Threshold LayerNorm mechanism. Besides, the constant injecting current to spiking neurons is determined by dividing the corresponding bias term of ANN neurons over NsN_sNs. Following the tandem learning approach [24], a hybrid network is further constructed bycoupling the converted SNN layer to the pre-trained ANN layer through weight sharing, thereafter the ANN layer becomes an auxiliary structure to facilitate the fine-tuning of the converted SNN layer. At each training stage, the PTL scheme follows the tandem learning idea except that 1) we fix the weights of the SNN layers in the previous stages; 2) we update only one SNN layer together with all ANN layers.

4.1 串联学习/Tandem Learning

如图2B所示，从上述SNN层导出的脉冲序列及其等效脉冲计数正向传播到耦合层。在耦合层中，脉冲神经元以脉冲序列为输入，生成脉冲计数作为输出，而ANN神经元以脉冲计数为输入，生成近似于耦合脉冲神经元脉冲计数的输出量。为了允许ANN和SNN层之间的权值共享，我们将脉冲计数作为桥梁。为此，我们将脉冲神经元的非线性变换表示为

As shown in Fig. 2B, the spike trains, derived from the preceding SNN layer, and their equivalent spike counts are forward propagated to the coupled layer. In the coupled layer, the spiking neurons take spike trains as input and generate spike counts as output, while the ANN neurons take spike counts as input and generate an output quantity that approximates the spike count of the coupled spiking neurons. To allow for weight sharing between the ANN and the SNN layers, we take the spike counts as the bridge. To this end, let us express the non-linear transformation of a spiking neuron as

cil=g(sl−1;wil−1,bil,ϑl)\begin{equation}c^l_i=g(s^{l-1};w^{l-1}_i,b^l_i,\vartheta^l)\end{equation} cil=g(sl−1;wil−1,bil,ϑl)

其中g(⋅)g(\cdot)g(⋅)表示脉冲神经元执行的有效转换。鉴于脉冲生成的状态依赖性质，直接确定从sl−1s^{l-1}sl−1到cilc^l_icil的解析表达式是不可行的。在这里，我们通过假设sl−1s^{l-1}sl−1产生的突触电流随时间均匀分布来简化脉冲生成过程。因此，我们得到输出脉冲序列的脉冲间间隔为

where g(⋅)g(\cdot)g(⋅) denotes the effective transformation performed by spiking neurons. Given the state-dependent nature of spike generation, it is not feasible to directly determine an analytical expression from sl−1s^{l-1}sl−1 to cilc^l_icil. Here, we simplify the spike generation process by assuming the resulting synaptic currents from sl−1s^{l-1}sl−1 are evenly distributed over time. We thus obtain the interspike interval of the output spike train as

Δil=ρ(ϑlNs∑jwijl−1cjl−1+bilNs)\begin{equation}\Delta^l_i=\rho\left(\frac{\vartheta^lN_s}{\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iN_s}\right)\end{equation} Δil=ρ(∑jwijl−1cjl−1+bilNsϑlNs)

其中ρ(⋅)\rho(\cdot)ρ(⋅)表示ReLU的非线性。等效输出脉冲计数可进一步确定为

where ρ(⋅)\rho(\cdot)ρ(⋅) denotes the ReLU non-linearity. The equivalent output spike count can be further determined as

cl=NsΔil=1ϑl⋅ρ(∑jwijl−1cjl−1+bilNs)\begin{equation}c^l=\frac{N_s}{\Delta^l_i}=\frac{1}{\vartheta^l}\cdot\rho\left(\sum_jw^{l-1}_{ij}c^{l-1}_j+b^l_iN_s\right)\end{equation} cl=ΔilNs=ϑl1⋅ρ(j∑wijl−1cjl−1+bilNs)

在实践中，为了重用原始的ANN层进行微调，我们将缩放因子1/ϑl1/\vartheta^l1/ϑl吸收到学习率中。这种配置允许从ANN层有效地近似脉冲序列级别的误差梯度。结果表明，ANN-SNN串联学习方法对于速率编码网络比其他基于脉冲的学习方法更有效，这些学习方法在每个时间步更新权重[24]。

In practice, to reuse the original ANN layer for the finetuning purpose, we absorb the scaling factor 1/ϑl1/\vartheta^l1/ϑl into the learning rate. This configuration allows spike-train level error gradients to be effectively approximated from the ANN layer. It was shown that the ANN-SNN tandem learning method works more efficiently for rate-coded networks than other spike-based learning methods that update the weights for each time step [24].

在本文中，串联学习规则允许在原语转换后对脉冲突触滤波器进行微调，这为离散神经表示提供了良好的初始化。通过对后续ANN层的权值进行微调，可以有效地减小转换误差。与[24]中引入的端到端串联学习框架不同，这里的串联学习是一层层地进行的，以防止梯度逼近误差跨层累积。在每个训练阶段结束后，SNN层的权值被冻结。

In this paper, the tandem learning rule allows the spiking synaptic filters to be fine-tuned after the primitive conversion, which offers a good initialization for discrete neural representation. Along with the weights fine-tuning of subsequent ANN layers, the conversion errors can be effectively mitigated. Different from the end-to-end tandem learning framework introduced in [24], the tandem learning here is performedone layer at a time to prevent the gradient approximation error from accumulating across layers.The weights of the SNN layer are frozenafter each training stage.

4.2 渐进串联学习的调度/Scheduling of Progressive Tandem Learning

PTL框架要求为每个训练阶段确定一个时间表。受[36]的启发，我们提出了一个自适应训练调度器来自动化PTL过程。如图2D所示，在每个训练周期结束时，我们根据当前验证损失和当前训练阶段的最佳验证损失更新耐心计数器ttt。当当前验证损失改善时，耐心计数器重置为零，否则，耐心计数器增加1。耐心计数器的作用与ANN学习率调度器的耐心参数类似。在ANN训练过程中，耐心参数决定何时学习率衰减应该发生。而耐心计数器决定何时应该转换下一层。一旦耐心计数器达到预定义的耐心周期TpT_pTp，在冻结训练后SNN层的权值之前，将具有最佳验证损失的混合网络参数重新加载到网络中（即当前训练阶段的最佳模型）。在最后一个ANN层被SNN层取代后，训练过程结束。算法1给出了所提出的分层ANN-to-SNN转换框架的伪代码。

The PTL framework requires a schedule to be determined for each training stage. Inspired from [36], we propose an adaptive training scheduler to automate the PTL process. As shown in Fig. 2D, at the end of each training epoch we update thepatience counterttt based on the current validation loss and the best validation loss at the current training stage. The patience counter is reset to zero when the current validation loss improves, otherwise, the patience counter is increased by one. The patience counter serves a similar purpose to the patience parameter of an ANN learning rate scheduler. During ANN training, the patience parameter determines when the learning rate decay should happen. Whilethe patience counter determines when the next layer should be converted. Once the patience counter reaches the pre-definedpatience periodTpT_pTp, the hybrid network parameters with the best validation loss are re-loaded to the network (i.e., the best model at the current training stage) before the weights of the trained SNN layer are frozen.The training process terminates after the last ANN layer is replaced by the SNN layer. The pseudo codes of the proposed layer-wise ANN-to-SNN conversion framework are presented in Algorithm 1.

📜 算法1

4.3 其他硬件约束的优化/Optimizing for Other Hardware Constraints

PTL框架还允许其他硬件约束，如非易失性存储设备的有限电导状态和神经形态架构中有限的扇入连接，在训练过程中很容易被纳入。因此，它极大地促进了硬件-算法的协同设计，并允许在将训练好的SNN模型部署到实际的神经形态硬件上时实现最佳性能。

The PTL framework also allows other hardware constraints, such as thelimited conductance states of non-volatile memory devicesandlimited fan-in connections in the neuromorphic architecture, to be incorporated easily during training. It hence greatly facilitates hardware-algorithm co-design and allows optimal performance to be achieved when deploying the trained SNN models onto the actual neuromorphic hardware.

为了阐明这一前景，我们重点讨论了导致SNN权重精度有限的有限电导状态的约束。具体来说，我们探索了量化感知训练[35]方法，在训练过程中逐步施加低精度权重。如图3所示，按照Eq.(8)中描述的激活量化的类似流程，在共享给SNN层之前，将网络权重和偏置项量化到所需的精度。而它们的全精度副本保存在ANN层中，以继续高精度的学习。PTL框架提供的灵活性允许SNN模型逐步导航到合适的参数空间，以适应各种硬件约束。

To elucidate on this prospect, we focus on the constraint of limited conductance states that will lead to the limited weight precision for SNNs. Specifically, we explored the quantization-aware training [35] method whereby the lowprecision weights are imposed progressively during training. As illustrated in Fig. 3, following the similar procedures that have been described for activation quantization in Eq. (8),the network weights and bias terms are quantized to a desirable precision before sharing to the SNN layer. Whiletheir full-precision copies are kept in the ANN layerto continue the learning with high precision. The flexibility provided by the PTL framework allows the SNN model to progressively navigate to a suitable parameter space to accommodate various hardware constraints.

🖼️ 图3 量化感知训练的说明，可以纳入拟议的PTL框架。将神经网络神经元的全精度权重和偏差项量化到所需的精度，然后与耦合的脉冲神经元共享。

Fig. 3. Illustration of the quantization-aware training that can be incorporated into the proposed PTL framework. The full precision weight and bias terms of ANN neurons are quantized to the desired precision before sharing with the coupled spiking neurons.

5 模式分类实验/Experiments on Pattern Classification

在本节中，我们首先研究了基于脉冲的学习方法的可伸缩性，这激发了在微调转换后的SNN时提出分层学习方法的建议。其次，我们展示了所提出的PTL框架在大规模物体识别任务上的学习有效性和可扩展性。第三，我们研究了算法-硬件协同设计方法的有效性，该方法将硬件约束纳入转换过程，并以低精度神经形态硬件的量化感知训练为例。最后，我们研究了所提出的转换框架的训练效率，以及训练后SNN模型的推理速度和能量效率的改进。

In this section, we first investigate the scalability of spikebased learning methods, which motivates the proposal of a layer-wise learning method in fine-tuning the converted SNN. Second, we demonstrate the learning effectiveness and scalability of the proposed PTL framework on large-scale object recognition tasks. Third, we investigate the effectiveness of the algorithm-hardware co-design methodology, that incorporates hardware constraints into the conversion process, with an example on the quantization-aware training for low precision neuromorphic hardware. Finally, we study the training efficiency of the proposed conversion framework as well as the improvements in the inference speed and energy efficiency of the trained SNN models.

5.1 实验设置/Experimental Setup

我们使用PyTorch库执行所有实验，该库支持多GPU机器上的加速和内存高效训练。在一个离散时间仿真中，我们使用IF神经元在Pytorch中实现了定制的线性层和卷积层。我们使用Adam优化器[38]进行所有的实验。为了提高训练效率，我们在每个卷积层和线性层之后增加了批量归一化(batch normalization, BN)层[39]。按照[27]中介绍的方法，我们将BN层的参数整合到它们之前的卷积层或线性层的权重中，然后与耦合的SNN层共享。除非另有说明，否则本节的模式分类任务和下一节将介绍的信号重建任务都将使用此设置。

We perform all experiments with PyTorch library that supports accelerated and memory-efficient training on multiGPU machines. Under a discrete-time simulation, we implement the customized linear layer and convolution layer in Pytorch using IF neurons. We use the Adam optimizer [38] for all the experiments. To improve the training efficiency, we add batch normalization (BN) layer [39] after each convolution and linear layer. Following the approach introduced in [27], we integrate the parameters of BN layers into their preceding convolution or linear layers’ weights before sharing them with the coupled SNN layers. We use this setup consistently for both the pattern classification tasks of this section and the signal reconstruction tasks that will be presented in the next section unless otherwise stated.

数据集。我们在MNIST[40]、Cifar-10[41]和ImageNet-12数据集[42]上进行了物体识别实验，这些数据集广泛用于机器学习和神经形态计算社区，以基准测试不同的学习算法。MNIST手写数字数据集由28 28像素的灰度数字组成，分为60,000个训练样本和10,000个测试样本。Cifar-10数据集由6万张32 32 3尺寸的彩色图像组成，分别来自10个等级，训练和测试标准分别为5万张和1万张。大规模ImageNet-12数据集由来自1000个对象类别的120多万张高分辨率图像组成。对于Cifar-10和MNIST数据集，我们将原始训练集随机分为训练集和验证集，分割比例为9:1，并在所有实验后固定。对于ImageNet-12数据集，所有实验均遵循标准数据分割。

Dataset. We perform the object recognition experiments on the MNIST [40], Cifar-10 [41] and ImageNet-12 datasets [42], which are widely used in machine learning and neuromorphic computing communities to benchmark different learning algorithms. The MNIST handwritten digits dataset consists of grayscaled digits of 2828 pixels that split into 60,000 training and 10,000 testing samples. The Cifar-10 dataset consists of 60,000 color images of size 32323 from 10 classes, with a standard split of 50,000 and 10,000 for train and test, respectively. The large-scale ImageNet-12 dataset consists of over 1.2 million high-resolution images from 1,000 object categories. For Cifar-10 and MNIST datasets, we randomly split the original train set into train and validation sets with a split ratio of 9:1, which are fixed afterward for all the experiments. For ImageNet-12 dataset, the standard data split is followed for all experiments.

网络、实施和评估度量。在Cifar-10数据集上探索了两种经典的CNN架构:AlexNet[3]和VGG-11[43]。对于ImageNet-12数据集，我们使用AlexNet和VGG-16[43]架构进行了实验，以方便与其他现有的ANN-to-SNN转换工作进行比较。

Network, Implementation and Evaluation Metric. Two classical CNN architectures are explored on the Cifar-10 dataset: AlexNet [3] and VGG-11 [43]. For the ImageNet-12 dataset, we performed experiments with AlexNet and VGG-16 [43] architectures to facilitate comparison with other existing ANN-to-SNN conversion works.

我们还在MNIST和Cifar-10数据集上进行了不同权重精度的量化感知训练实验。对于MNIST数据集，使用结构为28 28-c16s1-c32s2c32s1-c64s2-800-10的卷积神经网络，其中“c”和“s”后的数字分别表示卷积滤波器的数量和每个卷积层的步幅。所有卷积层都一致地使用3的核大小。对于Cifar-10数据集，我们采用了AlexNet架构。

We also performed experiments with quantization-aware training of different weight precisions on the MNIST and Cifar-10 datasets. For MNIST dataset, the convolutional neural network with the structure of 2828-c16s1-c32s2c32s1-c64s2-800-10 is used, wherein the numbers after ‘c’ and ‘s’ refer to the number of convolution filters and the stride of each convolution layer, respectively. The kernel size of 3 is used consistently for all convolution layers. For Cifar-10 dataset, we used AlexNet architecture.

对于所有实验，网络使用交叉熵损失函数训练100个epoch。耐心期TpT_pTp通过逐步增加它来匹配可用的训练周期的数量来进行微调。学习率在10−310^{-3}10−3初始化，并在50 Epoch衰减到10−410^{-4}10−4。在随机选择的训练批次中，使用所有aila^l_iail的第99个百分位数来确定发射阈值。报告了Cifar-10数据集5次独立运行的最佳测试精度。而对ImageNet-12数据集只执行一次运行。

For all experiments, the networks are trained for 100 epochs using the cross-entropy loss function. The patience period TpT_pTp is fine-tuned by progressively increasing it to match the number of available training epochs. The learning rate is initialized at 10−310^{-3}10−3 and decayed to 10−410^{-4}10−4 at Epoch 50. The 99th percentile of all aila^l_iail in a randomly selected training batch is used to determine the firing threshold. The best test accuracy across 5 independent runs is reported for the Cifar-10 dataset. While only a single run is performed for the ImageNet-12 dataset.

为了评估转换后的SNN模型与ANN模型的能量效率，我们遵循神经形态计算社区的惯例，计算总突触操作[27]。对于SNN，如下所定义，总突触操作（SynOps，AC操作）与神经元的放电率、扇出foutf_{out}fout（到后续层的传出连接数量）和时间窗口大小NsN_sNs相关。

To evaluate the energy efficiency of the converted SNN models to their ANN counterparts, we follow the convention of neuromorphic computing community by counting the total synaptic operations [27]. For SNN, as defined below, the total synaptic operations (SynOps, AC operations) correlate with the neurons’ firing rate, fan-out foutf_{out}fout (number of outgoing connections to the subsequent layer), and time window size NsN_sNs.

SynOps=∑t=1Ns∑l=1L−1∑j=1Qlfout,jlsjl[t]\begin{equation}SynOps=\sum^{N_s}_{t=1}\sum^{L-1}_{l=1}\sum^{Q^l}_{j=1}f^l_{out,j}s^l_j[t]\end{equation} SynOps=t=1∑Nsl=1∑L−1j=1∑Qlfout,jlsjl[t]

其中LLL为层数总数，QlQ^lQl为层lll中的神经元总数。

where LLL is the total number of layers and QlQ^lQl denotes the total number of neurons in layer lll.

相比之下，在神经网络中对一张图像进行分类所需的总突触操作（MAC操作）如下所示

In contrast, the total synaptic operations (MAC operations) that are required to classify one image in the ANN is given as follows

SynOps=∑l=1LfinlQl\begin{equation}SynOps=\sum^{L}_{l=1}f^l_{in}Q^l\end{equation} SynOps=l=1∑LfinlQl

其中finlf^l_{in}finl表示到达第lll层每个神经元的传入连接数。

where finlf^l_{in}finl denotes the number of incoming connections to each neuron in layer lll.

5.2 基于脉冲的端到端学习导致累积梯度近似误差/End-to-End Spike-Based Learning Leads to Accumulated Gradient Approximation Errors

如第4节所述，为了补偿由原始ANN-to-SNN转换引起的误差，需要一种训练方法来微调网络权重。在这里，我们以Cifar-10数据集上的物体识别任务为例，研究基于脉冲的学习方法在训练深度SNN执行快速模式识别时的可扩展性。具体来说，我们分别实现了[16]和[24]中提出的代理梯度学习方法和串联学习方法。本研究采用的网络结构取自著名的VGGNet[43]。

As discussed in Section 4, to compensate for the errors arising from the primitive ANN-to-SNN conversion, a training method is required to fine-tune the network weights. Here, we take the object recognition task on the Cifar-10 dataset as an example to study the scalability of spike-based learning methods in training deep SNNs to perform rapid pattern recognition. Specifically, we implemented the surrogate gradient learning method and tandem learning method proposed in [16] and [24], respectively. The network structures employed in this study are taken from the famous VGGNet [43].

在编码时间窗NsN_sNs为8的情况下，不同网络深度的神经网络模型和SNN模型的学习曲线如图4所示。如图4A所示，尽管VGG13和VGG16模型有轻微过拟合，但所有ANN模型的训练都很容易收敛。相比之下，如图4B和4C所示，网络深度超过10层的脉冲对应网络训练收敛困难。这一观察结果表明，基于脉冲的学习方法的梯度近似误差倾向于在层上累积，并显著降低了超过10层的深度SNN的学习性能。因此，这些端到端学习方法不能很好地用于原语网络转换后所需的微调任务。在接下来的部分中，我们将展示所提出的PTL框架，每次执行一层微调，可以有效地克服累积的梯度近似误差，并自由扩展到16层的深层SNN。

With an encoding time window NsN_sNs of 8, the learning curves for ANN and SNN models with different network depths are presented in Fig. 4. As shown in Fig. 4A, the training converges easily for all ANN models, despite slight overfitting observed for the VGG13 and VGG16 models. In contrast, the training convergence is difficult for the spiking counterparts that have a network depth of over 10 layers as shown in Figs. 4B and 4C. This observation suggeststhe gradient approximation error tends to accumulate over layerswith the spike-based learning methods andsignificantly degrades the learning performance for deep SNNs over 10 layers. Therefore, these endto-end learning methods would not work well for the finetuning task required after the primitive network conversion. In the following sections, we will show that the proposed PTL framework that performs fine-tuning one layer at a time can effectively overcome the accumulated gradient approximation errors and scale up freely to deep SNNs with 16 layers.

🖼️ 图4 Cifar-10数据集上的学习曲线说明。(A)人工神经网络模型。(B)基于脉冲计数的串联学习[24]训练的SNN模型。(C)采用基于时间的代理梯度学习[17]训练的SNN模型。值得注意的是，第50代学习曲线的跳跃是由于学习率的衰减。

Fig. 4. Illustration of learning curves on the Cifar-10 dataset. (A) ANN models. (B) SNN models trained with spike count-based tandem learning [24]. (C) SNN models trained with time-based surrogate gradient learning [17]. It is worth noting that the jump of learning curves at Epoch 50 is due to the learning rate decay.

5.3 Cifar-10和ImageNet-12上的目标识别/Object Recognition on Cifar-10 and ImageNet-12

如图5所示，我们绘制了AlexNet和VGG-11模型在Cifar-10数据集上的训练进度，以说明所提出的PTL框架的有效性。正如预期的那样，由于引入了转换误差，验证精度主要在每个转换阶段的开始下降。值得注意的是，这些误差被提出的分层学习方法抵消了，其中测试和验证的准确性仅用几个训练周期就能快速恢复。总体而言，在整个训练过程中，验证和测试精度保持相对稳定，训练后甚至可以超过预训练的ANN。它表明，所提出的转换框架可以通过利用存在于ANN高维特征表示中的冗余来显著减少表示空间NsN_sNs。

As shown in Fig. 5, we plot the training progress of the AlexNet and VGG-11 models on the Cifar-10 dataset, to illustrate the effectiveness of the proposed PTL framework. As expected, the validation accuracy drops mostly at the beginning of each conversion stage due to the conversion errors introduced. Notably, these errors are counteracted by the proposed layer-wise learning method, wherebythe test and validation accuracies are restored quickly with only a few training epochs. Overall, the validation and test accuracies remain relatively stable during the whole training progress and can even surpass those of the pre-trained ANNs after training. It suggests thatthe proposed conversion framework can significantly reduce the representation space NsN_sNs by exploiting the redundancies that existed in the highdimensional feature representation of the ANN.

🖼️ 图5 AlexNet和VGG-11在Cifar-10数据集(Ns=16,Tp=6N_s=16, T_p=6Ns=16,Tp=6)上的训练进展说明。阴影区域对应不同的训练阶段。在每个训练阶段的开始用等效的SNN层替换每个ANN层后，可以使用所提出的PTL框架快速恢复验证和测试准确性。在这些实验中，为了寻找更好的SNN模型，在最后的转换阶段没有采用早期终止。

Fig. 5. Illustration of the training progresses of the AlexNet and VGG-11 on the Cifar-10 dataset (Ns=16,Tp=6N_s=16, T_p=6Ns=16,Tp=6). The shaded regions correspond to different training stages. After replacing each ANN layer with an equivalent SNN layer at the beginning of each training stage, the validation and test accuracies can be quickly restored with the proposed PTL framework. In these experiments, to allow searching for a better SNN model, the early termination did not apply during the last conversion stage.

如表1所述，与其他具有类似网络架构的现有SNN实现相比，经过训练的深度SNN达到了最先进的分类精度，在Cifar-10数据集上，AlexNet和VGG-11的测试精度分别为90.86%和91.24%。值得一提的是，这些SNN模型的性能甚至比预先训练的ANN基线高出1.27%和0.65%。与最近提出的用于神经形态实现的二进制神经网络训练方法[36]相比，其分类准确率为84.67%，结果表明较大的编码时间窗口Ns=16N_s=16Ns=16有助于提高准确率。

As reported in Table 1, the trained deep SNNs achieve state-of-the-art classification accuracies over other existing SNN implementations with similar network architecture, with a test accuracy of 90.86% and 91.24% for AlexNet and VGG-11 respectively on the Cifar-10 dataset. It is worth mentioning that these SNN models even outperform their pre-trained ANN baselines by 1.27% and 0.65%. In comparison with a recently introduced binary neural network training method for neuromorphic implementation [36], which achieved a classification accuracy of 84.67%, the results suggest thatthe larger encoding time window Ns=16N_s=16Ns=16 contributes to the higher accuracy.

📊 表1 Cifar-10和ImageNet-12测试集上不同SNN实现的分类准确率比较。精度一列的圆括号内外的数字分别表示top-1和top-5的精度。

Comparison of Classification Accuracy of Different SNN Implementations on the Cifar-10 and ImageNet-12 Test Sets.
The numbers inside and outside the round bracket of the ‘Accuracy’ column refer to the top-1 and top-5 accuracy, respectively.

为了研究所提出的PTL框架在更复杂的数据集和网络架构上的可伸缩性，我们在具有挑战性的ImageNet-12数据集上进行了实验。由于深度SNN建模的计算复杂度很高，存储其中间状态需要巨大的内存，只有有限数量的ANN-to-SNN转换方法在该数据集上取得了一些有前景的结果。

To study the scalability of the proposed PTL framework on more complex datasets and network architectures, we conduct experiments on the challenging ImageNet-12 dataset. Due to the high computational complexity of modeling deep SNNs and the huge memory demand to store their intermediate states, only a limited number of ANN-to-SNN conversion methods have achieved some promising results on this dataset.

如表1所述，使用所提出的PTL框架训练的AlexNet和VGG-16模型在ImageNet-12数据集上取得了有希望的结果。对于脉冲AlexNet，top-1 (top-5)精度比早期采用约束然后训练方法[44]的工作提高了3.39%(2.21%)。同时，所需要的总时间步数从200减少到16，减少了不止一个阶数。对于脉冲VGG-16，尽管总时间步数减少了至少25倍，但我们的结果与最先进的ANN-toSNN转换方法[27]，[28]所取得的结果一样具有竞争力。

As reported in Table 1, the spiking AlexNet and VGG-16 models trained with the proposed PTL framework achieve promising results on the ImageNet-12 dataset. For the spiking AlexNet, the top-1 (top-5) accuracy improves by 3.39% (2.21%) over the early work that takes a constrain-then-train approach [44]. Meanwhile, the total number of time steps required is reduced by more than one order from 200 to 16. For the spiking VGG-16, despite the total number of time steps reduced by at least 25 times, our result is as competitive as those achieved with the state-of-the-art ANN-toSNN conversion approaches [27], [28].

Nitinet al.[45]最近应用了一种基于脉冲的学习方法，对转换后的SNN端到端的权值进行微调，以提高模型在运行时的速度。该方法成功地将总时间步长从2500减少到250，在ImageNet-12数据集上的准确率下降了约3%。相比之下，在这项工作中提出的离散神经表示提供了一个改进的网络初始化，允许更彻底地减少编码时间窗口。值得注意的是，我们系统的分类精度与他们的相当，而总共只需要16次步骤。虽然我们的SNN模型比预训练的AlexNet和VGG-16模型分别下降了约3%和6%，但它比ANN-to-SNN转换得到的准确率下降了16.6%[46]要好得多。此外，我们希望通过提供一个更大的表示空间NsN_sNs来阻止我们的精度下降。

Nitinet al. [45] recently apply a spike-based learning method to fine-tune the weights of the converted SNN endto-end, so as to speed up the model at run time. This method successfully reduces the total time steps from 2,500 to 250, with accuracy drops by about 3% on the ImageNet-12 dataset. In contrast, the discrete neural representation proposed in this work provides an improved network initialization that allows for a more radical reduction in the encoding time window. Notably, the classification accuracy of our system is on par with theirs, while requiring only a total of 16-time steps. Although our SNN models drop from the pre-trained AlexNet and VGG-16 models by about 3% and 6% respectively, it is much better than that obtained from the ANN-to-SNN conversion which is reported to have an accuracy drop of 16.6% [46]. Moreover, it is expected that our accuracy drop could be closed by providing a larger representation space NsN_sNs.

5.4 低精度神经形态硬件量化感知训练/Quantization-Aware Training for Low Precision Neuromorphic Hardware

表2提供了量化感知训练的目标识别结果。在MNIST和Cifar-10数据集上，低精度SNN模型在比特宽度减小和表示空间有限（即Ns=16N_s=16Ns=16）的情况下表现非常好。具体而言，当权重量化为4位时，MNIST和Cifar-10数据集的分类精度分别仅下降0.03%和0.85%。因此，所提出的PTL框架为在低精度神经形态硬件上实现SNN提供了巨大的机会，例如，新兴的受有限的电导状态限制的非易失性存储设备。

Table 2 provides the object recognition results with the quantization-aware training. On the MNIST and Cifar-10 datasets, the low-precision SNN models perform exceedingly well regardless of the reduced bit-width and the limited representation space (i.e., Ns=16N_s=16Ns=16). Specifically, when the weights are quantized to 4-bit, the classification accuracy drops by only 0.03% and 0.85% on the MNIST and Cifar-10 datasets, respectively. Therefore, the proposed PTL framework offers immense opportunities for implementing SNNs on the low-precision neuromorphic hardware, for instance with emerging non-volatile memory devices that suffering from limited conductance states.

📊 表2 分类结果作为权重精度函数的比较。通过量化感知训练得到SNN模型的结果。报告了5次独立运行的平均结果。

Comparison of the Classification Results as a Function of Weight Precision.
The result of SNN models is obtained through quantization-aware training. The average results across 5 independent runs are reported.

5.5 基于SNN的快速高效分类/Rapid and Efficient Classification With SNNs

当在神经形态芯片上实现时，SNN相较于ANN，在提高实时性能和能量效率方面具有很大的潜力。然而，基于发射速率假设的学习方法需要较长的推理时间，通常需要几百到数千个时间步才能达到稳定的网络发射状态。它们降低了SNN的异步操作所带来的延迟优势。相比之下，所提出的转换框架允许有效利用可用的时间步长，这样就可以在ImageNet-12数据集上仅使用16个时间步长进行快速推断。如图6A所示，我们注意到Cifar-10数据集上的编码时间窗大小与分类精度之间存在明显的正相关关系。值得注意的是，在二进制神经网络场景中，当训练SNN利用有限的信息量时，只需要一个时间步就可以做出可靠的预测，而当提供更大的编码时间窗口时，性能可以进一步提高。

When implemented on the neuromorphic chips, the SNNs have great potential to improve the real-time performance and energy efficiency over ANNs. However, the learning methods grounded on the firing rate assumption require long inference time, typically a few hundred to thousands of time steps, to reach a stable network firing state. They diminish the latency advantages that can be obtained from the asynchronous operation of SNNs. In contrast, the proposed conversion framework allows making efficient use of the available time steps, such that rapid inference can be performed with only 16 time steps on the ImageNet-12 dataset. As shown in Fig. 6A, we notice a clear positive correlation between the encoding time window size and the classification accuracy on the Cifar-10 dataset. Notably, a reliable prediction can still be made with only a single time step when SNN is trained to utilize this limited amount of information as in the scenario of binary neural networks, while the performance can be further improved when larger encoding time windows are provided.

🖼️ 图6 (A)分类精度作为编码时间窗口在Cifar-10数据集上的函数。水平虚线指的是预训练的人工神经网络的精度。(B)在Cifar-10数据集上，SNN和ANN之间的总突触操作的比例作为编码时间窗口的函数。(C)分类精度作为自适应调度器中定义的耐心周期的函数。(D)完成周期作为耐心周期的函数。所有实验结果都是在5个独立运行的脉冲AlexNet中总结出来的。误差条表示5次运行的一个标准偏差。

Fig. 6. (A) Classification accuracy as a function of the encoding time window on the Cifar-10 dataset. The horizontal dashed line refers to the accuracy of the pre-trained ANN. (B) The ratio of total synaptic operations between SNN and ANN as a function of encoding time window on the Cifar-10 dataset. (C) Classification accuracy as a function of the patience period defined in the adaptive scheduler. (D) Finishing epoch as a function of the patience period.All experimental results are summarized over 5 independent runs with spiking AlexNet. The error bars represent one standard deviation across the 5 runs.

为了进一步研究训练过的SNN模型的能量效率，我们遵循约定，通过计算每个推理的突触操作，并计算与相应的ANN模型[24]，[27]的比率。一般情况下，神经网络所需的突触总操作是一个常数，取决于网络架构，而它与编码时间窗口和SNN的发射速率正相关。如图6B所示，在等精度设置下，当ANN模型和SNN模型达到相同的精度时，SNN (Ns=8N_s=8Ns=8)仅比ANN模型消耗约0.315倍的总突触操作。相比之下，在类似的VGGNet-9网络[47]上，采用ANN-to-SNN转换和基于脉冲的学习方法的最先进SNN实现的SynOps比分别为25.60和3.61。这表明我们的SNN实现在运行时的效率分别提高了81.27倍和11.46倍。

To further study the energy efficiency of trained SNN models, we follow the convention by counting the synaptic operations per inference and calculating the ratio to the corresponding ANN models [24], [27]. In general, the total synaptic operations required by the ANN is a constant number depending on the network architecture, while it positively correlates with the encoding time window and the firing rate for SNNs. As shown in Fig. 6B, under the iso-accuracy setting, when the ANN and SNN models achieve an equal accuracy, the SNN (Ns=8N_s=8Ns=8) consumesonly around 0.315 times total synaptic operations over the ANN counterpart. In contrast, the state-of-the-art SNN implementations with the ANN-to-SNN conversion and spike-based learning methods have reported a SynOps ratio of 25.60 and 3.61 respectively on a similar VGGNet-9 network [47]. It suggests our SNN implementation is 81.27 and 11.46 times more efficient at run-time respectively.

值得注意的是，SNN主要执行累加(AC)操作，以整合来自传入脉冲的膜电位贡献。相比之下，在ANN中使用乘法-累加(MAC)运算，这在能量消耗和芯片面积使用方面显着更昂贵。例如，在Global Foundry 28nm工艺的模拟中，MAC操作的成本是AC操作的14倍，需要21倍的芯片面积。因此，相比于ANN，通过采用稀疏和廉价的AC操作，我们的SNN模型可以节省超过40倍的成本，并且可以从高效的神经形态芯片架构设计和新兴的超低功耗器件实现中进一步提高成本节约。值得一提的是，由4.3节中提出的量化感知训练策略支持的低精度网络可以进一步降低计算成本和内存占用。

It is worth noting that SNNs perform mostly accumulate (AC) operations to integrate the membrane potential contributions from incoming spikes. In contrast, multiply-accumulate (MAC) operations are used in ANN which is significantly more expensive in terms of energy consumption and chip area usage. For instance, the simulations in a Global Foundry 28 nm process reportthe MAC operation is 14x costly than the AC operationandrequires 21x chip area[27]. Therefore, over 40 times cost savings can be received from our SNN models by taking the sparse and cheap AC operations over the ANN counterparts, and the cost savings can be further boosted from efficient neuromorphic chip architecture design and emerging ultra-low-power devices implementation. It is worth mentioning that low precision networks supported by the quantization-aware training strategy proposed in Section 4.3 can further reduce the computing cost and memory footprint.

图6C和6D给出了分类结果和所需的训练周期作为自适应训练调度器中耐心周期的函数。如图6C所示，即使耐心周期仅为1，也可以达到超过预训练ANN模型的竞争分类精度，这需要的平均epoch仅为18，如图6D所示。如果给予较长的耐心时间，可以进一步提高精度。

Figs. 6C and 6D present the classification results and the required training epochs as a function of the patience period in the adaptive training scheduler. As shown in Fig. 6C, a competitive classification accuracy that surpasses the pretrained ANN model can be achieved even with a patience period of only 1, which requires an average epoch of only 18 as shown in Fig. 6D. The accuracy can be further improved if a longer patience period is given.

6 信号重构实验/Experiments on Signal Reconstruction

在第5节中，我们展示了所提出的PTL框架在模式分类任务上的卓越学习能力和可伸缩性。现有的ANN-to-SNN转换工作主要集中在模式分类任务上，不需要高精度的输出。然而，像信号重建这样的回归任务需要SNN模型使用脉冲来预测高精度输出，这一点还没有得到很好的探索。在本节中，我们进一步应用SNN来解决已知的对SNN具有挑战性的模式回归任务。具体来说，我们对图像重建和语音分离任务进行了实验，这两个任务都需要重构高保真信号。

In Section 5, we demonstrate superior learning capability and scalability of the proposed PTL framework on pattern classification tasks. The existing ANN-to-SNN conversion works mainly focus on the pattern classification tasks, where a high-precision output is not required. The regression tasks like signal reconstruction however require the SNN model to predict high precision outputs using spikes, which have not been well explored. In this section, we further apply SNNs to solve pattern regression tasks that are known to be challenging for SNNs. Specifically, we perform experiments on the image reconstruction and speech separation tasks, both of which require reconstructing high-fidelity signals.

6.1 图像重建与自动编码器/Image Reconstruction With Autoencoder

自编码器是一种神经网络，它学会将输入信号分解为紧凑的潜在表示，然后使用该表示尽可能地重建原始信号[48]。通常，自动编码器通过瓶颈层学习紧凑的潜在表示，该瓶颈层在输入上具有降低的维数。通过这种方式，它忽略了变化，消除了噪声，并理清了混合的信息。在这里，我们研究了紧凑的潜在表示提取并使用脉冲计数重建静态图像。

An autoencoder is a type of neural network that learns to decompose input signals into a compact latent representation, and then use that representation to reconstruct the original signals as closely as possible [48]. Typically an autoencoder learns compact latent representations through abottleneck layerthat has a reduced dimensionality over the input. In this way, it ignores the variation, removes the noise, and disentangles a mixture of information. Here, we investigate the compact latent representation extraction and reconstruction for static images using spike counts.

6.2 时域语音分离/Time-Domain Speech Separation

语音分离是鸡尾酒会问题的解决方案之一，在鸡尾酒会中，一个人被期望在一个多人谈话的场景中有选择地听一个特定的演讲者[49]。生理学研究表明，选择性听觉注意既发生在局部，通过改变单个神经元的感受场特性，也发生在整个听觉皮层，通过皮层回路的快速神经适应或可塑性[50][51]。然而，机器还没有达到和人类一样的注意力能力，以将混合的刺激分成不同的流。这种听觉注意能力在现实应用中有很高的要求，如助听器[52]、语音识别[53]、说话人验证[54]、说话人扩化[55]等。

Speech separation is one of the solutions for the cocktail party problem, where one is expected to selectively listen to a particular speaker in a multi-talker scenario [49]. Physiological studies reveal thatselective auditory attentiontakes place bothlocally by transforming the receptive field properties of individual neuronsandglobally throughout the auditory cortex by rapid neural adaptation, or plasticity, of the cortical circuits[50], [51]. However, machines have yet to achieve the same attention ability as humans in segregating mixed stimuli into different streams. Such auditory attention capability is highly demanded in real-world applications, such as, hearing aids [52], speech recognition [53], speaker verification [54], and speaker diarization [55].

受深度神经网络方法在时域语音分离和提取[56]，[57]方面最新进展的启发，我们提出并实现了一种基于深度SNN的语音分离解决方案。如图7所示，SNN将混合语音作为输入，将单个语音生成为单独的流。通过一堆扩展的卷积层，SNN以可管理的参数数量捕获语音信号的长程依赖关系。在高保真语音重建中，它以最大化尺度不变信号失真比(SI-SDR)[58]损失进行优化。

Inspired by the recent progress in deep ANN approachestotime-domainspeechseparationandextraction [56], [57], we propose and implement a deep SNNbased solution for speech separation. As shown in Fig. 7, the SNN takes the mixture speech as input and generates individual speech into separate streams. With a stack of dilated convolutional layers, the SNNcaptures the long-range dependency of speech signals with a manageable number of parameters. It is optimized to maximize a scaleinvariant signal-to-distortion ratio (SI-SDR) [58] loss for high fidelity speech reconstruction.

🖼️ 图7 (A)基于SNN的语音分离方法解决鸡尾酒会问题的说明。(B)拟建的基于SNN的语音分离网络示意图。它将两个扬声器混合作为输入，并为每个扬声器输出两个独立的流。“1d-Conv”表示一维卷积。“1×1 Conv”是一个具有1×1内核的卷积。d-Conv是一种扩张卷积。“Deconv”是一种反卷积（也称为转置卷积）。“ReLU”是一个整流线性单位函数。“BN”表示批处理归一化。⨂表示按元素进行的乘法运算。

Fig. 7. (A) Illustration of the SNN-based speech separation approach to solving the cocktail party problem. (B) Illustration of the proposed SNN-based speech separation network. It takes two speakers mixture as input and outputs two independent streams for each individual speaker. “1d-Conv” indicates a 1-dimensional convolution. “1×1 Conv” is a convolution with a 1×1 kernel. “d-Conv” is a dilated convolution. “Deconv” is a deconvolution (also known as transposed convolution). “ReLU” is a rectified linear unit function. “BN” represents batch normalization. ⨂ refers to the element-wise multiplication.

本文提出的基于SNN的语音分离框架由编码器、分离器和解码器三部分组成，如图7所示。编码器将时域混合信号转换为高维表示，然后将其作为分隔符的输入。分隔器在每个时间步为每个扬声器估计一个掩码。在此之后，通过对混合输入的编码表示进行滤波，提取每个单个扬声器的合适表示，并使用该扬声器的估计掩码。最后，利用解码器对每个扬声器的时域信号进行重构。

The proposed SNN-based speech separation framework consists of three components: an encoder, a separator, and a decoder, as shown in Fig. 7. The encoder transforms the time-domain mixture signal into a high-dimensional representation, which is then taken as the input to the separator. The separator estimates a mask for each speaker at each time step. After that, a suitable representation for every individual speaker is extracted by filtering the encoded representation of the input mixture with the estimated mask for that speaker. Finally, the time-domain signal of each speaker is reconstructed using a decoder.

6.3 实验设置/Experimental Setup

下面，我们将介绍为图像重建和语音分离任务设计的实验。通过应用PTL框架，将预训练ANN转换为SNN，用于这些任务中的高保真信号重建。

In the following, we will present the experiments designed for image reconstruction and speech separation tasks. By applying the PTL framework, the pre-trained ANNs are converted into SNNs for high-fidelity signal reconstruction in these tasks.

6.3.1 图像重建/Image Reconstruction

6.3.1.1 数据集。图像重建任务使用MNIST数据集[40]，其中包含6万个训练样本和1万个测试样本。这些样本直接用于训练和测试，而不应用任何数据预处理步骤。

6.3.1.1 Dataset. The MNIST dataset [40] is used for the image reconstruction task, which consists of 60,000 training and 10,000 test samples. These samples are directly used for training and testing without applying any data pre-processing steps.

6.3.1.2 网络、实施和评估指标。我们评估了一个全连接自编码器，其架构为784-128-64-32-64-128-784，其中数字指的是每层神经元的数量[36]。在输出层中使用sigmoid激活函数来规范化输出以匹配输入范围，而其余层使用ReLU激活函数。按照3.4节中介绍的神经编码方案，我们不再使用脉冲计数，而是将最终SNN层中脉冲神经元的自由聚合膜电位作为sigmoid激活函数的预激活量，从而提供了高分辨率的重建。使用均方误差(MSE)损失函数对网络进行100个epoch的训练，并将训练调度器的耐心周期TpT_pTp设置为6。我们报告了在不同编码时间窗大小的MNIST测试集上重构图像的峰值信噪比(PSNR)和结构相似性(SSIM)。其余的训练配置遵循第5.1节中介绍的模式分类任务中使用的配置。

6.3.1.2 Network, Implementation and Evaluation Metric. We evaluate a fully-connected autoencoder that has an architecture of 784-128-64-32-64-128-784, wherein the numbers refer to the number of neurons at each layer [36]. The sigmoid activation function is used in the output layer to normalize the output so as to match to the input range, while the rest of the layers use a ReLU activation function. Following the neural coding scheme introduced in Section 3.4, instead of using the spike count, the free aggregate membrane potential of spiking neurons in the final SNN layer is considered as the preactivation quantity to the sigmoid activation function, whichprovidesahigh-resolutionreconstruction.The networks are trained for 100 epochs using the mean square error (MSE) loss function, and the patience period TpT_pTp of the training scheduler is set to 6. We report the peak signal-to-noise ratio (PSNR) and Structural Similarity(SSIM)ofreconstructedimages on the MNIST test set with different encoding time window size. The rest of the training configurations follow those used in pattern classification tasks as presented in Section 5.1.

6.3.2 时域语音分离/Time-Domain Speech Separation

6.3.2.1 数据集。我们在WSJ0语料库[60]中随机选择两个说话人的话语，以8 kHz的采样率对双说话人混合WSJ0-2mix数据[59]进行了评估。WSJ02mix语料库由三个集组成：训练集（20000话语，≈30h\approx 30h≈30h）、开发集（5000话语 ≈8h\approx 8h≈8h）和测试集（3000话语 ≈5h\approx 5h≈5h）。具体来说，从WSJ0训练集(si_tr_s)中随机选取50名男性和51名女性说话者的话语，在WSJ0-2mix中以各种信噪比(SNR)生成训练和发展集，这些信噪比统一选择在0dB和5dB之间。类似地，测试集是通过将WSJ0发展集(si_dt_05)和评价集(si_et_05)中的10名男性和8名女性说话者的话语随机混合而创建的。测试集被认为是开放条件评估，因为测试集中的说话人与训练和发展集中的说话人不同。我们使用发展集来调优参数，并将其视为封闭条件评估，因为在训练过程中可以看到说话者。训练和发展组的话语被分成4秒的片段。

6.3.2.1 Dataset. We evaluated the methods on the twotalker mixed WSJ0-2mix dataset<2> [59] with a sampling rate of 8 kHz, which was mixed by randomly choosing utterances of two speakers from the WSJ0 corpus [60]. The WSJ02mix corpus consists of three sets: training set (20,000 utterances ≈30h\approx 30h≈30h), development set (5,000 utterances ≈8h\approx 8h≈8h), and test set (3,000 utterances ≈5h\approx 5h≈5h). Specifically, the utterances from 50 male and 51 female speakers in the WSJ0 training set (si_tr_s) were randomly selected to generate the training and development set in WSJ0-2mix at various signal-tonoise (SNR) ratios that uniformly chosen between 0dB and 5dB. Similarly, the test set was created by randomly mixing the utterances from 10 male and 8 female speakers in the WSJ0 development set (si_dt_05) and evaluation set (si_et_05). The test set was considered as the open condition evaluation because the speakers in the test set were different from those in the training and development sets. We used the development set to tune parameters and considered it as the closed condition evaluation because the speakers are seen during training. The utterances in the training and development set were broken into 4s segments.

6.3.2.2 网络与实现。受Conv-TasNet语音分离系统[56]的启发，提出的基于SNN的语音分离系统首先对混合输入x(t)∈R1×Tx(t)\in\mathbb{R}^{1\times T}x(t)∈R1×T进行编码，通过NNN(=512=512=512)滤波器进行1d卷积，然后使用ReLU激活函数。每个过滤器都有一个LLL(=20=20=20)样本的窗口，其步幅为L/2L/2L/2(=10=10=10)样本。在分隔符部分，将具有可训练增益和偏差参数的均值和方差归一化应用于信道维度上的编码表示A∈RK×NA\in\mathbb{R}^{K\times N}A∈RK×N，其中KKK等于2(T−L)/L+12(T -L)/L+12(T−L)/L+1。将1×1卷积与批归一化和ReLU激活一起应用于规范化的编码表示。带有512个滤波器的扩张卷积重复10次，扩张比为[20,21，⋯,29][2^ 0,2 ^1，\cdots,2^9][20,21，⋯,29]。这些扩展卷积滤波器的内核大小为1×3，步幅为1。批归一化和ReLU激活函数也应用于扩张卷积层。每个扬声器的掩码(M1,M2)(M_1, M_2)(M1,M2)由1×1卷积和sigmoid激活函数估计。用估计掩码(M1,M2)(M_1, M_2)(M1,M2)对编码后的表示AAA进行滤波，得到每个扬声器的调制表示(S1,S2)(S_1,S_2)(S1,S2)。最后，解码器重构每个扬声器的时域信号(S1,S2)(S_1,S_2)(S1,S2)，相当于编码器的逆过程。

6.3.2.2 Network and Implementation. Inspired by the Conv-TasNet speech separation system [56], the proposed SNN-based speech separation system first encodes the mixture input x(t)∈R1×Tx(t)\in \mathbb{R}^{1\times T}x(t)∈R1×T by a 1d-convolution with NNN(=512=512=512) filters followed by the ReLU activation function. Each filter has a window of LLL(=20=20=20) samples with a stride of L/2L/2L/2(=10=10=10) samples. In the separator part, a mean and variance normalization with trainable gain and bias parameters is applied to the encoded representations A∈RK×NA\in\mathbb{R}^{K\times N}A∈RK×N on the channel dimension, where KKK is equal to 2(T−L)/L+12(T -L)/L+12(T−L)/L+1.A 1×1 convolution together with batch normalization and ReLU activation is applied to the normalized encoded representations. The dilated convolutions with 512 filters are repeated 10 times with dilations ratios of [20,21,⋯,29][2^0, 2^1, \cdots,2^9][20,21,⋯,29]. These dilated convolution filters have a kernel size of 1×3 and a stride of 1. The batch normalization and ReLU activation function are also applied to the dilated convolutions layers. A mask (M1,M2)(M_1, M_2)(M1,M2) for each speaker is then estimated by a 1×1 convolution with a sigmoid activation function. The modulated representation (S1,S2)(S_1,S_2)(S1,S2) for each speaker is obtained by filtering the encoded representation AAA with the estimated mask (M1,M2)(M_1, M_2)(M1,M2). Finally, the time-domain signal (S1,S2)(S_1,S_2)(S1,S2) for each speaker is reconstructed by the decoder, which acts as the inverse process of the encoder.

基于ANN的系统优化后，学习率从0.001开始，当损失在开发集上增加至少3个epoch时，学习率减半。然后，我们将预先训练好的神经网络模型转化为一个SNN。值得一提的是，聚合膜电位被应用为最后1×1卷积层的输入，其中需要用浮点表示来生成高分辨率的听觉掩模。SNN的编码时间窗口NsN_sNs为32，耐心周期TpT_pTp为3。神经网络和SNN模型都训练了100个epoch，当损失在10个epoch的发展集上没有改善时，应用早期停止方案。

The ANN-based system is optimized with the learning rate started from 0.001 and is halved when the loss increased on the development set for at least 3 epochs. Then, we take the pre-trained ANN model and convert the separator into an SNN. It is worth mentioning that the aggregate membrane potential is applied as the inputs to the last 1×1 convolution layer where a float-point representation is required to generate high-resolution auditory masks. The encoding time window NsN_sNs and patience period TpT_pTp are set to 32 and 3 for SNNs, respectively. Both ANN and SNN models are trained for 100 epochs, and an early stopping scheme is applied when the loss does not improve on the development set for 10 epochs.

6.3.2.3 训练目标与评价指标。通过最大化尺度不变信号失真比(SI-SDR)[58]来优化语音分离系统，其定义为

6.3.2.3 Training Objective and Evaluation Metric. The speech separation system is optimized by maximizing the scale-invariant signal-to-distortion ratio (SI-SDR) [58], that is defined as

SI-SDR=10log⁡10(∥⟨s^,s⟩⟨s,s⟩s∥2∥⟨s^,s⟩⟨s,s⟩s−s^∥2)\begin{equation}\text{SI-SDR}=10\log_{10}\left(\frac{\parallel\frac{\langle\hat{s},s\rangle}{\langle s,s\rangle}s\parallel^2}{\parallel\frac{\langle \hat{s},s\rangle}{\langle s,s\rangle}s-\hat{s}\parallel^2}\right)\end{equation} SI-SDR=10log10∥⟨s,s⟩⟨s^,s⟩s−s^∥2∥⟨s,s⟩⟨s^,s⟩s∥2

其中s^\hat{s}s^和sss是分开的，分别针对干净的信号。⟨⋅,⋅⟩\langle\cdot,\cdot\rangle⟨⋅,⋅⟩表示内积。为了确保尺度不变性，信号s^\hat{s}s^和sss在SI-SDR计算之前被归一化为零均值。由于我们不知道分离的流属于哪个扬声器（排列问题），我们采用排列不变训练，通过在所有排列中最大化SI-SDR性能来找到最佳的排列。使用SI-SDR作为评价指标，比较了原始基于ANN的语音分离系统和转换后的基于SNN的语音分离系统的性能。我们还使用语音质量感知评估(PESQ)[61]，[62]来评估系统，它被推荐为ITU-T P.862标准，以自动评估语音质量，而不是主观的平均意见评分(MOS)。在评估过程中，在训练阶段进行置换不变量训练后，确定分离流与相应目标干净信号之间的置换问题。

where s^\hat{s}s^ and sss are separated and target clean signals, respectively. ⟨⋅,⋅⟩\langle\cdot,\cdot\rangle⟨⋅,⋅⟩ denotes the inner product. To ensure scale invariance, the signals s^\hat{s}s^ and sss are normalized to zero-mean prior to the SI-SDR calculation. Since we don’t know which speaker the separated stream belongs to (permutation problem), we adopt permutation invariant training to find the best permutation by maximizing the SI-SDR performance among all the permutations. The SI-SDR is used as the evaluation metric to compare the performances of the original ANN-based and the converted SNN-based speech separation systems. We also evaluate the systems with Perceptual Evaluation of Speech Quality (PESQ) [61], [62], which is recommended as the ITU-T P.862 standard to automatically assess the speech quality instead of the subjective Mean Opinion Score (MOS). During the evaluation, the permutation problem between the separated streams and the corresponding target clean signals are decided following the permutation invariant training during the training phase.

6.4 实验结果/Experimental Results

6.4.1 自编码器图像重建/Image Reconstruction With Autoencoder

图像重建结果如表3所示。正如预期的那样，编码时间窗大小NsN_sNs与图像重建质量之间存在明显的正相关关系。值得注意的是，在编码时间窗口为32的情况下，脉冲自编码器在PSNR和SSIM指标方面实现了与预训练的ANN相当的性能。如图8所示，该脉冲自动编码器(Ns=32N_s=32Ns=32)可以有效地重建高质量的图像。与图6A所示的物体识别结果相反，图像重建的结果表明，回归任务可能需要更大的离散表示空间或编码时间窗口来匹敌预训练的ANN的性能。

Table 3 provides the image reconstructions results. As expected, a clear positive correlation between the encoding time window size NsN_sNs and the image reconstruction quality has been observed. Notably, with an encoding time window of 32, the spiking autoencoder achieves a comparable performance to the pre-trained ANN, in terms of the PSNR and SSIM metrics. As also shown in Fig. 8, this spiking autoencoder (Ns=32N_s=32Ns=32) can effectively reconstruct images with high quality. In contrast to the object recognition results shown in Fig. 6 A, the results on the image reconstruction suggest regression tasks may require a larger discrete representation space or encoding time window to match the performance of the pre-trained ANN.

📊 表3 图像重建结果与编码时间窗大小NsN_sNs的函数比较。报告了5次独立运行的平均结果。

Comparison of the Image Reconstruction Results as a Function of the Encoding Time Window Size NsN_sNs.
The average results across 5 independent runs are reported.

🖼️ 图8 MNIST数据集上的脉冲自编码器(Ns=32N_s =32Ns=32)重建图像的插图。对于每对数字，左侧为原始图像，右侧为SNN重建图像。

Fig. 8. Illustration of the reconstructed images from spiking autoencoder (Ns=32N_s =32Ns=32) on the MNIST dataset. For each pair of digits, the left side is the original image and the right side is the reconstruction by SNN.

6.4.2 时域语音分离/Time-Domain Speech Separation

表4总结了原始基于ANN的语音分离系统与转换后的基于SNN的语音分离系统的对比研究。基于ANN和SNN的系统在开放条件下的SI-SDR分别为12.8dB和12.2dB。在感知质量方面，我们观察到ANN和SNN的PESQ得分非常接近，分别为2.94和2.85。开放条件评估结果表明，SNN可以在这个具有挑战性的语音分离任务中获得与ANN相当的性能，而SNN可以在测试时获得快速推断和能源效率的额外好处。封闭条件评价也可以得到同样的结论。

Table 4 summarizes the comparative study between the original ANN-based and the converted SNN-based speech separation systems. The ANN- and SNN-based systems achieve an SI-SDR of 12.8 dB and 12.2 dB under the open condition evaluation, respectively. In terms of the perceptual quality, we observe that the ANN and SNN have a very close PESQ score of 2.94 and 2.85, respectively. The open condition evaluation results suggest that the SNN can achieve comparable performance to the ANN in this challenging speech separation task, while the SNN can take additional benefits of rapid inference and energy efficiency at test time. The same conclusion could also be drawn for the closed condition evaluation.

📊 表4 封闭与开放条件下ANN与SNN语音分离任务的比较研究。封闭的条件是在训练现场，在那里可以看到说话者在训练。开放条件是在测试集上，在训练过程中扬声器是看不见的。“Diff.”指的是不同性别的混合。“相同”指的是相同性别的混合。“整体”指的是不同性别和同性混合的组合。

Comparative Study Between ANN and SNN on Speech Separation Tasks Under Both Closed and Open Condition.
The closed condition is on the development set, where the speakers are seen during training. The open condition is on the test set, where the speakers are unseen during training. “Diff.” refers to the different gender mixture. “Same” refers to the same gender mixture. “Overall” refers to the combination of both different and same gender mixtures.

通过听ANN和SNN生成的分离样例，我们观察到SNN生成的分离样例与ANN生成的分离样例非常相似，且保真度很高。我们从测试集（开放条件）中发布了一些例子来在线演示我们的系统性能我们从测试集中随机选取一个男女混合条件下的语音样本，其幅度谱如图9所示。我们观察到，即使在相同性别的挑战条件下，当多说话者具有相似的声学特征（即音调）时，SNN也获得了与真实频谱相似的频谱，因此可以区分它们的信息较少。

By listening to the separated examples generated by both ANN and SNN, we observe that the separated examples by SNN are very similar to those generated by ANN with high-fidelity. We publish some examples from the testing set (open condition) online to demonstrate our system performance.3 We randomly select a speech sample under the male-male mixture condition from the test set and show their magnitude spectra in Fig. 9. We observe that the SNN obtains a similar spectrum as the ground truth clean spectrum even under the challenging condition of the same gender, where the multi-talkers have similar acoustic characteristics, i.e., pitch, hence less information is available to discriminate them from each other.

🖼️ 图9 基于SNN的语音分离网络分离男-男混合语音的实例。

Fig. 9. The example of male-male mixture speech separated by SNN-based speech separation network.

7 结论/Conclusion

在这项工作中，我们重新研究了传统的ANN-toSNN转换方法，并确定了采用发射速率假设的准确性和延迟权衡。从激活量化工作中得到启发，我们进一步提出了一种新的网络转换方法，利用脉冲计数来表示神经网络神经元的激活空间。这种配置允许更好地利用有限的表示空间并提高推理速度。此外，我们还引入了一种分层学习方法来抵消原语网络转换带来的误差。所提出的转换和学习框架被称为渐进串联学习(PTL)，通过所提出的自适应训练调度器高度自动化，支持灵活和高效的训练。利用所提出的PTL框架，可以在训练过程中逐步施加硬件约束，从而有效地完成算法-硬件协同设计。

In this work, we reinvestigate the conventional ANN-toSNN conversion approach and identify the accuracy and latency trade-off with the adopted firing rate assumption. Taking inspiration from the activation quantization works, we further propose a novel network conversion method, whereby spike count is utilized to represent the activation space of ANN neurons. This configuration allows better exploitation of the limited representation space and improves the inference speed. Furthermore, we introduce a layer-wise learning method to counteract the errors resulted from the primitive network conversion. The proposed conversion and learning framework, that is called progressive tandem learning (PTL), is highly automated with the proposed adaptive training scheduler, which supports flexible and efficient training. Benefiting from the proposed PTL framework, the algorithm-hardware co-design can also be effectively accomplished by imposing the hardware constraints progressively during training.

这样训练的SNN在具有挑战性的ImageNet-12对象识别、图像重建与语音分离任务中表现出具有竞争力的分类和回归能力。此外，所提出的PTL框架允许有效利用可用的编码时间窗口，从而可以通过深度SNN实现快速有效的模式识别。以量化感知训练为例，我们说明了如何在训练过程中有效地引入硬件约束，即有限的权重精度，从而在实际的神经形态硬件上实现最佳性能。通过集成深度SNN的算法能力和高效的神经形态计算架构，它为普遍存在的低功耗设备的快速高效推断提供了无数机会。

The SNNs thus trained have demonstrated competitive classification and regression capabilities on the challenging ImageNet-12 object recognition, image reconstruction, and speech separation tasks. Moreover, the proposed PTL framework allows making efficient use of the available encoding time window, such that rapid and efficient pattern recognition can be achieved with deep SNNs. Taking the quantization-aware training as an example, we illustrate how the hardware constraint, limited weight precision, can be effectively introduced during training, such that the optimal performance can be achieved on the actual neuromorphic hardware. By integrating the algorithmic power of deep SNNs and energy-efficient neuromorphic computing architecture, it opens up a myriad of opportunities for rapid and efficient inference on the pervasive low-power devices.