Abstract:
Applying convolutional neural networks (CNNs) on high-resolution images leads to very large intermediate feature maps (FMs), which dominate the memory traffic. Processing in the classical layer-by-layer order creates the requirement to store the complete FMs at once, when moving from one layer to the next. As the size of these FMs only realistically allows this in off-chip memory, this leads to high off-chip bandwidth, which comes at great energy costs. The DepFiN processor chip, presented in this article, overcomes this cost by running CNNs in a deep layer fusion mode, dubbed depth-first execution, made possible by a control flow that supports frequently switching between layers. To furthermore tackle the computational cost as well, the computationally efficient depthwise + pointwise (DW + PW) layer pairs are explicitly supported in DepFiN by a novel accelerator core that can dynamically change its configuration to manage the low computational intensity of the depthwise layers. Benchmarking measurements show the 12-nm DepFiN chip reaching up to 20 TOPS/W peak, 8.2 TOPS/W on the MC-CNN-fast stereo-matching network excluding input-output (IO) power (at 8-bit 0.6 Vdd) and, crucially, 3.95 TOPS/W with the IO power included on the same network and an up to 18× improvement realized by supporting depth-first (MC-CNN-fast at 8-bit, 0.65 V Vdd).