We are excited to release the neon™ 2.6.0 framework, which features improvements for CPU inference path on a VGG-16 based Single Shot multibox Detector (SSD) neural network. These updates, along with the training optimizations released in neon 2.5.0, show that neon is gaining significant boosts in both training and inference performance. (Granular configuration details, as well as the raw data used in testing these configurations, are found at the end of the blog.)
Our focus was on a well-known hotspot (dilated convolution) in the SSD model. To accelerate convolution layer with dilation, the vanilla Python implementation was replaced with C kernels for matrix multiplication. The detection-output layer in inference path was also optimized around better use of these kernels.
All together, these improvements yielded the results shown in Figure 1 below: up to 3.7X speedup for a batch size of 8, and up to 2.4X for a batch size of 32 on a Intel® Xeon® Processor E5-2699 V4-based two socket system. The improvements were even more significant when running SSD on an Intel® Xeon® Platinum 8180 Processor-based two socket system: up to 8.4X for batch size 8, and up to 4.3X for batch size 32.https://ai.intel.com/neon-2-6-0-inference-optimizations-single-shot-multibox-detector-intel-xeon-processor-architectures/