Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding | Notion

Song Han, Huizi Mao, William J. Dally

Abstract

Neural network is computationally/memory intensive
- 임베디드 시스템에 적용하기 힘듬
Deep Compression으로 해결
- pruning, trained quantization, Huffman coding
- 정확도에 영향 없이 모델 용량 감소
  - 35 ~ 49 배
먼저 중요한 학습만 학습하도록 pruning, 다음 양자화, 마지막으로 huffman coding 적용

Introduction

딥러닝 모델 매우 강력해짐
- 하지만, 용량이 너무 커 mobile system에 적용 힘듬
- 계산량과 memory fetch 多, energy consumption 高
  - energy consumption은 memory access에 영향을 많이 받음
알고리즘
1. Pruning
  - 불필요한 연결을 끊어 중요한 연결만 남김
2. quantization
  - multiple connections share same weight
  - 용량 감소에 도움을 줌
3. Huffman coding
  - take advantage of the biased distribution of effective weights
핵심 아이디어
- pruning & quantization: 서로의 효과 극대화 가능

Network Pruning

Pruning은 모델 복잡도와 오버피팅을 방지하도록 연구 되어왔다
일단 정상적으로 학습을 시킴
small-weight connection들을 prune함
- 일정 threshold에 못미치는 가중치들 모두 제거
남은 connection들 재학습함
결과를 compressed sparse row (CSR)이나 compressed sparse column (CSC) 형태로 저장
- $2a+n+1$개가 필요함
  - a: # non-zero elements
  - n: # rows or columns
절대 위치 대신 index 차이 저장 가능
- 보다 압축됨
- 인코딩함
  - conv layer: 8 bits
  - fc layer: 5 bits
  - 추가 비트가 필요한 경우 zero padding solution을 적용함

Trained Quantization and Weight Sharing