Discovery II Building
8888 University Dr.
Simon Fraser University
Burnaby, BC V5A 1S6
Email: alec_lu [at] sfu.ca
I am a PhD student in Computer Engineering at the Simon Fraser University, co-advised by Prof. Zhenman Fang and Prof. Lesley Shannon. My research interests include: FPGA-based custom accelerator design, big data analytics acceleration, and heterogeneous computing. I received my B.A.Sc. from Simon Fraser University in 2018. During my undergrad, I interned as a software developer at the Canadian Nuclear Labratories (previous known as Atomic Energy of Canada Limited), and as an SoC Designer at Intel. During my time in grad schoool, I have interned at Meta as an ASIC designer for their AR image signal processor.
Aside from engineering, I am also a potter and a cook. I thoroughly enjoy the process of making and being fully immersed, much like being in the 'Zone' when coding. I have worked as a cook at several local restaurants. Nowadays, I occasionally teach ceramics class at HiDe Ceramic Works, owned by pottery master HiDe Ebina. Some of my creations are posted here.
You can find my CV here. Or view my Google Scholar profile.
C9 |
SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms FCCM'23
Today’s big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demon- strated promising performance improvement with good energy efficiency. However, few studies target the programming and design automation support to leverage the FPGA accelerator benefits in query processing. Most of them rely on the SQL query plan generated by CPU query engines, and manually map the query plan onto the FPGA accelerators, which is tedious and error prone. Moreover, such CPU-oriented query plans do not consider the utilization of FPGA accelerators and could lose more optimization opportunities. In this paper, we present SQL2FPGA, an FPGA accelerator- aware compiler to automatically map SQL queries onto the heterogeneous CPU-FPGA platforms. Our SQL2FPGA front-end takes an optimized logical plan of a SQL query from a database query engine, and transforms it into a unified operator-level intermediate representation. To generate an optimized FPGA- aware physical plan, SQL2FPGA implements a set of compiler optimization passes to 1) improve operator acceleration coverage by the FPGA, 2) eliminate redundant computation during phys- ical execution, and 3) minimize data transfer overhead between operators on the CPU and FPGA. Finally, SQL2FPGA generates the associated query acceleration code that can be deployed on the heterogeneous CPU-FPGA system. Compared to the widely used Apache Spark SQL framework running on the CPU, SQL2FPGA—using two AMD/Xilinx HBM-based Alveo U280 FPGA boards—achieves an average performance speedup of 10.1x and 13.9x across all 22 TPC-H benchmark queries in a scale factor of 1GB and 30GB, respectively.
@inproceedings{lu23sql2fpga,
title={SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms},
author={Alec Lu and Zhenman Fang},
year={2023},
booktitle = {The 31st IEEE International Symposium On Field-Programmable Custom Computing Machines},
series = {FCCM'23},
location = {Marina Del Rey, CA},
numpages = {11},
pages = {}
}
|
C8 |
ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for 8-Bit DNN Training DATE'23 |
C7 |
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers HPCA'23 |
C6 |
You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding ECCV'22 |
C5 |
Auto-ViT-Acc: FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization FPL'22 |
C4 |
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization FPGA'22
With the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization work deploying quantization below 8-bit may be either suffering from evident accuracy loss or facing a big gap between the theoretical improvement of computation throughput and the practical inference speedup. In this work, we propose a general framework, called FILM-QNN, to quantize and accelerate multiple DNN models across different embedded FPGA devices. First, we propose the novel intra-layer, mixed-precision quantization algorithm that assigns different precisions onto the filters of each layer. The candidate precision levels and assignment granularity are determined from our empirical study with the capability of preserving accuracy and improving hardware parallelism. Second, we apply multiple optimization techniques for the FPGA accelerator architecture in support of quantized computations, including DSP packing, weight reordering, and data packing, to enhance the overall throughput with the available resources. Moreover, a comprehensive resource model is developed to balance the allocation of FPGA computation resources (LUTs and DSPs) as well as data transfer and on-chip storage resources (BRAMs) to accelerate the computations in mixed precisions within each layer. Finally, to improve the portability of FILM-QNN, we implement it using Vivado High-Level Synthesis (HLS) on Xilinx PYNQ-Z2 and ZCU102 FPGA boards. Our experimental results of ResNet-18, ResNet-50, and MobileNet-V2 demonstrate that the implementations with intra-layer, mixed-precision (95% of 4-bit weights and 5% of 8-bit weights, and all 5-bit activations) can achieve comparable accuracy (70.47%, 77.25%, and 65.67% for the three models) as the 8-bit (and 32-bit) versions and comparable throughput (214.8 FPS, 109.1 FPS, and 537.9 FPS on ZCU102) as the 4-bit designs.
@inproceedings{sunFilmQNN,
title = {FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization},
author = {Sun, Mengshu and Li, Zhengang and Lu, Alec and Li, Yanyu and Chang, Sung-En and Ma, Xiaolong and Lin, Xue and Fang, Zhenman},
year = {2022},
booktitle = {Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
location = {Virtual Event, USA},
series = {FPGA '22},
numpages = {12},
pages = {134–145}
}
|
C3 |
Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking FPGA'21
With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators.
@inproceedings{luChipKNN,
title={Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking},
author={Alec Lu and Zhenman Fang and Weihua Liu and Lesley Shannon},
year={2021},
booktitle = {2021 International Symposium on Field-Programmable Gate Arrays},
series = {FPGA'21},
location = {Virtual Conference},
numpages = {11},
pages = {}
}
|
C2 |
CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs FPT'20
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN---an HLS-based, configurable, and high-performance KNN accelerator---which optimizes the off-chip memory access on cloud FPGAs with multiple DRAM or HBM (high-bandwidth memory) banks. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension of each data point, the distance metric, and the number of nearest neighbors - K. To optimize its performance, we build an analytical performance model to explore the design space and balance the computation and memory access performance. Given a user configuration of the KNN parameters, our tool can automatically generate the optimal accelerator design on the given FPGA platform. Our experimental results on the Nimbix cloud computing platform show that: Compared to a 16-thread CPU implementation, CHIP-KNN on the Xilinx Alveo U200 FPGA board with four DRAM banks and U280 FPGA board with HBM achieves an average of 7.5x and 19.8x performance speedup, and 6.1x and 16.0x performance/dollar improvement.
@inproceedings{luChipKNN,
title={CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs},
author={Alec Lu and Zhenman Fang and Nazanin Farahpour and Lesley Shannon},
year={2020},
booktitle = {2020 International Conference on Field-Programmable Technology},
series = {FPT'20},
location = {Virtual Conference},
numpages = {9},
pages = {}
}
|
C1 |
Rethinking Integer Divider Design for FPGA-based Soft-Processors FCCM'19
Most existing soft-processors on FPGAs today support a fixed-latency instruction pipeline. Therefore, for integer division, a simple fixed-latency radix-2 integer divider is typically used, or algorithm-level changes are made to avoid integer divisions. However, for certain important application domains the simple radix-2 integer divider becomes the performance bottleneck, as every 32-bit division operation takes 32 cycles. In this paper, we explore integer divider designs for FPGA-based soft-processors, by leveraging the recent support of variable-latency execution units in their instruction pipeline. We implement a high-performance, data-dependent, variable-latency integer divider called Quick-Div, optimize its performance on FPGAs, and integrate it into a RISC-V soft-processor called Taiga that supports a variable-latency instruction pipeline. We perform a comprehensive analysis and comparison—in terms of cycles, clock frequency, and resource usage—for both the fixed-latency radix-2/4/8/16 dividers and our variable-latency Quick-Div divider with various optimizations. Experimental results on a Xilinx Virtex UltraScale+ VCU118 FPGA board show that our Quick-Div divider can provide over 5x better performance and over 4x better performance/LUT compared to a radix-2 divider for certain applications like random number generation. Finally, through a case study of integer square root, we demonstrate that our Quick-Div divider provides opportunities for reconsidering simpler and faster algorithmic choices.
@INPROCEEDINGS{8735506,
author={E. {Matthews} and A. {Lu} and Z. {Fang} and L. {Shannon}},
booktitle={2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
title={Rethinking Integer Divider Design for FPGA-Based Soft-Processors},
year={2019},
volume={},
number={},
pages={289-297},
doi={10.1109/FCCM.2019.00046}}
|
J3 |
SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs TRETS'22 |
J2 |
Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking TRETS'22 |
J1 |
Quick-Div: Rethinking Integer Divider Design for FPGA-based Soft-Processors TRETS'22 |
A3 |
FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization DAC'22 LBR |
A2 |
Hardware-Efficient Stochastic Rounding Unit Design for DNN Training DAC'22 LBR |
A1 |
You Already Have It: A Generator-Free Low-Precision DNN Training Framework using Stochastic Rounding DAC'22 WIP |
uBench is a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo FPGA memory systems under a comprehensive set of factors that affect the memory bandwidth, including 1) the clock frequency of the accelerator design, 2) the number of concurrent memory access ports, 3) the data width of each port, 4) the maximum burst access length for each port, and 5) the size of consecutive data accesses. uBench is open-source and publicly available on GitHub.
CHIP-KNN is the framework for a configurable and high-performance K-Nearest Neighbors accelerator on cloud FPGAs. It automatically generates bandwidth-optimized KNN accelerator on cloud FPGA platforms. CHIP-KNN is open-source and publicly available on GitHub.
QuickDiv is a high-performance, data-dependent, variable-latency integer divider. Its architecture is optimized for FPGAs. Currently it had been integrated as one of the functional units in a RISC-V soft-processor called Taiga, which supports a variable-latency instruction pipeline. QuickDiv is part of Taiga, both of which are open-source and publicly available on GitLab.