Zhenman Fang | Assistant Professor at SFU

SERI is a high-throughput streaming accelerator for ERI (Electron Repulsion Integral) computation on HBM-based FPGAs, where ERI computation represents the largest bottleneck in ab initio molecular dynamics (AIMD) quantum chemistry simulations. To meet the varying computation, bandwidth, and floorplanning requirements between the 55 canonical quartet classes in ERI calculation, SERI designs an automation tool, together with an accurate performance model, to automatically customize the architecture and floorplanning strategy for each canonical quartet class to maximize their throughput. Running on the AMD/Xilinx Alveo U280 FPGA board, SERI achieves an average speedup of 9.80x over the previous best-performing FPGA design, a 3.21x speedup over a 64-core AMD EPYC 7713 CPU, and a 15.64x speedup over an Nvidia A40 GPU.

Team Members:

Students: Philip Stachura
Faculty: Zhenman Fang

Download:

SERI is open source and available for download at github: https://github.com/SFU-HiAccel/SERI

Further Interest:

For more details, please read our FPL 2024 paper, or contact Philip Stachura (pstachur@sfu.ca).
If you use SERI in your research, please cite our FPL 2024 paper (bibtex download):

Philip Stachura, Guanyu Li, Xin Wu, Christian Plessl, Zhenman Fang. SERI: High-Throughput Streaming Acceleration of Electron Repulsion Integral Computation in Quantum Chemistry using HBM-based FPGAs. The 34th IEEE International Conference on Field-Programmable Logic and Applications (FPL 2024), Turin, Italy, September 2024.

[@SFU] BitBlender: Scalable Bloom Filter Accelerator on FPGA

Software description:

BitBlender is the first dynamically-scheduled, configurable, and scalable multi-stream Bloom filter acceleration framework, to support large Bloom filters with low false-positive rate and high throughput. To effectively share one large bit-vector on chip among all streams, we design and implement the novel arbiter and unshuffle modules to dynamically schedule conflicting accesses to execute sequentially and non-conflicting accesses to execute in parallel. On the AMD/Xilinx Alveo U280 FPGA, BitBlender achieves a throughput up to 2,194 MQueries/s (i.e., 8.8 GB/s) for a 96Mb bit-vector with 0.01% false-positive rate.

Team Members:

Students: Kenny Liu, Alec Lu
Faculty: Zhenman Fang

Download:

BitBlender is open source and available for download at github: https://github.com/SFU-HiAccel/BitBlender

Further Interest:

For more details, please read our FPL 2024 paper, or contact Kenny Liu (kenny_liu_2@sfu.ca).
If you use BitBlender in your research, please cite our FPL 2024 paper (bibtex download):

Kenny Liu, Alec Lu, Zhenman Fang. BitBlender: Scalable Bloom Filter Acceleration on FPGAs with Dynamic Scheduling. The 34th IEEE International Conference on Field-Programmable Logic and Applications (FPL 2024), Turin, Italy, September 2024.

[@SFU] FORC: FPGA Accelerator for Apache ORC File Decoder

Software description:

FORC is the first high-throughput streaming-based FPGA accelerator for decoding Apache ORC files used in modern big data engines. It features (1) a resource-efficient overlay design to share the design of common operations across decoders and support different combinations of ORC decoding schemes (and bit widths), (2) a fully (dynamic) pipelined ORC decoder engine that can process up to the output streaming rate, i.e., four 512-bit wide streaming writes per cycle, (3) an end-to-end dataflow integration of our accelerator into Apache ORC C++ library. FORC achieves up to 12.9GB/s decoding throughput on AMD/Xilinx Alveo U280 FPGA, with a geomean speedup of 65x (up to 335x) over the CPU.

Team Members:

Students: Abdul Wadood, Alec Lu
Faculty: Zhenman Fang

Download:

FORC is open source and available for download at github: https://github.com/SFU-HiAccel/FORC

Further Interest:

For more details, please read our FPL 2024 paper, or contact Abdul Wadood (abdul_wadood@sfu.ca).
If you use FORC in your research, please cite our FPL 2024 paper (bibtex download):

Abdul Wadood, Alec Lu, Ken Zhang, Zhenman Fang. FORC: A High-Throughput Streaming FPGA Accelerator for Optimized Row Columnar File Decoders in Big Data Engines. The 34th IEEE International Conference on Field-Programmable Logic and Applications (FPL 2024), Turin, Italy, September 2024.

[@SFU] HiSpMV: A High-Performance SpMV Accelerator on FPGA for Imbalanced Matrices

Software description:

HiSpMV is a sparse matrix-vector multiplication (SpMV) accelerator on HBM-based FPGAs, which is optimized to accelerate imbalanced SpMV. It features (1) a hybrid row distribution network to enable both inter-row and intra-row distribution for better balance, (2) a fully pipelined floating-point accumulation on the output vector using a combination of an adder chain and register-based circular buffer, (3) hybrid buffering to improve memory access for input vector, and (4) an automation framework to automatically generate the optimized HiSpMV accelerator in Vitis HLS.

Team Members:

Students: Manoj BR, Xingyu Tian
Faculty: Zhenman Fang

Download:

HiSpMV is open source and available for download at github: https://github.com/SFU-HiAccel/HiSpMV

Further Interest:

For more details, please read our FPGA 2024 paper, or contact Manoj BR (mba151@sfu.ca).
If you use HiSpMV in your research, please cite our FPGA 2024 paper (bibtex download):

Manoj Bheemasandra Rajashekar, Xingyu Tian, Zhenman Fang. HiSpMV: Hybrid Row Distribution and Vector Buffering for Imbalanced SpMV Acceleration on FPGAs. The 32nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2024), Monterey, CA, Mar 2023.

[@SFU] PASTA: Scalable Task-Parallel HLS Programming Framework on FPGA

Software description:

PASTA is a programming framework that takes a large task-parallel HLS design as input and automatically generates a high-frequency accelerator on modern multi-die FPGAs via HLS and physical design co-optimization. It is built on top of the UCLA TAPA/AutoBridge [TRETS 2023] framework and extends a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. Therefore, PASTA supports a much broader class of HLS designs where parallel tasks can communicate with each other via both FIFO and buffer based channels.

Team Members:

Students: Moazin Khatti, Xingyu Tian, Ahmad Sedigh Baroughi, Akhil Raj Barnawal
Faculty: Zhenman Fang

Download:

PASTA is open source and available for download at github: https://github.com/SFU-HiAccel/PASTA

Further Interest:

For more details, please read our FCCM 2023 paper, or contact Moazin Khatti (moazin_khatti@sfu.ca).
If you use PASTA in your research, please cite our FCCM 2023 paper (bibtex download):

Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, Zhenman Fang. PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs. The 31st IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM 2023), Marina Del Rey, CA, May 2023.

[@SFU] SQL2FPGA: Spark SQL to FPGA Compiler

Software description:

SQL2FPGA is a hardware-aware SQL query compilation framwork for translating and efficiently mapping SQL queries on the modern heterogeneous CPU-FPGA platforms. SQL2FPGA takes the optimized query execution plans of SQL queries from big data query processing engines (Apache Spark SQL for now); performs hardware-aware optimizations to map query operations to FPGA accelerators (AMD/Xilinx Vitis database overlays for now); and lastly generates the deployable CPU host code and the associated FPGA accelerator configuration code.

Team Members:

Students: Alec Lu
Faculty: Zhenman Fang

Download:

SQL2FPGA is open source and available for download at github: https://github.com/SFU-HiAccel/SQL2FPGA

Further Interest:

For more details, please read our FCCM 2023 paper, or contact Alec Lu (alec_lu@sfu.ca).
If you use SQL2FPGA in your research, please cite our FCCM 2023 paper (bibtex download):

Alec Lu, Zhenman Fang. SQL2FPGA: Automatic Acceleration of SQL Query Processing on Modern CPU-FPGA Platforms. The 31st IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM 2023), Marina Del Rey, CA, May 2023.

[@SFU] SASA: Scalable and Automatic Stencil Acceleration on FPGA

Software description:

SASA is a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. It automatically parses a stencil DSL, exploits the best hybrid spatial and temporal parallelism configuration on an HBM-based FPGA, and generates the optimal FPGA design in Vitis HLS with TAPA/AutoBridge-based floorplanning optimization. It is developed and maintained by the SFU-HiAccel group at SFU, led by Dr. Fang.

Team Members:

Students: Xingyu Tian
Faculty: Zhenman Fang

Download:

SASA is open source and available for download at github: https://github.com/SFU-HiAccel/SASA

Further Interest:

For more details, please read our TRETS 2023 paper, or contact Xingyu (xingyut@sfu.ca).
If you use SASA in your research, please cite our TRETS 2023 paper (bibtex download):

Xingyu Tian, Zhifan Ye, Alec Lu, Licheng Guo, Yuze Chi, and Zhenman Fang. SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS 2023), 2023.

[@SFU] SyncNN: Novel Synchronous Spiking Neural Network Acceleration on FPGA

Software description:

SyncNN adopts a novel synchronous approach for rate encoding based Spiking Neural Networks (SNNs) and accelerates SNNs on Xilinx ARM-FPGA SoC boards using HLS C++. The SyncNN framework is scalable to run deep SNN networks on various Xilinx FPGA boards. The code base has three widely used image classification networks: LeNet, Network in Network (NiN), and VGG-13. The networks have been evaluated on MNIST, CIFAR-10 and SVHN datasets. It is developed and maintained by the SFU-HiAccel group at SFU, led by Dr. Fang.

Team Members:

Students: Sathish Panchapakesan
Faculty: Zhenman Fang

Download:

SyncNN is open source and available for download at github: https://github.com/SFU-HiAccel/SyncNN

Further Interest:

For more details, please read our FPL 2021 paper, or contact Sathish Panchapakesan (sathishp@sfu.ca).
If you use SyncNN in your research, please cite our FPL 2021 paper (bibtex download):

Sathish Panchapakesan, Zhenman Fang, Jian Li. SyncNN: Evaluating and Accelerating Spiking Neural Networks on FPGAs. The 31st International Conference on Field-Programmable Logic and Applications (FPL 2021), Virtual Conference, Sept 2021.

[@SFU] Rodinia-HLS: FPGA Version of Rodinia Benchmarks in HLS C/C++

Software description:

Rodinia-hls is an FPGA version of the widely used GPU benchmark suite Rodinia, written in HLS (High-Level Synthesis) C/C++. This project was initiated by Dr. Fang when he was a postdoc at UCLA, mentoring a few summer intern students. Now it is updated and maintained by the HiAccel group at SFU, led by Dr. Fang. The host code is written in Xilinx OpenCL, where we have abstracted the common code template. The kernels are written and optimized in Xilinx Vivado HLS C/C++. We follow a series of common optimizations, including tiling, pipeline, unrolling (parallelization), double buffering, and memory coalescing. We include the code of step-by-step HLS optimizations, which could be very useful for HLS beginners.

Team Members:

Students: Xingyu Tian, Alec Lu
Faculty: Zhenman Fang

Download:

Rodinia-hls is open source and available for download at github: https://github.com/SFU-HiAccel/rodinia-hls

Further Interest:

For more details, please read our FCCM 2018 paper, or contact Dr. Fang (zhenman@sfu.ca) or Xingyu (xingyut@sfu.ca).
If you use Rodinia-hls in your research, please cite our FCCM 2018 paper (bibtex download):

Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang. Understanding Performance Differences of FPGAs and GPUs. The 26th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2018 short paper), Boulder CO, May 2018, pp. 172-175.

[@SFU] Microbenchmarks to Characterize Modern FPGA Memory Systems

Software description:

uBench is a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo FPGA memory systems (DRAM and HBM) under a comprehensive set of factors that affect the off-chip memory bandwidth and on-chip streaming bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses.

Team Members:

Students: Alec Lu
Faculty: Zhenman Fang

Download:

uBench is open source and available for download at github: https://github.com/SFU-HiAccel/uBench

Further Interest:

For more details, please read our FPGA 2021 paper, or contact Alec Lu (alec_lu@sfu.ca).
If you use uBench in your research, please cite our FPGA 2021 paper (bibtex download):

Alec Lu, Zhenman Fang, Weihua Liu, Lesley Shannon. Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking. The 29th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2021), Virtual Conference, Mar 2021.

[@SFU] CHIP-KNN: A Configurable and High-Performance KNN Accelerator on FPGA

Software description:

CHIP-KNN is a configurable and high-performance K-Nearest Neighbors (KNN) acceleration framework. It automatically generates bandwidth-optimized KNN accelerator on cloud FPGA platforms and supports different configurations of 1) N - # of points in search space, 2) D - # of data dimension, 3) Dist - distance metric, and 4) K - # of nearest neighbors.

Team Members:

Students: Alec Lu
Faculty: Zhenman Fang

Download:

CHIP-KNN is open source and available for download at github: https://github.com/SFU-HiAccel/CHIP-KNN

Further Interest:

For more details, please read our FPT 2020 paper, or contact Alec Lu (alec_lu@sfu.ca).
If you use CHIP-KNN in your research, please cite our FPT 2020 paper (bibtex download):

Alec Lu, Zhenman Fang, Nazanin Farahpour, Lesley Shannon. CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs. IEEE International Conference on Field-Programmable Technology (FPT 2020). Virtual Conference, December 2020.

[@UCLA] PARADE: Full-System Accelerator-Rich Architecture Simulator

Software description:

PARADE is a cycle-accurate full-system simulation platform that enables the design and exploration of the emerging accelerator-rich architectures (ARA). It extends the widely used gem5 simulator with high-level synthesis (HLS) support.

PARADE simulates ARA at system-level and provides following features:

Full-system X86 support based on gem5
Global Accelerator Management (GAM)
Coherent cache/scratchpad with shared memory based on Ruby
Customizable Network-on-Chip simulation based on Garnet
Power/area simulation
Automatic accelerator/application generation based on HLS

Team Members:

Students: Michael Gill, Yuchen Hao
Faculty: Jason Cong, Zhenman Fang, Glenn Reinman

Download:

PARADE is open source and available for download at github: https://github.com/cdsc-github/parade-ara-simulator
Here is the download link for disk image and Linux binaries that PARADE simulates.

Further Interest:

For more details, please read our ICCAD 2015 paper, ISCA 2015 tutorial, HPCA 2017 paper, and ASAP 2019 paper, or contact Dr. Zhenman Fang.
If you use PARADE in your research, please cite our ICCAD 2015 paper (bibtex download):

Jason Cong, Zhenman Fang, Michael Gill, Glenn Reinman. PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectural Design and Exploration. 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2015), Austin TX, Nov 2015, pp. 380-387.

[@UCLA] Blaze: Deploying Accelerators at Datacenter Scale

Software description:

Blaze is an accelerator-aware programming framework for warehouse-scale accelerator deployment. Blaze provides a programming interface that is compatible to Apache Spark, an in-memory compute engine for large-scale data processing, and a runtime system that provides transparent accelerator management. With Blaze, the deployment effort of accelerator task is reduced by more than 10x compared to traditional frameworks such as OpenCL. Blaze is designed to efficiently manage accelerator platforms with intelligent data caching and task pipelining schemes to minimize the communication and data movement overheads.

Team Members:

Students: Muhuan Huang, Di Wu, Cody Hao Yu
Faculty: Jason Cong, Zhenman Fang

Download:

Blaze is open source and available for download at github: https://github.com/UCLA-VAST/blaze

Further Interest:

For more details, please read our ACM SOCC 2016 paper, or contact Di Wu (allwu@cs.ucla.edu) or Cody Hao Yu (hyu@cs.ucla.edu).
If you use PARADE in your research, please cite our ACM SOCC 2016 paper (bibtex download):

Muhuan Huang, Di Wu, Cody Hao Yu, Zhenman Fang, Matteo Interlandi, Tyson Condie, Jason Cong. Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale. The ACM Symposium on Cloud Computing (ACM SoCC 2016), Santa Clara, CA, Oct 2016, pp. 456-469.

[@UCLA] Microbenchmarks to Characterize Modern CPU-FPGA Platforms

Software description:

With the rapid evolution of CPU-FPGA heterogeneous acceleration platforms, it is critical for both platform developers and users to quantify the fundamental microarchitectural features of the platforms. We developed a set of microbenchmarks to evaluate mainstream CPU-FPGA platforms.

The first benchmark (https://github.com/peterpengwei/Microbench_AlphaData) is dedicated to the Alpha Data card which connects a CPU with an FPGA via the PCIe interface. The benchmark follows the Xilinx SDAccel programming model, and contains a host program written in C and a kernel program written in OpenCL. With a set of timers in the host program, users can quantify the latency and bandwidth of each critical step, including device buffer allocation, pageable-to-pinned memory copy, PCIe-DMA, etc.

The second benchmark (https://github.com/peterpengwei/Microbench_HARP) is dedicated to the Intel/Altera Heterogeneous Accelerator Research Platform (HARP). HARP connects an CPU with an FPGA via Intel's QPI processor interconnect, and implements a coherent cache interface (CCI) on the FPGA side to achieve coherence between CPU and FPGA. The benchmark follows the Intel AALSDK programming model, and contains a host program written in C++ and a kernel program written in Verilog HDL. Users can quantify the hit/miss latency of the coherent cache, as well as the remote memory access bandwidth between FPGA and the main memory located on the CPU side.

Team Members:

Students: Young-kyu Choi, Yuchen Hao, Peng Wei
Faculty: Jason Cong, Zhenman Fang, Glenn Reinman

Further Interest:

For more details, please read our DAC 2016 paper, or contact Peng Wei or Dr. Zhenman Fang.
If you use our microbenchmarks in your research, please cite our DAC 2016 paper (bibtex download):

Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. A Quantitative Analysis on Microarchitectures of Modern CPU-FPGA Platforms. Proceedings of the 53rd Annual Design Automation Conference (DAC 2016), Austin, TX, June 5-9, 2016.

[@UCLA] High-Throughput Deflate Compression Accelerator on FPGA

Software description:

This project implements a high-throughput FPGA accelerator for the widely used Deflate compression algorithm, which is the core of many lossless compression standards such as ZLIB and GZIP. We propose a novel multi-way parallel and fully pipelined architecture to achieve high-throughput lossless compression on modern FPGA platforms. To compensate for the compression ratio loss in a multi-way design, we implement novel techniques, such as a better data feeding method and a hash chain to increase the hash dictionary history. Our accelerator kernel itself can achieve a compression throughput of 12.8 GB/s (2.3x better than the current record throughput) and a comparable compression ratio of 2.03 for standard benchmark data. Our approach enables design scalability without a reduction in clock frequency and also improves the performance per area efficiency (up to 1.5x). Moreover, we exploit the high CPU-FPGA communication bandwidth of HARPv2 platforms to improve the compression throughput of the overall system, which can achieve an average practical endto-end throughput of 10.0 GB/s (up to 12 GB/s for larger input files) on HARPv2.

Team Members:

Students: Weikang Qiao, Jieqiong Du
Faculty: Jason Cong, Zhenman Fang

Download:

Our FPGA compression accelerator is open source and available for download at github: https://github.com/UCLA-VAST/HT-Deflate-FPGA

Further Interest:

For more details, please read our FCCM 2018 paper, or contact Weikang Qiao (wkqiao2015@ucla.edu).
If you use our high-throughput compression accelerator in your research, please cite our FCCM 2018 paper (bibtex download):

Weikang Qiao, Jieqiong Du, Zhenman Fang, Jason Cong, Mau-Chung Frank Chang. High-Throughput Lossless Compression on Tightly-Coupled CPU-FPGA Platforms. The 26th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2018), Boulder CO, May 2018, pp. 37-44.

[@UMN] Microbenchmarks to Characterize Multi-/Many-core Memory Systems

Software description:

To incorporate various memory optimizations using compiler optimizations and performance tuning techniques on multi- and many-core systems (Xeon and Xeon Phi), we propose a novel microbenchmarking methodology based on short elapsed-time events to uncover relevant micro-architectural details unpublished by vendors. We conduct detailed analysis of potential interfering factors that could affect the intended behavior of such memory systems, and lay out effective guidelines to control and mitigate those interfering factors.

Download:

The microbenchmarks can be downloaded here.

Further Interest:

For more details, please read our TACO 2015 paper, or contact Dr. Zhenman Fang.
If you use our microbenchmarks in your research, please cite our TACO 2015 paper (bibtex download):

Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, Binyu Zang. Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking. Proceedings of the 53rd Annual Design Automation Conference (DAC 2016), Austin, TX, June 5-9, 2016.