Please check Publications for our latest projects
Enabling customizable FPGA accelerators in datacenters
Modeling and design of Accelerator-Rich Architectures (ARAs)
When applications run better on GPUs, when on FPGAs, and how?
Improving memory system performance for multi-/many-core processors
Research infrastructure development for multicore architectures
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their low power, high performance and energy efficiency. Evidenced by Microsoft’s FPGA deployment in its Bing search engine and Azure cloud, Intel’s US$16.7B acquisition of Altera, and the announcement of its public cloud access by Amazon, Alibaba, and Huawei, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems—like Apache Spark and Hadoop—to exploit the performance and energy efficiencies of FPGA accelerators.
In this project, we study how to choose the right CPU+FPGA server platform in datacenters, and how to efficiently integrate FPGA accelerators into big data computing systems.
We were the first to conduct a quantitative study and in-depth analysis on the microarchitectures of emerging CPU-FPGA acceleration platforms, including 1) Xeon CPU + Alpha Data (PCIe), 2) QPI-based Intel-Altera HARP v1 and v2 (now known as Xeon+FPGA multi-chip package), and 3) IBM CAPI system (coherent PCIe). We characterized CPU-FPGA communication latency and throughput and provided users insights into the platform selection. The results and open source microbenchmarks were published in [DAC 2016], which now has over 100 citations based on Google Scholar. An ACM TRETS journal extension has also been accepted.
Second, based on a case study of Apache Spark and FPGA integration using the DNA sequencing application [HotCloud 2016], I have proposed the concept of accelerator-as-a-service (AaaS) and developed the Blaze system in collaboration with UCLA spin-off Falcon Computing. Blaze implements AaaS and provides efficient programming and runtime support for state-of-the-art big data framework Spark to easily utilize FPGA and GPU accelerators, which can be efficiently shared among multiple threads and multiple nodes. Blaze is published in ACM Symposium on Cloud Computing 2016 and is open sourced, which will bring hardware developers, application developers, and system developers altogether to implement the AaaS concept. Blaze has also been successfully commercialized by Falcon Computing.
Finally, this line of research also led to multiple invited talks (see my CV) by universities and companies in US, Canada, Switzerland, Italy, China (mainland) and Taiwan. It is also highlighted in our Proceedings of IEEE 2019 journal paper.
There will be more work to be done along this line of research.
The power and utilization walls in today’s processors have led to a focus on accelerator-rich architectures (ARAs), which will include a sea of accelerators that can achieve orders-of-magnitude performance and energy gains. The emerging ARAs are still in the early stages, and many system-level design issues, such as the efficient accelerator resource management and communication between accelerators and CPU cores, remain unclear.
In this project, we work on modeling and designing Accelerator-Rich Architectures (ARAs).
First, we design and implement two major research infrastructures to enable such design space explorations, including 1) the open source cycle-accurate full-system simulator PARADE [ICCAD 2015] that extends the widely-used gem5 simulator with high-level synthesis (HLS) to model accelerators, and 2) the ARAPrototyper flow [FPGA 2016 poster, arXiv 2016] to enable faster prototyping and evaluation of ARAs on Xilinx Zynq ARM-FPGA SoC. I have co-organized tutorials on PARADE and ARAPrototyper in two top computer architecture conferences (ISCA 2015 and MICRO 2016), which were well received.
Based on these research infrastructures, we also investigated novel architecture support to optimize the CPU-accelerator interaction by providing a unified memory space between them with efficient address translation support, which can achieve 7.6x speedup over the naïve solution and leaves only 6.4% gap to the ideal performance. This work was nominated as a best paper candidate in one of the top computer architecture conferences HPCA 2017. Furthermore, we proposed the near-memory acceleration architecture called accelerator-interposed memory (AIM), which can achieve 4x better performance than CPU-side acceleration for genomics workloads. The AIM work won the best paper award in MEMSYS 2017.
Finally, this line of research also led to multiple invited talks (see my CV) by universities and companies in US, Canada, Switzerland, Italy, and China. It is also highlighted in our Proceedings of IEEE 2019 journal paper.
There will be more work to be done along this line of research.
For many data-intensive applications such as machine learning, video processing, computational genomics, big data analytics, and network processing, their performance and energy efficiency are not only limited by the computation itself, but also limited the data communication. One of the most promising solutions to address the computation challenge is to move from general-purpose processors to specialized hardware accelerators, where the computation is fully customized for an application or application domain to achieve ultimate performance. Meanwhile, to address the challenge of data communication, one of the most promising solutions is to further push the computing nearby the data, where the data access has lower latency and higher bandwidth, and consumes less energy.
In this project, we focus on near data acceleration, with the aim to address both computation and communication challenges in data-intensive applications.
As an initial step, we have proposed the near-memory acceleration architecture called accelerator-interposed memory (AIM), which puts a reconfigurable chip (called AIM module) nearby each DDR bank. Experimental results show that AIM can achieve 4x better performance than CPU-side acceleration for genomics workloads. The AIM work won the best paper award in MEMSYS 2017.
There will be more work to be done along this line of research.
To improve the performance and energy efficiency of important application domains, different kinds of accelerators have been developed, including GPUs, FPGAs, and ASICs. Compared to ASICs, GPUs and FPGAs gain more popularity by providing better programmability and flexibility. It is natural to ask the question: when is FPGA better, when is GPU better, and why?
In this project, we aim to better understand the performance differences between FPGAs and GPUs.
As an initial step, we have ported 11 Rodinia benchmarks (15 kernels in total) to the FPGA with Vivado HLS (high-level synthesis) C for the kernels and OpenCL for host programs. To achieve reasonable performance on FPGAs, we apply a sequence of optimizations, including caching (tiling), customized pipeline, parallelization, double buffer, and memory coalescing and bursting, which can be easily understood and mastered by software programmers. Our preliminary results (published in [FCCM 2018]) show that, for 6 out of the 15 ported kernels, the Xilinx Virtex 7 FPGA (28nm) can provide comparable performance or even achieve better performance than the NVidia K40 GPU (28nm), while consuming an average of 28% of the GPU power.
There will be more work to be done along this line of research.
I have worked on characterization and acceleration of many emerging workloads on commodity hardware, which drives my work on customizable computing. For example, during my PhD, we parallelized a typical image retrieval algorithm called SURF on commodity multicore CPU and GPU, where the GPU achieved 46x speedup over a single-core CPU [ISPASS 2011]. We also worked on the FPGA acceleration of the widely used CNN (convolutional neural network) algorithm, and implemented the Caffeine FPGA engine [ICCAD 2016, TCAD 2018] for the industry-standard machine learning framework Caffe. It achieved 1.46 TOPS for the 8-bit convolution layers and 100x speedup for the fully connected layers, and saved 5.7x energy over a high-end NVidia GPU. The Caffeine paper now has over 300 citations based on Google Scholar, and receives 2019 IEEE CEDA Donald O. Pederson Best Paper Award.
Recently, we work on profiling and understanding computational genomics, which was nominated as an ISPASS 2018 best paper candidate. We also proposed both scale-out and scale-up solutions to accelerate the DNA sequencing pipeline using customized datacenters [HotCloud 2016], and proposed the AIM architecture for near-memory acceleration [MEMSYS 2017]. Moreover, we accelerated the genome compression algorithm on the Intel-Altera HARPv2 server and achieved a record 12.8GB/s compression throughput [FCCM 2018].
There will be more work to be done along this line of research.
To mitigate the memory wall on today’s multicore and many-core systems, hardware-based data prefetching has been used extensively. Software prefetching using compiler techniques is still lagging behind, in particular, for more sophisticated multilevel cache hierarchies on multicore and many-core systems such as Intel’s Xeon Phi and Sandy Bridge. In this project, I first proposed a novel micro-benchmarking methodology [TACO 2015] based on short elapsed-time events to uncover relevant microarchitectural details unpublished by vendors. Based on this, I proposed a coordinated multi-stage data prefetching algorithm [ICS 2014] between multilevel caches and between software and hardware prefetching in the ROSE compiler, which achieved a 1.5X average speedup on top of Xeon Phi’s hardware prefetchers for a variety of memory-intensive benchmarks.
Cycle-accurate multicore simulators are extremely useful tools in architecture research and performance tuning. To support new architecture features, more accuracy or faster simulation speed, a number of new Functional Models (FM, executing OS and applications) and Timing Models (TM, modeling detailed microarchitecture) have been proposed. However, how to easily integrate these new powerful models together and how to achieve a reasonable simulation speed still remain challenging. To ease the process of extending new FMs or TMs, and to enable efficient parallel simulation and sampling techniques for faster multicore simulation, I proposed a loosely coupled FM-driven multicore simulator framework called Transformer [DAC 2012]. To speed up the simulation, I proposed a multilevel repetitive program phase analysis (MLPA) method [LCTES 2012]. MLPA combines both coarse-grained (at outer loops) and fine-grained (at inner loops) phase analysis, and achieves a 14X speedup over the widely used SimPoint sampling simulation technique.