Amir Ghazizadeh — Efficient AI Systems · HW/SW Co-Design

Hi, I'm Amir —

a Bit-Bothering Researcher. an ML Accelerator Designer. an Efficient ML Systems Designer.

Senior Ph.D. student in Computer Engineering at the University of Central Florida, working at the seam between computer architecture and AI/ML systems. I design hardware-software co-optimized accelerators for GNNs, DNNs, and sparse workloads — translating model-level insight into dataflow, tiling, and silicon-aware decisions. Published at NeurIPS, and MICRO. I like the parts of the stack where the abstraction starts to leak.

AI Accelerator Design GNN Acceleration Sparse Kernels HW/SW Co-Design CUDA / GPU Systems LLM × Graph

[01] research

paper · NeurIPS '25 · GAMMA

GNN Architecture Design for Heterophilic Graphs

Gated multi-hop message passing rooted in an information-theoretic objective that maximizes relevant signal across hops. SOTA accuracy on 16 heterophilic benchmarks, 12× lower GPU memory, and 20× inference speedup via weight sharing and fixed-dim design.

PyTorch PyG Heterophily

paper

[02] research

paper · MICRO '25

HW/SW Co-Design for Sparse Matrix Multiplication

Reformulated SpMM as a graph-transformation problem: non-contiguous tiling, custom PE µ-architecture, and a Bidirectional Fiber Tree compression format. Cycle-accurate simulation (Ramulator + HBM): 3.2× lower off-chip traffic, 4.8× better energy efficiency, 4.3× average speedup over Sextans.

SpMM Cycle-Accurate Ramulator / HBM

paper

[03] research

M.Sc. · ICCKE '24

Efficient Deep Learning Model Quantization

Adaptive post-training quantization with mixed-precision allocation for Capsule Networks, driven by layer-wise vulnerability analysis. Power-of-Two scalers replace floating-point ops with hardware bitshifts — 8.67× weight compression, 4.56× activation reduction, 4× speedup, within 1% of full-precision accuracy.

PyTorch Mixed Precision CapsNet

repo

[04] research

M.Sc. work

Reliability of Capsule Networks under SEU

Fault-injection study tracing how Single-Event Upsets propagate through dynamic-routing layers, with mitigation strategies for soft-error tolerance — feeding directly into the layer-wise vulnerability ranking used in the quantization work.

CapsNet SEU / Fault-Injection Reliability

repo

[05] embedded

undergrad · embedded

Soccer Robot Development

Embedded firmware in C/C++ for ARM-based microcontrollers, paired with OpenCV-based ball-detection pipelines on the host side. End-to-end build of the perception, control, and wireless command loop.

C/C++ ARM MCU OpenCV

repo

[06] arch

course · arch

MIPS Processor in Proteus

A complete MIPS ISA datapath built from gate-level primitives — fetch, decode, ALU, hazard handling, and memory — simulated and verified inside Proteus. Same material I later TA'd for UCF's Computer Organization course.

MIPS ISA Proteus Datapath

repo

[07] parallel

course · parallel

Parallel Algorithms — Multicore & GPU

Course project benchmarking parallel primitives across CPU thread pools and CUDA kernels — communication, contention, warp coordination, and the gap between Amdahl and reality.

CUDA OpenMP Benchmarks

repo

> Experience

'25

GNN Architecture for Heterophilic Graphs UCF · RA

Information-theoretic gated message passing — SOTA accuracy on 16 heterophilic benchmarks, 12× lower GPU memory, 20× inference speedup. NeurIPS '25 (GAMMA).

'24

HW/SW Co-Design for SpMM UCF · RA

Graph-transformation framework, custom PE µ-arch, and a Bidirectional Fiber Tree compression format. 3.2× off-chip reduction, 4.8× energy gains, 4.3× speedup vs. Sextans. MICRO '25.

'24

Teaching Assistant UCF · Computer Org.

Delivered lectures on the MIPS ISA and processor hardware design — datapath, control, hazards — with concept-driven exercises to build student debugging intuition.

'21–'22

MVL Arbiter PUF for HW Auth KGUT · Visiting Researcher

Designed Arbiter PUFs in 32 nm and 14 nm CNTFET libraries with multi-valued logic, simulated in HSPICE. 99.56% (ternary) and 99.15% (quaternary) reliability across temperature and supply variations, deviation under 2%.

'21–'22

Efficient DL Model Quantization Shiraz Univ. · RA

Adaptive post-training quantization with layer-wise vulnerability analysis and Power-of-Two scalers (bitshift-friendly). 8.67× weight compression, 4.56× activation reduction, accuracy within 1% of FP baseline.

'14–'20

Soccer Robot Shahid Bahonar Univ.

Embedded software in C/C++ for ARM microcontrollers; OpenCV-based ball-detection pipelines on the host side. End-to-end perception, control, and wireless command loop.

> Education

'22→

Ph.D. Computer Engineering

University of Central Florida — Orlando, FL. Research: efficient AI systems, hardware accelerators, and GPU kernels for sparse and irregular workloads.

'20–'22

M.Sc. Computer Engineering

Shiraz University — Shiraz, Iran. Ranked 1st in M.Sc. studies. Thesis on capsule-network reliability under single-event upsets and quantization.

'14–'20

B.Sc. Computer Engineering

Shahid Bahonar University — Kerman, Iran. Foundation in digital design, architecture, and embedded systems.

Hi, I'm Amir —
a Bit-Bothering Researcher. an ML Accelerator Designer. an Efficient ML Systems Designer.

#Projects & academic experiments

GNN Architecture Design for Heterophilic Graphs

HW/SW Co-Design for Sparse Matrix Multiplication

Efficient Deep Learning Model Quantization

Reliability of Capsule Networks under SEU

Soccer Robot Development

MIPS Processor in Proteus

Parallel Algorithms — Multicore & GPU

#Selected publications

TensorPrism: Rethinking Sparse High-order Tensor Acceleration via Co-occurrence Graph

GAMMA: Gated Multi-hop Message Passing for Homophily-Agnostic Node Representation in GNNs

Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework

Towards Efficient Capsule Networks Through Approximate Squash Function and Layer-Wise Quantization

EGMA: Enhancing Data Reuse and Workload Balancing in Message Passing GNN Acceleration via Gram Matrix Optimization

Polyform: A Versatile Architecture for Multi-DNN Execution via Spatial and Temporal Acceleration

Aries: Accelerating Distributed Training in Chiplet-Based Systems via Flexible Interconnects

#Experience & education

> Experience

GNN Architecture for Heterophilic Graphs UCF · RA

HW/SW Co-Design for SpMM UCF · RA

Teaching Assistant UCF · Computer Org.

MVL Arbiter PUF for HW Auth KGUT · Visiting Researcher

Efficient DL Model Quantization Shiraz Univ. · RA

Soccer Robot Shahid Bahonar Univ.

> Education

Ph.D. Computer Engineering

M.Sc. Computer Engineering

B.Sc. Computer Engineering

> Service

#Technical skills

Programming

ML & Systems

Architecture & HW

#Get in touch

// contact

// how can I help?