$ welcome — initializing portfolio.exe

Hi, I'm Amir —
a Bit-Bothering Researcher. an ML Accelerator Designer. an Efficient ML Systems Designer.

Senior Ph.D. student in Computer Engineering at the University of Central Florida, working at the seam between computer architecture and AI/ML systems. I design hardware-software co-optimized accelerators for GNNs, DNNs, and sparse workloads — translating model-level insight into dataflow, tiling, and silicon-aware decisions. Published at NeurIPS, and MICRO. I like the parts of the stack where the abstraction starts to leak.

AI Accelerator Design GNN Acceleration Sparse Kernels HW/SW Co-Design CUDA / GPU Systems LLM × Graph
amir@ucf:~ — zsh
$ whoami
amir.ghazizadeh

$ cat ~/about.json
{
  "role": "Senior Ph.D. Student, CompE",
  "school": "UCF",
  "focus": [
    "AI accelerators",
    "sparse GPU kernels",
    "HW/SW co-design"
  ],
  "venues": ["NeurIPS", "MICRO", "ISCA", "DAC", "ICCAD", "ICCD"],
  "reviewer": ["NeurIPS", "ICML", "MLSys", "IEEE TC"],
  "coffee": "strong"
}

$ echo $STATUS
// open to research collaborations & FT roles, 2026

$
// 02 · selected work

#Projects & academic experiments

A mix of accelerator-design research, GPU-systems work, and embedded builds. Drop me a line if any of them sparks something.

[01] research
paper · NeurIPS '25 · GAMMA

GNN Architecture Design for Heterophilic Graphs

Gated multi-hop message passing rooted in an information-theoretic objective that maximizes relevant signal across hops. SOTA accuracy on 16 heterophilic benchmarks, 12× lower GPU memory, and 20× inference speedup via weight sharing and fixed-dim design.

PyTorch PyG Heterophily
[02] research
paper · MICRO '25

HW/SW Co-Design for Sparse Matrix Multiplication

Reformulated SpMM as a graph-transformation problem: non-contiguous tiling, custom PE µ-architecture, and a Bidirectional Fiber Tree compression format. Cycle-accurate simulation (Ramulator + HBM): 3.2× lower off-chip traffic, 4.8× better energy efficiency, 4.3× average speedup over Sextans.

SpMM Cycle-Accurate Ramulator / HBM
[03] research
M.Sc. · ICCKE '24

Efficient Deep Learning Model Quantization

Adaptive post-training quantization with mixed-precision allocation for Capsule Networks, driven by layer-wise vulnerability analysis. Power-of-Two scalers replace floating-point ops with hardware bitshifts — 8.67× weight compression, 4.56× activation reduction, 4× speedup, within 1% of full-precision accuracy.

PyTorch Mixed Precision CapsNet
[04] research
M.Sc. work

Reliability of Capsule Networks under SEU

Fault-injection study tracing how Single-Event Upsets propagate through dynamic-routing layers, with mitigation strategies for soft-error tolerance — feeding directly into the layer-wise vulnerability ranking used in the quantization work.

CapsNet SEU / Fault-Injection Reliability
[05] embedded
undergrad · embedded

Soccer Robot Development

Embedded firmware in C/C++ for ARM-based microcontrollers, paired with OpenCV-based ball-detection pipelines on the host side. End-to-end build of the perception, control, and wireless command loop.

C/C++ ARM MCU OpenCV
[06] arch
course · arch

MIPS Processor in Proteus

A complete MIPS ISA datapath built from gate-level primitives — fetch, decode, ALU, hazard handling, and memory — simulated and verified inside Proteus. Same material I later TA'd for UCF's Computer Organization course.

MIPS ISA Proteus Datapath
[07] parallel
course · parallel

Parallel Algorithms — Multicore & GPU

Course project benchmarking parallel primitives across CPU thread pools and CUDA kernels — communication, contention, warp coordination, and the gap between Amdahl and reality.

CUDA OpenMP Benchmarks
// 03 · peer-reviewed

#Selected publications

A condensed log of papers — full list and citations on Google Scholar.

~/publications — cat *.bib
ISCA '26

TensorPrism: Rethinking Sparse High-order Tensor Acceleration via Co-occurrence Graph

F. Ye, S. Tian, A. Ghazizadeh, F. Yao, H. Zheng
NeurIPS '25

GAMMA: Gated Multi-hop Message Passing for Homophily-Agnostic Node Representation in GNNs

A. Ghazizadeh, R. Ewetz, H. Zheng
PDF
MICRO '25

Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework

A. Ghazizadeh, L. Lin, S. Tian, F. Ye, H. Zheng
PDF
ICCKE '24

Towards Efficient Capsule Networks Through Approximate Squash Function and Layer-Wise Quantization

M. Raji, K. Soroush, A. Ghazizadeh
DAC '24

EGMA: Enhancing Data Reuse and Workload Balancing in Message Passing GNN Acceleration via Gram Matrix Optimization

F. Ye, L. Yin, A. Ghazizadeh, H. Zheng
ICCD '23

Polyform: A Versatile Architecture for Multi-DNN Execution via Spatial and Temporal Acceleration

L. Yin, A. Ghazizadeh, S. Tian, H. Zheng
ICCAD '23

Aries: Accelerating Distributed Training in Chiplet-Based Systems via Flexible Interconnects

L. Yin, A. Ghazizadeh, A. Louri, H. Zheng
// 04 · the long log

#Experience & education

A reverse-chronological dump of where I've been and what I worked on.

> Experience

'25

GNN Architecture for Heterophilic Graphs UCF · RA

Information-theoretic gated message passing — SOTA accuracy on 16 heterophilic benchmarks, 12× lower GPU memory, 20× inference speedup. NeurIPS '25 (GAMMA).

'24

HW/SW Co-Design for SpMM UCF · RA

Graph-transformation framework, custom PE µ-arch, and a Bidirectional Fiber Tree compression format. 3.2× off-chip reduction, 4.8× energy gains, 4.3× speedup vs. Sextans. MICRO '25.

'24

Teaching Assistant UCF · Computer Org.

Delivered lectures on the MIPS ISA and processor hardware design — datapath, control, hazards — with concept-driven exercises to build student debugging intuition.

'21–'22

MVL Arbiter PUF for HW Auth KGUT · Visiting Researcher

Designed Arbiter PUFs in 32 nm and 14 nm CNTFET libraries with multi-valued logic, simulated in HSPICE. 99.56% (ternary) and 99.15% (quaternary) reliability across temperature and supply variations, deviation under 2%.

'21–'22

Efficient DL Model Quantization Shiraz Univ. · RA

Adaptive post-training quantization with layer-wise vulnerability analysis and Power-of-Two scalers (bitshift-friendly). 8.67× weight compression, 4.56× activation reduction, accuracy within 1% of FP baseline.

'14–'20

Soccer Robot Shahid Bahonar Univ.

Embedded software in C/C++ for ARM microcontrollers; OpenCV-based ball-detection pipelines on the host side. End-to-end perception, control, and wireless command loop.

> Education

'22→

Ph.D. Computer Engineering

University of Central Florida — Orlando, FL. Research: efficient AI systems, hardware accelerators, and GPU kernels for sparse and irregular workloads.

'20–'22

M.Sc. Computer Engineering

Shiraz University — Shiraz, Iran. Ranked 1st in M.Sc. studies. Thesis on capsule-network reliability under single-event upsets and quantization.

'14–'20

B.Sc. Computer Engineering

Shahid Bahonar University — Kerman, Iran. Foundation in digital design, architecture, and embedded systems.

> Service

Reviewer — NeurIPS, ICML, MLSys, IEEE TC

// 05 · the toolbox

#Technical skills

Things I reach for daily. Bold = primary, the rest are supporting cast.

Programming

Python C / C++ CUDA JavaScript

ML & Systems

PyTorch PyG DGL TensorFlow TF Lite NumPy SciPy scikit-learn

Architecture & HW

Cycle-Accurate Sim Ramulator / HBM RTL / Datapath HSPICE CNTFET libs MIPS ISA
// 06 · let's chat

#Get in touch

Happy to talk research, AI accelerators, GPU systems — or just nerdy side projects.

// how can I help?