ROCmPort AI -- CUDA to ROCm Migration

How the pipeline works

Agent	Role
Analyzer	Scans CUDA for AMD-specific risks: wavefront size, ballot/shuffle idioms, shared-memory layout
Translator	Runs `hipify` then applies LLM-guided fixes for bugs `hipify` cannot detect
Tester	Verifies compilation with `hipcc` and checks output correctness
Optimizer	Proposes MI300X-specific optimisations; re-tested against baseline
Coordinator	Orchestrates the loop; retries up to 3x if the optimised output regresses

The key bug: warp-size assumption

// NVIDIA (warpSize = 32) -- silently WRONG on AMD
if (tid < 32) { vsmem[tid] += vsmem[tid + 32]; ... }

// AMD-correct (wavefront = 64)
if (tid < 64) {
    vsmem[tid] += vsmem[tid + 32];
    if (tid < 32) { vsmem[tid] += vsmem[tid + 16]; ... }
}

Benchmark highlights (MI300X, ROCm 7.0)

Kernel	Result
matrix_multiply 512x512	2.91x speedup over baseline HIP
vector_add 32M elements	3918 GB/s (74% of MI300X peak)
reduction 16M elements	correctness PASS after wavefront-64 fix

Source: docs/benchmark_runs/ -- real rocprof CSV output, May 2026. Results vary with kernel complexity; these figures are not guaranteed on every input.

ROCmPort AI

Agentic CUDA to ROCm/HIP migration with wavefront-64 bug detection

Input

Output

How the pipeline works

The key bug: warp-size assumption

Benchmark highlights (MI300X, ROCm 7.0)