ROCmPort AI
Agentic CUDA to ROCm/HIP migration with wavefront-64 bug detection
Backend API: rocmport-ai-q2b1.onrender.com | GitHub: tazwaryayyyy/ROCmPort-AI
hipify-clang translates CUDA API calls mechanically -- it cannot detect that if (tid < 32) in a
warp-level reduction silently skips lanes 32-63 on AMD wavefront-64.
The code compiles, the output is wrong, no errors. ROCmPort AI catches this before execution.
Output
Agent steps will appear here once you click Port to ROCm.
How the pipeline works
| Agent | Role |
|---|---|
| Analyzer | Scans CUDA for AMD-specific risks: wavefront size, ballot/shuffle idioms, shared-memory layout |
| Translator | Runs hipify then applies LLM-guided fixes for bugs hipify cannot detect |
| Tester | Verifies compilation with hipcc and checks output correctness |
| Optimizer | Proposes MI300X-specific optimisations; re-tested against baseline |
| Coordinator | Orchestrates the loop; retries up to 3x if the optimised output regresses |
The key bug: warp-size assumption
// NVIDIA (warpSize = 32) -- silently WRONG on AMD
if (tid < 32) { vsmem[tid] += vsmem[tid + 32]; ... }
// AMD-correct (wavefront = 64)
if (tid < 64) {
vsmem[tid] += vsmem[tid + 32];
if (tid < 32) { vsmem[tid] += vsmem[tid + 16]; ... }
}
Benchmark highlights (MI300X, ROCm 7.0)
| Kernel | Result |
|---|---|
| matrix_multiply 512x512 | 2.91x speedup over baseline HIP |
| vector_add 32M elements | 3918 GB/s (74% of MI300X peak) |
| reduction 16M elements | correctness PASS after wavefront-64 fix |
Source:
docs/benchmark_runs/-- realrocprofCSV output, May 2026. Results vary with kernel complexity; these figures are not guaranteed on every input.