ROCmPort AI

Agentic CUDA to ROCm/HIP migration with wavefront-64 bug detection

Backend API: rocmport-ai-q2b1.onrender.com | GitHub: tazwaryayyyy/ROCmPort-AI

hipify-clang translates CUDA API calls mechanically -- it cannot detect that if (tid < 32) in a warp-level reduction silently skips lanes 32-63 on AMD wavefront-64. The code compiles, the output is wrong, no errors. ROCmPort AI catches this before execution.

Input

Demo Kernels (pre-loaded with intentional AMD bugs)

Output

Agent steps will appear here once you click Port to ROCm.


How the pipeline works

Agent Role
Analyzer Scans CUDA for AMD-specific risks: wavefront size, ballot/shuffle idioms, shared-memory layout
Translator Runs hipify then applies LLM-guided fixes for bugs hipify cannot detect
Tester Verifies compilation with hipcc and checks output correctness
Optimizer Proposes MI300X-specific optimisations; re-tested against baseline
Coordinator Orchestrates the loop; retries up to 3x if the optimised output regresses

The key bug: warp-size assumption

// NVIDIA (warpSize = 32) -- silently WRONG on AMD
if (tid < 32) { vsmem[tid] += vsmem[tid + 32]; ... }

// AMD-correct (wavefront = 64)
if (tid < 64) {
    vsmem[tid] += vsmem[tid + 32];
    if (tid < 32) { vsmem[tid] += vsmem[tid + 16]; ... }
}

Benchmark highlights (MI300X, ROCm 7.0)

Kernel Result
matrix_multiply 512x512 2.91x speedup over baseline HIP
vector_add 32M elements 3918 GB/s (74% of MI300X peak)
reduction 16M elements correctness PASS after wavefront-64 fix

Source: docs/benchmark_runs/ -- real rocprof CSV output, May 2026. Results vary with kernel complexity; these figures are not guaranteed on every input.