prefixsum
Deadline
41 days 17 hours (2025-06-30 00:00 UTC)
Language
Python
GPU Types
A100, H100, L4, T4
Description
Implement an inclusive prefix sum (scan) kernel that matches the reference implementation. The kernel should compute the cumulative sum of all elements up to each position. Because of numerical instability, the tolerance is scaled by the square root of the input size. Input: - `data`: A 1D tensor of size `n` Output: - `output`: A 1D tensor of size `n`
Reference Implementation
from utils import match_reference
import torch
from task import input_t, output_t
def ref_kernel(data: input_t) -> output_t:
"""
Reference implementation of inclusive prefix sum using PyTorch.
Args:
data: Input tensor to compute prefix sum on
Returns:
Tensor containing the inclusive prefix sum
"""
return torch.cumsum(data.to(torch.float64), dim=0).to(torch.float64)
def generate_input(size: int, seed: int) -> input_t:
"""
Generates random input tensor.
Returns:
Tensor to compute prefix sum on
"""
gen = torch.Generator(device='cuda')
gen.manual_seed(seed)
return torch.randn(size, device='cuda', dtype=torch.float32, generator=gen).contiguous()
# This algorithm is very sensitive to the tolerance and the error is magnified by the input size
# The tolerance is scaled by the square root of the input size
def check_implementation(data: input_t, output: output_t) -> str:
# Then get the size for scaling the tolerance
n = data.numel()
scale_factor = n ** 0.5 # Square root of input size
rtol = 1e-5 * scale_factor
atol = 1e-5 * scale_factor
return match_reference(data, output, reference=ref_kernel, rtol=rtol, atol=atol)
Rankings
L4
Karang 🥇 | 8876.066μs | scan_cuda_v2.py |
az 🥈 | 9219.998μs +343.932μs | submission.py |
Snektron 🥉 | 9482.677μs +262.679μs | aaa.py |
T4
Karang 🥇 | 9106.582μs | scan_cuda_v3.py |
az 🥈 | 9289.434μs +182.851μs | submission.py |
Snektron 🥉 | 9307.217μs +17.784μs | aaa.py |
A100
Nader 🥇 | 1356.021μs | submission.py |
Snektron 🥈 | 1406.648μs +50.627μs | aaa.py |
tomaszki 🥉 | 1461.332μs +54.683μs | prefixsum.py |
H100
Nader 🥇 | 889.569μs | submission.py |
Snektron 🥈 | 902.879μs +13.310μs | aaa.py |
siro 🥉 | 958.149μs +55.269μs | submission_1.py |