prefixsum

Deadline

ended (2025-06-30 00:00 UTC)

Language

Python

GPU Types

A100, H100, L4, T4

Description

Implement an inclusive prefix sum (scan) kernel that matches the reference implementation. The kernel should compute the cumulative sum of all elements up to each position. Because of numerical instability, the tolerance is scaled by the square root of the input size. Input: - `data`: A 1D tensor of size `n` Output: - `output`: A 1D tensor of size `n`

Reference Implementation

from utils import match_reference
import torch
from task import input_t, output_t


def ref_kernel(data: input_t) -> output_t:
    """
    Reference implementation of inclusive prefix sum using PyTorch.
    Args:
        data: Input tensor to compute prefix sum on
    Returns:
        Tensor containing the inclusive prefix sum
    """
    return torch.cumsum(data.to(torch.float64), dim=0).to(torch.float64)


def generate_input(size: int, seed: int) -> input_t:
    """
    Generates random input tensor.
    Returns:
        Tensor to compute prefix sum on
    """
    gen = torch.Generator(device='cuda')
    gen.manual_seed(seed)
    return torch.randn(size, device='cuda', dtype=torch.float32, generator=gen).contiguous()


# This algorithm is very sensitive to the tolerance and the error is magnified by the input size
# The tolerance is scaled by the square root of the input size
def check_implementation(data: input_t, output: output_t) -> str:
    # Then get the size for scaling the tolerance
    n = data.numel()
    
    scale_factor = n ** 0.5  # Square root of input size
    rtol = 1e-5 * scale_factor
    atol = 1e-5 * scale_factor

    return match_reference(data, output, reference=ref_kernel, rtol=rtol, atol=atol)

Rankings

L4

Nader 🥇	8525.036μs	submission.py
Karang 🥈	8876.066μs +351.030μs	scan_cuda_v2.py
az 🥉	9219.998μs +343.932μs	submission.py
Snektron	9482.677μs +262.679μs	aaa.py
ajhinh	17858.542μs +8375.865μs	l4.py

T4

Nader 🥇	8937.644μs	submission.py
Karang 🥈	9106.582μs +168.938μs	scan_cuda_v3.py
az 🥉	9289.434μs +182.851μs	submission.py
Snektron	9307.217μs +17.784μs	aaa.py
ajhinh	17643.982μs +8336.764μs	t4.py

A100

Nader 🥇	1356.021μs	submission.py
Snektron 🥈	1406.648μs +50.627μs	aaa.py
tomaszki 🥉	1461.332μs +54.683μs	prefixsum.py
Karang	1488.339μs +27.008μs	scan_cuda_v5.py
az	1762.813μs +274.474μs	submission.py
Shihab	2403.750μs +640.937μs	decoupled_lookback.py
ajhinh	3452.166μs +1048.416μs	a100.py

H100

Nader 🥇	889.569μs	submission.py
Snektron 🥈	902.879μs +13.310μs	aaa.py
siro 🥉	958.149μs +55.269μs	submission_1.py
Karang	1025.282μs +67.133μs	scan_cuda_v4.py
az	1033.482μs +8.200μs	submission.py
Smexy	1034.798μs +1.317μs	submission.py
yechenzhi	1037.441μs +2.643μs	ref.py
Seraphim	1044.386μs +6.945μs	submission.py
Shihab	1836.295μs +791.909μs	decoupled_lookback.py
ajhinh	2131.119μs +294.824μs	h100.py