prefixsum

Deadline

41 days 17 hours (2025-06-30 00:00 UTC)

Language

Python

GPU Types

A100, H100, L4, T4

Description

Implement an inclusive prefix sum (scan) kernel that matches the reference implementation. The kernel should compute the cumulative sum of all elements up to each position. Because of numerical instability, the tolerance is scaled by the square root of the input size. Input: - `data`: A 1D tensor of size `n` Output: - `output`: A 1D tensor of size `n`

Reference Implementation

from utils import match_reference
import torch
from task import input_t, output_t


def ref_kernel(data: input_t) -> output_t:
    """
    Reference implementation of inclusive prefix sum using PyTorch.
    Args:
        data: Input tensor to compute prefix sum on
    Returns:
        Tensor containing the inclusive prefix sum
    """
    return torch.cumsum(data.to(torch.float64), dim=0).to(torch.float64)


def generate_input(size: int, seed: int) -> input_t:
    """
    Generates random input tensor.
    Returns:
        Tensor to compute prefix sum on
    """
    gen = torch.Generator(device='cuda')
    gen.manual_seed(seed)
    return torch.randn(size, device='cuda', dtype=torch.float32, generator=gen).contiguous()


# This algorithm is very sensitive to the tolerance and the error is magnified by the input size
# The tolerance is scaled by the square root of the input size
def check_implementation(data: input_t, output: output_t) -> str:
    # Then get the size for scaling the tolerance
    n = data.numel()
    
    scale_factor = n ** 0.5  # Square root of input size
    rtol = 1e-5 * scale_factor
    atol = 1e-5 * scale_factor

    return match_reference(data, output, reference=ref_kernel, rtol=rtol, atol=atol)

Rankings

L4

Karang 🥇 8876.066μs scan_cuda_v2.py
az 🥈 9219.998μs   +343.932μs submission.py
Snektron 🥉 9482.677μs   +262.679μs aaa.py
ajhinh 17858.542μs   +8375.865μs l4.py

T4

Karang 🥇 9106.582μs scan_cuda_v3.py
az 🥈 9289.434μs   +182.851μs submission.py
Snektron 🥉 9307.217μs   +17.784μs aaa.py
ajhinh 17643.982μs   +8336.764μs t4.py

A100

Nader 🥇 1356.021μs submission.py
Snektron 🥈 1406.648μs   +50.627μs aaa.py
tomaszki 🥉 1461.332μs   +54.683μs prefixsum.py
Karang 1488.339μs   +27.008μs scan_cuda_v5.py
az 1762.813μs   +274.474μs submission.py
Shihab 2403.750μs   +640.937μs decoupled_lookback.py
ajhinh 3452.166μs   +1048.416μs a100.py

H100

Nader 🥇 889.569μs submission.py
Snektron 🥈 902.879μs   +13.310μs aaa.py
siro 🥉 958.149μs   +55.269μs submission_1.py
Karang 1025.282μs   +67.133μs scan_cuda_v4.py
az 1033.482μs   +8.200μs submission.py
Smexy 1034.798μs   +1.317μs submission.py
Shihab 1836.295μs   +801.497μs decoupled_lookback.py
ajhinh 2131.119μs   +294.824μs h100.py