vectorsum
Deadline
41 days 17 hours (2025-06-30 00:00 UTC)
Language
Python
GPU Types
A100, H100, L4, T4
Description
Implement a vector sum reduction kernel. This kernel computes the sum of all elements in the input tensor. Input: A tensor of shape `(N,)` with values from a normal distribution with mean 0 and variance 1. Output: A scalar value equal to the sum of all elements in the input tensor.
Reference Implementation
from utils import make_match_reference
import torch
from task import input_t, output_t
def ref_kernel(data: input_t) -> output_t:
"""
Reference implementation of vector sum reduction using PyTorch.
Args:
data: Input tensor to be reduced
Returns:
Tensor containing the sum of all elements
"""
# Let's be on the safe side here, and do the reduction in 64 bit
return data.to(torch.float64).sum().to(torch.float32)
def generate_input(size: int, seed: int) -> input_t:
"""
Generates random input tensor of specified shape with random offset and scale.
The data is first generated as standard normal, then scaled and offset
to prevent trivial solutions.
Returns:
Tensor to be reduced
"""
gen = torch.Generator(device='cuda')
gen.manual_seed(seed)
# Generate base random data
data = torch.randn(size, device='cuda', dtype=torch.float32, generator=gen).contiguous()
# Generate random offset and scale (using different seeds to avoid correlation)
offset_gen = torch.Generator(device='cuda')
offset_gen.manual_seed(seed + 1)
scale_gen = torch.Generator(device='cuda')
scale_gen.manual_seed(seed + 2)
# Generate random offset between -100 and 100
offset = (torch.rand(1, device='cuda', generator=offset_gen) * 200 - 100).item()
# Generate random scale between 0.1 and 10
scale = (torch.rand(1, device='cuda', generator=scale_gen) * 9.9 + 0.1).item()
# Apply scale and offset
return (data * scale + offset).contiguous()
check_implementation = make_match_reference(ref_kernel)
Rankings
L4
tomaszki 🥇 | 665.082μs | vectorsum.py |
FourCore 🥈 | 941.813μs +276.731μs | submission.py |
Snektron 🥉 | 957.137μs +15.324μs | a.py |
T4
tomaszki 🥇 | 258.937μs | vectorsum.py |
Karang 🥈 | 793.310μs +534.373μs | vecsum_cuda.py |
ajhinh 🥉 | 810.232μs +16.922μs | t4.py |
A100
tomaszki 🥇 | 99.117μs | vectorsum.py |
Snektron 🥈 | 154.198μs +55.081μs | a.py |
FourCore 🥉 | 158.628μs +4.431μs | vectorsum.py |
H100
tomaszki 🥇 | 73.976μs | vectorsum.py |
FourCore 🥈 | 93.859μs +19.883μs | vectorsum.py |
Snektron 🥉 | 97.200μs +3.340μs | a.py |