histogram
Deadline
41 days 17 hours (2025-06-30 00:00 UTC)
Language
Python
GPU Types
A100, H100, L4, T4
Description
Implement a histogram kernel that counts the number of elements falling into each bin across the specified range. The minimum and maximum values of the range are fixed to 0 and 100 respectively. All sizes are multiples of 16 and the number of bins is set to the size of the input tensor divided by 16. Input: - data: a tensor of shape (size,)
Reference Implementation
from utils import verbose_allequal
import torch
from task import input_t, output_t
def ref_kernel(data: input_t) -> output_t:
"""
Reference implementation of histogram using PyTorch.
Args:
data: tensor of shape (size,)
Returns:
Tensor containing bin counts
"""
# Count values in each bin
return torch.bincount(data, minlength=256)
def generate_input(size: int, contention: float, seed: int) -> input_t:
"""
Generates random input tensor for histogram.
Args:
size: Size of the input tensor (must be multiple of 16)
contention: float in [0, 100], specifying the percentage of identical values
seed: Random seed
Returns:
The input tensor with values in [0, 255]
"""
gen = torch.Generator(device='cuda')
gen.manual_seed(seed)
# Generate integer values between 0 and 256
data = torch.randint(0, 256, (size,), device='cuda', dtype=torch.uint8, generator=gen)
# make one value appear quite often, increasing the chance for atomic contention
evil_value = torch.randint(0, 256, (), device='cuda', dtype=torch.uint8, generator=gen)
evil_loc = torch.rand((size,), device='cuda', dtype=torch.float32, generator=gen) < (contention / 100.0)
data[evil_loc] = evil_value
return data.contiguous()
def check_implementation(data, output):
expected = ref_kernel(data)
reasons = verbose_allequal(output, expected)
if len(reasons) > 0:
return "mismatch found! custom implementation doesn't match reference: " + " ".join(reasons)
return ''
Rankings
L4
FourCore 🥇 | 79.095μs | histogram.py |
tomaszki 🥈 | 87.895μs +8.800μs | histogram_2.py |
Darshan 🥉 | 96.719μs +8.824μs | baseline.py |
T4
tomaszki 🥇 | 115.777μs | histogram.py |
FourCore 🥈 | 129.195μs +13.419μs | histogram.py |
Darshan 🥉 | 130.399μs +1.203μs | baseline.py |
A100
mancala 🥇 | 38.125μs | submission.py |
tomaszki 🥈 | 44.217μs +6.092μs | histogram_2.py |
Darshan 🥉 | 44.764μs +0.547μs | baseline.py |
H100
mancala 🥇 | 26.493μs | submission.py |
FourCore 🥈 | 31.467μs +4.973μs | histogram.py |
tomaszki 🥉 | 34.097μs +2.630μs | histogram.py |