Hello,
I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.
See these lines below:
|
uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; |
|
srcKeys[0] = SrcBuffer[DataIndex]; |
|
srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; |
|
srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; |
|
srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; |
No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.
Also an issue here:
|
// Pre-load the key values in order to hide some of the read latency |
|
uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; |
|
srcKeys[0] = SrcBuffer[DataIndex]; |
|
srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; |
|
srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; |
|
srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; |
|
|
|
#ifdef kRS_ValueCopy |
|
uint srcValues[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; |
|
srcValues[0] = SrcPayload[DataIndex]; |
|
srcValues[1] = SrcPayload[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; |
|
srcValues[2] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; |
|
srcValues[3] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; |
Later on, the number of keys is checked, but by that point it's too late:
|
uint localKey = (DataIndex < CBuffer.NumKeys ? srcKeys[i] : 0xffffffff); |
|
#ifdef kRS_ValueCopy |
|
uint localValue = (DataIndex < CBuffer.NumKeys ? srcValues[i] : 0); |
I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.
Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.
Hello,
I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.
See these lines below:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 133 to 137 in 0c53994
No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.
Also an issue here:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 348 to 360 in 0c53994
Later on, the number of keys is checked, but by that point it's too late:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 369 to 371 in 0c53994
I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.
Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.