I'm experimenting with some usage cases of structured buffers in a side project, where I'm building parallel implementations with Vulkan and DirectX 12. My performance stress-test is brute-force lighting, just iterating over 256 point lights in the buffer in the fragment shader in a simple PBR forward shader. No tiled-forward or clustered-forward (yet), so absolute worst case, just to explore structured buffer performance.
For each frame, I update all the lights and copy positions/intensities/range (two float4s) into a buffer.
On D3D12, this copy is into a staging buffer, that gets transferred to a GPU-side resource before use.
On Vulkan, I have two paths. The first mirrors the D3D12 pattern (write to staging buffer, and call vkCmdCopyBuffer), and the second round-robins through write target buffers that are created as host-visible; I write to these and then just call vkFlushMappedMemoryRanges. These buffers around just bound directly to the pipeline for the fragment shader to read.
I was expecting the write/transfer path to be faster on both DX12 and Vulkan, since I assumed the host-visible/flush memory would be on the CPU side and basically just be read over the PCIe bus. Instead, I'm seeing better performance on D3D12 (~60ms/frame on my RTX 3090), and the same performance through either path on Vulkan (around 10% slower, at 66ms/frame).
Profiling hasn't given me anything useful; does anyone have any ideas why performance on Vulkan is so much slower (and what I can do to mitigate it)? I generally prefer Vulkan as an API, so I'd like to make sure I get this set up correctly.