Why do GPUs divide clip space Z by W, for position?

Question

Background:
I found that it is very easy to use a linear depth buffer, using only a slight modification to the canonical vertex transformation. The simplest method is found at the bottom of https://www.mvps.org/directx/articles/linear_z/linearz.htm.

However, the caveat is that it only works for triangles that don't need to be clipped against the near or far planes. (And an alternate solution, of performing the perspective divide in the vertex shader, will yield a similar problem for the other four frustum planes.)

Because clipping requires linear interpolation to work across all four clip space coordinates, I think it's impossible to work with linear depth, using only a vertex shader. But the reason for that is all down to Z being divided by W.

Why is that done? X and Y need to be divided by distance from the camera, but the Z coordinate does not, to fit perfectly into the NDC box.

Simon F · Answer 1 · 2019-01-30T09:25:08.350

If you are doing a perspective image and your model has implicit intersections then, if you use "linear Z", those intersections will appear in the wrong places.

For example, consider a simple ground plane with a line of telephone poles, receding into the distance, which pierce the ground (and continue below). The implicit intersections will be determined by the interpolated depth values. If those are not interpolated with 1/Z, then when the projected vertices have been computed with perspective, the image will look incorrect.

I apologise for the non-aesthetic quality of the following illustrations but I did them back in '97.

The first image shows the required rendering effect. (Note that the blue "pylons" go quite a long distance under the ground plane so are clipped at the bottom of the images)

This second image shows the result of using a non-reciprocal depth buffer: (Apologies for the change of scale - these were copied out of an old MS Word doc and I've no idea what has happened with the scaling.)

As you can see, the results are incorrect.

On another note, are sure you really want a linear Z representation? If you are rendering perspective, surely one wants more precision closer to the camera than at a distance?

Re your later comment:

“if those are not interpolated with 1/Z” that I don’t understand. What interpolation is that?

The first thing to note is that, with a standard perspective projection, straight lines in world space remain straight lines in perspective space. Distances/lengths, however, are not preserved.

For simplicity, let us assume a trivial perspective transform is used to project the vertices, i.e. $$X_{Screen} = \frac{X_{World}}{Z_{World}}$$ $$Y_{Screen} = \frac{Y_{World}}{Z_{World}}$$ We should also compute a reciprocal screen-space depth, e.g. $$Z_{Screen} = \frac{1}{Z_{World}}$$ but linear Z in the depth buffer would, to me, require something like: $$Z_{Screen} = scale*Z_{World}$$ (We can assume here that scale=1)

Let's assume we have a line with world space end points$$ \begin{matrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ \end{bmatrix} and \begin{bmatrix} 200 \\ 0 \\ 10 \\ \end{bmatrix}\\ \end{matrix} $$ With the perspective mapping these map to screen space coordinates $$ \begin{matrix} \begin{bmatrix} 0 \\ 0 \\ 1 \\ \end{bmatrix} and \begin{bmatrix} 20 \\ 0 \\ 0.1 \\ \end{bmatrix} \end{matrix}$$

The rendering system/hardware will linearly interpolate the screen space z, so at the 1/2 way point of the line, as it appears on-screen, i.e. at pixel (10, 0), we would get a projected (inverse) Z value of 0.55, which corresponds to to a world space Z value value of ~1.818. Given the starting and end Z values, this is about 20% along the length of the line.

If instead, we tried to interpolate using the original Z values, we'd end up with Z corresponding to a world space value of 5.5. As long as nothing intersects, you might be ok (I've not thought about it too thoroughly) but anything with implicit intersections will be incorrect.

What I haven't mentioned is that once you introduce perspective correct texturing (or even perspective correct shading), you must do per-pixel interpolation of 1/w and, additionally, also compute, per pixel, the reciprocal of that interpolated value.

I don't think I'll be able to understand this answer without more math/diagrams. And yes, more precision, closer, probably makes sense, but a scaling from linear by far / z, which is standard, doesn't make sense. It yields a depth buffer that becomes more linear the closer the two clip planes are to each other. It seems like a conflation of two concepts: screen space-linear Z, and a non-constant depth buffer mapping for a performance hack. — Jessy, Sep 04 '18 at 16:16
Specifically, it’s the, “if those are not interpolated with 1/Z” that I don’t understand. What interpolation is that? — Jessy, Sep 05 '18 at 02:51
Thanks! I think the problem comes down to "The rendering system/hardware will linearly interpolate the screen space z". I was under the impression that NDC position would be computed as (x, y, z) / w per-fragment, but apparently, instead, we have to deal with a linearly-interpolated version of (x/w, y/w, z/w)? That doesn't seem reasonable to me in 2018, but it would be good to know if that's the hack we have to live with for now anyway! — Jessy, Sep 06 '18 at 20:26
To perform perspective correct texturing/shading/whatever, you need to linearly interpolate (Val/w) values, and then, per fragment, do a division by the linearly interpolated 1/w. It's a bit hard to explain just in a comment, but there is a little bit of an explanation in https://computergraphics.stackexchange.com/a/4799/209. Alternatively, do a search for Jim Blinn's article "Hyperbolic Interpolation" — Simon F, Sep 07 '18 at 08:01
That article doesn't seem to be available without paying and I can't see a preview to know if it's worth it.
Even if linear interpolation for a reciprocal was necessary for something, it could be interpolated along with the original values, and I don't think it would ever be the right choice for storing depth. I amended the question to emphasize position. — Jessy, Sep 08 '18 at 00:30
Okay, I've read the article now! From that, what I'm gathering is that it's not a problem that the division happens. Rather, the problem is that depth doesn't get written by dividing the NDC position's Z value by the interpolated clip space 1 / W value. Sound right? I don't understand why that wouldn't happen, if it happens for all the other interpolated values, such as UVs. — Jessy, Sep 10 '18 at 23:57
You could do that (there were proposals for a "W" buffer), but it's unnecessary for perspective depth comparisons as straight lines in world space remain straight lines in perspective space. If you ever try doing a traditional perspective drawing with pencil/ruler (e.g. using vanishing points) this may become evident.
What doesn't get preserved is relative lengths but that doesn't matter as you go along a line as it's "just a line". It's when you add something that does depend on distances, e.g. texturing, that you need to do the per-pixel division. — Simon F, Sep 11 '18 at 07:43

score 7 · Answer 2 · answered Sep 09 '18 at 04:49

7

Using Z/W for the depth buffer goes deeper than just clipping against the near and far planes. As Simon alluded to, this has to do with interpolation between the vertices of a triangle, during rasterization.

Z/W is the unique option that allows NDC depth values to be correctly calculated for points in the interior of the triangle, by simply linearly interpolating the NDC depth values from the vertices, in screen space. In principle, we could use any function we like to map camera-space Z to the depth buffer value—but any other choice than Z/W would require more complicated math to be done per pixel, which would be slower, and more difficult to build in hardware.

Note that if you use a linear depth buffer, then of course linearly interpolating depth values will be correct in world space...but not, in general, in screen space! And it is screen space that matters for rasterization, as we need to be able to generate perspective-correct depth values (and other attribute values, like UVs) for each pixel center, or other sample point, within the screen-space bounds of a triangle being rasterized.

answered Sep 09 '18 at 04:49

Nathan Reed

25,002
2
68
107

I don’t know how to design a GPU, but it seems to me that all that’s needed is to interpolate Z instead of Z/W, for linear depth, and Z/W interpolation could still happen afterwards for anything visible. I still can’t tell if this is a matter of good reasoning or one of “nobody cares so we don’t bother updating”. – Jessy Sep 09 '18 at 10:28
Interpolating Z instead of Z/W does not give correct results in screen space. Z/W does. – Nathan Reed Sep 09 '18 at 18:42
Right. But if the depth buffer is quantized to a lower precision than position, then, aside from being performant when it works, it's not a good idea to store a scaled chunk of screen space Z. If linear interpolation is all we get, then clipping needs to happen in view space. And Z needs to be interpolated before division by W, for the depth buffer, and after it, for what you've gone over.
So is the answer to my question, "because GPUs have always only interpolated in clip space because it was the only practical solution on the first GPUs, and it has worked out well enough since"?
– Jessy Sep 10 '18 at 05:00
I'm not following what you mean about "quantized to a lower precision than position", or "store a scaled chunk of screen space Z". – Nathan Reed Sep 10 '18 at 05:17
1

Also, "Z needs to be interpolated before division by W, for the depth buffer"—no. That's what I've been trying to explain. You get the wrong answers if you interpolate Z (or anything else) in screen space without dividing it by W first. You seem to be stuck on this idea that a linear Z buffer would just work if we didn't divide by W. But it won't work—it won't interpolate in screen space properly. – Nathan Reed Sep 10 '18 at 05:21
Sorry, I meant "clip space", not "view space" in my last comment. I'm not convinced that interpolating in clip space instead of NDC space won't work for Z, but I am convinced (as mentioned in my last comment to Simon) that interpolating 1/ZW in NDC space and then dividing the interpolated 1/Z by 1/W would work. Trouble is, that step doesn't happen. What ends up getting stored in the depth buffer is linear depth, divided by distance from the camera (which is useful as you've said), scaled by the far clip plane – this last part seems a hack to achieve 1 at the far clip plane, to avoid division. – Jessy Sep 11 '18 at 00:17

Why do GPUs divide clip space Z by W, for position?

2 Answers2

Linked