I might solve this with an index texture, similar to this tilemap approach I presented earlier.
Set aside an extra texture that covers your whole terrain with a resolution of about 1 texel per aerial photo width - so it's very low res compared to the photos themselves.
For each of these index texels that overlaps a photo location, choose the closest photo, and encode the UV coordinates of its center point in the red and green components of the index texel, and the photo's array index in the blue component.
The center point can be stored as an offset from this texel's location, in units where 1.0 = 1 photo width, so even at 1 byte per channel, you can finely position photos within a precision of 1/256th of their width (4 pixels at 1024 resolution). You can bump up to larger texel formats for even higher precision if needed. You can use a special value like (1, 1, 1) to indicate "no photo nearby".
When drawing the terrain, sample the four closest texels in the index texture, with sample filtering set to point/nearest so they don't blend together. From this, you get the coordinates and array indices of the 4 closest aerial photos. Select the closest one to sample, or fall back on your default colour if none of them are close enough to overlap this fragment.
The advantage of this approach over sending coordinates as uniforms is that it scales up to large numbers of photos well: the work per fragment remains constant at 4 taps from the index texture and at most 1 tap from the photo array, and you don't have to iterate over a list of all photos.
The advantage of this approach over encoding the information in vertex attributes is that you can transition between photos in the middle of a polygon, rather than being restricted to the resolution of the mesh subdivision.