According to the Attention Is All You Need paper, the transformer's encoder portion is described as
The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
How are the outputs of the Nx=6 identical layers put together? Is it via concatenation, summation, element-wise product, or are the 6 blocks placed in sequential order, etc? Is the same also done for the decoder?