Context-free languages are equivalent to push-down automata, and in this particular case constructing an automaton is easier:
$$
\begin{array}{c}
\mathtt{0}:\mathrm{push}&&\mathtt{1}:\mathrm{push}&&\mathtt{2}:\mathrm{pop}&&\mathtt{3}:\mathrm{pop} \\
\curvearrowleft && \curvearrowleft && \curvearrowleft && \curvearrowleft \\
s_0& \xrightarrow{\epsilon} &s_1& \xrightarrow{\epsilon} &s_2& \xrightarrow{\epsilon} &s_3
\end{array}
$$
where $s_0$ is the initial state and $s_3$ is the accepting state (with empty stack). We only use $\mathrm{push}$ and $\mathrm{pop}$ because we don't need any additional information.
To simulate this automaton with a context-free grammar we have to preserve the symmetry between pushes and pops. In other words, each time we produce one of $0$ or $1$ we will need to produce also $2$ or $3$. To keep track which one we can produce, we need enough states to represent all the combinations:
- production $A$ will represent pair of states $(s_0,s_3)$,
- production $B$ will represent pair of states $(s_1,s_3)$,
- production $C$ will represent pair of states $(s_0,s_2)$,
- production $D$ will represent pair of states $(s_1,s_2)$.
Observe that the path of the automaton determines possible dependencies in the grammar (with a slight oversimplification we could say that in "push" states we go forward and in "pop" states we go backward):
- from $A$ we could go to $B$, because we can go forward from $s_0$ to $s_1$, but not back;
- from $A$ we could go to $C$, because we can backward from $s_3$ to $s_2$, but not the other way around;
- from $B$ we cannot go to $C$, because we cannot go from $s_1$ to $s_0$ (but we can go back from $s_3$ to $s_2$, in particular $B \to D$ is ok);
- etc.
The completed grammar looks as follows:
\begin{align}
S &\to A \\
A &\to \mathtt{0}\ A\ \mathtt{3} \mid B \mid C \mid D \\
B &\to \mathtt{1}\ B\ \mathtt{3} \mid D \\
C &\to \mathtt{0}\ C\ \mathtt{2} \mid D \\
D &\to \mathtt{1}\ D\ \mathtt{2} \mid \epsilon
\end{align}
I hope this helps $\ddot\smile$