I have two news for you. I start with the bad one.
The dark side of the question
Computation theory tells us that checking whether two programs, or
program fragments, are equivalent is not decidable.
What that means is only that there is no unique technique that can
check that equivalence of any pair of programs. This remains true if
you consider a single programming language, as long as it is Turing
complete. (Note: I do not understand what you intend when mentionning
Turing complete algorithms in your comment - and, by the way,
precisions should be integrated in the question, preferably to
comments).
No unique technique also means no unique finite set of techniques as
they could be applied simultaneously. It also means no infinite set
of techniques that is finitely describable, etc.
This can be formally proved on Turing machine with Rice theorem (which
is a bit subtle to use). The proof can be tediously transposed to any
other Turing complete formalisation of computation. But invoking
Church-Turing thesis is usually considered enough.
To summarize it, there is no way you can produce a system that will
take two arbitrary program fragments of a Turing Complete language and
tell you when they are equivalent semantically, i.e. when their
computation results are the same.
But do not despair, there is hope.
The bright side of the question
While the above is true when you make the question so general, it does
not mean that this can never be done. Actually, this, or problems very
close to it, is the object of considerable reasearch. The
undecidability statement should only be seen as a limitation to what is to
be expected, but there is considerable room within that limitation.
So, there are many situation when it is actually possible to apply a
procedure that will actually decide whether two (fragments of)
programs are actually equivalent. The applicability of such a
procedure can be defined by a language (that is not Turing complete),
or by some limitation on the computational power such fragments can
express (so that they do not have to be in the same language).
Much of the research related to type theory also concerns provability
of programs properties, and can lead to answers to your question. But
that is much outside my competence.
Many other techniques have been developed for your purpose.
About your examples
Your idea of running both codes and comparing results is a good one,
at least in simple cases, like your example.
But you have to run the code symbolically, and then use a symbolic
computation system to check that the answers are indeed the same.
So, assume that initially a==$a$ and b==$b$, where I use italics for
symbolic expressions, i.e. non evaluated formulae. The symbols $a$ and $b$ just stand for themselves, and have no associated value.
running the first code:
a = a + b; - - so a==$a+b$
b = a - b; - - so b==$(a+b)-b$
a = a - b; - - so a==$(a+b)-((a+b)-b)$
Recall, again, that what is in italics is just symbolic expressions, trees if you prefer. There is nothing to be computed.
running the second code:
int temp = a; - - so temp==$a$
a = b; - - so a==$b$
b = temp;; - - so b==$a$
Now you give these results to a symbolic calculator that check that
the values of variables a and b are the same at the end. It must be able to simplify expressions such as $(a+b)-b$ and $(a+b)-((a+b)-b)$ to respectively $a$ and $b$, which requires using known algebraic properties of the operators $+$ and $-$.
It is actually a good technique (when applied properly - I goofed my first try),as it allowed me to notice that your two codes were not equivalent, and I corrected the first one.
Running the code symbolically is an example of a general paradigm
called abstract interpretation. The "casting out nines" test is a very
elementary example of these techniques.
Symbolic evaluation and abstract interpretations are a well
studied way of proving things about programs.
It is pretty much what is done by type checkers.
But it is far from the whole story about proving things about programs.
Large systems are being developed to prove properties of programs by
different means.
In other cases (probably for your sorting examples), it may be better to have a specification of what the
program is supposed to do and separately prove the two pieces of code
conformant with that specification. This avoid having to consider
simultaneously the specifics of two algorithms.