In the proof systems I can think of right away, a single proof rule can only inspect formula syntax down to a finite depth from the root.
Thus, if we have a proof that contains standardly many rule applications, even if they work on formulas with nonstandardly deep parse trees, the correctness of the proof cannot depend on what's in those parse trees beyond a certain finite depth. So we ought to be able to replace everything deeper than that with finite dummies, and get a standard proof of the same conclusion.
(There may be some subtleties with term substitution for $\forall$ and/or $\exists$ rules, but I don't think that changes the conclusion).
So a nonstandard proof of a contradiction must -- if PA is consistent in the first place -- have nonstandard length.
More precisely:
We're assuming we're given a model $\mathfrak M$ of PA, and something that this model thinks is the Gödel number of a proof of a contradiction. The model considers the length of this proof (that is the number of steps) to be the standard number $n$.
Now consider a graph $G$ whose nodes are those elements of $\mathfrak M$ that the model considers to be Gödel numbers of wffs, with edges given as
- from $\varphi\to\psi$ to $\varphi$ as well as to $\psi$.
- from $\neg\varphi$ to $\varphi$.
- from $\forall x.\varphi$ to $\varphi$.
... and so forth. Depending on precisely how our version of first-order-logic looks there is a small finite number $D$ such that checking that a rule is correctly applied involves inspecting this graph at a distance at most $D$ from the nodes that represent the premises and conclusion of the rule. (Where the rule wants equality between different pieces of formulas, that only require checking that they're represented by the same node, not inspection of the entire syntax of the subformulas).
For this purpose, each logical or non-logical axiom or axiom schema counts as a rule with zero premises. Then $D$ can be chosen large enough to allow us to check the validity of axioms too.
Now let $H$ be the quotient of $G$ by the equivalence relation that relates two formulas if they differ only in the contents of terms and/or names of quantified variables.
$H$ is acyclic even when viewed outside $\mathfrak M$, because a finite cycle in it would correspond to a relation between finitely Gödel numbers that PA proves can't exist. It is not necessarily well-founded, though.
There are finitely many formulas appearing as premises/conclusions in the proof. Our graph of formulas has bounded fanout, so there are finitely many of the nodes in $H$ that are reachable in at most $D$ steps from a node that represents something in the proof sequence. These nodes are the relevant nodes.
Now consider the following rewriting of formulas: If the formula corresponds to a relevant node, rewrite its subformulas recursively; if it is not relevant, write $0=0$. Since the restriction of $H$ to relevant nodes is finite (and still acyclic), this rewriting always produces a wff of standard syntactic depth.
Rewriting every formula in the proof sequence produces a proof of the same length that contains only standardly many connectives and quantifiers, and which concludes $0=1$ just like the original proof did.
In order to get an actual finite proof, we now need to apply the same trick once again at the term syntax level, rather than the wff level. The details would be sensitive to exactly how our proof system handles term substitution in $\forall$ and $\exists$ rules, but since there are only standardly many terms or quantifiers left in the proof, a similar process ought to succeed.
(The details are elusive, though. The problem goes away if we imagine formulating PA in a language where there are no function symbols, but instead primitive predicates encoding $x=0$, $x=Sy$, $x=y+z$, $x=y\times z$. But translating proofs in the usual language of arithmetic to this alternative can cause the number of steps to blow up; in particular is might make a standard-length proof into a nonstandard-length one).