I would like to make a few comments; complementary to Prof. Mosher's answer.
- It might be helpful to consider the following characterization of ergodicity of $(\mu,T)$: Given two measurable subsets $A,B\subseteq X$, the measure $\mu(T^{-n}(A)\cap B)$ (which models the chance of a point $x\in X$ that starts somewhere in $B$ to end up somewhere in $A$ at time exactly $n\in\mathbb{Z}_{\geq1}$ under the time evolution $T$) is comparable to $\mu(A)\mu(B)$ for large $n$, on average, that is, $(\mu,T)$ is ergodic iff
$$\forall A,B\in\Sigma: \lim_{n\to \infty} \dfrac{1}{n}\sum_{k=0}^{n-1}\left[\,\mu(T^{-n}(A)\cap B)-\mu(A)\mu(B)\,\right]=0.$$
The term over which we average here can be thought of as a "correlation" or "covariance"; the fact that it decays in time in some sense means that events get independent asymptotically. (see my answer at What's so special about standard deviation? for more on this.)
I should remark that in the above characterization of ergodicity if one replaces the square brackets with absolute value, one obtains a stronger property called "weak mixing", and if one further drops the averages one obtains an even stronger property called "strong mixing" (the "mixing hierarchy" goes further than that, but the definitions get more sophisticated, at least with this formalism.)
There is something to be said about the time parameter being discrete. One could make sense of this by referring to the fact (?) that human perception is discrete (biologically there is a minimum time length beyond which we don't perceive), and ergodic theory is supposed to model (human) observation. Or else one could think that for one reason or another (due to the fault or low precision of the instruments, or the cost etc.) we make stroboscopic observations. Of course mathematically there is a well-developed ergodic theory of continuous time (and beyond).
This is more of a historical comment. Arguably one of the earliest such heuristics regarding mixing properties is from Halmos's Lectures on Ergodic Theory (p.37), where he mixes vermouth and gin. There is a nice discussion of this in Brown's Ergodic Theory and Topological Dynamics (p.15); here is an excerpt:
To borrow an illustrative example from Halmos [32], suppose that a
mixture is made containing 90% gin and 10% vermouth. If the process of
stirring the mixture is ergodic, then after sufficient stirring any portion
of the container will contain on the average (with respect to the number of
stirrings) about 10% vermouth.
The correspondence is as follows: $A$ represents an anonymous part of the container one is observing, $B$ represents that part of the container where the vermouth is originally located. Thus $\mu(B)=0.1$. In order for the observation to be nontrivial, say $\mu(A)>0$, so that the observed part has positive volume. Then renormalizing the correlation above, we get that for $n$ large, on average
$$\dfrac{\mu(T^{-n}(A)\cap B)}{\mu(A)}\approx \mu(B) = 0.1,$$
that is, approximately 10% of the part we are observing will be occupied by vermouth. (How large $n$ ought to be depends on $A,B$ and how good of an approximation one wants, assuming $(\mu,T)$ is fixed.)
For the record, this heuristic only started making sense to me after getting accustomed to being able to think of the direction of time consistently (formally the difference between the past and the future may be confusing, and it depends on the interpretation).