Incorrect reasoning during Taylor series derivation?

Question

I want to derive the Taylor series approximation of a function $f(x)$ at a point $p$ using the following reasoning, but "my" Taylor series formula misses the inverse factorial scaling of individual terms.

Why? Is my reasoning missing some steps? Are my assumptions incorrect?

The goal is to build a local approximation of a function $f(x)$ at point $p$ using incremental changes of functions.
Our first local approximation of $f(x)$ will be the constant function $f_{0}(x) = f(p)$. This is a very crude approximation, which we seek to incrementally improve by incorporating our knowledge of the first derivative $f'(x)$.
The numerical value of the first derivative $f'(x)$ tells us by how much does the value of function $f$ change if we move one unit from the point $x$ in the direction of the $X$ axis (but this is just a local linear approximation, which is valid only near the point $x$, not necessarily 1 unit away from the point $x$).
We can therefore improve our first guess $f_{0}$ to $f_{1}(x) = f(p) + (x - p)f'(p)$. The term $(x - p)f'(p)$ uses the local measure of change $f'(p)$ to construct a linear function $(x - p)f'(p)$, which represents an incremental offset that improves the approximation provided by $f_{0}$.
$(x - p)f'(p)$ is only a linear function, so to further improve our approximation error $\vert f(x) - f_{1}(x)\vert$, we can use higher-order derivative $f''(x)$ to construct another linear approximation of how does the function $f'(x)$ change: $f_{2}(x) = f(p) + (x - p) \left( f'(p) + (x - p) f''(p) \right) $.

In general, we can keep recursively improving the linear approximations by adding incremental correction terms for lower-order derivatives, e.g. adding the offset $(x - p) f^{(n)}(p)$ that will bring the value of $f^{(n-1)}(p)$ closer to the exact value $f^{(n-1)}(x)$:

$$ \begin{aligned} \Delta x &= x - p\\ f_{n}(x) &= f(p) + \underbrace{\Delta x \left( f'(p) + \underbrace{\Delta x \left(f''(p) + \underbrace{\Delta x \left(f'''(p) + \cdots\right)}_\text{incremental correction for $f''(p)$} \right)}_\text{incremental correction for $f'(p)$} \right)}_\text{incremental correction for $f(p)$}\\ f_{n}(x) &= f(p) + f'(p)\Delta x + f''(p)\Delta x^2 + f'''(p) \Delta x^3 + \cdots \end{aligned} $$

Yet the correct Taylor expansion looks like this:

$$ f_{T}(x) = f(p) + \frac{f'(p)}{1!}\Delta x + \frac{f''(p)}{2!}\Delta x^2 + \frac{f'''(p)}{3!} \Delta x^3 + \cdots $$

Why is my reasoning incorrect? Why is "my" recursive formula not an optimal function approximation when compared to the equivalent Taylor expansion? And how can I change my reasoning to arrive at the "correct" Taylor polynomials that include inverse factorials?

A side-question: seeking a clearer geometrical understanding, what does the term $\frac{1}{n!}$ scale? Does it scale the term $f^{(n)}(p)$ or does it scale $\Delta x^n$?

The point of a Taylor series is to make all the derivatives match at $p$: that's not really what you're doing. — Randall, May 27 '23 at 20:19
@Randall Thank you, I understand your point, which is beautifully explained in this 3Blue1Brown video, but I wanted to arrive at a function approximation on my own, hoping I would accidentally arrive at the Taylor series. It sounds like Taylor series and "my" recursive formula are two different ways of function approximation. How do they differ? Does "my" recursive formula have a name that I could look up and read more about it? — jordi, May 27 '23 at 20:25
Furthermore, if "my" formula and Taylor series are two distinct methods of function approximation, then is Taylor series in some way more optimal (relative to what "metric")? — jordi, May 27 '23 at 20:28
First two terms are good of course. The missing factorials come from differentiation of $x^n$, which you do $n$ times. Not including them makes for a worse and worse approximation for bigger and bigger $n$. Not sure about a common name. — Piita, May 27 '23 at 20:40
@Piita I do not think I understood. Where is the derivative of $x^n$ hidden in "my" formula? I can only see $\Delta x^n$ (which contains $x$), but I am unable to see that it is being differentiated. Which steps should I include into my 5-step reasoning above to make the differentiation of $x^n$ more visible? — jordi, May 27 '23 at 20:52
Read: The missing factorials come from differentiation of $x^n$, which one does n times in Taylor series. In your steps, you don't, correct; sorry for the throw-off. Wanted to point out where the factorials come from, which are not present in your approx. — Piita, May 27 '23 at 21:03
Did my newly elaborated answer clear your question ,or, did I miss something still — tryst with freedom, May 28 '23 at 20:12
@HopefulWhitepiller I commented under your answer (to keep our discussion at one place, under your answer). — jordi, May 28 '23 at 21:36

score 10 · Answer 1 · answered May 27 '23 at 20:53

10

$f^\prime(p)+(x-p)f^{\prime\prime}(p)$ is an approximation to $f^\prime(x)$

As a starting point, you might consider the following question:

Which approximation is better?

$f(x)\approx f(p)+f^\prime(p)(x-p)$ (Taylor polynomial)
$f(x)\approx f(p)+f^\prime(x)(x-p)$

answered May 27 '23 at 20:53

Dunham

3,297

Approximation #2 does not make sense to me; it computes local change of function value at the point $x$ ($f'(x)$), yet adds that linear increment or differential ($f'(x)(x-p)$) to the function value at point $p$ instead of to the fn value at point $x$. Using $(x-p)$ with $f'(x)$ introduces an unwanted $x$-value offset, I think. Therefore I feel like approximation #1 is better. – jordi May 27 '23 at 21:17
2

Do you see the connection between Approximation #2 and your step 5? – Dunham May 27 '23 at 21:27
1

I think I do see a connection; by writing $f'(p) + (x-p)f''(p)$ in my step 5, I am approximating $f'(x)$, which is similar to the Approximation #2 (which I think is incorrect). In other words, I think you lead me to realize that my whole attempt at function approximation is incorrect from the ground up :). Did I get that right? P.S. Thank you for asking me questions and letting me realize my mistake! Now I have to rethink my approach at deriving a function approximation method :). – jordi May 27 '23 at 22:14
1

@Dunham Your (2) could of course be expanded to a series where we evaluate the even derivatives at $p$, the odd ones at $x$; and we might wonder what the correct coefficients are. They are in fact the secant/tangent numbers (with alternating signs). This from a question once set to schoolboys aspiring to join Trinity College Cambridge. – ancient mathematician May 28 '23 at 12:57
Thinking about what @Dunham wrote, I came up with the following observation (is it incorrect?). It seems like Taylor series tries to (approximately) evaluate $f'$ at $p + \frac{x-p}{2}$. The higher-order derivatives are evaluated closer to $p$ (the factorial scaling determines the point of evaluation). Consider this: $f(p) + (x-p)f'(p) + \frac{(x-p)^2}{2!} f''(p)$ (Taylor polynomial) can be rewritten as $f(p) + (x-p)\left(f'(p) + \frac{x-p}{2} f''(p)\right)$, which can be seen as an approximation of $f(p) + (x-p)f'( p + \frac{x-p}{2})$. Why is it a valid thing to do? – jordi May 28 '23 at 19:12
You could argue that when you have a changing slope, the average slope would be better than simply taking the slope. That is, $f(x+h) = f(x) + \frac{f'(x) + f'(x+h) }{2}h$ – tryst with freedom May 29 '23 at 02:22
1

You may want to formulate your comment into a new question. I suspect that it will get more attention that way. Also, I would encourage you to work out the details in the comment of @ancient-mathematician. You may well find insight there for your pursuit. – Dunham May 29 '23 at 03:42
@Dunham in a comment under HopefulWhitepiller's answer, I wrote that I want to award you a bounty for your excellent answer too, but now that I awarded a bounty to HopefulWhitepiller, I can't award a bounty smaller than 200 rep. (I didn't know about this rule). Your answer helped me understand why my approach was incorrect, and I wish it was possible to accept two answers. I do appreciate your time and explanation, so I'm going to award you the promised bounty in a different way: I'll award it to your answer on an unrelated question (I wish it was possible not to abuse StackExchange). – jordi May 30 '23 at 21:03
1

No worries about the bounty. I understand the restrictions. I'm just glad you found something useful in my answer. Keep asking good questions. – Dunham May 30 '23 at 21:09

tryst with freedom · Accepted Answer · 2023-05-31T09:29:30.930

Note

After reading Dunham's answer once more, I realized he pointed out a critical mistake you've made directly (which I had done here also indirectly). In the bulk of this answer, I will from scratch, explain how Taylor series ist just nested linear approximations.

You need to increase the number of sample points to involve higher derivatives!

Taylor series is one of my favourite topics in mathematics, and I have gone through this same question you have myself before. The issue is of how you incoporated the second derivative. (Step-5)

We will first talk about approximating functions discrete sampling, and then talk about what happens when the sampling is infinite.

To talk about an n-velocity, we need information of the function at $n$ points. For example, if we talk about acceleration (2nd derivative), we need the function's values at three points.

This could be, for instance understood by seeing the first principles definition of the derivative(*):

$$f''(x) = \lim \frac{f'(x+h) - f'(x)}{h} = \lim \frac{ \left(\frac{f(x+2h) - f(x+h)}{h} \right) - \left(\frac{f(x+h) -f(x)}{h} \right) }{h}$$

Now, the thing is the second derivative contribution can only be captured if we have a sampling of points $ \geq 2$

If we invert for the second $f(x+2h)$, we have (using $f(x+h)=f(x) + f'(x)h$:

$$ f(x+2h) = f(x) + 2hf'(x) + h^2f''(x)$$

Similarly, we can find

$$ f(x+3h) = f(x) + 3 hf'(x)+ 3h^2f''(x) + h^3f'''(x)$$

We'd find that in general, we have :

$$f(x+nh) = f(x) + \binom{n}{1} hf'(x) + \binom{n}{2} h^2 f''(x) +...$$

Now why does this feel a binomial expansion? That's answered in the section of "what is the meaning of all this?", if you can accept the result for now, then you can continue reading continously through the next sections.

Direct calculation of (*):

Write $$ \begin{align}f(x+h) &= f(x+ \frac{h}{2} + \frac{h}{2})\\ &= f(x+\frac{h}{2}) + f'(x+ \frac{h}{2}) \frac{h}{2} \\ &= f(x) + \frac{h}{2} f'(x) +\left[ f'(x) + f''(x) \frac{h}{2} \right] \frac{h}{2} \\ &= f(x) + f''(x)h + f''(x) \frac{h^2}{4} \end{align}$$

A three way split would allow you to calculate $f(x+h)$ with a three split=

Intuitive calculation of $(*)$ [how you should have actually done it]

You could think of $f(x)$ as measuring distance of an accelerating car travelling, $a$ be the first point in time and $b$ being the final point. We introduce a point in the middle $a+ \frac{b-a}{2}$ to involve the second derivative.

The contribution of velocity to the change over this whole interval would be $ f'(a) (b-a)$, what about acceleration? Well suppose we gain velocity $\delta v$ in $\frac{b-a}{2}$ seconds, then for that velocity can only effect the distance in the next $\frac{b-a}{2}$ seconds. Hence, we have:

$$ f(b) = f(a) + f'(a) (b-a) +( f''(a) \frac{b-a}{2} \frac{b-a}{2}) = f(a) + f'(a)(b-a) + f''(a) (\frac{b-a}{2})^2$$

Now, still you'd be confused since there is an additional fact of $1/2$. Obviously, this is not a good approximation since, if slice the time interval more finely then even the most finest slices acceleration would be contributing every other second, so we'd have to take the formula of $f(x+nh)$ formula with $n \to \infty, h \to 0, nh=b-a$

How would get the taylor from the approximation based on the n-point sampling?

If we want to approximate $f(a+b)$ provided $f(a)$, we do:

Setting $nh=t = b-a$, we have:

$$ f(x+t) = f(x) + \frac{ \binom{n}{1}}{n} tf'(x) + \binom{n}{2} t^2\frac{f''(x)}{n^2} ...$$

We find that as $n \to \infty$ the above expression turns to,

$$ f(x + t) = f(x) + tf'(x) + \frac{t^2}{2!}f''(x)..$$

and so on.

Now, what on earth is the meaning of all this?

This is probably the most interesting and conceptually rich section in this post, as these ideas is used even in the most advanced calculus calculations (eg: . So, buckle up!

We have the same premise as before, we want to approximate $f(x)$ at $a+b$ provided $f(x)$ at $a$, we split the interval $\left[a, b \right]$ into $\left[ a , a+h \right] , \left[ a+h , a+2h \right] \cdots \left[ a+(n-1)h , a+nh=b \right] $

So, we will calculate the taylor by approximating in steps. First question: How does the value of the function change from $\left[ a, a+h \right]$? we have by linear approximation:

$$ f(a+h) = f(a) + f'(a) h $$

But let's right this in a different way, we write as:

$$f(a+h) = \left[(1 + h \frac{d}{dx}) f(x)\right]_{x=a}$$

We can see this is the same as the previous result by unpacking everything (let me know if this step wasn't clear).

Now, how does the value of the function change on the interval $\left[ a+h , a+ 2h \right]$?

$$ f(a+2h) = f(a+h) + h f'(a+h)$$

What happens if we use the previously mentioned trick? We have:

$$ f(a+2h) = \left[ (1 + h\frac{d}{dx}) f(x+h)\right]_{x=a}$$

But, hey we can do the trick again on the inner $f(x+h)$, we have:

$$f(a+2h) = \bigg[ (1 + \frac{d}{dx}) \left[ (1+ \frac{d}{dx}) \right] f(x) \bigg]_{x=a}$$

Now, we do another trick, we think of $(1+h \frac{d}{dx})$ as a function in itself, which takes functions and gives out functions. This is known as an operator. And, then, we we consider $(1+ h\frac{d}{dx})^2$ to be this map applied consecutively. We have,

$$f(a+2h) = \left[ (1+ h\frac{d}{dx})^2 f(x)\right]_{x=a}$$

By induction, we can show that:

$$f(a+nh) = \left[ (1+h \frac{d}{dx})^n f(x) \right]_{x=a}$$

Now, suppose we fix $n \cdot h = b-a$( number of partition * size of partition of b-a), and send the number of partitions to infinity, we have:

$$ \begin{align}f(b)= \lim_{ n \to \infty} f(a+nh) &= \lim_{n \to \infty} \left[ ( 1 + h \frac{d}{dx})^n f(x) \right]_{x=a} \\ &=\left[\lim_{n \to \infty} ( 1 + h \frac{d}{dx})^n f(x) \right]_{x=a} \end{align}$$

Now, here is something interesting, which we can show (try see why)

$$\lim_{n \to \infty} ( 1+ h \frac{d}{dx})^n = \lim_{n \to \infty} ( 1+ (\frac{b-a}{n} ) \frac{d}{dx})^n = 1+ (b-a) \frac{d}{dx} + \frac{(b-a)^2}{2!} {d^2}{dx^2} .. =e^{(b-a)\frac{d}{dx} }$$

We identify the operator series as evaluating the series for the exponential at the operator. Finally,

$$f(b) = \left[ e^{(b-a)\frac{d}{dx} } f(x) \right]_{x=a}$$

And that's it! That 's also the basis by the fancy shmancy category theory answer that tp1 wrote. It maybe remarked that the bracket evaluating trick thing I kept doing is also the basic idea for one shadowy version of calculus called Umbral Calculus

Why did this whole procedure "feel" like doing an integral?

See accepted answere here

Bonus : Shifting points of evaluation generally in higher calculus

Roger Penrose road to reality

The idea of linear approximation is quite profound, and more general single variable calculus itself. If we are to think of abstractly, it says to find the value at a little bit away, we add at the inital point plus the change in parameter * the rate of change function with that parameter. We have,

$$E_h f(x) = ( I+ h \nabla_t) f$$

In another way, we can write the value of the function at a later point $x+h$ using the initial way with derivative. This idea works also in higher calculus to change the point of evaluation of a function on many different variables. To capture the idea of point of evaluation being changed, we we think of the curve as being made up a bunch of tiny tangent vectors , and by some math magic, associate the change in the output as we move along these tangent vectors as a derivative operator acting on the function. Once, we have that last line, then we can immediately use the taylor series idea to shift the evaluation point of the function.

I am reacting to your comment under my question: thank you for your answer (I upvoted). I learned something new from almost all answers here, so I'm not sure which answer to accept (I wish I could accept multiple). What is not clear to me, is how you arrived at the expression $f(x+2h) = f(x) + 2f'(x) + f''(x)$ (you wrote "If we invert for the second $f(x + 2h)$…") and why is it true (there's no $h$ inside). Also, it seems as if Taylor series approximately evaluates $f'$ at $x + \frac{h}{2}$, which doesn't make sense to me (detailed explanation is in my third comment under @Dunham's answer). — jordi, May 28 '23 at 21:50
No, we don't evaluate it at $h/2$, we evaluate at $x$ itself. Basically write $f(x+h) = f(x+ \frac{h}{2} + \frac{h}{2}) = f(x+ \frac{h}{2}) + f'(x+ \frac{h}{2}) \frac{h}{2} = f(x) + \frac{h}{2} f'(x) +\left[ f'(x) + f''(x) \frac{h}{2} \right] \frac{h}{2} = f(x) + f''(x)h + f''(x) \frac{h^2}{4}$. Which you can see is the same thing I got by the method elaborated in my answer @jordi — tryst with freedom, May 29 '23 at 02:03
Thank you for editing the answer, but I am still confused. In the edited version, you wrote $f(x+h) = f(x) + f'(x)h + f''(x) \frac{h^2}{4}$ (I assumed there was a typo and changed $f''(x)h$ to $f'(x)h$), yet earlier above you wrote $f(x + t) = f(x) + tf'(x) + \frac{t^2}{2!}f''(x)$ (a Taylor approximation). Why are those expressions different? The expression with $h$ inside uses $f''(x)\frac{h^2}{4}$, meanwhile the Taylor approximation uses $f''(x)\frac{t^2}{2}$. How does the expression with $h$ inside relate to the Taylor approximation? — jordi, May 29 '23 at 09:53
Ok, non standard terminology incoming (basically made up by me on the spot), if you do a two point taylor, yes then the factor is actually four, but if you take an infinite point taylor, the factor is actually 1/2 @jordi. I'll do another addition on the interpretation — tryst with freedom, May 29 '23 at 10:21
Thank you for taking the time to further explain the topic, I appreciate it! I am going to award you a bounty as soon as StackExchange will let me do it for this question. — jordi, May 29 '23 at 11:48
Oh my god! That means the world to me. Thank you so much. Let me know if you have further doubts :D @jordi — tryst with freedom, May 29 '23 at 11:50
Ah, but what did you conclude? What was your takeaway from this essay I wrote @jordi — tryst with freedom, May 29 '23 at 18:30
My takeaway is that we can use approximations of terms we don't know while constructing an approx. of a different term, e.g. we can use $f'(p)$ to approx. $f(p + h)$ while approximating $f'(p + h)$. Next we can combine the two approx. together. Using these nested approximations with a low # of terms leads to crude approx.. That's why we need to use infinite # of terms, which is similar to compound interest (the exponentiation). We can use infinite nested series to replace integration of $\int_p^{x - p} f'(x),dx$. I think I gained a better understanding of approximation in general, thank you! — jordi, May 29 '23 at 20:18
After the previous edit to your answer (prior to the bonus section), the ideas you shared helped me to finally fully understand everything (and more) I wanted to know about Taylor series when I wrote this question. Now I can say I truly understand where the factorials come from and why your "two-point" Taylor expansion has the $\frac{h^2}{\mathbf{4}}$ term in it, because I can derive Taylor expansion from scratch. That's why I decided to accept your answer, even though Dunham's answer did uncover a big part of the puzzle for me (I will give @Dunham a smaller bounty too). Thank you so much!! — jordi, May 30 '23 at 09:59

FShrike · Answer 3 · 2023-05-28T12:41:40.980

I'd say your intuitive idea of improving the approximation is trying to change the wrong things.

$f(x)-f(p)=f'(p)\cdot(x-p)+\psi_1(x-p)$ where $\psi_1(u)/u\to0$ as $u\to0$ by definition of derivative. $f'(p)$ is the (well, represents the) best linear approximation to $f(x)-f(p)$. In light of that, to want to replace $f'(p)(x-p)$ with $(\text{something else})(x-p)$ is, I would say, the wrong philosophy (yes, the Taylor expression can be factored so that it is in that form, but that places the wrong emphasis imo: we don't want "better" linear approximations, we want a perfect linear approximation + a perfect quadratic approximation + ...)

You reasoned: "$f'(p)$ is a measure of the change, but we can improve upon it by using the more sensitive measure $f'(p)+(x-p)f''(p)$". I personally don't buy this argument, but I appreciate the concept of wanting a better measure of the change. It is the combination $x\mapsto f'(p)(x-p)$ which is the measure of the change (in some sense a "perfect" one) and if you want to improve upon it, you should focus on reducing the error. Changing either of the terms $(x-p)$ or $f'(p)$ individually - that has no reason to work. You could only expect improvements by reducing the error, which is guaranteed to be an improvement.

At the moment, we can only say that the error function - $\psi_1$ - is bounded by some constant multiple of $(x-p)$ in some small neighbourhood of $p$. The natural thing to want to do is push this to $(x-p)^2$ (and then to $(x-p)^3$, etc... and if your function is very nice (analytic) you can push this error right down to zero), but if you change $f'(p)(x-p)$ into something else you're being counterproductive: let's instead work on changing $\psi_1$ into something else!

So you can ask the question: focusing on the error term, what is the magic number $a$ (if it exists) so that: $$\psi_1(x-p)=a(x-p)^2+\psi_3(x-p)$$I.e. so that: $$f(x)-f(p)=f'(p)(x-p)+a(x-p)^2+\psi_3(x-p)$$Where $\psi_3(u)/u^{\color{red}{2}}\to0$ as $u\to0$? This magic number exists iff. $f$ is twice differentiable at $p$ and in this case it equals $\frac{1}{2}f''(p)$. The correct thing to do is simply calculate $a$, rather than try to guess at its value. But, this value can be intuitively motivated, and this has been done many times online - I'm not going to do a better job of that.

All I'll say is: if you want to approximate $f$, you want your approximation to start in the correct place (at $f(p)$) and then change in the same way as $f$ does (so the approximation can "keep up" with $f$ as it moves away from $f(p)$). That can be made precise by demanding the (first few) derivatives at $p$ are equal, and if you make that calculation you find $a=\frac{1}{2}f''(p)$ is correct (rather than your $a=f''(p)$ suggestion).

1/2 Thank you, I agree that the high-level goal should be to minimize the approximation error. Although it was my original intention in my approach, I guess I did not think my approach through long enough, but thanks to @Dunham's answer (please see my latest comment under his answer), I realized my mistake. — jordi, May 28 '23 at 21:47
2/2 Do I understand correctly that Taylor series is all about using combination of only polynomial functions to approximate other functions and that polynomials are the best way how to utilize the available derivative info at a point $p$? Does there exist another class of functions (e.g. exponential functions) for approximation, which could better utilize (not necessarily via addition) all available derivatives? — jordi, May 28 '23 at 21:47
@jordi Yes, the Taylor polynomials are about finding the "optimal" polynomial approximations to $f$ for a suitable sense of optimal. You can also approximate nice classes of functions via trigonometric polynomials, exponentials, complex exponentials, (Fourier series) but also there is a general theorem - the Stone Weierstrass theorem - that says for many general sets of "building block"-functions, you can approximate any continuous function via these building blocks, arbitrarily well. Though, the theorem doesn't give you formulas with which to build such approximations — FShrike, May 28 '23 at 22:00
As far as I know, the Taylor polynomials - or the Pade' approximants!! - are the only standard approximations / series representations that use derivative information. You can also express some functions as infinite products (thereby having an approximation with the finite products), e.g. the Weierstrass factorisation of the sine, cosine, tangent, etc. — FShrike, May 28 '23 at 22:01

tp1 · Answer 4 · 2023-06-27T22:40:16.090

Taylor series is best explained by the following function: $$ f(x+k) = e^{k\frac{d}{dx}} [f(x)] $$

This is called "translation operator", $(x) \mapsto (x+k)$. which is equivalent of taylor series:

\begin{array}{c} \hspace{2.0cm}\mathbf{(x)} \overset{+k}{\longrightarrow} \mathbf{(x+k)} \\ \hspace{1.4cm}\scriptstyle{\mathbf{f}}\hspace{0.1cm}\mathbf{\downarrow} \hspace{1.7cm} \mathbf{\downarrow} \hspace{0.1cm}\scriptstyle{\mathbf{f}}\\ \hspace{1.8cm}\mathbf{f(x)} \underset{\mathbf{e^{k\frac{d}{dx}}}}{\longrightarrow} \mathbf{f(x+k)} \end{array}

Then we need to remember famous formula (from first pages of Rudin): $$ exp(z) = \sum_{n=0}^{\infty} \frac{z^n}{n!} $$. (this basically answers where the $n!$ is coming from, i.e. it's part of the defintion of $e^{z}$.)

When applied to the translation operator: $$ f(x+k) = \sum_{n=0}^{\infty} \frac{k^n\frac{d^n}{dx^n}}{n!} [f(x)]$$.

This expression is a function $$f(x+k) ::(x,k,f(x),f'(x),f''(x),f'''(x),...,f^{(\infty)}(x)) \rightarrow R$$. (this explains what all data is needed to calculate the taylor series.)

This starts to look very much like the ordinary taylor series.