Lecture notes for MS-C1650 Numerical Analysis, Aalto University
Last updated: 30.5.2024
Largely based the lecture transcript by Harri Hakula, 2021.
Intended learning outcomes. After the course, the student will be able to…
explain the fundamental concepts of numerical analysis, like condition number, stability, and convergence rate;
construct the floating point numbers;
discuss and employ basic numerical algorithms like Newton’s method;
use the Monte-Carlo method in basic problems in analysis and geometry;
apply different methods of interpolation polynomials and numerical quadrature rules;
understand the Euler scheme and linear multi-step methods for solving ordinary differential equations.
Floating-point numbers
The set of real numbers R is infinite in two ways: it is unbounded and continuous. In most practical computing, the second kind of infiniteness is more consequential than the first kind, so we turn our attention there first.
Instead of R, we shall introduce the set of floating-point numbers (floats) F. They come with different bases, precisions and exponent ranges, and other features. The basic representation is x=±(d0.d1d2…dp)k⋅ke. k∈N∖{1} is called base or radix, p∈N0:=N∪{0} is called the precision, di, i∈{0,…,p}, and the sequence of numbers (d0.d1d2…dp)k:=i=0∑pdik−i
is called mantissa or significand. The exponent e∈Z is bounded m≤e≤M, where m,M∈Z.
If k=10, we can read the usual decimal commas from the mantissa: (1.01)10=1⋅100+0⋅10−1+1⋅10−2=1.01.
If k=2, we have binary floats. In the binary case, we observe that we can always choose d0=1, hence saving one bit, which can be expressed by e. We refer to this as normalization. The mantissa is always contained in the interval [1,2).
Example. (Toy floating point system). Binary floats of the type (1.b1b2)2 with exponents e=−1,0,1.
Hence (1.00)2=1, (1.01)2=45, (1.10)2=23, and >(1.11)2=47. By multiplying with the exponents 2−1=21, 20=1, 21=2, we get the whole set:
e
1
45
23
2
25
3
21
85
43
Important quantity: (1.01)2−1=41, the so-called machine epsilon.
Define the machine epsilon by ε:=2−p=(1.00…01)2−1.
Rounding
For the rounding function round:R→F, we have 5 alternative definitions:
Rounding to nearest (default)
Rounding to +∞
Rounding to −∞
Rounding to 0
Rounding away from 0
It holds that round(x)=x(1+δ), where ∣δ∣<2ε, where ε denotes the machine epsilon. Note that usually δ depends on x.
There is a way to define the standard arithmetic operations on F such that a⊕b=round(a+b)=(a+b)(1+δ1), a⊖b=round(a−b)=(a−b)(1+δ2), a⊗b=round(ab)=ab(1+δ3), a⊘b=round(ba)=ba(1+δ4),b=0.
Here, generally δi=δj, i=j.
IEEE 754 “Double precision”
k=2, 64 bits, where:
The sign: 1 bit;
The exponent field 0≤ex≤2047: 11 bits, where e=ex−1023, and −1022≤e≤1023, where ex=0 and ex=2047 are special cases.
The mantissa 52 bits, precision p=52.
Exponent field
Number
Type of number
00…00=0
±(0.b1b2…b52)2⋅2−1022
0 or subnormal
00…01=1
±(1.b1b2…b52)2⋅2−1022
00…10=2
±(1.b1b2…b52)2⋅2−1021
…
…
01…11=1023
±(1.b1b2…b52)2⋅20
…
…
11…10=2046
±(1.b1b2…b52)2⋅21023
11…11=2047
±∞ if b1=b2=…=b52=0, otherwise NaN
exception
Thus, there are two zeros, two infinities and NaN which denotes “not a number”. The smallest positive normalized number is: (1.0)2⋅2−1022≈2.2⋅10−308.
The largest positive number is: (1.1…1)2⋅21023≈1.8⋅10308
The machine epsilon is: 2−52≈2.22⋅10−16.
Here’s an easy-to-follow video explaining floating point numbers (and a specific version of Newton’s algorithm).
Condition number and stability
Conditioning of problems
Assume that f:R→R “solution map” of the problem, input numbers x, x^, close in value, e.g. x^=round(x). Set y:=f(x), y^:=f(x^).
Definition. The absolute condition numberC(x) is defined by the relation ∣y−y^∣≈C(x)∣x−x^∣.
The relative condition numberK(x) is defined by the relation ∣∣yy−y^∣∣≈K(x)∣∣xx−x^∣∣
By the normalization, we guarantee that (relative error in the output)≈K(x)×(relative error in the input).
Now, y−y^=f(x)−f(x^)=→f′(x)asx^→xx−x^f(x)−f(x^)(x−x^)
Thus, C(x)=∣f′(x)∣.
Furthermore, yy−y^=f(x)f(x)−f(x^)=→f′(x)asx^→xx−x^f(x)−f(x^)xx−x^f(x)x
Thus, K(x)=∣∣f(x)xf′(x)∣∣.
Example.f(x)=2x, f′(x)=2. Thus, C(x)=2, K(x)=∣∣2x2x∣∣=1. This is a well-conditioned problem.
Example.g(x)=x, g′(x)=2x1. Thus, C(x) is becomes unbounded for x>0 close to zero, e.g. x≈10−8 yields C(x)≈104. On the other hand, K(x)=∣∣2xxx∣∣=21.
Stability of algorithms
Definition. An algorithm or numerical process is called stable if small changes in the input produce small changes in the output. It is called unstable if large changes in the output are produced by small changes in the input.
An algorithm is stable, if every step is well-conditioned (i.e. has a uniformly bounded condition number). It is unstable if any step is ill-conditioned (i.e. the condition number may become arbitrarily large).
Forward error analysis (FEA) is asking:
“How far are we from the true solution?”
Backward error analysis (BEA) is asking:
“Given the answer, what was the problem?”
Example.
Set: fl(x+y):=round(x)⊕round(y)=((x(1+δ1)+y(1+δ2))(1+δ3),
where ∣δi∣<2ε, i=1,2,3.
FEA: fl(x+y)=x+y+x(δ1+δ3+δ1δ3)+y(δ2+δ3+δ2δ3).
The absolute error is ∣fl(x+y)−(x+y)∣≤(∣x∣+∣y∣)(ε+4ε2).
The relative error is: ∣∣x+yfl(x+y)−(x+y)∣∣≤∣x+y∣(∣x∣+∣y∣)(ε+4ε2).
BEA: fl(x+y)=x(1+δ1)(1+δ3)+y(1+δ2)(1+δ3).
Thus the relative error for each term is less or equal to ε+4ε2.
Hence the sum of two floating point numbers is backwards stable.
Well-conditioned problems may have unstable algorithms. For stability, each step has to be well conditioned. Some ill-conditioned steps produce an unstable algorithm. Ill-conditioned problems cannot be reliably solved with a stable algorithm.
Example. Consider evaluating f(x)=1+x−1 for x close to zero. The relative condition number is: K(x)=∣∣f(x)xf′(x)∣∣=21+x(1+x−1)x =21+x(1+x−1)(1+x+1)x(1+x+1)=21+x1+x+1,
and K(0)=1.
Consider the following 3 steps;
t1:=1+x, well-conditioned, x close to 0.
t2:=t1, relatively well-conditioned, also absolutely well conditioned, because t1 is close to 1.
t3:=t2−1, ill-conditioned, relative condition number of this step: K3(t2)=∣∣t2−1t2∣∣, which becomes unbounded for t2 close to 1!
On the other hand, the problem is well-conditioned. Solve it by writing: f(x)=1+x−1=1+x+1(1+x+1)(1+x−1)=1+x+1x,
which can be evaluated directly close to zero.
Numerical differentiation
Recall Taylor’s theorem, for a twice differentiable function f:R→R, f(x)=f(x0)+f′(x0)(x−x0)+21(x−x0)2f′′(ξ),
for any x0,x∈R, where ξ∈[x0,x].
What is that ξ (= “xi”)? Under certain assumptions elementary functions have their series expansions. If the series is truncated, we have the Taylor polynomial. However, the residual has an explicit expression but due to application of an intermediate value theorem, the exact location of the point ξ is not known, i.e. “generic”.
By setting x:=z+h, x0:=z, we obtain the useful equivalent formula f(z+h)=f(z)+f′(z)h+21f′′(ξ)h2,
for every z,h∈R, ξ∈[z,z+h].
Rate of convergence (Q-convergence)
Let (xk) be an infinite sequence of real numbers. Let sk:=supl≥kxl, k∈N, be the supremum (i.e., the lowest upper bound) of the tail (that is, large indices l≥k) of (xk). Define the limsup (limes superior) as k→∞limsupxk:=k→∞limsk∈[−∞,+∞].
Other than a limit, it always exists, but can be ±∞. If (xk) is bounded, the limsup is the largest limit of a converging subsequence. If limk→∞xk∈(−∞,∞) exists, then limk→∞xk=limsupk→∞xk. The opposite is not true.
Examples. (1) xk:=(−1)k, then sk=1, and limsupk→∞xk=1.
(2) xk=sin(k), then sk=1, and limsupk→∞xk=1.
Assume that limk→∞xk=x and that there exists some large index M∈N such that xk=x for all k≥M. Then we define the following quantity for p≥0 C(p):=k→∞limsup∣xk−x∣p∣xk+1−x∣.
We observe that C(p∗)<∞ for some p∗>0 implies C(p)=0 for every 0≤p<p∗. If C(p∗)>0 for some p∗>0 then C(p)=∞ for any p>p∗.
Proof. By the submultiplicative property of limsup, C(p)=k→∞limsup∣xk−x∣p∣xk+1−x∣=k→∞limsup[∣xk−x∣p∗∣xk+1−x∣∣xk−x∣p∗−p] ≤(k→∞limsup∣xk−x∣p∗∣xk+1−x∣)⋅(k→∞limsup∣xk−x∣p∗−p)=C(p∗)⋅{0∞ifp<p∗,ifp>p∗. □
Thus, there exists a (possibly infinite) p∗ such that C(p)=⎩⎨⎧0C(p∗)∞if0≤p<p∗,ifp=p∗,ifp>p∗.
The number p∗ is called order of convergence for the sequence (xk) and determines the rate of convergence as follows:
If p∗=1 and C(1)=1 then we say the convergence is sublinear.
If p∗=1 and 1>C(1)>1 then we say the convergence is linear.
If p∗>1 or C(1)=0 then we say the convergence is superlinear.
If p∗=2 then we say the convergence is quadratic.
If p∗=3 then we say the convergence is cubic, etc.
When working with convergence estimates it is often useful to use the following approximation: ∣xk+1−x∣≈C∣xk−x∣p∗
for some constant C>0, not necessarily C(p∗).
Here, it is useful to look at the logarithmic behavior: log(∣xk+1−x∣)≈log(C∣xk−x∣p∗)=log(C)+log(∣xk−x∣p∗)=log(C)+p∗log(∣xk−x∣).
The rate of convergence can be used interchangeably with the order of convergence. However, there is some caution necessary, as different authors use different terminology here. Usually, the order of convergence always refers to the same thing, namely, the p∗-exponent in the denominator of the limit defining the order of convergence. Most confusingly, some authors call the order of convergence “rate of convergence”, as e.g. here. The English Wikipedia article calls it the order of convergence, whereas here the rate of convergence is the constant in the definition, which also determines the speed of convergence, together with the order of convergence. So, please always check the context, as the use of the terminology should be clear from it. If there is no definition, try to figure out what is meant in each text. As a rule of thumb: The “order of convergence” is a unique terminology in numerical analysis. The “rate of convergence” can mean at least two different things. I will use both words for the same thing, but will try to make clear what I mean from case to case. In any case, to be sure, use “order of convergence”. My PhD advisor usually said that in mathematics “it’s all hollow words” (meaning that one should check the definition).
Landau’s little o- and big O-notation
Copied from Wikibooks under a Creative Commons BY-SA 4.0 license.
The Landau notation is an amazing tool applicable in all of real analysis. The reason it is so convenient and widely used is because it underlines a key principle of real analysis, namely ‘‘estimation’’. Loosely speaking, the Landau notation introduces two operators which can be called the “order of magnitude” operators, which essentially compare the magnitude of two given functions.
The “little-o”
The “little-o” provides a function that is of lower order of magnitude than a given function, that is the function o(g(x)) is of a lower order than the function g(x). Formally,
Definition.
Let A⊆R and let c∈R.
Let f,g:A→R.
If limx→cg(x)f(x)=0 then we say that
"As x→c, f(x)=o(g(x))"
Examples.
As x→∞, (and m<n), xm=o(xn);
As x→∞, (and n∈N), logx=o(xn);
As x→0, sinx=o(1).
The “Big-O”
The “Big-O” provides a function that is at most the same order as that of a given function, that is the function O(g(x)) is at most the same order as the function g(x). Formally,
Definition.
Let A⊆R and let c∈R
Let f,g:A→R
If there exists M>0 such that limx→c∣∣g(x)f(x)∣∣<M then we say that
"As x→c, f(x)=O(g(x))"
Examples.
As x→0, sinx=O(x);
As x→2π, sinx=O(1).
Applications
We will now consider few examples which demonstrate the power of this notation.
Differentiability
Let f:U⊆R→R and x0∈U.
Then f is differentiable at x0 if and only if
There exists a λ∈R such that as x→x0, f(x)=f(x0)+λ(x−x0)+o(∣x−x0∣).
Mean Value Theorem
Let f:[a,x]→R be differentiable on [a,b]. Then,
As x→a, f(x)=f(a)+O(x−a).
Taylor’s Theorem
Let f:[a,x]→R be n-times differentiable on [a,b]. Then,
As x→a, f(x)=f(a)+1!(x−a)f′(a)+2!(x−a)2f′′(a)+…+(n−1)!(x−a)n−1f(n−1)(a)+O((x−a)n).
Finding roots of functions and fixed points
Let f:R→R be continuous. We are interested in methods for finding zeros, that is, roots of f, in other words, x∈R, such that f(x)=0.
Definition. If f:Rn→R, n∈N is k-times continuously differentiable, then we write f∈Ck(Rn) or just f∈Ck. For k=0, f∈C0 or f∈C(Rn) or f∈C([a,b]) just means that f is assumed to be continuous.
Bisection method
The intermediate value theorem for continuous functions implies that x1<x<x2 with f(x)=0 exists if f(x1)f(x2)<0, i.e., there is a sign change. The bisection method is based on halving the interval such that the sign condition is preserved. Note that, in principle, we have to look for intervals [x1,x2].
Let us analyze the convergence rate. Let [a,b] be an interval. After k steps the interval of analysis has length 2kb−a which converges to zero for k→∞. Let us look in a neighborhood of radius δ>0, so that 2kb−a≤2δ⇔2k+1≥δb−a⇔k≥log2(δb−a)−1.
Every step reduces the error by factor 21. The convergence rate is thus linear.
Newton’s method
Assume that f∈C1. For an initial value x0, consider the iteration xk+1=xk−f′(xk)f(xk)k=0,1,….
Heuristics. If f(x∗)=0 and f∈C2, by the Taylor’s expansion, 0=f(x∗)=f(x0)+(x∗−x0)f′(x0)+2(x∗−x0)2f′′(ξ)
for ξ∈[x0,x∗], and upon neglecting the 2nd order term, x∗≈x0−f′(x0)f(x0).
Theorem. If f∈C2 and x0 is sufficiently good (i.e. close to the root x∗) and if f′(x∗)=0, then Newton’s method converges quadratically.
Proof. By Taylor’s expansion, it follows that x∗=xk−f′(xk)f(xk)−2(x∗−xk)2f′(xk)f′′(ξk)
for some ξk∈[x∗,xk]. Take xk+1 from the method and subtract, xk+1−x∗=(x∗−xk)2≤D2f′(xk)f′′(ξk).
In other words, ∣xk+1−x∗∣≤D∣xk−x∗∣2,
as k→∞ and thus xk→x∗.
Hence the method is quadratic. Note that f′(xk) does not vanish by continuity if xk is close to x∗. □
What happens if f′(x∗)=0? xk+1−x∗=(x∗−xk)22→0f′(xk)f′′(ξk)
as k→∞.
By Taylor’s expansion (η=“eta”): f′(xk)==0f′(x∗)+(xk−x∗)f′′(ηk)=(xk−x∗)f′′(ηk)
for some ηk∈[x∗,xk], and hence xk+1−x∗=f′′(ηk)=(xk−x∗)f′′(ηk)(xk−x∗). The method has degenerated to a linear method!
Sometimes it can be difficult or computationally expensive to compute the derivative f′(xk). Newton 's method can be adapted by approximating the derivative by the differential quotient. The secant method is the following two-step recursive algorithm. xk+1=xk−f(xk)−f(xk−1)f(xk)(xk−xk−1),k=1,2,…
with two distinct starting points x0=x1. The convergence rate is 21+5≈1.62, the golden ratio.
Fixed point iteration
Definition. A point x∈R is called a fixed point of φ:R→R if φ(x)=x.
We could for instance use Newton’s method to find fixed points by setting f(x):=φ(x)−x.
Banach’s Fixed Point Theorem. Suppose that φ is a contraction, that is, there exists a constant L<1 such that ∣φ(x)−φ(y)∣≤L∣x−y∣
for all x,y∈R. Then there exists a unique fixed point x∗∈R of φ, i.e., φ(x∗)=x∗, and the fixed point iteration φn:=n-timesφ∘…∘φ satisfies limn→∞φn(x0)=x∗ for any starting point x0∈R. The convergence rate is at least linear.
Proof. We prove that the sequence {φk(x0)}k=0∞={xk}k=0∞ is a Cauchy sequence. Let k>j. Then, by the triangle inequality, ∣xk−xj∣≤(k−j)-summands∣xk−xk−1∣+∣xk−1−xk−2∣+…+∣xj+1−xj∣.
Furthermore, ∣xm−xm−1∣=∣φ(xm−1)−φ(xm−2)∣≤L∣xm−1−xm−2∣≤Lm−1∣x1−x0∣.
Hence, by the geometric series, ∣xk−xj∣≤Lj1−L1−Lk−j∣x1−x0∣.
If k>N, j>N, then ∣xk−xj∣≤LN1−L1∣x1−x0∣→0
as N→∞, which proves that {xk} is a Cauchy sequence. The linear convergence rate follows also from this estimate.
The existence of a fixed point follows from the continuity of φ (as contractions are uniformly continuous, in fact, even Lipschitz continuous) as follows. x∗=k→∞limxk=k→∞limxk+1=k→∞limφ(xk)=φ(k→∞limxk)=φ(x∗). □
Theorem. Assume that φ∈Cp for p≥1. Furthermore, assume that has a fixed point x∗ and assume that φ′(x∗)=φ′′(x∗)=…=φ(p−1)(x∗)=0 for p≥2 and G′(x∗)<1 if p=1. Then the fixed point sequence {φk(x0)} converges to x∗ at least with rate p, provided that the starting point x0 is sufficiently close to x∗. If, in addition, φ(p)(x∗)=0, then the rate of convergence is precisely p.
Proof. First note that by Banach’s fixed point theorem the limit indeed converges to x∗ for suitable starting points x0. By the Taylor expansion, xk+1−x∗=φ(xk)−φ(x∗)=l=1∑p−1l!φ(l)(x∗)(xk−x∗)l+p!G(p)(ξk)(xk−x∗)p
for some ξk between x∗ and xk. The sum will be left empty for the case p=1. Since φ(l)(x∗)=0 for 1≤l≤p−1, we get that ∣xk+1−x∗∣=p!∣φ(p)(ξk)∣∣xk−x∗∣p.
By continuity, there exists C>0 (with C<1 for p=1) such that p!∣φ(p)(ξk)∣≤C for ξk sufficiently close to x∗ (that is, for sufficiently large k). Thus, ∣xk+1−x∗∣≤C∣xk−x∗∣p
for large k, and thus the rate of convergence is at least p. Note that for p=1, this also proves convergence by ∣xk+1−x∗∣<<1∣φ(ξk)∣∣xk−x∗∣.
If φ(p)=0, then by continuity, there exists K>0 such that p!∣φ(p)(ξk)∣≥K
for ξK sufficiently close to x∗. Thus ∣xk+1−x∗∣≥K∣xk−x∗∣p
which implies that the rate of convergence cannot be higher than p. Thus the rate of convergence is precisely p. □
Note. From the above proof, we expect that close to the fixed point x∗ ∣xk+1−x∗∣≈p!∣φ(p)(x∗)∣∣xk−x∗∣p,
when φ′(x∗)=φ′′(x∗)=…=φ(p−1)(x∗)=0,
but φ(p)(x∗)=0.
Polynomial interpolation
Idea. Approximate a function f:R→R over [a,b] by a polynomial p such that in distinct data points (xi,yi), i=0,1,…,n, the approximation is exact, that is, f(xi)=yi=p(xi),for alli=0,1,…,n.
We may call xinode and yivalue.
We need at least 2 data points. We usually just assume that xi=xj for i=j.
Note. Interpolation polynomials are not per se unique, for instance the data {(−1,1),(1,1)} can be interpolated by p(x)=1, q(x)=x2, or r(x)=x4−x2+1. However, we will see later that p is the unique interpolation polynomial with degp≤1=n.
Example.(1,2), (2,3), (3,6), as data set {(xi,yi):i=0,1,2} on the interval [1,3].
We are looking for a polynomial p2(x)=∑j=02cjxj, which is chosen to be 2nd order, because we have 3 data points and 3 unknown coefficients.
We can formulate the problem in matrix form: ⎝⎛111x0x1x2x02x12x22⎠⎞⋅⎝⎛c0c1c2⎠⎞=⎝⎛y0y1y2⎠⎞,
which is a so-called Vandermonde matrix which has determinant det⎝⎛111x0x1x2x02x12x22⎠⎞=i<j∏(xj−xi)=0, and is thus invertible. Here, ⎝⎛111123149⎠⎞⋅⎝⎛c0c1c2⎠⎞=⎝⎛236⎠⎞.
As a result, c0=3, c1=−2, and c2=1, and thus, p2(x)=x2−2x+3.
The computational complexity of solving the linear system is O(n3). We used the natural basis for the polynomials. What would be the ideal basis? Definition. (Lagrange basis polynomials) Suppose that xi=xj if i=j. We call ϕi(x):=j=0i=j∏nxi−xjx−xj
the ith Lagrange basis polynomial.
The Lagrange interpolation polynomial is given by p(x):=i=0∑nyiφi(x).