Numerical Analysis

Error Analysis

Understanding numerical errors, floating-point representation, and convergence analysis.

1. Floating-Point Arithmetic

IEEE 754 Double-Precision Format

A 64-bit floating-point number is represented as:

(1)s×2(e1023)×(1+f)(-1)^s \times 2^{(e-1023)} \times (1 + f)
Sign (s) 1 bit: 0 for positive, 1 for negative
Exponent (e) 11 bits: biased by 1023
Mantissa (f) 52 bits: fractional part

Machine Epsilon

ε2.22×1016\varepsilon \approx 2.22 \times 10^{-16}

Interactive IEEE 754 Converter

Binary Representation (64 bits)

0

Sign

0

Exponent (decimal)

0 (biased: -1023)

Mantissa (decimal)

0

Normalization

A floating-point number in base 10 is normalized when the leading digit d₁ ≠ 0, i.e., the mantissa satisfies 0.1 ≤ |m| < 1:

x=(1)s×0.d1d2d3×10e,d10x = (-1)^s \times 0.d_1 d_2 d_3 \ldots \times 10^e, \quad d_1 \neq 0

Static Examples

6.238 = (−1)⁰ × 0.6238 × 10¹ normalized ✓
−0.0014 = −0.0014 × 10⁰ → normalized: −0.14 × 10⁻²
0.00345 → normalized: 0.345 × 10⁻²

Interactive Normalizer

Normalized form: (+1) × 0.6238 × 10^1 normalized ✓
(1)0×0.6238×101(-1)^{0} \times 0.6238 \times 10^{1}

Overflow & Underflow

The exponent e is bounded: m < e < M. For IEEE 754 double precision, e ∈ [−1022, 1023].

emin=1022e1023=emaxe_{\min} = -1022 \leq e \leq 1023 = e_{\max}
Overflow e > emax → result is ±Infinity (or NaN)
Underflow e < emin → result rounds to 0 (or subnormal)

Static Examples

10³⁰⁸ ≈ 1e308 — near maximum representable value normal
10³⁰⁹ = 1e309 → Infinity overflow
10⁻³⁰⁸ ≈ 1e-308 — near minimum normal value normal
10⁻³²⁴ = 5e-324 → subnormal / 0 underflow

Interactive Classifier

Normal Representable as a normal IEEE 754 double. |x| ≈ 1.0000e+308

2. Error Types

Theory

Absolute Error

Eabs=ppE_{abs} = |p - p^*|

Relative Error

Erel=pppE_{rel} = \frac{|p - p^*|}{|p|}

Significant Digits

The approximation pp^* has nn significant digits with respect to pp if Erel<5×10nE_{rel} < 5 \times 10^{-n}

Error Calculator

Visual Error Bar (relative error scaled)

Absolute Error

0.000000e+0

Relative Error

0.000000e+0

Significant Digits

0

Worked Examples

Practice Problems

Compute the absolute error, relative error, and significant digits for each pair.

Problem 1: p = 2.71828, p* = 2.718

Problem 2: p = 1.41421, p* = 1.414

Problem 3: p = 9.8696, p* = 9.87

3. Convergence Rates

Theory

A sequence {αn} converges to zero with rate rr if:

αn+1Cαnr|\alpha_{n+1}| \leq C|\alpha_n|^r
r = 1 Linear convergence (e.g., O(h))
r = 2 Quadratic convergence (e.g., O(h²))
1 < r < 2 Superlinear convergence

Convergence Visualization

Number of iterations: 15

Linear (r=1)

αₙ = 0.5ⁿ

Quadratic (r=2)

αₙ = 0.5^(2ⁿ)

Superlinear (r≈1.618)

αₙ = 0.5^(φⁿ)

Sequence Rate Analyzer

Enter a sequence of error values (comma-separated) to classify its convergence rate.

4. Condition Number

Theory

Condition number of evaluating ff at xx:

κ(x)=xf(x)f(x)\kappa(x) = \left|\frac{x \cdot f'(x)}{f(x)}\right|

Well-conditioned: κ1\kappa \approx 1

Small input changes produce small output changes.

Ill-conditioned: κ1\kappa \gg 1

Small input changes produce large output changes.

Relative error amplification:

rel_err(f(x))κ(x)rel_err(x)\text{rel\_err}(f(x)) \approx \kappa(x) \cdot \text{rel\_err}(x)
f(x)f(x) = 10.00000
f(x)f'(x) (central diff) = 0.05000000
κ(x)\kappa(x) = 0.50000
Classification: Well-conditioned
Output perturbation κε\kappa \cdot \varepsilon = 0.005000

Visualization shows f(x)f(x) near the evaluation point. Brackets illustrate how input perturbation ε\varepsilon maps to output perturbation κε\kappa \cdot \varepsilon.

f(x)
Perturbation brackets
f(x₀)

5. Series Convergence

Theory

An infinite series is a sum of infinitely many terms:

S=n=1anS = \sum_{n=1}^{\infty} a_n

The N-th partial sum accumulates the first N terms:

SN=n=1Nan=a1+a2++aNS_N = \sum_{n=1}^{N} a_n = a_1 + a_2 + \cdots + a_N

Convergence means the partial sums approach a finite limit:

SNNSRS_N \xrightarrow{N\to\infty} S \in \mathbb{R}

Ratio Test

L=limnan+1anL = \lim_{n\to\infty}\left|\frac{a_{n+1}}{a_n}\right|

L < 1 → converges; L > 1 → diverges

Comparison Test

0anbn0 \le a_n \le b_n

If Σbₙ converges, so does Σaₙ

Integral Test

an1f(x)dx\sum a_n \sim \int_1^\infty f(x)\,dx

Same convergence when f is decreasing

Preset Series

an=rna_n = r^n

Converges when |r| < 1

r = 0.50

Custom Series

Enter a formula for a_n using variable n. Supports: + - * / ^ sin cos sqrt abs log exp pi e n!

Number of terms N = 50

Appears to converge to ≈ 1.000000

Ratio test estimate: |a_{n+1}/a_n| ≈ 0.50000 (< 1, supports convergence)

Visualization

━━ Partial sums Sₙ ╌╌ |aₙ| (term size) ╌╌ Estimated limit

Partial Sums Table (first 20 rows)

naₙSₙ
10.5000000.500000
20.2500000.750000
30.1250000.875000
40.0625000.937500
50.0312500.968750
60.0156250.984375
70.0078130.992188
80.0039060.996094
90.0019530.998047
100.0009770.999023
110.0004880.999512
120.0002440.999756
130.0001220.999878
146.104e-50.999939
153.052e-50.999969
161.526e-50.999985
177.629e-60.999992
183.815e-60.999996
191.907e-60.999998
209.537e-70.999999

6. Floating-Point Arithmetic Proof

Theory

Floating-point representation model

fl(x)=x(1+δ),δu\text{fl}(x) = x(1 + \delta), \quad |\delta| \leq u

Unit roundoff for IEEE 754 double precision

u=2531.11×1016u = 2^{-53} \approx 1.11 \times 10^{-16}

Arithmetic operations (each rounded once)

fl(xy)=(x+y)(1+ε1),ε1u\text{fl}(x \oplus y) = (x + y)(1 + \varepsilon_1), \quad |\varepsilon_1| \leq u fl(xy)=(xy)(1+ε2),ε2u\text{fl}(x \otimes y) = (x \cdot y)(1 + \varepsilon_2), \quad |\varepsilon_2| \leq u

Error accumulation after n operations

f^ffnu(first-order bound)\left|\frac{\hat{f} - f}{f}\right| \lesssim n \cdot u \quad (\text{first-order bound})

Chopping (truncation)

Relative error bound: β1k\beta^{1-k}

Rounding (nearest)

Relative error bound: 12β1k\tfrac{1}{2}\beta^{1-k}

Step-by-Step Proof Walkthrough

1

Step 1: Normalized Floating-Point Form

±0.d1d2dk×βe,d10\pm 0.d_1 d_2 \cdots d_k \times \beta^e, \quad d_1 \neq 0

Any nonzero real number is written in normalized form with base β (typically 2 for binary), k significant digits d₁d₂…dₖ where d₁ ≠ 0, and exponent e. For IEEE 754 double precision: β = 2, k = 53 (1 implicit + 52 stored).

1 / 6

Interactive: Catastrophic Cancellation

Computing (1+ε)1(1 + \varepsilon) - 1 should yield ε\varepsilon. Watch the relative error grow as ε approaches the machine epsilon.

ε valueExact resultComputed resultRelative error
1.0000e-131.0000e-139.9920e-147.99e-4
1.0000e-141.0000e-149.9920e-157.99e-4
1.0000e-151.0000e-151.1102e-151.10e-1
1.0000e-161.0000e-1601.00e+0
1.0000e-171.0000e-1701.00e+0

For ε = 10⁻¹⁶ ≈ u, JavaScript (IEEE 754) computes (1 + ε) = 1 exactly due to rounding, so (1 + ε) − 1 = 0. The relative error is 100%.

For ε = 10⁻¹⁷ (below machine epsilon), the situation is the same — ε is rounded to 0 when added to 1.

Error Propagation Calculator

Given two values and their relative errors, compute the propagated relative error bound.

Result (1.5 + 2.3)

3.800000000

Propagated rel. error bound

1.0000e-10

Addition error bound

ρx+yxρx+yρyx+y+u\rho_{x+y} \leq \dfrac{|x|\,\rho_x + |y|\,\rho_y}{|x+y|} + u

Loss of Significance

Theory

When two nearly equal numbers are subtracted, the leading significant digits cancel and the relative error of the result can be much larger than the relative errors of the individual operands.

If Er(xA)E_r(x_A) and Er(yA)E_r(y_A) are small, the relative error of zA=xAyAz_A = x_A - y_A satisfies:

Er(zA)xEr(xA)+yEr(yA)xyE_r(z_A) \approx \frac{|x| \, E_r(x_A) + |y| \, E_r(y_A)}{|x - y|}

When xyx \approx y, the denominator xy|x - y| is tiny while the numerator remains of order x|x|, causing catastrophic amplification.

Example 1.11 — Subtraction of Nearly Equal Numbers

Given values and their approximations (rounded):

x=7.6545428,xA=7.6545421  (6 sig. digits)x = 7.6545428, \quad x_A = 7.6545421 \; (6 \text{ sig. digits}) y=7.6544201,yA=7.6544200  (7 sig. digits)y = 7.6544201, \quad y_A = 7.6544200 \; (7 \text{ sig. digits})

Step 1: Compute the subtraction

z=xy=7.65454287.6544201=0.0001227z = x - y = 7.6545428 - 7.6544201 = 0.0001227 zA=xAyA=7.65454217.6544200=0.0001221z_A = x_A - y_A = 7.6545421 - 7.6544200 = 0.0001221

Step 2: Absolute error of the result

zzA=0.00012270.0001221=0.6×106|z - z_A| = |0.0001227 - 0.0001221| = 0.6 \times 10^{-6}

Although xAx_A had 6 significant digits, zAz_A has only 3 significant digits — three digits were lost to cancellation.

Step 3: Error amplification factor

Er(zA)53736×Er(xA)E_r(z_A) \approx 53736 \times E_r(x_A)

The relative error of the result is roughly 53,736 times larger than the relative error of the original approximation — a dramatic loss of significance.

Example 1.13 — Reformulation to Avoid Cancellation

Consider the function:

f(x)=x(x+1x)f(x) = x\bigl(\sqrt{x+1} - \sqrt{x}\bigr)

For large xx, the terms x+1\sqrt{x+1} and x\sqrt{x} are nearly equal, causing catastrophic cancellation. Rationalising the numerator gives the equivalent but numerically stable form:

f(x)=xx+1+xf(x) = \frac{x}{\sqrt{x+1} + \sqrt{x}}
xNaive (cancellation)Stable (rationalised)
10.4142140.414214
1004.9875624.987562
1,00015.80743715.807437
10,00049.99875049.998750
100,000158.113488158.113488

At x=100,000x = 100{,}000 the naive form suffers from severe cancellation. The stable form x/(x+1+x)x / (\sqrt{x+1} + \sqrt{x}) is algebraically identical but avoids subtracting nearly equal square roots, preserving all significant digits.

Propagated Error & Stability

Theory

By the Mean Value Theorem, if xAx_A approximates xx, then the error in f(x)f(x) propagates as:

f(x)f(xA)f(x)(xxA)f(x) - f(x_A) \approx f'(x)\,(x - x_A)

Converting to relative errors:

Er(f(x))xf(x)f(x)Er(x)=κ(x)Er(x)E_r(f(x)) \approx \left|\frac{x\,f'(x)}{f(x)}\right| E_r(x) = \kappa(x)\,E_r(x)

Well-conditioned: κ1\kappa \approx 1

Error is not amplified. Algorithm is stable.

Ill-conditioned: κ1\kappa \gg 1

Errors amplify dramatically. Algorithm is unstable.

Example 1.16 — Stability of f(x)=x+1xf(x) = \sqrt{x+1} - \sqrt{x}

Consider x=12345x = 12345. The true value is:

f(12345)=12346123450.004500f(12345) = \sqrt{12346} - \sqrt{12345} \approx 0.004500

Step 1: 3-digit rounding of each square root

12346111.1123-digit111\sqrt{12346} \approx 111.112 \xrightarrow{\text{3-digit}} 111 12345111.1083-digit111\sqrt{12345} \approx 111.108 \xrightarrow{\text{3-digit}} 111 fA=111111=0f_A = 111 - 111 = 0

Step 2: Relative error of the naive computation

Er(fA)=f(12345)fAf(12345)=0.00450000.004500100%E_r(f_A) = \frac{|f(12345) - f_A|}{|f(12345)|} = \frac{|0.004500 - 0|}{0.004500} \approx 100\%

The computed result is completely wrong — a 100% relative error from 3-digit rounding.

Step 3: Condition number confirms instability

κ(12345)=xf(x)f(x)0\kappa(12345) = \left|\frac{x\,f'(x)}{f(x)}\right| \approx {0}

A condition number of 0 means every digit of relative error in xx produces roughly 0 digits of relative error in f(x)f(x). This function is severely ill-conditioned for large xx.

Step 4: Stable reformulation

f(x)=x+1x=1x+1+xf(x) = \sqrt{x+1} - \sqrt{x} = \frac{1}{\sqrt{x+1} + \sqrt{x}}

This algebraically equivalent form avoids subtracting nearly equal numbers and is numerically stable.

Condition Number vs. xx for f(x)=x+1xf(x) = \sqrt{x+1} - \sqrt{x}

xf(x)κ(x)\kappa(x)
10.414213560.4
100.154347130.5
1000.049875620.5
1,0000.015807440.5
12,3450.004500030.5
100,0000.001581130.5

The condition number grows without bound as xx \to \infty, confirming that f(x)=x+1xf(x) = \sqrt{x+1} - \sqrt{x} is increasingly ill-conditioned for large xx.