Montgomery reduction algorithm
Montgomery reduction is a technique to speed up backtoback modular multiplications by transforming the numbers into a special form. Here I will explain how the algorithm works in precise detail, give mathematical justifications, and provide working code as a demonstration.
For a single multiplication, Montgomery is inferior to doing the modular multiplication directly. But for a chain of multiplications, such as in modular exponentiation, we transform the input numbers into Montgomery form, perform numerous multiplications, and transform back to standard numbers at the end.
Summary
Steps to compute \(c = (a b \text{ mod } n)\):
Choose \(r \in \mathbb{N}\) such that \(r > n\) and \(\text{gcd}(r, n) = 1\).
\(k = \frac{r (r^{1} \text{ mod } n) \,  \, 1}{n}\).
\(\bar{a} = (a r \text{ mod } n)\).
\(\bar{b} = (b r \text{ mod } n)\).\(x = \bar{a} \bar{b}\).
\(s = (x k \text{ mod } r)\).
\(t = x + s n\).
\(u = \frac{t}{r}\).
\(\bar{c} = \text{if } (u < n) \text{ then } (u) \text{ else } (u  n)\).
\(c = (\bar{c} r^{1} \text{ mod } n)\).
Detailed algorithm
Initialization

Problem: We want to compute many instances of \(c \equiv a × b\text{ mod }n\). We will compute with many values of \(a\) and \(b\) but keep \(n\) fixed.

Choose an \(r\) that is greater than \(n\) and coprime with \(n\). Typically we let \(r\) be a power of 2, which means \(n\) needs to be odd and greater than 3.
Later in the algorithm, we will perform modulo by \(r\) and division by \(r\). With \(r\) being a power of 2, these operations respectively become inexpensive bit masking and right shifting.
If the base is 10 due to performing decimal arithmetic by hand or using BCD, then \(r\) should accordingly be a power of 10. The reasoning behind the algorithm is still all valid.

Let \(k = \frac{r (r^{1} \text{ mod } n) \,  \, 1}{n}\). (This division is exact.)
The reciprocal \(r^{1} \text{ mod } n\) exists because \(r\) is coprime with \(n\). We know that \(r \, r^{1} \equiv 1 \text{ mod } n\). Thus \(r \, r^{1} = 1 + k n\) for some nonnegative integer \(k\).
\(r^{1} \text{ mod } n\) is computed by the extended Euclidean algorithm.
Outer algorithm

Since we are doing arithmetic modulo \(n\), we assume that all input and output numbers are in the range \([0, n)\). Intermediate results in the algorithm may be larger but not negative.
In this discussion, we carefully distinguish between equations on integers (\(x = y\)) versus congruences of integers modulo a number (\(x \equiv y \text{ mod } n\)). We also distinguish the use of \(\text{mod}\) as an arithmetic operator versus its use in congruence relations.

First convert the input numbers to Montgomery form: \(\bar{a} = (a r \text{ mod } n)\), \(\bar{b} = (b r \text{ mod } n)\).
Assume that our desired output in Montgomery form is \(\bar{c} = (c r \text{ mod } n) = (a b r \text{ mod } n)\).

Compute the product directly, so we have: \(\bar{a} \times \bar{b} \equiv (a r) (b r) \equiv a b r^2 \equiv \bar{c} r \text{ mod } n\).
Note that \(0 \le \bar{a} \times \bar{b} < n^2\).

Now compute the reduction: \(\bar{c} = R(\bar{a} \times \bar{b}) = (a b r \text{ mod } n)\).

Finally convert the output back to standard form: \(c = (\bar{c} \, r^{1} \text{ mod } n)\).
Reduction function

The reduction function \(R : [0, n^2) \rightarrow [0, n)\) effectively computes \(R(x) = (x \, r^{1} \text{ mod } n)\) in an efficient way.

Let \(s = (x k \text{ mod } r)\). (\(k\) is defined in the initialization.)
We know that \(0 \le s < r\). By the properties of modular arithmetic, we can replace \(x\) in this expression with \((x \text{ mod } r)\) to get the same result more efficiently.

Let \(t = x + s n\).
We know that \(0 \le x < n^2 < r n\) and \(0 \le s n < r n\), thus \(0 \le t < 2 r n\). (Recall that we chose \(r > n\).)
We claim that \(t\) is a multiple of \(r\), with proof:
\(x + s n \: \equiv \: x + x k n \: \equiv \: x (1 + k n) \: \equiv \: x \, (r \, r^{1}) \: \equiv \: 0 \: \text{ mod } r\).
Therefore we can write \(t = u r\) for some nonnegative integer \(u\).We know that \(t \equiv x \text{ mod } n\) because adding \(s n\) preserves the congruence.

Let \(u = \frac{t}{r}\). (This division is exact.)
We can see that \(u \: \equiv \: u \times 1 \: \equiv \: u \, (r \, r^{1}) \: \equiv \: (u r) r^{1} \: \equiv \: t \, r^{1} \: \text{ mod } n\).
Moreover, \(u \equiv x \, r^{1} \text{ mod } n\).

If \(0 \le u < n\) then return \(u\). Otherwise return \(u  n\).
Because \(0 \le t < 2 r n\), we have \(0 \le u < 2 n\). Hence this simple ifelse performs a modulo by \(n\) correctly. Furthermore, the result is still equal (and congruent) to \(x \, r^{1} \text{ mod } n\).
This completes the explanation and proof of the Montgomery reduction algorithm.
Source code
My runnable implementations of Montgomery reduction for modular multiplication and exponentiation:
 MontgomeryReducer.java (Java library)
 MontgomeryReducerDemo.java (commandline main program)
 MontgomeryReducerTest.java (JUnit suite)
 montgomeryreducer.py (Python library and selfcheck)