Smooth predictive axes for principal surface biplots

This vignette states the theory behind the axes drawn by plot.prinsurf and read by predict.prinsurf. It follows the accompanying manuscript; the package functions are pointed to as each result is introduced. The companion vignette Principal surface biplots with prinsurf covers usage.

Setup

Let \(\mathbf{X}\) be an \(n \times p\) column-centred data matrix with rows \(\mathbf{x}_i^\top\). A principal surface (Hastie and Stuetzle, 1989) is a smooth two-dimensional manifold in \(\mathbb{R}^p\) carrying a coordinate \(\boldsymbol{\lambda} = (\lambda_1, \lambda_2)^\top\) over a region \(\Omega \subset \mathbb{R}^2\), with \[\mathbf{x} \approx \mathbf{f}(\boldsymbol{\lambda}) = \bigl(\hat f_1(\boldsymbol{\lambda}), \ldots, \hat f_p(\boldsymbol{\lambda})\bigr)^\top,\] where each fitted coordinate function \(\hat f_j : \Omega \to \mathbb{R}\) is a smooth scatterplot smoother (local regression) of variable \(j\) on \(\boldsymbol{\lambda}\). Sample \(i\) carries the coordinate \(\boldsymbol{\lambda}_i\) of its projection onto the surface; plotting the \(\boldsymbol{\lambda}_i\) gives the display underlying the biplot, and \(\boldsymbol{\lambda}_c\) denotes their centroid. We write \(\Omega\) for the region supported by data (Estimation, below) and \(\partial\Omega\) for its boundary.

In the package, prinsurf() fits the surface and returns the \(\hat f_j\) as per-variable loess models (object$models) together with the sample coordinates \(\boldsymbol{\lambda}_i\) (object$lambda). The task is to equip the display, for each variable, with a calibrated biplot axis: a curve in \(\Omega\) along which \(x_j\) is read off, and onto which a sample is projected to recover a predicted value.

The axis as a trajectory of the fitted response

The established construction adapts the back-projection trajectory of nonlinear biplots (Gower, Lubbe and le Roux, 2011): for marker values \(\mu_1 < \cdots < \mu_R\), each \(\mu_r\) is represented by one point of its contour line \(\{\boldsymbol{\lambda} : \hat f_j(\boldsymbol{\lambda}) = \mu_r\}\), chosen independently of the others, and the chosen points are joined in marker order. Because the points are selected without reference to their neighbours, the axis is continuous but only piecewise linear, with sharp interior kinks; it approximates poorly the curve along which \(x_j\) actually varies, and predicts poorly.

We instead define the axis as the trajectory along which \(\hat f_j\) increases most rapidly. As \(\hat f_j\) is differentiable, this is the integral curve of its gradient field, smooth by construction.

Definition (gradient-flow axis). Let \(\hat f_j \in C^1(\Omega)\) with \(\nabla \hat f_j \neq \mathbf{0}\) on a connected neighbourhood of \(\boldsymbol{\lambda}_c\). The gradient-flow axis \(\gamma_j : I \to \Omega\) is the maximal solution of \[\frac{d\gamma_j}{ds} = \frac{\nabla \hat f_j(\gamma_j(s))}{\lVert \nabla \hat f_j(\gamma_j(s)) \rVert}, \qquad \gamma_j(0) = \boldsymbol{\lambda}_c, \tag{1}\] integrated in both directions of \(s\) until it reaches \(\partial\Omega\). A point \(\gamma_j(s)\) is calibrated by \(\hat f_j(\gamma_j(s))\), and marker \(\mu\) is placed where \(\hat f_j(\gamma_j(s)) = \mu\).

psaxis() implements this: it evaluates \(\hat f_j\) on a grid, forms \(\nabla \hat f_j\) by central differences, and integrates (1) by fourth-order Runge–Kutta from the supported grid node nearest \(\boldsymbol{\lambda}_c\).

Smoothness and orthogonality

Proposition 1 (smoothness). If \(\hat f_j \in C^k(\Omega)\), \(k \geq 1\), and \(\nabla \hat f_j\) does not vanish along \(\gamma_j\), then \(\gamma_j \in C^k\).

Where \(\nabla \hat f_j \neq \mathbf{0}\) the unit field in (1) is \(C^{k-1}\) and locally Lipschitz; by Picard–Lindelöf the solution exists and is unique, and by smooth dependence on a \(C^{k-1}\) field, \(\gamma_j \in C^k\). The axis is thus as smooth as the surface, in contrast to the merely continuous back-projection axis. kink_max() reports the largest turning angle of an axis after resampling to equal arc length; a gradient-flow axis registers only the curvature the contours genuinely demand, not spurious kinks.

Proposition 2 (orthogonality and monotone calibration). Along \(\gamma_j\) the calibration \(s \mapsto \hat f_j(\gamma_j(s))\) is strictly increasing, with derivative \(\lVert \nabla \hat f_j(\gamma_j(s)) \rVert > 0\), and \(\gamma_j\) crosses every contour line orthogonally.

By the chain rule and (1), \(\tfrac{d}{ds}\hat f_j(\gamma_j(s)) = \nabla \hat f_j^\top \dot\gamma_j = \lVert \nabla \hat f_j \rVert > 0\). The tangent \(\dot\gamma_j\) is parallel to \(\nabla \hat f_j\), which is normal to the level set, so \(\gamma_j\) meets every contour orthogonally. Monotone calibration makes the axis unambiguously readable; the orthogonal crossings are the calibration marks, and express that motion along the axis is the most efficient way to change the variable.

Reduction to the linear biplot

Proposition 3 (linear limit). If the fitted surface is affine, \(\mathbf{f}(\boldsymbol{\lambda}) = \mathbf{c} + \mathbf{B}\boldsymbol{\lambda}\) — as when \(\mathbf{X}\) is exactly two-dimensional — then for each \(j\) the gradient-flow axis is the straight line through \(\boldsymbol{\lambda}_c\) in direction \(\mathbf{b}_j\), the \(j\)th row of \(\mathbf{B}\). This is the prediction biplot axis of the corresponding principal-component biplot.

Affinity gives \(\hat f_j(\boldsymbol{\lambda}) = c_j + \mathbf{b}_j^\top \boldsymbol{\lambda}\), so \(\nabla \hat f_j = \mathbf{b}_j\) is constant and (1) integrates to the line \(\gamma_j(s) = \boldsymbol{\lambda}_c + s\,\mathbf{b}_j / \lVert \mathbf{b}_j \rVert\). In the linear biplot \(\hat x_j = \mathbf{b}_j^\top \boldsymbol{\lambda}\), so \(x_j\) is recovered by projecting \(\boldsymbol{\lambda}\) onto \(\mathbf{b}_j\); the two axes coincide in direction and calibration. The surface coordinates are determined only up to an orthogonal transformation of \(\Omega\), under which this result is equivariant, so the coincidence is frame-independent. The method therefore contains the classical biplot as its affine special case — a sanity check the display should always pass on genuinely two-dimensional data.

Monotonicity: two criteria

Definition (1) requires \(\nabla \hat f_j \neq \mathbf{0}\). A single calibrated axis is meaningful only when \(\hat f_j\) is monotone over the surface, i.e. has no interior extremum. If \(\hat f_j\) attains an extremum at an interior \(\boldsymbol{\lambda}^\star \in \mathrm{int}(\Omega)\) then \(\nabla \hat f_j(\boldsymbol{\lambda}^\star) = \mathbf{0}\), the nearby level sets are closed curves encircling it, no integral curve can pass through it, and any axis would assign one value to a whole closed contour. The height coordinate of a hemisphere, whose contours are concentric circles, is the canonical instance.

Testing this through the vanishing of the estimated gradient is unreliable: a wiggly-but-monotone surface produces near-flat patches where \(\lVert \nabla \hat f_j \rVert\) dips without an extremum. We therefore test the defining property of an extremum directly, through its value relative to the boundary.

Criterion 1 (boundary-extremum test). Let \(\delta > 0\) and \(r_j = \max_\Omega \hat f_j - \min_\Omega \hat f_j\). Variable \(j\) is declared non-monotone if \[\max_{\mathrm{int}(\Omega)} \hat f_j > \max_{\partial\Omega} \hat f_j + \delta r_j \quad\text{or}\quad \min_{\mathrm{int}(\Omega)} \hat f_j < \min_{\partial\Omega} \hat f_j - \delta r_j, \tag{2}\] and monotone otherwise; a non-monotone variable is deferred to contour reading.

The rationale is exact: a continuous function on \(\Omega\) attains its extrema on \(\partial\Omega\) unless it has an interior extremum, so an interior value beyond the entire boundary range certifies a genuine interior peak or basin, while the margin \(\delta r_j\) guards against estimation noise. A monotone ramp — however curved — keeps its extreme values on the boundary and is passed. In the affine case of Proposition 3 the extrema lie on \(\partial\Omega\), so the criterion never fires. In psaxis(), \(\delta\) is the argument delta (default 0.05).

Criterion 1 detects an interior extremum that is also the global extreme. A shallower local extremum, not global but still with \(\nabla \hat f_j = \mathbf{0}\) at an interior point, will not exceed the boundary range, yet it stalls the flow: the integral curve climbs into the local peak and stops, so the axis spans only part of the variable’s range. This is caught after the axis is built, by its coverage.

Criterion 2 (axis coverage). Let \(R_j = \max_i \hat f_j(\boldsymbol{\lambda}_i) - \min_i \hat f_j(\boldsymbol{\lambda}_i)\) be the range of fitted values over the samples. The coverage is \[c_j = \frac{\max_s \hat f_j(\gamma_j(s)) - \min_s \hat f_j(\gamma_j(s))}{R_j}. \tag{3}\] If \(c_j < c_{\min}\) the axis fails to span the variable and \(j\) is deferred.

Coverage measures the deliverable itself rather than inferring it from geometry: a trapping interior maximum, however shallow relative to \(R_j\), truncates the calibrated range and drives \(c_j\) down, whereas a genuinely monotone variable has \(c_j \approx 1\). In psaxis(), \(c_{\min}\) is cover_min (default 0.55).

A variable is given a calibrated axis only if it passes both criteria; otherwise it is read from contours, on which the closed level sets that defeat a single axis are read without ambiguity. The biplot is thus mixed by construction — monotone variables carry calibrated axes read by projection, folded variables are read from contours — mirroring the classical split between axis prediction and contour prediction. plot.prinsurf returns the deferred variables invisibly, and drawing their contours (with contours =) is the intended reading for them.

Reading and prediction

The surface supplies \(\hat f_j(\boldsymbol{\lambda})\) for any coordinate. Two distinct notions of prediction follow, and it is essential to keep them apart.

Surface prediction reconstructs a sample’s value as \(\hat x^{\mathrm{surf}}_{ij} = \hat f_j(\boldsymbol{\lambda}_i)\). It uses the fitted surface alone, not the axis, and is the natural reconstruction of the embedding. This is what fitted.prinsurf returns and what predictivity() scores.

Axis prediction reads the value off the drawn axis: for a monotone variable, let \(\boldsymbol{\lambda}_i^\star = \arg\min_{\boldsymbol{\lambda}\in\gamma_j} \lVert \boldsymbol{\lambda}_i - \boldsymbol{\lambda} \rVert\) be the orthogonal projection of \(\boldsymbol{\lambda}_i\) onto the polyline; then \(\hat x^{\mathrm{ax}}_{ij}\) is the calibration \(\hat f_j\) interpolated linearly at \(\boldsymbol{\lambda}_i^\star\). This is what a reader recovers by eye, and is the sense in which biplot axes are conventionally assessed. It is what predict.prinsurf returns for a monotone variable; for a deferred variable predict returns the surface value \(\hat f_j(\boldsymbol{\lambda}_i)\) instead, which is the contour reading.

The axis is therefore primarily a display device. Its fidelity is measured by how closely axis prediction tracks the data; we summarise it for variable \(j\) by the standardised squared prediction error \[\mathrm{SSPE}_j = \frac{\sum_i (x_{ij} - \hat x^{\mathrm{ax}}_{ij})^2} {\sum_i (x_{ij} - \bar x_j)^2}, \tag{4}\] \(0\) meaning exact recovery, \(1\) no better than the mean. axis_predictive_error() reports the root of the numerator of (4) per axis — the standard predictive error, in the units of the working data.

The reason a single axis cannot in general recover the data exactly is geometric, and it determines where the surface’s predictive advantage can and cannot be realised.

Proposition 4 (exactness of axis reading). For a monotone variable, the orthogonal-projection reading equals the surface value, \(\hat x^{\mathrm{ax}}_{ij} = \hat f_j(\boldsymbol{\lambda}_i)\) for every sample, if and only if the level sets of \(\hat f_j\) are parallel straight lines (i.e. \(\hat f_j\) is affine over the samples). When the contours curve, with \(d_i = \lVert \boldsymbol{\lambda}_i - \boldsymbol{\lambda}_i^\star \rVert\) the off-axis distance, \[\hat f_j(\boldsymbol{\lambda}_i) - \hat x^{\mathrm{ax}}_{ij} = \tfrac{1}{2}\,\kappa_i\, \lVert \nabla \hat f_j(\boldsymbol{\lambda}_i^\star) \rVert\, d_i^2 + o(d_i^2), \tag{5}\] where \(\kappa_i\) is the curvature of the contour through \(\boldsymbol{\lambda}_i^\star\).

By Proposition 2 the axis is tangent to \(\nabla \hat f_j\), so the perpendicular to the axis at \(\boldsymbol{\lambda}_i^\star\) is tangent to the contour there. The displacement \(\boldsymbol{\lambda}_i - \boldsymbol{\lambda}_i^\star\) is to leading order along that perpendicular; expanding \(\hat f_j\) from \(\boldsymbol{\lambda}_i^\star\) in this direction, the first-order term vanishes because \(\nabla \hat f_j\) is normal to the contour tangent, leaving the second-order term in (5). The error vanishes for all samples exactly when \(\kappa_i \equiv 0\), i.e. straight contours.

Single-axis reading is thus intrinsically lossy, in proportion to contour curvature and the square of off-axis distance, and is exact only in the affine limit. The surface itself carries no such loss — it uses the full position \(\boldsymbol{\lambda}_i\) — so the representational advantage of the surface over a plane is recovered by reading from position (sample predictivity) and from contours for folded variables, not by projection onto one curved axis.

Which fidelity measure, and why \([0,1]\) fails

The bounded axis predictivity of the linear biplot — the ratio of predicted to total sum of squares, lying in \([0,1]\) — is bounded precisely because linear axis prediction is orthogonal projection onto a subspace, whence \(\sum_i x_{ij}^2 = \sum_i (\hat x^{\mathrm{ax}}_{ij})^2 + \sum_i (x_{ij} - \hat x^{\mathrm{ax}}_{ij})^2\) and the ratio equals \(1 - \mathrm{SSPE}_j\). For a curved axis the reading is projection onto a curve, not a subspace; the Pythagorean decomposition fails, the ratio form is no longer confined to \([0,1]\), and the appropriate fidelity measure is the standard predictive error \(\bigl(n^{-1}\sum_i (x_{ij} - \hat x^{\mathrm{ax}}_{ij})^2\bigr)^{1/2}\) of the predictive-biplot tradition, which assumes no subspace structure. This is why axis_predictive_error() reports an RMS error rather than a bounded ratio.

Surface prediction, by contrast, retains the bounded form: \(\hat{\mathbf{f}}(\boldsymbol{\lambda}_i)\) is the nearest-point projection of \(\mathbf{x}_i\) onto the fitted surface, for which the decomposition does hold, so sample predictivity \(1 - \lVert \mathbf{x}_i - \hat{\mathbf{f}}(\boldsymbol{\lambda}_i) \rVert^2 / \lVert \mathbf{x}_i - \bar{\mathbf{x}} \rVert^2\) lies in \([0,1]\). This is the form predictivity() returns.

What is established, and what is not

Propositions 1–3 are theorems: the gradient-flow axis is smooth, crosses contours orthogonally with monotone calibration, and reduces to the linear biplot axis in the affine limit. Axis prediction, by contrast, is an empirical property. On smooth, single-mode surfaces the smooth axis follows the variable’s variation closely and predicts at least as well as back-projection. On complex multi-modal surfaces a single steepest-ascent curve need not pass near samples lying off it, and axis prediction is then comparable to — and occasionally worse than — back-projection, even while the axis remains far smoother. The contribution is a smooth, well-defined, orthogonally calibrated axis with a principled monotonicity diagnostic, whose predictive reading is competitive rather than uniformly superior.

A robust discrete estimator

When the estimated gradient field is noisy, the axis may instead be obtained by choosing one point per contour jointly, minimising a data-fidelity term plus a curvature penalty \(\sum_r \lVert \boldsymbol{\lambda}_{r-1} - 2\boldsymbol{\lambda}_r + \boldsymbol{\lambda}_{r+1} \rVert^2\) that couples adjacent markers. Because the penalty couples only neighbours, the global optimum follows by dynamic programming over the ordered contours, and its weight interpolates between the back-projection axis and a straight line. This is the internal energy of an active contour (Kass, Witkin and Terzopoulos, 1988). It is not the default in this package, which uses the gradient flow, but it is the natural fallback where the smoother is unreliable.

Algorithm

For variable \(j\), given sample coordinates \(\{\boldsymbol{\lambda}_i\}\), fitted \(\hat f_j\), grid size \(N\), step \(h\), and margin \(\delta\):

  1. Evaluate \(\hat f_j\) on an \(N \times N\) grid over the bounding box of the \(\boldsymbol{\lambda}_i\).
  2. Define the support \(\Omega\) as grid nodes within a density-scaled radius of a sample; mask the rest. Estimate \(\partial\Omega\) as supported nodes adjacent to a masked node or the frame.
  3. If Criterion 1 holds, return non-monotone; defer to contour reading.
  4. Estimate \(\nabla \hat f_j\) by central differences; interpolate bilinearly off-grid.
  5. Integrate (1) from the supported node nearest \(\boldsymbol{\lambda}_c\) by RK4 in both directions, stopping at \(\partial\Omega\).
  6. Calibrate \(\hat f_j\) along \(\gamma_j\) and place markers by inverse interpolation.
  7. If coverage \(c_j < c_{\min}\) (Criterion 2), return non-monotone; defer.

This is psaxis(); steps 1–2 are shared with contour drawing through the internal grid helper.

Estimation

The surface is fitted by the expectation/projection algorithm with \(\hat f_j\) estimated by local regression (prinsurf()). The support radius scales as the bounding-box diagonal divided by \(\sqrt n\), adapting to sample density and preventing the smoother’s extrapolation beyond the data from creating spurious gradients or spurious interior extrema. Grid evaluation and differencing are \(O(N^2)\) per variable and the integration is \(O(1/h)\) in the number of steps, so axis construction is inexpensive relative to fitting the surface.

References

Gower, J.C. and Hand, D.J. (1996). Biplots. Chapman & Hall, London.

Gower, J.C., Lubbe, S. and le Roux, N. (2011). Understanding Biplots. Wiley, Chichester.

Hastie, T. and Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502–516.

Kass, M., Witkin, A. and Terzopoulos, D. (1988). Snakes: active contour models. International Journal of Computer Vision, 1, 321–331.