--- title: "Principal surface biplots with prinsurf" author: "Raeesa Ganey" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Principal surface biplots with prinsurf} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5.5, fig.align = "center" ) library(prinsurf) ``` # Introduction A principal surface (Hastie and Stuetzle, 1989) is a smooth two-dimensional manifold fitted through the middle of a $p$-dimensional point cloud, estimated by alternating an expectation step (the coordinate functions are loess smooths of each variable on the surface coordinates) and a projection step (each sample is projected onto its nearest point of the current surface). Where a principal component analysis flattens the data onto the best-fitting plane, a principal surface is allowed to bend, so it recovers curved structure that a plane must smear out. The display problem is that a curved surface has no linear biplot axes. This package supplies the missing reading rules. Each variable's fitted values over the surface form a smooth scalar field; the **gradient-flow axis** for a variable is the steepest-ascent trajectory of that field, started from the centre of the samples. It crosses the variable's contours orthogonally, carries a monotone calibration in the variable's original units, and is read exactly like a classical biplot axis: project a sample onto the axis, read off the value. A variable whose field is not axis-readable — because it has an interior extremum (Criterion 1) or because the constructed axis spans too little of the variable's range (Criterion 2) — is **deferred** and read from its contour lines instead. The workflow is `prinsurf()` to fit, `plot()` to display, and `predict()` / `predictivity()` / `axis_predictive_error()` to quantify how well the display can be read. # A curved sheet We start with data simulated on a blanket-shaped surface: two latent coordinates observed almost directly, and a third variable that bends parabolically in one of them. ```{r blanket-data} set.seed(1) n <- 200 s <- runif(n, -1, 1) # latent coordinate 1 t <- runif(n, -1, 1) # latent coordinate 2 e <- function() runif(n, -0.05, 0.05) x1 <- s + e() x2 <- t + e() x3 <- 0.7 * s + 0.9 * t^2 + e() # curved (parabolic) contours in (s,t) X <- cbind(x1, x2, x3) colnames(X) <- c("x", "y", "z") ``` ```{r blanket-fit} fit_b <- prinsurf(X, verbose = FALSE) fit_b ``` The default display draws the samples inside a plain frame (the surface coordinates carry no metric meaning, so there are no coordinate ticks) and a calibrated gradient-flow axis for every variable that passes the two criteria. `plot()` invisibly returns the names of the deferred variables. ```{r blanket-plot} deferred <- plot(fit_b) deferred ``` The `x` and `y` axes are close to straight, because those variables are close to linear in the latent coordinates. The interest is in `z`: its contours over the surface are curved, and its gradient-flow axis bends smoothly so as to stay orthogonal to them. To see the field itself, suppress the axes and ask for the contours, coloured by their marker value: ```{r blanket-contour} plot(fit_b, vars = NULL, contours = "z", col_by_level = TRUE) ``` With `col_by_level = TRUE` each contour line is coloured by the value it represents (low to high through a sequential palette), so colour *is* the calibration — the contour analogue of tick marks on an axis. Only the calibrated lines are drawn; the space between them is left blank, so the display never colours a region where the fitted value would be an extrapolation. Contour levels and axis calibrations are always labelled in the variable's original units, matching what `predict()` returns. Where the contours stop short of the frame, the surface simply has no local data support there — the grid is masked to the region the samples cover, and refuses to draw where the fitted value would be an extrapolation. How much better is the bent axis than a straight one? `kink_max()` measures the largest turning angle along an axis (small for a smooth flow), and `axis_predictive_error()` gives the root-mean-square error, per axis, of reading each sample by orthogonal projection — the standard predictive error of the Gower biplot tradition, in working (standardised) units. ```{r blanket-diagnostics} kink_max(psaxis(fit_b, "z")$axis) axis_predictive_error(fit_b) ``` # A half-sphere cap: deferral in action The second example is data on a spherical cap. One coordinate — the one along the cap's axis — attains its maximum at the pole, in the *interior* of the surface. No monotone axis can represent such a variable, and this is exactly what Criterion 1 detects. ```{r sphere-data} set.seed(05012014) data1 <- function(n, side) { x <- 2 * runif(n) - 1 theta <- 2 * pi * runif(n) - pi y <- sin(theta) * sqrt(1 - x^2) z <- cos(theta) * sqrt(1 - x^2) X <- matrix(c(x, y, z), nrow = n, ncol = 3, byrow = FALSE) X[X[, side] > -0.4, ] # keep cap (> half ball) along axis `side` } X <- data1(200, 1) colnames(X) <- c("x", "y", "z") cat(sprintf("half-sphere: n = %d after the cap filter\n", nrow(X))) ``` ```{r sphere-fit} fit_s <- prinsurf(X, verbose = FALSE) deferred <- plot(fit_s) deferred ``` The deferred variable gets no axis. Reading it is still possible — from its contours, which on a cap are concentric rings around the pole: ```{r sphere-contour} plot(fit_s, vars = NULL, contours = deferred, col_by_level = TRUE) ``` `psaxis()` reports the deferral decision for any single variable, and `predict()` respects it: for a monotone variable it projects onto the calibrated axis, while for a deferred variable it returns the surface value $\hat f_j(\lambda_i)$ (the contour reading), with a message. Either way the result is in the variable's original units. ```{r sphere-predict} psaxis(fit_s, deferred[1]) head(predict(fit_s, deferred[1])) ``` If you want an axis regardless — for instance to demonstrate *why* deferral is the right call — switch the criteria off with `defer = FALSE` in `psaxis()` or in `plot()`, and inspect the coverage the forced axis achieves. # Iris: a real dataset with groups Fisher's iris data has four measured variables and a species factor. Variables are standardised by default (`scale = TRUE`), which is recommended whenever the variables are on unequal scales. ```{r iris-fit} fit_i <- prinsurf(as.matrix(iris[, 1:4]), verbose = FALSE) deferred <- plot(fit_i, group = iris$Species) deferred ``` The `group` argument colours the samples and adds a legend; the axes are read in the usual way, in centimetres. To study one variable's behaviour over the whole surface, overlay its contours on the biplot together with its axis, or drop the axes altogether: ```{r iris-one-variable} plot(fit_i, vars = "Petal.Length", contours = "Petal.Length", group = iris$Species) ``` Passing `contours = TRUE` switches to a panel display, one panel per variable with its contours coloured by level — the quickest overall diagnostic view of the fitted surface: ```{r iris-panels, fig.height = 7} plot(fit_i, contours = TRUE, col_by_level = TRUE, group = iris$Species) ``` ## How well can the display be read? Two complementary summaries. **Sample predictivity** asks, per sample, what proportion of its squared length the surface reconstructs, $1 - \lVert x_i - \hat f(\lambda_i)\rVert^2 / \lVert x_i \rVert^2$; values near 1 are well represented, and negative values flag samples the surface cannot reach. **Axis predictive error** asks, per variable, how accurately the axis-projection reading recovers the true values. ```{r iris-diagnostics} pred <- predictivity(fit_i) attr(pred, "overall") summary(as.numeric(pred)) axis_predictive_error(fit_i) ``` And the direct check — predicted against observed for one variable: ```{r iris-predict} plot(iris$Petal.Length, predict(fit_i, "Petal.Length"), xlab = "observed", ylab = "read from the biplot axis") abline(0, 1, col = "grey50") ``` ## Two reading rules `predict()` offers two rules. The default, `rule = "axis"`, projects each sample onto the calibrated axis — the value a reader recovers by eye. `rule = "contour"` instead returns the surface value $\hat f_j(\lambda_i)$ for every variable, the reading from the contours; this carries no contour-curvature loss (see the theory vignette, Proposition 4) and is the comparison the method turns on. The two coincide only where the contours are straight, and diverge in proportion to their curvature: ```{r iris-rules} ax <- predict(fit_i, "Petal.Length", rule = "axis") cont <- predict(fit_i, "Petal.Length", rule = "contour") obs <- iris$Petal.Length c(axis = sqrt(mean((obs - ax)^2)), contour = sqrt(mean((obs - cont)^2))) ``` The contour reading is at least as accurate everywhere, and strictly better where the field bends — but it is not a drawn axis a reader can follow, which is the trade the mixed biplot makes. # Practical notes **Span.** The single smoothing parameter is the loess span (default `0.6`). Smaller values let the surface bend more; larger values push it towards the principal-component plane. If the fitted contours look implausibly wiggly, increase it. **Deferral tuning.** Criterion 1 uses `delta` (the margin, as a fraction of the value range, by which an interior extremum must exceed the boundary) and Criterion 2 uses `cover_min` (the minimum fraction of the variable's range the axis must span, default `0.55`). Both are arguments to `psaxis()` and to `plot()`. Deferral is a display decision, not a verdict on the variable: a deferred variable is still fitted, still predictable, and often the most interesting one on the plot. **Colouring by level.** With `col_by_level = TRUE`, contour lines are coloured by the value they represent, so colour encodes the calibration directly. Only the calibrated curves are drawn; unlike a filled surface, nothing is painted in the space between them, so the display never suggests a value where the surface merely interpolates. Set `n_contours` for the number of levels and `level_pal` for the palette. **Blank regions.** The contours are evaluated on a grid masked to the data-support region: cells with no sample nearby are dropped. Where the contours stop short of the frame, the configuration has a genuine void there — between clusters, or at a sparse edge — and the smooth surface would otherwise extrapolate a value with no data behind it. **Units.** Displayed calibrations, contour levels, and everything returned by `predict()` are in the original units of each variable. The internal computations, `predictivity()`, and `axis_predictive_error()` operate in the working (centred, optionally standardised) units. # References Gower, J.C. and Hand, D.J. (1996). *Biplots*. Chapman & Hall, London. Gower, J.C., Lubbe, S. and le Roux, N. (2011). *Understanding Biplots*. Wiley, Chichester. Hastie, T. and Stuetzle, W. (1989). Principal curves. *Journal of the American Statistical Association*, **84**, 502–516.