I just received mail from an understandably confused user who ran our singular-value decomposition (SVD) tutorial example multiple times and got multiple different answers. As I replied to that user, there are two reasons for this, one trivial and one deeper.

The trivial problem is non-identifiability. SVD is a matrix factoring. As such, you can think of it as a solution to a matrix equation. If you remember your quadratic equations, you’ll remember that some equations have multiple solutions. In the simplest case, the equation `x * x = 4`

has two solutions, one where `x = 2`

and one where `x = -2`

. This is why you sometimes get all the signs flipped in the SVD factoring of a matrix. You’ll get the same inferences in terms of document-document, document-term and term-term similarity, though.

In practice, the way to solve the first problem is to simply choose one of the representations to serve as the canonical solution. For instance, vectors may be ordered lexicographically by dimension, giving you the positive solution `x=1`

. You see this kind of maneuver in both Bayesian and non-Bayesian statistics.

The second problem is deeper and has to do with the stochastic nature of the search. Our SVD is implemented with stochastic gradient descent, which initializes the vectors randomly and then follows the gradient (slope) of the error curve (least squares in the case of SVD) by taking littler and littler steps over time (so-called simulated annealing) until it’s near the answer. This kind of search is guaranteed to work in theory. Unfortunately, that theory requires an arbitrary number of epochs (iterations), arbitrary arithmetic precision, and may take a long time to even get close to an answer.

In practice, the exact values that aren’t well identified by stochastic search often don’t matter. The error in SVD is measured by least squares.

In practice, the way to mitigate this insoluble problem is to run multiple trials and take something like a consensus. You see this done in Gibbs sampling, where the idea of starting with diffuse initial values and assessing chain mixing (using, for example the R-hat statistic) is standard operating procedure in Bayesian model fitting.

### Like this:

Like Loading...

*Related*

This entry was posted on October 30, 2008 at 12:30 pm and is filed under Carp's Blog, LingPipe in Use. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

October 31, 2008 at 9:40 am |

You should probably fix this thinko: “In the simplest case, the equation x * x = 4 has two solutions, one where x = 1 and one where x = -1.” Or else people may think there’s a deeper problem here. ;-)

October 31, 2008 at 11:01 am |

Doh! Fixed.

I’ve never seen the term “thinko” before, but I like it better than “braino”; it implies more of a software than a hardware malfunction. The crowd seems to agree with you, with Google hit counts (always approximate) for “thinko” at 83,800 and “braino” at 36,600.