Archive for the ‘Carp’s Blog’ Category

0/1 Loss Meaningless for Predicting Rare Events such as Exploding Manholes

June 14, 2012

[Update: 19 June 2012: Becky just wrote me to clarify which tools they were using for what (quoted with permission, of course — thanks, Becky):

… we aren’t using BART to rank structures, we use an independently learned ranked list to bin the structures before we apply BART. We use BART to do a treatment analysis where the y values represent whether there was an event, then we compute the role that the treatment variable plays in the prediction. Here’s a journal paper that describes our initial ranking method

and the pre-publication version

The algorithm for doing the ranking was modified a few years ago, and now Cynthia is taking a new approach that uses survival analysis.]

Rare Events

Let’s suppose you’re building a model to predict rare events, like manhole explosions in the Con-Ed system in New York (this is the real case at hand — see below for more info). For a different example, consider modeling the probability of a driver getting into a traffic accident in the next week. The problem with both of these situations is that even with all the predictors in hand (last maintenance, number of cables, voltages, etc. in the Con-Ed case; driving record, miles driven, etc. in the driving case), the estimated probability for any given manhole exploding (any person getting into an accident next week) is less than 50%.

The Problem with 0/1 Loss

A typical approach in machine learning in general, and particularly in NLP, is to use 0/1 loss. This forces the system to make a simple yea/nay (aka 0/1) prediction for every manhole about whether it will explode in the next year or not. Then we compare those predictions to reality, assigning a loss of 1 if you predict the wrong outcome and 0 if you predict correctly, then summing these losses over all manholes.

The way to minimize expected loss is to predict 1 if the probability estimate of failure is greater than 0.5 and 0 otherwise. If all of the probability estimates are below 0.5, all predictions are 0 (no explosion) for every manhole. Consequently, the loss is always the number of explosions. Unfortunately, this is the best you can do if your loss is 0/1 and you have to make 0/1 predictions.

So we’ve minimized 0/1 loss and in so doing created a useless 0/1 classifier.

A Hacked Threshold?

There’s something fishy about a classifier that returns all 0 predictions. Maybe we can adjust the threshold for predicting explosions below 0.5. Equivalently, for 0/1 classification purposes, we could rescale the probability estimates.

Sure, it gives us some predicted explosions, but the result is a non-optimal 0/1 classifier. The reason it’s non-optimal in 0/1-loss terms is that each prediction of an explosion is likely to be wrong, but in aggregate some of them will be right.

It’s not a 0/1 Classification Problem

The problem in 0/1 classification arises from converting estimates of explosion of less than 50% per manhole to 0/1 predictions minimizing expected loss.

Suppose our probability estimates are close, at least in the sense that for any given manhole there’s only a very small chance it’ll explode no matter what its features are.

Some manholes do explode and the all-0 predictions are wrong for every exploding manhole.

What Con-Ed really cares about is finding the most at-risk properties in its network and supplying them maintenance (as well as understanding what the risk factors are). This is a very different problem.

A Better Idea

Take the probabilities seriously. If your model predicts a 10% chance of explosion for each of 100 manholes, you expect to see 10 explosions. You just don’t know which of the 100 manholes they’ll be. You can measure these marginal predictions (number of predicted explosions) to gauge how accurate your model’s probability estimates are.

We’d really like a general evaluation that will measure how good our probability estimates are, not how good our 0/1 predictions are. Log loss does just that. Suppose you have N outcomes y_1,\ldots,y_N with corresoponding predictors (aka features), x_1,\ldots,x_N, and your model has parameter \theta. The log loss for parameter (point) estimate \hat{\theta} is

      {\mathcal L}(\hat{\theta}) - \sum_{n=1}^N \, \log \, p(y_n|\hat{\theta};x_n)

That is, it’s the negative log probability (the negative turns gain into loss) of the actual outcomes given your model; the summation is called the log likelihood when viewed as a function of \theta, so log loss is really just the negative log likelihood. This is what you want to optimize if you don’t know anything else. And it’s exactly what most probabilistic estimators optimize for classifiers (e.g., logistic regression, BART [see below]).

Decision Theory

The right thing to do for the Con-Ed case is to break out some decision theory. We can assign weights to various prediction/outcome pairs (true positive, false positive, true negative, false negative), and then try to optimize weights. If there’s a huge penalty for a false negative (saying there won’t be an explosion when there is), then you are best served by acting on low-probability information, such as servicing even low-probability manholes. For example, if there is a $100 cost for a manhole blowing up and it costs $1 to service a manhole so it doesn’t blow up, then even a 1% chance of blowing up is enough to send out the service team.

We haven’t changed the model’s probability estimates at all, just how we act on them.

In Bayesian decision theory, you choose actions to minimize expected loss conditioned on the data (i.e., optimize expected outcomes based on the posterior predictions of the model).

Ranking-Based Evaluations

Suppose we sort the list of manholes in decreasing order of estimated probability of explosion. We can line this up with the actual outcomes. Good system performance is reflected in having the actual explosions ranked high on the list.

Information retrieval supplies a number of metrics for this kind of ranking. The thing I like to see for this kind of application is a precision-recall curve. I’m not a big fan of single-number evaluations like mean average precision, though precision-at-N makes sense in some cases, such as if Con-Ed had a fixed maintenance budget and wanted to know how many potentially exploding manholes it could service.

There’s a long description of these kind evaluations in

Just remember there’s noise in this received curves and that picking an optimal point on them is unlikely to produce such good behavior on held-out data.

With good probability estimates for the events you will get good rankings (there’s a ton of theory around this I’ve never studied).

About the Exploding Manholes Project

I’ve been hanging out at Columbia’s Center for Computational Learning Systems (CCLS) talking to Becky Passonneau, Haimanti Dutta, Ashish Tomar, and crew about their Con-Ed project of predicting certain kinds of events like exploding manholes. They built a non-parametric regression model using Bayesian additive regression trees with a fair amount of data and many features as predictors.

I just wrote a blog post on Andrew Gelman’s blog that’s related to issues they were having with diagnosing convergence:

But the real problem is that all the predictions are below 0.5 for manholes exploding and the like. So simple 0/1 loss just fails. I thought the histograms of residuals looked fishy until it dawned on me that it actually makes sense for all the predictions to be below 0.5 in this situation.

Moral of the Story

0/1 loss is not your real friend. Decision theory is.

The Lottery Paradox

This whole discussion reminds me of the lottery “paradox”. Each ticket holder is very unlikely to win a lottery, but one of them will win. The “paradox” arises from the inconsistency of the conjunction of beliefs that each person will lose and the belief that someone will win.

Oh, no! Henry Kyburg died in 2007. He was a great guy and decades ahead of his time. He was one of my department’s faculty review board members when I was at CMU. I have a paper in a book he edited from the 80s when we were both working on default logics.

Computing Autocorrelations and Autocovariances with Fast Fourier Transforms (using Kiss FFT and Eigen)

June 8, 2012

[Update 8 August 2012: We found that for KissFFT if the size of the input is not a power of 2, 3, and 5, then things really slow down. So now we’re padding the input to the next power of 2.]

[Update 6 July 2012: It turns out there’s a slight correction needed to what I originally wrote. The correction is described on this page:

I’m fixing the presentation below to reflect the correction. The change is also reflected in the updated Stan code using Eigen, but not the updated Kiss FFT code.]

Suppose you need to compute all the sample autocorrelations for a sequence of observations

x = x[0],...,x[N-1]

The most efficient way to do this is with the discrete fast fourier transform (FFT) and its inverse; it’s {\mathcal O}(N \log N) versus {\mathcal O}(N^2) for the naive approach. That much I knew. I had both experience with Fourier transforms from my speech reco days (think spectograms) and an understanding of the basic functional analysis principles. I didn’t know how to code it given an FFT library. The web turned out to be not much help — the explanations I found were all over my head complex-analysis-wise and I couldn’t find simple code examples.

Matt Hoffman graciously volunteered to give me a tutorial and wrote an initial prototype. It turns out to be really really simple once you know which way ’round the parts all go.

Autocorrelations via FFT

Conceptually, the input N-vector x is the time vector and the autocorrelations will be the frequency vector. Here’s the algorithm:

  1. create a centered version of x by setting x_cent = x / mean(x);
  2. pad x_cent at the end with entries of value 0 to get a new vector of length L = 2^ceil(log2(N));
  3. run x_pad through a forward discrete fast fourier transform to get an L-vector z of complex values;
  4. replace the entries in z with their norms (the norm of a complex number is the real number resulting of summing the squared real component and squared imaginary component).
  5. run z through the inverse discrete FFT to produce an L-vector acov of (unadjusted) autocovariances;
  6. trim acov to size N;
  7. create a L-vector named mask consisting of N entries with value 1 followed by L-N entries with value 0;
  8. compute the forward FFT of mask and put the result in the L-vector adj
  9. to get adjusted autocovariance estimates, divide each entry acov[n] by norm(adj[n]), where norm is the complex norm defined above; and
  10. to get autocorrelations, set acorr[n] = acov[n] / acov[0] (acov[0], the autocovariance at lag 0, is just the variance).

The autocorrelation and autocovariance N-vectors are returned as acorn and acov respectively.

It’s really fast to do all of them in practice, not just in theory.

Depending on the FFT function you use, you may need to normalize the output (see the code sample below for Stan). Run a test case and make sure that you get the right ratios of values out in the end, then you can figure out what the scaling needs to be.

Eigen and Kiss FFT

For Stan, we started out with a direct implementation based on Kiss FFT.

  • Stan’s original Kiss FFT-based source code (C/C++) [Warning: this function does not have the correction applied; see the current Stan code linked below for an example]

At Ben Goodrich’s suggestion, I reimplemented using the Eigen FFT C++ wrapper for Kiss FFT. Here’s what the Eigen-based version looks like:

As you can see from this contrast, nice C++ library design can make for very simple work on the front end.

Hat’s off to Matt for the tutorial, Thibauld Nion for the nice blog post on the mask-based correction, Kiss FFT for the C implementation, and Eigen for the C++ wrapper.

Ranks in Academia vs. Nelson’s Navy

June 5, 2012

I’m a huge fan of nautical fiction. And by that, I mean age of sail stuff, not WWII submarines (though I loved Das Boot ). The literature is much deeper than Hornblower and Aubrey/Maturin (though it doesn’t get better than O’Brian). I’ve read hundreds of these books. If you want to join me, you might find the following helpful.

I think I’ve pretty much read every nautical fiction book published in the last 50 years. I had to go back to sci-fi and even fantasy (thank you, Patrick Rothfuss, for making my life better a book at a time).

Officer Grades

Given that nautical fiction almost always focuses on the officers, I’ve come to realize that the books are really about organizational structure and management. I see a strong relation to the academic pecking order, which I summarize in the following table.

Academia Navy
undergrad nipper
grad student midshipman
post-doc lieutenant
junior faculty commander
tenured faculty post captain
department head, dean admiral

Non-Commmissioned and Warrant Officers

What about the rest of us?

Academia Navy
research scientist sailing master
research programmer boatswain (aka ‘bosun’)
grants officier Admiralty bureaucrat

Sailing master because us research scientists know the technical bits of being an officer, namely navigation and how the ship works. Programmers are bosuns because they’re the most technically adept at the low-level functionality of academia. I guess if you weren’t in computer science, the research programmer would be a lab tech.

Averages vs. Means (vs. Expectations)

May 29, 2012


Averages are statistics calculated over a set of samples. If you have a set of samples x = x_1,\ldots,x_N, their average, often written \bar{x}, is defined by

\bar{x} = \frac{1}{N} \sum_{n=1}^N x_n.


Means are properties of distributions. If p(x) is a discrete probability mass function over the natural numbers \mathbb{N}, then its mean is defined by

\sum_{x \in \mathbb{N}} \, x \times p(x).

If p(x) is a continuous probability density function over the real numbers \mathbb{R}, then its mean, if it exists, is defined by

\int_{\mathbb{R}} \, x \times p(x) \, dx.

This also shows how summations over discrete probability functions, \sum_{x \in \mathbb{N}} relate to integrals over continuous probability functions, \int_{\mathbb{R}} dx. (Distributions can also be mixed, like spike and slab priors, but the math gets more complicated due to the need to unify the notion of summation and integration.)


To confuse matters further, there are expectations. Expectations are properties of (some) random varaibles. The expectation of a random variable is the mean of its distribution. If X is a discrete random variable with probability mass function p(x), then its expectation is defined to be

\mathbb{E}[X] = \sum_{x \in \mathbb{N}} \, x \times p(x).

If X is a continuous random variable with probability density function p(x), then

\mathbb{E}[X] = \int_{\mathbb{R}} \, x \times p(x) \, dx.

Look familiar?

Sample Means

Samples don’t have means per se. They have averages. But sometimes the average is called the “sample mean”. Just to confuse things.

Averages as Estimates of the Mean

Gauss showed that the average of a set of independent, identically distributed (i.i.d.) samples from a distribution p(x) is a good estimate of the mean.

What’s good about the average as an estimator of the mean? First, it’s unbiased, meaning the expectation of the average of a set of i.i.d. samples from a distribution is the mean of the distribution. Second, it has the lowest expected mean square error among all estimators of the mean. That’s why everyone likes square error (that, and its convexity, which I discussed in a previous blog post on Mean square error, or why committees won the Netflix Prize).

What about Medians?

The median is a good estimator too. Laplace proved that it has the lowest expected absolute error among estimators (I just learned it was Laplace from the Wikipedia entry on median unbiased estimators). It’s also more robust to outliers.

More on Estimators

The Wikipedia page on estimators is a good place to start.

Of course, in Bayesian statistics, we’re more concerned with a full characterization of posterior uncertainty, not just a point (or even interval) estimate.


  • Means are properties of distributions.
  • Expectations are properties of random variables.
  • Averages or sample means are statistics calculated from samples.

Git Rocks!!!

May 25, 2012

We’ve switched the version control system for Stan (my project at Columbia Uni) from Subversion to Git. I was skeptical when everyone told me how great Git was; the move from CVS to Subversion didn’t buy us much.

Git, on the other hand, is worth it. What I’ve liked about Git so far is:

  • Local Repository Copies: Every user gets a full copy of the repository. So you can work on a local version of the entire repository before “pushing” any changes to the main repository. (So what was a commit in Subversion is now a commit followed by a push.) This makes it easy to work on the subway, but it also means you can keep things under version control without polluting the public server.
  • Speed: Uploading the 40MB Boost C++ sources to Subversion took, roughly speaking, forever (tens of minutes). In Git, it’s super fast. (Both hosted by Google, so I don’t think it’s the network or servers.)
  • Branching: What makes local repositories work really well is branching; it’s way easier to branch and merge in Git than in Subversion.
  • Reports: All the commands like “git diff” and “git status” give you more information than Subversion, which is actually very helpful.

If you want to read about Git, I can recommend

  • Chacon, Scott. 2009. Git Pro. Apress.

It’s free online in every format imaginable from the author.

Ryan tells me that GitHub is the bomb, too, and when Ryan recommends something, I listen (he told me the move to Subversion was minor, by the way). It apparently has a great community and a great way to suggest pushes to other projects. We may move the Columbia project to there from Google Code. (We can’t do the same for LingPipe, at least in their free open source area, because of our quirky license.)

Interannotator Agreement for Chunking Tasks Like Named Entities and Phrases

May 18, 2012

From the Emailbox

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

Chunker Evaluation

The chunker evaluations built into LingPipe calculate the usual of precision and recall measures (see below). These evaluations compare a set of response chunkings to a set of reference chunkings. Usually the reference is drawn from a gold-standard corpus and the response from an automated system built to do chunking.

Precision (aka positive predictive accuracy) measures the proportion of chunks in the response that are also in the reference. Recall (aka sensitivity) measures the proportion of chunks in the reference that are in the response. If we swap the reference and response chunkings, we swap precision and recall.

True negatives aren’t really being counted here — theoretically there are a huge number of them — any possible span with any possible tag could have been labeled. LingPipe just sets the true negative count to zero, and as a result, specificity (TN/[TN+FP]) doesn’t make sense.

Interannotator Agreement

Suppose you have chunkings from two human annotators. Just treat one as the reference and one as the response and run a chunking evaluation. The precision and recall values will tell you which annotator returned more chunkings. For instance, if precision is .95 and recall .75, you know that the annotator assigned as the reference chunking had a whole bunch of chunks the other annotator didn’t think were chunks, but most of the chunks found by the response annotator were also chunks of the reference annotator.

You can use F-measure as an overall single-number score.

The base metrics are all explained in

and their application to chunking in

Examples of running chunker evaluations can be found in

LingPipe Annotation Tool

If you’re annotating entity data, you might be interested in our learn-a-little, tag-a-little tool.

Now that Mitzi’s brought it up to compatibility with LingPipe 4, we should move citationEntities out of the sandbox and into a tutorial.

Standard Output Ruins Everything!

May 9, 2012

The title is a a paraphrase from Dirk Eddelbuettel on the Rcpp mailing list (an interface tool for R and C++), but the lesson also applies to Java.

Don’t Write to Standard Output!

One of the first lessons of writing an API (as opposed to something that only runs from the command line) is that you never ever ever write to standard output in an API.

The reasons are that (1) you never know how someone might configure standard output around you (it’s resettable in Java), and (2) you never know what context your API will run in — it may be running in a servlet or in a Swing GUI where standard output where it’s invisible to your user (but does clog up the logs and the shell from which the Swing GUI was invoked).

So What do You do Instead?

1. Throw an exception if there’s some kind of error. See my previous post, “When to catch, pass, or throw exceptions?

Of course you have to be careful here about the context things are running in, too, especially if you try throwing a runtime exception instead of a checked exception. This is why the Google style guide for C++ forbids exceptions!

2. If there’s no error, the common advice is to write to a logger. We didn’t do that in LingPipe because we didn’t want any dependencies to other code built into LingPipe. We also didn’t want every user of LingPipe to have to configure a logger like log4j or Java’s built-in logger. The other issue with loggers is that they have one top-level config, so it gets confusing with multiple packages running if you use high-level config in the properties files (I know you can configure per-package, but people often don’t and get surprised).

Alternatively, you can write messages into something like a string builder. Then they can be sent to whatever output source you want.

The class may look like a standard logger, but it’s only configurable programatically and is set by default to just accumulate results. Note how it’s passed into logistic regression fitting, not just there by default in the background.

A second alternative is to pass an OutputStream into the function that might want to write and write to that. In a command-line setting it can be set to the standard output. In an embedded context, it might be set to a byte array output stream wrapped in a PrintStream, which will just accumulate the results until they can be dealt with. For instance, they might be written into a servlet output stream for use in a web app.

Mavandadi et al. (2012) Distributed Medical Image Analysis and Diagnosis through Crowd- Sourced Games: A Malaria Case Study

May 5, 2012

I found a link from Slashdot of all places to this forthcoming paper:

The main body of the paper is about they reapplication to malaria diagnosis. But I’m more interested in the statistical techniques they used for crowd sourcing.

None of the nine authors, the reviewer(s) or editor(s) knew that their basic technique for analyzing crowd sourced data has been around for over 30 years. (I’m talking about the statistical technique here, not the application to distributed diagnosis of diseases, which I don’t know anything about.)

Of course, many of us reinvented this particular wheel over the past three decades, and the lack of any coherent terminology for the body of work across computer science, statistics, and epidemiology is part of the problem.

Previous Work

The authors should’ve cited the seminal paper in this field (at least it’s the earliest one I know — if you know earlier refs, please let me know):

  • Dawid, A. P. and A. M. Skene. 1979. Maximum likelihood estimation of observer error rates using the EM algorithm. Applied Statistics 28(1):20–28.

Here’s a 20-year old paper on analyzing medical image data (dental X-rays) with similar models:

  • Espeland, M. A. and S. L. Handelman. 1989. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 45:587–599.

Mavandadi et al.

Mavandadi et al. use an approach they call a “binary channel model for gamers”. On page 4 of part II of the supplement to their paper, they define a maximum a posteriori estimate that is the same as Dawid and Skene’s maximum likelihood estimate. It’s the same wheel I reinvented in 2008 (I added hierarchical priors because I was asking Andrew Gelman and Jennifer Hill for advice) and that several groups have subsequently reinvented.

I didn’t understand the section about “error control coding” (starting with whether they meant the same thing as what I know as an “error correcting code”). Why have an annotator annotate an item an odd number of times and then take a majority vote? You can build a probabilistic model for reannotation of any number of votes (that presumably would take into account the correlation (fixed effect) of having the same annotator).

Role of Automatic Classifiers

As in Raykar et al.’s 2009 JMLR paper, Mavandadi et al. also include a machine-based system. But it is not tightly linked as in the work of Raykar et al. It’s just trained from the data a la Padhraic Smyth’s mid-1990s model of crowdsourcing crater location data and then training image analysis models on the resulting crowdsourced data.

Mavandadi et al. instead run their automatic classifier first, then if it’s not confident, hand it over to the crowd. This is, by the way, the standard practice in speech-recognition-based automated call centers.

Mavandadi et al. should check out (Sheng et al. 2010), which analyzes when you need to find another label, also using a Dawid-and-Skene-type model of data annotation. It’s also a rather common topic in the epidemiology literature, because it’s the basis of the decision as to which diagnostic test to administer next, if any, in situations like breast cancer diagnosis (which involves notoriously false-positive-prone image tests and notoriously false-negative-prone tissue tests).

I didn’t see any attempt by Mavandadi et al. to calibrate (or even measure) their system’s confidence assessments. I’d wait for that analysis before trusting their output.

Quick iPad 3 Review: Wow!

April 9, 2012

My iPad 3 arrived Friday afternoon. I’ve been using the iPad 1 for the past year and a half or so for all of my technical reading.

Mirroring My Old iPad

After synching my iPad 1 with my Macbook Air, when I plugged in the iPad 3 for the first time, it gave me the option of just mirroring what I had on the old iPad. Yes, please. It worked like a charm. A guy could get spoiled with this kind of treatment. (On the other hand, I still feel like configuring an iPad is making a deal with the Borg; see my previous post, Resistance is futile — I’ve been assimilated by Apple.)


The Retina display on the iPad 3 is breathtaking. It’s a qualitatively different experience for reading text.

The iPad 1 was good, but it still looked like reading text on a computer. The iPad 3 feels more like reading a magazine or a journal article. The text is that sharp. Even on tiny subscripts in formulas, there’s no aliasing. The following link is to Apple’s demo, which, if anything, understates the perceivable difference:

The jump in quality from iPad 1 to iPad 3 seems much more noticeable than the jump from standard def video (480 vertical lines) to full 1080p high def (1080 vertical lines). In fact, HD video on the iPad 3 is just stunning (I’m running out of adjectives here). The speaker also sounds surprisingly clear for such a little device.


It’s heavier and fatter than the iPad 1. It’s just enough heavier that it’s much more uncomfortable to hold with one hand while reading, which is what I’m often trying to do on the subway. The iPad 2 is the thinnest and lightest of the three, but it’s hardly a Kindle.

The iPad 3 runs considerably hotter than the iPad 1. It doesn’t get as hot as my Macbook Air when running statistical simulations. But hot enough to notice. Nothing to worry about, but it adds to the unpleasantness of holding it.

I don’t notice any speed difference in the things I do, which is a bummer. It still takes GoodReader a dog’s age to load an old scanned PDF and flip the pages. It’s just that they’re much sharper when they come up.

It seems to take longer to recharge, but I’m not 100% sure.


For my use case, which is mainly for reading technical papers at home, on the subway, and at work, and secondarily for board games (Carcassone, Neuroshima Hex, Ticket to Ride) and for video (Vimeo and YouTube HD look awfully nice), it’s a no brainer. The iPad 3 blows away anything else I’ve ever seen, no contest.

All Aboard for Quasi-Productive Stemming

April 4, 2012

One of the words Becky and I are having annotated for word sense (collecting 25 non-spam Mechanical Turk responses per word) is the nominal (noun) use of “board”.

One of the examples was drawn from a text with a typo where “aboard” was broken into two words, “a board”. I looked at the example, and being a huge fan of nautical fiction, said “board is very productive — we should have the nautical sense”. Then I thought a bit longer and had to admit I didn’t know what “board” meant all by itself. I did know a whole bunch of terms that involved “board” as a stem:

  • inboard
  • outboard
  • aboard, onboard
  • overboard, “by the board”
  • larboard (port)
  • weatherboard (facing the weather [wind])
  • starboard
  • above board (on deck)

And what about “seaboard”? As in the “Eastern seaboard”.

The nautical meaning wasn’t listed in WordNet, but has an entry for board that lists it as one of two nautical senses. Words have a surprising number of meanings if you’re willing to go into low frequency, archaic/obsolete and domain-specific usages.

The nautical sense in play is “side of a ship”. It also lists an obsolete sense meaning edge or side of anything. So the nautical sense is just a specialization of this obsolete sense. That’s one way in which meaning drift occurs.

This is all consistent with “side”, cf., “inside”/”inboard”, “outside”/”outboard”, and “aside”/”aboard”. The “side” in question here seems to have drifted to something like “side of an enclosed structure”

This is the same problem we had with our morphological annotation project at LingPipe — there were words that seemed to be compounds, but one of the roots didn’t really stand alone in (common, everyday) English.