Computational Linguistics Curriculum

by

I recently received a request from someone asking how to tune their undergrad curriculum for computational linguistics. As I’ll recount later, I have considerable experience in designing curricula for computational linguistics at all levels of study. Here’s what I’d recommend.

My main piece of advice is to make sure you’re very solid in at least one of the component fields, which I take to be (1) computer science, (2) statistics, (3) linguistics, and (4) cognitive psychology. The biggest danger of an interdisciplinary eduction is becoming a jack of all trades, master of none.

Course Work

For CS, it’s mainly the software side that’s relevant, including programming languages/data structures, algorithm analysis and automata theory/logic/discrete math.

These days, you’ll need a strong stats background for the machine learning component of the field, and I’d recommend a core math stats sequence, a course on linear regression and a course on Bayesian data analysis if one’s available.

As background to stats and CS theory, you’ll need math through calc, diff eqs and matrices, though any kind of more advanced math would help, either analysis or abstract algebra. Numerical methods are extremely helpful.

For linguistics, what you want to learn for comptuational linguistics is the good old-fashioned, data-intensive, detail-oriented descriptive linguistics. That’s still practiced among the laboratory or acoustic phonetics folks, but by few others. Even so, you’ll want to take in phonetics/phonology, syntax, semantics and pragmatics. Sociolinguistics is good, too. The more empirical data you get to play with, the better.

It helps to get some basic cognitive psych and psycholinguistics if the latter’s available. This isn’t so important from an engineering perspective, but it really helps to have some basic ideas about how the brain works. And there are some really really nifty experiments. Sometimes you can count these as your social science requirements.

If you can take other social sci requirements in something quantitative like micro-economics, all the better, especially if you can get beyond maximizing convex functions into things like behavioral econ and decision theory.

Then there are the interdisciplinary courses like machine learning, artificial intelligence, and computational linguistics itself. By all means, take these classes.

Other interdisciplinary studies include speech recognition, which is often taught in electrical engineering. Speech recognition’s great because it gives you some continuous math as a basis and also has efficiency issues beyond what you see with text.

Information retrieval is also very useful if you can find a class in that in either CS or a library science school.

Genomics sequence analysis would also be a great thing to take as the algorithms and math are very similar to much of comp ling and it’s a really fun area.

Courseware

You can do worse than follow Chris Manning’s Stanford Course on NLP. It was the first version of this course in 1995 that got me into statistical NLP. Why doesn’t Michael Collins have an MIT courseware version?

Software to Study

There are also lots of software packages out there distributed with source. Steve Bird’s NLTK is designed explicitly for teaching and is based on Python. Our toolkit, LingPipe isn’t explicitly designed for teaching, but contains a large number of tutorials. The two other big toolkits out there, Mallet and MinorThird, are much harder to understand without already knowing the field.

Books to Read

Here’s a list of some recommended reading in these topics I put together a comprehensive computational linguisitics reading list on Amazon and occassionally update it (it’s up to date as of now).

Do Some Real Research

If at all possible, do some real research. It’s the single biggest factor a grad school will look at if you have the grades and test scores to make their basic cuts.

I’d also highly recommend browsing recent ACL proceedings online to get a feeling for what the field’s really like. And I’d highly suggest going to the one of the ACL meetings. Next year, University of Colorado Boulder’s hosting NAACL 2009.

Another great opportunity for undergrads is the Johns Hopkins Summer Workshops. This is a phenomenal opportunity to work with good and diverse teams. There’s nothing like applying to grad school with a publication and references from top researchers.

Internships

Almost everyone in this field seems to offer internships. They’ll be harder to land as an undergrad unless you have an advisor hooked into the field who can set you up. Try to get an internship that’s different from what you do in school. The best thing you can get is product programming experience on real projects with teams of more than one person.

Blogs to Read

You’re already here, so you can presumably read our links. This’ll give you more of a feel for the day-to-day in the field than the textbooks.

What the Profs Think

You can see what some of the professors in the field think of curricula and teaching by looking at the proceedings of this year’s ACL workshop on teaching comp ling.

How’d I Get Here?

I started back in 6th grade (1973-74) when my parents got me a Digi-Comp I from Edmund Scientific as a present. The reproduction manufacturers, rightly conclude that “… Digi-Comp is an ingenious, transparent Logical Gizmo that can teach anyone about binary numbers and Boolean algebra …”. I don’t know about transparent, but the books explained boolean algebra, binary arithmetic, and game trees, concepts even an elementary school student can grasp.

In 12th grade (1980-81), I read Hofstadter’s inspiring book Gödel, Escher, Bach, which made artificial intelligence in general, and learning logic in particular, sound fascinating.

As an undergraduate math major in Michigan State University’s Honors College (1981-1984), I created a computational linguistics curriculum for myself without knowing it. I took philosophy of language and analytic philosophy and non-standard logics as my humanities classes, I took developmental psych, micro-econ, cognitive psych and pyshcolinguistics as soc classes, and split the rest of my classes between computer science and (mostly discrete!) math.

As a Ph.D. student at the University of Edinburgh (1984-1987; go to the U.K. for speed), I found myself in another computational linguistics degree, this time masquerading as a School of Epistemics Centre for Cognitive Science. Our four qualifying exams were in syntax, semantics, computational linguistics and psycholinguistics, which is not exactly a general cognitive science curriculum.

After a brief stint writing my thesis while hanging out at Stanford’s Center for the Study of Langauge and Information (1987-1988), I landed a faculty gig in Carnegie Mellon’s Computational Linguistics program (1988-1996), which is now part of the Language Technologies Institute. We introduced new M.S. and B.S. programs while I was there and I had a significant hand in designing (and teaching) the undergraduate and graduate curricula in computational linguistics. Sitting in on Chris Manning’s 1995/96 class on statistical NLP was the last straw sending me down the statistical NLP path.

I learned how to program professionally at SpeechWorks (2000–2002). For that, you need a real project, a good team, and a great mentor, like Sasha Caskey.

8 Responses to “Computational Linguistics Curriculum”

  1. Teo D'Smyrni Says:

    Thanks, got a lot useful info hear. Can u probably advise me a topic for a senior project in this area. thanks in advance.

  2. Bob Carpenter Says:

    Teo:

    This is a great area for projects because there’s so much free data. Or you could create some of your own data. Projects can be more or less linguistically oriented.

    At the simplest level, there’s just throwing something at a new corpus. Like say our latent Dirichlet clustering in some language or domain other than English news. Or even simpler, topic extraction over time.

    Or you could create your own named entity corpus using our annotation tool and use it to build a named entity extractor.

    For that matter, you could follow my instructions and build a Chinese search engine. Or Thai or Turkish for that matter.

    Take a look through the ACL proceedings — there are lots of student papers and poster write-ups in there. I’d aim at getting something you could submit to the ACL student session as a concrete goal. Assuming you’re going to write it in English, that is.

    At the most complex level, you could try to take on a whole new problem or develop a new model for an existing problem. That’s much harder, though.

  3. Shreya S. Says:

    Hello! I’m a high school senior and I’m interested in computational linguistics as an area of study. I’ve always liked languages and linguistics, and I’m taking a computer science course in school this year.

    How do you think I can get begin to learn about computational linguistics? If you have any suggestions, I would really appreciate it. (I’ve seen your list of book recommendations: is that up to date?)

    Also, from what I’ve seen, it seems like I would have to major in computer science and minor in linguistics. I’m definitely applying to the colleges in-state (Texas for me), and I’m going to apply to some schools out of state, but I’m trying to narrow my list down. Do you have any input as to which institutions might give me a good foundation for computational linguistics?

    Thanks and regards!
    Shreya S.

    • Bob Carpenter Says:

      You’re in luck!

      There’s interesting work going on at U.T. Austin in Ray Mooney’s group and Jason Baldridge’s group (they have a whole center around stats and language and AI issues at Austin). There’s also a strong linguistics department at UT.

      I’d also check out UNT — Rada Mihalcea’s there and she’s a world class NLP researcher.

      If you can get into and afford Stanford, that’s the obvious choice these days. UC Berkeley or UPenn would also be great. All three of these are strong in both linguistics and computational linguistics.

      Alas, the list of textbooks is up to date. Start with the OSU Language Files — it’s the best, easiest-to-understand overview of linguistics. All the intros to comp ling are pretty math heavy, so they might be a bit daunting if you haven’t already done college-level computer science. You might start with the NLTK book from O’Reilly (written by a classmate of mine and my grad school advisor). It’s intended for beginners in both CS and linguistics and has working examples in Python, which is worth knowing itself.

      Plenty to keep you busy for the rest of the summer.

  4. jennifer Says:

    hi, i have a degree in humanities with english literature ,spanish translation and linguistics comprising the core it. i am toying with the idea of obtaining a degree in computational linguistcs.i am however not very good at math, average i will say, i am wondering if you will can advice me on the subject? thanks.

    • Bob Carpenter Says:

      I think you’ll find modern computational linguistics to be very mathematically oriented towards statistics. Check out the proceedings of a conference like ACL to get a feel for what the academic side of the field is like:

      http://aclweb.org/anthology-new/

      A good place to start if you’re not mathematically inclined is Bird et al.’s NLTK package (it’s free and oriented toward teaching in linguistics departments, not computer science departments).

      You must must must take the real programming classes, though — you at least need to know basic data structures and algorithms.

  5. Aslan Says:

    Hi Bob,

    Great post! I do see that it was written a long time ago and hope that you can still find time to offer your insight.

    I graduated from from the University of Michigan with an Honor in Linguistics (2004) and an MA in Cognitive Semiotics from the University of Aarhus Denmark (2010). My MA thesis was ‘An Enactive Approach to Courtship Study: Enactive, Schematic Behavioral Patterns Leading Toward Attachment in Male Pickup’. Since then and before, I have drawn directly and indirectly on my uniquely formed views of language to serve me in the professional sector. From teaching English to foreigners at a school, social coaching men and women on how to pick-up the opposite for a company named Charisma Arts, and working for a various hospitality companies project managing large scale groups. However my passions still revolve around linguistics, language use, and behavioral patterns and how this corresponds with thinking and achieving goals.

    I am keenly interested in dating sites, online interaction in courtship circles, and of course analyzing real life interactions.

    What would you recommend as course track for someone who wished to tackle these topics from a NLP or Computational Linguistics perspective. I have not taken statistics nor any computer programming courses. I do consider myself self-motivated and due to my current work nature am free to set up my own schedule and implement various sorts of study regimes. Returning to school, formal or not, is an option (preferably in Europe, most so in Denmark, Norway, Sweden or Germany) but I would want it to deliver a solid, job skill set, marketable to companies and eventually tailored towards my entrepreneurial drive.

    Here are some in english programs that I came across before coming upon your blog post via Chris’s at the Lousylinguist:
    http://www.sfs.uni-tuebingen.de/studium-lehre/studiengaenge/computerlinguistik/internationaler-ma-studiengang-iscl.html

    http://www.uni-marburg.de/fb10/studium/studiengaenge/malingwebtech/studium?language_sync=1

    http://www.lel.ed.ac.uk/lel_students/postgraduate/speech_and_language_processing/index.php

    As well as some online courses that may give me a jump in programming before I would potentially go off to university or market:

    http://lifehacker.com/plan-your-free-online-education-at-lifehacker-u-fall-s-1201482793#tech

    Your thoughts and advice are very much appreciated,
    Aslan

    • Bob Carpenter Says:

      Get a job at OK-Cupid — they’re doing some amazing analytics work on their data!

      If you have a strong math background, you can teach yourself stats, but like anything, working with someone who knows what they’re doing is the best option. If you can handle it, I like Gelman and Hill’s regression book — or you can start with something more basic in math stas, or barring that, it’s going to be an uphill battle. Something gentler is Kruschke’s book on Bayesian stats. The Wagenmakers book is also good, and perhaps somewhere in the middle — it’s also focused on cognitive psych, which seems like it’d fit your background.

      As to teaching yourself programming, you need to learn a programming language (think Java or C++), and then practice. The best practice I’ve found are the TopCoder challenge problems. You might start with Python, but it covers up a lot of the intricacies of what’s really going on. On the other hand, it may be all you ever need.

      As to NLP, I don’t do much of it these days, so I don’t know what to recommend on that front.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 797 other followers