16 May 2011

Lexicon Bleg

I was all in the mood to polish up and publish some of the huge stack of draft posts that have been backing up recently when Blogger shit the bed last week.  I decided to spend the time I wouldn't be blogging polishing up my overly rusty C++ chops, so I turned my attention to writing a program that would generate crossword puzzles. (Just fitting answers into a gird, not writing clues for them.) It's something I got assigned to do way back in undergrad, but I never had the time to do it right, and as a result I've wanted to revisit it for years.

After a surprisingly few hours over the weekend and it's working pretty well now. The key, as is so often the case, was having the right data structure to store the lexicon of possible words. It only needs to support one basic operation: determine if there is an entry n letters long which begins with a certain prefix. It's a bonus if you can populate the lexicon from an unordered list quickly, but you don't need to worry about deletion or insertions more generally.

I'm still playing around with some optimizations, like trying to fill the cells in Cantor diagonal order rather than row-major order, and perhaps incorporating letter frequency.

I think the biggest thing holding it back right now is the lack of a big enough list of words to choose from. Right now I only have 58,000 options to slot in to a puzzle.  Longer words are in especially short supply. Does anyone have an idea where I can find a plain text file with English words? Preferably lots and lots of words.  Most of the things I've found are aimed at linguists, so they're annotated with all sorts of parts-of-speech and pronunciation information that I don't want to have to deal with.

Bonus points if you know where I could find a list of phrases and names and idioms and other sorts of potential answers, but right now I'll settle for simple words.

Here are a couple of samples my system has come up with:

*********
*fridges*
*lenient*
*infanta*
*tarrier*
*slayers*
*********



***************
*you**keel*eta*
*alpha*guy*hod*
*mesons*rob**d*
****lie*onuses*
*ye*son***gnu**
*em*t*tome*or**
*tenet***gabon*
**tu*oval*s*pa*
**ide***yap*eg*
*scenic*ole****
*t**doh*nectar*
*egg*tin*stoma*
*moo*aces**opt*
***************

Some of the words in there are admittedly odd — "ide," "ye" — but they're in one of the word lists I'm working from, so the fault is there and not my algorithm's.  On the other hand some of the words are great — "gnu," "mesons," "infanta" — so I'll take the strikes with the gutters.

5 comments:

  1. Don't know if this will work, but here (https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/) is a rhyming dictionary maintained by Carnegie Mellon University. The file you want is cmudict.0.7a; I think it has something like 100,000 or 150,000 entries, in plain ASCII.

    ReplyDelete
  2. Thanks! That's just the sort of thing I need.

    ReplyDelete
  3. Why would you use C++ for this? It seems like masochism to me. A hundredfold simpler and easier in C.

    ReplyDelete
  4. Honestly? Because the job listings I was looking through last week all required C++, so I figured I might as well be sharpening a tool people were interested in having me wield.

    I played around with doing this in Python last year because I wanted a project to work on while I taught myself that language, but I decided not to since speed is really a factor for this project. Ditto Ruby.

    ReplyDelete
  5. Ahh, makes sense. Then I should redirect the question to the businessfags who are making such a blunder. Ah well, all you can do is enjoy the show as the whole project goes fubar because some suit chose a gimmicky trainwreck of a language :P

    ReplyDelete