I was all in the mood to polish up and publish some of the huge stack of draft posts that have been backing up recently when Blogger shit the bed last week. I decided to spend the time I wouldn't be blogging polishing up my overly rusty C++ chops, so I turned my attention to writing a program that would generate crossword puzzles. (Just fitting answers into a gird, not writing clues for them.) It's something I got assigned to do way back in undergrad, but I never had the time to do it right, and as a result I've wanted to revisit it for years.
After a surprisingly few hours over the weekend and it's working pretty well now. The key, as is so often the case, was having the right data structure to store the lexicon of possible words. It only needs to support one basic operation: determine if there is an entry n letters long which begins with a certain prefix. It's a bonus if you can populate the lexicon from an unordered list quickly, but you don't need to worry about deletion or insertions more generally.
I'm still playing around with some optimizations, like trying to fill the cells in Cantor diagonal order rather than row-major order, and perhaps incorporating letter frequency.
I think the biggest thing holding it back right now is the lack of a big enough list of words to choose from. Right now I only have 58,000 options to slot in to a puzzle. Longer words are in especially short supply. Does anyone have an idea where I can find a plain text file with English words? Preferably lots and lots of words. Most of the things I've found are aimed at linguists, so they're annotated with all sorts of parts-of-speech and pronunciation information that I don't want to have to deal with.
Bonus points if you know where I could find a list of phrases and names and idioms and other sorts of potential answers, but right now I'll settle for simple words.
Here are a couple of samples my system has come up with:
Some of the words in there are admittedly odd — "ide," "ye" — but they're in one of the word lists I'm working from, so the fault is there and not my algorithm's. On the other hand some of the words are great — "gnu," "mesons," "infanta" — so I'll take the strikes with the gutters.