This is a post about a question from Cracking the Coding Interview. It’s boring, so probably go watch Stranger Things instead.
“Given a list of millions of words, design an algorithm to create the largest possible rectangle of letters such that every row forms a word (reading left to right) and every column forms a word (reading top to bottom). The words need not be chosen consecutively from the list, but all rows must be the same length and all columns must be the same height.”
Example:
spears
planet
easily
animal
relate
styles
Occasionally I’ll try a practice problem like this on a whiteboard. Usually I learn something cool or at least feel a little less like I’m going to curl up into a ball the next time I have a programming interview, but this one got super weird and interesting.
Try to come up with one of these rectangles with just the words you know, it’s like a crossword puzzle with no clues or boxes and it’s horrible. I muddled along though, and with just a couple hints from the book, I came up with something, question mark? Let’s go through it.
When you’re building up a rectangle row by row, you tend to get stuck pretty quickly with something like
noticable
potential
classroom
because npc
isn’t the start of any word I can think of. If the words are stored
in something called a prefix tree (these are also called Tries but I don’t know
how to say that and it isn’t very descriptive), you can easily check this for
any partial rectangle you’ve built up, and cry because basically nothing works.
But if you’re a computer, you wouldn’t cry because that’s
ridiculous, and because it means you
don’t have to try quite so many rectangles, and you can move on to rectangles
that still have potential.
So, let’s just assume we knew all about prefix trees and we just have some object that tells us if something is a prefix for a word in the list.
tree.isPrefix('spe')
True
tree.isPrefix('npc')
False
Now, instead of doing a search on all possible 10 x 10 word rectangles (conservatively, if you had 10 10-letter words in your language that’s already ten billion rectangles), we can “prune off” some of the rectangles before they’re even fully formed, and go right from
spears
planet
planet # spp? I don't think so
to
spears
planet
easily
skipping right past
spears
planet
planet
planet # a person would have stopped trying by now
After this, it’s basically brute force. I took the book’s advice and tried areas in decreasing order until finding a size that was possible, so the main loop does that and calls a function sort of like this at each step (I changed a few things to make it clearer what’s going on):
def makeRectangle(length, width, words, rect):
if rect.width == width:
if isFullRect(rect):
return rect
else:
return None
for word in words:
rect.addWord(word)
if not isPartialRect(rect):
# Prune this one! Don't try any of its descendants.
rect.removeWord()
continue
else:
fullRect = makeRectangle(length, width, words, rect)
if fullRect:
# The recursive call got it!
return fullRect
# Or it didn't, and we remove the failed word to try the next one.
rect.removeWord()
# If all those fail, it's impossible.
return None
You might recognize this as a depth-first search - it goes deep and tries everything it can with a given first word before trying with the next first word.
There’s a few elements of the solution I’m glossing over - walking down potential areas to see if any of them are possible, splitting the list of words up into the different lengths - check out my full solution here.
Once I debugged it on this very small test set
['dog',
'emu',
'fog',
'def',
'omo',
'gug']
(I’m not good at crossword puzzles so I cheated), I put it to work on /usr/share/dict/words, a file with about 230000 words in it, which has words almost as obscure as ‘gug’ but is a good, smallish example of the type of data the question asks about.
My program chugged away for a few minutes without coming up with anything, so I ran a profiler on it to see what was taking the most time. Over the next day, I tried everything I could think of (short of rewriting the damn thing in C) to get this program to run faster (including but not limited to storing rectangles as byte arrays, storing rectangles as columns instead of rows, replacing the prefix tree with a huge hash table of strings that were valid prefixes), but it couldn’t finish the search for words much shorter than 16 letters (it’s easier the more letters you’re trying to use per word, because of how few words there are with 16+ letters).
Finally, frustrated, I turned to the solution in the book. Just reading the code, I couldn’t figure out why it would be much faster than mine, so I thought I’d clone it and try to run it myself. The solution is written in Java, which I know barely anything about, and the author recommends using Eclipse, which I know nothing about. I halfheartedly tried to bully it into running the code for a few minutes, but then I actually looked at the test that was going to run, and saw that the list of words it was going to use was just a couple thousand words long. 🤔 I ran my program on this test case, and it spat out the answer in 1.5 seconds.
Unsatisfied. I grabbed this list of 10000 most common english words, and just under two hours later I had
spears
planet
easily
animal
relate
styles
to show for it.
So, anybody else think about or solve this one? Can any Java person try it and tell me if the book solution can handle the Unix words file? I believe that the question as originally posed isn’t possible, and not just because I couldn’t do it.