Video demo:
Where does this idea come from
I have been playing various NYT games for quite some time, and thought about writing some helper or solver since early this year.
- For logical puzzles like pips, I mostly target for a solver.
- It has to been fully automated, i.e. it needs to open the game page, figure out the configuration of the game, come up with the solution, and finally apply the solution on the page, all programmatically.
- The “come up with the solution” part isn’t particular interesting, as it can be solved purely by backtracking algorithms.
- The pips game is the first I choose to target, for the game itself is close ended, and I’ll use this opportunity to lay out the basic structure and iterate on top of it for other games.
- For word puzzles like crosswords (only vague ideas at this moment):
- I’ll make an inline helper, as I’m not hopeful this can be solved purely programmatically (with the help of LLMs).
- The helper will behave like this: for single word clues, check some thesaurus API and filter results based on the length of the answer and cells that are already filled (I bet we won’t need rebus for this kind of clues); for clues that boil down to an entity name (like singer, actor, place, etc.) either query Wikipedia and do some pattern matching or ask an LLM; and I don’t yet have a good idea for other types of clues, and I doubt if I’ll ever make something for puns or themes. The results will simply be displayed, and it’s up to the gamer (myself) to choose and type it into the grid.
- These two will take a while to come to a working state, and I’ll decide on the next games afterwards.
- Specifically for wordle, I have an alpha-beta pruning based solution (the player tries to minimize the remaining options, while the imaginary opponent tries to maximize the remaining options), but it’s purely CLI based (I have to type the word to the page, and give the response back to the program). Maybe I’ll automate it? But wordle seems to have lost its popularity, and I don’t find it interesting any more.
So to summarize, the ideal outcomes are:
- Some tool (browser extension?) I can use to help me solve (cheat?) on crosswords
- More experience with browser automation
- And some experience with LLM.
Some back and forth
The very first setback is the aggressive bot prevention on NYT website. Even when I use my home Internet without VPN, and launch the manually installed Chrome browser (not those installed by playwright), I still can’t get pass their bot blocker. So I moved to build a browser extension instead.
Other than the parsing of the game, programmatically dispatch the right JavaScript event proved much harder than I anticipated. NYT pips doesn’t use the canonical drag and drop, and there are a myriad of MouseEvent types. I was never able to “drag the domino and drop it on the board”. Maybe I could reverse engineer the event handling implementation, but I decided against that rabbit hole. With playwright, you just need to say “mouse down at this location, move the mouse to that location, and mouse up“. From the event handler’s perspective, it’s no different than a user actually doing that with the physical mouse. In other words, you can work at a higher level of abstraction.
Then I came back to the playwright based approach, with the finding that you can manually launch the browser with remote debugging turned on, this way you will bypass the bot blocker and still control the browser programmatically with playwright via CDP (is Firefox supported?).
My very first experience with Gemini
One thing I realized is it’s very hard to parse the board from the underlying HTML.

I tried to take a screenshot of the particular DOM element and analyze it in Python, but it’s not straightforward to handle the restrictions.
So I decided to give Gemini a try. So far:
- It can only meaningfully handle easy boards.
- Even for easy boards, it constantly makes mistakes, and for the same board gives different results every time you try.
What’s next
I’ll try to tune the prompt more, maybe with whatever I can infer from the HTML or the DOM screenshot, and see if I can make it more reliable, so that we don’t need to “open editor and verify/edit” step.
Then for the browser extension based approach, most likely I’ll not go back to the rabbit hole to reverse engineer their event handling implementation, but instead just draw some simple animation in the popup.
And last I’ll research how to persist the state of the popup. Right now whenever you close it, all is lost.