Testing The Untestable · The Reluctant Guide to Shopify Migrations

A modern Plus build is four testing problems stacked next to each other, and the unsettling thing is that none of them is first-class. Shopify spent years building one of the most extensible commerce platforms available. You can rewrite the checkout. You can inject your own logic into the pricing and discount engine. You can render native components inside the admin. You can run a full custom app that talks to every corner of the platform. It is a genuinely remarkable amount of surface area to be handed.

What Shopify didn’t build, to anywhere near the same degree, is a testing story.

That gap used to be a footnote. It isn’t anymore, because the build that triggers it is now the normal build. A serious Plus store today routinely spans themes carrying real business logic, a handful of Functions, one or two checkout or admin extensions, and a custom app receiving webhooks. In a normal Rails or Next.js project, you have one testing stack for everything—one runner, one fixture system, one set of conventions everybody already knows. Here, you have four different answers, and none of them is first-class. You control your own data in none of them. You run in your own environment in none of them. The tooling that was designed for exactly your situation exists for, generously, one of the four.

To the engineering manager who priced this migration off a proposal that had a single line item called testing: the line item is four line items. Two of them are improvised. One of them does not exist at all. None of this is a reason to panic—it is workable, all of it, and teams ship beautifully tested Plus stores every week. But the budget the proposal quoted was the budget for testing a normal application, and this is not a normal application. Better you learn that here than in the retro.

Start with the layer where the news is worst, because everything else feels generous by comparison. Shopify ships no test framework for themes. No runner, no helpers, no fixture system, nothing. You are handed Liquid that compiles and renders somewhere inside Shopify’s CDN, never in your hands at runtime, and told, in effect, good luck.

The approach that works—the one we reach for first—is Playwright running against a preview theme. When a pull request opens, you generate a preview theme off the live one, point Playwright at it, and run the things that actually matter: accessibility checks, end-to-end browser flows, a real checkout completed with Shopify’s test card numbers. Failures gate the merge. It is a sturdy setup, and for the flows that carry money it is the right one.

It works within one hard limit, and the limit is the whole story of this layer. The test environment is read-only. You can navigate the store, fill a cart, complete a checkout—but you cannot create test data. If the behavior you need to verify depends on a specific product configuration, that product has to already exist in the store at the moment the test runs. If it doesn’t exist, you do not have a slower way to test the scenario. You have no way to test the scenario.

A locked wooden door viewed from the inside, light glowing under its base, a key visible through the keyhole hanging tantalizingly on the far wall.

What “Read-Only Test Environment” Means When You’re Trying To Seed Data.

Picture the engineer who has written exactly this kind of test a thousand times on a normal stack. The first three lines of the test are always the same: create the customer, create the product, put the product in the cart. It is muscle memory; it is the spine of every integration test ever written.

So they write create the product—and there is no create the product. There is no API that will let the test build the world it needs to run in. The store is a museum: you may walk through it and look at what is on the walls, but you may not hang anything new, and the exhibit you came to photograph is either already up or it is not.

The reader who has sat there at 11pm, cursor blinking after a comment that just says // TODO: how do we even get a subscription product in here, should feel seen. You did not do it wrong. The door is locked from the other side.

The partial mitigation is a staging store: a dedicated store, hand-stocked with a curated set of products that cover the configurations you care about. This genuinely helps. It also introduces a problem that is quieter and more corrosive than the one it solves, because staging stores drift. A product configuration that exists in staging but not in production—or in production but not in staging—does not announce itself. It generates a false positive, a green check against a world that no longer matches the real one, and it does this silently, and it does it more often the longer the store lives. The suite keeps passing. What it is passing against slowly stops being true.

For snippets where the logic branches hard enough that you want to exercise twenty combinations rather than two, Playwright is too coarse and too slow. So our team built and open-sourced minitest_shopify_themes, a small library that renders theme snippets and sections in isolation using Minitest and Capybara, with fixtures you inject directly instead of depending on store state. You pass in the data, you assert against the rendered output, and you do it for as many combinations as you like, quickly and deterministically. We are calling it an experiment because that is honestly what it is. It covers what it covers well; dynamic blocks and inter-snippet rendering misbehave in its test environment; and because Shopify blesses none of this, there is no promise it survives the next platform change. We use it the way you would use any sharp tool you made yourself—deliberately, for the cuts it is actually good at, with Playwright still doing the structural work underneath.

After the theme layer, Functions feel like a different platform built by different people who liked you more. This is the bright spot, and it is bright precisely because of a design decision, not a tooling decision.

Functions compile to WebAssembly and run as pure functions. GraphQL goes in, GraphQL comes out, and in between there is no network, no external state, nothing to reach for and nothing to mock. The whole function is its own test fixture. There is no world to construct around it because it does not depend on a world—it depends only on its input, and its input is a value you can write by hand. You import the Function directly into a test file, call it like any other piece of code, hand it an input, and assert against the output. That is the entire ceremony.

Shopify leans into this. A type-generation tool, typegen, reads your GraphQL query and emits TypeScript types from Shopify’s actual schema, so your editor autocompletes against reality and the compiler catches a structural mismatch before it can reach production. You wire the tests into CI, gate the deploy on them, and a Function whose tests fail simply does not ship. For once the platform is holding up its end.

The limit is worth naming plainly, because it is easy to over-trust a layer that finally behaves. This is unit testing, and only unit testing. It tells you the Function is correct in the abstract. It does not tell you the Function is correct against real Shopify data in a real store, because the seeding constraint from the theme layer never left—if triggering the edge case requires a particular cart or product configuration to actually exist, you cannot conjure it here either. For those scenarios the unit test is the ceiling, and you should know you are standing on it.

UI Extensions are where the documentation runs out entirely and you are left building in the dark with your hands. Checkout extensions, order-status extensions, admin extensions—they are React components rendering inside iframes that Shopify serves and Shopify owns. You can reach one from a Playwright test. You cannot control the context it runs in. There is no official testing path, none, not even a wrong one to argue with.

So we made one up, and we want to be precise about the word made up. We mock Shopify’s extension library APIs at the module level—global mocks that let the extension code run in isolation, away from the Shopify host, so the rendering logic and the conditional behavior and the metafield reads can be exercised without a live checkout standing by. We are not aware of any documentation describing this. We worked it out from the shape of the library and from years of mocking similar host APIs in other contexts. For an extension carrying real logic, the coverage is meaningful, and it is a great deal better than nothing.

The risk is deliberate, and it is the kind that does not knock before entering. If Shopify changes the behavior of the extension library, our mocks do not change with it. The drift is silent: the tests stay green while production quietly diverges from the thing the tests believe in. We have accepted that tradeoff with our eyes open, because the alternative is not better. End-to-end tests against a real Shopify checkout are slower, flakier, and still pinned under the same seeding constraint—if the extension’s behavior turns on a specific cart state or metafield value, you cannot reliably build that state at test time anyway. No clean answer exists here. You pick the tradeoff you would rather live with, and you make peace with the fact that you are making a choice rather than finding a solution.

The app layer is where you finally get to feel competent again, mostly. It is the most familiar surface on the stack: a standard web application, a mature framework—Rails and RSpec, in our case—webhook payload fixtures, HTTP request mocking, the whole comfortable apparatus you already know how to operate. Most of it is unremarkable, and unremarkable is a luxury after the last three sections.

There are two traps, and both are the silent kind, which on this layer is the only kind worth warning you about.

The first is webhook versioning. A Shopify webhook carries the payload structure of whatever API version your app is configured against, and your fixtures mirror that structure faithfully—at the moment you write them. Then Shopify deprecates an old version and you upgrade, as you eventually must, and your mocked payloads can quietly drift away from what Shopify now actually sends. The tests keep passing. They are passing against payloads that no longer reflect reality, which is a more dangerous green than a red, because a red at least asks for your attention. Catching this is not automatic; it is a discipline. Every single time you bump an API version, you pull up the webhook fixtures, check them against the new version’s schema, and update what changed. Put it in the upgrade checklist, because it will not put itself there.

The second trap has a name and a small confession attached. AppBridge—Shopify’s framework for rendering native admin UI inside your app—relies on the admin’s iframe context to authenticate and render, and you cannot replicate that context in a test environment. There is no honest way to test an AppBridge-wired component as such. The only path we have found that actually works is to refuse to put anything worth testing inside those components in the first place: extract the business logic out, ruthlessly, until AppBridge is a thin layer of presentation and everything that matters lives underneath it, in plain testable code. (Ask us how we know. We know because we once entangled the logic and the UI inside the AppBridge components, shipped it, and then had to pull them apart later under deadline, which is the most expensive possible time to learn that lesson.) AppProxy, for what it is worth, is the easy cousin—those endpoints are just APIs, so test them like any other API and move on.

Now the layer that doesn’t exist, which is the one that will actually hurt you, because it is invisible right up until the moment it isn’t.

Every layer above can be tested, to some degree, by some means. What cannot be tested with any reliability is how the layers behave together. Walk the path: a Function modifies the cart line items, a checkout extension reads those line items to decide what to show, the theme renders an element reflecting the extension’s output, and the app records the resulting order event. Four layers, one user journey. Test each layer in isolation and all four go green. Whether they compose correctly—under real conditions, with real data, in a real checkout—is a completely different question that none of those four green checks has answered.

This is the oldest problem in distributed systems wearing Shopify’s clothes: every component can be individually correct while the system as a whole is wrong. Shopify’s ecosystem is vast and compositional by design—that composability is the entire point of it, the reason you chose the platform. But composability and testability pull in opposite directions. The more freely the pieces combine, the more combinations there are that no test ever saw, and in a stack this distributed, real end-to-end integration testing is slow, fragile, and expensive to maintain—frequently so expensive that maintaining it honestly is not a thing anyone ends up doing.

So here is the line the previous chapter taught you to listen for, reprised in a new key. Shopify partially handles cross-layer integration—which is to say it does not handle it at all, and hands you the seam to manage yourself. You don’t solve this problem. You manage it. The distinction is not a consolation prize; it is the actual instruction. Solving implies a tool you have not found yet, and there is no such tool to find. Managing implies a practice, and the practice is concrete: write clear contracts between the layers, so each one states what it guarantees about the shape of what it emits. Hold disciplined change management whenever any single layer moves, because a change that is locally safe can be globally breaking. And maintain a small, targeted end-to-end smoke suite over the highest-stakes paths only—add to cart, checkout, order confirmation—the handful of journeys that cannot be allowed to break, accepting that this suite gives you signal rather than certainty. Signal on the paths that matter is worth more than certainty you were never going to get.

Full coverage across a Plus stack is not on offer. It is important to say that without flinching, because half the anxiety in this work comes from chasing a number that the platform structurally will not let you reach, and treating the shortfall as a personal failure. It isn’t one. It is a property of the territory.

What good looks like here is not a coverage percentage. It is knowing, layer by layer, exactly what kind of confidence you are holding: real coverage on Functions, which are genuinely testable and worth investing in heavily; consciously fragile, improvised coverage on UI Extensions, which you keep and watch and do not trust further than you built it; and signal-not-certainty on theme end-to-end and on cross-layer integration, where you operate on the smoke suite and the contracts and your own paranoia. Good is not pretending those three are the same thing. Good is knowing which one you are standing on at any given moment.

And the one decision that compounds across all four layers, the closest thing to a universal law this chapter has: write the code so it can be exercised in isolation. Keep business logic away from rendering. Keep rendering logic away from Shopify’s host context. Keep the part that matters extractable from the part you do not control. The code that is testable in isolation is the only code you can reliably test, because on most of this stack, isolation is the best environment you are ever going to get—and code you wrote to be isolatable is code you can still cover when every other strategy hits the read-only wall.

The teams that test Shopify well are not the ones who found better tools. There are no better tools waiting to be found; we looked. They are the ones who stopped waiting for Shopify to hand them a framework and started building their own—a preview-theme Playwright gate, a homemade snippet renderer they will not promise survives the next release, module-level mocks worked out from the shape of a library, a smoke suite guarding the three paths that carry the money. None of it is first-class. All of it was built by people who decided that the absence of an official answer was not the same as permission to ship blind. Be one of those teams. The budget is bigger than the proposal said, the answers are partly invented, and the work is entirely doable—in that order, and with your eyes open.