Act IV: The Joinery
Building Custom (When You Have To)
The Shopify documentation is, genuinely, some of the best platform documentation in the industry. It is thorough, it is current, it is searchable, and it is free. We are not about to tell you to stop reading it. We are about to tell you something more uncomfortable, which is that the documentation is not written for you—or rather, it is written for you and also for four other people, all on the same page, with no sign hung up to say which paragraph belongs to whom.
Picture the single web page that explains how to build a Shopify app. The indie developer reads it—the one shipping a tip-calculator to eleven thousand stores, who needs the thing to be cheap to run, trivial to deploy, and forgettable the moment it works. The Plus systems integrator reads it—the one wiring a single store into an ERP that was old when the developer was in school. The hobbyist reads it, on a Sunday, to see whether this is fun. And you read it: the engineering lead at a brand doing nine figures, who is about to build one application that has to survive five years, a re-org, two platform migrations, and the specific person who wrote it leaving for a competitor. The page serves all of you at once. The defaults it reaches for—the starter template, the tunnel, the webhook subscription, the deploy command—are correct. They are correct for whoever the documentation had in mind when it reached for them. The trouble is that the documentation does not say who that was, and the person it had in mind was, more often than not, the indie developer. The default path turns out to have been designed for someone else.

This whole chapter is about which defaults to override, and why, and what it costs you when you don’t. None of the overrides are exotic. Most of them are the kind of thing a senior engineer arrives at on their own, eventually, around two in the morning, after the thing the default told them to do has fallen over in production. We would simply like you to arrive there at a more reasonable hour.
The first place the defaults bill you is the one you choose before you’ve written a line: the stack. On Shopify, a serious custom build is rarely one piece. It’s an app, plus some Shopify Functions for the cart-and-checkout logic the app can’t touch, plus maybe a UI Extension or two for the surfaces customers actually see. These are not three independent decisions. They interact, and the interaction is where the regret accrues.
The seductive answer is one language for everything. Write the app in JavaScript, write the Functions in JavaScript, share types across the boundary, hire for one skill set. It is a genuinely good argument and we have made it ourselves, at the start of projects, with confidence. The bill comes later. Node has no Rails. We mean that almost literally: there is no single, boring, settled framework in the Node ecosystem that hands you queues, background jobs, and request routing already solved, the way Rails hands you Active Job and Sidekiq and a router on day one. So you assemble them. You pick a queue library, you pick a job runner, you wire up routing, and you make these infrastructure decisions in the middle of the project, while you are supposed to be building features, because the platform’s defaults walked you into a language whose ecosystem makes you bring your own plumbing. One language for everything is appealing right up until you’re debugging a production incident at midnight and wishing you had Sidekiq. (Ask us how we know. The Sidekiq we wished for, that night, was very specific.)
So our default—and a default is all it is, a place to start arguing from—runs the other way. Ruby and Rails for the app: Sidekiq and Active Job make webhook queuing a solved problem instead of a research project, routing comes in the box, and there’s even a Liquid gem so you can test the templates the storefront will actually render. The cost is honest and we’ll name it: there is no Ruby for Shopify Functions. None. Functions run in a constrained compute sandbox, and your choices are TypeScript or Rust, so the dream of one language was never on the table for the part of the system that prices the cart. Which is fine, because for Functions specifically we lean toward Rust anyway. The compute limits are tight enough that Rust’s performance headroom stops being a flex and starts being the reason the Function fits inside the budget at all; and because end-to-end testing of Functions is still thin, the compiler doing your type-checking is doing real safety work that you cannot easily get any other way. App in Ruby on Rails, Functions in TypeScript or Rust. That’s the default. (Ask us how we know.)
If that recommendation makes your team—fluent in JavaScript, indifferent to Ruby—wince, notice that the ground under the question has shifted. It used to be that you picked the language your team already knew, because the cost of unfamiliarity was measured in weeks of fumbling. AI tooling has quietly shrunk that gap. A strong engineer can now be productive in an unfamiliar-but-conventional framework far faster than they could three years ago, which means the old question—which language does my team know?—has lost most of its weight. The question that replaced it is sharper and harder to dodge: which ecosystem gives you solved infrastructure for the problems you’re guaranteed to hit? You are guaranteed to hit webhook queuing. You are guaranteed to hit background jobs. The language your team is comfortable in will not change whether those problems exist; it only changes whether you solve them or rebuild them.
Which brings us to webhooks, where the platform’s default is at its most charming and its most dangerous, because the default genuinely works—for a while, at a size you will outgrow.
The tutorial path is simple and it is what the documentation shows you: you subscribe to an event, Shopify makes an HTTP POST to your endpoint when the event happens, your app does something. Order created, fire a webhook. Customer updated, fire a webhook. It is reactive, it is real-time, it is one diagram you can draw on a napkin, and at a small store it is correct. At enterprise volume it has four failure modes, and they do not announce themselves; they accrue.
Events arrive out of order. Shopify does not promise you that customer updated lands after customer created, and at volume it sometimes won’t, and now your handler has to be defensive about a world where effects precede their causes. Events get lost—quietly. If your app is unavailable for the thirty seconds you spent deploying, the webhooks fired into that window are not patiently redelivered forever; some of them are simply gone, and nothing tells you, and you find out weeks later when a report doesn’t reconcile. There’s no built-in dead-letter queue, no built-in fan-out—if two parts of your system both need to know about an order, that’s your problem to solve, in your code, on every event. And the cruelest one: Shopify’s API rate limits apply to your reactions, so a busy storefront can fire events faster than you can process them, and a reactive architecture can exhaust your API quota just keeping up with incoming events—burning your entire budget on staying level, with nothing left for the work you actually wanted to do.

The shape that survives is to stop letting Shopify POST directly into your application at all. Put a real bus between them. The version we reach for: Shopify delivers events to an AWS EventBridge bus, EventBridge routes them into SQS, and SQS gives you the three things raw HTTP webhooks never did—durability, so an event waits in the queue instead of evaporating while you deploy; ordering, where you need it; and a dead-letter queue, so the message that can’t be processed lands somewhere you can see it instead of vanishing. Need a second consumer? You add another SQS queue subscribed to the same bus without touching a single thing in your Shopify configuration—the fan-out lives in your infrastructure, where you control it, not in a vendor’s webhook settings. CloudWatch gives you the observability. And as a quiet bonus that your security team will love more than you expect, your application stops exposing any HTTP endpoint to the public internet at all; it reads from a queue it owns rather than waiting for the open web to knock.
There is a temptation, having read the last two paragraphs, to go build all of that immediately, and we want to head it off, because the most important sentence in this section is the one that tells you when not to. Before you build a full reactive webhook architecture, ask whether the operation actually needs to be real-time. A great many of them don’t. The example we keep coming back to is Omnibus price-display compliance—the EU rule that you show the lowest price an item held over the prior thirty days. It feels like a webhook problem: price changes, recompute the lowest-prior-price, react. But the law does not care if your displayed figure is thirty seconds stale. A nightly bulk operation that walks the catalog and recomputes the numbers is legally sufficient, and it is dramatically—almost insultingly—simpler than a reactive pipeline. No bus, no queue, no ordering anxiety, no quota math. One job, once a night, the kind of thing you can reason about completely while making coffee. Half the reactive architectures we’ve seen built were monuments to a real-time requirement that nobody had actually checked was real.
A short tangent, because it derails the paragraph above and deserves its own corner.
The tunnel deserves more than a placard, because it’s where a multi-developer team meets the single-developer assumption head-on, usually in week one, usually in confusion.
Here is the scene, and to the engineering lead about to run the official Shopify-CLI quickstart with your whole team watching the shared screen: this part is kind, and then it is a warning. The CLI is wonderful. You run one command and it spins up a development app, opens a tunnel to your local machine, and starts delivering live webhooks to the code on your laptop. It feels like magic the first time. The warning is that the magic is built for one person. The CLI creates a single webhook tunnel per development app, and webhooks go to whoever started their server most recently. So your second developer runs the same command, and the tunnel quietly re-points to them, and now the first developer’s local app has gone deaf and doesn’t know it—sitting there, server running, receiving nothing, debugging a silence that isn’t a bug. Two engineers, one mailbox, and the mail goes to whoever touched it last. On a team of six this is a daily papercut that costs real hours and a recurring “wait, are you getting webhooks? because I’m not” in the channel.
The fix is to stop fighting the tunnel and route around it. Point the development app at the same EventBridge bus the rest of your architecture already uses, and let each developer read their own stream off it. A MeUndies engineer, tired of exactly this, built a small EventBridge log reader that gave every developer a private per-developer event stream—their own view of the events, no fighting over a single tunnel, no deaf laptops. It was not a big piece of software. It was a Saturday’s work that paid for itself by the following Tuesday, and it is the kind of thing the documentation will never tell you to build, because the documentation is still, helpfully, talking to the one developer it imagines you are. And the staging story rhymes with this: Shopify has no native notion of environments for custom apps—no “staging” toggle, no environment dropdown—so you construct one. Two separate apps in the Dev Dashboard, one staging and one production, the same codebase deploying to each with different credentials, CI pushing your main branch to staging and your tagged releases to production. The app automation tokens that landed earlier this year made this clean in a way it frankly wasn’t before; what used to involve a human clicking through a dashboard is now a deploy step like any other. The platform won’t hand you environments. It will, now, let you build them properly.
The last default is the largest, the most fashionable, and the one we feel most strongly about: how many things you deploy.
The constant pitch—from conference talks, from the AWS console’s gentle nudges, from at least one engineer on every team who read a good blog post—is serverless. A Lambda per webhook. A function per concern. Small, independent, infinitely scalable pieces, each doing one thing. It sounds like cleanliness itself, and for some systems it is the right answer. For one enterprise Shopify integration, owned by one team, it tends to be a slow-motion mistake, and the mistakes are specific. Your business logic fragments across a dozen deploy units, so the rule about how a discount stacks now lives in three Lambdas and an environment variable, and understanding it means opening four consoles. Cold starts hurt precisely where you can least afford them—the latency-sensitive cart-and-checkout path, where a Lambda waking up adds the milliseconds a customer feels. And tracing a single failure across several Lambdas, a multi-step flow that touched four functions and broke at the third, is an afternoon with a correlation ID and a prayer, where in one process it would have been one stack trace.
So we reach, deliberately and a little unfashionably, for what DHH and the 37signals crowd named the Majestic Monolith. One well-structured, long-running application. Background jobs queued internally rather than scattered across the cloud. One log stream—when something breaks, there is one place to look. One deployment surface—when you ship, you ship one thing. It is boring to operate, and that is not an apology, it is the entire argument: boring to operate—which is exactly the point. The hours you do not spend correlating logs across functions and reasoning about cold starts are hours you spend on the business. Tannico runs this way in production—one app, a multi-market enterprise wine business, a catalog with real complexity, several integrations hanging off it—and the reason it stays comprehensible is that there is one of it. One app, one log, one place where the truth lives.

This is the same instinct as the primitives test from the previous chapter, pointed inward. There, the question was whether an app builds on Shopify’s native concepts or smuggles in a parallel layer; here, the question is whether your own system builds one coherent thing or smuggles in a dozen parallel ones. Same temperament, same payoff: the fewer separate worlds you have to reconcile at the boundaries, the longer your system stays something a human can hold in their head.
And so the line we want you to carry out of this chapter, the one worth taping to a monitor where the next architecture-astronaut on your team can see it: start with a monolith. Split a service off only when you have a concrete, demonstrated scaling problem. Not an anticipated one. Not a whiteboard one. A real one, with a graph behind it, that you can point to. Splitting early is paying the full operational tax of distribution for a scale you do not have and may never reach. You can always carve a service out of a well-structured monolith when the graph finally demands it. You cannot easily reassemble twelve Lambdas into something you can reason about.
There’s a quieter version of the same decision sitting one layer down, in how your one app holds its data. Sooner or later you’ll choose between mirroring Shopify’s data into your own database—syncing it over, keeping a local copy current with webhooks and reconciliation—and staying API-first, querying Shopify on demand and holding as little as you can. Mirroring buys you fast local queries, resilience when Shopify’s API has a bad afternoon, and the ability to run real analytics over data shaped the way you need it; it costs you a sync to build, maintain, and reconcile, plus the standing risk that your copy and the truth quietly disagree. API-first buys you simplicity and data that is always current by definition; it costs you latency, and it ties your fate to someone else’s rate limits and uptime. The honest answer is usually a hybrid—mirror the reference data you query constantly and that rarely changes, stay API-first for the operational data that has to be current to the second—but the meta-point is the one to hold onto: let the business requirement drive the architecture, not the other way around. You decide what the data is for, and the shape follows. Decide the shape first and you’ll spend a year explaining to the business why the system can’t do the obvious thing.
None of this is a knock on Shopify, and we want to be clear about that, because a chapter that spends this long listing defaults to override can read as a chapter that thinks the platform got it wrong. It didn’t. The defaults are good answers. They are simply answers to questions asked by people who are not building what you are building, at the scale you are building it, with the lifespan you need it to have. The teams that succeed at custom development on Shopify are not the ones with the cleverest architecture or the most exotic stack. They are the ones who understood which defaults to override, had specific reasons for each override, and built the infrastructure those reasons demanded—no more, no less.
So when you open that documentation page tomorrow—and you should, it really is excellent—read it the way a translator reads a letter addressed to someone else. Most of it is for you. Some of it is for the indie developer shipping to eleven thousand stores, and that part will look exactly like the part that’s for you, and the only way to tell them apart is to keep asking the question this chapter has been asking all along: who was this default designed for, and is that me? Usually it isn’t. Override it on purpose, write down why, and go to bed at a reasonable hour. The midnight version of you will be grateful, and so will the engineer who inherits this in three years and finds, against all odds, a system they can actually understand.