monkeyman.agency
scaling

Shopify ERP Integration Failures: Building Error Handling and Reconciliation That Actually Works

A failed Shopify-to-ERP sync usually surfaces three days late, as a customer complaint. Here's the error handling and reconciliation that catches it first.

June 22, 2026 9 min read

Priya runs operations for Maison Reve, a homewares brand doing about $9M a year on Shopify Plus with the back office living in NetSuite. On a Thursday she got a Slack from the warehouse: three orders from Monday had never landed in the ERP, so nothing got picked. The customers had already emailed. Twice.

“When an order fails to sync, what’s your process? We currently find out about failures three days later when someone complains.” A dev told us exactly that on a discovery call last month, and honestly it’s the most accurate description of the problem we’ve heard.

Three days late. That’s the whole thing, right there.

The integration wasn’t broken in the dramatic sense. The connector was running, the status page was green, almost every order flowed through fine. It was the handful that didn’t, plus the fact that nobody knew, that turned a small API error into refunds and a couple of one-star reviews.

Why these syncs break quietly instead of loudly

A loud failure is a gift. When a connector throws a fatal error and the whole pipeline stops, someone notices within the hour because nothing is moving. The dangerous failures are the partial ones, where 98% of records sync and 2% fall on the floor without a sound.

That 2% has a few favorite hiding spots. A webhook fires once, the receiving endpoint times out, and there’s no retry, so the event is just gone. Or the ERP rejects a record because a required field is empty and the connector logs it somewhere nobody reads. Rate limits are another classic, where a burst of orders during a flash sale blows past the API ceiling and the overflow silently never gets queued for a second attempt.

The pattern underneath all of them is the same. The system assumes the happy path will hold, and when it doesn’t, there’s no mechanism to notice, retry, or escalate. Shopify’s webhook delivery model is explicit that delivery is best-effort and your endpoint has to be built to handle duplicates and misses, but a lot of integrations are wired as if every event is guaranteed to arrive exactly once.

So the failures aren’t really technical mysteries. They’re design gaps. The integration was built to move data, not to be honest about the data it failed to move.

Map where it actually breaks before you fix anything

You can’t build error handling for failures you haven’t named. Before touching code, we walk the four flows that carry the most risk and ask, for each one, what happens when it fails.

Orders are the obvious one, and the most expensive when they drop, because a missing order means an unfulfilled customer. Inventory is sneakier, where a sync lag means Shopify keeps selling something the ERP knows is gone, or hides stock that’s actually available. Customer records drift in ways that don’t hurt today but corrupt reporting and tax later. And refunds are the quiet fourth one, where a return processed in Shopify never writes back to the ERP and your financials slowly stop matching reality.

Each flow has a direction, a trigger, and a failure cost. Write those down. A simple table of “flow, source of truth, sync trigger, what breaks if it fails” does more for reliability than any tool, because it forces you to admit which failures you can tolerate for an hour and which you can’t tolerate for a minute.

The orders flow usually can’t tolerate much. Refund and customer drift can wait for a nightly catch. That ranking is the whole basis for everything that comes next.

Retry queues and the dead-letter pattern

Once you know what can fail, the first real defense is making failure recoverable instead of final. Every sync operation should go through a queue that can retry on its own, with backoff, so a transient timeout or a rate-limit blip gets a second and third attempt before anyone is paged.

The piece teams skip is the dead-letter queue. After a record fails its retries, say three or five attempts, it shouldn’t vanish or sit silently in a log. It moves to a dead-letter queue, a holding pen for records that genuinely couldn’t sync. That queue becomes your single most useful artifact, because its length is a live count of exactly how many orders, refunds, and updates are currently broken.

A dead-letter queue with five things in it on a Tuesday morning is a problem you can see and fix in twenty minutes. The same five failures scattered across application logs are a problem you discover on Friday when the customer emails. Same failures. Completely different outcome.

Build it so each dead-lettered record carries its payload, the error message, and a timestamp. That way replaying it after you fix the root cause is a button, not an archaeology project.

A nightly reconciliation that catches the drift retries miss

Retries handle the failures your system noticed. Reconciliation handles the ones it didn’t.

Even a well-built pipeline drifts, because some failures never raise an error at all. A record gets written with a wrong value, a webhook never fires because of a Shopify-side hiccup, a manual edit in the ERP overwrites synced data. None of that shows up in a retry queue. The only way to catch it is to compare the two systems directly, on a schedule, and flag every mismatch.

A good reconciliation job pulls yesterday’s orders from both Shopify and the ERP, matches them by order ID, and reports anything that exists in one but not the other, plus anything whose totals or statuses disagree. Run it nightly at minimum. For high-volume stores we run the orders check hourly and keep the heavier financial reconciliation on the nightly cadence.

The output should be boringly specific: “Order #10482 exists in Shopify, missing in NetSuite” beats “sync may have issues.” Specificity is what turns a report into an action.

This is also where you find the failures that aren’t your integration’s fault at all, the human ones, the marketplace ones, the someone-edited-it-directly ones. Reconciliation doesn’t care whose fault it was. It just tells you the truth about whether the two systems agree.

Get the alert to a human who can act

A failure caught by a system nobody watches is the same as a failure nobody caught.

The dead-letter queue and the reconciliation report only matter if their findings reach a person whose job is to act on them. We route both to a dedicated Slack channel with a clear threshold: zero dead-lettered orders is silence, one or more is a message with the order ID and the error. No daily “all good” noise, because the absence of an alert should mean all good, and an alert should always mean do something now.

The threshold matters more than the channel. Alert on everything and people mute it inside a week. Alert only on things that need a human and the alert keeps its teeth. Refund mismatches can roll up into a once-a-day digest. A missing paid order should ping immediately, because every hour it sits unfulfilled is an hour closer to a chargeback.

And someone has to own the channel. An alert with no owner is just anxiety with a notification sound.

Connector app, custom middleware, or iPaaS

The reliability work above is mostly independent of which tool you run it on, which is the part people get backwards. They shop for a tool first and assume reliability comes in the box.

Off-the-shelf connector apps are the right call for standard Shopify-to-ERP flows with common platforms, where someone has already solved the field mapping a hundred times. They get expensive at the edges, the custom flows and the unusual ERP configs, and you’re stuck with whatever error handling the vendor chose to expose. An iPaaS platform like a workflow tool gives you more visibility and control over retries and branching, at the cost of more to build and maintain. Custom middleware gives you everything and owes you everything, so it only makes sense at scale or when your logic is genuinely unusual.

Here’s the honest comparison we walk clients through.

ApproachBest whenReliability tradeoff
Connector appStandard flows, common ERP, small teamFast to launch, but error handling is whatever the vendor exposed
iPaaS / workflow toolCustom logic, want visibility into retriesStrong control and logging, more to build and maintain
Custom middlewareHigh volume or genuinely unusual mappingTotal control, total ownership of every failure mode

Whichever you pick, the retry queue, the dead-letter queue, the reconciliation job, and the alerting are still your job. The tool is the engine. Reliability is the seatbelt, and it doesn’t come standard.

An integration health checklist for agencies

When we audit an existing Shopify-ERP setup, we run the same short list, and it surfaces the gaps fast.

  • Is there a retry with backoff on every sync operation, or do transient errors die on first failure?
  • Is there a dead-letter queue, and does anyone look at it?
  • Does a reconciliation job compare both systems on a schedule, or do you trust the connector’s word?
  • Do failures alert a named human, with a threshold that prevents both silence and noise?
  • Can you replay a failed record after fixing the root cause without manual data entry?
  • Do you know your API rate limits and what happens to overflow during a flash sale?

A setup that passes all six rarely surprises anyone. A setup that fails three of them is the one that finds out about problems three days late. Shopify’s own admin API rate limit docs are worth reading alongside this, because the flash-sale overflow case is the one that turns a healthy integration into a broken one for exactly the orders you most wanted to keep.

What we keep telling clients

The instinct, when a sync breaks, is to go shopping for a better connector. Pretty much every time, that’s the wrong first move. The connector is rarely the problem. The absence of a system that notices failure is the problem, and a new connector without that system just fails more confidently.

Reliability isn’t a feature you buy. It’s four habits you build: retry what’s transient, hold what won’t sync where you can see it, compare the two systems on a schedule, and put the result in front of someone who can act. None of that is exotic. Most of it is a weekend of work for a competent dev, and it pays for itself the first time it catches a missing order before the customer does.

We tell teams to start with the orders flow and the dead-letter queue, because that’s where the money and the reputation live. Inventory, refunds, and customer drift can come into the net after. Trying to make everything bulletproof on day one is how the project stalls and nothing ships.

Priya’s team didn’t replace NetSuite or the connector. They added a retry-and-dead-letter layer on the orders flow, stood up a nightly reconciliation that pings a Slack channel, and gave one person the job of clearing the queue each morning. The next time a webhook silently dropped, the alert landed at 9:04am and the order was fulfilled by 9:20. No customer email. No one-star review. The three-days-late problem just stopped being a problem.

Questions we get every week

Why does an order fail to sync from Shopify to my ERP in the first place? Usually it’s a transient issue like a timeout or a rate limit, or a data issue like a missing field the ERP requires. The failure itself is normal and expected; the real problem is when there’s no retry to catch the transient ones and no alert to surface the data ones.

How often should I reconcile Shopify and my ERP? Nightly is the baseline for most brands, matching orders and totals across both systems. If you’re high volume or run flash sales, move the orders check to hourly and keep the heavier financial reconciliation on the nightly run.

Is a connector app enough, or do I need custom middleware? For standard flows on a common ERP, a connector app is usually the right start and the fastest to launch. You move to iPaaS or custom middleware when your logic gets unusual or your volume is high enough that you need full control over how retries and failures behave.

What’s a dead-letter queue and do I really need one? It’s a holding place for records that failed all their retry attempts, so they don’t silently disappear. You need one because its length is a live, honest count of exactly how many things are currently broken, which is the number you otherwise find out about three days late.

If your Shopify-to-ERP sync keeps surfacing failures days too late, talk to Monkey Man and we’ll map your failure points and build the retry, reconciliation and alerting layer that catches them first.

Need help with this?

Send us your store. We'll send back an audit.

Send us your store URL. We'll send back a free audit within 48 hours.

Phone (optional)