facebook killed my site
how I spent a sunday debugging an OOM and found zuckerberg at the other end
I run databakkes.be. free lookup for every company in Belgium. 1.9 million pages. SvelteKit + postgres.
sunday morning. site loads like rock. so I open rock logs.
FATAL ERROR: Ineffective mark-compacts near heap limit
Allocation failed - JavaScript heap out of memory node process: 2GB heap. dead. GC pauses: 4 seconds. only one page loading. everything else: gone.
step 1: find the symptom
postgres.js was throwing this:
TimeoutNegativeWarning: -93125.98 is a negative number.
Timeout duration was set to 1. reconnection loop clamped to 1ms. memory goes brrr. but this was just a symptom.
step 2: add logging
added one line to the hooks. logged every request.
GET /nl/0851149363 -> 301 (937ms, ua=meta-externalagent/1.1)
GET /nl/0533845834 -> 301 (902ms, ua=meta-externalagent/1.1)
GET /fr/0599815237 -> 301 (825ms, ua=meta-externalagent/1.1)
GET /fr/0868985980 -> 301 (194ms, ua=meta-externalagent/1.1) every. single. request. facebook.
meta-externalagent - crawling all 3 language versions simultaneously. ~6 requests per second. 24/7.
and look - 6 req/s is not crazy. this is a 12 core / 48GB machine. shared between postgres, meilisearch, and SvelteKit. it should eat 6 req/s for breakfast. the problem wasn’t the traffic. it was the code.
bing too. google? ignoring my site. thanks google.
step 3: understand why it’s so bad
each company page = 9 database queries + SSR of 1842 lines of svelte.
the slug redirect? at the bottom of the load function.
// first: run ALL 9 queries
const [addresses, denominations, contacts, ...] = await Promise.all([...])
// then: oh wait, wrong slug? throw it all away lol
if (params.slug !== expectedSlug) redirect(301, '...') facebook hits
/nl/0851149363(no slug). runs 9 queries. redirects. runs 9 queries again. = 18 queries per bot visit. ~6 visits per second.
step 4: it gets worse
connection pool was set to 5.
each page grabs 6+ connections via Promise.all.
two concurrent requests = pool saturated. everything queues.
checked pg_stat_activity:
Active queries: 43
35x address lookups -- bot traffic
1x COUNT(DISTINCT nace...) -- running for 4 MINUTES the sync process ran a massive aggregation on every deploy. competing with bots for connections.
the fixes
slug redirect moved up - was at the bottom after 9 queries. moved to the top. redirects: 3-5s -> ~100ms.
connection pool 5 -> 50 - Promise.all grabs 6 connections per request. pool of 5 = instant saturation.
cached 5000 NACE codes - same static data loaded from DB on every request. now a 1-hour in-memory cache.
keyset pagination for sitemaps - was doing OFFSET 390,000 with a correlated subquery. 10s -> 36ms.
sync only at midnight - a 4-minute aggregation query was running on every deploy. competing with bot traffic.
all good improvements. none of them fixed it.
the real problem: node.js is single-threaded. SSR rendering 1842 lines of svelte takes ~600ms of CPU. while that’s happening, the event loop is blocked. every other request just sits there waiting. at 6 req/s, they pile up. total_db climbs from 11ms to 14 seconds - not because postgres is slow, but because node hasn’t gotten around to reading the response yet.
the fix was stupid simple: 3 replicas. three node processes, three event loops, three CPUs doing SSR in parallel.
could have also dropped SSR entirely. but I need it for SEO - the whole point is google indexing these pages. (whenever google decides to actually visit my site.)
so: 3 replicas. done.
the result
# before
total_db = 14458ms total = 14979ms OOM every 4 hours
# after
total_db = 11ms total = 648ms stable tldr
javascript slow? add more servers.
thanks for the free load test, zuck.
databakkes.be - free Belgian company lookup. 1.9M companies. now with 100ms redirects.
if you’re here because meta-externalagent is destroying your server: you’re not alone. check your logs. move your redirects up. cache your static data. and maybe send zuck the hosting bill. hope you’re not on vercel.
hit me up on X (Twitter) if you have the same problem.