Diagnosing PostgreSQL Connection Leaks on RDS

The site started throwing 502 Bad Gateway errors. Everything stopped. Restarting Gunicorn fixed it within seconds. Then it would happen again, at a completely random time, once every few days. This went on for about two weeks before we decided to properly dig in. That pattern is almost always a leak. In our case it was database connections. Infrastructure context RDS: db.m5.4xlarge, max_connections=5000, tcp_keepalives_idle=300s Gunicorn: 8 workers x 25 threads = 200 concurrent connections Celery: 3 instances (1 main worker at concurrency=35, 2 side workers at concurrency=25 each) = 85 worker threads total Total max DB connections across all processes: 285 Peak traffic (8am to 8pm SGT): ...

March 16, 2026 · Pranav Gore

RDS PostgreSQL 13 to 15 Upgrade with GCP DataStream

We upgraded our AWS RDS PostgreSQL instance from version 13 to 15. On paper it looks like a few clicks in the console. In practice, with logical replication and a CDC pipeline involved, there are several things that will block or break the upgrade if you do not handle them in the right order. Context RDS instance: db.m5.4xlarge, ~1.5TB database, primary + read-replica ~5000 API requests/min at peak load GCP DataStream connected to BigQuery via logical replication – this is the main complication Custom parameter group on both primary and read-replica with logical replication enabled No blue-green deployment on AWS RDS, so this is an in-place upgrade with real downtime BigQuery will show roughly 1 hour of data loss for the period the slot was dropped – DataStream cannot backfill that gap automatically. AWS does offer a manual backfill option but it has additional cost associated with it Because there is no blue-green option here, the ~8 minutes of downtime is real and users will see it. Planning matters. ...

April 20, 2025 · Pranav Gore