Diagnosing PostgreSQL Connection Leaks on RDS

The site started throwing 502 Bad Gateway errors. Everything stopped. Restarting Gunicorn fixed it within seconds. Then it would happen again, at a completely random time, once every few days. This went on for about two weeks before we decided to properly dig in. That pattern is almost always a leak. In our case it was database connections. Infrastructure context RDS: db.m5.4xlarge, max_connections=5000, tcp_keepalives_idle=300s Gunicorn: 8 workers x 25 threads = 200 concurrent connections Celery: 3 instances (1 main worker at concurrency=35, 2 side workers at concurrency=25 each) = 85 worker threads total Total max DB connections across all processes: 285 Peak traffic (8am to 8pm SGT): ...

March 16, 2026 · Pranav Gore

Under Fire: What a Real DDoS Attack Looks Like and How We Fought Back

It was a Tuesday afternoon, around 3pm. We were deep into sprint planning when the Slack messages started coming in. “I can’t log in.” “The app is really slow.” “Is something wrong with Privyr?” The first instinct was to look for a simpler explanation. Maybe a flaky deploy. Maybe a user on a slow connection. Then more messages. Then a lot more. I tried to SSH into the box. It would not connect. CPU was pegged. I pulled up the metrics dashboards and stared at the graphs for a few seconds. Request counts were off the charts. Response times had collapsed. This was not a bug. ...

September 12, 2025 · Pranav Gore

RDS PostgreSQL 13 to 15 Upgrade with GCP DataStream

We upgraded our AWS RDS PostgreSQL instance from version 13 to 15. On paper it looks like a few clicks in the console. In practice, with logical replication and a CDC pipeline involved, there are several things that will block or break the upgrade if you do not handle them in the right order. Context RDS instance: db.m5.4xlarge, ~1.5TB database, primary + read-replica ~5000 API requests/min at peak load GCP DataStream connected to BigQuery via logical replication – this is the main complication Custom parameter group on both primary and read-replica with logical replication enabled No blue-green deployment on AWS RDS, so this is an in-place upgrade with real downtime BigQuery will show roughly 1 hour of data loss for the period the slot was dropped – DataStream cannot backfill that gap automatically. AWS does offer a manual backfill option but it has additional cost associated with it Because there is no blue-green option here, the ~8 minutes of downtime is real and users will see it. Planning matters. ...

April 20, 2025 · Pranav Gore