Recently I had been seeing the occasional request taking a very long time to complete. It was happening very rarely, but enough to be worth investigating.
Looking at diagnostic logs I could see that when it happened, the high level reason was that getting a database connection from the c3p0 connection pool took a long, long time (about 15 minutes or often more).
The pool already had checkoutTimeout set to 30 seconds precisely to avoid having a request sit around forever if for some reason the connection could not be acquired in reasonable time. So whatever was causing the delay was ignoring this setting. The docs say this setting changes “the number of milliseconds a client calling getConnection() will wait for a Connection to be checked-in or acquired when the pool is exhausted”. Turns out the key part here is “when the pool is exhausted” – from the pool statistics at the time of the slow request I could see that the pool was not exhausted, there were several idle connections available to be had. Which just made it stranger that it would take so long, but explains why this timeout was not relevant.
Trading some speed for more reliability, the server is also configured to testConnectionOnCheckout. This helps almost guarantee the connection will be good when the applications gets it (almost, because it could still become stale in the short time window between the time it is checked out and the time the application actually uses it). Since idle connections are available in the pool, it seems the only reason grabbing one could take a long time was if this test took a long time. But this didn’t make much sense, if the database was down or unreachable the test normally promptly fails.
I should mention that other requests which came in at about the same time as the slow request had no trouble getting their connections from the pool and those requests completed in normal time. So there was no connectivity problem to the database nor was the database responding slowly.
Long story short, I discovered there is a stateful firewall between the web server and the database.
So, turns out that if a database connection in the pool sat around unused long enough, the firewall dropped it from its connection table. When one of these connections was later grabbed from the pool, c3p0 attempted to test the connection by sending a query to the database. In normal conditions this would either work or quickly fail. But here the firewall was silently dropping all packets related to this connection, so the network stack on the web server machine kept retrying for a long time before giving up. Only after that did c3p0 get the network failure, dropped that connection, created a new one and handed it to the application.
This was happening very rarely because most of the time my server gets a fairly steady load of concurrent requests, so the connections in the pool get used frequently enough that the firewall never drops them. The problem surfaced only after there was an occasional spike in concurrent requests which led c3p0 to add more connections to the pool. After the spike was over the pool now had some “extra” connections above and beyond the normal use pattern so some of these connections did now sit unused for long enough to be dropped by the firewall. Eventually a request got unlucky and got one of these long-idled connections and ran into the problem.
The easiest solution was to set maxIdleTime from the default of zero (no timeout) to a time just a bit shorter than the firewall timeout. With that, connections which are about to be dropped by the firewall instead get dropped by c3p0 first. It’s a bit unfortunate since it causes otherwise unnecessary churn of good connections, but it is certainly better then the alternative. After I changed this setting, we haven’t seen any more problems.