The forgotten axis of scalability

I have written variants of this article before… I find it is such a recurring topic that maybe it is worth a revisit once again.

Back in the day, bumming instructions out of your assembly code was the thing to do to gain a few more CPU cycles here and there. It was very time consuming work but computers were expensive and very, very slow so the performance gains were worth it. It was great fun, but it’s hasn’t been cost effective for a couple decades now.

In the 90’s, software performance became a forgotten art for the most part. With the MHz (later, GHz) wars in full swing, it was a given that CPUs would be twice as fast by the time you released the code, so why bother with any performance optimizations! As long as it was adequate on your development box, it would be plenty fast later.

In the 2000’s the CPU frequency race was slowing down and Internet scale was speeding up. Up to a point you could buy bigger servers to keep up but that was quite expensive and only got you so far. No matter how much budget you had, at some point faster servers were not going to cut it anymore so you had to scale sideways instead. And thus, the obvious conclusion was to skip the expensive server part altogether and scale horizontally on cheaper hardware from day one. Remember the buzz in the earlier part of the decade about google having 10K servers? (Seems like such a small number now!)

It became a point of pride to have as many servers as possible and, once again, improving code performance was not seen as a good use of time when you can always throw another cheap box (or another hundred, who’s counting?) at the problem to compensate.

There’s nothing to argue with the basic premises of these trends. It was true that CPUs were getting faster all the time and it is true that scaling horizontally on commodity hardware is the way to go. And it is also very true that intensive code optimization is hardly ever worth the effort and opportunity cost of not doing something else.

(Back in Sun in the Web Server team we did spend a fair amount of time on such intense optimization work, looking for a few percent here and a few percent there. The goal was to be able to post world record SPECweb numbers (one example here). While fun, it was an exercise driven by marketing not so much the needs of data centers. For most platform vendors, such an effort is not worth the cost. For companies offering services as opposed to products, it’s basically never worth the cost.)

The end result of all this, however, is that the concept of writing faster code and architecting for performance seems to have been lost! I’ve been seeing this for years now and if anything the trend is becoming more prevalent. The idea of scaling with more boxes from day one is so ingrained that I rarely see teams doing some basic performance sanity checking first.

More servers do cost more money. Not just to buy, but particularly to run, cool and house in a rack. If you can get by with a few less servers, that’s not a bad thing. If you can get by with a lot fewer servers, all those operating costs go straight to your profit margin. Not a bad deal!

If you read the popular book Art of Scalability from a few years back you’ll be familiar with their three axis of scalability (more boxes horizontally, split by service, shard by customer). I note with amusement they forgot the easiest and cheapest axis of scalability, which is to write more performant code in the first place…

The usual argument goes that efficient design and code is not worth it because the gains, while real, are small enough that they are lost in the noise and you’ll still need about the same number of boxes anyway so why bother? That’s usually true IF you’re starting from a reasonably optimized design and implementation. However, if the development team has not been running realistic load testing and performance analysis all along, I can pretty much guarantee there are gains to be had that’ll save you quite a bit in operating costs.

Enough philosophizing, how about a real world example…

When I started at my current position I took over one of the core production REST services. It was (and is) a very standard setup… REST APIs, Java Servlets, JAX-RS, MySQL. The usual. Response times were plenty adequate although not stellar. About a year ago response times started climbing as our user base keeps growing every month. While it was still doing fine, comparing the usage growth curves to the response times curves made it clear it was time to order some more hardware soon to spread the load a bit before it slowed down enough that customers would notice.

Meanwhile though, I had been working on sanitizing the performance. Long story short, I never did order more hardware. Just the opposite… after I upgraded the code, about 50% of the hardware dedicated to this service became available to reassign to other things, it simply wasn’t needed anymore given the increased capacity of the new code.

The new code can handle just about 40 times more throughput per server (not 40 percent, 40 times!). When fully loaded (at max capacity) it now maintains mean service times in the 8ms to 10ms range. The previous version had mean response times in the 100ms to 150ms range even though it was handling less than 1/40th of the load!

These may be commodity boxes, but buying 40 of them still takes some cash. And the monthly operating cost of 40 boxes is real money as well. Think about it, it means that roughly a rack full of 1U servers can be downsized to a single server…

I’d love to be able to boast about having done some extreme performance magic to get these scalability benefits, but the reality is all I did was some basic design and implementation optimizations across the board, grabbing the low-hanging performance gains here and there. Such gains add up and so by the time I was done the system could handle 40 times as much traffic (customer requests handled per second).

Why wouldn’t you do this level of performance sanity checking? It doesn’t take that much extra work to design and implement for scalability and the end result is a competitive advantage.

Rails Not Finding a File

I came across an interesting bug today… a Ruby on Rails application was consistently failing on a test machine, displaying an “uninitialized constant” error message. It was failing because it had not loaded an .rb file which contained the definition of the class being used. That seems straightforward enough, except that:

  • The file that was not being loaded was in a directory among other .rb files, all of which did get loaded, so why only this one file?
  • The application worked perfectly on the developer’s machine.
  • The application was also working on another test server.

The file was there in all the installations, no access permissions issues, it was complete, readable and identical everywhere. So no issues with the file itself. Just that the runtime apparently sometimes was able to load it, sometimes not, depending on which machine it ran in.

The only thing different about this file is that its name was capitalized, whereas all other rb files there had all lowercase names. Then I found out the developer was running the application on a Mac and my test machine is a Linux box. As soon as I heard that, I remembered that the Mac HFS filesystem is a bit wacky about case. While it preserves case, it doesn’t observe it. The following surprising sequence actually works on a Mac:

% echo Hello > Hello
% cat hello
Hello

Problem solved, I thought! To confirm, I strace’d the application and indeed, it was opening “file.rb” even though the disk file was called “File.rb”. So it fails, except on a Mac where that works. That explains everything!

Except… the other test server that had been set up, the one where the application also works, is also a Linux box! How can it possibly be working there?

Using strace showed that there it was calling open() on “File.rb”, so it worked. But why?

After more closely reviewing the strace output, I noticed that on the working Linux box, the process was open()ing the directory, reading the list of files and then opening and reading each one in turn. So because it got the actual name of the file (“File.rb”), it was able to open it. On the Linux box where the application did not work, it never opened the directory entry, it went straight to attempting to open “file.rb” which of course failed. Ok that explains why the discrepancy in the file name being opened, but why is one reading the directory and the other one is not? Both machines have identical installations of all relevant software!

I then noticed that on the Linux box where it worked, the application was being run with the “production” environment flag of rails (-e). On the Linux box where it did not work, it wasn’t.

After some more digging, I discovered that the production.rb sets:

config.cache_classes = true

This cache_classes is defined as follows:

config.cache_classes controls whether or not application classes and modules should be reloaded on each request. Defaults to true in development mode, and false in test and production modes.

Aha! So looks like the way it is implemented is that if cache_classes is true, it scans the directories at startup and loads (and caches) all the .rb files it finds, which is what the strace output showed. Thus, it finds and loads “File.rb”. If cache_classes is false, it never scans the directory, simply attempts to (re)load “file.rb” each time, always failing.

With that, mystery truly solved! If you run into this, now you know…

 

TCP connection to local MySQL with Ruby

Here’s a small detail that took me a while to discover. Maybe it helps you, or at least it’ll probably help me in the future when I need to do this again…

I was working on a ruby script which made some calls to MySQL. Among other things, the script takes an argument with the hostname of the database which it later uses to open the connection. The implementation is a bit too smart for its own good though.. if the hostname is “localhost” it won’t do a TCP connection. I needed to force it to do a TCP connection even if it was localhost. I couldn’t find it clearly documented anywhere, but turns out there is a way. You need to set the OPT_PROTOCOL option to 1:

h = Mysql.init
h.options(Mysql::OPT_PROTOCOL, 1) # 1=TCP, force TCP connection
connection = h.real_connect(dbhost, $MYSQL_UID, $MYSQL_PWD)

That’ll do it.

As to why I needed this? The MySQL server was actually on a different server but I had set up an ssh tunnel to it mapped to localhost:3306. So, the ruby MySQL library assumption that “localhost” connection must be to a local process was not true in this case.

RPM Scripts

When creating an rpm package, the spec file can specify a number of scripts to be run before and after package install and uninstall. For the simple cases of a new install or an uninstall it is obvious which script runs when. However, the documentation didn’t seem very clear on the behavior during a package upgrade (rpm -U).

Documenting this as a note to my future self, for the next time I need it… The table shows which script runs when (and in what order) and what integer parameter it is given:

Fresh install (rpm -i) %pre     1
%post    1
Upgrade (rpm -U) %pre      2
%post     2
%preun  1
%postun 1
Uninstall (rpm -e) %preun  0
%postun 0

With this, the scripts can do something like:

%pre

case "$*" in
  1)
    echo package is about to be installed for first time
    ;;
  2)
    echo package is about to be upgraded, prepare component
    echo for upgrade (e.g. stop daemons, etc)
    ;;
esac
exit 0

Firewalls and database pools

Recently I had been seeing the occasional request taking a very long time to complete. It was happening very rarely, but enough to be worth investigating.

Looking at diagnostic logs I could see that when it happened, the high level reason was that getting a database connection from the c3p0 connection pool took a long, long time (about 15 minutes or often more).

The pool already had checkoutTimeout set to 30 seconds precisely to avoid having a request sit around forever if for some reason the connection could not be acquired in reasonable time. So whatever was causing the delay was ignoring this setting. The docs say this setting changes “the number of milliseconds a client calling getConnection() will wait for a Connection to be checked-in or acquired when the pool is exhausted”.  Turns out the key part here is “when the pool is exhausted” – from the pool statistics at the time of the slow request I could see that the pool was not exhausted, there were several idle connections available to be had. Which just made it stranger that it would take so long, but explains why this timeout was not relevant.

Trading some speed for more reliability, the server is also configured to testConnectionOnCheckout. This helps almost guarantee the connection will be good when the applications gets it (almost, because it could still become stale in the short time window between the time it is checked out and the time the application actually uses it). Since idle connections are available in the pool, it seems the only reason grabbing one could take a long time was if this test took a long time. But this didn’t make much sense, if the database was down or unreachable the test normally promptly fails.

I should mention that other requests which came in at about the same time as the slow request had no trouble getting their connections from the pool and those requests completed in normal time. So there was no connectivity problem to the database nor was the database responding slowly.

Long story short, I discovered there is a stateful firewall between the web server and the database.

So, turns out that if a database connection in the pool sat around unused long enough, the firewall dropped it from its connection table. When one of these connections was later grabbed from the pool, c3p0 attempted to test the connection by sending a query to the database. In normal conditions this would either work or quickly fail. But here the firewall was silently dropping all packets related to this connection, so the network stack on the web server machine kept retrying for a long time before giving up.  Only after that did c3p0 get the network failure, dropped that connection, created a new one and handed it to the application.

This was happening very rarely because most of the time my server gets a fairly steady load of concurrent requests, so the connections in the pool get used frequently enough that the firewall never drops them.  The problem surfaced only after there was an occasional spike in concurrent requests which led c3p0 to add more connections to the pool. After the spike was over the pool now had some “extra” connections above and beyond the normal use pattern so some of these connections did now sit unused for long enough to be dropped by the firewall. Eventually a request got unlucky and got one of these long-idled connections and ran into the problem.

The easiest solution was to set maxIdleTime from the default of zero (no timeout) to a time just a bit shorter than the firewall timeout. With that, connections which are about to be dropped by the firewall instead get dropped by c3p0 first. It’s a bit unfortunate since it causes otherwise unnecessary churn of good connections, but it is certainly better then the alternative. After I changed this setting, we haven’t seen any more problems.

Performance Testing With Faban

I’ve been quiet on the blog front lately, having all kinds of fun with new challenges. Now that I got around to installing WordPress maybe it is time to write again…

One of the interesting things I’m looking into at Proofpoint is the performance and scalability of one of the web services we offer. Architecturally the server is of a fairly standard design; built using Java Servlets and it provides a number of services via REST APIs.

In between working on features, I’ve been having fun exploring the performance characteristics. After setting up a suitable lab environment, the next question was deciding which load generator to use.

Back at Sun while working on the Web Server I had used Faban so I was already familiar with it, although only with running the load tests, not writing them. Earlier with the SunONE Application Server I had also used jmeter quite a lot so that was another choice. In the end, I decided to try first with Faban.

While there is a lot of documentation on the Faban web site, there are also gaps in the explanations that can be quite confusing. I found that it took some experimentation to get a custom benchmark driver to work. One drawback is that Faban doesn’t do a very good job of reporting problem root causes. Often if it doesn’t like something about the test or configuration it just doesn’t work, but finding out why involves trial and error. Oh well. Still, in the end it is fairly straightforward so with a bit of patience I had a nice custom benchmark with exercises the primary REST APIs of the server.

One thing I found was that none of the convenience HTTP query APIs built into Faban was suitable for my needs because they insisted in encoding the request data in ways not compatible with the server. The solution turned out to be easy in hindsight but difficult to find in the documentation at first, so documenting it is the primary reason for this article…

I ended up using the TimedSocketFactory provided by Faban. In the contructor of my benchmark class I create one instance of it:

    socketFactory = new TimedSocketFactory();

Then in each of the benchmark operation methods I do:

    Socket s = socketFactory.createSocket(myServerIP, 80);
    PrintWriter out = new PrintWriter(s.getOutputStream(), true);
    BufferedReader in = new BufferedReader(new InputStreamReader(s.getInputStream()));
    out.println(req);

Here ‘req’ is the previously constructed request buffer. Then I read and process the server response from ‘in’.

With that, Faban takes care of timing and collecting the statistics very nicely.

Overall I found Faban to be quite useful, I’ve used it to collect much useful data on the performance characteristics of our server under various load conditions. I now have a long list of ideas on how to scale up the performance!