Thursday, February 22, 2007

success is a double-edged sword

Since we launched our repository its success has revealed some server issues. We discovered previously undiscovered capacity issues that three very patient and experienced programmers and sys admins have been working very hard to troubleshoot. To allay fears, no, we had no problems with Fedora. But Fedora kept thinking that Tomcat/Cocoon wasn't responding, so Fedora would decline to complete its disseminations. The most puzzling thing was that our image delivery worked just fine, but our texts would fail. Actually, they'd work for a while and then fail. The texts are much larger and have much more complex disseminations and rendering, so we knew it was a capacity issue of some sort. The much increased number of such complex renderings was causing something somewhere to intermittently give up the ghoast.

We ended up doing a number of things: adding more retrys when requesting objects found through text search results. Moving Cocoon and Tomcat onto a newer box with upgraded versions and a lot more memory to allocate. Upping timeout limits in Tomcat and Apache. We found a log that kept filling up. There was a Cocoon STX bug that we needed to take into account in one text disseminator. We have one remaining mystery issue -- some Apache connections that aren't being released. It may be from Cocoon errors that aren't being properly identified as such and going into a wait state.

It's interesting what we never found in 2 years that we found in three weeks when more people started using the service.

No comments: