Vivek Agarwal’s Portal/Java Blog

An IBM Gold Consultant’s weblog about IBM, Lotus, WebSphere, J2EE, IT Processes, and other IT technologies

Back after a WPS v5.1 upgrade deployment

Posted by Vivek Agarwal on December 11, 2005

It has been a crazy few weeks with a couple of 80+ hour work weeks – got back from Africa on Black Friday but just have not been able to make the time to blog. The WebSphere Portal v5.1 upgrade deployment was fairly hellish. Experienced a lot of unexpected issues inspite of various deployment rehearsals on staging. Several of our issues had to deal with the basic server and network infrastructure, and the poor connectivity between the dual master site locations. However things have finally stabilized and all of our major issues are resolved or so I believe.

Some of the non-infrastructure related issues we ran into make me red in the face, but I will share them with you so that possibly somebody can learn from them –

  • The WebSphere Portal servers kept crashing consistently with “OutOfMemoryErrors”. Our deployment is on Windows 2003 on servers with 4GB RAM, and I had bumped up the minimum heap size to 1024MB and the maximum heap size to 1400MB to take advantage of the available RAM. However, I believe that the heap size was too big with not enough memory being available for non-heap memory with the Windows limitation of 2GB RAM for a single process. Anyways, have reduced the minimum heap size to 512MB and the maximum to 1024MB early last week, and the servers have not crashed since then. Keeping my fingers crossed that the issue is resolved!
  • The servers experienced an issue randomly after a restart wherein they would not authenticate any users even if the users entered the correct login credentials – the login page would refresh with no error message. On restarting the server, the issue would resolve itself. On looking through the logs, I found an exception “ EJPSG0015E: Data Backend Problem The profile repository did not return a external identifier“. Eventually, I realized that the issue was one of our creation! Ouch! Essentially, we have identical server names on each site, except for the fact that server names have a suffix “a” on one site, and a suffix “b” on another site. However, we configured WPS to point to srvsonwps002 for LDAP on both sites instead of having it point to srvsonwps002a on one site and srvsonwps002b on the other site. We use hosts file entries to abstract the server names. The problem was that we had a box named srvsonwps002 on the same network as srvsonwps002a and I believe that was the root of our problem. Disconnecting srvsonwps002 from the network seems to have resolved the issue. A weird issue that has something to do with Windows networking and unlikely to affect anybody else …
  • We experienced an issue with the WebSphere connection pool running out of connections with a version 4.0 data source for a set of portlet applications that we had ported over from version 4.2 with minimal changes. Obviously, the application code was calling connection.close() appropriately even in case of transaction rollbacks as this application used to work just great on WPS v4.2. Eventually we tracked down that this application was leaking connections on WAS5.1 when using the WebSphere connection pool, only because of a specific operation – none of the other operations in this set of portlets caused the connection pool to grow. On looking at this operation, we realized that this operation was unique in the sense that we would begin a save transaction, and then in the middle go off and execute a search query in another transaction on another connection, and then come back and complete the save transaction. This for some reason seems to leave a connection in the pool that does not get allocated to another database transaction. Eventually once enough of these operations are performed, the pool runs out of available connections and the application starts experiencing ConnectionWaitTimeoutExceptions. Restructuring the operation such that the search query is no longer performed as part of the save transaction has fixed the issue with the connection pool! Not quite sure why this is so, but I will take it. It seems to be related to the whole shareable/unshareable connection change in WAS 5.x, but I am unclear why this happens – will have to educate myself on this once things are more relaxed.

Still battling a couple of issues though –

  • On one site, we have an issue that if we perform a full resync from the deployment manager to the WPS nodes, the serverindex.xml for the WebSphere_Portal server on the WPS node disappears! And naturally, if you restart WPS after a resync, then it fails to start up successfully. For now, we have just made sure that if we re-sync the nodes, we manually drop in the serverindex.xml file in the right place. Need to look into this more closely, but it has not been a top focus item so far.
  • We have a virtual portal on this WPS install, whose users do NOT have Sametime access. However, since we have Sametime enabled on our install, WPS unconditionally inserts references to Sametime for users that do not have access to Sametime. This just does not make sense to me. If the Sametime login fails, WPS should just not insert the presence awareness snippets. Will have to chase this down with IBM – I do not see anything obvious on how to selectively enable Sametime integration.

All in all, life is never lacking a challenge out here! That is great at most times, but right now I would be very happy being bored!!


Sorry, the comment form is closed at this time.

%d bloggers like this: