IdentityIQ upgrade 8.4p2 and unrelated performance issues

Written as a story, if you want to skip to Performance issue section, then scroll down to the end section to Solution.

I have been working for 5 months on IdentityIQ upgrade project as technical lead. It went ok and it was delivered on time. Could not do it with the team help and coming together for regression testing and finalizing tasks.

The upgrade weekend went throught with some minor issues. It was long. Minor mistakes too. I blame the rushed two last weeks with many tasks and many people involved. Either way overall success, very happy and tired.

Next day minor performance issue, probably system catching up (tasks) after being off for 2 days. Tuesday, not so minor anymore. I was tryign to get involved, but was refused. In my opinion good decision on one hand as it might have been something else than upgrade and had to be confirmed before me joining. I was eventually called in for help week after to take over from my colleague, who did not deny or confirm upgrade as a cause.

Performance issue

The issue was about overall slowness affecting everything and everyone. It was not isolated part of the application like specific task or GUI section. It was everything, UI, Batch and all operations with spikes.

What we knew that has happened, chronologically:
IIQ upgrade (weekend)
Minor performance issues (Monday)
Major performance issues (Wednesday)
Collapse (Friday)
SQL server Patch (weekend)

We have not upgraded Java and Tomcat, so yes, you guessing right it was either upgraded IIQ OOTB functionality, our custom code or Database. Right?

For the first day, we were kind of jumping between IIQ and DB and trying to debug different things. Bit of blaming and also clarifying reasons for causing this or more specifically excluding reasons what it could not have been.

Our first conclusion with DB team was that we need to rebuild indexes as it somehow helped previously with a lot of table updates (not much mine idea). It helped a bit for a day. Then we were hit again.

I think at that point we were able to narrow it down to database, because the queries were spiking from IIQ servers, but also directly on the DB. Therefore it could have not been IIQ servers that cause the performance hit. That was a good conlusion.

The only trouble was there was maybe 30-50% CPU utlization. RAM was used 90% and managed by SQL server to utilize. There were no visible spikes. That went for another day with some random queries testing and also whether the Perform maintenance tasks are the cause and slowing down everything else.

In the next day I had some enlightment that it could be the Disk IO utilization. We haven’t received any info about it yet till now. Once provided , we could see it. Yes the utilization was 100% almost all the time. The spikes were visible in the graph and confirmed our behaviour. Finally moving somewhere – my first win.

We were still trying to figure it out what is causing this. To be honest I didn’t know as I am not database expert, but for sure I knew there is something wrong with the Disk IO. Being tired of taking blame that it is due to upgrade. I requested a Disk IO statistics for 1 month to compare how it really as before the IIQ upgrade.

Solution

And there it was. I was angry, I was happy, I was cursing, I wanted to share it and kick some ass.

The Disk IO statistics report as a image were ok, but there was a pattern of Disk IO upper limit as there was a horizontal line on a graph with 100% utilization since upgrade date

At the first glance, it looked like since upgrade, but when I zoomed in and inspected the blurry image, it turned out it was limited a day or two before upgrade. – Yes, another win

Raised the question with the DB/storage team and ofcourse there was a change for setting an upper limit for our service. marked as low impact, low risk change 🙂 I couldn’t believe it. The guy was very unproffesional and still blaming us for the change, but at the end reverted it (good push and good communication history proof from my colleague). I did not have that. I was pushing for change to be reverted due to incident.

Once reverted and upper limit lifted. All went back to normal and we could finally leave for a weekend.

Good approach

Push the other teams or people to provide proof or gather more information

Go for your hunch, but verify with proof or confirm logically

Be the one to take a lead or stand up when needed

Do single change at a time to find the rootcause

Lessons learned

Check for changes sooner when large incidents occur (coudl have been found sooner)

Take a moment and brainstorm few theories in the beginning (this could increase finding the right path from the beginning)

Avoid pushing your opinion without any proof (hunch is ok, but be reasonable)

Comments

Leave a Reply