Category: SailPoint

  • Password page takes long time to load

    I got assigned task with two pages loading too long time. The first one was about custom form with password reset functionality (solved). The second one was poorly described and was about some certifications take long time to load (moved to next post).

    Password reset page

    Lets start with the custom page for password reset. Some basic info. Different roles have access to different users and acounts, e.g. manager, user administrator or basic user with admin accounts.

    Workflow did not contain extensive QueryOptions or context search, but the Form did have quite large QueryOption and quite a few filters for AND and OR clauses.

    My first approach and funny silly debug mistake.

    This was initital process to confirm that the slow performance is caused by the custom code and not by OOTB IIQ.

    This line is for future retrospective if ever changes: so far I have not found a better way on how to debug XML objects in IIQ. The only way is through “console output messages” into a log file.

    These are the last lines of the first script section.

    qo.addFilters(filters);
    return (context.countObjects(qo,));

    I add lines above to the Form objects scrpt fields, like this

    log.error("script identity start")
    ...
    qo.addFilters(filters);
    log.error("script identity end");
    return (context.countObjects(Links.class,qo) > 0);

    Second script for hidden variable

    log.error("script hidden start")
    ...
    qo.addFilters(filters);
    log.error("script hidden end");
    return (context.countObjects(Links.class,qo) > 0);

    Now I open the password reset page and I get a bit confused about the output and timings between the steps in the log file.

    It shows some some 2 seconds between steps, rather than start/end log messages (not exact log messages)

    script identity end – 14:03:14
    script hidden start – 14:03:16

    Bit puzzled, but after a while i can see it clearly and smile about how silly i was 🙂

    Correcting the debug messages and confirming the delay is due to the countObjects DB search in both scripts.

    log.error("script hidden start")
    ...
    qo.addFilters(filters);
    log.error("script hidden end1");
    int objs = context.countObjects(Links.class,qo);
    log.error("script hidden end2");
    return (objs  > 0);

    Moving down from TEST to my DEV environment

    My spadmin in my own dev is set up without additional roles or anything, therefore the admin user cannot reset any passwords with the custom functionality. It is a good thing, because the query takes around 26 seconds and returns 0 records.

    Assumptions:
    It must go through whole table to find nothing. (correct)
    It must be due to missing index. (wrong)
    It should be easy to locate the DB query with trace enabled (correct)

    I used Dev tools in Edge and check the timings for the API call. I enabled trace on all objects in log4j2 and soon I was able to see the two queries.

    Combined select query with one inner and 3 left joins on Identity table. I am not database expert, but I could see that each left join is increased exponentially from the live statistics in MSSQL studio.

    I removed the left joins and related OR clauses. One inner and one left join as a result. Suddenly It returned results (still 0) within a second. I was on the right path.

    Reviewed the code for QueryOptions and filters. Logically it could have been split, so I did split it with some extra optimizing and added comments for my colleagues on why it is split in simialr query options and context.countObjects calls.

    Now it runs on my DEV within a second. Retested and confirmed usecases for different user types, deployed to TEST. 4 seconds to under 1 second(success). Finally created a PR and waiting for the next release to save 8 seconds in Production for each password reset.

    How did I debug in this case?

    • used error debug messages to find/confirm the problematic code
    • set up DEV environment to not run any tasks, that allows me to run trace on all objects and classes used by IIQ when needed
    • read trace logs for SQL queries and timings



  • IdentityIQ upgrade 8.4p2 and unrelated performance issues

    Written as a story, if you want to skip to Performance issue section, then scroll down to the end section to Solution.

    I have been working for 5 months on IdentityIQ upgrade project as technical lead. It went ok and it was delivered on time. Could not do it with the team help and coming together for regression testing and finalizing tasks.

    The upgrade weekend went throught with some minor issues. It was long. Minor mistakes too. I blame the rushed two last weeks with many tasks and many people involved. Either way overall success, very happy and tired.

    Next day minor performance issue, probably system catching up (tasks) after being off for 2 days. Tuesday, not so minor anymore. I was tryign to get involved, but was refused. In my opinion good decision on one hand as it might have been something else than upgrade and had to be confirmed before me joining. I was eventually called in for help week after to take over from my colleague, who did not deny or confirm upgrade as a cause.

    Performance issue

    The issue was about overall slowness affecting everything and everyone. It was not isolated part of the application like specific task or GUI section. It was everything, UI, Batch and all operations with spikes.

    What we knew that has happened, chronologically:
    IIQ upgrade (weekend)
    Minor performance issues (Monday)
    Major performance issues (Wednesday)
    Collapse (Friday)
    SQL server Patch (weekend)

    We have not upgraded Java and Tomcat, so yes, you guessing right it was either upgraded IIQ OOTB functionality, our custom code or Database. Right?

    For the first day, we were kind of jumping between IIQ and DB and trying to debug different things. Bit of blaming and also clarifying reasons for causing this or more specifically excluding reasons what it could not have been.

    Our first conclusion with DB team was that we need to rebuild indexes as it somehow helped previously with a lot of table updates (not much mine idea). It helped a bit for a day. Then we were hit again.

    I think at that point we were able to narrow it down to database, because the queries were spiking from IIQ servers, but also directly on the DB. Therefore it could have not been IIQ servers that cause the performance hit. That was a good conlusion.

    The only trouble was there was maybe 30-50% CPU utlization. RAM was used 90% and managed by SQL server to utilize. There were no visible spikes. That went for another day with some random queries testing and also whether the Perform maintenance tasks are the cause and slowing down everything else.

    In the next day I had some enlightment that it could be the Disk IO utilization. We haven’t received any info about it yet till now. Once provided , we could see it. Yes the utilization was 100% almost all the time. The spikes were visible in the graph and confirmed our behaviour. Finally moving somewhere – my first win.

    We were still trying to figure it out what is causing this. To be honest I didn’t know as I am not database expert, but for sure I knew there is something wrong with the Disk IO. Being tired of taking blame that it is due to upgrade. I requested a Disk IO statistics for 1 month to compare how it really as before the IIQ upgrade.

    Solution

    And there it was. I was angry, I was happy, I was cursing, I wanted to share it and kick some ass.

    The Disk IO statistics report as a image were ok, but there was a pattern of Disk IO upper limit as there was a horizontal line on a graph with 100% utilization since upgrade date

    At the first glance, it looked like since upgrade, but when I zoomed in and inspected the blurry image, it turned out it was limited a day or two before upgrade. – Yes, another win

    Raised the question with the DB/storage team and ofcourse there was a change for setting an upper limit for our service. marked as low impact, low risk change 🙂 I couldn’t believe it. The guy was very unproffesional and still blaming us for the change, but at the end reverted it (good push and good communication history proof from my colleague). I did not have that. I was pushing for change to be reverted due to incident.

    Once reverted and upper limit lifted. All went back to normal and we could finally leave for a weekend.

    Good approach

    Push the other teams or people to provide proof or gather more information

    Go for your hunch, but verify with proof or confirm logically

    Be the one to take a lead or stand up when needed

    Do single change at a time to find the rootcause

    Lessons learned

    Check for changes sooner when large incidents occur (coudl have been found sooner)

    Take a moment and brainstorm few theories in the beginning (this could increase finding the right path from the beginning)

    Avoid pushing your opinion without any proof (hunch is ok, but be reasonable)