Investigation
We began the investigation by attempting to replicate the issue in our test environments, however, we found it a challenge to capture meaningful performance metrics. As a result of this, we chose to work with a performance monitoring tool, NewRelic, as it worked within the limitations and captured enough information for us to perform analysis. NewRelic allowed us to quickly identify the root cause of the performance issue by providing traces sorted by response time and returning all invocation patterns and details. We found that caching of complex data, non-performant database queries and logging added directly to response times.
Analysis
Once the most impactful areas were identified, we pinpointed the exact methods and SQL calls that were creating the most damage using the method tagging analysis feature in NewRelic. From here, we analysed the code to understand its intent and any dependencies it may have.
Improvements
We then promptly workshopped solutions, estimated their effort with t-shit sizing and the associated value. We found this a valuable exercise as it ensured that we targeted the improvements with the highest return on investment. Our team began developing the solutions in the ‘Quick Wins’ bucket. If we reached a blocker or found that the fix required complex code, we moved onto the next improvement. This tactic kept us lean by preventing us from getting bogged down on a singular improvement. It was also important not to over-optimise as each optimisation was partial to diminishing returns.