Issue Summary:
On 21st January, some Salsa Engage clients began experiencing issues with sluggish responses from various elements within the product – webpages, email sending, email statistics and some reports. Over the course of the following days, up until February 7th, some clients continued to experience issues to various degrees. As of February 8th, normal performance resumed.
Root Cause:
The infrastructure for all of Salsa was being upgraded. This included both Salsa CRM and Engage. The entirety of CRM was upgraded without issue, however, the Engage element presented problems when the database was upgraded. This was primarily due to a race condition that presented as the new database and infrastructure were spun-up. Although this was resolved, a significant backlog had built up and needed to ‘catch up’ and re-synchronize. In addition, verbose logging during the upgrade took up considerably more disc space than planned for. This combination of events had a knock-on effect impacting some reports, elements of email sending and webpage performance. While no element was entirely non-functional, the slowness of some exacerbated the performance impact of many that were not initially directly affected. Once the backlog had been purged, logging updated, reports manually re-queued, then Salsa Engage returned to the performance levels expected and indeed, to a greater response, throughput and output level than before the upgrade.
Prevention:
Detailed error aggregation and logging capacity has been enforced. Database performance has been massively updated and auto-scaling enabled. Locked jobs are automatically identified and released through auto-scaling compute capacity.