Delogue was experiencing technical issues, rendering the platform unavailable from time to time - roughly twice a day at the same time (10 am & 2:30 pm CET).
Among other, the timeout to the platform was increased in December due to some suppliers not being able to download large files as the download would time out.
This resulted in many open connections at once.
We tried ensuring that connections were regularly closed down and that no one connection would run for too long - without impacting the usability of the platform.
We discovered that the connection pool limit was being reached. A fix was scheduled overnight to avoid additional downtime.
The the connection pool limit was raised by 1000 %.
Unfortunately the issues persist.
Additional monitoring was introduced and API call logging scheduled.
Additional logging for API calls was introduced to monitor API connections.
We found that even with the 1000 % increase in connection pool size we reaching the limit, which could mean there was a connection leakage problem. We tried to fix this problem by forcefully closing the leaked connection at the data layer.
Additional increase of the connection pool was planned.
Forcefully closing the connection at data layer still didn't solve the problem as the application still thinks it's an active connection from the pool and keep a position reserved for leaked (and not killed connection).
Called in external consulting experts to help analyse the issues.
Several corrective actions were taken:
- We identified places where queries were not closing the connection correctly
- The underlying code has been changed to handle connections better
- We increased the connection pool again - by ten times
We found that this time the application was going down due to a new reason. Now the CPU utilisation for the application server was going high but the database connection problem seems to be fixed.
This could be because of the increased connection pool size.
The largely increased connection pool size was putting a toll on the application server CPU.
Connection pool size was decreased again - to a more realistic number.
We didn't experience the problem second time this day.
We have not experienced issues with the platform at the 'usual' time (10 am & 2:30 pm CET).
Around 12 noon we have a short, one minute restart.
We identified that our 10 year old - but until now working - job scheduler most likely was the root cause for the sporadic downtime and server restarts during the past 8-9 days due to multiple jobs running irregularly - some too many times and some not at all.
Monday we will re-code these jobs to use the Hangfire framework that we are already using to handle other jobs scheduled. It's our expectation that the re-coding will be completed by week 5 (03.02.23).
The team continue to analyse the data and logging from the past days instability to ensure we have indeed identified the root cause of this.
We continue to have our main focus on this and will keep updating this article as we know more.
Updated 29.01.23, 15:40 CET