Recently I had to troubleshoot a SQL server that performed nightly batch jobs for a management information system. Under normal conditions this required 6.5 hours but this was suddenly increased to 11.5 hours. An increase of 75%!
Because of this delay the information wasn’t presented on time with a lot of implications. Several departments where asked what has changed in the past days, of course the answer was “nothing”.
Point in time
The delay of the batch was introduced since the 6th of June (increasing from 400 to 600+ minutes):
VMware vSphere Client
Performance
The performance metrics of the virtual machines showed a decrease in both processor and disk performance while the network was hardly affected.
This is unexpected since the content of the batch job is unchanged, and the same applies for the infrastructure. No (major) changes are executed that justify the decrease in performance
Storage
There was a sudden increase (of ~ 600GB) in allocated disk space, with a substantial amount for snapshots. Aha!
Snapshot
Unless a change is performed (and a rollback is required) no snapshot should be present. However there was a snapshot called “Consolidate Helper- 0” .
This snapshot was residual after a failed Veeam backup (as described Jim Jones in this article).
Veeam Backup & Replication
To verify that the snapshot indeed was a leftover of a failed backup I verified the backup log. And indeed, after performing a successful backup on the 4th the backup of 5th of june ended with a warning:
Removing snapshot Unable to connect to the remote server No connection could be made because the target machine actively refused it xxx.xxx.xxx.xxx:443 Veeam Backup will attempt to remove snapshot during the next job cycle, but you may consider removing snapshot manually. Possible causes for snapshot removal failure: - Network connectivity issue, or vCenter Server is too busy to serve the request - ESX host was unable to process snapshot removal request in a timely manner - Snapshot was already removed by another application
The backup on the 6th of june could not be completed at all and ended with an error:
Initializing target session RemoveSnapshot failed, snapshotRef "snapshot-35436", timeout "3600000" Unable to access filesince it is locked
Result
After removing the snapshot the storage space was reclaimed
and the time required to perform the batch job was back to normal
Moral of the story
Be careful with snapshots of virtual machines. The impact on the performance can be dramatic and the time-to-fix can be quite a while if you’re unaware of this.
More information :