SCAS stress tests results
In this page we collect stress tests results done on the first SCAS/glexec patches.
The scripts used to run the tests are explained in the
README
file in CVS.
10 March 2009
This test lasted 4 days from 6 Feb 2009 to 10 Feb 2009 09:30.
It was executed using 10 WNs and one SCAS server all deployed on VMs.
The Worker Nodes were activated in sequence, with an interval of 2 hour.
This test used the new scripts to use multiple user credentials. On each worker node
each glexec call choose a random proxy among a set of 10 available proxy. 100 proxies are
used in total on all the worker nodes
The new glexec patch with the new SCAS client showed to be tolerant to SCAS internal refresh, which now do not cause glexec errors like
it was happening in the previous versions.
The hostnames were:
- lxb7606v1 to lxb7606v5 and lxb7605v1 to lxb7605v5 (WNs)
- vtb-generic-83 (SCAS)
The patches installed were:
The total requests and the error rate was:
- Total requests: 3306508
- Frequency achieved: 10.02 Hz
- Total errors: 1 (glexec failed after 172915 seconds with the message " [gLExec]: LCMAPS failed, see '/var/log/glexec/lcas_lcmaps.log' for more info" )
GLEXEC response time on the Worker Nodes
The response time on the first host (lxb7606v1) is showed in the following graph:
A frequency histogram with Y axis in logarithmic scale is of the same data is here:
The same plot with a NON logarithmic Y axis is here:
Categorizing the response time in 3 zones we have:
- zone 1 [0,2): 98.16%
- zone 2 [2,10): 1.80%
- zone 3 [10, ): 0.04%
Memory consumpion on the SCAS server
The memory consumption on the SCAS server is showed in the following graphs:
19 February 2009
This test lasted 6 days from 13 Feb 2009 14:25 to 19 Feb 2009 08:00.
It was executed using 10 WNs and one SCAS server all deployed on VMs.
The Worker Nodes were activated in sequence, with an interval of 1 hour.
The hostnames were:
- lxb7606v1 to lxb7606v5 and lxb7605v1 to lxb7605v5 (WNs)
- vtb-generic-83 (SCAS)
The patches installed were:
The total requests and the error rate was:
- Total requests: 6264443
- Total errors: 14200
- Error rate: .2267% (error meaning a glexec failure with an error message)
- Requests per second: 12.65 (this is the frequency achieved, considering that WNs make requests continuously)
GLEXEC response time on the Worker Nodes
The response time on the first host (lxb7606v1) is showed in the following graph:
Zooming in 1 hour period in the middle of the test we get:
Two levels are present in the response time graph. Most of the executions have a response time less than 1 seconds but for a considerable amount of executions (~40 per hour) this response time is around 6 seconds. Some spikes are present in the zone around 10 seconds and very rarely at a higher level.
Using a three zones categorization, these are the results:
zone1 [0,2): 99.49%
zone2 [2,8): 0.50%
zone3 [8,+inf): 0.02%
A frequency histogram plot is available here:
Breaking the Y axis to 5000 we can see smaller contributions:
Memory consumpion on the SCAS server
The memory consumption on the SCAS server is showed in the following graphs:
The trend is more visible in the following graph, zoomed in 1 hour period:
The memory leak problem (see patch #2684) has been fixed killing the SCAS child process every 5 minutes. This allow the SCAS server not to crash but it introduces periodic errors that happen during the restarting of the child process (see Error distribution section). This problem is known to the SCAS developers and tracked in bug #47148. Some memory leak is still present and visible in the first graph (bug #47149) .
Error rate and distribution
The error rate of glexec executions, that was around .03% with 2 WNs, with 10WNs reaches 0.2%.
The error distribution graph, zoomed in the same 1 hour period as before, shows that errors happen at the time of the switch in the SCAS server:
12 February 2009
This test lasted 18 hours from Thu Feb 12 13:38:13 2009.
It was executed using 10 WNs and one SCAS server all deployed on VMs.
The hostnames were:
- lxb7606v1 to lxb7606v5 and lxb7605v1 to lxb7605v5 (WNs)
- vtb-generic-83 (SCAS)
The patches installed were:
The total requests and the error rate was:
- Total requests: 765937
- Total errors: 1360
- Requests per second: 11.59
- Error rate: .1775%
The response time on the first host (lxb7606v1) is showed in the following graph:
The error distribution on the same WN is in:
The memory consumption on the SCAS server is showed in the following graphs:
The total error rate computed each hour is showed in the following graph:
06 February 2009
This test lasted almost 3 days (67 hours), from Fri Feb 6 12:47:14 to Mon Feb 9 08:00:00 (in unix time, from 1233920839 to 1234162799)
It was executed using 2 WNs and one SCAS server all deployed on VMs.
The hostnames were:
- vtb-generic-111 and lxb7606v1 (WNs)
- vtb-generic-83 (SCAS)
The patches installed were:
The total requests and the error rate was:
- Total requests: 2475288
- Total errors: 907
- Requests per second: 10.23
- Error rate: .03664%
The response time on each host is showed in the following graphs:
The memory consumption on the SCAS server is showed in the following graphs:
--
GianniPucciani - 09 Feb 2009