Week from 03082009 to 10082009
Job Statistics
- Summary:
- Almost 127 K jobs run last week
- Over 14% failed
- Daily peak of over 57 K jobs
- 101 K Production jobs run to end
- 9 K User jobs run to the end
- 7 K Production Jobs Failed
- 11 K User Jobs Failed
In the last 2 days, we had very few running jobs. Production jobs and user jobs concentrated between 07/08 and 09/08, when productions from 5034 to 5046 where set to automatic.
- Total number of Jobs by Final Major Status
- Daily number of Jobs by Final Mayor Status
- Done|Completed Jobs by User Group
- Done|Completed Production Jobs by Job Type
- Failed Jobs by User Group
- Failed Production Jobs by Minor Status
- Failed User Jobs by Minor Status
Running at Tier1's
- Summary:
- 35 K Production Jobs at Tier1s
- 8 % CERN share
- 10 % CNAF share
- 40 % GRIDKA share
- 10 % IN2P3 share
- 1 % NIKHEF share
- 21 % PIC share
- 9 % RAL share
-
- 8 K User Jobs at Tier1s
- 53 % CERN share
- 0 % CNAF share
- 12 % GRIDKA share
- 11 % IN2P3 share
- 3 % NIKHEF share
- 10 % PIC share
- 10 % RAL share
CNAF has been banned since 23/07, and added in site mask on 07/08
- Done|Completed Production Jobs by Site
- Done|Completed User Jobs by Site
Job Failure Analysis
- Summary:
- Production Jobs Failed mostly due to:
- Application Finished With Errors everywhere (5.33 K)
- Watchdog identified this job as stalled mostly at LCG.Torino.it (0.34 K from 0.55 K)
- Pending Requests mostly at LCG.ITEP.ru (0.29 K from 0.53 K)
- Received Kill signal mostly at LCG.CERN.ch (0.09 K from 0.10 K)
Torino has been problematic all the week, and then banned. A ticket was opened.
-
- User Jobs Failed mosty due to:
- Application Finished With Errors mostly at LCG.CERN.ch (3.41 K from 7.17 K)
- Input Data Resolution mostly at LCG.CERN.ch (1.88 K from 1.93 K)
- No eligible sites for job mostly at VOLHC13.CERN.CH (1.21 K from 1.21 K)
- Chosen site is not eligible mostly at VOLHC13.CERN.CH (0.21 K from 0.21 K)
- Input Data Not Available mostly at VOLHCB09.CERN.CH (0.12 K from 0.12 K)
A bug was introduced with Dirac v4r18, and a bug fix was released on 07/08: v4r18p1 fixed a typo in
GaudiApplicationScript module in the Workflow library. The release is deployed in pilots. This prevented problems of user jobs with the following exception:
= EXCEPTION =
exceptions.NameError: global name 'stdError' is not defined
- Failed Production Jobs (Application Finished With Errors) by Site
- Failed Production Jobs (Watchdog identified this job as stalled) by Site
- Failed Production Jobs (Pending Requests) by Site
- Failed Production Jobs (Received Kill signal) by Site
- Failed User Jobs (Application Finished With Errors) by Site
- Failed User Jobs (Input Data Resolution) by Site
- Failed User Jobs (No eligible sites for job) by Site
- Failed User Jobs (Chosen site is not eligible) by Site
- Failed User Jobs (Input Data Not Available) by Site
- Failed Jobs at CERN by Minor Status
- Failed Jobs at CNAF by Minor Status
- Failed Jobs at GRIDKA by Minor Status
- Failed Jobs at IN2P3 by Minor Status
- Failed Jobs at NIKHEF by Minor Status
- Failed Jobs at PIC by Minor Status
- Failed Jobs at RAL by Minor Status
Hardware Status
- Various volhcb09:
- CPU utilization: almost always idle more than 60%?
- Network utilization: less than 150k in average
- Swap Used: less than 500Mb
- Partition Used: stable for the first 2 days at 103Gb, than stable at 50Gb
- DMS volhcb10:
- CPU utilization: Idle only less than 20%
- Network utilization: less than 300k
- Swap Used: stable at 400Mb, than stable at 80Mb
- Partition Used: stable at 80Gb
- LogSE volhcb06:
- CPU utilization: Idle ~ 50%
- Network utilization: above 1 M?
- Swap Used: do we get close to the limit (2 GB)?.
- Partition Used: Is stable?
- WMS volhcb13:
- CPU utilization: Idle ~84%
- Network utilization: a peak of above 1 M
- Swap Used: close to the limit (2 GB) in the first days of the week
- Partition Used: quite stable at 150Gb
--
FedericoStagni - 10 Aug 2009