Week from 03082009 to 10082009

Job Statistics

  • Summary:
    • Almost 127 K jobs run last week
    • Over 14% failed
    • Daily peak of over 57 K jobs
    • 101 K Production jobs run to end
    • 9 K User jobs run to the end
    • 7 K Production Jobs Failed
    • 11 K User Jobs Failed

In the last 2 days, we had very few running jobs. Production jobs and user jobs concentrated between 07/08 and 09/08, when productions from 5034 to 5046 where set to automatic.

  • Total number of Jobs by Final Major Status
Total_Number_of_Jobs_by_FinalMajorStatus.png

  • Daily number of Jobs by Final Mayor Status
Daily_Number_of_Jobs_by_FinalMajorStatus.png

  • Done|Completed Jobs by User Group
Done+Complete_Jobs_by_UserGroup.png

  • Done|Completed Production Jobs by Job Type
Done+Complete_Production_Jobs_by_JobType.png

  • Failed Jobs by User Group
Failed_Jobs_by_UserGroup.png

  • Failed Production Jobs by Minor Status
Failed_Production_Jobs_by_MinorStatus.png

  • Failed User Jobs by Minor Status
Failed_User_Jobs_by_MinorStatus.png

Running at Tier1's

  • Summary:
    • 35 K Production Jobs at Tier1s
      • 8 % CERN share
      • 10 % CNAF share
      • 40 % GRIDKA share
      • 10 % IN2P3 share
      • 1 % NIKHEF share
      • 21 % PIC share
      • 9 % RAL share

    • 8 K User Jobs at Tier1s
      • 53 % CERN share
      • 0 % CNAF share
      • 12 % GRIDKA share
      • 11 % IN2P3 share
      • 3 % NIKHEF share
      • 10 % PIC share
      • 10 % RAL share

CNAF has been banned since 23/07, and added in site mask on 07/08

  • Done|Completed Production Jobs by Site
Done+Complete_Production_Jobs_at_Tier1_by_Site.png

  • Done|Completed User Jobs by Site
Done+Complete_User_Jobs_at_Tier1_by_Site.png

Job Failure Analysis

  • Summary:
    • Production Jobs Failed mostly due to:
      • Application Finished With Errors everywhere (5.33 K)
      • Watchdog identified this job as stalled mostly at LCG.Torino.it (0.34 K from 0.55 K)
      • Pending Requests mostly at LCG.ITEP.ru (0.29 K from 0.53 K)
      • Received Kill signal mostly at LCG.CERN.ch (0.09 K from 0.10 K)

Torino has been problematic all the week, and then banned. A ticket was opened.

    • User Jobs Failed mosty due to:
      • Application Finished With Errors mostly at LCG.CERN.ch (3.41 K from 7.17 K)
      • Input Data Resolution mostly at LCG.CERN.ch (1.88 K from 1.93 K)
      • No eligible sites for job mostly at VOLHC13.CERN.CH (1.21 K from 1.21 K)
      • Chosen site is not eligible mostly at VOLHC13.CERN.CH (0.21 K from 0.21 K)
      • Input Data Not Available mostly at VOLHCB09.CERN.CH (0.12 K from 0.12 K)

A bug was introduced with Dirac v4r18, and a bug fix was released on 07/08: v4r18p1 fixed a typo in GaudiApplicationScript module in the Workflow library. The release is deployed in pilots. This prevented problems of user jobs with the following exception:

= EXCEPTION = exceptions.NameError: global name 'stdError' is not defined

  • Failed Production Jobs (Application Finished With Errors) by Site
Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png
  • Failed Production Jobs (Watchdog identified this job as stalled) by Site
Failed_Production_Jobs_Watchdog_identified_this_job_as_stalled_by_Site.png
  • Failed Production Jobs (Pending Requests) by Site
Failed_Production_Jobs_Pending_Requests_by_Site.png
  • Failed Production Jobs (Received Kill signal) by Site
Failed_Production_Jobs_Received_Kill_signal_by_Site.png
  • Failed User Jobs (Application Finished With Errors) by Site
Failed_User_Jobs_Application_Finished_With_Errors_by_Site.png
  • Failed User Jobs (Input Data Resolution) by Site
Failed_User_Jobs_Input_Data_Resolution_by_Site.png
  • Failed User Jobs (No eligible sites for job) by Site
Failed_User_Jobs_No_eligible_sites_for_job_by_Site.png
  • Failed User Jobs (Chosen site is not eligible) by Site
Failed_User_Jobs_Chosen_site_is_not_eligible_by_Site.png
  • Failed User Jobs (Input Data Not Available) by Site
Failed_User_Jobs_Input_Data_Not_Available_by_Site.png

  • Failed Jobs at CERN by Minor Status
Failed_Jobs_at_CERN_by_MinorStatus.png
  • Failed Jobs at CNAF by Minor Status
Failed_Jobs_at_CNAF_by_MinorStatus.png
  • Failed Jobs at GRIDKA by Minor Status
Failed_Jobs_at_GRIDKA_by_MinorStatus.png
  • Failed Jobs at IN2P3 by Minor Status
Failed_Jobs_at_IN2P3_by_MinorStatus.png
  • Failed Jobs at NIKHEF by Minor Status
Failed_Jobs_at_NIKHEF_by_MinorStatus.png
  • Failed Jobs at PIC by Minor Status
Failed_Jobs_at_PIC_by_MinorStatus.png
  • Failed Jobs at RAL by Minor Status
Failed_Jobs_at_RAL_by_MinorStatus.png

Hardware Status

  • Various volhcb09:
    • CPU utilization: almost always idle more than 60%?
    • Network utilization: less than 150k in average
    • Swap Used: less than 500Mb
    • Partition Used: stable for the first 2 days at 103Gb, than stable at 50Gb

volhcb09_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb09_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb09_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb09_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • DMS volhcb10:
    • CPU utilization: Idle only less than 20%
    • Network utilization: less than 300k
    • Swap Used: stable at 400Mb, than stable at 80Mb
    • Partition Used: stable at 80Gb

volhcb10_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb10_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb10_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb10_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • LogSE volhcb06:
    • CPU utilization: Idle ~ 50%
    • Network utilization: above 1 M?
    • Swap Used: do we get close to the limit (2 GB)?.
    • Partition Used: Is stable?

volhcb06_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb06_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb06_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb06_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • WMS volhcb13:
    • CPU utilization: Idle ~84%
    • Network utilization: a peak of above 1 M
    • Swap Used: close to the limit (2 GB) in the first days of the week
    • Partition Used: quite stable at 150Gb

volhcb13_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb13_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb13_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb13_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

-- FedericoStagni - 10 Aug 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Daily_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 46.4 K 2009-08-10 - 11:57 FedericoStagni  
PNGpng Done+Complete_Jobs_by_UserGroup.png r1 manage 33.3 K 2009-08-10 - 11:58 FedericoStagni  
PNGpng Done+Complete_Production_Jobs_at_Tier1_by_Site.png r1 manage 63.1 K 2009-08-10 - 11:58 FedericoStagni  
PNGpng Done+Complete_Production_Jobs_by_JobType.png r1 manage 30.7 K 2009-08-10 - 11:58 FedericoStagni  
PNGpng Done+Complete_User_Jobs_at_Tier1_by_Site.png r1 manage 59.2 K 2009-08-10 - 11:58 FedericoStagni  
PNGpng Failed_Jobs_at_CERN_by_MinorStatus.png r1 manage 43.7 K 2009-08-10 - 11:59 FedericoStagni  
PNGpng Failed_Jobs_at_CNAF_by_MinorStatus.png r1 manage 30.9 K 2009-08-10 - 11:59 FedericoStagni  
PNGpng Failed_Jobs_at_GRIDKA_by_MinorStatus.png r1 manage 33.5 K 2009-08-10 - 11:59 FedericoStagni  
PNGpng Failed_Jobs_at_IN2P3_by_MinorStatus.png r1 manage 29.1 K 2009-08-10 - 12:00 FedericoStagni  
PNGpng Failed_Jobs_at_NIKHEF_by_MinorStatus.png r1 manage 26.5 K 2009-08-10 - 12:00 FedericoStagni  
PNGpng Failed_Jobs_at_PIC_by_MinorStatus.png r1 manage 32.7 K 2009-08-10 - 12:00 FedericoStagni  
PNGpng Failed_Jobs_at_RAL_by_MinorStatus.png r1 manage 32.8 K 2009-08-10 - 12:01 FedericoStagni  
PNGpng Failed_Jobs_by_UserGroup.png r1 manage 34.0 K 2009-08-10 - 12:01 FedericoStagni  
PNGpng Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png r1 manage 117.3 K 2009-08-10 - 12:02 FedericoStagni  
PNGpng Failed_Production_Jobs_Pending_Requests_by_Site.png r1 manage 36.6 K 2009-08-10 - 12:03 FedericoStagni  
PNGpng Failed_Production_Jobs_Received_Kill_signal_by_Site.png r1 manage 37.0 K 2009-08-10 - 12:03 FedericoStagni  
PNGpng Failed_Production_Jobs_Watchdog_identified_this_job_as_stalled_by_Site.png r1 manage 66.6 K 2009-08-10 - 12:03 FedericoStagni  
PNGpng Failed_Production_Jobs_by_MinorStatus.png r1 manage 51.1 K 2009-08-10 - 12:02 FedericoStagni  
PNGpng Failed_User_Jobs_Application_Finished_With_Errors_by_Site.png r1 manage 78.3 K 2009-08-10 - 12:04 FedericoStagni  
PNGpng Failed_User_Jobs_Chosen_site_is_not_eligible_by_Site.png r1 manage 25.2 K 2009-08-10 - 12:04 FedericoStagni  
PNGpng Failed_User_Jobs_Input_Data_Not_Available_by_Site.png r1 manage 25.3 K 2009-08-10 - 12:05 FedericoStagni  
PNGpng Failed_User_Jobs_Input_Data_Resolution_by_Site.png r1 manage 30.5 K 2009-08-10 - 12:05 FedericoStagni  
PNGpng Failed_User_Jobs_No_eligible_sites_for_job_by_Site.png r1 manage 25.6 K 2009-08-10 - 12:07 FedericoStagni  
PNGpng Failed_User_Jobs_by_MinorStatus.png r1 manage 61.7 K 2009-08-10 - 12:04 FedericoStagni  
PNGpng Total_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 36.6 K 2009-08-10 - 12:08 FedericoStagni  
PNGpng volhcb06_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 27.0 K 2009-08-10 - 12:08 FedericoStagni  
PNGpng volhcb06_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 20.6 K 2009-08-10 - 12:09 FedericoStagni  
PNGpng volhcb06_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 13.4 K 2009-08-10 - 12:09 FedericoStagni  
PNGpng volhcb06_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 12.3 K 2009-08-10 - 12:09 FedericoStagni  
PNGpng volhcb09_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 24.5 K 2009-08-10 - 12:10 FedericoStagni  
PNGpng volhcb09_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 19.7 K 2009-08-10 - 12:10 FedericoStagni  
PNGpng volhcb09_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 12.1 K 2009-08-10 - 12:10 FedericoStagni  
PNGpng volhcb09_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 11.0 K 2009-08-10 - 12:11 FedericoStagni  
PNGpng volhcb10_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 25.6 K 2009-08-10 - 12:11 FedericoStagni  
PNGpng volhcb10_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 20.2 K 2009-08-10 - 12:12 FedericoStagni  
PNGpng volhcb10_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 13.0 K 2009-08-10 - 12:12 FedericoStagni  
PNGpng volhcb10_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 12.2 K 2009-08-10 - 12:13 FedericoStagni  
PNGpng volhcb13_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 25.4 K 2009-08-10 - 12:13 FedericoStagni  
PNGpng volhcb13_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 21.7 K 2009-08-10 - 12:14 FedericoStagni  
PNGpng volhcb13_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 12.2 K 2009-08-10 - 12:14 FedericoStagni  
PNGpng volhcb13_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 10.9 K 2009-08-10 - 12:14 FedericoStagni  
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2009-08-10 - FedericoStagni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback