DIRAC Weekly Report 20090517

Week Ending 2009-05-17

  • Summary:
    • Almost 53 K jobs run last week
    • Over 26% failed
    • Daily peak of over 12 K jobs
    • 4 K Production jobs run to end
    • 33 K User jobs run to the end
    • 1 K Production Jobs Failed
    • 12 K User Jobs Failed

  • Total number of Jobs by Final Major Status
Total_Number_of_Jobs_by_FinalMajorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse )

  • Daily number of Jobs by Final Mayor Status
Daily_Number_of_Jobs_by_FinalMajorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse )

  • Done|Completed Jobs by User Group
Done+Complete_Jobs_by_UserGroup.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings9:UserGroupe )

  • Done|Completed Production Jobs by Job Type
Done+Complete_Production_Jobs_by_JobType.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss14:Completed,Dones9:_typeNames3:Jobs9:_groupings7:JobTypee )

  • Failed Jobs by User Group
Failed_Jobs_by_UserGroup.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings9:UserGroupe )

  • Failed Production Jobs by Minor Status
Failed_Production_Jobs_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

  • Failed User Jobs by Minor Status
Failed_User_Jobs_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

Running at Tier1's

  • Summary:
    • 1 K Production Jobs at Tier1s
      • Shares?
    • 24 K User Jobs at Tier1s
      • 59 % CERN Share

  • Done|Completed Production Jobs by Site
Done+Complete_Production_Jobs_at_Tier1_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_prods5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Done|Completed User Jobs by Site
Done+Complete_User_Jobs_at_Tier1_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_users5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee )

Job Failure Analysis

  • Summary:
    • Production Jobs Failed mostly due to:
      • Application Finished with Errors everywhere (709)
      • Application Finished with Errors mostly at GRIDKA(94), IN2P3(114), NIKHEF(118) AND PIC(87)
    • User Jobs Failed mosty due to:
      • Input Data Resolution everywhere ( 8K)
      • Input Data Resolution mostly at CERN (6K)
    • production 4741 failed almost every job. the failures could be clasiffied in DaVinci Failures (CERN, CNAF, GRIDKA, PIC and RAL) and Brunel failures (NIKHEF)

  • Failed Production Jobs (Application Finished With Error) by Site
Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss32:Application%20Finished%20With%20Errorss9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Failed User Jobs (Input Data Resolution) by Site
Failed_Users_Jobs_Input_Data_Resolution_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss21:Input%20Data%20Resolutions9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Failed Jobs at CERN by Minor Status
Failed_Jobs_at_CERN_by_MinorStatus.png

=(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s5:_Sites11:LCG.CERN.chs17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )=

  • Failed Jobs at GRIDKA by Minor Status
Failed_Jobs_at_GRIDKA_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s5:_Sites13:LCG.GRIDKA.des17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

  • Failed Jobs at IN2P3 by Minor Status
Failed_Jobs_at_IN2P3_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s5:_Sites12:LCG.IN2P3.frs17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

  • Failed Jobs at NIKHEF by Minor Status
Failed_Jobs_at_NIKHEF_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s5:_Sites13:LCG.NIKHEF.nls17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

  • Failed Jobs at PIC by Minor Status
Failed_Jobs_at_PIC_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-05-11s8:_endTimes10:2009-05-17s5:_Sites10:LCG.PIC.ess17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

Hardware Status

  • WMS volhcb09:
    • CPU utilization: Idle < 50%?, IO Wait peaks?,
    • Network utilization: several peaks of 800k and an average of 300k
    • Swap Used: .2GB at the begining of the weeks then drops to 40MB.

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb09&detailed=yes )

volhcb09_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb09_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb09_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb09_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • DMS volhcb10:
    • CPU utilization: barely any io.
    • Network utilization: one peak of 60k average of 30k
    • Swap Used: .4GB throught the week.
    • Partition Used: ~70GB throught the week

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb10&detailed=yes )

volhcb10_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb10_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb10_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb10_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • LogSE volhcb06:
    • CPU utilization: several peaks at the begining of the day from 14th onwards.
    • Network utilization: couple of peaks of 100k on 12th and 13th
    • Swap Used: <300kB.
    • Partition Used: Slight decrase on the usage from ~650GB to ~600GB

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb06&detailed=yes )

volhcb06_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb06_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb06_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb06_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • Various volhcb01:
    • CPU utilization: periodic peaks of user load, periodic peaks of io from 13th onwards,
    • Swap Used: ~150M consistently used.
    • Partition Used: ~500GB stable.

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb01&detailed=yes )

volhcb01_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb01_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb01_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb01_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png

-- MarcosASeco - 18 May 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Daily_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 36.8 K 2009-05-18 - 14:42 MarcosASeco  
PNGpng Done+Complete_Jobs_by_UserGroup.png r1 manage 37.1 K 2009-05-18 - 14:42 MarcosASeco  
PNGpng Done+Complete_Production_Jobs_at_Tier1_by_Site.png r1 manage 51.1 K 2009-05-18 - 14:43 MarcosASeco  
PNGpng Done+Complete_Production_Jobs_by_JobType.png r1 manage 35.6 K 2009-05-18 - 14:43 MarcosASeco  
PNGpng Done+Complete_User_Jobs_at_Tier1_by_Site.png r1 manage 58.8 K 2009-05-18 - 14:43 MarcosASeco  
PNGpng Failed_Jobs_at_CERN_by_MinorStatus.png r1 manage 46.6 K 2009-05-18 - 14:43 MarcosASeco  
PNGpng Failed_Jobs_at_GRIDKA_by_MinorStatus.png r1 manage 43.2 K 2009-05-18 - 14:44 MarcosASeco  
PNGpng Failed_Jobs_at_IN2P3_by_MinorStatus.png r1 manage 43.8 K 2009-05-18 - 14:44 MarcosASeco  
PNGpng Failed_Jobs_at_NIKHEF_by_MinorStatus.png r1 manage 37.3 K 2009-05-18 - 14:46 MarcosASeco  
PNGpng Failed_Jobs_at_PIC_by_MinorStatus.png r1 manage 47.0 K 2009-05-18 - 14:47 MarcosASeco  
PNGpng Failed_Jobs_by_UserGroup.png r1 manage 34.5 K 2009-05-18 - 14:55 MarcosASeco  
PNGpng Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png r1 manage 92.4 K 2009-05-18 - 14:47 MarcosASeco  
PNGpng Failed_Production_Jobs_by_MinorStatus.png r1 manage 60.5 K 2009-05-18 - 14:47 MarcosASeco  
PNGpng Failed_User_Jobs_by_MinorStatus.png r1 manage 56.7 K 2009-05-18 - 14:48 MarcosASeco  
PNGpng Failed_Users_Jobs_Input_Data_Resolution_by_Site.png r1 manage 51.4 K 2009-05-18 - 14:48 MarcosASeco  
PNGpng Total_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 32.4 K 2009-05-18 - 14:48 MarcosASeco  
PNGpng volhcb01_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 23.0 K 2009-05-18 - 14:49 MarcosASeco  
PNGpng volhcb01_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 20.4 K 2009-05-18 - 14:49 MarcosASeco  
PNGpng volhcb01_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 13.4 K 2009-05-18 - 14:50 MarcosASeco  
PNGpng volhcb01_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 12.1 K 2009-05-18 - 14:50 MarcosASeco  
PNGpng volhcb06_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 21.8 K 2009-05-18 - 14:50 MarcosASeco  
PNGpng volhcb06_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 18.9 K 2009-05-18 - 14:50 MarcosASeco  
PNGpng volhcb06_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 13.0 K 2009-05-18 - 14:51 MarcosASeco  
PNGpng volhcb06_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 12.0 K 2009-05-18 - 14:51 MarcosASeco  
PNGpng volhcb09_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 28.5 K 2009-05-18 - 14:51 MarcosASeco  
PNGpng volhcb09_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 23.3 K 2009-05-18 - 14:52 MarcosASeco  
PNGpng volhcb09_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 12.7 K 2009-05-18 - 14:52 MarcosASeco  
PNGpng volhcb09_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 11.3 K 2009-05-18 - 14:52 MarcosASeco  
PNGpng volhcb10_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 24.9 K 2009-05-18 - 14:53 MarcosASeco  
PNGpng volhcb10_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 19.7 K 2009-05-18 - 14:53 MarcosASeco  
PNGpng volhcb10_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 12.4 K 2009-05-18 - 14:53 MarcosASeco  
PNGpng volhcb10_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 13.1 K 2009-05-18 - 14:53 MarcosASeco  
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2009-05-18 - MarcosASeco
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback