Week from 05072009 to 12072009

Job Statistics

  • Summary:
    • Almost 295K jobs run last week
    • Over 6% failed
    • Daily peak of over 49K jobs
    • 252K Production jobs run to end
    • 43K User jobs run to the end
    • 18K Production Jobs Failed
    • almost 5K User Jobs Failed

  • Total number of Jobs by Final Major Status
Total_Number_of_Jobs_by_FinalMajorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse )

  • Daily number of Jobs by Final Mayor Status
Daily_Number_of_Jobs_by_FinalMajorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss13:_timeSelectors5:86400s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse )

  • Done|Completed Jobs by User Group
Done+Complete_Jobs_by_UserGroup.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s17:_FinalMajorStatuss14:Completed,Dones9:_typeNames3:Jobs9:_groupings9:UserGroupe )

  • Done|Completed Production Jobs by Job Type
Done+Complete_Production_Jobs_by_JobType.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss14:Completed,Dones9:_typeNames3:Jobs9:_groupings7:JobTypee )

  • Failed Jobs by User Group
Failed_Jobs_by_UserGroup.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings9:UserGroupe )

  • Failed Production Jobs by Minor Status
Failed_Production_Jobs_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

  • Failed User Jobs by Minor Status
Failed_User_Jobs_by_MinorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse )

Running at Tier1's

  • Summary:
    • 96K Production Jobs at Tier1s
      • 29% at GridKA
      • 28% at RAL
      • 12% at CERN
      • 11% at CNAF
      • 7% at IN2P3
      • 7% at NIKHEF
      • 6% at PIC

    • almost 14K User Jobs at Tier1s
      • 22% CERN Share

  • Done|Completed Production Jobs by Site
Done+Complete_Production_Jobs_at_Tier1_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_prods5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Done|Completed User Jobs by Site
Done+Complete_User_Jobs_at_Tier1_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_users5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee )

Job Failure Analysis

  • Summary:
    • Production Jobs Failed mostly due to:
      • application finished with errors (~3600)
      • watchdog identified this job as stalled (~600)
    • User Jobs Failed mosty due to:
      • input data resolution (~1600)
      • application finished with errors (~1600)
      • received kill signal (~200)
      • uploading job outputs (~200)
    • IN2P3 suffered from not being able to handle a bunch of user jobs. This is still being investigated.

  • Failed Production Jobs (Application Finished With Error) by Site
Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss32:Application%20Finished%20With%20Errorss9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Failed User Jobs (Input Data Resolution) by Site
Failed_Users_Jobs_Input_Data_Resolution_by_Site.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss21:Input%20Data%20Resolutions9:_typeNames3:Jobs9:_groupings4:Sitee )

  • Failed Jobs at GRIDKA by Minor Status
Failed_Jobs_at_GRIDKA_by_MinorStatus.png

=(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors5:86400s5:_Sites13:LCG.GRIDKA.des17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse)=

  • Failed Jobs at CERN by Minor Status
Failed_Jobs_at_CERN_by_MinorStatus.png

  • Failed Jobs at PIC by Minor Status
Failed_Jobs_at_PIC_by_MinorStatus.png

Hardware Status

  • WMS volhcb09:
    • CPU utilization: Idle more than 50%
    • Network utilization: almost always below 400k?
    • Swap Used: ~71Mb, well under the limits
    • Partition Used: stable at ~96Gb

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb09&detailed=yes )

volhcb09_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb09_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb09_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb09_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • DMS volhcb10:
    • CPU utilization: always 60%+
    • Network utilization: between 100k and 250k
    • Swap Used: ~350Mb. Little peak at the beginning of the week at ~450Mb
    • Partition Used: stable at 80Mb

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb10&detailed=yes )

volhcb10_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb10_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb10_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb10_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • LogSE volhcb06:
    • CPU utilization: Idle more than 50%? Several IO Wait peaks between 50% and 60%
    • Network utilization: 200k-300k
    • Swap Used: ~200k
    • Partition Used: quite stable, around 900k

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb06&detailed=yes )

volhcb06_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb06_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb06_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb06_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

  • Various volhcb01:
    • CPU utilization: almost always Idle more than 90%
    • Network utilization: mean: ~100k
    • Swap Used: 150k
    • Partition Used: quite stable (~700Mb)

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=1d&entity=volhcb01&detailed=yes )

volhcb01_1_-86400_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb01_1_-86400_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb01_1_-86400_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb01_1_-86400_SWAP_SPACE_USED_STACKEDS_1.gif.png

-- FedericoStagni - 13 Jul 2009

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2020-08-30 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback