DIRAC Weekly Report 20090510

Week Ending 2009-05-10

This is the report for the week ending 10 May 2009. The URL-s point to an earlier period as I just copy-pasted the information, but the images are correct and correspond to the period 3 May 2009 to 9 May 2009 (inclusive) for the DIRAC plots and the status at about 15:00 hrs on 10 May 2009 for the hardware statuses.

Job Statistics

(Follow the URLs, change the dates, save the new plot and add as attachement, then update the images)

  • Total number of Jobs by Final Major Status
Total_Number_of_Jobs_by_FinalMajorStatus.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse)
  • Daily number of Jobs by Final Mayor Status
Daily_Number_of_Jobs_by_FinalMajorStatus.png

(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s9:_typeNames3:Jobs9:_groupings16:FinalMajorStatuse)

  • Done|Completed Jobs by User Group
Done_and_Complete_Jobs_by_UserGroup.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s17:_FinalMajorStatuss14:Completed,Dones9:_typeNames3:Jobs9:_groupings9:UserGroupe)

  • Done|Completed Production Jobs by JobType
Done_and_Complete_Production_Jobs_by_JobType.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss14:Completed,Dones9:_typeNames3:Jobs9:_groupings7:JobTypee)

  • Failed Jobs by User Group
Failed_Jobs_by_UserGroup.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings9:UserGroupe)

  • Failed Production Jobs by Minor Status
Failed_Production_Jobs_by_MinorStatus.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse)

  • Failed User Jobs by Minor Status
Failed_User_Jobs_by_MinorStatus.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse)

  • Summary:
    • >54 K jobs run last week
    • Approximately 7.7 K failed
    • Daily peak of between 5K and 16 K jobs (done + completed + failed)

Running at Tier1's

  • Done|Completed Production Jobs by Site
Done_and_Complete_Production_Jobs_at_Tier1_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Done|Completed User Jobs by Site
Done_and_Complete_User_Jobs_at_Tier1_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_users5:_Sites86:LCG.CERN.ch,LCG.CNAF.it,LCG.GRIDKA.de,LCG.IN2P3.fr,LCG.NIKHEF.nl,LCG.PIC.es,LCG.RAL.uks17:_FinalMajorStatuss14:Done,Completeds9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Summary:
    • This primarily corresponds to the 10 + 10 M MC09 productions and the corresponding merging productions
    • About 36K user jobs on the Tier-1s in this period (done + completed)
      • > 50 % CERN Share

Job Failure Analysis

(Change Error and User Group as Appropriated)

  • Failed Production Jobs by FinalMinorStatus
Failed_Production_Jobs_by_Final_Minor_Status.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse)

  • Failed Production Jobs ( Application Finished With Error) by Site
Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss32:Application%20Finished%20With%20Errorss9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Failed Production Jobs ( Input Sandbox Download) by Site
Failed_Production_Jobs_Input_Sandbox_Download_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss22:Input%20Sandbox%20Downloads9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Failed Production Jobs ( Input Data Resolution) by Site
Failed_Production_Jobs_Input_Data_Resolution_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_prods17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss21:Input%20Data%20Resolutions9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Failed User Jobs by FinalMinorStatus
Failed_User_Jobs_by_Final_Minor_Status.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds9:_typeNames3:Jobs9:_groupings16:FinalMinorStatuse)

  • Failed User Jobs ( Input Data Resolution) by Site
Failed_Users_Jobs_Input_Data_Resolution_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss21:Input%20Data%20Resolutions9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Failed User Jobs ( Application Finished With Error) by Site
Failed_Users_Jobs_Application_Finished_With_Errors_by_Site.png
(https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames17:TotalNumberOfJobss13:_timeSelectors2:-1s10:_startTimes10:2009-04-25s8:_endTimes10:2009-05-02s10:_UserGroups9:lhcb_users17:_FinalMajorStatuss6:Faileds17:_FinalMinorStatuss32:Application%20Finished%20With%20Errorss9:_typeNames3:Jobs9:_groupings4:Sitee)

  • Summary:
    • The application failures in MC09 simulation (productions 4719, 4720) have been communicated to the experts.
      • Jobs also failed at some sites because the queues are tuned to LHCb requirements and are just a fraction too short for running 4000 minimum bias events / job.
    • Most failures for merging productions were due to the Tier-1 SEs being unstable at CNAF, GridKa. These largely cleared up by the Friday of the week.
      • CNAF was down because the Storm / GPFS instance was improperly configured. Problem solved finally on Friday. GGUS ticket 48392.
      • GridKa had problems with the number of TURLs they could handle and also with some worker nodes having a maximum file size allowed of 4GB - the merging productions produced 5GB output files. Problem identified and temporarily fixed (pending a more thorough fix) on Friday. GGUS tickets 48627 and 48270.
      • Data access at IN2P3 was much slower than at other Tier-1s, though jobs finished successfully. GGUS ticket 48241.
      • The Merging jobs (productions 4723, 4725) ran without problems at RAL, PIC, CERN, NIKHEF.
    • User Jobs Failed mostly due to:
      • Input Data Resolution (mostly at GridKa)
      • Application Finished with Error

Hardware Status

Status at ~15:00 20090510:

  • WMS volhcb09:
    • swap usage ~ .6Gb
(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=0&entity=volhcb09&detailed=yes)
volhcb09_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb09_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb09_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb09_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png
[root@volhcb09 ~]# df -h /opt/dirac
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda6             100G   85G   11G  90% /home

  • DMS volhcb10:
(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=0&entity=volhcb10&detailed=yes)
volhcb10_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb10_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb10_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb10_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png
[root@volhcb10 ~]# df -h /opt/dirac
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda6             100G   60G   36G  63% /home

  • LogSE volhcb06:
    • Must watch disk usage more closely, probably need additional disk.
    • Daily IOwait peaks, probably due to full disk scan by updatedb cron (to be removed).
(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=0&entity=volhcb06&detailed=yes)
volhcb06_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb06_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb06_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb06_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png
[root@volhcb06 ~]# df -h /opt/dirac/ /storage/
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda11             92G  3.0G   84G   4% /opt
/dev/sdb1             917G  599G  310G  66% /storage

  • Various volhcb01:

(https://lemonweb.cern.ch/lemon-web/info.php?time=1&offset=0&entity=volhcb01&detailed=yes)
volhcb01_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png volhcb01_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png
volhcb01_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png volhcb01_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png
[root@volhcb01 ~]# df -h /opt/dirac /storage
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda9             129G   67G   56G  55% /opt
/dev/sdb1             917G  384G  487G  45% /storage


-- Main.RajaNandakumar - 09 May 2009
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Daily_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 28.4 K 2009-05-10 - 18:00 RajaNandakumar  
PNGpng Done_and_Complete_Jobs_by_UserGroup.png r1 manage 23.2 K 2009-05-10 - 18:01 RajaNandakumar  
PNGpng Done_and_Complete_Production_Jobs_at_Tier1_by_Site.png r1 manage 44.6 K 2009-05-10 - 18:01 RajaNandakumar  
PNGpng Done_and_Complete_Production_Jobs_by_JobType.png r1 manage 26.7 K 2009-05-10 - 18:02 RajaNandakumar  
PNGpng Done_and_Complete_User_Jobs_at_Tier1_by_Site.png r1 manage 40.5 K 2009-05-10 - 18:02 RajaNandakumar  
PNGpng Failed_Jobs_by_UserGroup.png r1 manage 21.8 K 2009-05-10 - 18:03 RajaNandakumar  
PNGpng Failed_Production_Jobs_Application_Finished_With_Errors_by_Site.png r2 r1 manage 64.6 K 2009-05-10 - 18:35 RajaNandakumar  
PNGpng Failed_Production_Jobs_Input_Data_Resolution_by_Site.png r1 manage 28.7 K 2009-05-10 - 18:35 RajaNandakumar  
PNGpng Failed_Production_Jobs_Input_Sandbox_Download_by_Site.png r1 manage 23.4 K 2009-05-10 - 18:36 RajaNandakumar  
PNGpng Failed_Production_Jobs_by_Final_Minor_Status.png r2 r1 manage 46.4 K 2009-05-10 - 18:43 RajaNandakumar  
PNGpng Failed_Production_Jobs_by_MinorStatus.png r1 manage 46.4 K 2009-05-10 - 18:04 RajaNandakumar  
PNGpng Failed_User_Jobs_by_Final_Minor_Status.png r1 manage 48.3 K 2009-05-10 - 18:36 RajaNandakumar  
PNGpng Failed_User_Jobs_by_MinorStatus.png r1 manage 47.6 K 2009-05-10 - 18:04 RajaNandakumar  
PNGpng Failed_Users_Jobs_Application_Finished_With_Errors_by_Site.png r1 manage 47.4 K 2009-05-10 - 18:05 RajaNandakumar  
PNGpng Failed_Users_Jobs_Input_Data_Resolution_by_Site.png r1 manage 33.2 K 2009-05-10 - 18:05 RajaNandakumar  
PNGpng Total_Number_of_Jobs_by_FinalMajorStatus.png r1 manage 24.3 K 2009-05-10 - 18:06 RajaNandakumar  
PNGpng volhcb01_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 16.9 K 2009-05-10 - 18:18 RajaNandakumar  
PNGpng volhcb01_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 9.8 K 2009-05-10 - 18:06 RajaNandakumar  
PNGpng volhcb01_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 8.2 K 2009-05-10 - 18:07 RajaNandakumar  
PNGpng volhcb01_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 7.0 K 2009-05-10 - 18:07 RajaNandakumar  
PNGpng volhcb06_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 16.6 K 2009-05-10 - 18:07 RajaNandakumar  
PNGpng volhcb06_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 9.1 K 2009-05-10 - 18:07 RajaNandakumar  
PNGpng volhcb06_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 8.9 K 2009-05-10 - 18:08 RajaNandakumar  
PNGpng volhcb06_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 8.2 K 2009-05-10 - 18:08 RajaNandakumar  
PNGpng volhcb09_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 24.1 K 2009-05-10 - 18:08 RajaNandakumar  
PNGpng volhcb09_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 18.3 K 2009-05-10 - 18:09 RajaNandakumar  
PNGpng volhcb09_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 9.3 K 2009-05-10 - 18:09 RajaNandakumar  
PNGpng volhcb09_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 8.1 K 2009-05-10 - 18:09 RajaNandakumar  
PNGpng volhcb10_1_0_CPUUTILPERCUSER_CPUUTILPERCSYSTEM_CPUUTILPERCNICE_CPUUTILPERCIDLE_CPUUTILPERCIOWAIT_CPUUTILPERCIRQ_CPUUTILPERCSOFTIRQSTACKEDC_1.gif.png r1 manage 20.8 K 2009-05-10 - 18:09 RajaNandakumar  
PNGpng volhcb10_1_0_NUMKBREADAVG_NUMKBWRITEAVGOVERLAYN_1.gif.png r1 manage 15.6 K 2009-05-10 - 18:10 RajaNandakumar  
PNGpng volhcb10_1_0_PARTITIONUSEDPERC_STACKEDP_1.gif.png r1 manage 8.2 K 2009-05-10 - 18:10 RajaNandakumar  
PNGpng volhcb10_1_0_SWAP_SPACE_USED_STACKEDS_1.gif.png r1 manage 8.1 K 2009-05-10 - 18:10 RajaNandakumar  
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-05-10 - RajaNandakumar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback