Difference: ProductionShifterGuide (1 vs. 23)

Revision 232017-01-19 - GiacomoGraziani

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Changed:
<
<
A new page https://lhcb-shifters.web.cern.ch/ has been created with additional information for shifters, please also follow instructions there. This twiki page is not being maintained anymore %ENCDOLOR%
>
>
A new page https://lhcb-shifters.web.cern.ch/ has been created with additional information for shifters, please also follow instructions there. This twiki page is not being maintained anymore
 

Grid Shifter Guide

Revision 222012-11-12 - StefanRoiser

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Added:
>
>
A new page https://lhcb-shifters.web.cern.ch/ has been created with additional information for shifters, please also follow instructions there. This twiki page is not being maintained anymore %ENCDOLOR%
 

Grid Shifter Guide

UpdatedProductionShifterGuide (under development Sept 2010)

Revision 212010-11-16 - FedericoStagni

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 35 to 35
 

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

Changed:
<
<
  • LCG.CERN.ch
>
>
  • LCG.CERN.ch (acting also as a Tier 0)
 
  • LCG.CNAF.it
  • IN2P3.fr
  • LCG.NIKHEF.nl
Added:
>
>
  • LCG.SARA.nl
 
Line: 50 to 51
 

Backend Storage Systems

Changed:
<
<
Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:
>
>
Three backend storage technologies are employed at the Tier-1 sites, Castor, dCache StoRM. The Tier-1 sites which utilise each technology choice are summarised in the table below:
 
Backend Storage Tier-1 Site
Changed:
<
<
Castor CERN, CNAF, RAL
>
>
Castor CERN, RAL
 
dCache IN2P3, NIKHEF, GridKa, PIC
Added:
>
>
StoRM CNAF
 

Jobs

Changed:
<
<
The number of jobs created for a productions varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.
>
>
The number of jobs created for a production varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.
 

JobIDs

Line: 293 to 295
  The Production Monitoring Webpage has the following features:
Deleted:
<
<
 
Deleted:
<
<

Site Downtime Calendar

The calendar [6] displays all the sites with scheduled and unscheduled downtime. Calendar entries are automatically parsed through the site downtime RSS feed and added to the calendar.

Occasionally the feed isn't parsed correctly and Grid Shifters should double-check that the banned and allowed sites are correct. Useful scripts for this are:

dirac-admin-get-banned-sites
and
dirac-admin-get-site-mask

Plots

The Production Monitoring Webpage has the capacity to produce various plots. Many of which are extremely useful to monitor the performance of the production system.

Links to useful plots can be found on the DIRAC System Monitoring Pages. These plots should be monitored at three times daily.

 

Buglist and Feature Request

The procedure to submit a bug report or a feature request is outlined in Procedures.

Line: 333 to 312
  The new shifter should:
Changed:
<
<
>
>
  • Ensure their Grid certificate is valid for all expected duties
  • Create accounts on all relevant web-resources
  • Subscribe to the relevant mailing lists
 

Grid Certificates

Line: 386 to 365
 
  • Check that there is a minimum of one successful (and complete) job.
  • Confirm that data access is working at least intermittently.
  • Report problems to the operations team.
Changed:
<
<
  • Submit a summary of the job status at all the grid sites to the ELOG 7.
>
>
  • Submit a summary of the job status at all the grid sites to the ELOG.
 

Performance Monitoring

Line: 436 to 415
  Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.
Deleted:
<
<

Weekly Report

A weekly report should be prepared by the Grid Shifter at the end of each week. The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.

The weekly reports are to be uploaded to the Weekly Reports Page on the LHCb Computing tWiki. Grid Shifters should use the template provided when compiling a report.

 

Base Plots

Line: 482 to 452
 

Machine Monitoring Plots

Monitoring of the LHCb VO boxes is vital to maintaining the effcient running of all Grid operations. Particular attention should be paid to the used and free space on the various

Changed:
<
<
disks, network and CPU usage. Reports on the state of the following boxes should be constructed:
  • vobox01
  • vobox06
  • vobox09
  • vobox10

For each machine, save and then upload the plots for:

  • CPU utilization
  • Network utilization
  • Partition Used
  • Swap Used
Note: Mac users may find that the suggested name when saving the plots does not follow the format “*.gif.png” and they should take care to either rename the saved files or edit that week’s report page accordingly.
>
>
disks, network and CPU usage. The machines could be monitored using Lemon
 

Analysis and Summary

Line: 534 to 491
 

When to Submit an ELOG

Changed:
<
<
Submit an ELOG in the following situations:
>
>
A non-exhaustive list of cases when an ELOG has to be submitted include:
 
  • Jobs finalise with exceptions.
Changed:
<
<
>
>
  • The applications run in the job crash with exceptions.
  • A production is stuck/does not proceed/is failing all the jobs/...
  • Site related problems:
    • A large number/percentage of pilots are aborting
    • Shared area slowness (e.g. : jobs failed with Application status = "SetupProject.sh execution failed")
    • The site is killing a suspiciously high number of jobs.
    • ...
 

Exceptions

Revision 202010-11-14 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 976 to 976
  -- PaulSzczypka - 14 Aug 2009
Deleted:
<
<

Development Section for monitoring plots.

These are being assembled at https://twiki.cern.ch/twiki/bin/view/Main/PeterClarkeShiftProcesses

 

META FILEATTACHMENT attachment="dirac-primary-states.png" attr="" comment="Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted." date="1251818035" name="dirac-primary-states.png" path="dirac-primary-states.png" size="111571" stream="dirac-primary-states.png" tmpFilename="/usr/tmp/CGItemp57097" user="szczypka" version="1"

Revision 192010-11-04 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Changed:
<
<
UpdatedProductionShifterGuide (under development Sept 2010
>
>
UpdatedProductionShifterGuide (under development Sept 2010)
 
Contents:

Revision 182010-10-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Added:
>
>
UpdatedProductionShifterGuide (under development Sept 2010
 
Contents:
Added:
>
>
 Placeholder for the tWiki version of the shifter guide.

Introduction

Revision 172010-10-05 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 973 to 973
  -- PaulSzczypka - 14 Aug 2009
Changed:
<
<

Development Section for new Shifter Guide Material and Shifter Plots Sept 2010 (Pete Clarke)

The aim is to eventually gather a set of common production shifter plots. This is a stopgap whilst developing a useful set of plots. It is hoped to be able to share these via presenter in the future. The format of all plots is a 7-day view on the left and a 1-day view on the right.

All LHCb Jobs:

This sets the scene showing the split between production jobs and user jobs.

lhcb_prod Jobs:

This set of plots is for lhcb production only (filtered on lhcb_prod). The first line shows the number of failed lhcb_prod jobs.

The rest of these plots are for the failed lhcb_prod jobs only, to diagnose where and why they failed.

lhcb_user Jobs:

This set of plots is for lhcb user only (filtered on lhcb_user). The first line shows the number of failed lhcb_user jobs.

The rest of these plots are for the failed user jobs only, to diagnose where and why they failed.

Data Upload:

This next set of plots shows the results of data upload at each Tier-1 site, listed by source location. Green indicates success, whilst red indicates failure. These plots only show there may be a space problem, but dont tell you which space token

These plots show the data upload results for each space token listed by site.

Pilot monitoring:

These plots would show a set of three for each site, showing pilot status and then information on aborted pilots.

>
>

Development Section for monitoring plots.

 
Added:
>
>
These are being assembled at https://twiki.cern.ch/twiki/bin/view/Main/PeterClarkeShiftProcesses
 

Revision 162010-10-04 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 971 to 971
 
XML
eXtensible Markup Language
XML-RPC
XML Remote Procedure Call
Added:
>
>
-- PaulSzczypka - 14 Aug 2009
 
Changed:
<
<

Development Section for Shifter Plots Sept 2010 (dont rely on for now -Pete Clarke)

>
>

Development Section for new Shifter Guide Material and Shifter Plots Sept 2010 (Pete Clarke)

  The aim is to eventually gather a set of common production shifter plots. This is a stopgap whilst developing a useful set of plots. It is hoped to be able to share these via presenter in the future. The format of all plots is a 7-day view on the left and a 1-day view on the right.
Line: 1295 to 1296
 
Changed:
<
<
-- PaulSzczypka - 14 Aug 2009
>
>
 
META FILEATTACHMENT attachment="dirac-primary-states.png" attr="" comment="Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted." date="1251818035" name="dirac-primary-states.png" path="dirac-primary-states.png" size="111571" stream="dirac-primary-states.png" tmpFilename="/usr/tmp/CGItemp57097" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_logfiles.png" attr="" comment="View all the output files of a job via the Job Monitoring Webpage." date="1251818814" name="get_logfiles.png" path="get_logfiles.png" size="38751" stream="get_logfiles.png" tmpFilename="/usr/tmp/CGItemp60974" user="szczypka" version="1"

Revision 152010-10-04 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 1161 to 1161
 
Added:
>
>

Pilot monitoring:

These plots would show a set of three for each site, showing pilot status and then information on aborted pilots.

 

Revision 142010-10-04 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 974 to 974
 

Development Section for Shifter Plots Sept 2010 (dont rely on for now -Pete Clarke)

Changed:
<
<
Apologies to those who come across this in Sept/Oct 2010. The aim is to eventually gather a set of common production shifter plots. This is a stopgap whilst developing a useful set of plots. It is hoped to be able to share these via presenter in the future.
>
>
The aim is to eventually gather a set of common production shifter plots. This is a stopgap whilst developing a useful set of plots. It is hoped to be able to share these via presenter in the future. The format of all plots is a 7-day view on the left and a 1-day view on the right.
 

All LHCb Jobs:

Added:
>
>
This sets the scene showing the split between production jobs and user jobs.
 
Line: 988 to 988
 
Changed:
<
<

LHCB_prod Jobs:

>
>

lhcb_prod Jobs:

 
Added:
>
>
This set of plots is for lhcb production only (filtered on lhcb_prod). The first line shows the number of failed lhcb_prod jobs.
 
Line: 999 to 1000
 
Added:
>
>
The rest of these plots are for the failed lhcb_prod jobs only, to diagnose where and why they failed.
 
Line: 1038 to 1041
 
Changed:
<
<

LHCB_user Jobs:

>
>

lhcb_user Jobs:

 
Added:
>
>
This set of plots is for lhcb user only (filtered on lhcb_user). The first line shows the number of failed lhcb_user jobs.
 
Line: 1054 to 1058
 
Added:
>
>
The rest of these plots are for the failed user jobs only, to diagnose where and why they failed.
 
Line: 1095 to 1101
 
Added:
>
>

Data Upload:

This next set of plots shows the results of data upload at each Tier-1 site, listed by source location. Green indicates success, whilst red indicates failure. These plots only show there may be a space problem, but dont tell you which space token

These plots show the data upload results for each space token listed by site.

 

Revision 132010-09-28 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 975 to 975
 

Development Section for Shifter Plots Sept 2010 (dont rely on for now -Pete Clarke)

Apologies to those who come across this in Sept/Oct 2010. The aim is to eventually gather a set of common production shifter plots.

Changed:
<
<
The mechanism doesnt allow sharing yet. This is a stopgap to see if its feasible to give presenter export strings.

--+++ Data Production Jobs:

>
>
This is a stopgap whilst developing a useful set of plots. It is hoped to be able to share these via presenter in the future.
 
Added:
>
>

All LHCb Jobs:

 
Line: 989 to 988
 
Added:
>
>

LHCB_prod Jobs:

 
Line: 998 to 999
 
Added:
>
>

LHCB_user Jobs:

 
Changed:
<
<
--+++ User Jobs:
>
>

 
Deleted:
<
<
columns is equal .49&refresh is equal 0&plots is equal https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss9:_groupings16:FinalMinorStatuss13:_timeSelectors6:604800s10:_UserGroups9:lhcb_prods9:_typeNames3:Jobe;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha_iCxQk14mU2QUCgS7SRZwDKNGrGZAto5lA3NNXzipQshh4A-_ni-Lplqd44JvGqoeS7zNPQ5IdXQTluP6AOMIUex7xG5RbYw3YgT3YdvU24K5q79rtpsOEGsZDS9iXQaJYSymInrEOCW9N541ZPXPK2rMmiHW0T6n5yldpPhrXHMIieMF5pRFrqGDORU9hrNGWTvfxivL9s8arUzQU_cdq37F60mXGs-uTagX-AIUTYCk=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha-iCxQkJ7Wc2QUChS7SRZwDKNGrGZAto5lAnNNHzipQuhiYGd7PF6Wja57iga8a6z6UfJt5GpLs6CwoX-sNiCdMsecRj6DcWGfBHtyB3bZOC_6sn53ftI3DhGrGXUvYl0GiOEcpiJ6wFgl37dbaVTKnrD1rgjhP-5TMd76I-TDOmENY8EbzDiPOUsGcix7DWK0NHW_jBeXnd7VXpWgo-gfV_4fakS4zXlkbqhF4AgrsX5I=;[ampersand];[ampersand];https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j01OAzEMha-SG5DQ6UyaHbT8CAELpl1X7sQajNJkFHukltOTlA0SsLDsZz09f_Zs3ZCi39Agnldux5gfcpqnUER4Hw77uWywuMa6pTgGNq27pwjhBT5S7gWkWJA7h9Fv6YifIHStjUZaIZmmVOlLpMbW0bZ2iRFLOp4kw00e2bMxLgBLjxWFqdWN1bp6ppBkSxKQF517flzf7nf93Zt6SgdWV-pCob4JiuzUBs4_Uf8iZaNdxilleYUjsrGuhCk84TALpagySD3MAll-fdP984x1cp7wErioefgFxll4RQ==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j81OwzAQhF_Fb4DdlCT1DVp-hIADac-VG4_CIteOvBup5elxygUJOKxWMxrNfuu5tX2KfkO9eF7ZHSM_5DSNoYjw3h_2U3FQUsPsUhwCm9reU3ThxX2k3ImTEgE3FtFv6YhPJ7TQRoNWILMsU3YNul6AmqbS2iCitOMk2d3kgT0bY4Nj6TCjMLX1Uus5MoYkW5IArhr7_Li-3e-6uzf1lA6srtQFQn0DFGnUxp1_kv4FykbbjDFleXVHsGltKVM4oZ-EUlTZyXyYxWX5_Uz1zzOtlfOIS2M1F-ILxBJ4Dg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91KxDAQhV8lb7DpD902d_6gIurNrteSbY91lmxSMlNYfXqTgCioMDdzGL75zsS9GYOfrmmUiavO3JC37tEeQ9yJlZUdp8ySwwQezDMj3sawLi4t7m08vKwpQYLMOSU_u28I-S8IwFsDP-3phA8rVOtKgwZQvQWladPWdqBmqLt2gM-vcJZoL-LMSasyzrLskE2ZOt32WuebxQXZkzhw25iHu6vLoqM2qgioUkNxUVCb19JCHcOBfwr_5cuVNhFLiPJkT-CqN_fhoHDGuAoFr6KV_D6Ro_zupP_p1Bt5X1CITQbiE2VUh5I=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91OxCAQhV-FN1joNpTlzp-oMerNrteGlrHOhoWGmSarTy-QGE3UhAs4Id_5jidjpxT9NU7sSWl7g9GFR3dMec-OVwpUMocBPNDOPhPk25zWJZRHeJvGl7UkUCBzTTHO4RuC8QsCQIOF6A94gg_H2EklAXeA3QBYTq_KtSS6k9oMEGsVnDm7izxT0VI2OOI9VFNCo3sp65clJD4gB6B-ax_uri6bjdiI1i_aCkHNQGxe2whxTCP99P1Ll5S0GZaU-cmdgJSx92kUcIZpZUxRZMe1vpAz_56k_5lkLL8v0IjbCoRP86aHUg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jstOxDAMRX8lfzBpp9NHdjwECAGbDmuUtpfiUSapYlca-HqSigUSQvLC98o6PhO3Zgx-uqVRJi5qc0feumd7CrEXKys7Tp0lhwncmVdGvI9hXVwK7mMc3tbUIEHm3JKfHVemJwHAjYGfjnTGlxUqdaFBHahsQGmqElQc0nrYN3UDn-m4SLRXceZkUhhnWXpkOaZaV63W-WZxQY4kDlx25unh5nozUDuVn6rd-6aqTmHg31Y_UlxoE7GEKC_2nFJrHsOgcMG4CgWvopX8g8VG-Suu_xFvjXwu2Ij7DMQ3eDt4Qg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNptjt1KxDAQhV8lb7Bpu3TT3PmDiqg3Xa8lbY91lmxSMlNYfXqT4oWgMANzDsM538TGjjFMtzTKxFVr7yg4_-xOMfXiZGXP2XPkMYE7-8pI9ymui8_Cf4zD25od5JC5uBRmz3vbkwDgg0WYjnTGlxOqdaVBHag-gPLsm7z5anRrDEIJx0WSu0ozZ5DKesfSo7AxmXavdXlZfJQjiQfXnX16uLne-tVOlUq1e99A1SkO_JvpB4krbROWmOTFnbMy9jEOCheMq1AMKjkpHSwuyV_s9n9sY-VzwRbYlDx8A07yd7A=;&
  -- PaulSzczypka - 14 Aug 2009

Revision 122010-09-27 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 977 to 977
 Apologies to those who come across this in Sept/Oct 2010. The aim is to eventually gather a set of common production shifter plots. The mechanism doesnt allow sharing yet. This is a stopgap to see if its feasible to give presenter export strings.
Added:
>
>
--+++ Data Production Jobs:

 --+++ User Jobs:

columns is equal .49&refresh is equal 0&plots is equal https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss9:_groupings16:FinalMinorStatuss13:_timeSelectors6:604800s10:_UserGroups9:lhcb_prods9:_typeNames3:Jobe;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha_iCxQk14mU2QUCgS7SRZwDKNGrGZAto5lA3NNXzipQshh4A-_ni-Lplqd44JvGqoeS7zNPQ5IdXQTluP6AOMIUex7xG5RbYw3YgT3YdvU24K5q79rtpsOEGsZDS9iXQaJYSymInrEOCW9N541ZPXPK2rMmiHW0T6n5yldpPhrXHMIieMF5pRFrqGDORU9hrNGWTvfxivL9s8arUzQU_cdq37F60mXGs-uTagX-AIUTYCk=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha-iCxQkJ7Wc2QUChS7SRZwDKNGrGZAto5lAnNNHzipQuhiYGd7PF6Wja57iga8a6z6UfJt5GpLs6CwoX-sNiCdMsecRj6DcWGfBHtyB3bZOC_6sn53ftI3DhGrGXUvYl0GiOEcpiJ6wFgl37dbaVTKnrD1rgjhP-5TMd76I-TDOmENY8EbzDiPOUsGcix7DWK0NHW_jBeXnd7VXpWgo-gfV_4fakS4zXlkbqhF4AgrsX5I=;[ampersand];[ampersand];https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j01OAzEMha-SG5DQ6UyaHbT8CAELpl1X7sQajNJkFHukltOTlA0SsLDsZz09f_Zs3ZCi39Agnldux5gfcpqnUER4Hw77uWywuMa6pTgGNq27pwjhBT5S7gWkWJA7h9Fv6YifIHStjUZaIZmmVOlLpMbW0bZ2iRFLOp4kw00e2bMxLgBLjxWFqdWN1bp6ppBkSxKQF517flzf7nf93Zt6SgdWV-pCob4JiuzUBs4_Uf8iZaNdxilleYUjsrGuhCk84TALpagySD3MAll-fdP984x1cp7wErioefgFxll4RQ==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j81OwzAQhF_Fb4DdlCT1DVp-hIADac-VG4_CIteOvBup5elxygUJOKxWMxrNfuu5tX2KfkO9eF7ZHSM_5DSNoYjw3h_2U3FQUsPsUhwCm9reU3ThxX2k3ImTEgE3FtFv6YhPJ7TQRoNWILMsU3YNul6AmqbS2iCitOMk2d3kgT0bY4Nj6TCjMLX1Uus5MoYkW5IArhr7_Li-3e-6uzf1lA6srtQFQn0DFGnUxp1_kv4FykbbjDFleXVHsGltKVM4oZ-EUlTZyXyYxWX5_Uz1zzOtlfOIS2M1F-ILxBJ4Dg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91KxDAQhV8lb7DpD902d_6gIurNrteSbY91lmxSMlNYfXqTgCioMDdzGL75zsS9GYOfrmmUiavO3JC37tEeQ9yJlZUdp8ySwwQezDMj3sawLi4t7m08vKwpQYLMOSU_u28I-S8IwFsDP-3phA8rVOtKgwZQvQWladPWdqBmqLt2gM-vcJZoL-LMSasyzrLskE2ZOt32WuebxQXZkzhw25iHu6vLoqM2qgioUkNxUVCb19JCHcOBfwr_5cuVNhFLiPJkT-CqN_fhoHDGuAoFr6KV_D6Ro_zupP_p1Bt5X1CITQbiE2VUh5I=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91OxCAQhV-FN1joNpTlzp-oMerNrteGlrHOhoWGmSarTy-QGE3UhAs4Id_5jidjpxT9NU7sSWl7g9GFR3dMec-OVwpUMocBPNDOPhPk25zWJZRHeJvGl7UkUCBzTTHO4RuC8QsCQIOF6A94gg_H2EklAXeA3QBYTq_KtSS6k9oMEGsVnDm7izxT0VI2OOI9VFNCo3sp65clJD4gB6B-ax_uri6bjdiI1i_aCkHNQGxe2whxTCP99P1Ll5S0GZaU-cmdgJSx92kUcIZpZUxRZMe1vpAz_56k_5lkLL8v0IjbCoRP86aHUg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jstOxDAMRX8lfzBpp9NHdjwECAGbDmuUtpfiUSapYlca-HqSigUSQvLC98o6PhO3Zgx-uqVRJi5qc0feumd7CrEXKys7Tp0lhwncmVdGvI9hXVwK7mMc3tbUIEHm3JKfHVemJwHAjYGfjnTGlxUqdaFBHahsQGmqElQc0nrYN3UDn-m4SLRXceZkUhhnWXpkOaZaV63W-WZxQY4kDlx25unh5nozUDuVn6rd-6aqTmHg31Y_UlxoE7GEKC_2nFJrHsOgcMG4CgWvopX8g8VG-Suu_xFvjXwu2Ij7DMQ3eDt4Qg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNptjt1KxDAQhV8lb7Bpu3TT3PmDiqg3Xa8lbY91lmxSMlNYfXqT4oWgMANzDsM538TGjjFMtzTKxFVr7yg4_-xOMfXiZGXP2XPkMYE7-8pI9ymui8_Cf4zD25od5JC5uBRmz3vbkwDgg0WYjnTGlxOqdaVBHag-gPLsm7z5anRrDEIJx0WSu0ozZ5DKesfSo7AxmXavdXlZfJQjiQfXnX16uLne-tVOlUq1e99A1SkO_JvpB4krbROWmOTFnbMy9jEOCheMq1AMKjkpHSwuyV_s9n9sY-VzwRbYlDx8A07yd7A=;&

Revision 112010-09-27 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 972 to 972
 
XML-RPC
XML Remote Procedure Call
Added:
>
>

Development Section for Shifter Plots Sept 2010 (dont rely on for now -Pete Clarke)

 
Added:
>
>
Apologies to those who come across this in Sept/Oct 2010. The aim is to eventually gather a set of common production shifter plots. The mechanism doesnt allow sharing yet. This is a stopgap to see if its feasible to give presenter export strings.

--+++ User Jobs:

columns is equal .49&refresh is equal 0&plots is equal https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/job#ds9:_plotNames12:NumberOfJobss9:_groupings16:FinalMinorStatuss13:_timeSelectors6:604800s10:_UserGroups9:lhcb_prods9:_typeNames3:Jobe;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha_iCxQk14mU2QUCgS7SRZwDKNGrGZAto5lA3NNXzipQshh4A-_ni-Lplqd44JvGqoeS7zNPQ5IdXQTluP6AOMIUex7xG5RbYw3YgT3YdvU24K5q79rtpsOEGsZDS9iXQaJYSymInrEOCW9N541ZPXPK2rMmiHW0T6n5yldpPhrXHMIieMF5pRFrqGDORU9hrNGWTvfxivL9s8arUzQU_cdq37F60mXGs-uTagX-AIUTYCk=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jk1qwzAQha-iCxQkJ7Wc2QUChS7SRZwDKNGrGZAto5lAnNNHzipQuhiYGd7PF6Wja57iga8a6z6UfJt5GpLs6CwoX-sNiCdMsecRj6DcWGfBHtyB3bZOC_6sn53ftI3DhGrGXUvYl0GiOEcpiJ6wFgl37dbaVTKnrD1rgjhP-5TMd76I-TDOmENY8EbzDiPOUsGcix7DWK0NHW_jBeXnd7VXpWgo-gfV_4fakS4zXlkbqhF4AgrsX5I=;[ampersand];[ampersand];https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j01OAzEMha-SG5DQ6UyaHbT8CAELpl1X7sQajNJkFHukltOTlA0SsLDsZz09f_Zs3ZCi39Agnldux5gfcpqnUER4Hw77uWywuMa6pTgGNq27pwjhBT5S7gWkWJA7h9Fv6YifIHStjUZaIZmmVOlLpMbW0bZ2iRFLOp4kw00e2bMxLgBLjxWFqdWN1bp6ppBkSxKQF517flzf7nf93Zt6SgdWV-pCob4JiuzUBs4_Uf8iZaNdxilleYUjsrGuhCk84TALpagySD3MAll-fdP984x1cp7wErioefgFxll4RQ==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j81OwzAQhF_Fb4DdlCT1DVp-hIADac-VG4_CIteOvBup5elxygUJOKxWMxrNfuu5tX2KfkO9eF7ZHSM_5DSNoYjw3h_2U3FQUsPsUhwCm9reU3ThxX2k3ImTEgE3FtFv6YhPJ7TQRoNWILMsU3YNul6AmqbS2iCitOMk2d3kgT0bY4Nj6TCjMLX1Uus5MoYkW5IArhr7_Li-3e-6uzf1lA6srtQFQn0DFGnUxp1_kv4FykbbjDFleXVHsGltKVM4oZ-EUlTZyXyYxWX5_Uz1zzOtlfOIS2M1F-ILxBJ4Dg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91KxDAQhV8lb7DpD902d_6gIurNrteSbY91lmxSMlNYfXqTgCioMDdzGL75zsS9GYOfrmmUiavO3JC37tEeQ9yJlZUdp8ySwwQezDMj3sawLi4t7m08vKwpQYLMOSU_u28I-S8IwFsDP-3phA8rVOtKgwZQvQWladPWdqBmqLt2gM-vcJZoL-LMSasyzrLskE2ZOt32WuebxQXZkzhw25iHu6vLoqM2qgioUkNxUVCb19JCHcOBfwr_5cuVNhFLiPJkT-CqN_fhoHDGuAoFr6KV_D6Ro_zupP_p1Bt5X1CITQbiE2VUh5I=;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1j91OxCAQhV-FN1joNpTlzp-oMerNrteGlrHOhoWGmSarTy-QGE3UhAs4Id_5jidjpxT9NU7sSWl7g9GFR3dMec-OVwpUMocBPNDOPhPk25zWJZRHeJvGl7UkUCBzTTHO4RuC8QsCQIOF6A94gg_H2EklAXeA3QBYTq_KtSS6k9oMEGsVnDm7izxT0VI2OOI9VFNCo3sp65clJD4gB6B-ax_uri6bjdiI1i_aCkHNQGxe2whxTCP99P1Ll5S0GZaU-cmdgJSx92kUcIZpZUxRZMe1vpAz_56k_5lkLL8v0IjbCoRP86aHUg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNp1jstOxDAMRX8lfzBpp9NHdjwECAGbDmuUtpfiUSapYlca-HqSigUSQvLC98o6PhO3Zgx-uqVRJi5qc0feumd7CrEXKys7Tp0lhwncmVdGvI9hXVwK7mMc3tbUIEHm3JKfHVemJwHAjYGfjnTGlxUqdaFBHahsQGmqElQc0nrYN3UDn-m4SLRXceZkUhhnWXpkOaZaV63W-WZxQY4kDlx25unh5nozUDuVn6rd-6aqTmHg31Y_UlxoE7GEKC_2nFJrHsOgcMG4CgWvopX8g8VG-Suu_xFvjXwu2Ij7DMQ3eDt4Qg==;https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/systems/accountingPlots/getPlotImg?file=Z:eNptjt1KxDAQhV8lb7Bpu3TT3PmDiqg3Xa8lbY91lmxSMlNYfXqT4oWgMANzDsM538TGjjFMtzTKxFVr7yg4_-xOMfXiZGXP2XPkMYE7-8pI9ymui8_Cf4zD25od5JC5uBRmz3vbkwDgg0WYjnTGlxOqdaVBHag-gPLsm7z5anRrDEIJx0WSu0ozZ5DKesfSo7AxmXavdXlZfJQjiQfXnX16uLne-tVOlUq1e99A1SkO_JvpB4krbROWmOTFnbMy9jEOCheMq1AMKjkpHSwuyV_s9n9sY-VzwRbYlDx8A07yd7A=;&

  -- PaulSzczypka - 14 Aug 2009

Revision 102010-06-16 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 250 to 250
 which also includes the minor status of each job category and provides an example JobID for each category. The example JobIDs can then be used to investigate the failures further.
Changed:
<
<
>
>
Beware of failed jobs which have been killed - when a production is complete, the remaining jobs may be automatically killed by DIRAC. Killed jobs like this are ok.
 

Non-Progressing Jobs

Revision 92009-12-08 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 355 to 355
 The new Grid Shifter should subscribe to the following mailing lists:

Deleted:
<
<
 

Revision 82009-10-05 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 23 to 23
 A number of quick-reference sections are also available. DIRAC Scripts and Acronyms list the available DIRAC 3 scripts and commonly-used acronyms respectively.
Deleted:
<
<
Link to Jobs here#Jobs

Link to Shifts here#Shifts

Jobs Link

Shifts Link sans hash

 

Grid Sites

Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites.

Line: 47 to 38
 
  • LCG.NIKHEF.nl
  • LCG.PIC.es
  • RAL.uk
Changed:
<
<
  • LCG.GRIDKA.de
>
>
 

Tier-2 Sites

Line: 60 to 51
 
Backend Storage Tier-1 Site
Castor CERN, CNAF, RAL
Changed:
<
<
dCache IN2P3, NIKHEF, GridKa, PIC
>
>
dCache IN2P3, NIKHEF, GridKa, PIC
 

Jobs

Line: 183 to 175
  The submission of a production can be started once it has been formulated and all the required jobs created. Grid Shifters should ensure they have the permission of the Production Operations Manager (or equivalent) before starting a production.
Changed:
<
<
Production jobs can be submitted manually or automatically.
>
>
Production jobs can be submitted manually or automatically.
  The state of a production can also be set using:
dirac-production-change-status <Command> <Production ID> | <Production ID>
Line: 284 to 276
 

Operations on Productions

Changed:
<
<
All CLI scripts which can be used to manage productions are listed in DIRAC Scripts.
>
>
All CLI scripts which can be used to manage productions are listed in DIRAC Scripts.
 Running a script without arguments will return basic usage notes. In some cases further help is available by running a script with the option "--help".
Line: 320 to 312
 The Production Monitoring Webpage has the capacity to produce various plots. Many of which are extremely useful to monitor the performance of the production system.
Changed:
<
<
Links to useful plots can be found on the DIRAC System Monitoring Pages.
>
>
Links to useful plots can be found on the DIRAC System Monitoring Pages.
 These plots should be monitored at three times daily.

Buglist and Feature Request

Changed:
<
<
The procedure to submit a bug report or a feature request is outlined in [[][Section 9]].
>
>
The procedure to submit a bug report or a feature request is outlined in Procedures.
 
Line: 337 to 329
  The new shifter should:
Changed:
<
<
  • Ensure their Grid certificate is valid for all expected duties (Sec. 6.1.2).
  • Create accounts on all relevant web-resources (Sec. 6.1.2).
  • Subscribe to the relevant mailing lists (Sec. 6.1.3).
>
>
 

Grid Certificates

Line: 362 to 354
  The new Grid Shifter should subscribe to the following mailing lists:
Changed:
<
<
  • lhcb-datacrash [11,4,12].
  • lhcb-dirac-developers [12].
  • lhcb-dirac [12].
  • lhcb-production [12].
>
>
 
Changed:
<
<
Note that both the lhcb-datacrash and lhcb-production mailing lists recieve a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
>
>
Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
 

Production Operations Key

Line: 395 to 388
 

Performance Monitoring

Changed:
<
<
Grid Shifters should view the plots accessible via the DIRACSystemMonitoring page [8] at least three times a day and investigate any unusual features present.
>
>
Grid Shifters should view the plots accessible via the DIRACSystemMonitoring page at least three times a day and investigate any unusual features present.
 

Production Operations Meeting

Changed:
<
<
A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summerising the morning activities.
>
>
A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summarising the morning activities.
  The Grid Shifter's report should contain:
Changed:
<
<
  • Current production progress, jobs submitted, wating etc.
>
>
  • Current production progress, jobs submitted, waiting etc.
 
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.
Line: 431 to 423
  Unsubscribe from the following mailing lists:
Changed:
<
<
  • lhcb-datacrash [11,4,13].
  • lhcb-dirac-developers [13].
  • lhcb-dirac [13].
  • lhcb-production [12].
>
>
  • lhcb-datacrash.
  • lhcb-dirac-developers.
  • lhcb-dirac.
  • lhcb-production.
 

Miscellaneous

Line: 447 to 439
 The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.
Changed:
<
<
The weekly reports are to be uploaded to the Weekly Reports Page [16] on the LHCb Computing tWiki. Grid Shifters should use the template [17] provided when compiling a report.
>
>
The weekly reports are to be uploaded to the Weekly Reports Page on the LHCb Computing tWiki. Grid Shifters should use the template provided when compiling a report.
 

Base Plots

Line: 457 to 448
 The following plots should always be included in the report:
  • Total number of Jobs by Final Ma jor Status
  • Daily number of Jobs by Final Ma jor Status
Changed:
<
<
  • Done—Completed Jobs by User Group
  • Done—Completed Production Jobs by JobType
>
>
  • Done - Completed Jobs by User Group
  • Done - Completed Production Jobs by Job Type
 
  • Failed Jobs by User Group
  • Failed Production Jobs by Minor Status
  • Failed User Jobs by Minor Status
Changed:
<
<
  • Done—Completed Production Jobs by Site
  • Done—Completed User Jobs by Site
From these plots the Grid Shifter should then create a number of further plots to analyse the causes and execution locations of failed jobs.
>
>
  • Done - Completed Production Jobs by Site
  • Done - Completed User Jobs by Site

From these plots the Grid Shifter should then create a number of further plots to analyse the causes and execution locations of failed jobs.

 

Specific Plots

Changed:
<
<
On analysis of the failed jobs, the Grid Shifter should produce plots of the breakdown by site of all failed jobs with the three or four main job “MinorStatus” results. • Failed Production Jobs by FinalMinorStatus – Failed Production Jobs (FinalMinorStatus 1) by Site – Failed Production Jobs (FinalMinorStatus 2) by Site – Failed Production Jobs (FinalMinorStatus 3) by Site – . . . • Failed User Jobs by FinalMinorStatus – Failed User Jobs (FinalMinorStatus 1) by Site – Failed User Jobs (FinalMinorStatus 2) by Site – Failed User Jobs (FinalMinorStatus 3) by Site – . . .
>
>
On analysis of the failed jobs, the Grid Shifter should produce plots of the breakdown by site of all failed jobs with the three or four main job “MinorStatus” results.

 

Machine Monitoring Plots

Changed:
<
<
Monitoring of the LHCb VO boxes is vital to maintaining the effcient running of all Grid operations. Particular attention should be paid to the used and free space on the various
>
>
Monitoring of the LHCb VO boxes is vital to maintaining the effcient running of all Grid operations. Particular attention should be paid to the used and free space on the various
 disks, network and CPU usage. Reports on the state of the following boxes should be constructed:
  • vobox01
Line: 501 to 491
 
  • Network utilization
  • Partition Used
  • Swap Used
Changed:
<
<
Note: Mac users may find that the suggested name when saving the plots does not follow the format “*.gif.png” and they should take care to either rename the saved files or edit that week’s report page accordingly.
>
>
Note: Mac users may find that the suggested name when saving the plots does not follow the format “*.gif.png” and they should take care to either rename the saved files or edit that week’s report page accordingly.
 

Analysis and Summary

Line: 539 to 527
  Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.
Changed:
<
<
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in (Sec. 8.1.1).
>
>
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.
 

When to Submit an ELOG

Changed:
<
<
Submit an elog in the following situations:
>
>
Submit an ELOG in the following situations:
 
  • Jobs finalise with exceptions.
Line: 605 to 593
 
  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
Changed:
<
<
  • Is the problem systamatic across sites with different backend storage technologies? (Sec. 3.3)
>
>
  • Is the problem systematic across sites with different backend storage technologies?
 
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?
Line: 625 to 613
  Once the user has prepared all the relevant information, they should:
Changed:
<
<
>
>
 

Revision 72009-10-05 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 190 to 190
 where the available commands are:
'start', 'stop', 'manual', 'automatic'
Added:
>
>
Remember to validate a production before setting it to automatic! Run a few jobs to ensure they complete successfully before launching the whole production and setting it to automatic submission.
 

Starting and Stopping a Production

The commands:

Line: 269 to 271
 One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.
Added:
>
>

Merging Productions

Each MC Production should have an associated Merging Production which merges the output files together into more manageable file sizes. Ensure that the number of files available to the Merging Production increases in proportion to the number of successful jobs of the MC Production. If the number of files does not increase, this can point to a problem in the Bookkeeping which should be reported.

 

Ending a Production

Ending a completed production is handled by the Productions Operations Manager (or equivalent).

Revision 62009-09-02 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 11 to 11
 This document describes the frequently-used tools and procedures available to Grid Shifters when managing production activities. It is expected that the current Grid Shifters should update this document by incorporating or linking to the stable procedures available on the LHCbProduction TWiki pages [1] when appropriate.
Changed:
<
<
The grid sites section gives some brief information about the various Grid sites and their backend storage systems. The jobs section details the jobs types a Grid Shifter is expected to encounter and provides some debugging methods.
>
>
The Grid Sites section gives some brief information about the various Grid sites and their backend storage systems. The Jobs section details the jobs types a Grid Shifter is expected to encounter and provides some debugging methods.
 The methods available to manage and monitor productions are described in the Productions section. The Web Production Monitor section describes the main features of the Production Monitor webpage.
Line: 40 to 40
 

Tier-1 Sites

Changed:
<
<
Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.
>
>
Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.
 
  • LCG.CERN.ch
  • LCG.CNAF.it
  • IN2P3.fr
Line: 71 to 71
 A particular job is tagged with the following information:

  • Production Identifier (ProdID), e.g. 00001234 - the 1234$^{th}$ production.
Changed:
<
<
  • Job Idetifier (JobID), e.g. 9876 - the 9876$^{th}$ job in the DIRAC system.
  • JobName, e.g. 00001234_00000019 - the 19$^{th}$ job in production 00001234.
>
>
  • Job Idetifier (JobID), e.g. 9876 - the 9876th job in the DIRAC system.
  • JobName, e.g. 00001234_00000019 - the 19th job in production 00001234.
 

Job Status

Line: 182 to 182
 

Starting a Production

The submission of a production can be started once it has been formulated and all the required jobs created.

Changed:
<
<
Grid Shifters should ensure they have the permission of the Production Operations Manager (or equivalent) before starting a production (Sec. 4.1.1). Production jobs can be submitted manually (Sec. 4.1.2) or automatically (Sec. 4.1.3).
>
>
Grid Shifters should ensure they have the permission of the Production Operations Manager (or equivalent) before starting a production. Production jobs can be submitted manually or automatically.
  The state of a production can also be set using:
dirac-production-change-status <Command> <Production ID> | <Production ID>
Line: 845 to 845
 

Common Acronyms

Changed:
<
<
  • ACL Access Control Lists
  • API Application Programming Interface
  • ARC Advance Resource Connector
  • ARDA A Realisation of Distributed Analysis
  • BDII Berkeley Database Information Index
  • BOSS Batch Object Submission System
  • CA Certification Authority
  • CAF CDF Central Analysis Farm
  • CCRC Common Computing Readiness Challenge
  • CDF Collider Detector at Fermilab
  • CE Computing Element
  • CERN Organisation Européenne pour la Recherche Nucléaire: Switzerland/France
  • CNAF Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy
  • ConDB Conditions Database
  • CPU Central Processing Unit
  • CRL Certifcate Revocation List
  • CS Confguration Service
  • DAG Directed Acyclic Graph
  • DC04 Data Challenge 2004
  • DC06 Data Challenge 2006
  • DCAP Data Link Switching Client Access Protocol
  • DIAL Distributed Interactive Analysis of Large datasets
  • DIRAC Distributed Infrastructure with Remote Agent Control
  • DISET DIRAC Secure Transport
  • DLI Data Location Interface
  • DLLs Dynamically Linked Libraries
  • DN Distinguished Name
  • DNS Domain Name System
  • DRS Data Replication Service
  • DST Data Summary Tape
  • ECAL Electromagnetic CALorimeter
  • EGA Enterprise Grid Alliance
  • EGEE Enabling Grids for E-sciencE
  • ELOG Electronic Log
  • ETC Event Tag Collection
  • FIFO First In First Out
  • FTS File Transfer Service
  • GASS Global Access to Secondary Storage
  • GFAL Grid File Access Library
  • GGF Global Grid Forum
  • GIIS Grid Index Information Service
  • GLUE Grid Laboratory Uniform Environment
  • GRAM Grid Resource Allocation Manager
  • GridFTP Grid File Transfer Protocol
  • GridKa Grid Computing Centre Karlsruhe
  • GriPhyN Grid Physics Network
  • GRIS Grid Resource Information Server
  • GSI Grid Security Infrastructure
  • GT Globus Toolkit
  • GUI Graphical User Interface
  • GUID Globally Unique IDentifer
  • HCAL Hadron CALorimeter
  • HEP High Energy Physics
  • HLT High Level Trigger
  • HTML Hyper-Text Markup Language
  • HTTP Hyper-Text Transfer Protocol
  • I/O Input/Output
  • IN2P3 Institut National de Physique Nucleaire et de Physique des Particules: France
  • iVDGL International Virtual Data Grid Laboratory
  • JDL Job Description Language
  • JobDB Job Database
  • JobID Job Identifer
  • L0 Level 0
  • LAN Local Area Network
  • LCG LHC Computing Grid
  • LCG IS LCG Information System
  • LCG UI LCG User Interface
  • LCG WMS LCG Workload Management System
  • LDAP Lightweight Directory Access Protocol
  • LFC LCG File Catalogue
  • LFN Logical File Name
  • LHC Large Hadron Collider
  • LHCb Large Hadron Collider beauty
  • LSF Load Share Facility
  • MC Monte Carlo
  • MDS Monitoring and Discovery Service
  • MSS Mass Storage System
  • NIKHEF National Institute for Subatomic Physics: Netherlands
  • OGSA Open Grid Services Architecture
  • OGSI Open Grid Services Infrastructure
  • OSG Open Science Grid
  • P2P Peer-to-peer
  • Panda Production ANd Distributed Analysis
  • PC Personal Computer
  • PDC1 Physics Data Challenge
  • PFN Physical File Name
  • PIC Port d’Informació Cientfca: Spain
  • PKI Public Key Infrastructure
  • POOL Pool Of persistent Ob jects for LHC
  • POSIX Portable Operating System Interface
  • PPDG Particle Physics Data Grid
  • ProdID Production Identifer
  • PS Preshower Detector
  • R-GMA Relational Grid Monitoring Architecture
  • RAL Rutherford-Appleton Laboratory: UK
  • RB Resource Broker
  • rDST reduced Data Summary Tape
  • RFIO Remote File Input/Output
  • RICH Ring Imaging CHerenkov
  • RM Replica Manager
  • RPC Remote Procedure Call
  • RTTC Real Time Trigger Challenge
  • SAM Service Availability Monitoring
  • SE Storage Element
  • SOA Service Oriented Architecture
  • SOAP Simple Ob ject Access Protocol
  • SPD Scintillator Pad Detector
  • SRM Storage Resource Manager
  • SSL Secure Socket Layer
  • SURL Storage URL
  • TCP/IP Transmission Control Protocol / Internet Protocol
  • TDS Transient Detector Store
  • TES Transient Event Store
  • THS Transient Histogram Store
  • TT Trigger Tracker
  • TURL Transport URL
  • URL Uniform Resource Locator
  • VDT Virtual Data Toolkit
  • VELO VErtex LOcator
  • VO Virtual Organisation
  • VOMS Virtual Organisation Membership Service
  • WAN Wide Area Network
  • WMS Workload Management System
  • WN Worker Node
  • WSDL Web Services Description Language
  • WSRF Web Services Resource Framework
  • WWW World Wide Web
  • XML eXtensible Markup Language
  • XML-RPC XML Remote Procedure Call
>
>
ACL
Access Control Lists
API
Application Programming Interface
ARC
Advance Resource Connector
ARDA
A Realisation of Distributed Analysis
BDII
Berkeley Database Information Index
BOSS
Batch Object Submission System
CA
Certification Authority
CAF
CDF Central Analysis Farm
CCRC
Common Computing Readiness Challenge
CDF
Collider Detector at Fermilab
CE
Computing Element
CERN
Organisation Européenne pour la Recherche Nucléaire: Switzerland/France
CNAF
Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy
ConDB
Conditions Database
CPU
Central Processing Unit
CRL
Certifcate Revocation List
CS
Confguration Service
DAG
Directed Acyclic Graph
DC04
Data Challenge 2004
DC06
Data Challenge 2006
DCAP
Data Link Switching Client Access Protocol
DIAL
Distributed Interactive Analysis of Large datasets
DIRAC
Distributed Infrastructure with Remote Agent Control
DISET
DIRAC Secure Transport
DLI
Data Location Interface
DLLs
Dynamically Linked Libraries
DN
Distinguished Name
DNS
Domain Name System
DRS
Data Replication Service
DST
Data Summary Tape
ECAL
Electromagnetic CALorimeter
EGA
Enterprise Grid Alliance
EGEE
Enabling Grids for E-sciencE
ELOG
Electronic Log
ETC
Event Tag Collection
FIFO
First In First Out
FTS
File Transfer Service
GASS
Global Access to Secondary Storage
GFAL
Grid File Access Library
GGF
Global Grid Forum
GIIS
Grid Index Information Service
GLUE
Grid Laboratory Uniform Environment
GRAM
Grid Resource Allocation Manager
GridFTP
Grid File Transfer Protocol
GridKa
Grid Computing Centre Karlsruhe
GriPhyN
Grid Physics Network
GRIS
Grid Resource Information Server
GSI
Grid Security Infrastructure
GT
Globus Toolkit
GUI
Graphical User Interface
GUID
Globally Unique IDentifer
HCAL
Hadron CALorimeter
HEP
High Energy Physics
HLT
High Level Trigger
HTML
Hyper-Text Markup Language
HTTP
Hyper-Text Transfer Protocol
I/O
Input/Output
IN2P3
Institut National de Physique Nucleaire et de Physique des Particules: France
iVDGL
International Virtual Data Grid Laboratory
JDL
Job Description Language
JobDB
Job Database
JobID
Job Identifer
L0
Level 0
LAN
Local Area Network
LCG
LHC Computing Grid
LCG IS
LCG Information System
LCG UI
LCG User Interface
LCG WMS
LCG Workload Management System
LDAP
Lightweight Directory Access Protocol
LFC
LCG File Catalogue
LFN
Logical File Name
LHC
Large Hadron Collider
LHCb
Large Hadron Collider beauty
LSF
Load Share Facility
MC
Monte Carlo
MDS
Monitoring and Discovery Service
MSS
Mass Storage System
NIKHEF
National Institute for Subatomic Physics: Netherlands
OGSA
Open Grid Services Architecture
OGSI
Open Grid Services Infrastructure
OSG
Open Science Grid
P2P
Peer-to-peer
Panda
Production ANd Distributed Analysis
PC
Personal Computer
PDC1
Physics Data Challenge
PFN
Physical File Name
PIC
Port d’Informació Cientfca: Spain
PKI
Public Key Infrastructure
POOL
Pool Of persistent Ob jects for LHC
POSIX
Portable Operating System Interface
PPDG
Particle Physics Data Grid
ProdID
Production Identifer
PS
Preshower Detector
R-GMA
Relational Grid Monitoring Architecture
RAL
Rutherford-Appleton Laboratory: UK
RB
Resource Broker
rDST
reduced Data Summary Tape
RFIO
Remote File Input/Output
RICH
Ring Imaging CHerenkov
RM
Replica Manager
RPC
Remote Procedure Call
RTTC
Real Time Trigger Challenge
SAM
Service Availability Monitoring
SE
Storage Element
SOA
Service Oriented Architecture
SOAP
Simple Ob ject Access Protocol
SPD
Scintillator Pad Detector
SRM
Storage Resource Manager
SSL
Secure Socket Layer
SURL
Storage URL
TCP/IP
Transmission Control Protocol / Internet Protocol
TDS
Transient Detector Store
TES
Transient Event Store
THS
Transient Histogram Store
TT
Trigger Tracker
TURL
Transport URL
URL
Uniform Resource Locator
VDT
Virtual Data Toolkit
VELO
VErtex LOcator
VO
Virtual Organisation
VOMS
Virtual Organisation Membership Service
WAN
Wide Area Network
WMS
Workload Management System
WN
Worker Node
WSDL
Web Services Description Language
WSRF
Web Services Resource Framework
WWW
World Wide Web
XML
eXtensible Markup Language
XML-RPC
XML Remote Procedure Call
 

Revision 52009-09-02 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 34 to 36
  Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2.
Changed:
<
<
Cern is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.
>
>
CERN is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.
 

Tier-1 Sites

Line: 110 to 112
 

Job Output via the Job Monitoring Webpage

Changed:
<
<
There are two methods to view the output of a job via the https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displayJob Monitoring Webpage [3]. The first returns the last 20 lines of the std.out and the second allows the Grid Shifter to view all the output files.
>
>
There are two methods to view the output of a job via the Job Monitoring Webpage. The first returns the last 20 lines of the std.out and the second allows the Grid Shifter to view all the output files.
 
Line: 165 to 167
 

Operations on Jobs

Changed:
<
<
The full list of scripts which can be used to perform operations on a job is given in (App. A.19).
>
>
The full list of scripts which can be used to perform operations on a job is given in DIRAC Scripts.
 The name of each script should be a clear indication of it's purpose. Running a script without arguments will print basic usage notes.
Line: 216 to 218
 When started, a production set to automatic submission will submit all jobs in the production in quick succession.

A production can be set to automatic submission once you are satisfied that there are no specific problems with the production jobs. To set a production to automatic submission use:

Changed:
<
<
dirac-production-set-automatic <Production ID> \vert<Production ID>
>
>
dirac-production-set-automatic <Production ID> | <Production ID>
 

Monitoring a Production

Line: 240 to 242
 

Failed Jobs

A Grid Shifter should monitor a production for failed jobs and jobs which are not progressing.

Changed:
<
<
Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the https://mmm.cern.ch/public/archive-list/l/lhcb-datacrash/lhcb-datacrash [4] mailing list for each failed job.
>
>
Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the lhcb-datacrash mailing list for each failed job.
 It is therefore not enough to simply rely on the number of lhcb-datacrash emails to indicate if there are any problems with a production.
Changed:
<
<
In addition to any lhcb-datacrash notifications, the Grid Shifter should also check the number of failed jobs in a production via the CLI or the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayProduction Monitoring Webpage [5].
>
>
In addition to any lhcb-datacrash notifications, the Grid Shifter should also check the number of failed jobs in a production via the CLI or the Production Monitoring Webpage.
  Using the CLI, the command:
dirac-production-progress [<Production ID>]
Line: 269 to 271
 

Ending a Production

Changed:
<
<
Ending a completed production is handeled by the Productions Operations Manager (or equivalent).
>
>
Ending a completed production is handled by the Productions Operations Manager (or equivalent).
 No action is required on the part of the Grid Shifter.

Operations on Productions

Changed:
<
<
All CLI scripts which can be used to manage productions are listed in (App. A.16).
>
>
All CLI scripts which can be used to manage productions are listed in DIRAC Scripts.
 Running a script without arguments will return basic usage notes. In some cases further help is available by running a script with the option "--help".

Web Production Monitor

Changed:
<
<
Production monitoring via the web is possible through the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayProduction Monitoring Webpage [5].
>
>
Production monitoring via the web is possible through the Production Monitoring Webpage.
 A valid grid certificate loaded into your browser is required to use the webpage.

Features

The Production Monitoring Webpage has the following features:

Changed:
<
<
>
>
 

Site Downtime Calendar

Line: 307 to 309
 

Plots

Added:
>
>
The Production Monitoring Webpage has the capacity to produce various plots. Many of which are extremely useful to monitor the performance of the production system.
 
Added:
>
>
Links to useful plots can be found on the DIRAC System Monitoring Pages. These plots should be monitored at three times daily.
 

Buglist and Feature Request

Changed:
<
<
The procedure to submit a bug report or a feature request is outlined in section 8.
>
>
The procedure to submit a bug report or a feature request is outlined in [[][Section 9]].
 
Line: 331 to 337
 

Grid Certificates

A Grid certificate is mandatory for Grid Shifters.

Changed:
<
<
If you don't have a certificate you should register for one through http://lcg.web.cern.ch/lcg/users/registration/registration.html CERN LCG [8] and apply to join the LHCb Virtual Organisation (VO).
>
>
If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).
 
Changed:
<
<
To access the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayproduction monitoring webpages [5] you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the http://lcg.web.cern.ch/lcg/users/registration/load-cert.html CERN LCG pages [9].
>
>
To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.
 

Web Resources

Primary web-based resources for DIRAC 3 production shifts:

Changed:
<
<
>
>
 

Mailing Lists

Line: 381 to 387
 

Performance Monitoring

Changed:
<
<
>
>
Grid Shifters should view the plots accessible via the DIRACSystemMonitoring page [8] at least three times a day and investigate any unusual features present.
 

Production Operations Meeting

Line: 427 to 434
 Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.

Weekly Report

Added:
>
>
A weekly report should be prepared by the Grid Shifter at the end of each week. The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.

The weekly reports are to be uploaded to the Weekly Reports Page [16] on the LHCb Computing tWiki. Grid Shifters should use the template [17] provided when compiling a report.

 

Base Plots

Added:
>
>
The following plots should always be included in the report:
  • Total number of Jobs by Final Ma jor Status
  • Daily number of Jobs by Final Ma jor Status
  • Done—Completed Jobs by User Group
  • Done—Completed Production Jobs by JobType
  • Failed Jobs by User Group
  • Failed Production Jobs by Minor Status
  • Failed User Jobs by Minor Status
  • Done—Completed Production Jobs by Site
  • Done—Completed User Jobs by Site
From these plots the Grid Shifter should then create a number of further plots to analyse the causes and execution locations of failed jobs.
 

Specific Plots

Added:
>
>

On analysis of the failed jobs, the Grid Shifter should produce plots of the breakdown by site of all failed jobs with the three or four main job “MinorStatus” results. • Failed Production Jobs by FinalMinorStatus – Failed Production Jobs (FinalMinorStatus 1) by Site – Failed Production Jobs (FinalMinorStatus 2) by Site – Failed Production Jobs (FinalMinorStatus 3) by Site – . . . • Failed User Jobs by FinalMinorStatus – Failed User Jobs (FinalMinorStatus 1) by Site – Failed User Jobs (FinalMinorStatus 2) by Site – Failed User Jobs (FinalMinorStatus 3) by Site – . . .

 

Machine Monitoring Plots

Added:
>
>
Monitoring of the LHCb VO boxes is vital to maintaining the effcient running of all Grid operations. Particular attention should be paid to the used and free space on the various disks, network and CPU usage. Reports on the state of the following boxes should be constructed:
  • vobox01
  • vobox06
  • vobox09
  • vobox10

For each machine, save and then upload the plots for:

  • CPU utilization
  • Network utilization
  • Partition Used
  • Swap Used
Note: Mac users may find that the suggested name when saving the plots does not follow the format “*.gif.png” and they should take care to either rename the saved files or edit that week’s report page accordingly.
 

Analysis and Summary

Added:
>
>
A summary of each group of plots should be written to aid the next Grid Shifter’s appraisal of the current situation and to enable the Grid Expert on duty to investigate problems further.
 

ELOG

Changed:
<
<
All Grid Shifter actions of note should be recorded in the http://lblogbook.cern.ch/Operations/ELOG [10].
>
>
All Grid Shifter actions of note should be recorded in the ELOG.
 This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.
Line: 460 to 531
  Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.
Changed:
<
<
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in (Sec. 7.1.1).
>
>
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in (Sec. 8.1.1).
 

When to Submit an ELOG

Line: 471 to 542
 

Exceptions

Changed:
<
<
Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:
>
>
Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:
 
  • The production ID.
  • An example job ID.
Line: 485 to 557
 

Datacrash Emails

Changed:
<
<
The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications.
>
>
The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications.
 If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.
Line: 504 to 576
  Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem.
Changed:
<
<
Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).
>
>
Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).
 

Standard Checklist

Line: 523 to 595
 

Grid-Specific Issues

  • Was there an announcement of downtime for the site?
Changed:
<
<
  • Is the problem specific to a single site?
>
>
  • Is the problem specific to a single site?
 
    • Are all the CE’s at the site affected?
  • Is the problem systamatic across sites with different backend storage technologies? (Sec. 3.3)
Changed:
<
<
  • Is the problem specific to an SE?
>
>
  • Is the problem specific to an SE?
 
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?
Line: 539 to 611
  * Record all relevant information. * Identify a use-case for the new feature.
Changed:
<
<
Image savannah_support_browse
>
>
 Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Line: 547 to 620
 
Changed:
<
<
Image savannah_support_submit
>
>
 Figure 6: Savannah support submit.
Changed:
<
<
Image savannah_support_example
>
>
 Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

Line: 572 to 647
  Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:
Changed:
<
<
>
>

 
Deleted:
<
<
Image savannah_bugs
 Figure 8: Browse current bugs.
Line: 588 to 664
 
  • Set the privacy option to ``private''.
  • Submit the bug report.
Changed:
<
<
Image savannah_bugs_example
>
>
 Figure 9: Example bug report.
Line: 602 to 679
  Grid Shifter actions:
Changed:
<
<
>
>
  • Submit an ELOG report listing the affected productions and sites.
 
  • Ban the relevant sites until they pass their SAM tests.

DIRAC 3 Scripts

Line: 654 to 731
 

DIRAC DMS

Added:
>
>
  • dirac-dms-add-file
  • dirac-dms-get-file
  • dirac-dms-lfn-accessURL
  • dirac-dms-lfn-logging-info
  • dirac-dms-lfn-metadata
  • dirac-dms-lfn-replicas
  • dirac-dms-pfn-metadata
  • dirac-dms-pfn-accessURL
  • dirac-dms-remove-pfn
  • dirac-dms-remove-lfn
  • dirac-dms-replicate-lfn

 

DIRAC Embedded

Added:
>
>
  • dirac-embedded-external
 

DIRAC External

Added:
>
>
  • dirac-external
 

DIRAC Fix

Added:
>
>
  • dirac-fix-ld-library-path

 

DIRAC Framework

Added:
>
>
  • dirac-framework-ping-service

 

DIRAC Functions

Added:
>
>
  • dirac-functions.sh
 

DIRAC Group

Added:
>
>
  • dirac-group-init
 

DIRAC Jobexec

Added:
>
>
  • dirac-jobexec
 

DIRAC LHCb

Added:
>
>
  • dirac-lhcb-job-replica
  • dirac-lhcb-manage-software
  • dirac-lhcb-production-job-check
  • dirac-lhcb-sam-submit-all
  • dirac-lhcb-sam-submit-ce
 

DIRAC Myproxy

Added:
>
>
  • dirac-myproxy-upload
 

DIRAC Production

Added:
>
>

  • dirac-production-application-summary
  • dirac-production-change-status
  • dirac-production-job-summary
  • dirac-production-list-active
  • dirac-production-list-all
  • dirac-production-list-id
  • dirac-production-logging-info
  • dirac-production-mcextend
  • dirac-production-manager-cli
  • dirac-production-progress
  • dirac-production-set-automatic
  • dirac-production-set-manual
  • dirac-production-site-summary
  • dirac-production-start
  • dirac-production-stop
  • dirac-production-submit
  • dirac-production-summary

 

DIRAC Proxy

Added:
>
>
  • dirac-proxy-info
  • dirac-proxy-init
  • dirac-proxy-upload

 

DIRAC Update

Added:
>
>
  • dirac-update

 

DIRAC WMS

Added:
>
>
  • dirac-wms-job-delete
  • dirac-wms-job-get-output
  • dirac-wms-job-get-input
  • dirac-wms-job-kill
  • dirac-wms-job-logging-info
  • dirac-wms-job-parameters
  • dirac-wms-job-peek
  • dirac-wms-job-status
  • dirac-wms-job-submit
  • dirac-wms-job-reschedule

 

Common Acronyms

Changed:
<
<
ACL Access Control Lists, 26 API Application Programming Interface, 26 ARC Advance Resource Connector, 26 ARDA A Realisation of Distributed Analysis, 26 BDII Berkeley Database Information Index, 26 BOSS Batch Object Submission System, 26 CA Certification Authority, 26 CAF CDF Central Analysis Farm, 26 CCRC Common Computing Readiness Challenge, 6 CDF Collider Detector at Fermilab, 26 CE Computing Element, 26 CERN Organisation Europ´eenne pour la Recherche Nucléaire: Switzerland/France, 2 CNAF Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy, 2 ConDB Conditions Database, 26 CPU Central Processing Unit, 26 CRL Certificate Revocation List, 26 CS Configuration Service, 26 DAG Directed Acyclic Graph, 26 DC04 Data Challenge 2004, 26 DC06 Data Challenge 2006, 26 DCAP Data Link Switching Client Access Protocol, 26 DIAL Distributed Interactive Analysis of Large datasets, 26 DIRAC Distributed Infrastructure with Remote Agent Control, 26 DISET DIRAC Secure Transport, 26 DLI Data Location Interface, 26 DLLs Dynamically Linked Libraries, 26 DN Distinguished Name, 26 DNS Domain Name System, 26 DRS Data Replication Service, 26 DST Data Summary Tape, 26 ECAL Electromagnetic CALorimeter, 26 EGA Enterprise Grid Alliance, 26 EGEE Enabling Grids for E-sciencE, 26 ELOG Electronic Log, 1 ETC Event Tag Collection, 26 FIFO First In First Out, 26 FTS File Transfer Service, 26 Ganga Gaudi / Athena and Grid Alliance, 26 GASS Global Access to Secondary Storage, 26 GFAL Grid File Access Library, 26 GGF Global Grid Forum, 26 GIIS Grid Index Information Service, 26 GLUE Grid Laboratory Uniform Environment, 26 GRAM Grid Resource Allocation Manager, 26 GridFTP Grid File Transfer Protocol, 26 GridKa Grid Computing Centre Karlsruhe, 2 GriPhyN Grid Physics Network, 26 GRIS Grid Resource Information Server, 26 GSI Grid Security Infrastructure, 26 GT Globus Toolkit, 26 GUI Graphical User Interface, 26 GUID Globally Unique IDentifier, 26 HCAL Hadron CALorimeter, 26 HEP High Energy Physics, 26 HLT High Level Trigger, 26 HTML Hyper-Text Markup Language, 26 HTTP Hyper-Text Transfer Protocol, 26 I/O Input/Output, 26 IN2P3 Institut National de Physique Nucleaire et de Physique des Particules: France, 2 iVDGL International Virtual Data Grid Laboratory, 26 JDL Job Description Language, 26 JobDB Job Database, 26 JobID Job Identifier, 3 L0 Level 0, 26 LAN Local Area Network, 26 LCG LHC Computing Grid, 26 LCG IS LCG Information System, 26 LCG UI LCG User Interface, 26 LCG WMS LCG Workload Management System, 26 LDAP Lightweight Directory Access Protocol, 26 LFC LCG File Catalogue, 26 LFN Logical File Name, 26 LHC Large Hadron Collider, 26 LHCb Large Hadron Collider beauty, 26 LSF Load Share Facility, 26 MC Monte Carlo, 6 MDS Monitoring and Discovery Service, 26 MSS Mass Storage System, 26 NIKHEF National Institute for Subatomic Physics: Netherlands, 2 OGSA Open Grid Services Architecture, 26 OGSI Open Grid Services Infrastructure, 26 OSG Open Science Grid, 26 P2P Peer-to-peer, 26 Panda Production ANd Distributed Analysis, 26 PC Personal Computer, 26 PDC1 Physics Data Challenge, 26 PFN Physical File Name, 26 PIC Port d’Informaci´o Cient´ıfica: Spain, 2 PKI Public Key Infrastructure, 26 POOL Pool Of persistent Ob jects for LHC, 26 POSIX Portable Operating System Interface, 26 PPDG Particle Physics Data Grid, 26 ProdID Production Identifier, 3, 6 PS Preshower Detector, 26 R-GMA Relational Grid Monitoring Architecture, 26 RAL Rutherford-Appleton Laboratory: UK, 2 RB Resource Broker, 26 rDST reduced Data Summary Tape, 26 RFIO Remote File Input/Output, 26 RICH Ring Imaging CHerenkov, 26 RM Replica Manager, 26 RPC Remote Procedure Call, 26 RTTC Real Time Trigger Challenge, 26 SAM Service Availability Monitoring, 20 SE Storage Element, 18, 26 SOA Service Oriented Architecture, 26 SOAP Simple Ob ject Access Protocol, 26 SPD Scintillator Pad Detector, 26 SRM Storage Resource Manager, 26 SSL Secure Socket Layer, 26 SURL Storage URL, 26 TCP/IP Transmission Control Protocol / Internet Protocol, 26 TDS Transient Detector Store, 26 TES Transient Event Store, 26 THS Transient Histogram Store, 26 TT Trigger Tracker, 26 TURL Transport URL, 26 URL Uniform Resource Locator, 26 VDT Virtual Data Toolkit, 26 VELO VErtex LOcator, 26 VO Virtual Organisation, 11, 26 VOMS Virtual Organisation Membership Service, 26 WAN Wide Area Network, 26 WMS Workload Management System, 26 WN Worker Node, 26 WSDL Web Services Description Language, 26 WSRF Web Services Resource Framework, 26 WWW World Wide Web, 26 XML eXtensible Markup Language, 26 XML-RPC XML Remote Procedure Call, 26
>
>
  • ACL Access Control Lists
  • API Application Programming Interface
  • ARC Advance Resource Connector
  • ARDA A Realisation of Distributed Analysis
  • BDII Berkeley Database Information Index
  • BOSS Batch Object Submission System
  • CA Certification Authority
  • CAF CDF Central Analysis Farm
  • CCRC Common Computing Readiness Challenge
  • CDF Collider Detector at Fermilab
  • CE Computing Element
  • CERN Organisation Européenne pour la Recherche Nucléaire: Switzerland/France
  • CNAF Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy
  • ConDB Conditions Database
  • CPU Central Processing Unit
  • CRL Certifcate Revocation List
  • CS Confguration Service
  • DAG Directed Acyclic Graph
  • DC04 Data Challenge 2004
  • DC06 Data Challenge 2006
  • DCAP Data Link Switching Client Access Protocol
  • DIAL Distributed Interactive Analysis of Large datasets
  • DIRAC Distributed Infrastructure with Remote Agent Control
  • DISET DIRAC Secure Transport
  • DLI Data Location Interface
  • DLLs Dynamically Linked Libraries
  • DN Distinguished Name
  • DNS Domain Name System
  • DRS Data Replication Service
  • DST Data Summary Tape
  • ECAL Electromagnetic CALorimeter
  • EGA Enterprise Grid Alliance
  • EGEE Enabling Grids for E-sciencE
  • ELOG Electronic Log
  • ETC Event Tag Collection
  • FIFO First In First Out
  • FTS File Transfer Service
  • GASS Global Access to Secondary Storage
  • GFAL Grid File Access Library
  • GGF Global Grid Forum
  • GIIS Grid Index Information Service
  • GLUE Grid Laboratory Uniform Environment
  • GRAM Grid Resource Allocation Manager
  • GridFTP Grid File Transfer Protocol
  • GridKa Grid Computing Centre Karlsruhe
  • GriPhyN Grid Physics Network
  • GRIS Grid Resource Information Server
  • GSI Grid Security Infrastructure
  • GT Globus Toolkit
  • GUI Graphical User Interface
  • GUID Globally Unique IDentifer
  • HCAL Hadron CALorimeter
  • HEP High Energy Physics
  • HLT High Level Trigger
  • HTML Hyper-Text Markup Language
  • HTTP Hyper-Text Transfer Protocol
  • I/O Input/Output
  • IN2P3 Institut National de Physique Nucleaire et de Physique des Particules: France
  • iVDGL International Virtual Data Grid Laboratory
  • JDL Job Description Language
  • JobDB Job Database
  • JobID Job Identifer
  • L0 Level 0
  • LAN Local Area Network
  • LCG LHC Computing Grid
  • LCG IS LCG Information System
  • LCG UI LCG User Interface
  • LCG WMS LCG Workload Management System
  • LDAP Lightweight Directory Access Protocol
  • LFC LCG File Catalogue
  • LFN Logical File Name
  • LHC Large Hadron Collider
  • LHCb Large Hadron Collider beauty
  • LSF Load Share Facility
  • MC Monte Carlo
  • MDS Monitoring and Discovery Service
  • MSS Mass Storage System
  • NIKHEF National Institute for Subatomic Physics: Netherlands
  • OGSA Open Grid Services Architecture
  • OGSI Open Grid Services Infrastructure
  • OSG Open Science Grid
  • P2P Peer-to-peer
  • Panda Production ANd Distributed Analysis
  • PC Personal Computer
  • PDC1 Physics Data Challenge
  • PFN Physical File Name
  • PIC Port d’Informació Cientfca: Spain
  • PKI Public Key Infrastructure
  • POOL Pool Of persistent Ob jects for LHC
  • POSIX Portable Operating System Interface
  • PPDG Particle Physics Data Grid
  • ProdID Production Identifer
  • PS Preshower Detector
  • R-GMA Relational Grid Monitoring Architecture
  • RAL Rutherford-Appleton Laboratory: UK
  • RB Resource Broker
  • rDST reduced Data Summary Tape
  • RFIO Remote File Input/Output
  • RICH Ring Imaging CHerenkov
  • RM Replica Manager
  • RPC Remote Procedure Call
  • RTTC Real Time Trigger Challenge
  • SAM Service Availability Monitoring
  • SE Storage Element
  • SOA Service Oriented Architecture
  • SOAP Simple Ob ject Access Protocol
  • SPD Scintillator Pad Detector
  • SRM Storage Resource Manager
  • SSL Secure Socket Layer
  • SURL Storage URL
  • TCP/IP Transmission Control Protocol / Internet Protocol
  • TDS Transient Detector Store
  • TES Transient Event Store
  • THS Transient Histogram Store
  • TT Trigger Tracker
  • TURL Transport URL
  • URL Uniform Resource Locator
  • VDT Virtual Data Toolkit
  • VELO VErtex LOcator
  • VO Virtual Organisation
  • VOMS Virtual Organisation Membership Service
  • WAN Wide Area Network
  • WMS Workload Management System
  • WN Worker Node
  • WSDL Web Services Description Language
  • WSRF Web Services Resource Framework
  • WWW World Wide Web
  • XML eXtensible Markup Language
  • XML-RPC XML Remote Procedure Call
 

Revision 42009-09-02 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 433 to 433
 

Analysis and Summary

ELOG

Added:
>
>
All Grid Shifter actions of note should be recorded in the http://lblogbook.cern.ch/Operations/ELOG [10]. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.
 

Typical ELOG Format

Added:
>
>
Each ELOG entry which reports a new problem should include as much relevant information as possible. This allows the production operations team to quickly determine the problem and apply a solution.
 

ELOG Entry for a New Problem

Added:
>
>
A typical ELOG entry for a new problem contains:

  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.

 

Subsequent ELOG Entries

Added:
>
>
Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.

If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in (Sec. 7.1.1).

 

When to Submit an ELOG

Added:
>
>
Submit an elog in the following situations:

  • Jobs finalise with exceptions.
 

Exceptions

Added:
>
>
Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:

  • The production ID.
  • An example job ID.
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.
 

Crashed Application

Added:
>
>
Should submit example error log for the crashed application.
 

Datacrash Emails

Added:
>
>
The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.
 

ELOG Problems

Added:
>
>
If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.
 

Procedures

Added:
>
>
If a problem is discovered it is very important to escalate it to the operations team. Assessing the scale of the problem is very important and Grid Shifters should attempt to answer the questions in section 9.1.1 as soon as possible.
 

On the Discovery of a Problem

Added:
>
>
Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem. Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).
 

Standard Checklist

Added:
>
>
On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:
  • How many jobs does the problem affect?
  • Are the central DIRAC services running normally?
  • Are all jobs affected?
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
    • Can extra redundancy be introduced to the system?
    • Is there enough information available to determine the error?
 

Grid-Specific Issues

Added:
>
>
  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
  • Is the problem systamatic across sites with different backend storage technologies? (Sec. 3.3)
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?
 

Feature Requests

Added:
>
>
Before submitting a feature request, the user should:

* Identify conditions under which the feature is to be used. * Record all relevant information. * Identify a use-case for the new feature.

Image savannah_support_browse Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Image savannah_support_submit Figure 6: Savannah support submit.

Image savannah_support_example Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 7).
  • Set the severity option to ``wish''.
  • Set the privacy option to ``private''.
  • Submit the feature request.
 

Bug Reporting

Added:
>
>
Before submitting a bug report, the user should:

  • Identify conditions under which the bug occurs.
  • Record all relevant information.
  • Try to ensure that the bug is reproducible.

Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:

Image savannah_bugs Figure 8: Browse current bugs.

Assuming the bug is new, the procedure to submit a bug report is as follows:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 9).
  • Set the appropriate severity of the problem.
  • Write a short and clear summary.
  • Set the privacy option to ``private''.
  • Submit the bug report.

Image savannah_bugs_example Figure 9: Example bug report.

 

Software Unavailability

Added:
>
>
Symptom: Jobs fail to find at least one software package.

Software installation occurs during Service Availability Monitoring (SAM) tests. Sites which fail to find software packages should have failed at least part of their most recent SAM test.

Grid Shifter actions:

DIRAC 3 Scripts

DIRAC Admin Scripts

* dirac-admin-accounting-cli * dirac-admin-add-user * dirac-admin-allow-site * dirac-admin-ban-site * dirac-admin-delete-user * dirac-admin-get-banned-sites * dirac-admin-get-job-pilot-output * dirac-admin-get-job-pilots * dirac-admin-get-pilot-output * dirac-admin-get-proxy * dirac-admin-get-site-mask * dirac-admin-list-hosts * dirac-admin-list-users * dirac-admin-modify-user * dirac-admin-pilot-summary * dirac-admin-reset-job * dirac-admin-service-ports * dirac-admin-site-info * dirac-admin-sync-users-from-file * dirac-admin-upload-proxy * dirac-admin-users-with-proxy

DIRAC Bookkeeping Scripts

* dirac-bookkeeping-eventMgt * dirac-bookkeeping-eventtype-mgt * dirac-bookkeeping-ls * dirac-bookkeeping-production-jobs * dirac-bookkeeping-production-informations

DIRAC Clean

* dirac-clean

DIRAC Configuration

* dirac-configuration-cli

DIRAC Distribution

* dirac-distribution

DIRAC DMS

DIRAC Embedded

DIRAC External

DIRAC Fix

DIRAC Framework

DIRAC Functions

DIRAC Group

DIRAC Jobexec

DIRAC LHCb

DIRAC Myproxy

DIRAC Production

DIRAC Proxy

DIRAC Update

DIRAC WMS

Common Acronyms

ACL Access Control Lists, 26 API Application Programming Interface, 26 ARC Advance Resource Connector, 26 ARDA A Realisation of Distributed Analysis, 26 BDII Berkeley Database Information Index, 26 BOSS Batch Object Submission System, 26 CA Certification Authority, 26 CAF CDF Central Analysis Farm, 26 CCRC Common Computing Readiness Challenge, 6 CDF Collider Detector at Fermilab, 26 CE Computing Element, 26 CERN Organisation Europ´eenne pour la Recherche Nucléaire: Switzerland/France, 2 CNAF Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy, 2 ConDB Conditions Database, 26 CPU Central Processing Unit, 26 CRL Certificate Revocation List, 26 CS Configuration Service, 26 DAG Directed Acyclic Graph, 26 DC04 Data Challenge 2004, 26 DC06 Data Challenge 2006, 26 DCAP Data Link Switching Client Access Protocol, 26 DIAL Distributed Interactive Analysis of Large datasets, 26 DIRAC Distributed Infrastructure with Remote Agent Control, 26 DISET DIRAC Secure Transport, 26 DLI Data Location Interface, 26 DLLs Dynamically Linked Libraries, 26 DN Distinguished Name, 26 DNS Domain Name System, 26 DRS Data Replication Service, 26 DST Data Summary Tape, 26 ECAL Electromagnetic CALorimeter, 26 EGA Enterprise Grid Alliance, 26 EGEE Enabling Grids for E-sciencE, 26 ELOG Electronic Log, 1 ETC Event Tag Collection, 26 FIFO First In First Out, 26 FTS File Transfer Service, 26 Ganga Gaudi / Athena and Grid Alliance, 26 GASS Global Access to Secondary Storage, 26 GFAL Grid File Access Library, 26 GGF Global Grid Forum, 26 GIIS Grid Index Information Service, 26 GLUE Grid Laboratory Uniform Environment, 26 GRAM Grid Resource Allocation Manager, 26 GridFTP Grid File Transfer Protocol, 26 GridKa Grid Computing Centre Karlsruhe, 2 GriPhyN Grid Physics Network, 26 GRIS Grid Resource Information Server, 26 GSI Grid Security Infrastructure, 26 GT Globus Toolkit, 26 GUI Graphical User Interface, 26 GUID Globally Unique IDentifier, 26 HCAL Hadron CALorimeter, 26 HEP High Energy Physics, 26 HLT High Level Trigger, 26 HTML Hyper-Text Markup Language, 26 HTTP Hyper-Text Transfer Protocol, 26 I/O Input/Output, 26 IN2P3 Institut National de Physique Nucleaire et de Physique des Particules: France, 2 iVDGL International Virtual Data Grid Laboratory, 26 JDL Job Description Language, 26 JobDB Job Database, 26 JobID Job Identifier, 3 L0 Level 0, 26 LAN Local Area Network, 26 LCG LHC Computing Grid, 26 LCG IS LCG Information System, 26 LCG UI LCG User Interface, 26 LCG WMS LCG Workload Management System, 26 LDAP Lightweight Directory Access Protocol, 26 LFC LCG File Catalogue, 26 LFN Logical File Name, 26 LHC Large Hadron Collider, 26 LHCb Large Hadron Collider beauty, 26 LSF Load Share Facility, 26 MC Monte Carlo, 6 MDS Monitoring and Discovery Service, 26 MSS Mass Storage System, 26 NIKHEF National Institute for Subatomic Physics: Netherlands, 2 OGSA Open Grid Services Architecture, 26 OGSI Open Grid Services Infrastructure, 26 OSG Open Science Grid, 26 P2P Peer-to-peer, 26 Panda Production ANd Distributed Analysis, 26 PC Personal Computer, 26 PDC1 Physics Data Challenge, 26 PFN Physical File Name, 26 PIC Port d’Informaci´o Cient´ıfica: Spain, 2 PKI Public Key Infrastructure, 26 POOL Pool Of persistent Ob jects for LHC, 26 POSIX Portable Operating System Interface, 26 PPDG Particle Physics Data Grid, 26 ProdID Production Identifier, 3, 6 PS Preshower Detector, 26 R-GMA Relational Grid Monitoring Architecture, 26 RAL Rutherford-Appleton Laboratory: UK, 2 RB Resource Broker, 26 rDST reduced Data Summary Tape, 26 RFIO Remote File Input/Output, 26 RICH Ring Imaging CHerenkov, 26 RM Replica Manager, 26 RPC Remote Procedure Call, 26 RTTC Real Time Trigger Challenge, 26 SAM Service Availability Monitoring, 20 SE Storage Element, 18, 26 SOA Service Oriented Architecture, 26 SOAP Simple Ob ject Access Protocol, 26 SPD Scintillator Pad Detector, 26 SRM Storage Resource Manager, 26 SSL Secure Socket Layer, 26 SURL Storage URL, 26 TCP/IP Transmission Control Protocol / Internet Protocol, 26 TDS Transient Detector Store, 26 TES Transient Event Store, 26 THS Transient Histogram Store, 26 TT Trigger Tracker, 26 TURL Transport URL, 26 URL Uniform Resource Locator, 26 VDT Virtual Data Toolkit, 26 VELO VErtex LOcator, 26 VO Virtual Organisation, 11, 26 VOMS Virtual Organisation Membership Service, 26 WAN Wide Area Network, 26 WMS Workload Management System, 26 WN Worker Node, 26 WSDL Web Services Description Language, 26 WSRF Web Services Resource Framework, 26 WWW World Wide Web, 26 XML eXtensible Markup Language, 26 XML-RPC XML Remote Procedure Call, 26

  -- PaulSzczypka - 14 Aug 2009
Line: 457 to 812
 
META FILEATTACHMENT attachment="get_logfiles.png" attr="" comment="View all the output files of a job via the Job Monitoring Webpage." date="1251818814" name="get_logfiles.png" path="get_logfiles.png" size="38751" stream="get_logfiles.png" tmpFilename="/usr/tmp/CGItemp60974" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_std_out.png" attr="" comment="Peek the std.out of a job via the Job Monitoring Webpage" date="1251818832" name="get_std_out.png" path="get_std_out.png" size="38486" stream="get_std_out.png" tmpFilename="/usr/tmp/CGItemp60935" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_pilot_output.png" attr="" comment="View the pilot output of a job via the Job Monitoring Webpage." date="1251820011" name="get_pilot_output.png" path="get_pilot_output.png" size="43849" stream="get_pilot_output.png" tmpFilename="/usr/tmp/CGItemp60995" user="szczypka" version="1"
Added:
>
>
META FILEATTACHMENT attachment="savannah_bugs.png" attr="" comment="Savannah bugs dialogue" date="1251887682" name="savannah_bugs.png" path="savannah_bugs.png" size="47603" stream="savannah_bugs.png" tmpFilename="/usr/tmp/CGItemp49580" user="szczypka" version="1"
META FILEATTACHMENT attachment="savannah_bugs_example.png" attr="" comment="Savannah bug example" date="1251887702" name="savannah_bugs_example.png" path="savannah_bugs_example.png" size="42727" stream="savannah_bugs_example.png" tmpFilename="/usr/tmp/CGItemp49509" user="szczypka" version="1"
META FILEATTACHMENT attachment="savannah_support_browse.png" attr="" comment="Savanah Support Dialogue" date="1251887722" name="savannah_support_browse.png" path="savannah_support_browse.png" size="38383" stream="savannah_support_browse.png" tmpFilename="/usr/tmp/CGItemp49537" user="szczypka" version="1"
META FILEATTACHMENT attachment="savannah_support_example.png" attr="" comment="Savannah support example" date="1251887756" name="savannah_support_example.png" path="savannah_support_example.png" size="43565" stream="savannah_support_example.png" tmpFilename="/usr/tmp/CGItemp49598" user="szczypka" version="1"
META FILEATTACHMENT attachment="savannah_support_submit.png" attr="" comment="Savannah support submission" date="1251888429" name="savannah_support_submit.png" path="savannah_support_submit.png" size="31343" stream="savannah_support_submit.png" tmpFilename="/usr/tmp/CGItemp49573" user="szczypka" version="1"

Revision 32009-09-01 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 6 to 6
  Placeholder for the tWiki version of the shifter guide.
Added:
>
>

Introduction

This document describes the frequently-used tools and procedures available to Grid Shifters when managing production activities. It is expected that the current Grid Shifters should update this document by incorporating or linking to the stable procedures available on the LHCbProduction TWiki pages [1] when appropriate.

The grid sites section gives some brief information about the various Grid sites and their backend storage systems. The jobs section details the jobs types a Grid Shifter is expected to encounter and provides some debugging methods. The methods available to manage and monitor productions are described in the Productions section. The Web Production Monitor section describes the main features of the Production Monitor webpage. A chronological guide through a production shift, from beginning to end, is presented in the shifts section. The ELOG section outlines the situations for which the submission of an ELOG is appropriate. Finally, the Procedures section details the well-established procedures for Grid Shifters. A number of quick-reference sections are also available. DIRAC Scripts and Acronyms list the available DIRAC 3 scripts and commonly-used acronyms respectively.

Link to Jobs here#Jobs

Link to Shifts here#Shifts

Jobs Link

Shifts Link sans hash

 

Grid Sites

Line: 26 to 49
 

Tier-2 Sites

Added:
>
>
There are numerous Tier-2 sites with sites being added frequently. As such, it is of little worth presenting a list of all the current Tier-2 sites in this document. Tier-2 sites are used for MC production in the LHCb Computing Model.
 

Backend Storage Systems

Added:
>
>
Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:

Backend Storage Tier-1 Site
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC
 

Jobs

Added:
>
>
The number of jobs created for a productions varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.
 

JobIDs

Added:
>
>
A particular job is tagged with the following information:

  • Production Identifier (ProdID), e.g. 00001234 - the 1234$^{th}$ production.
  • Job Idetifier (JobID), e.g. 9876 - the 9876$^{th}$ job in the DIRAC system.
  • JobName, e.g. 00001234_00000019 - the 19$^{th}$ job in production 00001234.
 

Job Status

Added:
>
>
The job status of a successful job proceeds in the following order:

  1. Received,
  2. Checking,
  3. Staging,
  4. Waiting,
  5. Matched,
  6. Running,
  7. Completed,
  8. Done.

Jobs which return no heartbeat have a status of ``Stalled'' and jobs where any workflow modules return an error status are classed as ``Failed''.

The basic flowchart describing the evolution of a job's status can be found in figure 1. Jobs are only ``Grid-active'' once they have reached the ``Matched'' phase.

Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted.

Figure 1: Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted.

 

Job Output

Added:
>
>
The standard output and standard error of a job can be accessed through the API, the CLI and the webpage via a global job ``peek''.
 

Job Output via the CLI

Added:
>
>
The std.out and std.err for a given job can be retrieved using the CLI command:
dirac-wms-job-get-output <JobID> | [<JobID>]
This creates a directory containing the std.out and std.err for each JobID entered. Standard tools can then be used to search the output for specific strings, e.g. ``FATAL''.

To simply view the last few lines of a job's std.out (``peek'') use:

dirac-wms-job-peek <JobID> | [<JobID>]
 

Job Output via the Job Monitoring Webpage

Added:
>
>
There are two methods to view the output of a job via the https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displayJob Monitoring Webpage [3]. The first returns the last 20 lines of the std.out and the second allows the Grid Shifter to view all the output files.

Figure 2: Peek the std.out of a job via the Job Monitoring Webpage.

To ``peek'' the std.out of a job:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``StandardOutput'' (Fig. 2).

Figure 3: View all the output files of a job via the Job Monitoring Webpage.

Similarly, to view all output files for a job:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``Get Logfile'' (Fig. 3).

This method can be particularly quick if the Grid Shifter only wants to check the output of a selection of jobs.

 

Job Pilot Output

Added:
>
>
The output of the Job Pilot can also be retrieved via the API, the CLI or the Webpage.
 

Job Pilot Output via the CLI

Added:
>
>
To obtain the Job Pilot output using the CLI, use:
dirac-admin-get-pilot-output <Grid pilot reference> [<Grid pilot reference>]
This creates a directory for each JobID containing the Job Pilot output.
 

Job Pilot Output via the Job Monitoring Webpage

Added:
>
>
Viewing the std.out and std.err of a Job Pilot via the Job Monitoring Webpage is achieved by:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``Pilot'' then ``Get StdOut'' or ``Get StdErr'' (Fig. 4).

Figure 4: View the pilot output of a job via the Job Monitoring Webpage.

 

Operations on Jobs

Added:
>
>
The full list of scripts which can be used to perform operations on a job is given in (App. A.19). The name of each script should be a clear indication of it's purpose. Running a script without arguments will print basic usage notes.
 

Productions

Added:
>
>
As a Grid Shifter you will be required to monitor the official LHCb productions. Each production is assigned a unique Production ID (ProdID). These consist of Monte Carlo (MC) generation, data stripping and CCRC productions. Production creation will generally be performed by the Production Operations Manager and is not a duty of the Grid Shifter.

The current list of all active productions can be obtained with the command:

dirac-production-list-active
The command also gives the current submission status of the active productions.
 

Starting a Production

Added:
>
>
The submission of a production can be started once it has been formulated and all the required jobs created. Grid Shifters should ensure they have the permission of the Production Operations Manager (or equivalent) before starting a production (Sec. 4.1.1). Production jobs can be submitted manually (Sec. 4.1.2) or automatically (Sec. 4.1.3).

The state of a production can also be set using:

dirac-production-change-status <Command> <Production ID> | <Production ID>
where the available commands are:
'start', 'stop', 'manual', 'automatic'

Starting and Stopping a Production

The commands:

dirac-production-start <Production ID> | <Production ID>
and
dirac-production-stop <Production ID> | <Production ID>
are used to start and stop a production. Grid Shifters may stop a current production if a significant number of jobs are failing.
 

Manual Submission

Added:
>
>
A production is set to manual submission by default. To reset the submission status of one or more productions, use the command:
dirac-production-set-manual <Production ID> | <Production ID>

A small number of test jobs should be manually submitted for each new production. In the case of stripping or CCRC productions, a small number of test jobs should be sent to all the Tier1 sites and closely monitored.

To manually submit jobs to a selected site, use the following command:

dirac-production-site-submit <ProdID> <Num Jobs> <Site>
Note that the full site name string must be entered, e.g. to submit a job to CERN you must type:
dirac-production-site-submit <ProdID> 1 LCG.CERN.ch

Any observed problems or job failures should be investigated and an ELOG entry submitted. Assuming there are no problems in all the test jobs, the production may be set to automatic submission.

 

Automatic Submission

Added:
>
>
When started, a production set to automatic submission will submit all jobs in the production in quick succession.

A production can be set to automatic submission once you are satisfied that there are no specific problems with the production jobs. To set a production to automatic submission use:

dirac-production-set-automatic <Production ID> \vert<Production ID>
 

Monitoring a Production

Added:
>
>
Jobs in each production should be periodically checked for failed jobs (Sec. 4.2.1) and to ensure that jobs are progressing (Sec. 4.2.2).

When monitoring a production, a Grid Shifter should be aware of a number of issues which can cause jobs to fail:

  • Staging.
  • Stalled Jobs.
  • Segmentation faults.
  • DB access.
  • Software problems.
  • Data access.
  • Shared area access.
  • Site downtime.
  • Problematic files.
  • Excessive runtime.
 

Failed Jobs

Added:
>
>
A Grid Shifter should monitor a production for failed jobs and jobs which are not progressing. Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the https://mmm.cern.ch/public/archive-list/l/lhcb-datacrash/lhcb-datacrash [4] mailing list for each failed job. It is therefore not enough to simply rely on the number of lhcb-datacrash emails to indicate if there are any problems with a production. In addition to any lhcb-datacrash notifications, the Grid Shifter should also check the number of failed jobs in a production via the CLI or the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayProduction Monitoring Webpage [5].

Using the CLI, the command:

dirac-production-progress [<Production ID>]
entered without any arguments will return a breakdown of the jobs of all current productions. Entering one or more ProdIDs returns only the breakdown of those productions.

A more detailed breakdown is provided by:

dirac-production-job-summary <Production ID> [<DIRAC Status>]
which also includes the minor status of each job category and provides an example JobID for each category. The example JobIDs can then be used to investigate the failures further.

 

Non-Progressing Jobs

Added:
>
>
In addition to failed jobs, jobs which do not progress should also be monitored. Particular attention should be paid to jobs in the states ``Waiting'' and ``Staging''. Problematic jobs at this stage are easily overlooked since the associated problems are not easily identifiable.
 

Non-Starting Jobs

Added:
>
>
Jobs arriving at a site but then failing to start have multiple causes. One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.
 

Ending a Production

Added:
>
>
Ending a completed production is handeled by the Productions Operations Manager (or equivalent). No action is required on the part of the Grid Shifter.
 

Operations on Productions

Added:
>
>
All CLI scripts which can be used to manage productions are listed in (App. A.16). Running a script without arguments will return basic usage notes. In some cases further help is available by running a script with the option "--help".
 

Web Production Monitor

Added:
>
>
Production monitoring via the web is possible through the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayProduction Monitoring Webpage [5]. A valid grid certificate loaded into your browser is required to use the webpage.
 

Features

Added:
>
>
The Production Monitoring Webpage has the following features:

 

Site Downtime Calendar

Added:
>
>

The calendar [6] displays all the sites with scheduled and unscheduled downtime. Calendar entries are automatically parsed through the site downtime RSS feed and added to the calendar.

Occasionally the feed isn't parsed correctly and Grid Shifters should double-check that the banned and allowed sites are correct. Useful scripts for this are:

dirac-admin-get-banned-sites
and
dirac-admin-get-site-mask
 

Plots

Added:
>
>
 

Buglist and Feature Request

Added:
>
>
The procedure to submit a bug report or a feature request is outlined in section 8.

 

Shifts

Added:
>
>
Grid Shifters are required to monitor all the current LHCb productions and must have a valid Grid Certificate and be a member of the LHCb VO.
 

Before a Shift Period

Added:
>
>
The new shifter should:

  • Ensure their Grid certificate is valid for all expected duties (Sec. 6.1.2).
  • Create accounts on all relevant web-resources (Sec. 6.1.2).
  • Subscribe to the relevant mailing lists (Sec. 6.1.3).
 

Grid Certificates

Added:
>
>
A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through http://lcg.web.cern.ch/lcg/users/registration/registration.html CERN LCG [8] and apply to join the LHCb Virtual Organisation (VO).

To access the https://lhcbweb.pic.es/DIRAC/jobs/ProductionMonitor/displayproduction monitoring webpages [5] you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the http://lcg.web.cern.ch/lcg/users/registration/load-cert.html CERN LCG pages [9].

 

Web Resources

Added:
>
>
Primary web-based resources for DIRAC 3 production shifts:

 

Mailing Lists

Added:
>
>
The new Grid Shifter should subscribe to the following mailing lists:

  • lhcb-datacrash [11,4,12].
  • lhcb-dirac-developers [12].
  • lhcb-dirac [12].
  • lhcb-production [12].

Note that both the lhcb-datacrash and lhcb-production mailing lists recieve a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.

 

Production Operations Key

Added:
>
>
The new shifter should obtain the Production Operations key (TCE5) from the LHCb secretariat or the previous Grid Shifter.
 

During a Shift

Added:
>
>
During a shift Grid Shifters are expected to monitor all current productions and be aware of the current status of the Tier1 sites. A knowledge of the purpose of each production is also useful and aids in determining the probable cause of any failed jobs.
 

Daily Actions

Added:
>
>
Grid Shifters are expected to carry out the following daily actions for sites used in the current productions:

  • Trigger submission of pending productions.
  • Monitor active productions.
  • Check transfer status.
  • Verify that the staging at each site is functional.
  • Check that there is a minimum of one successful (and complete) job.
  • Confirm that data access is working at least intermittently.
  • Report problems to the operations team.
  • Submit a summary of the job status at all the grid sites to the ELOG 7.
 

Performance Monitoring

Added:
>
>
 

Production Operations Meeting

Added:
>
>
A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summerising the morning activities.

The Grid Shifter's report should contain:

  • Current production progress, jobs submitted, wating etc.
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.
 

Ending a Shift

Added:
>
>
At the end of each shift, morning Grid Shifters should:

  • Pass on the key (TCE5) for the Production Operations room to the next Grid Shifter.
  • Prepare a list of outstanding issues to be handed over to the next Grid Shifter and discussed in the Production Operations meeting.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

Similarly, evening Grid Shifters should:

  • Place the key (TCE5) to the Productions Operations room in the secretariat key box.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.
 

End of Shift Period

Added:
>
>
At the end of a shift period the Grid Shifter may wish to unsubscribe from the various mailing lists (Sec. 6.4.1) in addition to returning the Production Operations room key, TCE5 (Sec. 6.4.2).
 

Mailing Lists

Added:
>
>
Unsubscribe from the following mailing lists:

  • lhcb-datacrash [11,4,13].
  • lhcb-dirac-developers [13].
  • lhcb-dirac [13].
  • lhcb-production [12].
 

Miscellaneous

Added:
>
>
Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.
 

Weekly Report

Base Plots

Specific Plots

Line: 96 to 452
 

-- PaulSzczypka - 14 Aug 2009

Added:
>
>
META FILEATTACHMENT attachment="dirac-primary-states.png" attr="" comment="Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted." date="1251818035" name="dirac-primary-states.png" path="dirac-primary-states.png" size="111571" stream="dirac-primary-states.png" tmpFilename="/usr/tmp/CGItemp57097" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_logfiles.png" attr="" comment="View all the output files of a job via the Job Monitoring Webpage." date="1251818814" name="get_logfiles.png" path="get_logfiles.png" size="38751" stream="get_logfiles.png" tmpFilename="/usr/tmp/CGItemp60974" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_std_out.png" attr="" comment="Peek the std.out of a job via the Job Monitoring Webpage" date="1251818832" name="get_std_out.png" path="get_std_out.png" size="38486" stream="get_std_out.png" tmpFilename="/usr/tmp/CGItemp60935" user="szczypka" version="1"
META FILEATTACHMENT attachment="get_pilot_output.png" attr="" comment="View the pilot output of a job via the Job Monitoring Webpage." date="1251820011" name="get_pilot_output.png" path="get_pilot_output.png" size="43849" stream="get_pilot_output.png" tmpFilename="/usr/tmp/CGItemp60995" user="szczypka" version="1"

Revision 22009-08-14 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 7 to 7
 Placeholder for the tWiki version of the shifter guide.
Added:
>
>

Grid Sites

Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2. Cern is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

  • LCG.CERN.ch
  • LCG.CNAF.it
  • IN2P3.fr
  • LCG.NIKHEF.nl
  • LCG.PIC.es
  • RAL.uk
  • LCG.GRIDKA.de

Tier-2 Sites

Backend Storage Systems

Jobs

JobIDs

Job Status

Job Output

Job Output via the CLI

Job Output via the Job Monitoring Webpage

Job Pilot Output

Job Pilot Output via the CLI

Job Pilot Output via the Job Monitoring Webpage

Operations on Jobs

Productions

Starting a Production

Manual Submission

Automatic Submission

Monitoring a Production

Failed Jobs

Non-Progressing Jobs

Non-Starting Jobs

Ending a Production

Operations on Productions

Web Production Monitor

Features

Site Downtime Calendar

Plots

Buglist and Feature Request

Shifts

Before a Shift Period

Grid Certificates

Web Resources

Mailing Lists

Production Operations Key

During a Shift

Daily Actions

Performance Monitoring

Production Operations Meeting

Ending a Shift

End of Shift Period

Mailing Lists

Miscellaneous

Weekly Report

Base Plots

Specific Plots

Machine Monitoring Plots

Analysis and Summary

ELOG

Typical ELOG Format

ELOG Entry for a New Problem

Subsequent ELOG Entries

When to Submit an ELOG

Exceptions

Crashed Application

Datacrash Emails

ELOG Problems

Procedures

On the Discovery of a Problem

Standard Checklist

Grid-Specific Issues

Feature Requests

Bug Reporting

Software Unavailability

 

-- PaulSzczypka - 14 Aug 2009

Revision 12009-08-14 - PaulSzczypka

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Contents:

Placeholder for the tWiki version of the shifter guide.

-- PaulSzczypka - 14 Aug 2009

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback