Difference: UpdatedProductionShifterGuide (1 vs. 43)

Revision 432012-11-12 - StefanRoiser

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"
Added:
>
>
A new page https://lhcb-shifters.web.cern.ch/ has been created with additional information for shifters, please also follow instructions there. This twiki page is not being maintained anymore
 

Grid Shifter Guide : Being updated autumn 2010

This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.

Revision 422012-02-13 - StefanRoiser

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 535 to 535
 

Backend Storage Systems

Changed:
<
<
Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:
>
>
Three backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:
 
Backend Storage Tier-1 Site
Changed:
<
<
Castor CERN, CNAF, RAL
>
>
Castor CERN, RAL
 
dCache IN2P3, NIKHEF, GridKa, PIC
Added:
>
>
Storm CNAF
 

File Transfer System, FTS

Revision 412011-11-04 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reproc: S17   - - Req:4356,4357,4358,4359,4360,4363 link
Reproc: Reco12   12503,12504,12518,12519,12601,12628,12674 - - link
>
>
Reproc: Reco12 - after Aug TS   12908 - - link
Reproc: Reco12 - before Aug-TS   12503,504,518,519,601,628,674,701,714,727 - - link
 
Reco11a-S16 MD   12527 12528,12539,12542 12529-12538, 12540,12542,12543 link
Reco11a-S16 MU   12448 12449,12460,12463 12450-12459, 12461,12462,12464 link
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link

Revision 402011-10-16 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco12-S17 MU   12504,12519 not yet not yet link
Reco12-S17 MD   12503,12518 not yet not yet link
>
>
Reproc: S17   - - Req:4356,4357,4358,4359,4360,4363 link
Reproc: Reco12   12503,12504,12518,12519,12601,12628,12674 - - link
Reco11a-S16 MD   12527 12528,12539,12542 12529-12538, 12540,12542,12543 link
 
Reco11a-S16 MU   12448 12449,12460,12463 12450-12459, 12461,12462,12464 link
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link
Line: 188 to 189
  This information is provided by Philippes Dashboard tables
Changed:
<
<
>
>
 

Revision 392011-10-05 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco12-S17 MU   12504 not yet not yet link
Reco12-S17 MD   12503 not yet not yet link
>
>
Reco12-S17 MU   12504,12519 not yet not yet link
Reco12-S17 MD   12503,12518 not yet not yet link
 
Reco11a-S16 MU   12448 12449,12460,12463 12450-12459, 12461,12462,12464 link
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link

Revision 382011-10-03 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco12-S17 MU   12503 not yet not yet link
>
>
Reco12-S17 MU   12504 not yet not yet link
Reco12-S17 MD   12503 not yet not yet link
 
Reco11a-S16 MU   12448 12449,12460,12463 12450-12459, 12461,12462,12464 link
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link

Revision 372011-09-30 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Added:
>
>
Reco12-S17 MU   12503 not yet not yet link
Reco11a-S16 MU   12448 12449,12460,12463 12450-12459, 12461,12462,12464 link
 
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link
Reco11a-S16 MD   11891 11892,11903 11893-11902,11904,11905 link

Revision 362011-09-23 - FedericoStagni

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 149 to 149
 Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.
Added:
>
>

Hints on the productions monitor page

From the productions monitoring page (e.g. https://lhcb-web-dirac.cern.ch/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display) you can follow many types of "transformations", where "transformation" is a general name including "job productions" (MC, merge, stripping, reco, reprocessing) and "data movement" (replication, removal) activities. As production shifters, you are mostly interested in "job productions".

Reconstruction, reprocessing, stripping (and its related merge) prods are more important because this is the real data. Here, the single most important view that you want to look is "file status": e.g., click on any Reconstruction "line" and you'll get the menu where "file status" is. This gives you a summary (and also the complete list, if you click further) of the files in input to that production. You'll get file status for everything but MC.

Files can be in:

  • processed: what is done. Basically, jobs treat files, and it happened that the job that treated that file ended successfully. Good, one down!
  • unused: file that are not being treated by any job.
  • assigned: file that are being treated RIGHT NOW by a job.

Basically, this is the cycle:

  1. A prod is created with a Bk query (it can be seen by clicking on "input data query" near to the "file status", and 0 files in.
  2. After some minutes, the Bk query is run, and the files retrieved are assigned to the production, and are all in "unused"
  3. Wait some more minutes and an agent will create "tasks", that will become in the end jobs. Every file assigned to a task is marked as "assigned"
    1. a. You'll see the column "created" of https://lhcb-web-dirac.cern.ch/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display being populated
    2. b. To be created, some task will require more than an input file. When this is not possible, the agent will just wait. This is the case of, for example, for merge prods.
  4. if the jobs are successful, files are marked as processed, otherwise they are treated by another agent that will remark the file as "unused", and point 3 will be restart.

there are some "special cases", none of which is a good sign:

  • MaxReset: it means that the system has tried to process the file 10 times with ten different jobs, and failed all of them.
  • missingLFC: the file can't be found on the LFC
  • applicationCrash: very rare, sometimes a DIRAC problem.

These last 3 cases should be reported. For the "MaxReset" case, it would be good finding out the jobs that tried to process those files, and report the reason for their failures, but this is quite difficult without a deeper understanding of the system.

Remember: transformation handles tasks, WMS handles jobs. A task lives in DIRAC only, a job on the grid. For transformation that are "job productions", each task will become a job.

 

Howto: Look at the processed data in the bookeeping

Revision 352011-09-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Added:
>
>
Reco11a-S16 MD (Post Sept TS)   12362 12363,12374,12377 12364-12373, 12375,12376,12378 link
 
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link
Reco11a-S16 MD   11891 11892,11903 11893-11902,11904,11905 link
Reco11a-S16 MU   11878 11912,11923 11913-11922,11924,11925 link
Line: 151 to 152
 

Howto: Look at the processed data in the bookeeping

Changed:
<
<
This information is provided by Philippes magic tables
>
>
This information is provided by Philippes Dashboard tables
 

Revision 342011-09-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 139 to 139
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Added:
>
>
Reco11a-S16 MU (Post Sept TS)   12051 12052,12063,12066 12053 - 12062,12064,12065,12067 link
 
Reco11a-S16 MD   11891 11892,11903 11893-11902,11904,11905 link
Reco11a-S16 MU   11878 11912,11923 11913-11922,11924,11925 link
Reco11 MD after July TS 11715 11716 11752,11755 11753,11754,11756-11765 link

Revision 332011-08-23 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 76 to 76
 
  • General Job success rate for production jobs: The shifter should monitor the overall job success rate for all production jobs. How to do this.
Added:
>
>
* Progress of productions reaching the book keeping: The shifter should look at the tables made by Philippe. How to do this.
 
  • General job success rate for user jobs: The shifter should look at the progress of user jobs How to do this

  • Monte Carlo production: t.b.d
Line: 141 to 143
 
Reco11a-S16 MU   11878 11912,11923 11913-11922,11924,11925 link
Reco11 MD after July TS 11715 11716 11752,11755 11753,11754,11756-11765 link
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
Deleted:
<
<
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link
1.38 tev 11567,11568 11720,11721 11746,11748 11747,11749 link
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link
Reco10 MU+MD before May TS   10691,10685 10731,10719,10747,10749 R3438,3437,3441,3442 link
  Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.
Added:
>
>

Howto: Look at the processed data in the bookeeping

This information is provided by Philippes magic tables

 

Howto: Monitor general job success rate

Revision 322011-08-21 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Added:
>
>
Reco11a-S16 MD   11891 11892,11903 11893-11902,11904,11905 link
Reco11a-S16 MU   11878 11912,11923 11913-11922,11924,11925 link
 
Reco11 MD after July TS 11715 11716 11752,11755 11753,11754,11756-11765 link
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link

Revision 312011-08-03 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 172 to 172
  A set of monitoring plots has been assembled to help you try to diagnose problems.
Changed:
<
<
Firstly there are some overview plots.

>
>
  If there are lots of production jobs failing then you need to investigate. This may be a new problem associated with current processing of recent runs. Or it may be some manifestation of an old problem.
Changed:
<
<
See what site the problem is connected with ? Are the failures associated with an old current production ? Are the failures in reconstruction or merging ? Are the failures re-tries associated with some site which has a known problem ? If it looks like an old problem there will likely be a comment in the logbook. You can drill down with these links:
>
>
See what site the problem is connected with ? Are the failures associated with an old current production ? Are the failures in reconstruction or merging ? Are the failures re-tries associated with some site which has a known problem ? If it looks like an old problem there will likely be a comment in the logbook.
  At this point you may see that, for example, "site-XYZ" is failing lots of jobs in "Merging" with "InputDataResolutionErrors". You probably want to identify which productions/runs these are associated with. Go back to Current production productions to do this. Its not trivial from here as you need to try to identify which production the failures are associated with. On the main productions monitor page you can look in the "failed jobs" column and that might give you a clue. Once you have identified the production can look at the "run status" and also "show jobs" in the pop out menu and try to correlate then with the site-XYZ
Line: 223 to 220
  A set of monitoring plots has been assembled to help you try to diagnose problems.
Changed:
<
<
Firstly there are some overview plots and some diagnostic plots:
>
>
* Monitoring Plots : New Overview form Mark
 

Data Transfer Monitoring

Revision 302011-07-29 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco11 MD after July TS 11715 11716     link
>
>
Reco11 MD after July TS 11715 11716 11752,11755 11753,11754,11756-11765 link
 
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link
Changed:
<
<
1.38 tev 11567,11568 11720,11721     link
>
>
1.38 tev 11567,11568 11720,11721 11746,11748 11747,11749 link
 
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link

Revision 292011-07-27 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 140 to 140
 
Reco11 MD after July TS 11715 11716     link
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link
Added:
>
>
1.38 tev 11567,11568 11720,11721     link
 
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link

Revision 282011-07-26 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Added:
>
>
Reco11 MD after July TS 11715 11716     link
 
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link

Revision 272011-07-26 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 138 to 138
 
Activity Express Full Stripping Merging Link
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
Added:
>
>
CaloFemtoDST     11497,11384,11645 11498,11385,11646 link
 
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link
Reco10 MU+MD before May TS   10691,10685 10731,10719,10747,10749 R3438,3437,3441,3442 link
Deleted:
<
<
CaloFemtoDST     11497,11384 11498,11385 link
  Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.

Revision 262011-07-20 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco11 MU after July TS 11367 11368     link
>
>
Reco11 MU after July TS 11367 11368 11553,11564 11554-11563,11565,11566 link
 
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link

Revision 252011-07-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco11 MagUp after July TS 11367 11368     link
Reco10 MagDown after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MagDown after May TS ? 10822 10823,10832 10824-10833,10927,10928 link
>
>
Reco11 MU after July TS 11367 11368     link
Reco10 MU after May TS [2nd tranche] 11066 11067 11094,11092 11093-11105 link
Reco10 MU after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MD after May TS [1st tranche]   10822 10823,10832 10824-10833,10927,10928 link
Reco10 MU+MD before May TS   10691,10685 10731,10719,10747,10749 R3438,3437,3441,3442 link
CaloFemtoDST     11497,11384 11498,11385 link
  Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.

Revision 242011-07-18 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 137 to 137
 The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Changed:
<
<
Reco11 on Data after July TS 11367 11368     link
>
>
Reco11 MagUp after July TS 11367 11368     link
Reco10 MagDown after May TS [1st tranche] 10882 10883 10886,10884 10887-10897,10885 link
Reco10 MagDown after May TS ? 10822 10823,10832 10824-10833,10927,10928 link
  Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.

Revision 232011-07-18 - MarkWSlater

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 134 to 134
 This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.
Added:
>
>
The following table shows the current active production 'Activities' with links that will show each:

Activity Express Full Stripping Merging Link
Reco11 on Data after July TS 11367 11368     link
 Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.

Revision 222011-05-27 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 54 to 54
 
    • Status and progress of problems you inherited at beginning of shift (i.e.e resolution, or still ongoing)
    • Summary of any new problems in your shift (there will likely be separate Elog entries for the details)
    • ..anything else relevant...
Added:
>
>
    • It is important that you only tick the shift report box. This will mean that the report automatically goes to an LHCb summary page
 
  • Return the key to the operations room to the secretariat if appropriate.

Revision 212011-02-02 - RobCurrie

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 127 to 127
 This link picks out the active productions from the production monitoring page:
Changed:
<
<
This link is supposed to pick out the active reconstruction and merging productions but it doesn't work - a beer to anyone who can fix it.
>
>
This link is supposed to pick out the active reconstruction and merging productions Rob fixed it, Pete owes him one beer.
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.

Revision 202011-02-01 - RobCurrie

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 128 to 128
 

This link is supposed to pick out the active reconstruction and merging productions but it doesn't work - a beer to anyone who can fix it.

Changed:
<
<
  • [[https://lhcb-web-dirac.cern.ch/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display?prodStatus=Active&productionType=DataReconstruction:::%20productionType=Merge]
>
>
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.

Revision 192011-01-06 - NickBrook

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 487 to 487
 
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC
Added:
>
>

File Transfer System, FTS

Many of the LHCb data transfers are done under the auspices of the central FTS service provided at CERN and the other T1 centres. It is possible to monitor what is going with the transfers through these services by the following links at the Tier-1 sites.

 

DIRAC Scripts

DIRAC Scripts

Revision 182010-12-05 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 46 to 46
 
  • Open your favourite links Many shifters will have their own "favourite" set of links they open at the start of a shift. Most of those you might want will be linked below. You will develop your own over time.

Deleted:
<
<
  • Subscribe to the following mailing lists [Is this still a reccommendation ???]: Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
 At the end of a shift you should

  • Enter a short summary in the ELOG. This is important as it provides an interface to the next shifter, and allows others to get a summary. This might contain:

Revision 172010-11-20 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 84 to 84
  * Monte Carlo production: t.b.d
Changed:
<
<
  • Data transfer success rate: t.b.d
>
>
 
Line: 165 to 165
 A set of monitoring plots has been assembled to help you try to diagnose problems.

Firstly there are some overview plots.

Changed:
<
<
>
>

  If there are lots of production jobs failing then you need to investigate. This may be a new problem associated with current processing of recent runs. Or it may be some manifestation of an old problem. See what site the problem is connected with ? Are the failures associated with an old current production ? Are the failures in reconstruction or merging ? Are the failures re-tries associated with some site which has a known problem ? If it looks like an old problem there will likely be a comment in the logbook. You can drill down with these links:
Changed:
<
<
>
>
  At this point you may see that, for example, "site-XYZ" is failing lots of jobs in "Merging" with "InputDataResolutionErrors". You probably want to identify which productions/runs these are associated with. Go back to Current production productions to do this. Its not trivial from here as you need to try to identify which production the failures are associated with. On the main productions monitor page you can look in the "failed jobs" column and that might give you a clue. Once you have identified the production can look at the "run status" and also "show jobs" in the pop out menu and try to correlate then with the site-XYZ
Line: 215 to 216
 A set of monitoring plots has been assembled to help you try to diagnose problems.

Firstly there are some overview plots and some diagnostic plots:

Changed:
<
<
>
>

Data Transfer Monitoring

This on needs writing by someone who knows what to describe. For now here are some plots

 

Line: 246 to 255
 

The general monitoring plots may have already alerted you to failing jobs at a particular site. We have also provided a set of plots centred on each site with a bit more information.

Changed:
<
<
>
>
 

Revision 162010-11-16 - Graciani

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 427 to 427
 
Changed:
<
<
>
>
 

Revision 152010-11-14 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 82 to 82
 
  • General job success rate for user jobs: The shifter should look at the progress of user jobs How to do this
Changed:
<
<

  • Monte Carlo production: t.b.d
>
>
* Monte Carlo production: t.b.d
 
  • Data transfer success rate: t.b.d
Added:
>
>
 
  • Site Centric View: The shifter should take a site centric view of job success and SAM test results How to do this

  • Daily Operations Meeting: Attend this at 11.15 CET.
Line: 222 to 222
 

Space Token Monitoring

Changed:
<
<
t.b.d.
>
>
Whilst file transfer is a primary shifter function, overall space usage is not as the production and data mangers should be watching this. However it does not hurt for the shifter to be aware of the situation, particularly if it is leading to upload failures. It never hurts to mention obvious space problems at the daily ops meeting.
 
Changed:
<
<
Useful link which shows used and free space:
>
>
This is a good link to be aware of - it shows used and free space at all sites:
 

Revision 142010-11-14 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 4 to 4
  This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.
Added:
>
>
 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Changed:
<
<
INTRODUCTION
>
>

Introduction

  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
Deleted:
<
<

Introduction

 This document is for LHCb Grid computing shifters. It is organised as follows.

Line: 21 to 22
  *Preliminaries:* We provide links to information which might be useful, but which you probably only need to read once or twice until you are familiar with it. For this reaosn it comes at the end.
Deleted:
<
<
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
 
Deleted:
<
<
DOING SHIFTS
 
Deleted:
<
<
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
 
Changed:
<
<

Doing Shifts

>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

Doing Shifts

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

  This section contains suggested activities for:
  • The start and end of shift
Line: 84 to 86
 
  • Monte Carlo production: t.b.d
Changed:
<
<
  • Data transfer success rate: t.b.d
>
>
  • Data transfer success rate: t.b.d
 
  • Site Centric View: The shifter should take a site centric view of job success and SAM test results How to do this
Line: 94 to 96
 
  • Escalate a problem: If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is Investigate : Go as far as you can using the monitoring plots and the examples below. Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
Added:
>
>
 

How to: See what has happened in last few days

Line: 139 to 143
 

Howto: Monitor general job success rate

Added:
>
>
Jobs may fail for many reasons.
  • Staging.
  • Stalled Jobs.
  • Segmentation faults.
  • DB access.
  • Software problems.
  • Data access.
  • Shared area access.
  • Site downtime.
  • Problematic files.
  • Excessive runtime.
 The shifter should monitor the overall job success rate for all production jobs. If jobs start failing at a single site, and it is not a "known problem" then it may be that a new problem has arisen at that site. If jobs start failing at all sites then it is more likely to be a production or application misconfiguration.
Changed:
<
<
Jobs may fail for many reasons. The problem for the shifter is that some of them are important and some not and some will already be "well known" and some will be new.
>
>
The problem for the shifter is that some of failures are important and some not and some will already be "well known" and some will be new.
 
  • A common jobs failure minor status is "Input Resolution Errors". This can be transitory when a new production is launched and some disk servers get "overloaded" . However these jobs are resubmitted automatically and if they then run there is not a problem.
  • Jobs which retry too many times will time out. These show up with a final minor status of "Watchdog Identified Jobs as Stalled". These are cause for concern.
  • Jobs have been seen to fail at a site which is known to have its data servers down, but still gets requests for data. This is quite hard for the shifter will observe lots of failed jobs continuing over days, but the problem may well have been reported long ago and remedial work is underway.
Line: 164 to 180
  Don't be afraid to ask the GEOC if in doubt.
Added:
>
>
Using the CLI to look at failed jobs

Using the CLI, the command:

dirac-production-progress [<Production ID>]
entered without any arguments will return a breakdown of the jobs of all current productions. Entering one or more ProdIDs returns only the breakdown of those productions.

A more detailed breakdown is provided by:

dirac-production-job-summary <Production ID> [<DIRAC Status>]
which also includes the minor status of each job category and provides an example JobID for each category. The example JobIDs can then be used to investigate the failures further.

Beware of failed jobs which have been killed - when a production is complete, the remaining jobs may be automatically killed by DIRAC. Killed jobs like this are ok.

Non-Progressing Jobs

In addition to failed jobs, jobs which do not progress should also be monitored. Particular attention should be paid to jobs in the states ``Waiting'' and ``Staging''. Problematic jobs at this stage are easily overlooked since the associated problems are not easily identifiable.

Non-Starting Jobs

Jobs arriving at a site but then failing to start have multiple causes. One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.

 

Line: 259 to 300
 Figure 9: Example bug report.
Changed:
<
<
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

COMPENDIUM OF EXAMPLES

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

Compendium of Example Procedures To Address Problems.

It is hoped that shifters and other experts will contribute to this section and build it up

To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document.

How to add a new example:

Edit this page and add a bullet under the example list below. Add only a few few words to briefly describe the example you are adding. Keep it short - leave the details out . You finish this bullet with the following: "Go to ShifterGuideExamplexxxxxxxx" (where you replace xxxxxxx with your own topic title). You then exit and save.

You can now click on your topic title which will be highlighted in red. This will take to you a fresh area where you can write what you like.

Example List

TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

PRELIMINARIES

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

Preliminaries

A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).

To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.

The new shifter should:

=================================================

Background Information

Grid Sites

Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2. CERN is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

Tier-2 Sites

There are numerous Tier-2 sites with sites being added frequently. As such, it is of little worth presenting a list of all the current Tier-2 sites in this document. Tier-2 sites are used for MC production in the LHCb Computing Model.

Backend Storage Systems

Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:

Backend Storage Tier-1 Site
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC

DIRAC Scripts

DIRAC Scripts

Accronyms

Acronyms

============================================

Taxonomy of Jobs

>
>

More Details on looking at jobs

 

JobIDs

Line: 456 to 402
  Figure 4: View the pilot output of a job via the Job Monitoring Webpage.
Deleted:
<
<

Operations on Jobs

 
Deleted:
<
<
The full list of scripts which can be used to perform operations on a job is given in DIRAC Scripts. The name of each script should be a clear indication of it's purpose. Running a script without arguments will print basic usage notes.
 
Deleted:
<
<

Monitoring a Production

 
Deleted:
<
<
Jobs in each production should be periodically checked for failed jobs (Sec. 4.2.1) and to ensure that jobs are progressing (Sec. 4.2.2).
 
Changed:
<
<
When monitoring a production, a Grid Shifter should be aware of a number of issues which can cause jobs to fail:
>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
 
Changed:
<
<
  • Staging.
  • Stalled Jobs.
  • Segmentation faults.
  • DB access.
  • Software problems.
  • Data access.
  • Shared area access.
  • Site downtime.
  • Problematic files.
  • Excessive runtime.
>
>

Compendium of Example Procedures To Address Problems.

 
Added:
>
>
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
 
Changed:
<
<

Failed Jobs

>
>
It is hoped that shifters and other experts will contribute to this section and build it up
 
Changed:
<
<
A Grid Shifter should monitor a production for failed jobs and jobs which are not progressing. Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the lhcb-datacrash mailing list for each failed job. It is therefore not enough to simply rely on the number of lhcb-datacrash emails to indicate if there are any problems with a production. In addition to any lhcb-datacrash notifications, the Grid Shifter should also check the number of failed jobs in a production via the CLI or the Production Monitoring Webpage.
>
>
To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document.
 
Changed:
<
<
Using the CLI, the command:
dirac-production-progress [<Production ID>]
entered without any arguments will return a breakdown of the jobs of all current productions. Entering one or more ProdIDs returns only the breakdown of those productions.
>
>
Edit this page and add a bullet under the example list below. Add only a few few words to briefly describe the example you are adding. Keep it short - leave the details out . You finish this bullet with the following: "Go to ShifterGuideExamplexxxxxxxx" (where you replace xxxxxxx with your own topic title). You then exit and save. You can now click on your topic title which will be highlighted in red. This will take to you a fresh area where you can write what you like.
 
Changed:
<
<
A more detailed breakdown is provided by:
dirac-production-job-summary <Production ID> [<DIRAC Status>]
which also includes the minor status of each job category and provides an example JobID for each category. The example JobIDs can then be used to investigate the failures further.
>
>
Example List
 
Changed:
<
<
Beware of failed jobs which have been killed - when a production is complete, the remaining jobs may be automatically killed by DIRAC. Killed jobs like this are ok.
>
>
 
Changed:
<
<

Non-Progressing Jobs

>
>
 
Deleted:
<
<
In addition to failed jobs, jobs which do not progress should also be monitored. Particular attention should be paid to jobs in the states ``Waiting'' and ``Staging''. Problematic jobs at this stage are easily overlooked since the associated problems are not easily identifiable.
 
Deleted:
<
<

Non-Starting Jobs

 
Deleted:
<
<
Jobs arriving at a site but then failing to start have multiple causes. One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.
 
Deleted:
<
<

Merging Productions

 
Deleted:
<
<
Each MC Production should have an associated Merging Production which merges the output files together into more manageable file sizes. Ensure that the number of files available to the Merging Production increases in proportion to the number of successful jobs of the MC Production. If the number of files does not increase, this can point to a problem in the Bookkeeping which should be reported.
 
Deleted:
<
<

Ending a Production

 
Changed:
<
<
Ending a completed production is handled by the Productions Operations Manager (or equivalent). No action is required on the part of the Grid Shifter.
>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

Preliminary Things & Background Information

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

 
Added:
>
>

Grid Certificates

 
Changed:
<
<

Web Production Monitor

>
>
A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).

To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.

The new shifter should:

 
Deleted:
<
<
Production monitoring via the web is possible through the Production Monitoring Webpage. A valid grid certificate loaded into your browser is required to use the webpage.
 
Changed:
<
<

Features

>
>

Grid Sites

 
Changed:
<
<
The Production Monitoring Webpage has the following features:
>
>
Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2. CERN is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

Tier-2 Sites

There are numerous Tier-2 sites with sites being added frequently. As such, it is of little worth presenting a list of all the current Tier-2 sites in this document. Tier-2 sites are used for MC production in the LHCb Computing Model.

Backend Storage Systems

Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:

Backend Storage Tier-1 Site
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC

DIRAC Scripts

DIRAC Scripts

Accronyms

Acronyms

 
Deleted:
<
<
 

Site Downtime Calendar

Line: 848 to 801
 -- PaulSzczypka - 14 Aug 2009

-- PeterClarke - 19-Oct-2010

Added:
>
>
META FILEATTACHMENT attachment="dirac-primary-states.png" attr="" comment="Dirac states diagram" date="1289737986" name="dirac-primary-states.png" path="dirac-primary-states.png" size="111571" stream="dirac-primary-states.png" tmpFilename="/usr/tmp/CGItemp3672" user="clarke" version="1"
META FILEATTACHMENT attachment="get_logfiles.png" attr="" comment="" date="1289738093" name="get_logfiles.png" path="get_logfiles.png" size="38751" stream="get_logfiles.png" tmpFilename="/usr/tmp/CGItemp3888" user="clarke" version="1"
META FILEATTACHMENT attachment="get_pilot_output.png" attr="" comment="" date="1289738112" name="get_pilot_output.png" path="get_pilot_output.png" size="43849" stream="get_pilot_output.png" tmpFilename="/usr/tmp/CGItemp3946" user="clarke" version="1"
META FILEATTACHMENT attachment="get_std_out.png" attr="" comment="" date="1289738130" name="get_std_out.png" path="get_std_out.png" size="38486" stream="get_std_out.png" tmpFilename="/usr/tmp/CGItemp3941" user="clarke" version="1"

Revision 132010-11-14 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 133 to 134
 This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.
Changed:
<
<
Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete.

To understand why this is, you need to be aware that although many thousands of files are allocated to jobs, everything works on a run basis. Thus if a single file from any run is problematic then the reconstruction and/or merging of that run will not complete. If you see a production with 99.x% complete, then by clicking on it you can chose to view the "run status". If you go through this you might see a run which is not complete. You can then look at the jobs associated with this. Unfortunately this is a laborious process which requires developing some skill with practice.

could someone experienced please add better description here

>
>
Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete. This is more involved and is described in the "Compendium of Examples" below.
 

Howto: Monitor general job success rate

Line: 284 to 281
 

Example List

Changed:
<
<
>
>
 

Revision 122010-11-14 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 42 to 42
  At the start of a shift.
Changed:
<
<
>
>
  • Open your favourite links Many shifters will have their own "favourite" set of links they open at the start of a shift. Most of those you might want will be linked below. You will develop your own over time.
 
Changed:
<
<
  • [Is this still a reccommendation ???] Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
>
>
  • Subscribe to the following mailing lists [Is this still a reccommendation ???]: Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
 

At the end of a shift you should

Changed:
<
<
  • Enter a short summary in the ELOG. NEEDS TEMPLATE
>
>
  • Enter a short summary in the ELOG. This is important as it provides an interface to the next shifter, and allows others to get a summary. This might contain:
    • Your name
    • Status of any ongoing data reconstruction or reprocessing productions (including EXPRESS, FULL,...)
    • Status and progress of problems you inherited at beginning of shift (i.e.e resolution, or still ongoing)
    • Summary of any new problems in your shift (there will likely be separate Elog entries for the details)
    • ..anything else relevant...
 
  • Return the key to the operations room to the secretariat if appropriate.

and finally - please log out of the terminal stations in the operations room so that you don't block them for the next person

Revision 112010-11-13 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.

Changed:
<
<
======================================================
>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
 
Added:
>
>
INTRODUCTION

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

 

Introduction

This document is for LHCb Grid computing shifters. It is organised as follows.

Deleted:
<
<
Firstly we provide links to information which might be useful, but which you probably only need to read once or twice until you are familiar with it.
  • Preliminaries (things you need to do before starting to do these shifts)
  • Background Information (information which may be useful to understand the context better)
  • Taxonomy of Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)
 
Changed:
<
<
Doing Shifts: The main body of this document describes the principal activities of a production shifter under . This includes many links to web pages and to monitoring plots which may help you in the shift.
>
>
*Doing Shifts:* The main body of this document describes the principal activities of a production shifter under . This includes many links to web pages and to monitoring plots which may help you in the shift.

*Compendium of Examples:* We (try to) give a compendium of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).

*Preliminaries:* We provide links to information which might be useful, but which you probably only need to read once or twice until you are familiar with it. For this reaosn it comes at the end.

 
Changed:
<
<
Compendium of Examples: Finally we (try to) give a compendium of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
 
Changed:
<
<
==========================================================
>
>
DOING SHIFTS

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

 

Doing Shifts

Line: 37 to 42
  At the start of a shift.
Changed:
<
<
  • Open your favourite links Many shifters will have their own "favourite" set of links they open at the start of a shift. Here is a possible set to start from.
>
>
  • Open your favourite links Many shifters will have their own "favourite" set of links they open at the start of a shift. Here is a possible set to start from, but you will develop your own.
 
Line: 45 to 50
 
Changed:
<
<
  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this

  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at production request page in the Dirac portal. If you look down it you will set those that are associated with Reconstruction and are Active. You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore). How to do this

  • Read the ELOG for the last few days. You should read the end of shift report from the last shifter. You will also be able to pick up threads pertaining to current issues.

  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
>
>
  • [Is this still a reccommendation ???] Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
 
Line: 68 to 66
  The principle things you need to keep an eye on during the shift are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
Added:
>
>
  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this

  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at production request page in the Dirac portal. If you look down it you will set those that are associated with Reconstruction and are Active (if in doubt it might help to filter on these using the filter arrow at top left). You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore). How to do this

  • Read the ELOG for the last few days. You should read the end of shift report from the last shifter. You will also be able to pick up threads pertaining to current issues.
 
  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production How to do this.

  • Current Reconstructions: Ensure the current reconstruction productions are progressing How to do this.
Line: 82 to 86
 
  • Data transfer success rate: t.b.d
Deleted:
<
<
 
  • Site Centric View: The shifter should take a site centric view of job success and SAM test results How to do this
Changed:
<
<
  • Attend the daily operations meeting at 11.15 CET.
>
>
  • Daily Operations Meeting: Attend this at 11.15 CET.
 
  • Elog entries: Make entries when there are new problems/observations or when there are developments to an existing problem. Making Elog entires

  • Escalate a problem: If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is Investigate : Go as far as you can using the monitoring plots and the examples below. Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
Deleted:
<
<
  • Shift Report: At the end of a shift you must submit a shift report to the Operations ELOG.
 

How to: See what has happened in last few days

Changed:
<
<
Look at the RUNDB to see fills in last days. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.
>
>
Look at the RUNDB to see fills in last days. By clicking on the fils you will get a list of runs associated with each. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.
 
Line: 104 to 105
 

Howto: Look at live data flow

Changed:
<
<
This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the book-keeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain. The questions to be answered are:
  • Are we taking data now ?
  • Is the RAW data for each run flowing from the pit and getting into the BKK ?
>
>
This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the book-keeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain.
 
Changed:
<
<
These links help you answer these questions. When we are in proper data taking you will see the LHCb online page showing COLLISION10 with data destination OFFLINE. If the destination is LOCAL then you can ignore it.
>
>
When we are in proper data taking you will see the LHCb online page showing COLLISION10 with data destination OFFLINE. If the destination is LOCAL then you can ignore it.
 
Line: 118 to 117
 

Howto: Look at the current reconstruction production.

Changed:
<
<
In general, if we are data taking and there is no anomalous situation, the latest data reconstruction production will be running. You will know what the current active data reconstructions are from the earlier step at the start of shift. These have 3 visible steps (i) reconstruction and stripping (ii) merging (iii) replication to sites. The shifter must keep an eye on these to ensure they are progressing, and they there are not an unexpected number of failing jobs which are the result of a recent (as yet unknown) problem.
>
>
In general, if we are data taking, or are in a re-processing period, the latest data reconstruction production will be running. You will know what the current active data reconstructions are from the earlier step at the start of shift. These have 3 visible steps (i) reconstruction and stripping (ii) merging (iii) replication to sites. The shifter must keep an eye on these to ensure they are progressing, and they there are not an unexpected number of failing jobs which are the result of a recent (as yet unknown) problem.
 Questions to be answered now are:
  • Is the RAW data for each run being picked up by the current data reconstruction production ?
  • Is the merging going properly ? When each run is 100% reconstructed and stripped, the stripped data should be picked up by the merging production to produce DSTs . There is some delay here, typically merging may not yet be running on the latest runs form the very latest fill yet.
  • When merged the DST data should appear in the BKK.
Changed:
<
<
These links pick out the active data reconstruction and merging productions from the production monitoring page:
>
>
This link picks out the active productions from the production monitoring page:
 
Changed:
<
<
>
>
This link is supposed to pick out the active reconstruction and merging productions but it doesn't work - a beer to anyone who can fix it.
  • [[https://lhcb-web-dirac.cern.ch/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display?prodStatus=Active&productionType=DataReconstruction:::%20productionType=Merge]
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.
Changed:
<
<
A systematic procedure for DataReconstruction productions is something like this:
  • Look at the...
  • t.b.c.
>
>
Once you have checked that there are no gross problems with productions, you need to look at the next level down. A few problematic cases can hold up completion of the whole chain. This will typically show up by some of the productions getting stuck at 99.x% complete.
 
Added:
>
>
To understand why this is, you need to be aware that although many thousands of files are allocated to jobs, everything works on a run basis. Thus if a single file from any run is problematic then the reconstruction and/or merging of that run will not complete. If you see a production with 99.x% complete, then by clicking on it you can chose to view the "run status". If you go through this you might see a run which is not complete. You can then look at the jobs associated with this. Unfortunately this is a laborious process which requires developing some skill with practice.
 
Added:
>
>
could someone experienced please add better description here
 

Howto: Monitor general job success rate

Line: 237 to 233
  If Elog is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.
Changed:
<
<
============================================================

Compendium of Example Procedures To Address Problems.

It is hoped that shifters and other experts will contribute to this section and build it up

To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document.

How to add a new example:

Edit this page and add a bullet under the example list below. Add only a few few words to briefly describe the example you are adding. Keep it short - leave the details out . You finish this bullet with the following: "Go to ShifterGuideExamplexxxxxxxx" (where you replace xxxxxxx with your own topic title). You then exit and save.

You can now click on your topic title which will be highlighted in red. This will take to you a fresh area where you can write what you like.

Example List

>
>
 

Bug Reporting

Changed:
<
<
Before submitting a bug report, the user should:
>
>
You may reach a point where you should submit a bug report. Before submitting a bug report:
 
  • Identify conditions under which the bug occurs.
  • Record all relevant information.
  • Try to ensure that the bug is reproducible.
Changed:
<
<
Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:
>
>
Once the shifter is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:
 
Line: 290 to 262
 Figure 9: Example bug report.
Added:
>
>
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

COMPENDIUM OF EXAMPLES

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL

Compendium of Example Procedures To Address Problems.

It is hoped that shifters and other experts will contribute to this section and build it up

To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document.

How to add a new example:

Edit this page and add a bullet under the example list below. Add only a few few words to briefly describe the example you are adding. Keep it short - leave the details out . You finish this bullet with the following: "Go to ShifterGuideExamplexxxxxxxx" (where you replace xxxxxxx with your own topic title). You then exit and save.

You can now click on your topic title which will be highlighted in red. This will take to you a fresh area where you can write what you like.

Example List

TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

 
Added:
>
>
PRELIMINARIES
 
Changed:
<
<
=============================================================
>
>
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
 

Preliminaries

Revision 102010-11-04 - PaulSzczypka

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 91 to 91
 
  • Escalate a problem: If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is Investigate : Go as far as you can using the monitoring plots and the examples below. Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
Added:
>
>
  • Shift Report: At the end of a shift you must submit a shift report to the Operations ELOG.
 

How to: See what has happened in last few days

Line: 98 to 99
 Look at the RUNDB to see fills in last days. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.
Added:
>
>
  • Check the ELOG!
 

Howto: Look at live data flow

Line: 194 to 196
 

Howto: Monitor Monte Carlo Production

Changed:
<
<
t.b.d.
>
>
  • Identify the Productions of interest
  • Use either the Production Monitoring webpage or the Job Monitoring webpage to check the job progress.
  • Check job failures from the productions of interest to ensure they are random/site-specific rather than a problem with the production itself.
  • ...
 

Howto: Site Centric View

Revision 92010-10-25 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 128 to 128
 
Changed:
<
<
>
>
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.

Revision 82010-10-24 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.

Changed:
<
<
#!comment ======================================================
>
>
======================================================
 

Introduction

Line: 43 to 43
 
Added:
>
>
 

  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this
Line: 82 to 83
 
  • Data transfer success rate: t.b.d
Changed:
<
<
  • Site Centric View: The shifter should take a site centric view to look for problems How to do this
>
>
  • Site Centric View: The shifter should take a site centric view of job success and SAM test results How to do this
 
  • Attend the daily operations meeting at 11.15 CET.
Changed:
<
<
  • Make ELOG entries when there are new problems/observations or when there are developments to an existing problem. Making ELOG entires
>
>
  • Elog entries: Make entries when there are new problems/observations or when there are developments to an existing problem. Making Elog entires
 
  • Escalate a problem: If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is Investigate : Go as far as you can using the monitoring plots and the examples below. Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
Line: 203 to 204
 
Added:
>
>
The Sam dashboard shows you the results of Sam availability tests (here is the SAM topic in full in case its any further help).
 The general monitoring plots may have already alerted you to failing jobs at a particular site. We have also provided a set of plots centred on each site with a bit more information.

Changed:
<
<
#ELOG

Making ELOG Entries

>
>

Making Elog Entries

Here is the ELOG.

  All Grid Shifter actions of note should be recorded in the ELOG. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions.
Changed:
<
<
ELOG entries should contain as much relevant information as possible.
>
>
Elog entries should contain as much relevant information as possible.
 
Changed:
<
<
A typical ELOG entry for a new problem contains:
>
>
A typical ELOG entry for a new problem contains some or all of :
 
  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
Added:
>
>
  • The application in which the job failed
 
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.
Line: 222 to 229
 
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.
Deleted:
<
<
 Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.
Changed:
<
<
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG.

ELOG Problems

If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.

When to Submit an ELOG

Submit an ELOG in the following situations:

  • Jobs finalise with exceptions.

Exceptions

Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:

  • The production ID.
  • An example job ID.
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.

Crashed Application

Should submit example error log for the crashed application.

Datacrash Emails

The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.

>
>
If Elog is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.
  ============================================================
Line: 264 to 239
 

Compendium of Example Procedures To Address Problems.

Changed:
<
<
* It is hoped that shifters and other experts will contribute to this section and build it up.*
>
>
It is hoped that shifters and other experts will contribute to this section and build it up

To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document.

How to add a new example:

 
Changed:
<
<
To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document. To add an example:
  • Simply edit this page and add a new short section as shown in the example below in ShifterGuideExampleTopicHeading.
  • Save and exit this page.
  • Then when you click on the highlighted new topic heading it will open up a new area to write your example.
>
>
Edit this page and add a bullet under the example list below. Add only a few few words to briefly describe the example you are adding. Keep it short - leave the details out . You finish this bullet with the following: "Go to ShifterGuideExamplexxxxxxxx" (where you replace xxxxxxx with your own topic title). You then exit and save.
 
Changed:
<
<

Examples

>
>
You can now click on your topic title which will be highlighted in red. This will take to you a fresh area where you can write what you like.
 
Changed:
<
<

Example Topic Heading

Add some few words in here to briefly describe the example you are adding. Keep it short - leave the details out . You finish this short section with the following : Go to ShifterGuideExampleTopicHeading.
>
>

Example List

 
Deleted:
<
<
You then exit and save. You can now click on ShifterGuideExampleTopicHeading (capitals left out here to avoid it becoming another link and start editing in a fresh area.
 
Deleted:
<
<

Software Unavailability

Go to SoftwareUnavailability
 

Bug Reporting

Revision 72010-10-23 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.

Changed:
<
<
>
>
#!comment ======================================================
 
Deleted:
<
<
//==========================================================
 

Introduction

This document is for LHCb Grid computing shifters. It is organised as follows.

Line: 18 to 16
 
  • Background Information (information which may be useful to understand the context better)
  • Taxonomy of Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)
Changed:
<
<
DoingShifts: The main body of this document describes the principal activities of a production shifter under . This includes many links to web pages and to monitoring plots which may help you in the shift.
>
>
Doing Shifts: The main body of this document describes the principal activities of a production shifter under . This includes many links to web pages and to monitoring plots which may help you in the shift.
 
Changed:
<
<
Compendium of Examples: Finally we give a of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
>
>
Compendium of Examples: Finally we (try to) give a compendium of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
  ==========================================================
Line: 37 to 35
 

Start and End of of Shift

Changed:
<
<
At the start of a shift you should
>
>
At the start of a shift.

 

  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this
Line: 60 to 65
 

Principal Activities During a Shift

Changed:
<
<
The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
>
>
The principle things you need to keep an eye on during the shift are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
 
Changed:
<
<
Many shifters will have their own "favourite" set of links they open at the start of a shift. Here is a possible set to start from.
>
>
  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production How to do this.
 
Changed:
<
<
In many cases when using the DIRAC production monitor, it takes several clicks as well as setting up filters in order to get to the information you want. Many of the links below pre-code this for you so that you get directly to specific bits of information.
>
>
  • Current Reconstructions: Ensure the current reconstruction productions are progressing How to do this.
 
Changed:
<
<
  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the bookeeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain How to do this
>
>
  • General Job success rate for production jobs: The shifter should monitor the overall job success rate for all production jobs. How to do this.
 
Changed:
<
<
  • The current reconstruction productions: In general, if we are data taking and there is no anomalous situation, the latest data reconstruction production will be running. This has 3 visible steps (i) reconstruction and stripping (ii) merging (iii) replication to sites. The shifter must keep an eye on this to ensure it is progressing, and they there are not an unexpected number of failing jobs which are the result of a recent (as yet unknown) problem. How to do this
>
>
  • General job success rate for user jobs: The shifter should look at the progress of user jobs How to do this
 
Changed:
<
<
  • General Job success rate for production jobs: The shifter should monitor the overall job success rate for all production jobs. If jobs start failing at a single site, and it is not a "known problem" then it may be that a new problem has arisen at that site. If jobs start failing at all sites then it is more likely to be a production or application misconfiguration. A set of monitoring plots has been assembled to at least alert the shifter to a problem, and to start the diagnostic process. How to do this
>
>
 
Changed:
<
<
  • General job success rate for user jobs: t.b.d
>
>
  • Monte Carlo production: t.b.d
 
  • Data transfer success rate: t.b.d
Changed:
<
<
  • Site Integrity: The shifter should look at the sites
>
>
  • Site Centric View: The shifter should take a site centric view to look for problems How to do this
 
  • Attend the daily operations meeting at 11.15 CET.
Line: 100 to 101
 

Howto: Look at live data flow

Changed:
<
<
The questions to be answered are:
>
>
This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the book-keeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain. The questions to be answered are:
 
  • Are we taking data now ?
  • Is the RAW data for each run flowing from the pit and getting into the BKK ?

Changed:
<
<
These links help you answer these questions. When we are in proper data taking you will see the LHCb online page showing COLLISION10 with data destination OFFLINE
>
>
These links help you answer these questions. When we are in proper data taking you will see the LHCb online page showing COLLISION10 with data destination OFFLINE. If the destination is LOCAL then you can ignore it.
 
Line: 114 to 115
 

Howto: Look at the current reconstruction production.

Changed:
<
<
You will know what the current active data reconstructions are from the earlier step at the start of shift. Questions to be answered now are:
>
>
In general, if we are data taking and there is no anomalous situation, the latest data reconstruction production will be running. You will know what the current active data reconstructions are from the earlier step at the start of shift. These have 3 visible steps (i) reconstruction and stripping (ii) merging (iii) replication to sites. The shifter must keep an eye on these to ensure they are progressing, and they there are not an unexpected number of failing jobs which are the result of a recent (as yet unknown) problem. Questions to be answered now are:
 
  • Is the RAW data for each run being picked up by the current data reconstruction production ?
  • Is the merging going properly ? When each run is 100% reconstructed and stripped, the stripped data should be picked up by the merging production to produce DSTs . There is some delay here, typically merging may not yet be running on the latest runs form the very latest fill yet.
  • When merged the DST data should appear in the BKK.

These links pick out the active data reconstruction and merging productions from the production monitoring page:

Changed:
<
<
>
>
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.
Added:
>
>
A systematic procedure for DataReconstruction productions is something like this:
  • Look at the...
  • t.b.c.
 

Howto: Monitor general job success rate

Added:
>
>
The shifter should monitor the overall job success rate for all production jobs. If jobs start failing at a single site, and it is not a "known problem" then it may be that a new problem has arisen at that site. If jobs start failing at all sites then it is more likely to be a production or application misconfiguration.
 Jobs may fail for many reasons. The problem for the shifter is that some of them are important and some not and some will already be "well known" and some will be new.
  • A common jobs failure minor status is "Input Resolution Errors". This can be transitory when a new production is launched and some disk servers get "overloaded" . However these jobs are resubmitted automatically and if they then run there is not a problem.
  • Jobs which retry too many times will time out. These show up with a final minor status of "Watchdog Identified Jobs as Stalled". These are cause for concern.
Line: 143 to 152
 A set of monitoring plots has been assembled to help you try to diagnose problems.

Firstly there are some overview plots.

Changed:
<
<
>
>
  If there are lots of production jobs failing then you need to investigate. This may be a new problem associated with current processing of recent runs. Or it may be some manifestation of an old problem. See what site the problem is connected with ? Are the failures associated with an old current production ? Are the failures in reconstruction or merging ? Are the failures re-tries associated with some site which has a known problem ? If it looks like an old problem there will likely be a comment in the logbook. You can drill down with these links:
Changed:
<
<
>
>
  At this point you may see that, for example, "site-XYZ" is failing lots of jobs in "Merging" with "InputDataResolutionErrors". You
Changed:
<
<
probably want to identify which jobs/runs these are associated with. Go back to Current production productions to do this. From here you can look at the "run status" and also "show jobs" in the pop out menu. You can search for failing jobs in both reconstruction or merging and see what sites they are associated with.
>
>
probably want to identify which productions/runs these are associated with. Go back to Current production productions to do this. Its not trivial from here as you need to try to identify which production the failures are associated with. On the main productions monitor page you can look in the "failed jobs" column and that might give you a clue. Once you have identified the production can look at the "run status" and also "show jobs" in the pop out menu and try to correlate then with the site-XYZ
  Once you have used these monitoring plots the next line of diagnosis depends upon the shifter experience. One has to look at the job log outputs and see if there is any information which helps diagnose the problem.
Added:
>
>
You can also try the site centric monitoring plots (see below)
 Don't be afraid to ask the GEOC if in doubt.
Added:
>
>

Howto: Monitor user job success rate

User jobs may fail because a user has submitted a job with an error (which is not an ops issue) or there may be a problem at a site (which you do need to care about). A set of monitoring plots has been assembled to at least alert the shifter to a problem, and to start the diagnostic process.

A set of monitoring plots has been assembled to help you try to diagnose problems.

Firstly there are some overview plots and some diagnostic plots:

Space Token Monitoring

t.b.d.

Useful link which shows used and free space:

Howto: Monitor Monte Carlo Production

t.b.d.

Howto: Site Centric View

Firstly look at the site summary page

The general monitoring plots may have already alerted you to failing jobs at a particular site. We have also provided a set of plots centred on each site with a bit more information.

 #ELOG

Making ELOG Entries

Revision 62010-10-23 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 39 to 39
  At the start of a shift you should
Deleted:
<
<
  • Read the ELOG for the last few days. You should read the end of shift report from the last shifter. You will also be able to pick up threads pertaining to current issues.

  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this
 
Changed:
<
<
  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at this list: Production Requests. In it you will find many productions. If you look down it you will set those that are associated with Reconstruction and are Active. You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore)
>
>
  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this
 
Changed:
<
<
Active Reconstruction Requests
>
>
  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at production request page in the Dirac portal. If you look down it you will set those that are associated with Reconstruction and are Active. You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore). How to do this
 
Changed:
<
<
Active Reconstruction Requests
>
>
  • Read the ELOG for the last few days. You should read the end of shift report from the last shifter. You will also be able to pick up threads pertaining to current issues.
 
  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
Line: 59 to 56
 
  • Enter a short summary in the ELOG. NEEDS TEMPLATE
  • Return the key to the operations room to the secretariat if appropriate.
Added:
>
>
and finally - please log out of the terminal stations in the operations room so that you don't block them for the next person
 

Principal Activities During a Shift

Changed:
<
<
The DIRAC Web portal is the workhorse of of the shift and provides much and varied information. One can always start from here, and many shifters do just that. A possible set of web links to start from is
  • x
  • x
  • x
>
>
The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
 
Changed:
<
<
In many cases when using the DIRAC production monitor, it takes several clicks as well as setting up filters in order to get to the information you want. For this reason many of the links below pre-code this for you so that you get directly to specific bits of information.
>
>
Many shifters will have their own "favourite" set of links they open at the start of a shift. Here is a possible set to start from.
 
Changed:
<
<
The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
>
>
In many cases when using the DIRAC production monitor, it takes several clicks as well as setting up filters in order to get to the information you want. Many of the links below pre-code this for you so that you get directly to specific bits of information.
 
  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the bookeeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain How to do this
Line: 119 to 120
 
  • When merged the DST data should appear in the BKK.

These links pick out the active data reconstruction and merging productions from the production monitoring page:

Changed:
<
<
>
>
  This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.

Revision 52010-10-20 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 18 to 18
 
  • Background Information (information which may be useful to understand the context better)
  • Taxonomy of Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)
Changed:
<
<
The main body of this document describes the principal activities of a production shifter under DoingShifts. This includes many links to web pages and to monitoring plots which may help you in the shift.
>
>
DoingShifts: The main body of this document describes the principal activities of a production shifter under . This includes many links to web pages and to monitoring plots which may help you in the shift.
 
Changed:
<
<
Finally we give a Compendium of Examples of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
>
>
Compendium of Examples: Finally we give a of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
  ==========================================================

Revision 42010-10-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 6 to 6
 
Changed:
<
<
UpdatedProductionShifterGuide
>
>
//==========================================================
 

Introduction

Line: 17 to 18
 
  • Background Information (information which may be useful to understand the context better)
  • Taxonomy of Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)
Changed:
<
<
Principal Shift Activities The main body of this document describes the principal activities of a production shifter. This includes
>
>
The main body of this document describes the principal activities of a production shifter under DoingShifts. This includes many links to web pages and to monitoring plots which may help you in the shift.
 
Changed:
<
<
A number of quick-reference sections are also available. DIRAC Scripts and Acronyms list the available DIRAC 3 scripts and commonly-used acronyms respectively.
>
>
Finally we give a Compendium of Examples of problems and processes you may encounter. This is written in a way to easily allow shifters and other experts to edit and add their own useful examples in a simple factorised way (i.e. you don't have to add text in the main body of this document).
 
Added:
>
>
==========================================================
 

Doing Shifts

This section contains suggested activities for:

  • The start and end of shift
Changed:
<
<
  • The tasks and observations a shifter is expected to make continually during a shift
>
>
  • The principal activities during a shift
 
  • Links to many web pages which you may find useful.
  • A suggested shift checklist
  • A compendium of possible problems and possible actions to take.
Line: 44 to 45
 
  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at this list: Production Requests. In it you will find many productions. If you look down it you will set those that are associated with Reconstruction and are Active. You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore)
Changed:
<
<
Active Reconstruction Requests
>
>
Active Reconstruction Requests
  Active Reconstruction Requests
Changed:
<
<
  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily.
It's suggested that suitable message filters and folders are created in your mail client of choice.
>
>
  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.
 
Line: 61 to 61
 

Principal Activities During a Shift

Added:
>
>
The DIRAC Web portal is the workhorse of of the shift and provides much and varied information. One can always start from here, and many shifters do just that. A possible set of web links to start from is
  • x
  • x
  • x

In many cases when using the DIRAC production monitor, it takes several clicks as well as setting up filters in order to get to the information you want. For this reason many of the links below pre-code this for you so that you get directly to specific bits of information.

 The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.

  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the bookeeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain How to do this
Line: 71 to 78
 
  • General job success rate for user jobs: t.b.d
Added:
>
>
  • Data transfer success rate: t.b.d
 
  • Site Integrity: The shifter should look at the sites

  • Attend the daily operations meeting at 11.15 CET.

  • Make ELOG entries when there are new problems/observations or when there are developments to an existing problem. Making ELOG entires
Added:
>
>
  • Escalate a problem: If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is Investigate : Go as far as you can using the monitoring plots and the examples below. Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
 

Changed:
<
<

What has happened in last few days

>
>

How to: See what has happened in last few days

  Look at the RUNDB to see fills in last days. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.
Line: 118 to 128
 

Changed:
<
<

Howto: General Job success rate

>
>

Howto: Monitor general job success rate

  Jobs may fail for many reasons. The problem for the shifter is that some of them are important and some not and some will already be "well known" and some will be new.
  • A common jobs failure minor status is "Input Resolution Errors". This can be transitory when a new production is launched and some disk servers get "overloaded" . However these jobs are resubmitted automatically and if they then run there is not a problem.
Line: 149 to 159
 This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.
Deleted:
<
<

ELOG Entry for a New Problem

 A typical ELOG entry for a new problem contains:
Deleted:
<
<
 
  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
Line: 160 to 167
 
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.
Deleted:
<
<

Subsequent ELOG Entries

 Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.
Changed:
<
<
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.
>
>
If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG.
 

ELOG Problems

Line: 200 to 205
 ============================================================
Changed:
<
<
#Compendium
>
>
 

Compendium of Example Procedures To Address Problems.

Changed:
<
<
* It is hoped that shifters and other experts will contribute to this section. To make it easy it is suggested that people add a "topic" for each example problem they are able to contribute. To add a topic
  • Simply edit this page and add a single wiki-word (means at least two capitals in the word) in the bullet list of examples.
  • Save and exit this page.
  • Then when you click on the highlighted wiki-word it will open up a new area to write your example.
*
>
>
* It is hoped that shifters and other experts will contribute to this section and build it up.*
 
Changed:
<
<
If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is
  • Investigate : Go as far as you can using the monitoring plots and the examples below.
  • Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.

On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:

  • How many jobs does the problem affect?
  • Are the central DIRAC services running normally?
  • Are all jobs affected?
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
  • Is there enough information available to determine the error?
>
>
To make it easy it is arranged so that you add a separate "topic" for each example problem you contribute. You do not need to add text in the main body of this document. To add an example:
  • Simply edit this page and add a new short section as shown in the example below in ShifterGuideExampleTopicHeading.
  • Save and exit this page.
  • Then when you click on the highlighted new topic heading it will open up a new area to write your example.
 

Examples

ExampleTopicHeading

Changed:
<
<
Go to ExampleTopicHeading
>
>
Add some few words in here to briefly describe the example you are adding. Keep it short - leave the details out . You finish this short section with the following : Go to ShifterGuideExampleTopicHeading.

You then exit and save. You can now click on ShifterGuideExampleTopicHeading (capitals left out here to avoid it becoming another link and start editing in a fresh area.

 

Software Unavailability

Go to SoftwareUnavailability
Line: 267 to 261
  =============================================================
Added:
>
>
 

Preliminaries

A Grid certificate is mandatory for Grid Shifters.

Line: 282 to 277
  =================================================
Added:
>
>
 

Background Information

Grid Sites

Line: 314 to 310
 
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC
Changed:
<
<

All about Jobs

>
>

DIRAC Scripts

DIRAC Scripts

Accronyms

Acronyms

============================================

Taxonomy of Jobs

 

JobIDs

Revision 32010-10-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 13 to 13
 This document is for LHCb Grid computing shifters. It is organised as follows.

Firstly we provide links to information which might be useful, but which you probably only need to read once or twice until you are familiar with it.

Changed:
<
<
  • Preliminaries (things you need to do before starting to do these shifts)
  • Background Information (information which may be useful to understand the context better)
  • Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)

shifts The main body of this document describes the principal activities of a production shifter. This includes

  • Start of shift
  • The tasks and observations a shifter is expected to make continually during a shift
  • Links to many web pages which you may find useful.
  • A suggested shift checklist
  • A compendium of possible problems and possible actions to take.
  • Instructions and a template for the report you are strongly requested to lodge in the ELOG at the end of each shift.

Orphan links for now

The ELOG section outlines the situations for which the submission of an ELOG is appropriate.

Finally, the Procedures section details the well-established procedures for Grid Shifters.

The Grid Sites section gives some brief information about the various Grid sites and their backend storage systems.

The Web Production Monitor section describes the main features of the Production Monitor webpage.

The methods available to manage and monitor productions are described in the Productions section.

>
>
  • Preliminaries (things you need to do before starting to do these shifts)
  • Background Information (information which may be useful to understand the context better)
  • Taxonomy of Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)
 
Added:
>
>
Principal Shift Activities The main body of this document describes the principal activities of a production shifter. This includes
  A number of quick-reference sections are also available. DIRAC Scripts and Acronyms list the available DIRAC 3 scripts and commonly-used acronyms respectively.
Changed:
<
<
>
>
 

Doing Shifts

Added:
>
>
This section contains suggested activities for:
  • The start and end of shift
  • The tasks and observations a shifter is expected to make continually during a shift
  • Links to many web pages which you may find useful.
  • A suggested shift checklist
  • A compendium of possible problems and possible actions to take.
  • Instructions and a template for the report you are strongly requested to lodge in the ELOG at the end of each shift.
 

Start and End of of Shift

At the start of a shift you should

Revision 22010-10-19 - PeterClarke

Line: 1 to 1
 
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

Line: 25 to 25
 
  • A compendium of possible problems and possible actions to take.
  • Instructions and a template for the report you are strongly requested to lodge in the ELOG at the end of each shift.
Changed:
<
<
Orphan for now
>
>
Orphan links for now
  The ELOG section outlines the situations for which the submission of an ELOG is appropriate.
Line: 54 to 54
 
  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this
Changed:
<
<
  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at this list: Production Requests. In it you will find many productions. If you look down it you will set those that are associated with Reconstruction and are Active. You are looking for those which are Recox-Strippingy.
>
>
  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at this list: Production Requests. In it you will find many productions. If you look down it you will set those that are associated with Reconstruction and are Active. You are typically looking for those which are Recox-Strippingy or some variation of this (there may be validation versions which are under test which you can ignore)
 
Changed:
<
<
Active Reconstruction Requests
>
>
Active Reconstruction Requests

Active Reconstruction Requests

 
  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily.
It's suggested that suitable message filters and folders are created in your mail client of choice.
Line: 67 to 69
 At the end of a shift you should

  • Enter a short summary in the ELOG. NEEDS TEMPLATE
Added:
>
>
  • Return the key to the operations room to the secretariat if appropriate.
 
Changed:
<
<

Principal Activities During a Shift

>
>

Principal Activities During a Shift

  The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.
Line: 84 to 87
 
  • Attend the daily operations meeting at 11.15 CET.
Added:
>
>
  • Make ELOG entries when there are new problems/observations or when there are developments to an existing problem. Making ELOG entires

 
Changed:
<
<

What has happened in last few days

>
>

What has happened in last few days

  Look at the RUNDB to see fills in last days. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.

Changed:
<
<

Howto: Look at live data flow

>
>

Howto: Look at live data flow

 
Changed:
<
<
The questions to be checked are:
>
>
The questions to be answered are:
 
  • Are we taking data now ?
  • Is the RAW data for each run flowing from the pit and getting into the BKK ?

Line: 106 to 113
 

Changed:
<
<

Howto: Look at the current reconstruction production.

>
>

Howto: Look at the current reconstruction production.

 
Changed:
<
<
Questions to be answered are:
  • What is the current reconstruction production ?
>
>
You will know what the current active data reconstructions are from the earlier step at the start of shift. Questions to be answered now are:
 
  • Is the RAW data for each run being picked up by the current data reconstruction production ?
  • Is the merging going properly ? When each run is 100% reconstructed and stripped, the stripped data should be picked up by the merging production to produce DSTs . There is some delay here, typically merging may not yet be running on the latest runs form the very latest fill yet.
  • When merged the DST data should appear in the BKK.
Changed:
<
<
Now look at the productions associated with the current reconstruction. This is hard wired for you here (if the most recent reconstruction is not shown then tell clarke@cernNOSPAMPLEASE.ch)

This one will probably be the main reconstruction production. It is probably useful to then look at the pop up "Run Status" option (by left clicking on the line). This will show you which runs are 100% complete.

This one shows you (probably) the mergings. Merging only works on runs which are already 100% reconstructed (which is what you will have seen above). Again, left clicking and looking at run status will show you what is happening.

This one probably shows you the EXPRESS stream for completeness.

  • DIRAC Production Monitoring: Reco06

These links show similar information, but grouped by all current reconstruction and merging productions:

>
>
These links pick out the active data reconstruction and merging productions from the production monitoring page:
 
Deleted:
<
<
 
Changed:
<
<
Finally you can (after many clicks) look in the BKK to see if DSTs are there.
>
>
This link takes you to the book-keeping where (after many clicks)you can see if DSTs are there.
 
Line: 136 to 128
 
Added:
>
>
 
Changed:
<
<

Howto: General Job success rate

>
>

Howto: General Job success rate

  Jobs may fail for many reasons. The problem for the shifter is that some of them are important and some not and some will already be "well known" and some will be new.
  • A common jobs failure minor status is "Input Resolution Errors". This can be transitory when a new production is launched and some disk servers get "overloaded" . However these jobs are resubmitted automatically and if they then run there is not a problem.
Line: 161 to 154
 Don't be afraid to ask the GEOC if in doubt.
Changed:
<
<
WHERE IS THERE A SIMPLE UP TO DATE SITE STATUS MONITOR ????? SUCH THAT SHIFTER CAN EASILY SEE STATUS AND PROBLEMS WHICH ARE KNOWN AT EACH SITE.

================================================

Production Operations Meeting

A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summarising the morning activities.

The Grid Shifter's report should contain:

  • Current production progress, jobs submitted, waiting etc.
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.

Ending a Shift

At the end of each shift, morning Grid Shifters should:

  • Pass on the key (TCE5) for the Production Operations room to the next Grid Shifter.
  • Prepare a list of outstanding issues to be handed over to the next Grid Shifter and discussed in the Production Operations meeting.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

Similarly, evening Grid Shifters should:

  • Place the key (TCE5) to the Productions Operations room in the secretariat key box.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

End of Shift Period

At the end of a shift period the Grid Shifter may wish to unsubscribe from the various mailing lists (Sec. 6.4.1) in addition to returning the Production Operations room key, TCE5 (Sec. 6.4.2).

Miscellaneous

Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.

ELOG

>
>
#ELOG

Making ELOG Entries

  All Grid Shifter actions of note should be recorded in the ELOG. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.
Changed:
<
<

Typical ELOG Format

Each ELOG entry which reports a new problem should include as much relevant information as possible. This allows the production operations team to quickly determine the problem and apply a solution.

ELOG Entry for a New Problem

>
>

ELOG Entry for a New Problem

  A typical ELOG entry for a new problem contains:
Line: 238 to 172
 
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.
Changed:
<
<

Subsequent ELOG Entries

>
>

Subsequent ELOG Entries

  Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.

If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.

Added:
>
>

ELOG Problems

If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.

 

When to Submit an ELOG

Submit an ELOG in the following situations:

  • Jobs finalise with exceptions.
Changed:
<
<

Exceptions

>
>

Exceptions

  Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:
Line: 263 to 198
 
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.
Changed:
<
<

Crashed Application

>
>

Crashed Application

  Should submit example error log for the crashed application.
Changed:
<
<

Datacrash Emails

>
>

Datacrash Emails

  The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.
Changed:
<
<

ELOG Problems

>
>
============================================================
 
Deleted:
<
<
If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.
 
Changed:
<
<

Procedures

>
>
#Compendium

Compendium of Example Procedures To Address Problems.

 
Changed:
<
<
If a problem is discovered it is very important to escalate it to the operations team. Assessing the scale of the problem is very important and Grid Shifters should attempt to answer the questions in section 9.1.1 as soon as possible.

On the Discovery of a Problem

Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem. Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).

Standard Checklist

>
>
* It is hoped that shifters and other experts will contribute to this section. To make it easy it is suggested that people add a "topic" for each example problem they are able to contribute. To add a topic
  • Simply edit this page and add a single wiki-word (means at least two capitals in the word) in the bullet list of examples.
  • Save and exit this page.
  • Then when you click on the highlighted wiki-word it will open up a new area to write your example.
*

If a problem is discovered experienced shifters may recognise the context and either know how to fix it, or escalate it. New and inexperienced shifters may not easily be able to know the next step. The general procedure is

  • Investigate : Go as far as you can using the monitoring plots and the examples below.
  • Consult the GEOC : When you reach an impasse, or in doubt, please consult the GEOC first. Please resit the temptation to just interrupt the Production Manager (or anyone else) who may be sitting behind you. By escalating to the GEOC you will (i) be more likely to learn about a known problem and (ii) will aid continuity of knowledge the problem you have found. The GEOC will escalate to the Production Manager or others as necessary.
  On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:
Line: 301 to 234
 
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
Deleted:
<
<
    • Can extra redundancy be introduced to the system?
 
    • Is there enough information available to determine the error?
Added:
>
>

Examples

 
Changed:
<
<

Grid-Specific Issues

  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
  • Is the problem systematic across sites with different backend storage technologies?
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?

Feature Requests

Before submitting a feature request, the user should:

* Identify conditions under which the feature is to be used. * Record all relevant information. * Identify a use-case for the new feature.

Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Figure 6: Savannah support submit.

Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 7).
  • Set the severity option to ``wish''.
  • Set the privacy option to ``private''.
  • Submit the feature request.
>
>

ExampleTopicHeading

Go to ExampleTopicHeading

Software Unavailability

Go to SoftwareUnavailability
 

Bug Reporting

Line: 383 to 276
 
Deleted:
<
<

Software Unavailability

 
Changed:
<
<
Symptom: Jobs fail to find at least one software package.
>
>
=============================================================
 
Changed:
<
<
Software installation occurs during Service Availability Monitoring (SAM) tests. Sites which fail to find software packages should have failed at least part of their most recent SAM test.
>
>

Preliminaries

 
Changed:
<
<
Grid Shifter actions:
>
>
A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).
 
Changed:
<
<
  • Submit an ELOG report listing the affected productions and sites.
  • Ban the relevant sites until they pass their SAM tests.
>
>
To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.
 
Added:
>
>
The new shifter should:

=================================================

Background Information

 
Changed:
<
<

Grid Sites

>
>

Grid Sites

  Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2. CERN is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.
Changed:
<
<

Tier-1 Sites

>
>

Tier-1 Sites

  Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.
  • LCG.CERN.ch
Line: 414 to 314
 
Changed:
<
<

Tier-2 Sites

>
>

Tier-2 Sites

  There are numerous Tier-2 sites with sites being added frequently. As such, it is of little worth presenting a list of all the current Tier-2 sites in this document. Tier-2 sites are used for MC production in the LHCb Computing Model.
Line: 426 to 326
 
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC
Changed:
<
<

Jobs

>
>

All about Jobs

 
Changed:
<
<
The number of jobs created for a productions varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.

JobIDs

>
>

JobIDs

  A particular job is tagged with the following information:
Line: 439 to 337
 
  • JobName, e.g. 00001234_00000019 - the 19th job in production 00001234.
Changed:
<
<

Job Status

>
>

Job Status

  The job status of a successful job proceeds in the following order:
Line: 461 to 359
 Figure 1: Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted.
Changed:
<
<

Job Output

>
>

Job Output

  The standard output and standard error of a job can be accessed through the API, the CLI and the webpage via a global job ``peek''.
Changed:
<
<

Job Output via the CLI

>
>

Job Output via the CLI

  The std.out and std.err for a given job can be retrieved using the CLI command:
dirac-wms-job-get-output <JobID> | [<JobID>]
Line: 474 to 372
 To simply view the last few lines of a job's std.out (``peek'') use:
dirac-wms-job-peek <JobID> | [<JobID>]
Changed:
<
<

Job Output via the Job Monitoring Webpage

>
>

Job Output via the Job Monitoring Webpage

  There are two methods to view the output of a job via the Job Monitoring Webpage. The first returns the last 20 lines of the std.out and the second allows the Grid Shifter to view all the output files.
Line: 506 to 404
  This method can be particularly quick if the Grid Shifter only wants to check the output of a selection of jobs.
Changed:
<
<

Job Pilot Output

>
>

Job Pilot Output

  The output of the Job Pilot can also be retrieved via the API, the CLI or the Webpage.
Changed:
<
<

Job Pilot Output via the CLI

>
>

Job Pilot Output via the CLI

  To obtain the Job Pilot output using the CLI, use:
dirac-admin-get-pilot-output <Grid pilot reference> [<Grid pilot reference>]
This creates a directory for each JobID containing the Job Pilot output.
Changed:
<
<

Job Pilot Output via the Job Monitoring Webpage

>
>

Job Pilot Output via the Job Monitoring Webpage

  Viewing the std.out and std.err of a Job Pilot via the Job Monitoring Webpage is achieved by:
Line: 529 to 427
  Figure 4: View the pilot output of a job via the Job Monitoring Webpage.
Changed:
<
<

Operations on Jobs

>
>

Operations on Jobs

 

The full list of scripts which can be used to perform operations on a job is given in DIRAC Scripts. The name of each script should be a clear indication of it's purpose. Running a script without arguments will print basic usage notes.

Deleted:
<
<

Productions

As a Grid Shifter you will be required to monitor the official LHCb productions. Each production is assigned a unique Production ID (ProdID). These consist of Monte Carlo (MC) generation, data stripping and CCRC productions. Production creation will generally be performed by the Production Operations Manager and is not a duty of the Grid Shifter.

The current list of all active productions can be obtained with the command:

dirac-production-list-active
The command also gives the current submission status of the active productions.
 

Monitoring a Production

Line: 563 to 453
 
  • Excessive runtime.
Changed:
<
<

Failed Jobs

>
>

Failed Jobs

  A Grid Shifter should monitor a production for failed jobs and jobs which are not progressing. Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the lhcb-datacrash mailing list for each failed job.
Line: 583 to 473
 Beware of failed jobs which have been killed - when a production is complete, the remaining jobs may be automatically killed by DIRAC. Killed jobs like this are ok.
Changed:
<
<

Non-Progressing Jobs

>
>

Non-Progressing Jobs

  In addition to failed jobs, jobs which do not progress should also be monitored. Particular attention should be paid to jobs in the states ``Waiting'' and ``Staging''. Problematic jobs at this stage are easily overlooked since the associated problems are not easily identifiable.
Changed:
<
<

Non-Starting Jobs

>
>

Non-Starting Jobs

  Jobs arriving at a site but then failing to start have multiple causes. One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.
Changed:
<
<

Merging Productions

>
>

Merging Productions

  Each MC Production should have an associated Merging Production which merges the output files together into more manageable file sizes. Ensure that the number of files available to the Merging Production increases in proportion to the number of successful jobs of the MC Production. If the number of files does not increase, this can point to a problem in the Bookkeeping which should be reported.
Changed:
<
<

Ending a Production

>
>

Ending a Production

  Ending a completed production is handled by the Productions Operations Manager (or equivalent). No action is required on the part of the Grid Shifter.
Changed:
<
<

Web Production Monitor

>
>

Web Production Monitor

  Production monitoring via the web is possible through the Production Monitoring Webpage. A valid grid certificate loaded into your browser is required to use the webpage.
Changed:
<
<

Features

>
>

Features

  The Production Monitoring Webpage has the following features:
Line: 620 to 510
 
Changed:
<
<

Site Downtime Calendar

>
>

Site Downtime Calendar

 

The calendar [6] displays all the sites with scheduled and unscheduled downtime.

Line: 633 to 523
 
dirac-admin-get-site-mask
Deleted:
<
<

Shifts

Grid Shifters are required to monitor all the current LHCb productions and must have a valid Grid Certificate and be a member of the LHCb VO.

Before a Shift Period

The new shifter should:

Grid Certificates

A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).

To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.

Web Resources

Primary web-based resources for DIRAC 3 production shifts:

Mailing Lists

The new Grid Shifter should subscribe to the following mailing lists:

Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.

Production Operations Key

The new shifter should obtain the Production Operations key (TCE5) from the LHCb secretariat or the previous Grid Shifter.

During a Shift

During a shift Grid Shifters are expected to monitor all current productions and be aware of the current status of the Tier1 sites. A knowledge of the purpose of each production is also useful and aids in determining the probable cause of any failed jobs.

Daily Actions

Grid Shifters are expected to carry out the following daily actions for sites used in the current productions:

  • Trigger submission of pending productions.
  • Monitor active productions.
  • Check transfer status.
  • Verify that the staging at each site is functional.
  • Check that there is a minimum of one successful (and complete) job.
  • Confirm that data access is working at least intermittently.
  • Report problems to the operations team.
  • Submit a summary of the job status at all the grid sites to the ELOG 7.

Performance Monitoring

Grid Shifters should view the plots accessible via the DIRACSystemMonitoring page at least three times a day and investigate any unusual features present.

Production Operations Meeting

A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summarising the morning activities.

The Grid Shifter's report should contain:

  • Current production progress, jobs submitted, waiting etc.
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.

Ending a Shift

At the end of each shift, morning Grid Shifters should:

  • Pass on the key (TCE5) for the Production Operations room to the next Grid Shifter.
  • Prepare a list of outstanding issues to be handed over to the next Grid Shifter and discussed in the Production Operations meeting.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

Similarly, evening Grid Shifters should:

  • Place the key (TCE5) to the Productions Operations room in the secretariat key box.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

End of Shift Period

At the end of a shift period the Grid Shifter may wish to unsubscribe from the various mailing lists (Sec. 6.4.1) in addition to returning the Production Operations room key, TCE5 (Sec. 6.4.2).

Mailing Lists

Unsubscribe from the following mailing lists:

  • lhcb-datacrash.
  • lhcb-dirac-developers.
  • lhcb-dirac.
  • lhcb-production.

Miscellaneous

Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.

Weekly Report

A weekly report should be prepared by the Grid Shifter at the end of each week. The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.

The weekly reports are to be uploaded to the Weekly Reports Page on the LHCb Computing tWiki. Grid Shifters should use the template provided when compiling a report.

Analysis and Summary

A summary of each group of plots should be written to aid the next Grid Shifter’s appraisal of the current situation and to enable the Grid Expert on duty to investigate problems further.

ELOG

All Grid Shifter actions of note should be recorded in the ELOG. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.

Typical ELOG Format

Each ELOG entry which reports a new problem should include as much relevant information as possible. This allows the production operations team to quickly determine the problem and apply a solution.

ELOG Entry for a New Problem

A typical ELOG entry for a new problem contains:

  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.

Subsequent ELOG Entries

Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.

If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.

When to Submit an ELOG

Submit an ELOG in the following situations:

  • Jobs finalise with exceptions.

Exceptions

Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:

  • The production ID.
  • An example job ID.
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.

Crashed Application

Should submit example error log for the crashed application.

Datacrash Emails

The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.

ELOG Problems

If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.

Procedures

If a problem is discovered it is very important to escalate it to the operations team. Assessing the scale of the problem is very important and Grid Shifters should attempt to answer the questions in section 9.1.1 as soon as possible.

On the Discovery of a Problem

Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem. Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).

Standard Checklist

On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:

  • How many jobs does the problem affect?
  • Are the central DIRAC services running normally?
  • Are all jobs affected?
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
    • Can extra redundancy be introduced to the system?
    • Is there enough information available to determine the error?

Grid-Specific Issues

  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
  • Is the problem systematic across sites with different backend storage technologies?
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?

Feature Requests

Before submitting a feature request, the user should:

* Identify conditions under which the feature is to be used. * Record all relevant information. * Identify a use-case for the new feature.

Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Figure 6: Savannah support submit.

Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 7).
  • Set the severity option to ``wish''.
  • Set the privacy option to ``private''.
  • Submit the feature request.

Bug Reporting

Before submitting a bug report, the user should:

  • Identify conditions under which the bug occurs.
  • Record all relevant information.
  • Try to ensure that the bug is reproducible.

Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:

Figure 8: Browse current bugs.

Assuming the bug is new, the procedure to submit a bug report is as follows:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 9).
  • Set the appropriate severity of the problem.
  • Write a short and clear summary.
  • Set the privacy option to ``private''.
  • Submit the bug report.

Figure 9: Example bug report.

Software Unavailability

Symptom: Jobs fail to find at least one software package.

Software installation occurs during Service Availability Monitoring (SAM) tests. Sites which fail to find software packages should have failed at least part of their most recent SAM test.

Grid Shifter actions:

  • Submit an ELOG report listing the affected productions and sites.
  • Ban the relevant sites until they pass their SAM tests.
 

DIRAC 3 Scripts

Changed:
<
<

DIRAC Admin Scripts

>
>

DIRAC Admin Scripts

 
  • dirac-admin-accounting-cli
  • dirac-admin-add-user
Line: 969 to 548
 
  • dirac-admin-upload-proxy
  • dirac-admin-users-with-proxy
Changed:
<
<

DIRAC Bookkeeping Scripts

>
>

DIRAC Bookkeeping Scripts

 
  • dirac-bookkeeping-eventMgt
  • dirac-bookkeeping-eventtype-mgt
Line: 977 to 556
 
  • dirac-bookkeeping-production-jobs
  • dirac-bookkeeping-production-informations
Changed:
<
<

DIRAC Clean

>
>

DIRAC Clean

 
  • dirac-clean
Changed:
<
<

DIRAC Configuration

>
>

DIRAC Configuration

 
  • dirac-configuration-cli
Changed:
<
<

DIRAC Distribution

>
>

DIRAC Distribution

 
  • dirac-distribution

Changed:
<
<

DIRAC DMS

>
>

DIRAC DMS

 
  • dirac-dms-add-file
  • dirac-dms-get-file
Line: 1008 to 587
 
Changed:
<
<

DIRAC Embedded

>
>

DIRAC Embedded

 
  • dirac-embedded-external
Changed:
<
<

DIRAC External

>
>

DIRAC External

 
  • dirac-external
Changed:
<
<

DIRAC Fix

>
>

DIRAC Fix

 
  • dirac-fix-ld-library-path

Changed:
<
<

DIRAC Framework

>
>

DIRAC Framework

 
  • dirac-framework-ping-service

Changed:
<
<

DIRAC Functions

>
>

DIRAC Functions

 
  • dirac-functions.sh
Changed:
<
<

DIRAC Group

>
>

DIRAC Group

 
  • dirac-group-init
Changed:
<
<

DIRAC Jobexec

>
>

DIRAC Jobexec

 
  • dirac-jobexec
Changed:
<
<

DIRAC LHCb

>
>

DIRAC LHCb

 
  • dirac-lhcb-job-replica
  • dirac-lhcb-manage-software
Line: 1049 to 628
 
  • dirac-lhcb-sam-submit-all
  • dirac-lhcb-sam-submit-ce
Changed:
<
<

DIRAC Myproxy

>
>

DIRAC Myproxy

 
  • dirac-myproxy-upload
Changed:
<
<

DIRAC Production

>
>

DIRAC Production

 

  • dirac-production-application-summary
Line: 1076 to 655
 
Changed:
<
<

DIRAC Proxy

>
>

DIRAC Proxy

 
  • dirac-proxy-info
  • dirac-proxy-init
Line: 1084 to 663
 
Changed:
<
<

DIRAC Update

>
>

DIRAC Update

 
  • dirac-update

Changed:
<
<

DIRAC WMS

>
>

DIRAC WMS

 
  • dirac-wms-job-delete
  • dirac-wms-job-get-output
Line: 1239 to 818
  -- PaulSzczypka - 14 Aug 2009
Deleted:
<
<

Development Section for monitoring plots.

These are being assembled at https://twiki.cern.ch/twiki/bin/view/Main/PeterClarkeShiftProcesses

 -- PeterClarke - 19-Oct-2010

Revision 12010-10-19 - PeterClarke

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionShifterGuide"

Grid Shifter Guide : Being updated autumn 2010

This topic is under development during Autumn 2010. It is experimental. Please contact Pete Clarke (clarke@cernNOSPAMPLEASE.ch) for complaints or suggestions.

UpdatedProductionShifterGuide

Introduction

This document is for LHCb Grid computing shifters. It is organised as follows.

Firstly we provide links to information which might be useful, but which you probably only need to read once or twice until you are familiar with it.

  • Preliminaries (things you need to do before starting to do these shifts)
  • Background Information (information which may be useful to understand the context better)
  • Jobs (details the jobs types a Grid Shifter is expected to encounter and provides some debugging method)

shifts The main body of this document describes the principal activities of a production shifter. This includes

  • Start of shift
  • The tasks and observations a shifter is expected to make continually during a shift
  • Links to many web pages which you may find useful.
  • A suggested shift checklist
  • A compendium of possible problems and possible actions to take.
  • Instructions and a template for the report you are strongly requested to lodge in the ELOG at the end of each shift.

Orphan for now

The ELOG section outlines the situations for which the submission of an ELOG is appropriate.

Finally, the Procedures section details the well-established procedures for Grid Shifters.

The Grid Sites section gives some brief information about the various Grid sites and their backend storage systems.

The Web Production Monitor section describes the main features of the Production Monitor webpage.

The methods available to manage and monitor productions are described in the Productions section.

A number of quick-reference sections are also available. DIRAC Scripts and Acronyms list the available DIRAC 3 scripts and commonly-used acronyms respectively.

Doing Shifts

Start and End of of Shift

At the start of a shift you should

  • Read the ELOG for the last few days. You should read the end of shift report from the last shifter. You will also be able to pick up threads pertaining to current issues.

  • Find out what has been happening in the last few days: If you havn't been on shift for a while its probably a good idea to get a quick picture of the fills and associated data taking runs of the last few days. This will help understand what to expect in the reconstructions productions. How to do this

  • Find out what the current reconstruction productions are if you are not already familiar with them. To do this look at this list: Production Requests. In it you will find many productions. If you look down it you will set those that are associated with Reconstruction and are Active. You are looking for those which are Recox-Strippingy.

Active Reconstruction Requests

  • Subscribe to the following mailing lists. Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily.
It's suggested that suitable message filters and folders are created in your mail client of choice.

At the end of a shift you should

  • Enter a short summary in the ELOG. NEEDS TEMPLATE

Principal Activities During a Shift

The principle activities of the shifter are listed below. In many cases the "activities" are not orthogonal, i.e. may be different ways to view the same thing.

  • Live Data flow: This means keeping abreast of the data flow from the pit to it being picked up by the current reconstruction production. Each live data taking run results in RAW DATA files produced in the pit being transferred OFFLINE and into the bookeeping (BKK). Once in the BKK these files are automatically found by the current reconstruction production and processed This should rarely fail, but as a first task the shifter can look and verify the integrity of the chain How to do this

  • The current reconstruction productions: In general, if we are data taking and there is no anomalous situation, the latest data reconstruction production will be running. This has 3 visible steps (i) reconstruction and stripping (ii) merging (iii) replication to sites. The shifter must keep an eye on this to ensure it is progressing, and they there are not an unexpected number of failing jobs which are the result of a recent (as yet unknown) problem. How to do this

  • General Job success rate for production jobs: The shifter should monitor the overall job success rate for all production jobs. If jobs start failing at a single site, and it is not a "known problem" then it may be that a new problem has arisen at that site. If jobs start failing at all sites then it is more likely to be a production or application misconfiguration. A set of monitoring plots has been assembled to at least alert the shifter to a problem, and to start the diagnostic process. How to do this

  • General job success rate for user jobs: t.b.d

  • Site Integrity: The shifter should look at the sites

  • Attend the daily operations meeting at 11.15 CET.

What has happened in last few days

Look at the RUNDB to see fills in last days. It might be helpful to note down which runs start and end each recent fill and the luminosity. This may be useful when using the DIRAC portal later. Check that all runs destined for offline are in the BKK.

Howto: Look at live data flow

The questions to be checked are:

  • Are we taking data now ?
  • Is the RAW data for each run flowing from the pit and getting into the BKK ?

These links help you answer these questions. When we are in proper data taking you will see the LHCb online page showing COLLISION10 with data destination OFFLINE

Howto: Look at the current reconstruction production.

Questions to be answered are:

  • What is the current reconstruction production ?
  • Is the RAW data for each run being picked up by the current data reconstruction production ?
  • Is the merging going properly ? When each run is 100% reconstructed and stripped, the stripped data should be picked up by the merging production to produce DSTs . There is some delay here, typically merging may not yet be running on the latest runs form the very latest fill yet.
  • When merged the DST data should appear in the BKK.

Now look at the productions associated with the current reconstruction. This is hard wired for you here (if the most recent reconstruction is not shown then tell clarke@cernNOSPAMPLEASE.ch)

This one will probably be the main reconstruction production. It is probably useful to then look at the pop up "Run Status" option (by left clicking on the line). This will show you which runs are 100% complete.

This one shows you (probably) the mergings. Merging only works on runs which are already 100% reconstructed (which is what you will have seen above). Again, left clicking and looking at run status will show you what is happening.

This one probably shows you the EXPRESS stream for completeness.

  • DIRAC Production Monitoring: Reco06

These links show similar information, but grouped by all current reconstruction and merging productions:

Finally you can (after many clicks) look in the BKK to see if DSTs are there.

Howto: General Job success rate

Jobs may fail for many reasons. The problem for the shifter is that some of them are important and some not and some will already be "well known" and some will be new.

  • A common jobs failure minor status is "Input Resolution Errors". This can be transitory when a new production is launched and some disk servers get "overloaded" . However these jobs are resubmitted automatically and if they then run there is not a problem.
  • Jobs which retry too many times will time out. These show up with a final minor status of "Watchdog Identified Jobs as Stalled". These are cause for concern.
  • Jobs have been seen to fail at a site which is known to have its data servers down, but still gets requests for data. This is quite hard for the shifter will observe lots of failed jobs continuing over days, but the problem may well have been reported long ago and remedial work is underway.

A set of monitoring plots has been assembled to help you try to diagnose problems.

Firstly there are some overview plots.

If there are lots of production jobs failing then you need to investigate. This may be a new problem associated with current processing of recent runs. Or it may be some manifestation of an old problem. See what site the problem is connected with ? Are the failures associated with an old current production ? Are the failures in reconstruction or merging ? Are the failures re-tries associated with some site which has a known problem ? If it looks like an old problem there will likely be a comment in the logbook. You can drill down with these links:

At this point you may see that, for example, "site-XYZ" is failing lots of jobs in "Merging" with "InputDataResolutionErrors". You probably want to identify which jobs/runs these are associated with. Go back to Current production productions to do this. From here you can look at the "run status" and also "show jobs" in the pop out menu. You can search for failing jobs in both reconstruction or merging and see what sites they are associated with.

Once you have used these monitoring plots the next line of diagnosis depends upon the shifter experience. One has to look at the job log outputs and see if there is any information which helps diagnose the problem. Don't be afraid to ask the GEOC if in doubt.

WHERE IS THERE A SIMPLE UP TO DATE SITE STATUS MONITOR ????? SUCH THAT SHIFTER CAN EASILY SEE STATUS AND PROBLEMS WHICH ARE KNOWN AT EACH SITE.

================================================

Production Operations Meeting

A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summarising the morning activities.

The Grid Shifter's report should contain:

  • Current production progress, jobs submitted, waiting etc.
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.

Ending a Shift

At the end of each shift, morning Grid Shifters should:

  • Pass on the key (TCE5) for the Production Operations room to the next Grid Shifter.
  • Prepare a list of outstanding issues to be handed over to the next Grid Shifter and discussed in the Production Operations meeting.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

Similarly, evening Grid Shifters should:

  • Place the key (TCE5) to the Productions Operations room in the secretariat key box.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

End of Shift Period

At the end of a shift period the Grid Shifter may wish to unsubscribe from the various mailing lists (Sec. 6.4.1) in addition to returning the Production Operations room key, TCE5 (Sec. 6.4.2).

Miscellaneous

Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.

ELOG

All Grid Shifter actions of note should be recorded in the ELOG. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.

Typical ELOG Format

Each ELOG entry which reports a new problem should include as much relevant information as possible. This allows the production operations team to quickly determine the problem and apply a solution.

ELOG Entry for a New Problem

A typical ELOG entry for a new problem contains:

  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.

Subsequent ELOG Entries

Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.

If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.

When to Submit an ELOG

Submit an ELOG in the following situations:

  • Jobs finalise with exceptions.

Exceptions

Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:

  • The production ID.
  • An example job ID.
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.

Crashed Application

Should submit example error log for the crashed application.

Datacrash Emails

The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.

ELOG Problems

If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.

Procedures

If a problem is discovered it is very important to escalate it to the operations team. Assessing the scale of the problem is very important and Grid Shifters should attempt to answer the questions in section 9.1.1 as soon as possible.

On the Discovery of a Problem

Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem. Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).

Standard Checklist

On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:

  • How many jobs does the problem affect?
  • Are the central DIRAC services running normally?
  • Are all jobs affected?
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
    • Can extra redundancy be introduced to the system?
    • Is there enough information available to determine the error?

Grid-Specific Issues

  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
  • Is the problem systematic across sites with different backend storage technologies?
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?

Feature Requests

Before submitting a feature request, the user should:

* Identify conditions under which the feature is to be used. * Record all relevant information. * Identify a use-case for the new feature.

Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Figure 6: Savannah support submit.

Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 7).
  • Set the severity option to ``wish''.
  • Set the privacy option to ``private''.
  • Submit the feature request.

Bug Reporting

Before submitting a bug report, the user should:

  • Identify conditions under which the bug occurs.
  • Record all relevant information.
  • Try to ensure that the bug is reproducible.

Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:

Figure 8: Browse current bugs.

Assuming the bug is new, the procedure to submit a bug report is as follows:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 9).
  • Set the appropriate severity of the problem.
  • Write a short and clear summary.
  • Set the privacy option to ``private''.
  • Submit the bug report.

Figure 9: Example bug report.

Software Unavailability

Symptom: Jobs fail to find at least one software package.

Software installation occurs during Service Availability Monitoring (SAM) tests. Sites which fail to find software packages should have failed at least part of their most recent SAM test.

Grid Shifter actions:

  • Submit an ELOG report listing the affected productions and sites.
  • Ban the relevant sites until they pass their SAM tests.

Grid Sites

Jobs submitted to the Grid will be scheduled to run at one of a number of Grid sites. The exact site at which a job is executed depends on the job requirements and the current status of all relevant grid sites. Grid sites are grouped into two tiers, Tier-1 and Tier-2. CERN is an exception, because it is also responsible for processing and archiving the RAW experimental data it is also referred to as a Tier-0 site.

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

Tier-2 Sites

There are numerous Tier-2 sites with sites being added frequently. As such, it is of little worth presenting a list of all the current Tier-2 sites in this document. Tier-2 sites are used for MC production in the LHCb Computing Model.

Backend Storage Systems

Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:

Backend Storage Tier-1 Site
Castor CERN, CNAF, RAL
dCache IN2P3, NIKHEF, GridKa, PIC

Jobs

The number of jobs created for a productions varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.

JobIDs

A particular job is tagged with the following information:

  • Production Identifier (ProdID), e.g. 00001234 - the 1234$^{th}$ production.
  • Job Idetifier (JobID), e.g. 9876 - the 9876th job in the DIRAC system.
  • JobName, e.g. 00001234_00000019 - the 19th job in production 00001234.

Job Status

The job status of a successful job proceeds in the following order:

  1. Received,
  2. Checking,
  3. Staging,
  4. Waiting,
  5. Matched,
  6. Running,
  7. Completed,
  8. Done.

Jobs which return no heartbeat have a status of ``Stalled'' and jobs where any workflow modules return an error status are classed as ``Failed''.

The basic flowchart describing the evolution of a job's status can be found in figure 1. Jobs are only ``Grid-active'' once they have reached the ``Matched'' phase.

Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted.

Figure 1: Job status flowchart. Note that the ``Checking'' and ``Staging'' status are omitted.

Job Output

The standard output and standard error of a job can be accessed through the API, the CLI and the webpage via a global job ``peek''.

Job Output via the CLI

The std.out and std.err for a given job can be retrieved using the CLI command:

dirac-wms-job-get-output <JobID> | [<JobID>]
This creates a directory containing the std.out and std.err for each JobID entered. Standard tools can then be used to search the output for specific strings, e.g. ``FATAL''.

To simply view the last few lines of a job's std.out (``peek'') use:

dirac-wms-job-peek <JobID> | [<JobID>]

Job Output via the Job Monitoring Webpage

There are two methods to view the output of a job via the Job Monitoring Webpage. The first returns the last 20 lines of the std.out and the second allows the Grid Shifter to view all the output files.

Figure 2: Peek the std.out of a job via the Job Monitoring Webpage.

To ``peek'' the std.out of a job:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``StandardOutput'' (Fig. 2).

Figure 3: View all the output files of a job via the Job Monitoring Webpage.

Similarly, to view all output files for a job:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``Get Logfile'' (Fig. 3).

This method can be particularly quick if the Grid Shifter only wants to check the output of a selection of jobs.

Job Pilot Output

The output of the Job Pilot can also be retrieved via the API, the CLI or the Webpage.

Job Pilot Output via the CLI

To obtain the Job Pilot output using the CLI, use:

dirac-admin-get-pilot-output <Grid pilot reference> [<Grid pilot reference>]
This creates a directory for each JobID containing the Job Pilot output.

Job Pilot Output via the Job Monitoring Webpage

Viewing the std.out and std.err of a Job Pilot via the Job Monitoring Webpage is achieved by:

  1. Navigate to the Job Monitoring Webpage.
  2. Select the relevant filters from the left panel.
  3. Click on a job.
  4. Select ``Pilot'' then ``Get StdOut'' or ``Get StdErr'' (Fig. 4).

Figure 4: View the pilot output of a job via the Job Monitoring Webpage.

Operations on Jobs

The full list of scripts which can be used to perform operations on a job is given in DIRAC Scripts. The name of each script should be a clear indication of it's purpose. Running a script without arguments will print basic usage notes.

Productions

As a Grid Shifter you will be required to monitor the official LHCb productions. Each production is assigned a unique Production ID (ProdID). These consist of Monte Carlo (MC) generation, data stripping and CCRC productions. Production creation will generally be performed by the Production Operations Manager and is not a duty of the Grid Shifter.

The current list of all active productions can be obtained with the command:

dirac-production-list-active
The command also gives the current submission status of the active productions.

Monitoring a Production

Jobs in each production should be periodically checked for failed jobs (Sec. 4.2.1) and to ensure that jobs are progressing (Sec. 4.2.2).

When monitoring a production, a Grid Shifter should be aware of a number of issues which can cause jobs to fail:

  • Staging.
  • Stalled Jobs.
  • Segmentation faults.
  • DB access.
  • Software problems.
  • Data access.
  • Shared area access.
  • Site downtime.
  • Problematic files.
  • Excessive runtime.

Failed Jobs

A Grid Shifter should monitor a production for failed jobs and jobs which are not progressing. Due to the various configurations of all the sites it is occasionally not possible for an email to be sent to the lhcb-datacrash mailing list for each failed job. It is therefore not enough to simply rely on the number of lhcb-datacrash emails to indicate if there are any problems with a production. In addition to any lhcb-datacrash notifications, the Grid Shifter should also check the number of failed jobs in a production via the CLI or the Production Monitoring Webpage.

Using the CLI, the command:

dirac-production-progress [<Production ID>]
entered without any arguments will return a breakdown of the jobs of all current productions. Entering one or more ProdIDs returns only the breakdown of those productions.

A more detailed breakdown is provided by:

dirac-production-job-summary <Production ID> [<DIRAC Status>]
which also includes the minor status of each job category and provides an example JobID for each category. The example JobIDs can then be used to investigate the failures further.

Beware of failed jobs which have been killed - when a production is complete, the remaining jobs may be automatically killed by DIRAC. Killed jobs like this are ok.

Non-Progressing Jobs

In addition to failed jobs, jobs which do not progress should also be monitored. Particular attention should be paid to jobs in the states ``Waiting'' and ``Staging''. Problematic jobs at this stage are easily overlooked since the associated problems are not easily identifiable.

Non-Starting Jobs

Jobs arriving at a site but then failing to start have multiple causes. One of the most common reasons is that a site is due to enter scheduled downtime and are no longer submitting jobs to the batch queues. Jobs will stay at the site in a ``Waiting'' state and state that there are no CE's available. Multiple jobs in this state should be reported.

Merging Productions

Each MC Production should have an associated Merging Production which merges the output files together into more manageable file sizes. Ensure that the number of files available to the Merging Production increases in proportion to the number of successful jobs of the MC Production. If the number of files does not increase, this can point to a problem in the Bookkeeping which should be reported.

Ending a Production

Ending a completed production is handled by the Productions Operations Manager (or equivalent). No action is required on the part of the Grid Shifter.

Web Production Monitor

Production monitoring via the web is possible through the Production Monitoring Webpage. A valid grid certificate loaded into your browser is required to use the webpage.

Features

The Production Monitoring Webpage has the following features:

Site Downtime Calendar

The calendar [6] displays all the sites with scheduled and unscheduled downtime. Calendar entries are automatically parsed through the site downtime RSS feed and added to the calendar.

Occasionally the feed isn't parsed correctly and Grid Shifters should double-check that the banned and allowed sites are correct. Useful scripts for this are:

dirac-admin-get-banned-sites
and
dirac-admin-get-site-mask

Shifts

Grid Shifters are required to monitor all the current LHCb productions and must have a valid Grid Certificate and be a member of the LHCb VO.

Before a Shift Period

The new shifter should:

Grid Certificates

A Grid certificate is mandatory for Grid Shifters. If you don't have a certificate you should register for one through CERN LCG and apply to join the LHCb Virtual Organisation (VO).

To access the production monitoring webpages you will also need to load your certificate into your browser. Detailed instructions on how to do this can be found on the CERN LCG pages.

Web Resources

Primary web-based resources for DIRAC 3 production shifts:

Mailing Lists

The new Grid Shifter should subscribe to the following mailing lists:

Note that both the lhcb-datacrash and lhcb-production mailing lists receive a substantial amount of mail daily. It's suggested that suitable message filters and folders are created in your mail client of choice.

Production Operations Key

The new shifter should obtain the Production Operations key (TCE5) from the LHCb secretariat or the previous Grid Shifter.

During a Shift

During a shift Grid Shifters are expected to monitor all current productions and be aware of the current status of the Tier1 sites. A knowledge of the purpose of each production is also useful and aids in determining the probable cause of any failed jobs.

Daily Actions

Grid Shifters are expected to carry out the following daily actions for sites used in the current productions:

  • Trigger submission of pending productions.
  • Monitor active productions.
  • Check transfer status.
  • Verify that the staging at each site is functional.
  • Check that there is a minimum of one successful (and complete) job.
  • Confirm that data access is working at least intermittently.
  • Report problems to the operations team.
  • Submit a summary of the job status at all the grid sites to the ELOG 7.

Performance Monitoring

Grid Shifters should view the plots accessible via the DIRACSystemMonitoring page at least three times a day and investigate any unusual features present.

Production Operations Meeting

A Production Operations Meeting takes place at the end of the morning shift and allows the morning Grid Shifter to highlight any recent or outstanding issues. Both the morning and afternoon Grid Shifter should attend. The morning Grid Shifter should give a report summarising the morning activities.

The Grid Shifter's report should contain:

  • Current production progress, jobs submitted, waiting etc.
  • Status of all Tier1 sites.
  • Recently observed failures, paying particular attention to previously-unknown problems.

Ending a Shift

At the end of each shift, morning Grid Shifters should:

  • Pass on the key (TCE5) for the Production Operations room to the next Grid Shifter.
  • Prepare a list of outstanding issues to be handed over to the next Grid Shifter and discussed in the Production Operations meeting.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

Similarly, evening Grid Shifters should:

  • Place the key (TCE5) to the Productions Operations room in the secretariat key box.
  • Submit an ELOG report summarising the shift and any ongoing or unresolved issues.

End of Shift Period

At the end of a shift period the Grid Shifter may wish to unsubscribe from the various mailing lists (Sec. 6.4.1) in addition to returning the Production Operations room key, TCE5 (Sec. 6.4.2).

Mailing Lists

Unsubscribe from the following mailing lists:

  • lhcb-datacrash.
  • lhcb-dirac-developers.
  • lhcb-dirac.
  • lhcb-production.

Miscellaneous

Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.

Weekly Report

A weekly report should be prepared by the Grid Shifter at the end of each week. The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.

The weekly reports are to be uploaded to the Weekly Reports Page on the LHCb Computing tWiki. Grid Shifters should use the template provided when compiling a report.

Analysis and Summary

A summary of each group of plots should be written to aid the next Grid Shifter’s appraisal of the current situation and to enable the Grid Expert on duty to investigate problems further.

ELOG

All Grid Shifter actions of note should be recorded in the ELOG. This had the benefits of allowing new Grid Shifters to familiarise themselves with recent problems with current productions. ELOG entries should contain as much relevant information as possible.

Typical ELOG Format

Each ELOG entry which reports a new problem should include as much relevant information as possible. This allows the production operations team to quickly determine the problem and apply a solution.

ELOG Entry for a New Problem

A typical ELOG entry for a new problem contains:

  • The relevant ProdID or ProdIDs.
  • An example JobID.
  • A copy of the relevant error message and output.
  • The number of affected jobs.
  • The Grid sites affected.
  • The time of the first and last occurrence of the problem.

Subsequent ELOG Entries

Once a problem has been logged it is useful to report the continuing status of the affected productions at the end of each shift.

If a Grid Shifter is unsure whether a problem has been previously logged then they should submit a fresh ELOG following the format outlined in the ELOG.

When to Submit an ELOG

Submit an ELOG in the following situations:

  • Jobs finalise with exceptions.

Exceptions

Jobs which finalise with an exception should be noted in the ELOG. The ELOG entry should contain:

  • The production ID.
  • An example job ID.
  • A copy of the relevant error messages.
  • The number of jobs in the production which have the same status.

Crashed Application

Should submit example error log for the crashed application.

Datacrash Emails

The Grid Shifter should filter the datacrash emails and determine if the crash reported is actually due to one of the applications. If so, then the Grid Shifter should submit an ELOG describing the problem and including an example error message. The Grid Shifter should ensure the “Applications” radio button is selected when submitting the ELOG report since this means that the relevant experts will be alerted to the problem.

ELOG Problems

If ELOG is down, send a notification email to lhcb-production@cernNOSPAMPLEASE.ch.

Procedures

If a problem is discovered it is very important to escalate it to the operations team. Assessing the scale of the problem is very important and Grid Shifters should attempt to answer the questions in section 9.1.1 as soon as possible.

On the Discovery of a Problem

Once a problem has been discovered it is important to assess the severity of the problem. Section 9.1.1 provides a checklist which the Grid Shifter should go through after discovering a problem. Additionally, there are a number of Grid-specific issues to consider (Sec. 9.1.2).

Standard Checklist

On the discovery of a new problem, attempt to provide answers to the following questions as quickly as possible:

  • How many jobs does the problem affect?
  • Are the central DIRAC services running normally?
  • Are all jobs affected?
  • When did the problem start?
  • When did the last successful job run in similar conditions?
  • Is it a DIRAC problem?
    • Can extra redundancy be introduced to the system?
    • Is there enough information available to determine the error?

Grid-Specific Issues

  • Was there an announcement of downtime for the site?
  • Is the problem specific to a single site?
    • Are all the CE’s at the site affected?
  • Is the problem systematic across sites with different backend storage technologies?
  • Is the problem specific to an SE?
    • Are there any stalled jobs at the site clustered in time? *Are other jobs successfully reading data from the SE?

Feature Requests

Before submitting a feature request, the user should:

* Identify conditions under which the feature is to be used. * Record all relevant information. * Identify a use-case for the new feature.

Figure 5: Browse current support issues.

Once the user has prepared all the relevant information, they should:

Figure 6: Savannah support submit.

Figure 7: Savannah support submit feature request.

Assuming the feature request has not been previously submitted, the user should then:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 7).
  • Set the severity option to ``wish''.
  • Set the privacy option to ``private''.
  • Submit the feature request.

Bug Reporting

Before submitting a bug report, the user should:

  • Identify conditions under which the bug occurs.
  • Record all relevant information.
  • Try to ensure that the bug is reproducible.

Once the user is convinced that the behaviour they are experiencing is a bug, they should then prepare to submit a bug report. Users should:

Figure 8: Browse current bugs.

Assuming the bug is new, the procedure to submit a bug report is as follows:

  • Navigate to the ``Support'' tab at the top of the page (Fig. 6) and click on ``submit''.
  • Ensure that the submission webform contains all relevant information (Fig. 9).
  • Set the appropriate severity of the problem.
  • Write a short and clear summary.
  • Set the privacy option to ``private''.
  • Submit the bug report.

Figure 9: Example bug report.

Software Unavailability

Symptom: Jobs fail to find at least one software package.

Software installation occurs during Service Availability Monitoring (SAM) tests. Sites which fail to find software packages should have failed at least part of their most recent SAM test.

Grid Shifter actions:

  • Submit an ELOG report listing the affected productions and sites.
  • Ban the relevant sites until they pass their SAM tests.

DIRAC 3 Scripts

DIRAC Admin Scripts

  • dirac-admin-accounting-cli
  • dirac-admin-add-user
  • dirac-admin-allow-site
  • dirac-admin-ban-site
  • dirac-admin-delete-user
  • dirac-admin-get-banned-sites
  • dirac-admin-get-job-pilot-output
  • dirac-admin-get-job-pilots
  • dirac-admin-get-pilot-output
  • dirac-admin-get-proxy
  • dirac-admin-get-site-mask
  • dirac-admin-list-hosts
  • dirac-admin-list-users
  • dirac-admin-modify-user
  • dirac-admin-pilot-summary
  • dirac-admin-reset-job
  • dirac-admin-service-ports
  • dirac-admin-site-info
  • dirac-admin-sync-users-from-file
  • dirac-admin-upload-proxy
  • dirac-admin-users-with-proxy

DIRAC Bookkeeping Scripts

  • dirac-bookkeeping-eventMgt
  • dirac-bookkeeping-eventtype-mgt
  • dirac-bookkeeping-ls
  • dirac-bookkeeping-production-jobs
  • dirac-bookkeeping-production-informations

DIRAC Clean

  • dirac-clean

DIRAC Configuration

  • dirac-configuration-cli

DIRAC Distribution

  • dirac-distribution

DIRAC DMS

  • dirac-dms-add-file
  • dirac-dms-get-file
  • dirac-dms-lfn-accessURL
  • dirac-dms-lfn-logging-info
  • dirac-dms-lfn-metadata
  • dirac-dms-lfn-replicas
  • dirac-dms-pfn-metadata
  • dirac-dms-pfn-accessURL
  • dirac-dms-remove-pfn
  • dirac-dms-remove-lfn
  • dirac-dms-replicate-lfn

DIRAC Embedded

  • dirac-embedded-external

DIRAC External

  • dirac-external

DIRAC Fix

  • dirac-fix-ld-library-path

DIRAC Framework

  • dirac-framework-ping-service

DIRAC Functions

  • dirac-functions.sh

DIRAC Group

  • dirac-group-init

DIRAC Jobexec

  • dirac-jobexec

DIRAC LHCb

  • dirac-lhcb-job-replica
  • dirac-lhcb-manage-software
  • dirac-lhcb-production-job-check
  • dirac-lhcb-sam-submit-all
  • dirac-lhcb-sam-submit-ce

DIRAC Myproxy

  • dirac-myproxy-upload

DIRAC Production

  • dirac-production-application-summary
  • dirac-production-change-status
  • dirac-production-job-summary
  • dirac-production-list-active
  • dirac-production-list-all
  • dirac-production-list-id
  • dirac-production-logging-info
  • dirac-production-mcextend
  • dirac-production-manager-cli
  • dirac-production-progress
  • dirac-production-set-automatic
  • dirac-production-set-manual
  • dirac-production-site-summary
  • dirac-production-start
  • dirac-production-stop
  • dirac-production-submit
  • dirac-production-summary

DIRAC Proxy

  • dirac-proxy-info
  • dirac-proxy-init
  • dirac-proxy-upload

DIRAC Update

  • dirac-update

DIRAC WMS

  • dirac-wms-job-delete
  • dirac-wms-job-get-output
  • dirac-wms-job-get-input
  • dirac-wms-job-kill
  • dirac-wms-job-logging-info
  • dirac-wms-job-parameters
  • dirac-wms-job-peek
  • dirac-wms-job-status
  • dirac-wms-job-submit
  • dirac-wms-job-reschedule

Common Acronyms

ACL
Access Control Lists
API
Application Programming Interface
ARC
Advance Resource Connector
ARDA
A Realisation of Distributed Analysis
BDII
Berkeley Database Information Index
BOSS
Batch Object Submission System
CA
Certification Authority
CAF
CDF Central Analysis Farm
CCRC
Common Computing Readiness Challenge
CDF
Collider Detector at Fermilab
CE
Computing Element
CERN
Organisation Européenne pour la Recherche Nucléaire: Switzerland/France
CNAF
Centro Nazionale per la Ricerca e Svilupponelle Tecnologie Informatiche e Telematiche: Italy
ConDB
Conditions Database
CPU
Central Processing Unit
CRL
Certifcate Revocation List
CS
Confguration Service
DAG
Directed Acyclic Graph
DC04
Data Challenge 2004
DC06
Data Challenge 2006
DCAP
Data Link Switching Client Access Protocol
DIAL
Distributed Interactive Analysis of Large datasets
DIRAC
Distributed Infrastructure with Remote Agent Control
DISET
DIRAC Secure Transport
DLI
Data Location Interface
DLLs
Dynamically Linked Libraries
DN
Distinguished Name
DNS
Domain Name System
DRS
Data Replication Service
DST
Data Summary Tape
ECAL
Electromagnetic CALorimeter
EGA
Enterprise Grid Alliance
EGEE
Enabling Grids for E-sciencE
ELOG
Electronic Log
ETC
Event Tag Collection
FIFO
First In First Out
FTS
File Transfer Service
GASS
Global Access to Secondary Storage
GFAL
Grid File Access Library
GGF
Global Grid Forum
GIIS
Grid Index Information Service
GLUE
Grid Laboratory Uniform Environment
GRAM
Grid Resource Allocation Manager
GridFTP
Grid File Transfer Protocol
GridKa
Grid Computing Centre Karlsruhe
GriPhyN
Grid Physics Network
GRIS
Grid Resource Information Server
GSI
Grid Security Infrastructure
GT
Globus Toolkit
GUI
Graphical User Interface
GUID
Globally Unique IDentifer
HCAL
Hadron CALorimeter
HEP
High Energy Physics
HLT
High Level Trigger
HTML
Hyper-Text Markup Language
HTTP
Hyper-Text Transfer Protocol
I/O
Input/Output
IN2P3
Institut National de Physique Nucleaire et de Physique des Particules: France
iVDGL
International Virtual Data Grid Laboratory
JDL
Job Description Language
JobDB
Job Database
JobID
Job Identifer
L0
Level 0
LAN
Local Area Network
LCG
LHC Computing Grid
LCG IS
LCG Information System
LCG UI
LCG User Interface
LCG WMS
LCG Workload Management System
LDAP
Lightweight Directory Access Protocol
LFC
LCG File Catalogue
LFN
Logical File Name
LHC
Large Hadron Collider
LHCb
Large Hadron Collider beauty
LSF
Load Share Facility
MC
Monte Carlo
MDS
Monitoring and Discovery Service
MSS
Mass Storage System
NIKHEF
National Institute for Subatomic Physics: Netherlands
OGSA
Open Grid Services Architecture
OGSI
Open Grid Services Infrastructure
OSG
Open Science Grid
P2P
Peer-to-peer
Panda
Production ANd Distributed Analysis
PC
Personal Computer
PDC1
Physics Data Challenge
PFN
Physical File Name
PIC
Port d’Informació Cientfca: Spain
PKI
Public Key Infrastructure
POOL
Pool Of persistent Ob jects for LHC
POSIX
Portable Operating System Interface
PPDG
Particle Physics Data Grid
ProdID
Production Identifer
PS
Preshower Detector
R-GMA
Relational Grid Monitoring Architecture
RAL
Rutherford-Appleton Laboratory: UK
RB
Resource Broker
rDST
reduced Data Summary Tape
RFIO
Remote File Input/Output
RICH
Ring Imaging CHerenkov
RM
Replica Manager
RPC
Remote Procedure Call
RTTC
Real Time Trigger Challenge
SAM
Service Availability Monitoring
SE
Storage Element
SOA
Service Oriented Architecture
SOAP
Simple Ob ject Access Protocol
SPD
Scintillator Pad Detector
SRM
Storage Resource Manager
SSL
Secure Socket Layer
SURL
Storage URL
TCP/IP
Transmission Control Protocol / Internet Protocol
TDS
Transient Detector Store
TES
Transient Event Store
THS
Transient Histogram Store
TT
Trigger Tracker
TURL
Transport URL
URL
Uniform Resource Locator
VDT
Virtual Data Toolkit
VELO
VErtex LOcator
VO
Virtual Organisation
VOMS
Virtual Organisation Membership Service
WAN
Wide Area Network
WMS
Workload Management System
WN
Worker Node
WSDL
Web Services Description Language
WSRF
Web Services Resource Framework
WWW
World Wide Web
XML
eXtensible Markup Language
XML-RPC
XML Remote Procedure Call

-- PaulSzczypka - 14 Aug 2009

Development Section for monitoring plots.

These are being assembled at https://twiki.cern.ch/twiki/bin/view/Main/PeterClarkeShiftProcesses

-- PeterClarke - 19-Oct-2010

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback