Difference: ProductionShifterGuide (20 vs. 21)

Revision 212010-11-16 - FedericoStagni

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"

Grid Shifter Guide

Line: 35 to 35
 

Tier-1 Sites

Tier-1 sites are used for Analysis, Monte Carlo production, file transfer and file storage in the LHCb Computing Model.

Changed:
<
<
  • LCG.CERN.ch
>
>
  • LCG.CERN.ch (acting also as a Tier 0)
 
  • LCG.CNAF.it
  • IN2P3.fr
  • LCG.NIKHEF.nl
Added:
>
>
  • LCG.SARA.nl
 
Line: 50 to 51
 

Backend Storage Systems

Changed:
<
<
Two backend storage technologies are employed at the Tier-1 sites, Castor and dCache. The Tier-1 sites which utilise each technology choice are summarised in the table below:
>
>
Three backend storage technologies are employed at the Tier-1 sites, Castor, dCache StoRM. The Tier-1 sites which utilise each technology choice are summarised in the table below:
 
Backend Storage Tier-1 Site
Changed:
<
<
Castor CERN, CNAF, RAL
>
>
Castor CERN, RAL
 
dCache IN2P3, NIKHEF, GridKa, PIC
Added:
>
>
StoRM CNAF
 

Jobs

Changed:
<
<
The number of jobs created for a productions varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.
>
>
The number of jobs created for a production varies depending on the exact requirements of the production. Grid Shifters are generally not required to create jobs for a production.
 

JobIDs

Line: 293 to 295
  The Production Monitoring Webpage has the following features:
Deleted:
<
<
 
Deleted:
<
<

Site Downtime Calendar

The calendar [6] displays all the sites with scheduled and unscheduled downtime. Calendar entries are automatically parsed through the site downtime RSS feed and added to the calendar.

Occasionally the feed isn't parsed correctly and Grid Shifters should double-check that the banned and allowed sites are correct. Useful scripts for this are:

dirac-admin-get-banned-sites
and
dirac-admin-get-site-mask

Plots

The Production Monitoring Webpage has the capacity to produce various plots. Many of which are extremely useful to monitor the performance of the production system.

Links to useful plots can be found on the DIRAC System Monitoring Pages. These plots should be monitored at three times daily.

 

Buglist and Feature Request

The procedure to submit a bug report or a feature request is outlined in Procedures.

Line: 333 to 312
  The new shifter should:
Changed:
<
<
>
>
  • Ensure their Grid certificate is valid for all expected duties
  • Create accounts on all relevant web-resources
  • Subscribe to the relevant mailing lists
 

Grid Certificates

Line: 386 to 365
 
  • Check that there is a minimum of one successful (and complete) job.
  • Confirm that data access is working at least intermittently.
  • Report problems to the operations team.
Changed:
<
<
  • Submit a summary of the job status at all the grid sites to the ELOG 7.
>
>
  • Submit a summary of the job status at all the grid sites to the ELOG.
 

Performance Monitoring

Line: 436 to 415
  Return the key for the Production Operations Room (TCE5) to the secretariat or the next Grid Shifter.
Deleted:
<
<

Weekly Report

A weekly report should be prepared by the Grid Shifter at the end of each week. The report should contain information on all the processed production and user jobs, the respective failure rates and some basic analysis of the results. The report should be compiled on the last day of the shift and contain information about the previous seven full days of operation, i.e. it should not include information from the day the report is compiled.

The weekly reports are to be uploaded to the Weekly Reports Page on the LHCb Computing tWiki. Grid Shifters should use the template provided when compiling a report.

 

Base Plots

Line: 482 to 452
 

Machine Monitoring Plots

Monitoring of the LHCb VO boxes is vital to maintaining the effcient running of all Grid operations. Particular attention should be paid to the used and free space on the various

Changed:
<
<
disks, network and CPU usage. Reports on the state of the following boxes should be constructed:
  • vobox01
  • vobox06
  • vobox09
  • vobox10

For each machine, save and then upload the plots for:

  • CPU utilization
  • Network utilization
  • Partition Used
  • Swap Used
Note: Mac users may find that the suggested name when saving the plots does not follow the format “*.gif.png” and they should take care to either rename the saved files or edit that week’s report page accordingly.
>
>
disks, network and CPU usage. The machines could be monitored using Lemon
 

Analysis and Summary

Line: 534 to 491
 

When to Submit an ELOG

Changed:
<
<
Submit an ELOG in the following situations:
>
>
A non-exhaustive list of cases when an ELOG has to be submitted include:
 
  • Jobs finalise with exceptions.
Changed:
<
<
>
>
  • The applications run in the job crash with exceptions.
  • A production is stuck/does not proceed/is failing all the jobs/...
  • Site related problems:
    • A large number/percentage of pilots are aborting
    • Shared area slowness (e.g. : jobs failed with Application status = "SetupProject.sh execution failed")
    • The site is killing a suspiciously high number of jobs.
    • ...
 

Exceptions

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback