LCG Grid Deployment - CERN ROC

CERN ROC weekly report

  • FIO fills the CERN-PROD part
  • ROC manager fills the ROC CERN part, taking the C5 reports as input
  • PPS manager fills the CERN PPS site part, taking the C5 reports as input
  • ROC on duty of previous week adds tickets and things that were not reported in other sections of the report and adds them.
  • ROC on duty checks if the Tier-1 sites (FNAL, BNL, and TRIUMF) have entered their report.

GGUS issues

How to request new features in GGUS (e.g. creation of a new support unit)

  1. www.ggus.org
  2. "Submit a request for a new feature to GGUS". The direct link is https://savannah.cern.ch/support/?func=additem&group=esc

How to handle "temporary hack solutions" in GGUS

If in a ticket a solution is found which solves the problem but only temporarily and it is known that a more stable fix is possible and due, the ROC operator on duty should

  1. put the user ticket to "solved" and write in the solution field that a solution has been found, but a more stable/permanent solution is needed, warning the user that the problem might reoccur and advising them to open a new ticket in GGUS if the problem reappears,
  2. open a new ticket describing the problem, the temporary fix, and the real solution (if known)
  3. link the old closed ticket to the new one using the "relates issue" field in the ticket

CERN ROC Site Procedures

When a site wants to join

If a request by a site to join the grid is received, one should

  1. Evaluate if the site will be under the CERN ROC or not.
  2. a. Should it fall under ROC CERN, an e-mail should be written to the requester providing the necessary information and explaining the procedure for joining the grid. An example of the e-mail to send is in the How to join template.
  3. b. Otherwise an e-mail should be sent to the ROC under which the site should be asking if they are willing to be the ROC for that site.
  4. Check that the site administrator has signed their candidature. If they have not, send an e-mail to remind them that they need to provide a digital signature. An example of the e-mail to send is in the Request to a site to sign their candidature template. If you are asked for a template of the statement that the site has to sign, you can use the following
        "I have read and hereby accept the policy documents as requested in the ROC CERN e-mail. 
         All administrators and other necessary personnel at our site will be informed of and agree to 
         abide by all Grid operating policies of LCG/EGEE Security Response procedure and the 
         relevant Open Science Grid (OSG) guide." 

How to Suspend a Site

A definitive CERN ROC escalation process, aimed to decide whether and when a site should be suspended, is not defined yet. In our experience two classes of events have been identified, so far, that can lead to this decision.
  1. The CERN ROC is informed by the Grid Operator on Duty that the decision has been made to suspend one of the sites which the ROC is managing. Usually the next step for the GOoD is to put forward the decision during the weekly operation meeting. Although the suspension action can be performed by the GOoD, if no serious objections to this decision can reasonably be moved from the CERN ROC, it has been internally agreed that, as a demonstration of proactivity, the CERN ROC should autonomously suspend the site. In this case a notification to the site should be sent with the Suspension Notification template.
  2. The decision to suspend the site is made directly by the CERN ROC (e.g. due to excessively growing number of unhandled issues affecting the site). In this case it is preferable to contact the site first, informing the administrators that a measure of suspension is going to be taken by the ROC. Convenient time should be left to the site to reply if the site wants to oppose to the measure. An example of the e-mail to send is in the Suspension Request template.

Once the site is informed, and all the "timeouts" have expired, the following steps are:

  1. Suspend the site on the GOC db.
    • Select "ROC Management" page.
    • Change site Status --> suspended.
    • Status Reason: Enter the main reason why the site is being suspended, possibly with reference to tickets.
  2. Add the suspended site to the CERN-ROC BDII.
    • From the GOC db, copy site name ((in bold characters)) and site GIIS URL (ldap://...)
    • Paste them in the BDII configuration file /afs/cern.ch/project/gd/egee/www/roc-cern/bdii/observed-sites.conf.
      The format is:
      Site-Name ldap-Connection-String
      e.g.
      # suspended sites
      BEIJING-CNIC-LCG2-IA64 ldap://ce-lcg.sdg.ac.cn:2170/Mds-Vo-Name=BEIJING-CNIC-LCG2-IA64,o=grid
      CARLETONU-LCG2 ldap://lcg02.physics.carleton.ca:2135/mds-vo-name=carletonlcg02,o=grid
      HPTC-LCG2 ldap://ce.prd.hp.com:2170/mds-vo-name=HPTC-LCG2,o=grid
      HPTC-LCG2ia64 ldap://ice.prd.hp.com:2170/mds-vo-name=HPTC-LCG2ia64,o=grid
      SDU-LCG2 ldap://ce.grid.hepg.sdu.edu.cn:2170/mds-vo-name=SDU-LCG2,o=grid
         
  3. Close all relevant open tickets on GGUS.
    • Tickets opened by the GOoD can be retrieved by a GGUS search using the site name as a keyword.
    • Specify in the solution " Site suspended by Cern ROC on _DATE_ "

SLA with sites.

  1. First send an e-mail to the site administrtor with in attachment a copy of the SLA template to the site administrator for comments. An example of the e-mail to send is in the First request to sign the SLA template. There is also an editable pdf for version 1.6 of the SLA attached at the bottom of this page.
  2. A GGUS (Global Grid User Support) ticket is opened and assigned to the Cern ROC.This ticket will be used to keep track of the SLA signing activity, which in general can be followed by several people within the ROC. It will be closed when the site and the ROC have signed and a signed copy has been returned to the ROC.
  3. Agree on possible amendments
  4. Have the ROC manager to sign two copies of the SLA and send them to the site administrator.
  5. Have the ROC manager to sign digitally a copy of the SLA and e-mail it to the site administrator for their signature.

Site Certification: overview

  1. Add the site to the GOCDB.
  2. A GGUS (Global Grid User Support) ticket is opened and assigned to the Cern ROC.This ticket will be used to keep track of the certification activity, which in general can be followed by several people within the ROC. It will be closed when the site is set in production.
  3. First of all a gap analysis has to be done with the site in order to make clear which services and versions is the site going to run. The target of this analysis is the definition of the actual requirements for the site and the documentation/training possibly needed.
  4. The site is configured by the site administrator(s) and Cern ROC is notified when it is done. During all this time the SAM tests run by the CERN ROC will be active, in order to help the site admins to monitor their progresses.
  5. After the notification arrives that the site is correctly configured (proved by the successful SAM tests) the countdown of three days is started.
  6. An end-to-end test of the support line is done in order to verify the ability of the site to receive and work with service tickets opened by the Operators-on-Duty (COD).
  7. After three days of continuous successful tests. The site will be inserted in production.
  8. As an add-on to the technical certification we require the site so sign the EGEE SLA with the ROC

All the relevant communication and interactions between the site and Cern ROC during the process above described should in principle be done through the GGUS ticket opened at step 1).

Site Certification: details

Starting the certification

A GGUS (Global Grid User Support) ticket is opened and assigned to the Cern ROC.This ticket will be used to keep track of the certification activity, which in general can be followed by several people within the ROC. It will be closed when the site is set in production. A template for an e-mail to do everything in one step is given here.

Gap Analysis

A gap analysis has to be done with the site in order to make clear which services and versions is the site going to run. The target of this analysis is the definition of the actual requirements for the site and the documentation/training possibly needed.

Certification Tests

The site is configured by the site administrator(s) and Cern ROC is notified when it is done. During all this time the SAM tests run by the CERN ROC will be active, in order to help the site admins to monitor their progresses. Details on SAM for the CERN ROC can be read here

After the notification arrives that the site is correctly configured (proved by the successful SAM tests) the countdown of three days is started.

The support line of the site will also be checked in order to verify the ability of the site to receive and work with service tickets opened by the Operators-on-Duty (COD)

Putting a site in production.

After three days of continuous successful tests. The site will be inserted in production see next section

After the certification: delivery of a site in Production

After the certification tests have been carried on successfully, the sites must be re-introduced in the production service. This should not happen automatically. The site should be informed and accept. So we suggest this simple procedure to do it.

  1. Send an e-mail to the site administrator (cc egee-roc-cern) giving a report on the certification tests and asking if the site agrees to go in production. Ideally in this stage a "certification document" should be provided, but this is not defined right now. An example of this comunication can be found in the template, but please consider that each site's certification has got a particular history, and that should be summarized in the e-mail.
  2. Wait for explicit acceptance by the site administrator. We assume here that the site accepts.
  3. Set the "Certified" status in the GOC db.
    • Select "ROC Management" page.
    • Change site Status --> certified.
    • Status Reason: Enter a brief summary of the certification criteria (e.g. Job submission succeeded for 1 week).
  4. Send an e-mail to the site Administrator, the cic-on-duty (cc. egee-roc-cern) confirming the start of the operation in production mode. A template is given here.
  5. Add the site contact e-mails in the lists egee-roc-cern-all-sites and egee-roc-cern-certified-sites using e-groups.

CERN ROC Email templates

CERN ROC Certification Tools

A quick guide to digital signatures

-- DianaBosio - 26 Nov 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf EGEE-ROC-Site-CERN-SLD-v1.6-editable.pdf r1 manage 147.3 K 2008-11-26 - 14:07 DianaBosio ROC CERN SLA for site admins to sign
Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2009-11-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback