Experiment Services in the T0 VOBOXES Wiki

Description of the project

The four LHC experiments rely on a set of VOBOXES placed at T0/T1/T2 sites to run experiment-specific services. For some of these, such as those required for the production/analysis tasks of the experiment, a high service level is needed.

  • This project is to establish of a set of procedures and protocols which will ensure smooth operation of these nodes at the T0 and that can be easily followed by the operators at the CERN-IT
  • In a 2nd phase a similar approach will be applied also to the T1 and some T2 sites.

Phases of the project

  • 1st Phase: Individually each experiment should provide through the EIS contact persons the list of services and the nodes hosting these services following the form included below. STATUS: ONGOING
  • 2nd Phase: In collaboration with the FIO experts and based exclusively on the experiment requirements, specific lemon sensors will be applied together with specific procedures to be easily followed by the operators
  • 3rd Phase: Depending on the time schedule and experiment requirements, the experience should be applied to T1 sites

ALICE Services

Contact person: Patricia Mendez Lorenzo (IT), Lachezar Betev (ALICE), Pablo Saiz (IT), Costin Grigoras (ALICE)

Regarding the ALICE VOBOXES at CERN, MOST OF THE SPECIFIC EXPERIMENT SERVICES WILL BE MANAGED BY THE EXPERIMENT EXPERTS WHICH WILL FOLLOW THEIR STATUS IN THEIR NATIVE MONITORING PAGES. No actions or descriptions will be therefore included in this page regarding these services. There is however a single ALICE service which might require operator services and that will be included here: MonaLisa
In addition and considering that ALICE is using the standard WLCG-VOBOX service, basically all generic WLCG-VOBOX service has to be covered by the IT operators on duty since they require admin privileges.

  • ALICE services requiring operator actions:
  • Generic WLCG-VOBOX services requiring operator actions
    • alice-box-renewal
    • gsisshd
General Information for alice-box-renewal
  • Name of the service: alice-box-renewal
  • Description of the service: Generic service engaged of the user proxy renewal mechanism
  • Contact: alice-support@cernNOSPAMPLEASE.ch
  • Criticality: (and include a description of the criticality): 10. Any stop in this service will stop the proxy renewal mechanism of the vobox and will therefore stop the rest of the specific alice services which run into the vobox
  • Impact and max allowed downtime: Stop of all Alice services. Stop of T0-T1 raw data transfers, stop of the Pass 1 reconstruction of raw data. Max allow downtime since the 1st alarm: 12h
  • Existing instances: 2
  • Site(s): CERN
  • Dependencies: myproxy.cern.ch
  • Cluster: vobox (vobox/alice)

Hardware

  • Managed by: IT/FIO
  • Name of the node and network alias: voalice03(128.142.173.147), voalice06(128.142.173.212)

Monitoring

  • Monitoring available via: Native ALICE system and SAM
  • Alarms sent to site: NO
  • Alarms sent to experiment: YES

Procedures

  • Description of procedures to install/upgrade: In Preparation
  • Description of procedures to start/stop: In preparation
  • Description of procedures to reboot: In preparation
  • Configuration documented: YES
  • Error troubleshooting documented: does not procede
  • Backup procedures documented: does not procede
  • Configuration backed up: does not procede
  • Other procedures documented: YES

Operations

  • Covered by experiment shifts: YES
  • Covered by experts on call: YES
  • Coverage: best effort (for the moment)
  • Additional FIO coverage require and if so what: Service restart at alarm time (In preparation)
General Information for gsisshd
  • Name of the service: gsisshd
  • Description of the service: Generic deamon allowing the vobox login operation
  • Contact: alice-support@cernNOSPAMPLEASE.ch
  • Criticality: (and include a description of the criticality): 10. Any stop in this service will forbid any log inside the node
  • Impact and max allowed downtime: Stop any manual Alice operation into the vobox.
  • Existing instances: 2
  • Site(s): CERN
  • Dependencies: none
  • Cluster: vobox (vobox/alice)

Hardware

  • Managed by: IT/FIO
  • Name of the node and network alias: voalice03(128.142.173.147), voalice06(128.142.173.212)

Monitoring

  • Monitoring available via: NO
  • Alarms sent to site: NO
  • Alarms sent to experiment: NO

Procedures

  • Description of procedures to install/upgrade: In Preparation
  • Description of procedures to start/stop: In preparation
  • Description of procedures to reboot: In preparation
  • Configuration documented: NO
  • Error troubleshooting documented: does not procede
  • Backup procedures documented: Does not procede
  • Configuration backed up: does not procede
  • Other procedures documented: YES

Operations

  • Covered by experiment shifts: NO
  • Covered by experts on call: NO
  • Coverage: best effort
  • Additional FIO coverage require and if so what: Service restart at alarm time (In preparation)
General Information for MonaLisa
  • Name of the service: MonaLisa
  • Description of the service: local monitoring system. In addition responsible of the startup of all specfici ALICE services into voboxes
  • Contact: alice-support@cernNOSPAMPLEASE.ch
  • Criticality: (and include a description of the criticality): 10. Any stop in this service might stop the good behaviour of the rest of the services (2h max)
  • Impact and max allowed downtime: Stop the follow up of the Alice services
  • Existing instances: 4
  • Site(s): CERN
  • Dependencies: none (only with AliEn central services)
  • Cluster: vobox (vobox/alice)

Hardware

  • Managed by: IT/FIO
  • Name of the node and network alias: voalice01, voalice02, voalice03, voalice06

Monitoring

  • Monitoring available via: YES, ALICE native system
  • Alarms sent to site: NO
  • Alarms sent to experiment: NO

Procedures

  • Description of procedures to install/upgrade: In Preparation
  • Description of procedures to start/stop: In preparation
  • Description of procedures to reboot: In preparation
  • Configuration documented: NO
  • Error troubleshooting documented: does not procede
  • Backup procedures documented: Does not procede
  • Configuration backed up: does not procede
  • Other procedures documented: YES

Operations

  • Covered by experiment shifts: YES
  • Covered by experts on call: YES
  • Coverage: 24X7
  • Additional FIO coverage require and if so what: Service restart at alarm time (In preparation)

Specific procedures and operations for ALICE

The following procedures will be available into the AliEn page.

alice-box-renewal

  • Description of service: This service is the responsible to refresh the user proxies which have been previously registered into the vobox
  • Placement: all ALICE VOBOXES: voalice03, voalice06 (these nodes will be out of warranty in October 2009. New voboxes already available: voalice11, voalice12, voalice13, voalice14)
  • Nodes criticality: >50
  • Service criticality: 10
  • Dependencies: myproxy.cern.ch
  • Characteristics: This service runs under root credentials. no actions can be expected therefore from the experiment
  • Level of support:
  • Procedures
    • Description of procedures to install/upgrade: Any upgrade of the WLCG VOBOX (in terms of middleware) must be previously announced and agreed with the Alice support team through the corresponding email list: alice-support@cernNOSPAMPLEASE.ch. Once the VOBOX has been upgraded: THE SERVICE SHOULD BE AUTOMATICALLY RESTARTED
    • Description of procedures to start/stop: Any stop of the service HAS TO BE AGREED WITH THE ALICE SUPPORT TEAM through the through the corresponding email list: alice-support@cernNOSPAMPLEASE.ch.
      • STOP PROCEDURE: /etc/rc.d/init.d/alice-box-renewal stop
      • START PROCEDURE: /etc/rc.d/init.d/alice-box-renewal start
    • Description of procedures to reboot: Reboot of the node can be done if required without announcing to the experiment. After the reboot of the node the status of the service must be checked:
      • CHECK PROCEDURE: /etc/rc.d/init.d/alice-box-renewal status
      • If the result of this command is "Service not running", the service has to be restarted as specified before
      • Control tools
        • The following script will need to be installed at each ALICE VOBOX and be regularly executed to trigger lemon alarms in case of errors:http://pmendez.web.cern.ch/pmendez/alice/lcg_cern_vobox_tests
        • This script will write the results into /tmp/CERN-$NODE_NAME where $NODE_NAME is the name of the host. Inside this directory 3 files will appear at the end of the script execution which includes a certain error code. This error code will need to be parsed and trigger lemon alarms in the case of errors.
        • Description of the tests related to the user proxy renewal service
          • Test1: Status of the proxy renewal service. In case of failures the service will be restarted automatically (TO BE IMPLEMENTED TO THE LEMON LEVEL). After two failed attempts an email will have to be sent to the 2nd (for action) and 3rd levels of support (for information)
          • Test2: Proper registration of the host into myproxy server at CERN. In case of failures an email will have to be sent to the 3rd level of support

    MonaLisa
  • Description of service: This service is the responsible to monitor the specific ALICE local services running at each vobox and also the responsible to restart automatically the rest of specific ALICE services if they are stopped

  • Placement: voalice03, voalice06 (these nodes will be out of warranty in October 2009. New voboxes already available: voalice11, voalice12, voalice13, voalice14)
  • Nodes criticality: >50
  • Service criticality: 10
  • Dependencies: none (only AliEn central services)
  • Characteristics: This service runs under alicesgm credentials
  • Level of Support

  • Procedures
    • Description of procedures to install/upgrade: Any upgrade of the WLCG VOBOX (in terms of middleware) must be previously announced and agreed with the Alice support team through the corresponding email list: alice-support@cernNOSPAMPLEASE.ch. Once the VOBOX has been upgraded: THE SERVICE SHOULD BE AUTOMATICALLY RESTARTED (via a cron installed at each VOBOX)
    • Description of procedures to start/stop: Any stop of the service will be managed centrally by ALICE
    • Description of procedures to reboot: Reboot of the node can be done if required without announcing to the experiment. After the reboot of the node MonaLisa will be automatically restarted (via a cron installed at each VOBOX)
  • Control tools
    • Depending on ALICE support

gsisshd

  • Description of service: This service is the responsible to provide access to the VOBOX for sgm persons

  • Placement: voalice03, voalice06 (these nodes will be out of warranty in October 2009. New voboxes already available: voalice11, voalice12, voalice13, voalice14)
  • Nodes criticality: >50
  • Service criticality: 10
  • Dependencies: none
  • Characteristics: This service runs under root credentials
  • Level of Support
    • 1st level of support: Operator
    • 2nd level of support: FIO
    • 3nd level of support: ALICE expertise: alice-support@cernNOSPAMPLEASE.ch
    • Description of procedures to install/upgrade: Any upgrade of the WLCG VOBOX (in terms of middleware) must be previously announced and agreed with the Alice support team through the corresponding email list: alice-support@cernNOSPAMPLEASE.ch. Once the VOBOX has been upgraded: THE SERVICE SHOULD BE AUTOMATICALLY RESTARTED
    • Description of procedures to start/stop: /etc/rd.c/init.d/gsisshd stop(start). Any stop operation has to be agreed with the 3rd level of support: alice-support@cernNOSPAMPLEASE.ch
    • Description of procedures to reboot: Reboot of the node can be done if required without announcing to the experiment. After the reboot of the node gsisshd will be automatically restarted but its status must be confirmed: /etc/rd.c/init.d/gsisshd status
  • Control tools
    • The following script will need to be installed at each ALICE VOBOX and be regularly executed to trigger lemon alarms in case of errors:http://pmendez.web.cern.ch/pmendez/alice/lcg_cern_vobox_tests
    • This script will write the results into /tmp/CERN-$NODE_NAME where $NODE_NAME is the name of the host. Inside this directory 3 files will appear at the end of the script execution which includes a certain error code. This error code will need to be parsed and trigger lemon alarms in the case of errors.
    • Description of the tests related to the gsisshd service
      • Test3: Status of the gsisshd service.In case of failures the service will be restarted automatically (TO BE IMPLEMENTED TO THE LEMON LEVEL). After two failed attempts an email will have to be sent to the 2nd (for action) and 3rd levels of support (for information)

ATLAS Services

Contact persons: Sergey Baranov
List of the ATLAS services: ATLAS.DistributedComputingMachines

CMS Services and important info

Contact Persons: Andrea Sciaba, Nicolo Magini, Patricia Bittencourt
List of the CMS services: https://twiki.cern.ch/twiki/bin/view/CMS/CMSServices_Criticality

Check this page to have some important notes about CMS VOBOXES procedures

Plans to implement Lemon sensors for the SAM client and the Job Robot.

LHCb Services

LHCb Operational Procedures and Mainteinance

Contact Persons: Roberto Santinelli, Joel Closier,Raja Nandakumar
List of the LHCb Services: LHCb Services

EXPERIMENTS LEMON METRICS

The specific experiment sensors can be found here.

OPERATORS PROCEDURES

  • everything begins here
  • The template can be found here
  • a quick guide is available here

PROJECT INDICO CATEGORY

The indico list can be found here

QUESTIONNAIRE FOR EACH EXPERIMENT SERVICE General Information

  • Name of the service
  • Description of the service
  • Contact:
  • Criticality: (and include a description of the criticality)
  • Impact and max allowed downtime
  • Existing instances:
  • Site(s):
  • Dependencies:
  • Cluster:
    Hardware
  • Managed by:
  • Name of the node and network alias

Monitoring

  • Monitoring available via
  • Alarms sent to site
  • Alarms sent to experiment

Procedures

  • Description of procedures to install/upgrade
  • Description of procedures to start/stop
  • Description of procedures to reboot
  • Configuration documented
  • Error troubleshooting documented
  • Backup procedures documented
  • Configuration backed up
  • Other procedures documented

Operations

  • Covered by experiment shifts
  • Covered by experts on call
  • Coverage
  • Additional FIO coverage require and if so what

-- PatriciaMendezLorenzo - 24 Jun 2009

Mailing List(s)

  • support-eis@cern.ch

Documents & presentations

  • See "attachments" below

-- PatriciaMendezLorenzo - 24 Jun 2009

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2010-04-01 - JiriHorky
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback