CE Notes

Introduction

The Computing Element (CE) is the service representing a computing resource. Its main functionality is job management (job submission, job control, etc.). The CE may be used by a generic client: an end-user interacting directly with the Computing Element, or the Workload Manager, which submits a given job to an appropriate CE found by a matchmaking process For job submission, the CE can work in push model (where the job is pushed to a CE for its execution) or pull model (where the CE is asking the Workload Management Service for jobs). Besides job management capabilities, a CE must also provide information describing itself. In the push model this information is published in the information Service, and it is used by the match-making engine which matches available resources to queued jobs. In the pull model the CE information is embedded in a "CE availability" message, which is sent by the CE to a Workload Management Service. The matchmaker then uses this information to find a suitable job for the CE.

In the production cluste for CERN, the local batch system is based on LSF.

The CE participates in the following flows

Components

The CE consists of

  • An LDAP server which presents the information on the CE to the BDII

Data

Configuration

Configuration is performed via YAIM.

High Availability

The RB service failure has the following impact

  • New jobs cannot be submitted
  • Status of existing jobs cannot be queried
  • Jobs which complete will not be shown as completed until the RB service has been recovered
  • Output data from jobs may be lost since they cannot copy the job results to the output sandbox
  • User sandboxes will not be available for retrieval

Currently, the RB service does not support IP aliases. This is being worked on and should be fixed before SC4 implementation.

There is a drain function available to stop new submissions which allowing old submissions to complete.

Approach RB 1

An IP alias rbvo.cern.ch will be defined which allows the service to be switched between machines if required.

All state data will be stored 'off the box'. The state data consists of several directories (/var/edgwl,...) and the MySQL database server.

Drawing is not editable here (insufficient permission or read-only site) Drawing is not editable here (insufficient permission or read-only site)

Thus, in the event of failure of the master, the slave would take over the external disks. The state data stored on file systems would be 'rolled back' using ext3 functions. The MySQL database would be restarted and would play its redo log to arrive at a consistent state.

Approach RB 2

The database for logging and bookkeeping is split off onto separate servers. The MySQL servers can then be shared between all of the resource brokers.

Drawing is not editable here (insufficient permission or read-only site) Drawing is not editable here (insufficient permission or read-only site)

Using replication from the master to slave, the slave can take over the role of the master in the event of a failure. This also resolves the issue of hot online backups in MySQL since you just stop the slave, perform the backup and then start the slave again.

Equipment required

Approach 1

Assuming n RBs and 2 spares, the hardware required is

Component Number Purpose
Midrange Server n+2 RB masters and standby machines
FC HBA n+2 Fibre channel connectivity
FC Switch Ports 2*n+2 Connectivity for the two servers
FC Disk space 20 Storage for credentials (2x10GB on different disk subsystems)

Approach 2

Assuming n RBs and 2 spares, the hardware required is

Component Number Purpose
Midrange Server n+4+2 n RB masters and standby machines along with 2 MySQL clusters
FC HBA n+4+2 Fibre channel connectivity
FC Switch Ports 2*n+8+2 Connectivity for the two servers
FC Disk space 20 Storage for credentials (2x10GB on different disk subsystems)

Disk space requirements

Disk space required is based on the following data.

Parameter Value (MB)
Size of input sandbox 10
Size of output sandbox 50
Jobs / Day 10000
Sandbox Purge Time (days) 14
Jobs in queue 10000
Disk Space Required 8400000

Thus, the total space required is 8400 GBytes.

ALERT! This seems very large and has therefore been raised as an issue in the technical factors. Either the number of jobs/day per resource broker is not correct or the purge period will have to be reduced to keep the sandboxes to a reasonable size.

Engineering required

Development Purpose
Hot backup for MySQL A Hot backup procedure needs to be developed. The MySQL database cannot be shutdown for extended periods of time while the backup is performed
Start/Stop/Status procedure Scripts for RB operations
Replication procedure for MySQL Enable MySQL master/slave setup
Lemon MySQL availability test A lemon aware sensor which can be used for reporting availability
Lemon RB availability test A lemon aware sensor which can be used for reporting availability CeMonitoringNotes
Linux Heartbeat availability test A Linux-HA aware sensor which would activate the procedure for automatic switch from master to slave
Switch procedure Automatic switch from master to slave changing the DNS alias, disabling the master, enabling the slave in its new master role
Capacity Metric Capacity metrics defined for
Number of renewals / second
Number of inits / second
Quattor configuration for Linux-HA NCM component to configure Linux-HA/Heartbeat

RBs are a per-VO configuration to avoid one VO causing problems for another one. The spare slave boxes though could be shared between the VOs until a problem occurs.

-- TimBell - 13 Sep 2005

-- TimBell - 15 Sep 2005

Edit | Attach | Watch | Print version | History: r8 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2005-09-15 - TimBell
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback