TWiki> EGEE Web>SA3>ServiceReferenceCards>LcgCE (revision 21)EditAttachPDF

LCG Computing Element (LCG CE)

Functional description

LCG CE is a native computing resource access service with Globus Gatekeeper. LCG has modified some of its component to improve its performance.

Daemons running

  • globus-gatekeeper — must be started
  • globus-gridftp — must be started
  • globus-job-manager-marshal — must be started
  • globus-gass-cache-marshal — should be started, but the client is able to work in fall-back mode with stopped daemon
  • globus-gma — must be started if GLOBUS_GMA is enabled in site's config

Init scripts and options (start|stop|restart|reload|...)

  • globus-job-manager-marshal, globus-gass-cache-marshal and globus-gma scripts support 'reload' action to send a SIGHUP to a daemon.

Configuration files location with example or template

  • /opt/globus/etc/globus-gass-cache-marshal.conf, /opt/globus/etc/globus-job-manager-marshal.conf
    • logf (srting) — location of the log file (default is relative to GLOBUS_LOCATION)
    • dgaspath (string) — [only for globus-job-manager-marshal] path to DGAS directory (default is /opt/edg/var/gatekeeper/jobs/)
    • maxproc (numeric) — maximum number of parallel requests [this is the most useful variable for tuning] (5 by default)
    • rrobin (0, 1 or 2) — enables round-robin queue mode for users(1) or groups (2) (disabled (0) by default)
    • groups (0 or 1) — if set, supplementary groups will be applied to the job-manager processes (disabled (0) by default)
    • tick (numeric) — hung child processes are killed every this number of seconds (if no other events are happening) (300 by default).
    • reqtout (numeric) — client should send a complete request in this number of seconds after connection (10 by default)
    • proctout (numeric) — each request (child process) is allowed to run for this number of seconds (600 by default)
    • reqlimit (numeric) — maximum size of a request in bytes (16384 by default). One should increase this limit if environment is very large.
    • window (numeric) — data block for recv/send in bytes [never change this] (default value is 4096 (x86 page size)).
    • debug (0, 1 or 2) — debug level. There are three of them: 0 — only warnings (default), 1 — all messages, 2 — stderr is being redirected to the log file (bad for log parsers, but good for catching problems in Perl jobmanagers)

  • /opt/globus/etc/globus-gma.conf (please note that current production version does not understand condorfix and statefact yet)
    • logf (srting) — location of the log file (default is relative to GLOBUS_LOCATION)
    • gridservices (string) — path to the gridservices directory [never change this] (default is relative to GLOBUS_LOCATION)
    • agentpath (string) — path to the directory with agent files [never change this] (default is relative to GLOBUS_LOCATION)
    • groups (0 or 1) — if set, supplementary groups will be applied to the poll process (disabled (0) by default)
    • condorfix (0 or 1) — enables a Condor work-around for not distinguishing VOMS attributes (disabled (0) by default)
    • tout (numeric) — sets a limit in seconds for a single job state poll to finish (30 by default)
    • toutlim (numeric) — sets a limit for a number of consecutive poll timeouts for a given user, after which all remaining jobs for that user will be skipped till the next poll cycle (4 by default)
    • tick (numeric) — number of seconds between poll cycles.This parameter defines granularity for stateage, fileage and adaptive state refresh interval below (300 by default).
    • stateage (numeric) — number of seconds for which a job state is considered 'fresh' (600 by default)
    • statefact (numeric) — division factor for calculating adaptive state refresh interval for short jobs [refresh_interval = min(stateage, job_run_time / statefact)] (disabled (0) by default)
    • fileage (numeric) — number of seconds before a job file is considered 'stale' and gets removed (86400 by default)
    • fileretry (numeric) — number of retries to read a job file (2 by default)
    • filesleep (numeric) — delay in milliseconds between retries above (10 by default)
    • debug (0, 1 or 2) — debug level. There are three of them: 0 — only warnings (default), 1 — all messages, 2 — stderr is being redirected to the log file (bad for log parsers, but good for catching problems in Perl jobmanagers)

  • If your CE suffers form a very high load, try to decrease the maxproc parameters of globus-*-marshal. On the other hand if you have a CE with modern hardware, lots of CPUs and a very fast disk subsystem, consider increasing it.
  • If your site is running short jobs consider decreasing the tick and stateage parameters of globus-gma to make job status updates faster.
  • All parameters (except debug 1 → 2) could be changed online (modify config file and send a sighup). All daemons create pidfiles in /var/run/ (not configurable).

Logfile locations (and management) and other useful audit information

  • /opt/globus/var/log/*.log — configurable with logf options above.
  • /var/log/globus-gridftp.log
  • /var/log/globus-gatekeeper.log
  • /var/log/message
  • /opt/edg/var/gatekeeper/

Open ports

  • 2811 — Gridftp Server
  • 2119 — Globus Gatekeeper
  • 9002 — Locallogger Daemon
  • Ports from $GLOBUS_TCP_PORT_RANGE should be open

Possible unit test of the service

Submitting jobs to it through both WMS and globus-job-run

Where is service state held (and can it be rebuilt)

Staged files are held under home directory of pool account

Job state files are in $GLOBUS_LOCATION/tmp/gram_job_state

Cron jobs

The cron jobs can be found in:

  • /etc/cron.d/

and are:

  • bdii-proxy
  • edg-mkgridmap
  • lcg-expiregridmapdir
  • cleanup-grid-accounts
  • edg-pbs-knownhosts
  • cleanup-job-records
  • edg-pbs-shostsequiv
  • edg-apel-pbs-parser
  • fetch-crl

Security information

Be filled by OSCT team

Access control Mechanism description (authentication & authorization)

Be filled by OSCT team

How to block/ban a user

  • If it is necessary to ban a user on a CE, the following step:

  • Add the user(s)'s DN into the "ban_users.db" file, which in default can be found at /opt/edg/etc/lcas/ or /opt/glite/etc/lcas/ if it is glite CE, as follow:
    • "User1's DN"
    • "User2's DN"
    • ... ... ...
    • "UserN's DN"

  • If there are multiple DNs to be banned, each DN name should be in separated lines and must be quoted with the double quote mark (""), otherwise LCAS will not be able to block the user. At the moment, LCAS does not support wild mark, therefore you can not use "/C=UK/O=eScience/OU=CLRC/L=RAL/*" to ban a group of users. To verify that the user has indeed been banned, in the log there should be something like "LCAS failed authorization" if the job of the banned user landed on the CE.

  • Nothing needs to be restarted

  • If it is necessary to ban a VO reconfigure the service without that VO
    • Will also adapt the information system

Network Usage

Be filled by OSCT team

Firewall configuration

Be filled by OSCT team

Security recommendations

See EGEE'08 presentation.

Security incompatibilities

Be filled by OSCT team

List of externals (packages are NOT maintained by Red Hat or by gLite)

Be filled by OSCT team

Other security relevant comments

  • If you need to handle suspicious jobs, these the step tp follow:
    • Pause or stop the batch system queues
    • Suspend all active jobs, if the batch system supports it
    • Stop gatekeeper and gridftp-server while suspected DNs not yet identified
    • Ban suspected DNs or VO
    • Keep the active jobs submitted by the suspected accounts suspended if possible, to facilitate forensic investigations. Otherwise kill the jobs.
    • Follow the EGEE Incident Response Procedure: IncidentReporting

Utility scripts

Location of reference documentation for users

Location of reference documentation for administrators

  • gLite 3.1 documentation
  • LCG-CE Internals:
    LCG-CE-internals.png
  • On the image above: red boxes are Globus binaries, yellow boxes are Perl daemons from LCG, green boxes are Perl job-manager libraries.
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng LCG-CE-internals.png r1 manage 96.6 K 2009-02-19 - 12:31 AndreyKiryanov LCG-CE Internals
Edit | Attach | Watch | Print version | History: r30 | r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r21 - 2009-03-26 - AndreyKiryanov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback