TWiki
>
EGEE Web
>
SA3
>
ServiceReferenceCards
>
LcgCE
(revision 21) (raw view)
Edit
Attach
PDF
---+ LCG Computing Element (LCG CE) %TOC% ---+++ Functional description LCG CE is a native computing resource access service with Globus Gatekeeper. LCG has modified some of its component to improve its performance. ---+++ Daemons running * globus-gatekeeper — must be started * globus-gridftp — must be started * globus-job-manager-marshal — must be started * globus-gass-cache-marshal — should be started, but the client is able to work in fall-back mode with stopped daemon * globus-gma — must be started if GLOBUS_GMA is enabled in site's config ---+++ Init scripts and options (start|stop|restart|reload|...) * globus-job-manager-marshal, globus-gass-cache-marshal and globus-gma scripts support 'reload' action to send a SIGHUP to a daemon. ---+++ Configuration files location with example or template * */opt/globus/etc/globus-gass-cache-marshal.conf*, */opt/globus/etc/globus-job-manager-marshal.conf* * ==logf== (srting) — location of the log file (default is relative to GLOBUS_LOCATION) * ==dgaspath== (string) — _[only for globus-job-manager-marshal]_ path to DGAS directory (default is /opt/edg/var/gatekeeper/jobs/) * ==maxproc== (numeric) — maximum number of parallel requests _[this is the most useful variable for tuning]_ (5 by default) * ==rrobin== (0, 1 or 2) — enables round-robin queue mode for users(1) or groups (2) (disabled (0) by default) * ==groups== (0 or 1) — if set, supplementary groups will be applied to the job-manager processes (disabled (0) by default) * ==tick== (numeric) — hung child processes are killed every this number of seconds (if no other events are happening) (300 by default). * ==reqtout== (numeric) — client should send a complete request in this number of seconds after connection (10 by default) * ==proctout== (numeric) — each request (child process) is allowed to run for this number of seconds (600 by default) * ==reqlimit== (numeric) — maximum size of a request in bytes (16384 by default). One should increase this limit if environment is very large. * ==window== (numeric) — data block for recv/send in bytes _[never change this]_ (default value is 4096 (x86 page size)). * ==debug== (0, 1 or 2) — debug level. There are three of them: 0 — only warnings (default), 1 — all messages, 2 — stderr is being redirected to the log file (bad for log parsers, but good for catching problems in Perl jobmanagers) * */opt/globus/etc/globus-gma.conf* (please note that current _production_ version does not understand ==condorfix== and ==statefact== yet) * ==logf== (srting) — location of the log file (default is relative to GLOBUS_LOCATION) * ==gridservices== (string) — path to the gridservices directory _[never change this]_ (default is relative to GLOBUS_LOCATION) * ==agentpath== (string) — path to the directory with agent files _[never change this]_ (default is relative to GLOBUS_LOCATION) * ==groups== (0 or 1) — if set, supplementary groups will be applied to the poll process (disabled (0) by default) * ==condorfix== (0 or 1) — enables a Condor work-around for not distinguishing VOMS attributes (disabled (0) by default) * ==tout== (numeric) — sets a limit in seconds for a single job state poll to finish (30 by default) * ==toutlim== (numeric) — sets a limit for a number of consecutive poll timeouts for a given user, after which all remaining jobs for that user will be skipped till the next poll cycle (4 by default) * ==tick== (numeric) — number of seconds between poll cycles.This parameter defines granularity for ==stateage==, ==fileage== and adaptive state refresh interval below (300 by default). * ==stateage== (numeric) — number of seconds for which a job state is considered 'fresh' (600 by default) * ==statefact== (numeric) — division factor for calculating adaptive state refresh interval for short jobs _[refresh_interval = min(stateage, job_run_time / statefact)]_ (disabled (0) by default) * ==fileage== (numeric) — number of seconds before a job file is considered 'stale' and gets removed (86400 by default) * ==fileretry== (numeric) — number of retries to read a job file (2 by default) * ==filesleep== (numeric) — delay in milliseconds between retries above (10 by default) * ==debug== (0, 1 or 2) — debug level. There are three of them: 0 — only warnings (default), 1 — all messages, 2 — stderr is being redirected to the log file (bad for log parsers, but good for catching problems in Perl jobmanagers) * *If your CE suffers form a very high load, try to decrease the ==maxproc== parameters of globus-*-marshal. On the other hand if you have a CE with modern hardware, lots of CPUs and a very fast disk subsystem, consider increasing it.* * *If your site is running short jobs consider decreasing the ==tick== and ==stateage== parameters of globus-gma to make job status updates faster.* * *All parameters (except debug 1 → 2) could be changed online (modify config file and send a sighup). All daemons create pidfiles in /var/run/ (not configurable).* ---+++ Logfile locations (and management) and other useful audit information * /opt/globus/var/log/*.log — configurable with ==logf== options above. * /var/log/globus-gridftp.log * /var/log/globus-gatekeeper.log * /var/log/message * /opt/edg/var/gatekeeper/ ---+++ Open ports * 2811 — Gridftp Server * 2119 — Globus Gatekeeper * 9002 — Locallogger Daemon * Ports from $GLOBUS_TCP_PORT_RANGE should be open ---+++ Possible unit test of the service Submitting jobs to it through both WMS and globus-job-run ---+++ Where is service state held (and can it be rebuilt) Staged files are held under home directory of pool account Job state files are in $GLOBUS_LOCATION/tmp/gram_job_state ---+++ Cron jobs The cron jobs can be found in: * /etc/cron.d/ and are: * bdii-proxy * edg-mkgridmap * lcg-expiregridmapdir * cleanup-grid-accounts * edg-pbs-knownhosts * cleanup-job-records * edg-pbs-shostsequiv * edg-apel-pbs-parser * fetch-crl ---+++ Security information Be filled by OSCT team ---++++ Access control Mechanism description (authentication & authorization) Be filled by OSCT team ---++++ How to block/ban a user * If it is necessary to ban a user on a CE, the following step: * Add the user(s)'s DN into the "ban_users.db" file, which in default can be found at /opt/edg/etc/lcas/ or /opt/glite/etc/lcas/ if it is glite CE, as follow: * "User1's DN" * "User2's DN" * ... ... ... * "UserN's DN" * If there are multiple DNs to be banned, each DN name should be in separated lines and must be quoted with the double quote mark (""), otherwise LCAS will not be able to block the user. At the moment, LCAS does not support wild mark, therefore you can not use "/C=UK/O=eScience/OU=CLRC/L=RAL/*" to ban a group of users. To verify that the user has indeed been banned, in the log there should be something like "LCAS failed authorization" if the job of the banned user landed on the CE. * Nothing needs to be restarted * If it is necessary to ban a VO reconfigure the service without that VO * Will also adapt the information system ---++++ Network Usage Be filled by OSCT team ---++++ Firewall configuration Be filled by OSCT team ---++++ Security recommendations See [[http://indico.cern.ch/contributionDisplay.py?contribId=185&sessionId=40&confId=32220][EGEE'08 presentation]]. ---++++ Security incompatibilities Be filled by OSCT team ---++++ List of externals (packages are NOT maintained by Red Hat or by gLite) Be filled by OSCT team ---++++ Other security relevant comments * If you need to handle suspicious jobs, these the step tp follow: * Pause or stop the batch system queues * Suspend all active jobs, if the batch system supports it * Stop gatekeeper and gridftp-server while suspected DNs not yet identified * Ban suspected DNs or VO * Keep the active jobs submitted by the suspected accounts suspended if possible, to facilitate forensic investigations. Otherwise kill the jobs. * Follow the EGEE Incident Response Procedure: [[http://osct.web.cern.ch/osct/incident-reporting.html][IncidentReporting]] ---+++ Utility scripts ---+++ Location of reference documentation for users * [[http://glite.web.cern.ch/glite/documentation/R3.1/default.asp][gLite 3.1 documentation]] ---+++ Location of reference documentation for administrators * [[http://glite.web.cern.ch/glite/documentation/R3.1/default.asp][gLite 3.1 documentation]] * LCG-CE Internals: <br /> <img src="%ATTACHURLPATH%/LCG-CE-internals.png" alt="LCG-CE-internals.png" width='1020' height='680' /> * On the image above: red boxes are Globus binaries, yellow boxes are Perl daemons from LCG, green boxes are Perl job-manager libraries.
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
LCG-CE-internals.png
r1
manage
96.6 K
2009-02-19 - 12:31
AndreyKiryanov
LCG-CE Internals
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r30
|
r23
<
r22
<
r21
<
r20
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r21 - 2009-03-26
-
AndreyKiryanov
Log In
EGEE
EGEE Web
EGEE Web Home
gLite
ProductTeams
SA3
JRA1
TMB
EMT
SA1
SA2
NA2
NA4
EGEE-UIG
List of
registered projects
List of EGEE-RP
interactions
Changes
Index
Search
Main.WebList
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
EGEE
All webs
Copyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Ask a support question
or
Send feedback