Sites Rank

Checking the site rank:when and how


If a site seems to suffer running the payload the reasons can be different. Either the pilots are not able to get slots on the sites or the site is banned from the mask or some problem in the CS.
Supposed the sites is properly registered in DIRAC the fact that pilots do not go there is because the site either fails systematically all jobs (check SAM) or the site is not matched properly by the WMS that does not steer pilot to the site. If this last is the explanation again a couple of possible direction should be the reason: the pair gLite WMS/gLite BDII is not matching the site for instabilities of one (or the other) service or a the combination of the twos or the site is really not attractive for LHCb matching policies.
We suggest to check first the following metrics advertized by the site on the WLCG Information System:
  1. Site Rank value (taking into account number of jobs running, waiting in the local LRMS free and total slots)
  2. Lenght of the queue
In order to simplify the life web pages displaying these information per CE have been setup: http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview and the same metrics computed usinng the VO-View information (used by the gLite WMS) are available at the twin page http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview2
The column LHCb Rank gives the current value of the LHCb rank expression. The bigger this value is the more attractive the site is. A value less than -5 for that expression explains why no more jobs are submitted through that CE. For sake of completeness the same page gives Estimate Response Time (ERT) value as retrieved from the site that also gives an indication of the time required to get jobs running.
The page http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview also offers an indication of the length of the queue the site provides. LHCb require 300K seconds on a 5002KSI machine. In this page the operator can find the value of the SpecInt of the (lowest? average?) node on the site and the GlueMaxCPUTime the site provides for its queue. These two values should be evaluated against the reference value accordingly the following expression:
GlueCEMaxCPUtime> 300000*500/GlueCESpecInt
If this expression is verified the queue is long enough to hold Production jobs for LHCb.

CERN: why no jobs are running there?


It is not rare that the large majorioty of users wants to run his payloads at CERN where Grid queues usage has to compete with local batch job submissions. The resources at the T0 for LHCb grid and non-grid are shared and the inflaction of people usually accessing CERN resources might exhaust the (limited) CERN computing resources often.
As explained a glance at the grid information system might then not report the "real" load the VO is running at CERN as whole (also through the local batch nodes) being only grid submitted jobs reported.
For that reason we bring here some tips to understand what are possible reasons for having no further jobs submitted at CERN via DIRAC (a lot of waiting jobs in DIRAC sense).
From lxplus go through the following checks:
> bjobs -G u_LHCB | grep -i run | wc -l

This gives the list of running jobs for LHCb. One could certainly grep for the number of pending jobs and also see which accounts submitted these jobs. I remind to the help for this command to get more information.
As soon as you have evidence that a lot of trafic (not necessarily only from the grid) is however running you can check at other interesting information via bhpart command.
A first overlook gives you general accounting information over the last 24 hours.
[lxplus235] ~ > bhpart SHARE
HOST_PARTITION_NAME: SHARE
HOSTS: g_share/
SHARE_INFO_FOR: SHARE/
USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME
u_TOTEM 100 33.333 0 0 0.0 0
u_DIRAC 100 29.324 0 0 2109.7 0
u_NTOF 80 26.667 0 0 0.0 0
u_HARP 70 23.333 0 0 0.0 0
u_HARPD 25 8.333 0 0 0.0 0
u_ALICE 1400 7.676 22 0 142866.7 440303
u_NA61 20 6.667 0 0 0.0 0
u_FLUKARP 20 6.667 0 0 0.0 0
u_GEANT4 100 4.978 0 0 87880.6 0
u_C3 10 3.333 0 0 0.0 0
others 20 1.106 3 0 23358.3 7949
u_NA48 400 0.522 146 0 324291.1 1351106
u_NA45 1 0.333 0 0 0.0 0
u_ITDC 1 0.333 0 0 0.0 0
u_NOMAD 1 0.333 0 0 0.0 0
u_DELPHI 1 0.333 0 0 0.0 0
u_OPAL 1 0.333 0 0 0.0 0
u_SLDIV 575 0.255 366 0 4753936.5 1178285
u_NA49 75 0.226 13 0 463448.2 1026397
u_ATLASCAT 20 0.089 13 0 546947.3 389406
u_LHCB 600 0.086 433 0 12201683.0 16849682
u_NA38NA50 1 0.073 3 0 4537.4 4182
u_ATLAS 900 0.065 1101 0 19379564.0 35002355
u_CMS 1400 0.059 291 0 35614644.0 80917510
u_OPERA 1 0.040 2 0 49355.8 31846
u_THEORY 50 0.021 128 0 3858136.8 6185725
u_PARC 1 0.021 1 0 67461.8 144254
u_COMPASS 1800 0.019 1592 0 136260064.0 316508363
u_DTEAM 10 0.017 1 0 0.0 2945314
u_ALEPH 1 0.012 4 0 96351.1 242961
u_GEAR 100 0.012 95 0 17353044.0 23545161
u_L3 1 0.008 1 0 88.5 633582

A more verbose output, providing user by user all accounting information can be retrieved using bhpart command with -r option

[lxplus235] ~ > bhpart -r SHARE

From this command you can isolate immediately which user consumed all CPU allocated for the VO preventing, for the fair-share mechanism the whole community to run furthermore.
Please note that a user beating all resources shouldn't be fired. There are indeed conditions (like pre-empted system) that might lead to that. So just waiting until this user jobs finish and the priority mechanism (that for LSF is based on the fair share among VO and VO's Groups and Users) will re-normalize the situation and allow the VO to submit more jobs

References of allocated shares/groups for LHCB and others at CERN LSF are available through the web interface https://j2eeps.cern.ch/service-lsfweb/login (your afs/nice account)

- 12 Jan 2009

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-05-27 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback