LCGSCM Workload Management Status

Oct 21 2009

Status

  • 4 LCG CEs are in production pointing to the SLC5 resources (2 running the latest version of the software, 3.1.35-0).

Work in progress

  • Upgrade of SLC5 WN to gLite version 3.2.3-0 is being tested (preprod). It will be deployed with the next scheduled linux upgrades.
  • 2 new LCG CEs will be put into production today, pointing to the SLC5 resources.
    • 3 LCG CEs pointing to SLC4 will be retired, draining will start on the 26th Oct.
  • SLC5 BDII (3.2.3-0) will be in production soon.

Issues

Oct 07 2009

Status

  • LXPLUS has 2 new aliases: lxplus4 and lxplus5 and the SLC4 motd was revised to encourage users to move to SLC5.

Work in progress

  • Upgrade of SLC5 WN to gLite version 3.2.3-0 is being tested (preprod). It will be deployed with the next scheduled linux upgrades.
  • 4 new LCG CEs will be put into production soon submitting to SLC5, these will replace 4 LCG-CEs which are being retired soon. The number of LCG CEs submitting to SLC5 will be increased from 2 to 6 when this operation is completed.
  • CREAM CEs instabilities lately. Will install brand new release asap.
  • SLC5 BDII is being prepared.

Issues

May 26 2009

Status

  • The public batch capacity available under SLC5 has exceeded now the SLC4 capacities, using exclusively additional (new) resources.
    x86_64_slc5 32053 HEPspec2006   (8148 wmperf)
    x86_64_slc4 29177 HEPspec2006   (7738 wmperf)
    
  • 2 CREAM CEs are in production since ~2 months, both submit to SLC5 batch resources. These CEs are ce201 and ce202
  • 2 LCG CEs are submitting to the SLC5 resources as well. These are ce128 and ce129
  • 18 machines are now in total behind the SLC5 lxplus alias lx64slc5, and available for users to test the new OS. We are awaiting a release of the UI software for SLC5

Work in progress

  • HEPSpec2006 benchmarking of older worker nodes is ongoing. At the long term a renormalization of CPU factors at CERN is planned as well.
  • Upgrade of SLC5 WN to gLite version 3.2.1-0 is being tested (preprod).
  • Disabling of SELinux execheap check (requested by ATLAS) is also being tested (preprod). (At CERN a ncm component is used to configure this, the same results can be achieved by executing: setsebool -P allow_execheap=true.)

Issues

  • submissions to CREAM CEs via WMS is still broken. Direct submissions should work fine though (used by Alice)

April 29 2009

Status

  • We are currently running 20 WMS nodes in production with gLite update 44
  • There are a number of open bugs and as yet unsolved problems requiring tedious manual intervention to fix. In addition a number of automated workarounds have been implemented.

Work in progress

  • Nothing to report

Issues

  • Monitoring of the usage suggests that due to the poor throughput of the WMS more nodes may need to be requested. This is particularly true for ATLAS and LHCb who were previously not heavy users of the system.
  • gLite update 44 broke ICE so submission to CREAM CEs is not possible.
  • It appears that a number of bugs will only be fixed in the 3.2 release.

Nov 26 2008

Status

  • Due to regular crashes, wms108, the public WMS node for small VOs has been replaced by wms210. The CERN AFS UI has been updated accordingly. The other new nodes should be available soon (this week). This should allow the RBs to be retired afterwards. The current planning is to retire them in the beginning of January 2009.
  • an additional post-installation step for LCG-CEs at CERN has been obsoleted by an automatic procedure using sindes. This affects the configuration of APEL, which needs to have a plain text password in it's configuration file. This is now properly stored on SINDES, and securely transfered when sindes is configured.
  • The local APEL database has been moved to a new machine. The old box (monb006) is still up, but will be reused for other purposes in the next days when we are sure that we don't need it any longer.

Work in progress

  • we started to prepare the support for the Pilot role for all experiments. New pool accounts have been requested and are available now for this. We have 10 such accounts per experiment.
  • support for the SGM role for NA48 will be added. This is needed for the software area for this VO.

Issues

  • CMS requests that CE110 is published in prod-bdii. It needs to be in production because it will remain in the PPS. Are there any objections from the other experiments ?
  • APEL publishing problems this month: APEL stopped working with a local "table full" error. The lcg-records table reached a 4GB limit.

Nov 05 2008

Status

  • added support for pilot role for LHCb
  • added support for VO "na48" (under test)
  • we've been temporarily short in resources in public batch. The problem is under control now.

Work in progress

  • planning to upgrade the SLC5 worker nodes to 64bit WN software
  • for WMS we are waiting for new hardware which will be used to deploy WMS3.1 nodes (WMS+LB installed on each node). These will replace the current RB nodes. This will finalize the migration off SLC3

Issues

  • CE110 (registered in the preprod system and submitting to SLC5 resources) was by accident kept in maintenance. This prevented Alice from running their tests on it.
  • there is a known issue around gssklog on SLC5 which still needs to be understood. This problem may affect Alice testing.
  • in the process of deploying the new pilot role temporarily a bug was introduced which screwed up the mappings for specifically the static alicesgm account. This account is heavily used by Alice for production purpose. The problem came to the attention of the support staff only via a private e-mail from Maarten, and was fixed within 30min after we got aware of the problem.
[lxadm06] /afs/cern.ch/user/u/uschwick/cvs-fs/fabric/kickstart > cvs update ? MyPrepareInstall cvs update: Updating . RCS file: /local/reps/fio/fabric/kickstart/PrepareInstall,v retrieving revision 1.107 retrieving revision 1.108 Merging differences between 1.107 and 1.108 into PrepareInstall M PrepareInstall M PrepareInstall.spec P rhes4.tpl

Oct 15 2008

Status

  • on request of LHCb the mapping of the role "user" has been corrected. This involved a reconfiguration of the CEs

Work in progress

  • we are preparing a new group of pool accounts to map a new role "pilot" for LHCb

Issues

  • LHCb complained about unsufficient resources at CERN. This problem has been followed up last week and is currently solved. A post-mortem is in preparation.

Sep 17 2008

Status

  • rolling software update on the BDII nodes (transparent)

Work in progress

Issues

  • there were recent complains about the availability of our CEs. It turned out that the reason was an outdated hard coded lists of CEs on the experiment side, who were refering to retired nodes. Experiments are reminded not to use such lists but get the state from the information system instead.

Sep 03 2008

Status

  • preparing updates which were announced yesterday. Currently, there is an issue while upgrading the mon service which is being followed up with Felix Ehm.

Work in progress

Issues

  • the linux kernel which was deployed with the latest software upgrade caused a hickup with LSF which was not detected in the preproduction phase. When the problem was detected on Monday, further upgrades were paused, and the vendor Platform has been contacted who provided a fix very quickly. The patch went out directly into production on Monday afternoon, and the upgrades continued.

Aug 20 2008

Status

  • glite-WN software version 3.1.15-0 deployed with latest schedule software upgrades at CERN

Work in progress

  • replacement for old BDII nodes is being prepared (now "fully" quattorized), there is a pending issue with a CERN specific RPM for SRM.
  • gLite-WMS: Konstantin from the SAM team is testing SAM tests against 3.1 WMS (on wms112 and wms113). Load limit had to be increased (15 was not enough, as SAM submits many jobs in burst). Konstantin will test again with the load limit set to 25.

Issues

Aug 06 2008

Status

  • Ewan Roche just joined FIO-FS. He is now one of the WMS service managers. Welcome Ewan! smile
  • wms117 supporting LHCb and ATLAS.

Work in progress

  • replacement for old BDII nodes scheduled for next week
  • some improvements on grid related lemon sensors, specifically for bdii and linuxha, close to deployment
  • We've got 6 new machines - to be installed with quattorized version of SLC4 gLite 3.1 WMS.

Issues

  • LHCb reproduced this bug on wms117, running 3.1 WMS: [User proxy mixup for job submissions too close in time]
<http://savannah.cern.ch/bugs/?39641>

July 23 2008

Status

  • completed the implementation of vo.sixt.cern.ch on WMS and CEs
  • mon.cern.ch officially retired (de facto replaced)

work in progress

  • Full quattorized and yaimonized version of gLite 3.1 WMS
  • ongoing review of access to production batch nodes and grid services
  • new BDII nodes are being set up (hardware replacements)
  • gLite 3.1 WN software will be deployed with the next scheduled software upgrade
  • planning to add a script to ease up instrumentation of experiment frameworks for efficiency monitoring

Issues

  • several incidents with myproxy, followed up and patched, patches submitted upstream
  • we've been hit by two cases of undetected black hole nodes in lxbatch, which developed a slight preference to eat grid jobs. This is still under investigation
  • the gLite middle ware has been deinstalled on wms117 due to an rpm glitch which removed the spma config file
  • problems with "interlogd_wrong" alarms on wms101 and wms106 (LHCb) are under investigation
  • we have seen several incidents where LSF lost queues (temporarily) during the automatic reconfiguration. The suspicion is that this is caused by communication problems which are caused by orphaned nodes which are trying to connect to LSF. 14 such nodes came into life recently by accident, still running LSF6.1 with an ancient configuration. The nodes have been retired, and the reconfiguration procedure has been instrumented to detect such failures, abort the reconfiguration and inform service managers by SMS.

July 09 2008

Status

  • 10 CEs have been retired
  • as decided in the last meeting and announced via broadcast last Friday, services on the R-GMA box monb001 have been stopped today
  • decreasing capacity in CERN-PROD due to node retirements in lxbatch (affecting also 64bit resources) since the last meeting
  • preparation for gLite 3.1 WMS/LB service on SLC4: hardware requested. Configuration change needed. Move to SLC4 won't happen before August.

Issues

  • 2 service interruptions of batch services this week, both related to the move of license servers. This affected both BATCH and CASTOR systems
  • problems with APEL uploads after production release of new CEs due to a configuration issue on the local APEL database. Updated procedures.

June 25 2008

Status

  • 6 new CEs put into production
  • draining 10 CEs on old hardware for retirement
  • ongoing massive retirement of nodes by hardware type

Issues

  • we got hit again by the myproxy middle ware bug
  • monb001 running RGMA : The box has to be retired by mid July. Can this service be dropped, or do we need to migrate it?
  • False "Downtime" due to issue publishing sBDII GStat test results to SAM

June 04 2008

Status

  • CPU counts: we are publishing now cores as physical cpus (on one CE only). This is known to be wrong, but everybody does it like this
  • number of jobs jumped up by a factor of 10 around the 26th of May
  • the power cut on Friday upgraded all remaining grid nodes to the new kernel version.
  • myproxy has been migrated to SLC4. After the migration, we have been hit by a bug. The new executables are wrongly linked. The effect of this is that the server stops to serve requests when it is restarted from a non-interactive login shell, for example by the monitoring, or an automatic yaim reconfiguration. The problem is not solved yet.

Issues

  • bug in myproxy middle ware, see above
  • since the powercut, LHCb has problems to submit pilot jobs to CERN. They exit with a maradonna error. This is currently being debugged. This debugging (mainly by Maarten) unveilled a bug in the gridftp daemon on the CEs for which he applied a manual work around on all CEs. The problem is still not fully fixed (nor understood).

May 21 2008

Status

  • closed access to 32bit worker nodes from the GRID (nodes are scheduled for retirement)
  • migrated ce101 and ce102 to 64bit WN submission nodes -> only one unique subcluster left
  • revised CPU and core counts
  • first steps happening towards SLC5 support

Work in progress

  • WMS on SLC4 prototype (Yvan)
  • review of APEL accounting: meeting with the developers, with the aim of attacking the outstanding issues
  • replacement and migration of the myproxy service ongoing. The new nodes are up and running, a few things still need to be fixed, and HA linux needs to be tested.

Issues

  • occasional site unavailabilities (SAM) due to sBDII failures. All the failed tests were issued by Taiwan. This is currently being investigated.
  • counting of CPUs and COREs
    • CERN is reporting the number of physical CPUs which are in production in the GlueSubclusterPhysicalCPUs field
    • the number of COREs is reported as GlueSubclusterLogicalCPUs
    • other site apparently publish their number of cores as physical CPUs, with the result that CERN appears to be a small center
  • since the start of CCRC08 the number of jobs at CERN has dropped significantly. We are unable to fill our job slots due to a lack of jobs

April 30 2008

Status

  • WN upgraded to latest release on 29/4
  • 4 CEs (ce111-ce114) replaced by recent hardware
  • many batch configuration changes in preparation of CCRC08
  • LSF master double logging feature has been reenabled after we got a fix from Platform

February 27 2008

Status

  • the new worker node software which was deployed in a hurry recently after user complains broke some local jobs (eg. T0 activity). A work around was provided, and the problem is on solution (Savannah bug #33530)
  • CEs are suffering from high load. We are in touch with Di Qing to find out ways on how to improve this
  • job priority (DENY tags) under test now on ce110 (our production like PPS CE)
  • Critical bug found on LB nodes (lb102 and lb103) in production last Friday, causing the job submission to fail on all WMS. We switched the configuration of the WMS node to use new LB nodes lb104 and lb105.
  • Use CDB template pro_type_iptables_gridwms to configure the local firewall on all WMS nodes.
  • Use CDB template pro_type_iptables_gridlb to configure the local firewall on all WMS nodes.

Issues

  • renaming of GEAR VO to vo.gear.cern.ch. Do we have green light to go ahead ?

February 13 2008

  • following complains about an outdated version of lcg_utils at CERN we have upgraded the WN software to the latest version. This update was orginially scheduled for the next scheduled upgrade.
  • The latest version of the CE software has been put into preproduction and will be deployed with the next scheduled software upgrade campain.
  • New SLS probe for WMS nodes: http://sls.cern.ch/sls/service.php?id=WMS

January 22 2008

Status

  • Current status of these nodes can be found here.

Issues

  • Need new LB machines (~8). It seems that the number of WMS actually in production is sufficient for the time being.

January 16 2008

Status

  • During the Christmas break, all CEs have been upgraded to SLC4.
  • Current status of the gLite WMS, gLite LB and LCG RB nodes can be found here.

Work in progress

Old Reports

Edit | Attach | Watch | Print version | History: r141 < r140 < r139 < r138 < r137 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r141 - 2009-10-21 - RicardoSilva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback