December 12 2007

Status

  • The NFS server which is used for fail over of the LSF master nodes has been replaced by new hardware, running 64bit SLC4 now

Work in progress

  • 5 new CEs are being set up, running on powerful new midrange servers, with 64bit SLC4 OS

November 21 2007

Status

Work in progress

  • we are setting up new LCG-CEs (V3.1) based on SLC4 OS and a minimal set of installed rpms. This requires a change in the packaging of LSF which is currently being tested
  • we'll need several CE reconfigurations in the next days (requested for SRM) which will give an opportunity to debug problems we have seen in the past, like gatekeeper and bdii crashes after reconfigurations.
  • Working on the new CDB templates for SLC4 which will be used by the WMS and LB nodes.

Issues

  • Problem with the match making mechanism in the WMS after job resubmission. Should be fixed in the next middleware release.

November 14 2007

Status

  • LSF7 upgrade completed on the 29/10. No major issues seen but we were flagged as down (?!) although the intervention was properly announced.
  • the access to the remaining SLC3 submitting CEs has been closed 2 weeks ago.
  • little batch capacity in SLC3 left
  • access to one remaining node in lxslc3 (lxplus) has been revoked
  • WN software has been upgraded on the 13/11, fixing a critical bug related to (t)csh local users.
  • wms101 disk raid problem - solved.
  • gLite 3.0 Update 36 done on all the LCG RB nodes

Work in progress

  • plan to migrate a test CE and BDII to SLC4
  • first glance at glExec on a test node: fighting against dependencies, the package does not install.
  • reboot of all WMS/LB/RB next week for kernel upgrade.

October 17 2007

Issues

  • still problems with apel accounting. Under investigation

work in progress

  • we'll migrate 4 more CEs to SLC4/32 SLC/64 submission. Scheduled downtime to allow them to drain

October 10 2007

Status

Issue

  • Some problems with the LB node lb102 (purger not installed by default because a package is missing in the meta-package glite-LB). There is also a problem with the purger when the database is huge. This is under investigation by the developers. As a consequence, all the WMS nodes are now using lb103 as stand-alone LB.
  • All the WMS/LB/RB nodes need to be rebooted (kernel upgrade for SLC3 nodes). Could be done next week.
  • LCG RB rb128 dedicated to CMS will be shutdown tomorrow morning (Thursday 11 October). The RAID controller needs to be changed.

September 26 2007

Status

  • gssklog AFS token grabbing mechanism revised by plugging a script into the grid job wrappers

Issues

  • still synchronization problems for the APEL August data. Under investigation with Dave Kant.

work in progress

  • ce110 is currently submitting to our LSF7 test cluster. No issues have been seen, so it will be returned to the production machines.

September 19 2007

Status

  • Current status of the LCG RBs, gLite WMS and gLite LB nodes.
  • New WMS for CMS put in production (wms107) with the latest patch #1251 from certification.
  • New LCG RB for Alice put in production (rb129).
  • (WMS for CMS) in drain mode because of some backlog.
  • the gLite 3.1 SLC4 WN software has been installed on the SLC4/32bit subcluster of lxbatch
  • ce121 and ce122 have been modified to submit to the 32bit SLC4 subcluster of lxbatch
  • more machines have been moved from SLC3 to 32bit SLC4 which now has about 600 job slots

Issues

  • the APEL accounting for August showed crazy numbers because of a new job record class which was not properly handled by the APEL lsf log file parser. A new version has been created by Dave Kant and deployed at CERN. This new version also improves/fixes support for sites on which the ksi2k number reported by the CE can change. At CERN, this number is dynamically calculated by taking the average of all machines in the subcluster. Due to movements of nodes between the subcluster this average can change frequently which was causing problems to the released version APEL.

August 22 2007

Status

  • update of WN software rolled out on all SLC3 (gLite 3.0) and SLC4 (gLite 3.1) worker nodes nodes
  • software updated on CEs and monbox, in particular APEL

Issues

  • we have seen jobs arriving on CEs via some WMS although these CEs were reporting to be in draining mode.

Work in progress

  • 4 CEs are still draining, blocked by stalled jobs from LHCb, waiting for news from the user

August 15 2007

Status

  • all LCG CEs have been upgraded to yaim 4.0.X
  • all globus_mds has been replaced by a local BDII on all CEs. This changed went along with a port change
  • the gLite production CEs have been reinstalled as LCG CEs submitting to SLC4/64
  • CE113 and CE114 have been migrated to SLC4/64 submission hosts
  • CE115, CE116,CE117,CE118 are now draining, and will be migrated to SLC4/64 submission hosts
  • a new version of the CA certificate rpms has been deployed

Issues

  • in July we experienced again problems uploading the APEL accounting data. A manual intervention was needed
  • the change in the port of the information system mentionened above caused APEL to crash (Savannah bug posted), and not to report accounting data to the RGMA host. As a result, the APEL accounting data for July had to be corrected by hand
  • accounting of local jobs to APEL is not working yet for CERN. We are not ready to report those numbers.
  • many problems found on wms10x/lb101.
  • there are many jobs submitted to wms104 - problems to upgrade.

work in progress

  • work on APEL local accounting has to be resumed
  • preparation to reinstall all wms10x/lb101 to newer glite-WMS patch 1251

July 11 2007

Status

  • The SLC4 native build gLite 3.1 worker node software has been packaged and deployed without reinstallation of the nodes
  • Thorsten left. I'll overtake the work on APEL
  • SLC4 deployment: our gLite CERN_PROD CEs have been converted into LCG CEs, and submit to SLC4/64 worker nodes
  • created and deployed a 32bit version of python on our 64bit machines. It is available under /usr/bin/python32
  • successfully tested a LCG CE with LSF 7.0
  • new CA rpms have been deployed on the 10/7 (1-15)
  • A new WMS and a new LB have been recently installed and configured with the latest packages available last week. Two bugs have been found in the meantime.

work in progress

  • packaging of SLC4 gLite3.1 WN software for deployment.
  • we are preparing for additional SGM and PRD pool accounts
  • 3 WMS will be reinstalled from scratch (one for Atlas, rb111 for Alice and rb125 for CMS) this week and will have the new middleware installed.

Issues

  • APEL accounting for June was broken due to various problems both on the Apel ddb site, and a problem on the monbox. We still have problems with this.
  • a first attempt to replace the globus-mds on ce110 by a bdii did not work as expected. Laurence is looking into the issue. When the problem is fixed we plan to test it on one CE first (ce110 presumably), and then deploy it site wide. The change should improve the situation of CEs falling out of the information system under high load conditions.

June 20 2007

Status

  • LSF updated rolled out which fixes a problem recently seen with SGM user accounts. This affected also opssgm when they were running tests from the production gLite CEs via blah.
  • various problems last week on lxbatch worker nodes (and lxplus) related to disk and memory problem. This seems to be under control now.

Work in progress

  • APEL accounting: work is going on to include local users of the VOs (Thorsten).
  • SLC4 native worker node software: successfully installed on a test WN without reinstallation.
  • Preparing a new CDB template containing a minimal set of packages to install on GD production nodes.

Issues

  • Sandbox partition full on one CMS WMS last week due to a bug in CRAB. The CRAB developers have been contacted.

June 06 2007

Status

  • 8 new LCG-CEs (batch hardware) installed and put into production within 2 days to be able to cope with increased activity on the GRID
  • updated operator procedures for dealing with problems on the CE cluster
  • removed gridice_daemons to reduce the number of queries to the batch system
  • tuning of NFS servers used by the gridce cluster
  • tuning of /tmp cleaner (special version on the CEs now) in the gridce cluster to reduce the risk of files being removed by race conditions
  • fall back to CERN private version of lcg-info-generic script. With the released version our CEs publish default numbers when they are under load
  • renewed certificates of ce106 and ce107 which were about to expire
  • improved monitoring and actuators (AFS, nscd, grisgrid)
  • lsfnfs02 which is used for the fall back mechanism in LSF and which runs the glite log file parser had a disk problem, and was running in degraded mode. The problem was fixed by the vendor and the sysadmins in a fully transparent way

Issues

  • due to increased activity on the GRID we hit scalability issues on our CE cluster. Measures taken to improve the situation are listed above

May 30 2007

Status

  • WN and CEs have been updated to the latest version update 24, both SLC4 (32bit SLC3 compat) and SLC3 worker nodes
  • the production CEs now publish GlueHostArchitecturePlatformType (x86_64 for ce110 and ce111, i686 for the others). This should allow experiments to identify the target architecture.
  • fixed the published values of GlueHostOperatingSystemRelease and GlueHostProcessorModel
  • WMS nodes published by the top-level BDII lcg-bdii.cern.ch.
  • Middleware upgrade (update 24) on all the gLite WMS 3.0 and lcg RB nodes.

Issues

  • Problem with the glite account on all the WMS nodes which had a bad group identity (gid). The problem is that the gid of this account changed at Cern last week or two weeks ago, and suddently this change caused troubles on all the glite wms last Tuesday. The only way to have this information consistent was unfortunately to reboot the nodes. Problem fixed.
  • Some CMS users generated some huge output sandbox files (> 500MB), and the sandbox partition became full on rb102 (CMS WMS 3.1).
  • Load-balanced DNS alias lhcb-wms.cern.ch created and applied to the WMS 3.1 nodes dedicated to LHCb (rb112 and rb117). However, the WMS middleware does not support this mechanism yet.

May 09 2007

Status

  • Alice WMS nodes upgraded to gLite 3.1.
  • LHCb WMS nodes upgraded to gLite 3.1.
  • Waiting green light from CMS and Atlas to upgrade their WMS nodes.
  • Restriction of root access on LCG RBs nodes and gLite WMS nodes (cluster lcgrb and gridrb). Only operator, sysadmins, smod and service managers are now allowed to log in as root on these nodes.
  • Installation of a new griview service on some LCG RBs nodes (requested by the GridView team).

Issues

  • DNS load-balancing on WMS and RBs nodes ?

Apr. 25 2007

Status

  • CE101 has been put back into production after the HW intervention

Work in progress

  • Deployment of the new lemon metrics/exceptions for the gLite WMS cluster.
  • Middleware upgrade (update 22 for gLite 3.0) on all the gLite WMS 3.0 and lcg RB nodes.
  • SLC3 worker node software is being upgraded to the latest version
  • Work on Apel accounting still ongoing, received a new patch

Issues

  • The upgrade done on one of the gLite WMS 3.1 (rb112) failed because of some problems with packages dependencies. People responsible of the release have been contacted and the problem is under investigation.

Apr. 04 2007

Status

  • Network intervention on Monday went fine
  • Pending software upgrades will be done today on CE and WN
  • APEL accouting almost under control (New monbox monb006 has to be opened on the firewall)
  • Middleware upgrade (update 20 for gLite 3.0) on all the gLite WMS 3.0 and lcg RB nodes. It is for BDII component only.

Work in Progress

  • Waiting for updated packages for gLite WMS 3.1.

Issues

  • Interlogger problem on gLite WMS 3.1.
  • gahp_server on lcg-RB looping.
  • glite-wms-purger cronjob created high I/O load.

Mar. 21 2007

Status

  • Write some lemon sensors for the gLite WMS nodes.
  • Middleware upgrade (update 18 for gLite 3.0) on all the gLite WMS nodes.
  • gdrbxx nodes removed from production next friday 21/03/2007. A broadcast will be sent by the GMOD today.
  • Ce105 has been retired due to a h/w failure. It will not be put back as a CE, because it is the wrong hardware.
  • CE101 & CE102 have been reinstalled, to be checked.
  • Next steps will be to drain and retire the other batch hardware based CE machines ce103,ce104,ce106,ce107

Issues

  • APEL accounting is still not able to handle multiple CE machines. Therefore our accounting is not correct.

Mar. 14 2007

Status

  • As requested by LHCb, the two gLite WMS rb112 and rb117 have been reinstalled from scratch to version 3.1.
  • One gLite WMS dedicated to Atlas (rb101) upgraded to version 3.1.
  • Middleware upgrade (updates 16 and 17) done on all the gLite WMS 3.0 nodes during the last 2 weeks.
  • Middleware upgrade (updates 16 and 17) done on all the LCG RBs.
  • About 20 new lemon exceptions configured on all the LCG RB nodes. There is a wiki page available here. The operational procedure guide has been updated accordingly.

Work in Progress

  • Write some lemon sensors for the gLite WMS nodes.
  • CE101 and CE102 in status draining for reinstallation
  • CE105 has a hardware problem. Is draining now and will be removed when empty (batch hardware).

Issues

  • Still jobs submitted to CE hosts, which are in draining status, currently ~300 jobs by ATLASPRD.

Feb. 28 2007

Status

  • 1 new LCG RBs for SAM put in production.

Work in Progress

  • Middleware upgrade (update 15 for gLite 3.0) on all the gLite WMS nodes and on gLite-CE (to version 2.4.23-4). The LCG-CE were not affected by this update.

Issues

  • We would like to use the status of a CE to steer itself, e.g. when a CE is overloaded, it should put itself into the status 'draining' and reduce its load by processing the jobs. If the load is healthy again, the node should put itself back in 'production'. This is currently not possible because GOCDB has to be modified for this and the SAM tests don't take the status of a CE into account.
  • We exceeded 10K Grid jobs (pending and running) over the weekend, of which ~4K are running. We now see some scalability issues on the batch master (highloads), and possibly GRID job submission failures. Although the batch system is able to cope with this number of jobs, the GRID jobs cause much higher load than probably necessary, if you compare to local jobs. We should look for ways to improve the GRID usage of the batch system.

Feb. 21 2007

Status

  • 6 new LCG RBs and 1 new experimental gLite WMS 3.1 put in production. The repartitions of the new LCG RBs and gLite WMS is as follow:
    • LCG RBs:
      • Alice: rb105, rb120.
      • Atlas: rb106, rb121.
      • CMS: rb107, rb119, rb122.
      • LHCb: rb114, rb123.
      • shared: rb104, rb124.
      • SAM: rb113, rb115.
    • gLite WMS 3.0:
      • Alice: rb111, rb116.
      • Atlas: rb101.
      • CMS: rb102, rb109.
      • LHCb: rb112, rb117.
      • shared: rb103.
      • SAM: rb108, rb118.
    • Experimental gLite WMS 3.1:
      • Atlas: rb110, rb126.
      • CMS: rb125.
  • Middleware upgrade (update 14 for gLite 3.0) done on all the LCG RB and gLite WMS nodes.
  • Jobs submission blocked on the old gdrbxx nodes (LCG RB 2.7.0). These machines will be removed from production in one month.
  • activated lcg-info-dynamic-scheduler by hand for CERN.

Work in Progress

  • Configuration of the middleware on the experimental gLite WMS 3.1 node rb126.
  • retirement of old batch HW based CEs still to be done: NEw H/w is in place, when the information provider instabilities are under control we will start to retire the old boxes.

Issues

  • investigating stability issues of the CEs (under high load for LCG, for gLite in general). Minor upgrades of the LSF information provider plugin deployed.

Feb. 14 2007

Status

  • New LCG and gLite-CE machines are ready to go into production. New information providers installed. CE108 was put into the IS yesterday. Still problems with timeouts in the IS on the CE's.

Work in Progress

  • We will migrate to the new CE machines in the next days. The old (batch h/w) machines, ce103-ce107, will be retired afterwards. ce101 and ce102 will be upgraded. New machines are ce108 - ce113.

-- SteveTraylen - 15 Oct 2008

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-03-30 - SophieLemaitre
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback