Older Status Reports

LCGSCM Workload Management Status

Jan. 31 2007

Status

  • lxplus alias changed as scheduled on Monday to SLC4_64
  • ce110 which has been added recently to the PPS system is successfully used mainly by Alice to run jobs on SLC4_64 in 32bit compatibility mode
  • following the model of CE110 (lcg-like) another one CE111(gLite like) has been set up and added to the PPS system.

Work in Progress

  • work on APEL accounting is still ongoing

Jan. 17 2007

Status

  • new CEs installed with newest SW release: ce108 - ce113 but not yet put into production
  • ce110 (LCG flavor CE) taken for SLC4 tests: submits to SLC4_64 WN
  • new version of LSF information providers: adds support for multiple OS nodes (job slot and job counting takes into account the LSF type of the submission host)
  • ce110 has been added to the PPS now
  • (SLC3) WN software has been installed on SLC4_64

Work in Progress

  • APEL accounting for SLC4 (new)
  • APEL accounting for SLC3 needs updates on both the CE as well as on the monbox

Issues

  • some SAM test problems on gLite CEs before Christmas were finally solved by SAM people

Dec. 13 2006

Work in Progress

  • We are planning to set up WN on LXBATCH running SLC4. As soon as we have the software from PP. For this, we will also add a new CE.
  • New CE machines on mid range server h/w is being installed. These nodes will get the latest CE s/w, and replace the CE machines on batch h/w (ce103 - ce107).

Dec. 06 2006

Status

  • New VO EELA configured on CE, WN and LSF. Something to be done on WMSLB/RB?
  • Actuator for GridGris wrong alarm configured
  • Two new gLite WMSLB put in production: rb111 for Alice, rb112 for LHCb.

Issues

  • ce103,ce104,ce105,ce106,ce107 were unavailable last week due to a faulty network switch. The switch was replaced on Friday. NEtwork people want to do some more tests on the new switch and ask for a downtime of 30 minutes. To be scheduled.

Nov. 29 2006

Status

  • some progress on SFT/SAM tests on gLite CEs. We found and fixed one problem on the gLite CEs which was caused by a pool cleaner tool. It has been disabled. The success rate of the SFT/SAM tests on ce103 has increased since then but there are still failures. The last test shown for ce104 is a couple of days old, so we cannot verify for that one.
  • LHCb reported problems with the experiment tags. This has been traced down to the usage of a buggy version of lcg-ManageVOTag that is installed in the AFS UI. The format of the tags was corrected by hand, and LHCb is using now a newer version that should fix the problem.

Nov. 15 2006

Status

  • LSF update of last week had a side effect on SGM Grid accounts which are special. Job submission from the CEs temporarily failed for these accounts. The problem was understood yesterday, and a fix was applied.

Issues

  • SFT/SAM tests on gLite CEs failure still not solved. Under investigation by developers now.
  • Grid jobs that run into the CPU time limit seem to report strange and missleading error message back to the users. Also, it does not make much sense to resubmit such jobs automatically without change.

Nov. 08 2006

Issues

SFT/SAM tests on gLite CEs frequently fail. The reason is unclear and under investigation by the experts.

Nov. 01 2006

Nothing to report

Oct. 25 2006

Work in progress

New (generic) information providers for the CE hosts are under investigation, this should reduce the execution time for the IS, and therefore stabilize it.

Status

rb110 has been installed as a new WMSLB

glite-WN software upgraded to last week announced new version

Oct. 18 2006

Status

New gLite-LCG flavour CE ce107. Same for two new BDII machines, one as top level LCG-EGEE.BDII and one as EXP-EGEE.BDII.

The firewall settings for all these types of services are now configured by LANDB sets. Therefore adding a new machine does neither require a request to computer.security any more nor a port scan. The machine simply has to be added to the appr. LANDB set. This is done in CDB. Nevertheless security port scans will also be done in the future.

In coordination with the VOs the experiment software tags have been merged and unified on all CEs. Any updates should now be propagated automatically to all CEs.

Issues

ce107 still fails the SFT tests.

Oct. 11 2006

Issues

High loads and failures on LCG-CE and LCG-EGEE.BDII machines are ongoing.

Suffering inode problems in /tmp on RBs. Workaround: Cron jb to regularely clean /tmp from thousands of empty files that cause the problem). Long term solution: Change the configration of the software to stop writing this files.

Work in Progress

New CE and BDII machines are put in place: ce107 waiting for firewall settings), BDII107 & BDII108 are ready for install, and the EXP-EGEE.BDII machine will become LCG-EGEE.BDII machines as well, which makes 6. More CE should come as well (ce108, ce109, ce110), already on midrange server h/w.

Oct. 04 2006

Issues

Many high load and daemon dead alarms on ce101 and ce102. Under investigation, but possibly due to heavy (but not unusual) usage.

Sep. 27 2006

Issues

gssklog s/w has been updated due to a problem with the error output (See GGUS #13289)

Sep. 22 2006

Status

  • rb109 is in production

Work in Progress

  • installation of lxgate20 and lxgate21 is ongoing

Issues

  • problems with one LCG - CE : ce102 suffered from high load in the last two days and had to be rebooted (once). It had been put into draining status for some time to give it a chance to recover. It's back in production now but the problem itself remains unsolved and may return at any time.

  • got another GGUS ticket about failing SFTs for ce102 with reasing: JL failed for this CE although the CE was in "draining" mode. Does this make sense ?

Sep. 13 2006

Status

* gssklog allowing GRID sgm accounts to get AFS tokens to write to software areas has been deployed on LXBATCH. It works for gLite CE without changes to the user jobs. For jobs submitted via the LCG CE, we are trying to find a similar solution. So far, a small modification to the user job is necessary.

Work in Progress

* rb109 installed as second WMSLB for CMS. Status: waiting for rename of the machine. This machine will be configred with RAID1 on the data disk.

* lxgate20 and lxgate21 are installed for ATLAS as UI and classic SE: Status: UI installation ongoing.

Issues

* Waiting for the expert to be around for upgrading the gLite software and checking.

* We can enable passing of memory requirements on the gLite WMS, as it is required by ATLAS. TBD: Should we do it at CERN-PROD, what are the other sides doing?

* Still instabilities on ce101 due to high load (peaks). But now the load is distributed over the four CE hosts. Do we need more?

Sep. 06 2006

Status

Since Monday at high noon CERN is publishing additional short queues of 2 hours (2days) length. These are:
grid_cms_2nh
grid_lhcb_2nh
grid_atlas_2nh
grid_lhcb_2nd

The new queues are in production and already being used by the experiments.

Work in Progress

New Glite-RB lxb7283 is ready to be used.

New software (gLite 3.0.2) is deployed on the WMS.

Issues

ce103 (gLite-CE) has shown problems recently, "GRAM gatekeeper[6430]: GSS failed to get server credentials". Reconfiguration and rebooting did not help. The certificate is not expired and has not been touched since its installation in June. We will go for a reinstall. Any help is appreciated.

gLite-3.0 CE hosts do not publish the right memory size for the cluster (GlueHostMainMemoryRAMSize), which should come from "CE_MINPHYSMEM" in the lcg-site-info.def. They are publishing the ram size of the CE instead. GGUS ticket opened (12279)

Aug. 30 2006

status

  • new LSF groups grid_CMSPRD and grid_ATLASPRD holding the prd accounts have been allocated. Effective from tomorrow. Prepared the same for LHCb and Alice, but we need to know the share allocation (in percent, relative to the current total GRID share allocation)

  • new LSF queues introduced:
   grid_2nh_cms    25  Open:Active  -  -    -    -     0     0     0     0
   grid_2nh_atlas   25  Open:Active  -  -    -    -     0     0     0     0
   grid_2nh_lhcb    25  Open:Active  -  -    -    -     0     0     0     0

Tests coming soon. Whe will have a meeting with CMS on this today and we will also contact ALICE.

  • Besides the issues with monb001 (see below), the grid accounting is now working, even with multiple CEs. (But still no accouting for gLite 3.0)

Work in progress

  • New test WMS (lxb7283) is set up to be used as test RB instead of the production RB102. Currently waiting for a certificate

  • We have started to implement local firewalls on Quattor managed GRID machines. We will start with CE & WMS. Thanks to Romain Wartel from GD for providing the list of GRID ports in a CDB usable template format. More tests have to be carried out and an announcment will be made before we go live on the production machines.

Issues

  • (again) new version of the information providers deployed on the CEs * fixes a problem seen when a user recently decided to submit 40k jobs * some minor bug fixes

  • monb001 is running out of memory and though is not reporting accounting information any more to RAL.
    • Who is currently responsible for this machine? We need some help on what we can do when we have a current situation like this: restart gLite, reboot the node? And is this a known problem and followed up or should we open a GGUS ticket?

Aug. 22 2006

status

  • The LSF information providers have been updated on all our production CEs. No problems seen so far.

Aug. 16 2006

status

  • A new version of the LSF dynamic info provider plugin has been deployed yesterday on the production gLite CEs.
  • One of the two gLite-CEs, ce104, has been given back from GD, who used it for tests, and was reinstalled and put back into production recently.  

Work in progress

  • we plan to deploy the new info providers on our LCG production CEs, if no major problems are seen - today.
  • GRID accounting has been enabled again, although we are still only publishing the data from one CE, because the APEL software is not ready to handle multiple CE's. We are incontact with the developers and waiting for reply.
  • The power problems have been solved and all of LXBATCH will be switched on again successively, which will bring back the CERN-PROD capacity to its level before the shutdown.

Aug. 1st 2006

status

  • problems with LHCB log file server: the machine got flooded with grid-ftp requests until it broke down, running out of memory. A reboot did not help since the grid-ftp daemon was started very early in the boot sequence, even before ssh, and brought the machine down again by answering requests before the machine had a chance to boot up properly. As a hot fix iptable rules were set up to restrict the number of simultanous requests to 3/s which was then increased to 10/s and finally to 20/s. LHCB took action and changed the way of using the machine on 1/8/06, and the iptable rules were disabled again.
  • LSF reconfiguration problems are definitely solved. A full reconfiguration needs now about 2min (2h before), and does not bring the system down.
  • LSF dynamic info providers: caching works fine, a problem with stale lock files on the cache file caused the number of CPUs to become erratic today. A fix (detection of stale lock files) for this will go into the new version which is under developement.

Work in progress

  • rewrite of LSF dynamic info providers: due to scalability issues these scripts are being restructured.

July 25th 2006

status

The hot fix mbatchd delivered last week did not solve our problems. The problem is going into another iteration now. A new attempt is currently being organized for the coming night. This has been requested by Platform.

UPDATE: the test of the new mbatchd on 26/7/06 at 4:00am done by Platform was successful. They did a full reconfiguration twice, and according to the lemon monitoring plots there was no visible service interruption. We will resume the reconfiguration procedure using this new mbatch and see how it works, and Platform will continue to watch it for some time.

A problem with the site functional tests failing for our gLite CE's was tracked down to a problem with the site functional tests themself, and was fixed at the end of last week.

Last week a problem with the firewall settings of our new LCG CE's ce105 and ce106 was discovered and fixed.

Work in progress

  • A new version of LSF dynamic info providers rpm is currently being tested. Among a couple of bug fixings it contains a better modularisation and configuration of the SW, and a better support for cache files on a remote file system. These tests are done on lxgate13
  • ongoing tests on the implementation of passing user requirements to gLite CEs via blah. In this context a set of new time based grid queues have been added to BATCH.

July 17th 2006

Several people reported that the CERN production CEs reported inconsistent information eg. about the number of CPUs in the batch system. The reason for this was that each CE reported a snapshot of the system at a different time. The problem was solved by introducing a better synchronisation of the lsf-dynamic-info-providers running on these nodes. As a side effect, the number of batch system queries to the LSF master is significantly reduced, taking away a bit of load from the master.

Problems with LSF continued to be investigated by Platform. The automatic reconfiguration procedure has been stopped until the problem is solved. Platform developers are working on a solution for us. Platform delivered several versions of mbatchd to debug the problem. With the debugging output delivered on Monday they finally found the root cause of the problem. They have sent us a hot fix mbatchd which we are about to install now.

GSSKLOG functionality roll out has been delayed because initial tests of the proposed mechanism failed. Work is in progress.

July 6th 2006

Ongoing load problems with production CERN-PROD WMS. Two new CEs added (ce105, ce106) added to the existing ce101 and ce102.

gLite 3.0 WMS (ce103, ce104 and rb101, rb102, rb103) added to the information system

LXBATCH has been upgraded to gLite 3.0

=> We will announce today gLite 3.0 for CERN-PROD on CIC.

Still many workarounds are necessary, and the solutions that come from bug reports need fixes that produce extra work. The proper solution will only come in the next release, if ever.

Problems with our local batch scheduler LSF and wrong information published in the IS due to timeouts in LSF commands are undr investigation between our LSF experts and Platform. They are investigating on line now, when problems appear.

On the requests of ALICE and LHCB, we are planning to add GSSKLOG functionality (AFS tokens from GRID certs) for VO-SGM accounts to allow the SGM accounts to install software at CERN in AFS via Grid Jobs. Announcement was done yesterday, if no negative reply, we will modify the Jobstarter on Monday.

June 8th 2006

New gLite 3.0 WMS ready, waiting for opening of ports on the CERN firewall to do the final testing.

Added 100 GRID users to the four VOs ALICE, ATLAS, CMS, LHCB.

Created new 'OPS' VO with 50 accounts

Fixed entries in the IS:

GlueHostOperatingSystemName: Scientific Linux CERN
GlueHostOperatingSystemRelease: 3.0.6
GlueHostOperatingSystemVersion: SL

Work in progress

Creating queue and CE entries for the 'OPS' VO

Issues

It is disappointing that there is no documentation about which ports are needed for what services, and if they are needed only locally, on the LAN or on the WAN. It is only due to the personal knowledge of people that we were able to identify these ports, and we still have to test if this is o.k. In addition the ports have changed from LCG_2_7 to gLite 3.0, therefore a simple copy and paste was not sufficient. Furthermore, the PORT_RANGE parameter in the WMS configuration is not honoured by the software and a manual patch has to be applied after the installation but before the configuration of the middleware.

GRID accounting display has stopped (again) for the CERN-PROD beginning of May, due to software changes that have overwritten the configuration. To be followed up.

'Erratic' publication of resources is understood due to timeouts and overload of the CEs. New info-provider scripts for LSF are being testing by the LSF team.

May 31st 2006

Today we deployed 5 new RBs on mid-range servers that have been set up by Yvan: rb104 (Alice), rb105 (Atlas), rb106 (CMS), rb107 (LHCb), rb108 (dteam, geant, gear, na48, sixt, unosat). The old RBs will be phased out in the coming weeks.

gLite 3.0 WMS (test-) set up successful. 3 RB's have been set up: rb101, rb102, rb103 with gLite 3.0. Two CE's are set up with gLite 3.0: ce103 and ce104. The last step before putting them in production is to open the necessary ports on the CERN firewall for these services.

gLite 3.0 UI has been installed as before in AFS, and is available on LXPLUS and any other node running AFS.

New LSF dynamic info providers have been installed on all CE's to offload them and the LSF batch system.

Work in progress

gLite 3.0 WN installation will follow on LXBATCH.

Issues

Needed open ports for gLite 3.0 WMS not clear. A patch is necessary on CE to use the port range parameter.

Reminder: Additional VO accounts (VO101 - VO200) still not created. This has to be done by the VO admins in the new CRA system to reserve their UIDs, before they can be used in the WMS.

May 30th 2006

Today we did a very smooth, zero-downtime transition of the myproxy.cern.ch alias from the old MyProxy server to the new high-availability cluster (prod-px) set up by Tim.

May 10th 2006

Work in progress

Setting up a gLite 3.0 WMS

Issues

Accounting problems: How to report accounting data from different CE's for the same WN. Harry in contact with Dave Kant from RAL.

LSF queries on CE time out due to many executions of (heavy) b-commands. This leads to ther publication of useless default values in the IS. The cause is understood, the solution not so easy. TBD

Killing of many CMS GRID jobs has overloaded and killed one of our CE (ce102) last weekend. Again, the problem is now understood, and investigated. Solution?

-- ThorstenKleinwort - 10 Jul 2006

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2007-03-14 - ThorstenKleinwort
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback