Week of 090720

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:



Attendance: local(Simone, David, Eva, Julia, Gavin, Gang, Alessandro, Patricia, Harry, Jean-Philippe);remote(Xavier Mol, Angela, Michael, Ronald, Fabio, John Kelly, Fabio, Danielle).

Experiments round table:

  • ATLAS - SC: Last Friday's problem at ASGC of "No default service classes defined" has been resolved - Castor had gone down following a power cut. Serial processing at ASGC still cannot start as stage pool cleaning is not working. There were a couple of problematic Tier 2 over the weekend. NDGF reported some lcorrupt files last week and these are now being recuperated from other sites.

  • CMS reports - DB: Will give a full report today then return to incremental mode. 1) A Tier0 CAF user is getting error messages reading from CAF pools but this is thought to be a user environment problem. 2) There were 7 tickets reopened at ASGC - 4 have been understood and closed. Main issue left is that since some weeks important data are not being migrated to tape. Gang reported that their tape drives (currently 7) are busy reading CMS files from tape (1200 jobs in queue) but that they should reprioritise the use perhaps dedicating 2 drives to migration. Also ASGC fails to install a new software release reporting a locked lcg-tags file. 3) The link from a Russion Tier 2 to ASGC is failing commissioning since 5-6 days - looks like the Russian site is overloaded by ATLAS jobs. 4) The GRIF (French) Tier 2 fails to install a new software release reporting a hanging installation job. 5) Phedex transfers FZK to TIFR (India) are failing and expiring. Not thought to be a Tier 1 issue. 6) Loadtest transfers from PIC to CALTECH have been failing - may be due to network overloads at Caltech as reported in another ticket. Probably already solved. 7) Two old tickets (from end June) will be escalated to experts - CRAB jobs failing in Estonia and Phedex transfers to IPHC (France) timing out.

  • ALICE - PM: Have been testing a new version of the ALIEN module submitting jobs to the CERN gLite 3.2 WMS and due to a bug submitted huge numbers of jobs causing it to overload. It had to be drained over the weekend so not much production was done. Known problems have now been resolved and testing continues today. The WMS itself is working fine and does not show the scalability issues shown by previous versions.

  • LHCb reports - 1 billion events MC production is now running its last 500 jobs. Some further signal production submitted and a long tail of reprocessing activity last week (500 jobs at NIKHEF)

Sites / Services round table:

* FZK AP: Had a disk problem with stale NFS handles that was quickly fixed but that they would like to make more automated.

* SARA RS: Have installed 12 new tape drives. Still needing some configuration/tuning but already being used for data migration.

* IN2P3 FH: Advance warning of downtime from 22-24 September where work on the electrical infrastructure will lead to significantly reduced batch capacity. Full service resumption on the morning of 25th and detailed planning will be available early in September.

  • CERN Networking: There was a short (5 minutes) Geant network interruption between Paris and London on Saturday about 08.30 and there is no backup link so traffic stopped. No report from Geant yet.

  • CERN FTS: Some transfers have been observed to be going via version 1 endpoints by not using fully qualified surls. Details are being checked.

  • CERN Databases: The CMS offline production database has been migrated to a RHEL5 64-bit platform (later news - hit performance issues and was rolled back).

  • CERN CASTOR: Two switch modules in the CASTOR RAC infrastructure have to be changed. A test of the procedure was successful today so the intervention to be made tomorrow from 10.00-11.00 should be transparent but if not there is a risk of a 15 minute interruption.



Attendance: local(Gang, Graeme, Harry, Jean-Philippe, Simone, Miguel, Patricia, Ricardo, Gavin, MariaG);remote(Xavier (FZK), Tiju (RAL), Michael, Jeremy, Joel, Danielle).

Experiments round table:

  • ATLAS - 1) Small issue with the ATLAS Panda to Oracle connections where we had two hanging connections on our side. Have now increased the logging on our side and also have worked out a procedure to kill such processes even from outside of CERN. 2) We want to retest STEP'09 reprocessing at ASGC, where we are waiting for confirmation, and at FZK. FZK then announced they are bringing in new tapes and doing some tape system reconfiguring so would like this to start Thursday of this week.

  • CMS reports - 1) T0 CAF problem reported yesterday was a user problem. 2) ASGC software installation problem closed understood. 3) Link commissioning failure Russia to ASGC now passed to network experts. 4) 3 Tier 2 tickets closed - left open are failing test transfers CIEMAT to Brunel (London) which is waiting for site admin to come back from vacation and failing transfers T1 to IPHC (French T2). 5) The FZK to PSI (Switzerland) problem had been resolved the same day but the ticket was not closed so we need to check communications there.

  • ALICE - ALIEN central servers are being relocated to new racking (within their building 12 centre) since 09.00 today so production has been halted. Scheduled to finish by 17.30.

  • LHCb reports - Two GGUS tickets for stalled batch jobs at CERN (ticket 50406) and Nikhef (ticket 50408).

Sites / Services round table:

* BNL (ME): Some 2 PB of disk is being added requiring to reboot some 30 disk servers. Each takes less than 5 mins so they will be done today starting at about 14.00 UTC.

* CERN Databases (MG) : The CMS offline cluster migration to RHEL5 yesterday had to be rolled back adding an hour to the downtime. There is a known incompatibility between RHEL5 and QLogic switch firmware for which we have a workaround that does not appear to work well for databases under high load. Migration of the LCGR DB scheduled for tomorrow has been postponed. That of the LHCBR DB on the 5 August will also be reconsidered with a decision announced on Monday (Roberto Santinelli will coordinate for LHCb). Yesterday the ALICE offline cluster automatically evicted some failing disks and the DB had to be restarted today to rebalance the storage.

* CERN FTS (GM): Decided now to change FTS to use SRM 2.2 by default. Will be scheduled for next Thursday. Danielle questioned the FTS 2.2 release timeline. It is being tested in the PPS. ATLAS is testing the checksum changes and there are changes in how it interacts with SRM to test at some scale. Gavin will contact Nicolo Magini to involve CMS in the tests. Simone reported ATLAS have been testing it from CERN to TRIUMF since a few days at a steady 4 MB/sec and 30 files an hour with no problems seen.

* ASGC: Yesterday tried to clean a lot of data from the SAM disk pool so SAM CE tests recovered, at least temporarily. We have turned on garbage collection and plan to double the size of the pool.

* CNAF - by email from Luca: On Sunday 12 July at 01:13 am the ATLAS LFC standby database in Roma has bocome unreachable because of a storage problem. Moreover, at CNAF, on Sunday afternoon, a not well understood problem has caused the loss of connectvity to the storage area network from several Oracle clusters among wich there was the ATLAS LFC one. Due to this connectivity problem, several clusters have been automatically rebooted, after the reboot, the connection between the LFC front-end and the back-end has been automatically restored, but unfortunately the software wasn't functional. On Monday the 13th, the database was in hang with an error ORA-29702 (error in cluster group service operation). We found a lot of connections (order of 100) on the database, while the usual number is 40. The investigation of this problem is difficult because in the LFC front-end logs there is an hole between July 12 at 22:41 and July 13 at 10:19, probably due to the fact that the lfcdaemon was in hang. As the database in Roma was unavailable, the failover didn't succeed. The service has been restored in the evening on Monday the 13th, in both CNAF and Roma sites.



Attendance: local(Oliver, Harry, Graeme, Miguel, Jean-Philippe, Maria, Simone);remote(Tiju, Angela).

Experiments round table:

  • ATLAS - GS: 1) The ASGC stage buffer has been cleared so reprocessing tests with pre-staging recall from tape have been launched. CPU activity will peak after about 8 hours. ASGC confirm that CMS is also busy running stripping work so this will be a useful multi-experiment test of sharing batch and data access. 2) ATLAS have been running a large scale analysis stress test in the UK for the last 24 hours. The Glasgow T2 had upgraded their dpm to version 1.7.2-4 and during these 24 hours under high load conditions their srm version 2.2 has crashed 7 times - not seen under low load conditions. J-P volunteered to look at a dump.

  • CMS reports - apologies from Danielle who is unable to attend today.

  • ALICE - After the relocation of the alice central services nodes the production has continued since this morning. They are not seeing any remarkable issue in any of the alice sites. For the coming two weeks ALICE will connect to this meeting only when they have a specific issue.

  • LHCb reports - busy in another meeting.

Sites / Services round table:

* BNL (by email from ME): Yesterday I mentioned that we will have to reboot ~30 servers to connect the ~2PB of new storage. Turned out that this was not necessary since Solaris (as others do) allow to perform a “live attachment”. Our experts found that this was working well and almost all of the 50 storage arrays are now visible behind the existing storage servers. The storage management group will be installing/configuring dCache pools on the new arrays and we expect to have them all available by the end of today.

* CERN Databases (MG): 1) The ATLAS DB at IN2P3 is down for a 12 hour intervention to fix a block corruption (announced yesterday). Also they have a problem that they cannot restart the capture process for replication of the ATLAS AMI database from IN2P3 to CERN. 2) The restart of the ALICE online to rebalance the disks after yesterdays failure did not work as planned and a complete recovery from disk is now being performed. Should be back in 1 more hour. 3) All upgrades to RHEL5 64-bit (would have been LHCb next week) have now been put on hold following yesterdays problems after the CMS upgrade and rollback.

* CERN CASTOR (MCS): Tomorrow there will be a transparent rolling upgrade to the CASTOR nameserver from 10.00 to 12.00. AOB:


Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

IN2P3 (by email): This mail to let you know that the DBATL database was put back into production yesterday night (around 11 PM). Unfortunately I clicked on the delete link instead of the update one. (may be a confirm dialog should be useful to avoid such errors (certainly due to the late hour frown ) ). The problem we hit was that a corrupt block in the sysaux tablespace made the all 4 instances to coredump with the following errors :

ORA-00607: Internal error occurred while making a change to a data block

ORA-00600: internal error code, arguments: [kddummy_blkchk], [3], [19539], [18007], [], [], [], []

Thanks to the Oracle support we were able to mark the block as corrupted and then to drop it. At this time, there is no clear explanation nor clue to find a reason to this corruption (which was logical not physical). All we know is that this corrupted block was Block Type = Pagetable segment header block, and the HWM was wrong (HWM block 50 beyond extent boundary 8).



Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:


-- JamieShiers - 16 Jul 2009

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2009-07-23 - HarryRenshall
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback