Week of 090713

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:



Attendance: local(Eva, Jamie, MariaG, Ueda, Andrea, JeanPhilippe, Gang, David, Julia, Harry, Roberto, Simone, Alessandro, Patricia, Jan );remote(Gareth, Brian, Angela, Alessandro).

Experiments round table:

  • ATLAS (Ueda and Simone) : Problems reported at the weekend on raw datafiles not being accessible. Also a problem on Sunday morning from 10.27 to 10.33 of connection lost from the central catalogs to ATLR to be further investigated by the Physics Database Services. It appears as sites upgrade WNs to latest glite version, they start failing ATLAS production. The failure is when the ATLAS distributed python 2.5 imports the logging module Details in https://gus.fzk.de/ws/ticket_info.php?ticket=50148. There is a workaround available to the sites, as described in the ticket. This morning there has been a problem of high load caused by DQ2 on ATLR, now fixed and due to a developer's error. A post-mortem is being produced. Problems with CASTOR DB backend at CNAF occurred at the weekend. Might be ok now, but cannot be confirmed as there is currently a downtime of the local LFC (possibly due to a network problem - under investigation).

  • ALICE (Patricia) Production is currently on standby due the set-up of the new MC cycle. There is also a new module for job submission for WMS, which might bring in some initial inefficiencies. VOboxes will also go out of warranty in Q4 2009. Reviewing with FIO for their replacement (2 in production and 2 for standby).

  • LHCb(Roberto) reports - Currently running 10,5K jobs between user private analysis (~1K) and MC09 production. The reprocessing activity announced last week is still awaiting for all sites to remove data from the cache. About 15K jobs pending the green light to be submitted to sites for testing the staging and file access at T1's. Problems at CNAF for CASTOR at the weekend (see GGUS ticket) and at FZK for shared are with problems on some of the WN. Also on T2s a wrong BDII publication altering the computation of the rank, shared are issues and too many pilots aborting.

Sites / Services round table:

  • RAL (Gareth): The problem on the ATLAS CASTOR reported last wednesday was fixed in the early aftenoon and was due to a RAC node crash. Today the Oracle "BIgID" patch was applied. Tomorrow a schedule CASTOR upgrade to 2.1.7-27 will take pleace from 5am to 4pm UTC, affecting all VOs. Question raised by Simone on the 2 disk servers which were not available last week at the time of the computer room move are now available again. Because of the intervention foreseen tomorrow, ATLAS will include RAL as of tomorrow evening for getting the fast reprocessing ESDs.

  • CNAF (Alessandro): LFC problem at the moment being investigated and probably due to a network problem.

  • ASGC (Gang): ATLAS reprocessing agenda to be published shortly.

  • FZK (Angela): Nothing to report.

  • NL-T1 (by email):
] Nikhef moving grid infrastructure to new data center

A new data center has been built at Nikhef. The existing grid
infrastructure at Nikhef will be moved to this new data center between
10 August and 21 August. During the migration process, grid services
will be unavailable.

This large-scale operation will take place in two phases:

1) Moving grid services and network infrastructure (10-14 August 2009)
During this phase, all grid services at site NIKHEF-ELPROD will be
For grid users this means that the following services cannot be used:
- Computing services (CEs gazon.nikhef.nl and trekker.nikhef.nl);
- Storage services (SE tbn18.nikhef.nl);
- Job submission services (WMS graszode.nikhef.nl, graspol.nikhef.nl,
- Requesting renewal of grid certificates via the Dutchgrid CA web site
will not be possible at 10 and 11 August (requests can be submitted via
mail but will not be processed);
- The web sites www.dutchgrid.nl, www.vl-e.nl and poc.vl-e.nl will be

2) Moving compute and storage clusters (15-21 August 2009)
In this phase, the computing and storage clusters will be unavailable.
Grid users will not be able to:
- Use the computing services (CEs gazon.nikhef.nl and trekker.nikhef.nl);
- Access certain data files via SE tbn18.nikhef.nl.

We advise all users of the grid infrastructure to:
- Request renewal of grid certificates before August 5th (only if the
certificate will expire early or mid-August);
- Use the grid computing services at SARA (CE ce.gina.sara.nl)
- Submit grid jobs via the WMS at SARA (WMS wms.grid.sara.nl)
- Plan their work such, that no access is required to data files via
storage element tbn18.nikhef.nl, or to copy relevant data elsewhere.

Kind regards,
Ronald Starink

  • CERN Services round table

    • Databases (Eva): Migration of some production DBs (LHCb, CMS and WLCG) to RHEL5 is announced and scheduled in three weeks, after the successful migration of the validation clusters. Also three more schemas added to the ATLAS PVSS replication from the online to the offline DB.



Attendance: local(MariaG, Eva, Jan, Jamie, Ueda, Simone, Harry, Jamie, Gang, Julia, Alessandro, Roberto, Andrea);remote(Ronald, Jeremy, Michael, Angela,Gareth, Luca).

Experiments round table:

  • ATLAS -Reprocessing of cosmics is progressing at a good pace. 90% has already been done. The plan is to complete it by July 16th. Problems at CNAF reported yesterday on CASTOR and LFC are solved. Also, the problem reported yesterday on sites upgrade WNs to latest glite version causing job crash is now understood and the workaround published on GGUS (ticket number 50148) and Savannah.

  • ALICE - No report.

  • LHCb reports - LHCb will restart reprocessing tomorrow afternoon after the CERN CASTOR intervention will finish. Special care to transfer data back to NIKHEF (where it had accidentally be scratched also from tape). Asking now SARA to go again through the cache to remove data. MC09 proceeding well with 11k concurrent jobs running. Three open GGUS tickets (the one reported yesterday for a stager problem at CNAF being reopened) and DST transfers to MC-DST space at PIC failing.

Sites / Services round table:

  • NIKHEF (Ronald): WN capacity will be ramped down by 30% because of some cooling problems that will be solved only with the move to the new Computer Center (see Monday's report). The full pledge will be reached by Oct/Nov 2009.

  • BNL (Michael): The HPSS is currently being upgraded (just started) and will be off line for the entire business day (till 5pm local time). During this window, no data transfers into or out of the system will be possible.

  • FZK (Angela): Since Sunday evening one of the pair of NFS servers serving the ATLAS software repository is suffering from a high load. This causes ATLAS repro jobs to fail. Trying to identify the reason for this but possibly some worker node(s) are in a loop causing excessive access. At the moment the server with the high load is no longer selected by the WNs. ATLAS will try to get the fast reprocessing going using the remaining server. One of the tape libraries shows errors contacting drives. We may have to restart the library control to try to fix this. The assembly of a dedicated 'tape stager' cluster is progressing. Early next week we can continue recall tests using the new stagers. We have observed a sustained 100MB/s for CMS recalls (using 2 LTO drives) and 300-350MB/s for ATLAS recalls (using 6 LTO drives). The new stagers will provide redundancy and an increased bandwidth. Planned intervention on tape as announced in CIC site report for this Thursday (16th July) has to be postponed to next week Tuesday (21st July). We will change some tapes and add two tape drives.

  • RAL (Gareth): Currently in scheduled intervention for the CASTOR upgrade to Should finish within announced time window. Next Monday there will be an intervention affecting the ATLAS LFC to separate its database backend from other non-ATLAS ones.

  • CNAF (Luca): Still investigating the root cause of yesterday's ATLAS LFC outage (possibly due to network or exhausting DB number of concurrent connections).

  • ASGC (Gang): Just out of a scheduled CE downtime of a few hours for a upgrade.

* FIO Operations (Jan): Intervention to upgrade CASTORPUBLIC to version 2.1.8-9. This will affect all CASTOR users from 9:00 to 13:00. Upgrades on CASTOR tape service. All CASTOR users affected from 8:00 to 17:00.

  • Databases (Eva): ATLR services have suffered DB service stop with session kill for several connected session on Sunday 12th at 10:26. Full service connectivity was restored at 10:28. A second similar issue happened on 10:32 and full service connectivity was restored on 10:36. Applications with 'automatic reconnect' should have experienced 2 glitches in the time window mentioned. The event was triggered by what should have been a transparent change to fix an error message from the monitoring system. Namely the flash recovery area was filling up due to DB growth and the parameter db_recovery_file_dest_size needed to be increased accordingly. An erroneous setting of the parameter by the DBA triggered Oracle to perform service relocation as described above. The parameter has been set correctly at 10:36 by the DBA and services restarted and made fully available since then. Yesterday afternoon the streams replication from the ALTAS production to Tier1 sites affected 8 sites to to a deadlock triggered by a code executed by ATLAS people at CERN in order to migrate some COOL schemas. Problem understood and solved. Also, this morning the apply processes at BNL, IN2P3 and ASGC were blocked for the apply status. The problem seemed to be caused by the apply servers blocking themselves. In order to fix it, the apply processes were stopped, changed the apply parallelism to 1 and restarted. The apply was able to resume.


Attendance: local(Ueda, MariaG, Jamie, Harry, Simone, Alessandro, Julia, David, Miguel, Roberto, Gang, Andrea, Nick) ;remote(Michael, Angela, John, Luca).

Experiments round table:

  • ATLAS - File access to CASTOR problems reported on Monday solved. Some problems with RAL GridMap files reported (see RAL report below). Also reported and under investigation with CNAF some instabilities in accessing the LFC (see CNAF report).

  • ALICE (reported by Patricia before the meeting)- Small issues regarding the latest (pilot) version of the specific experiment submission module is causing some inefficiencies in the submission procedure through one of the ALICE VOBOXES at CERN. The situation was expected since this new module is still on testing phase. The current ALICE VOBOXES will be out of warranty at the end of the year, the experiment has asked for 4 new VOBOXES and the support will by this week cover the configuration requirements with the FIO responsible.

  • LHCb reports - Reprocessing will be re-established this afternoon, with tests on all Tier1s with small job samples. MC09 proceeding smoothly (with 8k simultaneous jobs) + the usual 1K for analysis. Open still a GGUS ticket with CNAF for wrong BDII publication while the problem at PIC for DST transfers is solved.

Sites / Services round table:

  • BNL (Michael): The HPSS upgrade went well (although it took 4 hours more than announced: format conversion took longer than expected and small hardware fault). Currently experiencing some timeouts when staging out of a SE (a massive deletion is currently in progress, with some db locking issues with the scheduling).

  • RAL (John): two disk server failures overnight which caused some file transfer problems. Now back in production. Also an issue with GridMap files still under investigation as they seem to be still seen by ATLAS. To be checked.

  • FZK (Angela): Still investigating the NFS problem reported yesterday.

  • CNAF (Luca): ongoing investigations of the LFC problem that occurred over the weekend. Two network glitches of one minute today between midday and 2pm might explain the LFC instability reported by ATLAS.

  • ASGC (Gang): CMS have cleaned out their jobs and CE upgrade is being scheduled.

  • Services:

  • FIO Ops: Successful upgrade of CASTORPUBLIC (intervention finished on time). Some clarification is required for the availability calculations from SAM tests for the LHC experiments (which are not affected by CASTORPUBLIC, but the OPS VO is). Ongoing tape service upgrade.
  • Databases: nothing to report.



Attendance: local( );remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:



Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:


-- JamieShiers - 10 Jul 2009

Edit | Attach | Watch | Print version | History: r17 | r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2009-07-15 - MariaGirone
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback