Week of 090907

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:



Attendance: local(Jamie, Steve, Gavin, Andrew, Edoardo, Stephane, Olof, Jean-Philippe, Patricia, MariaD, Roberto, Gang, MariaG, Dirk(chair));remote(Gonzalo, Ron, Gareth, Angela, ).

Experiments round table:

  • ATLAS - (Stephane) disk server RAID rebuild over weekend at RAL affected ATLAS - Q: why more frequently problem at RAL then at other sites? Gareth: In case of issues with automatic RAID rebuild (eg hot spare problem, controller failure to initiate rebuild) disk servers have to be taken out of production. What do other sites do? T0 does not systematically take disk servers out of production, but also in case of rebuild problems. ATLAS experienced problems at Lancaster: site has migrated to dpm 1.7.2 but not yet applied the patch solving SRM problems discovered during STEP in Glasgow. Also several other sites still missing this patch. Brian: will look into this problem based on ATLAS Elog entry. ATLAS further saw slow T2 transfers from ASGC - ATLAS is investigation together with site experts.

  • ALICE - (Patricia) quiet weekend... Cream CE down at CERN (ticket 51389, BL parser is not alive), may just need a restart. Gavin: will check. Further WMS at CERN is showing strange behavior with number of scheduled jobs ramping up and down quickly. So far only falling slope can maybe explained with ALICE workload falling from 11k to 5k jobs.

  • LHCb reports - (Roberto) large number of MC production ran over weekend, but experienced problems with WMS at CERN which forced LHCb to stop pilot submission to allow the system to catch up (more details on LHCb twiki report) (running/indle). Not yet clear if the reason for the problem is local at CERN or is rather wrong number of jobs published by the sites. (one case found at Manchester: 8k waiting, while info system showed 11 jobs). Investigation ongoing at additional sites. As the system recovered LHCb has resumed pilot submission restarted. Castor outage the morning from 6:10 to 7:45 caused by lack of database space. SIR requested to clarify the DB space monitoring in place.

Sites / Services round table:

  • Angela/FZK: NTR
  • Ron/SARA: New intervention for network expansion next Wed 9 Sept (whole day) - Apologies for late announcement. Further intervention planned for 16 Sept: grid services will moveto another switch (short outage), and 22 Sept: mass storage will move to new switch (short tape outage). Roberto: does 9th Sept outage also apply to NIKHEF -> no, just SARA. MariaG: intervention slot for DB migration at SARA to new h/w for 11sept fall on a Fri and some of the preconditions have not yet been met. Suggest to reschedule to a later date (to be announced).
  • Gareth/RAL: scheduled outage tomorrow on 3D cluster - migration to 64bit Oracle, Wed 9th Sept: SRM endpoint upgrade (2h, expected to be transparent)
  • Gonzalo/PIC: NTR
  • Gang/ASGC: Last Sat INode space was exchausted, fixed on the same day.
  • Gavin/CERN: Scheduled linux upgrade is ongoing and should finish today. LSF service will start using new license server (intervention should be transparent).
  • Edoardo/CERN: maintenance on LHC backbone, problem with T0->T1 traffic observed by PIC and RAL is now understod and fixed (https://gus.fzk.de/ws/ticket_info.php?ticket=51180). PIC should confirm that the issue is now removed.
  • Jan/CERN: Castor upgrade on the shared t3 instance (analysis) will take place tomorrow, upgrade of experiment stager is planned for Tue/Wed next week.



Attendance: local( MariaG, Gavin, Gang, Patricia, Roberto, Andrea, Jamie, JPB, MariaD, Diana ,Dirk(chair));remote(Onno/SARA, John/RAL,Michael/BNL).

Experiments round table:

  • ATLAS - (Stephane) problems T2 transfers in ASGC being investigated. Some files at IN2P3 not accessible: disk server down for repair

  • CMS reports - (Andrea) Flle migration problem at T0 solved- what caused the problem? Gavin will add a link to full description to minutes. Development instance of phedex after upgrade showed some missing package (openldap) Ticket has been created and prod phedex upgrade has been delayed.

  • ALICE - (Patricia) Cream at T0: any news? Gavin: problem with BL parser not yet fully understood, but being investigated. New ticket for Budapest: vobox not reachable. Since several days problem with sara vobox - now solved and vobox in back in prduction. Also the T2 in bologna has vobox problems - ticket will be created.

  • LHCb reports - (Roberto) WMS problem reported yesterday has been traced down to new bug glite WMS (details on LHCb twiki). Despite their WMS problems LHCb are now running 15k jobs for MC/physics production but operate close to a limit. New ticket at CNAF: shared area instability. Short CASTOR DB instability (h/w problem of head nodes) around noon, which affected experiment user.

Sites / Services round table:

  • Onno/SARA: problems with SRM pools (last week and today): when max. number of open files is reached then dcache switches the pool off. Site now increased this parameter. WNs: some s/w directory using a glusterFS mount showed instablity, now replaced with NFS mount. Reminder: tomorrow network maintenance.
  • John/RAL: DB outage scheduled had to be extended to 3pm local time (affects ATLAS 3D cluster).
  • Micheal/BNL: NTR
  • Jos/FZK: LFC hickup for ATLAS FTS after DNS misconfiguration - solved now.
  • Gang/ASGC: NTR
  • Gavin/CERN: upgrade of t3 CASTOR instance is now complete. LSF for main batch cluster has been reconfigured to use a new license server. Batch user should report any unexpected behaviour. Scheduled OS upgrades are being run now
  • MariaG/CERN: security patch and recommended patch bundle (PSU) applied in a rolling way to CMS online DB cluster yesterday and LHCB offline (this afternoon). Tomorrow CMS offline +WLCG clusters will follow. Streams replication to FZK currently paused to FZK streams due to problems with the destination DB. Jos: being dealt with by the site and ATLAS DBAs.



Attendance: local( Gang, Roberto, Oliver, Alessandro, Antonio, MariaG, Dirk(chair) );remote(John/RAL, Micheal/BNL, Angela/FZK, Jos/FZK).

Experiments round table:

  • ATLAS - (Alessandro) SARA in scheduled downtime has been removed from DDM FTS 2.2 is being tested by ATLAS and some problems with delegation have been observed. Gavin has been contacted and is working on the problem. IN2P3: problems with a group of users accessing data. all belong to same CA (NORDUGRID). Debugging the issue is difficult without using the user cert.

  • ALICE -

  • LHCb reports - (Roberto) 15k MC physics running concurrently, WMS at CERN seem to have recovered, despite no change on the experiment side. New tickets: CERN WMS did not process status of jobs which are done since weeks. This has raised an qlarm on LHCb side. Suspect WMS bug and are in contact with support. IN2P3: open ticket about access problems (disk server problems?), also in a few cases a turl could not be returned due to a sax exception CNAF: migrated DST space to free space for LHCb - experiment is validating the operation.

Sites / Services round table:

  • ASCG: NTR - ATLAS would like to thank the site for disabling the T2 queue! Gang: this may only be temporary. ATLAS: please check if dedicated resources for T2 will be used.
  • John/RAL: upgraded ALICE and DTEAM SRM to 2.8 (at risk). Upgrade of LHCb and CMS SRM are planned next week.
  • Michael/BNL: NTR
  • Angela/FZK: reboot of voboxes, planned LHCb space token extension can not be done this week. May run out of space this weeked. Roberto: do not expect problems for LHCb as FZK is currently not much used.

Release report: deployment status wiki page


  • Next meeting will be on Friday due to local holiday in Geneva


No call today - holiday at CERN (jeune Genevois)

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:



Attendance: local(Olof, Alessandro, Gavin, Gang, Dirk(chair));remote(Roberto, Gareth, Michael, Brian).

Experiments round table:

  • ATLAS - (Alessandro) RAL down yesterday - good info flow, problem fixed during afternoon. IN2P3: still some problem with data access, but now also other user ( not from NORDUGRID CA) affected -investigation ongoing. Minlano T2 migrated to STORM - fixing few remaining some FTS channel and quota problems.

  • ALICE -(Patricia before the meeting). Is there any any input regarding the GGUS ticket 51389 opened on Monday and related to the bad performance of the CREAM-CE services at CERN?. The submission of Alice to SL5@CERN is stopped because of this issue. ticket has not any input since it was submitted.

  • LHCb reports - (Roberto) overnight cleanup and this morning ramp up for MC production. SARA: old MC09 is reported by SRM as not available (SRM) due to a dcache problem - support contaced. CNAF: many pilot jobs aborting, investigation ongoing. LHCb Master DST space token being moved, was use for MC data. LHCb summary on WMS problems during this week is now available on the LHCb twiki.

Sites / Services round table:

  • Gang/ASGC: NTR
  • Gareth/RAL: outage yesterday was caused by fault on SAN used for DB stoarge. This affected castor database cluster. After some h/w investigation it remains unclear why Oracle cluster hung instead of using a different path to access DB storage (in touch with Oracle support on this problem). Service was resumed in afternoon.
  • Michael/BNL: NTR
  • Brain/RAL: testing TCP stack tuning to increase the transfer rate between BNL and RAL. Some improvement already achieved, but test will continue.


-- JamieShiers - 2009-09-03

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2009-09-11 - DirkDuellmann
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback