Week of 090824

WLCG Service Incidents, Interventions and Availability

General Information
Attendance: local(Jamie, Lola, Simone, Gang, Ewan, Edoardo, Patricia, Roberto, Julia, Harry, Andrea, Jean-Philippe, Jacek, Maria, Dirk , Markus, Diana, Olof (chair));remote(Gonzalo, Michael, Andreas Heiss, John Kelly, Jos, Ronald).

Experiments round table:

  • ATLAS - (Simone) Announcements:
    • NIKHEF - Hoang informed ATLAS that the DC move has completed. Ronald: confirm that the move has completed and the WNs are being brought back.
    • New version of the ATLAS site services is running in the french cloud since a week. Will now change to use LFC bulk inserts.
    • Reminder: tomorrow there will be the name change for BNL at 15:00 CEST (09:00 EDS). Sites are reminded to run the script 15 minutes after the start, i.e. 15:15. The Procedure is documented at https://twiki.cern.ch/twiki/bin/view/LCG/FtsProcedureSiteNameChange

  • ALICE - (Patricia) still finding small issues in the ALIEN configuration which prevents from starting the production. This is central configuration fault, not sites' problems (many sites have asked). The two WMSes for ALICE use in France are completely dead. For grid33 @ LAL a GGUS ticket 51107 has been submitted. Patricia discovered just before this meeting that the other one was also dead (GGUS ticket will be submitted). CREAM CE at CERN issues: ticket for ce201 has to be reopened over the Weekend. Seems to be some grid-proxies related problem... POST-MEETING: France has confirmed that both WMS@FRANCE are in scheduled downtime

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - name change already mentioned
  • FZK (Andreas) - short downtime for the CE pointing to SL5 cluster. Still installing the remaining WNs to SL5, which will continue up end of the week. SE down time for SRM upgrade. Scheduled for 30 minutes tomorrow morning.
  • RAL (John) - scheduled an 'at risk' intervention for one of the tape robots tomorrow to fix the damages from the leak.
  • ASGC (Gang) - CMS migration started again. 2000 files have been migrated.
  • CERN (Ewan) - CASTORCMS default pool was overloaded this morning.

  • Network: (Edoardo)
      Since 24/8/2009 14:00CEST, the CERN LHCOPN routers are announcing the prefix together with the previous, more specific
      The BGP announcement of the old prefix will be stopped only when all the Tier1s will have acknowledged of being receiving and using the new /16.
     The new address space is needed for the new servers that will soon be installed in the CERN computer centre.

     The intervention is documented here: https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=50916
  • New router tomorrow put in production starting at 10am.

  • middleware: (Markus) WMS 3.2 released


  • Maria: Do ATLAS and BNL want to run an alarm ticket test tomorrow late afternoon after the name change in order to assure the work-flow is ok after the change. Simone and Michael agree.


Attendance: local(Simone, Gang, Dirk, Jamie, Maria, Harry, Patricia, Roberto, Maria, Julia, Dan, Olof (chair));remote(Gonzalo, Tiju, Daniele, Brian, Andreas).

Experiments round table:

  • ATLAS - (Simone) only concern: one tier-2 (Milano) failing all transfers due to SRM errors. FTS 2.2 testing: no problems seen here at CERN and the checksum functionality was successfully tested yesterday with dCache sites. Still need to test with DPM. In summary, the tests looks promising. Roberto: did you test with StorM? no. Simone thinks we need to foresee a stress-test with FTS 2.2 the two first weeks of Septembers. Brian: does this mean that ATLAS will require minimum version of DPM 1.7.2? Simone: we still need to verify that it works with DPM. In any case, the ATLAS production system does not require the checksum checks but will use it if the SE supports it.

  • ALICE - (Patricia) the CREAM CE problems at CERN have been fixed. More than 1000 concurrent jobs successfully run so far today. Concerning the issue with WMS in France reported yesterday: the sites were in scheduled downtime so it was normal that they were down. In order to cope with similar situations in the future ALICE can now use two WMS at CERN for submitting to the French CEs.

Sites / Services round table:

  • TRIUMF (Reda/offline): Yesterday we experienced a reverse DNS lookup failures affecting all of our Grid services which are in address space (nodes on the lightpaths). The problem affected all sites outside British Columbia province. There was a configuration issue with a BCNET server which took some time to resolve after a few iterations with the NOC. This essentially resulted in a 0 MB/s traffic to/from TRIUMF lasting about 7 hours starting at about ~5PM UTC.
  • PIC (Gonzalo) - NTR. Scheduled downtime at PIC as announced. Maria: forwarded a mail to the wlcg-operations list last week concerning the GOCDB problem reported by Gonzalo about not being able to broadcast early announcements for scheduled downtime. As there was no answer, Maria suggests now to follow up the issue with the GOCDB developers. This was agreed by the meeting.
  • RAL (Tiju): NTR
  • ASGC (Gang): CASTOR operation team found a problem with the 'expert' system but the CMS migrations are still stalled for 24 hours. There is 110TB available in the disk space so the space is not yet critical.
  • FZK (Andreas): problem with PBS server this morning. Job submission didn't work for a few minutes. The SRM intervention scheduled this morning was successful.
  • CERN (Olof): LFC upgrade


  • Simone: as agreed yesterday ATLAS will send an ALARM and TEAM ticket to BNL this afternoon after the name change.


Attendance: local(Jean-Philippe, Jan, Simone, Oliver, Maria, Gang, Antonio, Harry, Cedric, Roberto, Diana, Julia, Jamie, Olof (chair));remote(Gonzalo, Daniele, Michael, Tiju, Ron, Jos).

Experiments round table:

  • ATLAS - (Simone) BNL name change: situation of FTS (as known) - CERN, PIC, ASGC, IN2P3 applied the script. SARA applied it but the channels still existed, being followed up at site. RAL and TRIUMF do not do daily BDII update and will therefore apply the script at next intervention. NGDF and CNAF have not answered but the channels are working. GGUS test: name has not been updated, still BNL-ATLAS-T2. ALARM ticket test will therefore be postponed until tomorrow. Maria: not sure it will be possible by then? Diana: GGUS has not been updated but that can be done tomorrow. However, another issue is the resource grouping used for publishing sites in EGEE sense has nothing to do with GGUS. This will be followed up offline. TEAM ticket: several BNL-... names exist which is confusing and needs to be understood/fixed.

  • CMS reports - (Daniele) MC production run throughout August: 265M raw events in 26 days in the Tier-2s. Remarkable results according to physicists. Transfers Tier-2 -> Tier-1 ~90TB. Correction to minutes submitted by Daniele after meeting: T0->Tier-*: 90 TB, T1->Tier-*: 210 TB, T2->Tier-*: 110 TB. Further details in the CMS twiki. Some transfer errors are being investigated. SAM status: Russian sites seem to have suffered some instabilities in August - being followed up with sites. All problems have been tracked in savannah tickets, which have doubled in numbers during August.

  • ALICE -

  • LHCb reports - (Roberto) *
    • Request to reshuffling of CASTORLHDB disk pools has been fulfilled at CERN CASTOR operations team. One open question, which needs to be confirmed by Philippe: it seems like reduction of the lhcbraw space by removing out-of-warranty hardware is sufficient and it shouldn't be necessary to drain + remove other servers
    • PIC and NL-T1: ROOT - dCache client incompatibility problem have been solved by running a specific version of the client packaged with the LHCb software. Would like the sites to install the client natively on the nodes. Being followed up by the sites.

Sites / Services round table:

  • PIC (Gonzalo) - NTR
  • BNL (Michael) - NTR
  • RAL (Tiju) - NTR
  • NL-T1 (Ron) - next Monday the tape backend will be down for the whole day due to tape robot change. The FTS issue with the name change have been solved.
  • FZK (Jos) - some dCache disk nodes went down due to hardware problems. Data also no tape so access not be affected. TEAM ticket about FTS transfers to Tier-2 sites in the US, which required some efforts on the FZK side.
  • ASGC (Gang) - NTR
  • CERN (Jan) -
    • BNL name change went ok on the FTS @ CERN.
    • EGEE broadcast about emergency kernel update. Will be rolled out tomorrow. Simone: kernel patch of the VOBoxes? Jan: everything will be prepared in Cdb but it is up to the VO contacts to deploy the new kernel on their own machines. A notification will be sent out

  • Middleware (Antonio): last Monday new gLite was released for production. A complete redesign of the WMS service. Feedback from sites participating in the pilot is good. It has the advantage that it can submit to CREAM. Next release for production to be carried out this week (tomorrow or next Monday): yaim core change impacts all services. Main thing for the WN is the glExec wrapper scripts together with a corresponding fix to the LCAS interface. Harry: what are CERN plans for the deployment of the new WMS?


  • (MariaDZ) Appended file with email thread with GOCDB developer on broadcast tool behaviour change. Jamie: We need an early notification about changes like that?


Attendance: local(Jean-Philippe, Lola, Jamie, Harry, Maria, Diana, MariaDZ, Gavin, Roberto, Andrea, Simone, Cedric, Gang, Olof (chair));remote(Andreas, Daniele, Brian, Gareth, Gonzalo).

Experiments round table:

  • ATLAS - (Cedric) start ESD processing next week unless any problems found. Testing ALARM ticket to BNL at 5pm today. FTS notification for the BNL name change: still no news from CNAF. The channel seems to work, though.

  • CMS reports - (Daniele) tickets from last 24hrs. An ALARM ticket was opened to Tier-0 yesterday evening because of the queues having been set inactive following the emergency reboot yesterday. This was not expected from the message that had been sent out that said that only public batch queues were closed. Tape migration problems at ASGC. A ticket has been opened and the ASGC operation team is requesting help from CASTOR development team. Gang: the problem has not been solved. CASTOR development has not yet been requested to investigate. More than 40 tickets opened with the Tier-2s: some due to holidays.

  • ALICE -

  • LHCb reports - (Roberto) zero production going on, only a few user jobs for the distributed analysis activity. Follow-up concerning CASTOR: there is no need to reduce the LHCBRAW space. Second follow-up: problem at IN2P3 where a disk resident files cannot be accessed. Thirdly: any news from PIC or SARA about the issue reported yesterday concerning the dCache client installation. Gonzalo: no news but will follow-up after meeting.

Sites / Services round table:

  • ASGC (Gang) - CASTOR tape migration problem reported for CMS above.
  • FZK (Andreas) - network problems affecting file transfers CERN transfers to ftp servers at FZK. Human error, which has been fixed.
    • Correction received form Andreas after the meeting: The network problem caused SAM tests to fail because the routing between the hosts running the tests (e.g. monb002.cern.ch) and FZK was broken. Regular FTS transfers from/to Cern were not affected
  • RAL (Gareth) - outage this morning for deploying the kernel upgrade. GOCDB announcement for migration of WNs to SL5. Scheduled for week of 14th of September. Bank holidays in UK tomorrow. Roberto: will RAL still support SL4 resources? Yes, there will be some left. Brian: following up ticket with ATLAS for cleaning up connections from monbox@cern after the LFC migration. Simone: not aware of the issue (yet).
    • Correction submitted by Gareth after meeting: The UK bank holiday is next Monday, and RAL will be closed Monday and Tuesday (I.e. 31st August, 1st September) - not tomorrow
  • PIC (Gonzalo) - since some days we see intermittent failures in some of the CEs at PIC. It is the Replica Manager test fails when trying to replicate files to DPM server at CERN. A ping to the DPM server gives high packet failure. A GGUS ticket 51180 has been opened and is being followed up by Maarten. Gavin: we see the same problem here at CERN
  • CERN (Gavin) - we had to stop the LXBATCH service yesterday night. In order to limit the impact we tried to drain the jobs before the reboot. However, this may not have been the desired action and in particular CMS submitted an ALARM ticket for rebooting the Tier-0 production farm late in the evening. A post-mortem analysis is being produced.


  • Gareth: this issue reported above about the packet loss - we also see some packet loss on the OPN starting from Tuesday this week. For historical reasons we ping srm-dteam.cern.ch (two nodes behind it). Gavin: could be because SRM nodes are being rebooted for memory upgrades although the node would then normally be taken out of the loadbalanced alias


Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:


-- JamieShiers - 2009-08-24

