  • local: AndreaS, Luca, Alberto, Felix, Maarten
  • remote: Vladimir/LHCb, Sang-Un/KISTI, Rolf/IN2P3-CC, Xavier/KIT, Michael/BNL, Salvatore/CNAF, Onno/NL-T1, Oli/CMS, Christian/NDGF, Tiju/RAL, Rob/OSG, Wahid/ATLAS, Pepe/PIC

Experiments round table:

  • CMS reports (raw view) -
    • T2_FR_GRIF_IRFU (WLCG name: IFRU): problems being flagged with SAM test problems although jobs are running fine:
      • One CE: node74.datagrid.cea.fr simply disappeared from the SSB, GGUS:98341, can the site support team please follow up, maybe related to next issue
      • CEStatus of 2nd CE lpnhe-cream.in2p3.fr for the CMS queue is not the same between top-bdii.cern.ch, topbdii.grif.fr: GGUS:98418
Andrea: inconsistencies in the CERN top BDII have been seen often, lately.

Maarten: indeed some BDII bugs were found and corrected, one remaining issue is not yet understood and is not fixed but it's usually remedied by a restart. Unfortunately the main developer is on holiday for 1-2 weeks. CERN is using the latest version, by the way.

  • ALICE -
    • CERN
      • because of persistent low efficiencies and high failure rates, job submission to SLC6 queues was disabled Thu last week
      • to be debugged at a smaller scale
      • meanwhile half of the SLC5 jobs will be made to use CVMFS instead of Torrent, to check if the use of CVMFS has anything to do with the matter
    • KIT
      • error rate has been highish due to known issues in CVMFS version on WN
      • looking forward to seeing 2.1.15 deployed...
Andrea: Xavier, when do you plan to update CVMFS?

Xavier: now we are busy with migrating to SL6, it will be done after.

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0: FTS3 Downtime
    • T1:
      • SARA: Downtime

Sites / Services round table:

  • ASGC: one DPM disk server has a high failure rate and impacts ATLAS datadisk, trying to recover.
  • BNL: ntr
  • CNAF: ntr
  • IN2P3-CC: will have a dCache downtime on Nov 12 to install a SHA-2 compliant version.
  • KIT: ntr
  • KISTI: ntr
  • NDGF: ntr
  • NL-T1: about the SARA downtime, cksum verifications found some corrupted files. We needed vendor help to remount a filesystem but we are getting some I/O errors so we will soon migrate data out of it (it will also go out of warranty within a month). The advantage of migrating files is that dCache will checksum all of them and therefore we'll spot any corrupted files. A list of such files will be eventually circulated.
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN FTS - CERN FTS3 was updated this morning to 3.1.33 as published on ITSSB. It is accepted this should have been published in GOCDB as well. Monitoring at https://fts3.cern.ch:8449/ will be available outside CERN shortly.
  • CERN storage services: we just upgraded CASTORCMS and CASTORALICE. The former is confirmed to be OK, for the latter it would be good if ALICE could test it. [Maarten: I will pass the message]. Tomorrow we'll upgrade CASTORLHCB and CASTORPUBLIC.




  • local: AndreaS, Maarten, Alessandro, Luca, Alberto
  • remote: Gareth/RAL, Jhen-Wei/ASGC, Burt/FNAL, Sang-Un/KISTI, Kyle/OSG, Michael/BNL, Ronald/NL-T1

Experiments round table:

Jhen-Wei explains that the problem is due to the disk server failure reported last Monday. Experts are trying to recover the data, but in the worst case about one million files and 140 TB would be lost. Waiting for (hopefully) good news from Taiwan. Andrea asks to produce a SIR.

  • ALICE -
    • CERN
      • SLC6 jobs: the efficiency has climbed to very good levels on Tue ~11:20 CET, for reasons unknown
        • the current job mix should be more demanding on I/O, if anything
        • the efficiencies varied between 70 and 90% during the last 24h and were significantly higher than the SLC5 jobs efficiency for a number of hours
      • SLC5 jobs: 1 of the 2 VOBOXes was switched to CVMFS on Tue morning
        • about half the SLC5 jobs were running with CVMFS, the rest with Torrent
        • Torrent jobs could no longer run as of yesterday due to an unexpected side effect of an AliEn update for CVMFS
          • this also affected other sites that had not yet fully migrated to CVMFS (RRC-KI, NIKHEF, ...)
          • Torrent jobs now use the previous version for the time being

Maarten suspects that the job failures and inefficiencies might be related to worker nodes or disk servers at Wigner; collecting more evidence will be necessary. Alessandro suggests that PES could report on the fraction of SLC6 resources at Wigner, at least to have an idea if they could produce a measurable effect on efficiency. All experiments are invited to share experiences, which would help in finding the cause of the problem.

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0:
    • T1:
      • SARA: Downtime

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • FNAL: installing SLC6 on the worker nodes, it should be completed by end of November. It took some time due to the need of puppetizing everything.
  • KIT:
  • KISTI: ntr
  • NL-T1: the problem at SARA mentioned by LHCb is GGUS:98370
  • RAL: there will be a scheduled downtime on Tuesday-Wednesday for works on the power supplies, some services will be taken down (FTS-3 should keep running apart possibly from a short interruption)
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: the upgrade of the CASTOR stagers went OK; on Monday we will upgrade the nameserver and it should be transparent.


