Week of 160404

  • local: Maria Alandes (Chair and minutes), Andrea Manzi (MW Officer), Cedric Serfon (ATLAS), Maarten Litmaath (ALICE), Jesus (Storage), Stefan Roiser (LHCb)
  • remote: Tommaso Boccali (CMS), Michael Ernst (BNL), Salvatore Tuputti (CNAF), Dave Mason (FNAL), Rolf Rumler (IN2P3), Dimitri (KIT), Ulf Tigerstedt (NDGF), Andrew Pickford (NL-T1), Kyle Gross (OSG), Pepe Flix (pic), John Kelly (RAL), Di Qing (TRIUMF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • MC15c (digi+reco reconstruction) running smoothly and taking up 120k slots. Producing approx 100M events per day, which is good.
      • HeavyIon reprocessing under testing, will start soon (this week most probably). It will require MCORE high memory slots.
      • Upgrade studies: running SCORE very high memory (more than 4GB needed), running in parallel now in very few sites.
      • Analysis as usual
    • VOMS-related issue fixed on Tuesday.
    • Disk full at T1_DATADISK : SARA, BNL. Will reshuffle data.
    • Consistency checks : Site reminded to provided regular dumps (at least quarterly, monthly even better)

  • CMS reports (raw view) -
    • Spring16DR80 production started, first tranche of events injected
    • Quite calm week, apart from the authentication problem: manifested in two ways, as far as I know:
      • ASO not communicating with FTS (March 28th, GGUS:120454), needed restarting of proxy on ASO services
      • Phedex @ sites not communicating with FTS (March April 1st, GGUS:120536), needed broadcast to sites to renew proxies
      • what is not clear at least is :
        • why the problems manifested in different moments?
        • was the second one expected from the first one (and we could have avoided that)?

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Activity
      • Very high activity
      • Incremental Run1 Stripping Campaign in full swing
      • Turbo data reprocessing half done
      • MC and User jobs
    • Site Issues
      • T1:
        • FZK-LCG2: Problem uploading job output data to local storage (GGUS:120533)
        • INFN-T1: Problem uploading job output data to local storage (GGUS:120530)

Sites / Services round table:

  • ASGC: NA
  • BNL: NTR
  • CNAF: ATLAS downtime: On Tuesday March 22 it was announced a downtime to do the finalization of Atlas fs migration to do as soon as possible in order to have the storage in production before the start of data taking. Since it was foreseen that the procedure could need up to 5 days, it was also planned to have 2 days of total shutdown and 3 more days with the storage in read only mode (as effectively done). After a mail exchange on March 23 it has been agreed between CNAF and Atlas to proceed with the downtime even with the short notice.
  • GridPP: NA
  • IN2P3: Since yesterday morning until April 15th the batch capacity is reduced by draining a varying number of workers, at maximum 20 percent, to allow for the correction of a minor but annoying bug in the batch system. No other effect is expected on user jobs.
  • JINR: NA
  • KIT: NTR
  • NDGF: Downtime starting at the same time of the meeting lasted for 4 min went OK.
  • NL-T1:
  • NRC-KI: NA
  • OSG: NTR
  • PIC: NTR
  • RAL: On Wed 30th March at 19:00 BST there was a network link break at RAL. The failover did not work as expected and staff had to attend site. The break lasted about 2 hours, but the site was in downtime till 10:00 next day

  • CERN computing services: As discussed during the meeting, there will be a follow up with VOMS sys admins at CERN to understand the details of the problems reported by experiments and FTS, and also to know why the VOMS certificate renewal was not done with enough time in advance. The certificate renewal in this case is a bit more complex than usual due to the certificate DN and requires manual intervention. It would be good to know if a similar situation can be avoided in the future.
  • CERN storage services:
    • FTS issues last week affecting all VOs due to a VOMS / gridsite issues ( under investigation in GGUS:120463 )
      • CMS( ASO), ATLAS and LHCb, using FTS REST access got jobs assigned to a wrong VO because the VOMS extensions could not be extracted from proxies. Issue fixed with proxies re-delegation + manual DB changes by the FTS admins
      • CMS Phedex transfers could not be submitted due to expired proxies not reported by the VOMS clients. WLCG broadcast sent to sites to renew the affected proxies
  • CERN databases: NA
  • GGUS: NA
  • MW Officer:
    • WLCG broadcast for the issue related to VOMS and expired VOMS AC not recognised by VOMS clients. Issue affecting Phedex mainly but any other services which make use of long lasting proxies. All VOMS proxies created before 29th March and still supposed to be valid must be renewed.
  • Security: NA

AOB: Maria reminds the Ops Coord meeting on Thursday that will be dedicated to accounting.

