Week of 160404

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Maria Alandes (Chair and minutes), Andrea Manzi (MW Officer), Cedric Serfon (ATLAS), Maarten Litmaath (ALICE), Jesus (Storage), Stefan Roiser (LHCb)
  • remote: Tommaso Boccali (CMS), Michael Ernst (BNL), Salvatore Tuputti (CNAF), Dave Mason (FNAL), Rolf Rumler (IN2P3), Dimitri (KIT), Ulf Tigerstedt (NDGF), Andrew Pickford (NL-T1), Kyle Gross (OSG), Pepe Flix (pic), John Kelly (RAL), Di Qing (TRIUMF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • MC15c (digi+reco reconstruction) running smoothly and taking up 120k slots. Producing approx 100M events per day, which is good.
      • HeavyIon reprocessing under testing, will start soon (this week most probably). It will require MCORE high memory slots.
      • Upgrade studies: running SCORE very high memory (more than 4GB needed), running in parallel now in very few sites.
      • Analysis as usual
    • VOMS-related issue fixed on Tuesday.
    • Disk full at T1_DATADISK : SARA, BNL. Will reshuffle data.
    • Consistency checks : Site reminded to provided regular dumps (at least quarterly, monthly even better)

  • CMS reports (raw view) -
    • Spring16DR80 production started, first tranche of events injected
    • Quite calm week, apart from the authentication problem: manifested in two ways, as far as I know:
      • ASO not communicating with FTS (March 28th, GGUS:120454), needed restarting of proxy on ASO services
      • Phedex @ sites not communicating with FTS (March April 1st, GGUS:120536), needed broadcast to sites to renew proxies
      • what is not clear at least is :
        • why the problems manifested in different moments?
        • was the second one expected from the first one (and we could have avoided that)?

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Activity
      • Very high activity
      • Incremental Run1 Stripping Campaign in full swing
      • Turbo data reprocessing half done
      • MC and User jobs
    • Site Issues
      • T1:
        • FZK-LCG2: Problem uploading job output data to local storage (GGUS:120533)
        • INFN-T1: Problem uploading job output data to local storage (GGUS:120530)

Sites / Services round table:

  • ASGC: NA
  • BNL: NTR
  • CNAF: ATLAS downtime: On Tuesday March 22 it was announced a downtime to do the finalization of Atlas fs migration to do as soon as possible in order to have the storage in production before the start of data taking. Since it was foreseen that the procedure could need up to 5 days, it was also planned to have 2 days of total shutdown and 3 more days with the storage in read only mode (as effectively done). After a mail exchange on March 23 it has been agreed between CNAF and Atlas to proceed with the downtime even with the short notice.
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: Since yesterday morning until April 15th the batch capacity is reduced by draining a varying number of workers, at maximum 20 percent, to allow for the correction of a minor but annoying bug in the batch system. No other effect is expected on user jobs.
  • JINR: NA
  • KISTI: NA
  • KIT: NTR
  • NDGF: Downtime starting at the same time of the meeting lasted for 4 min went OK.
  • NL-T1:
  • NRC-KI: NA
  • OSG: NTR
  • PIC: NTR
  • RAL: On Wed 30th March at 19:00 BST there was a network link break at RAL. The failover did not work as expected and staff had to attend site. The break lasted about 2 hours, but the site was in downtime till 10:00 next day
  • TRIUMF: NTR

  • CERN computing services: As discussed during the meeting, there will be a follow up with VOMS sys admins at CERN to understand the details of the problems reported by experiments and FTS, and also to know why the VOMS certificate renewal was not done with enough time in advance. The certificate renewal in this case is a bit more complex than usual due to the certificate DN and requires manual intervention. It would be good to know if a similar situation can be avoided in the future.
  • CERN storage services:
    • FTS issues last week affecting all VOs due to a VOMS / gridsite issues ( under investigation in GGUS:120463 )
      • CMS( ASO), ATLAS and LHCb, using FTS REST access got jobs assigned to a wrong VO because the VOMS extensions could not be extracted from proxies. Issue fixed with proxies re-delegation + manual DB changes by the FTS admins
      • CMS Phedex transfers could not be submitted due to expired proxies not reported by the VOMS clients. WLCG broadcast sent to sites to renew the affected proxies
  • CERN databases: NA
  • GGUS: NA
  • MW Officer:
    • WLCG broadcast for the issue related to VOMS and expired VOMS AC not recognised by VOMS clients. Issue affecting Phedex mainly but any other services which make use of long lasting proxies. All VOMS proxies created before 29th March and still supposed to be valid must be renewed.
  • Security: NA

AOB: Maria reminds the Ops Coord meeting on Thursday that will be dedicated to accounting.

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2016-04-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback