Week of 160118

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Maria Alandes (chair&minutes), Maarten Litmaath (ALICE), Alessandro di Girolamo (ATLAS), Julia Andreeva, Ben Jones (Batch&Grid), Storage (Jan Iven), Raja Nandakumar (LHCb), David Collados (DB)
  • remote: Christoph Wissing (CMS), Rolf Rumler (IN2P3), Sang Un Ahn (KISTI), Ulf Tigerstedt (NDGF), Onno Zweers (NL-T1), John Kelly (RAL), Di Qing (TRIUMF)

Experiments round table:

  • ATLAS reports (raw view) -
    • All smooth in general. Minor internal details
      • Transfer rate was decreasing to almost zero until midnight last night, when it jumped up. To investigate with rucio team.
      • Derivation jobs are failing, they are testing new software, we wait to hear from them tomorrow
      • Some logs not accessible on pandamon, sometimes it works with http and not https. Reported to sergei.
    • Reminder: there is going to be the ATLAS Sites Jamboree https://indico.cern.ch/event/440821/ from Wednesday 27 January to Friday 29.

  • CMS reports (raw view) -
    • CERN/Tier-0
      • Backlog of Heavy Ion PromptRECO (actually not so prompt...) basically cured
    • Distributed Sites
      • High production/processing activity at the various Grid sites
    • No major issues

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Data Processing
      • Mostly MC and User jobs
    • Site Issues
      • T0: NTR
      • T1:
        • Strange transfer failure patterns in SARA in the middle of last week, likely due to srm issues. Being followed up internally.
        • Problems transferring to RRCKI-ARCHIVE (http://lblogbook.cern.ch/Operations/22905) - likely server misconfiguration. Admins being contacted internally.
    • Miscellaneous
      • One problematic user running multi-core jobs on the grid - being contacted
      • About to start stripping 24 to reprocess 2015 data. Applications being prepared and start with a small validation before turning on everything. Staging of some of the data already done in December (copied from tape to buffer).

Sites / Services round table:

  • ASGC: NA
  • BNL: NA
  • CNAF: NA
  • FNAL: NA
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NTR
  • KIT: NA
  • NDGF: Several patching for OS, dCache and firmware scheduled tomorrow. Short 10min downtime. More details in GOCDB.
  • NL-T1:
    • Last week we had a user who started srmput operations but did not cancel pending operations when the associated gridftp had failed. This led to a buildup of turls (srmput reservations), causing congestion in dCache, affecting other users. The user will fix his workflow.
    • We're still seeing "space manager timeouts", so we continue investigating.
  • NRC-KI: NA
  • OSG: NA
  • PIC: NA
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services: OS upgrade in Myproxy service scheduled on Friday. It should be transparent.
  • CERN storage services: CASTOR DB backend is being patched. All LHC VOs today and Nameserver and Public instances on Thursday. 5min intervention. It should be almost transparent, not affecting ongoing transfers. In very few cases, clients may get disconnected.
  • Databases: NTR
  • GGUS: NA
  • Grid Monitoring: NA
  • MW Officer: NA

AOB:

Thursday

Attendance:

  • local: Maria Alandes (chair&minutes), Maarten Litmaath (ALICE), Ben Jones (Batch&Grid), Storage (Jan Iven), Raja Nandakumar (LHCb), David Collados (DB), Ulf Tigerstedt (NDGF)
  • remote: Christoph Wissing (CMS), Rolf Rumler (IN2P3), Sang Un Ahn (KISTI), Andrew Pickford (NL-T1), Gareth Smith (RAL), Di Qing (TRIUMF), Pepe Flix (pic), Kyle Gross (OSG), Salvatore Tupputi (CNAF), Felix Lee (ASGC), Michael Ernst (BNL)

Experiments round table:

  • ATLAS reports (raw view) -
    • nothing special to report. Running jobs on the order of 200k.

  • ALICE -
    • low activity since yesterday morning

  • LHCb reports (raw view) -
    • Data Processing
      • Mostly MC and User jobs
    • Site Issues
      • T0: NTR
      • T1:
    • Miscellaneous
      • About to start stripping 24 to reprocess 2015 data.
      • Pre-staging of turbo data for re-stripping.

Sites / Services round table:

  • ASGC: Scheduled downtime for maintenance of the network link to Amsterdam. Whole site will be unavailable on 29th January from 9am to 6pm.
  • BNL: NTR
  • CNAF:
    • Problem reported by LHCb should be now fixed. It was due to a tape drive that was stuck.
    • MJF setup now ready at CNAF. Ongoing final verification.
  • FNAL: NA
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NTR
  • KIT: NA
  • NDGF:
    • Problems with the firmware update this week. New intervention scheduled to fix the problems. It affected mainly ALICE files.
    • ALICE disks almost full. More space in one month. In the meantime, ALICE requested to delete files.
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: Kyle asks whether any particular input from OSG is needed in the WLCG workshop. Maria replies that the agenda is now defined and that despite the fact that there is no specific presentation to be made by OSG or EGI, their input in all the discussions is very important. Maria explains that current chairmen are working in their sessions contacting the speakers and gathering feedback from existing fora like TFs or WGs. OSG representatives in those fora already have the chance to comment on the proposed topics and discussion material. For instance, Rob Quick is involved in the Information System TF.
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services:
    • Raja asks about the current status of slow kernel issues. Ben explains that it's been decided to switch on EPT. 4 cells are done this week but it will take until end of February to reconfigure all machines. Penalty seems to be below 5% after the configuration changes. Ben will provide more concrete numbers on the machines already reconfigured (in the next 3 bullets).
    • CERN have currently upgraded 700 / 2000 hypervisors requiring the "kilo1" settings, which address both the "virtualisation overhead" as well as extremely slow performance seen on some worker nodes, particularly by LHCb due to EPT settings. The remainder will be upgraded by end of Feb, incrementally on a weekly basis.
    • Older hardware that cannot profit from NUMA settings will be addressed only by switching EPT on only. This is 1200 hypervisors, which will be upgraded on a weekly basis to finish by end of Feb.
    • Details of "virtualisation overhead" as well as effects due to "EPT" can be seen in this hepix presentaion
  • CERN storage services:
    • CASTOR Oracle DB upgrades still ongoing without major problems.
    • EOSALICE is full. It is needed to remove some data.
    • Feedback on the impact of the AFS outage from Monday night to Tuesday morning would be very useful to the Storage team to understand if there are any critical dependencies on AFS coming from the experiments. The idea is to reduce the usage of AFS at CERN, so a list of use cases is being collected. The last outage is a good opportunity to learn about unknown dependencies (again for critical workflows).
  • Databases:
    • ATLAS ADG and ALICE ADG databases were rebooted this morning due to the Extension of storage network on RAC51 scheduled intervention. Service incident ticket OTG0028008
    • LHCb offline (LHCBR) and PDBR databases were patched for the OS, CRS, and RDBMS, yesterday Wednesday Jan 20th. More information in SSB entry OTG0027608
    • CMSONR database was migrated to new hardware yesterday, Wednesday Jan 20th. More information in SSB entry OTG0027890
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Jan-16.pptx r1 manage 2857.4 K 2016-01-18 - 15:57 PabloSaiz  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2016-01-22 - BeJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback