Week of 160215

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Maria D (SCOD), Bo (CMS), Miguel (batch + grid), Julia (WLCG), Maarten (ALICE), Maria A (WLCG), Hervé (Storage).
  • remote: Pepe (PIC), David C (ATLAS), David M (FNAL), Ulf (NDGF), Francesco (CNAF), Kyle (OSG), Onno (NL_T1), Rolf (IN2P3), John K (RAL), Di (Triumf).

Experiments round table:

During a discussion that took place during the PIC report (further down in this twiki) David C commented that they are aware of unrecoverable ATLAS data on Oracle T10000D experienced at the site. This is the 4th time in the last 2 weeks they are losing data on tape. Nevertheless, these data are not critical.

  • CMS reports ( raw view) -
    • CMS mid-week global run occurred 10 Feb-12 Feb last week
      • T0 read events from all runs successfully
      • Next such run is next week (24 Feb-26 Feb)
    • Operations continue to be busy (~138K running jobs on average last week, peak of 155K)

About the ARGUS bug affecting CMS SAM tests Maarten provided the following report:

  • big debugging exercise Fri late afternoon, evening, Sat and Sun (GGUS:118701 and GGUS:117125)
  • breakthrough in understanding the recurrent CMS SAM gLExec failures
  • the service does not honor a crucial configuration option
  • 2 bugs were identified as well
  • Argus devs are involved

Future such investigations would be accelerated if Maarten, expert ARGUS collaboration member, had privileged access on the ARGUS servers. Can this be arranged by the CERN service managers?

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC: Not connected.
  • BNL: Not connected.
  • CNAF:As announced we remind the "at risk" on 18/Feb : https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=19761
  • FNAL: NTR
  • GridPP: Not connected.
  • IN2P3: Planned all-day outage on 15/3. Most services will be stopped.
  • JINR: Not connected.
  • KISTI: Sang Un, now travelling for a conference, emailed: NTR
  • KIT: Not connected.
  • NDGF: Downtime this Wednesday at noon 12h00 CET for dCache and kernel updates. One of the NDGF sites will be down this Saturday - ALICE operations will be affected. All in GOCDB.
  • NL-T1: Onno investigated further the question of tape robustness, following the discussion at the WLCGDailyMeetingsWeek160208#Monday about the SARA move to a new data centre in October. It seems that no tape damage was ever found even when the robot dropped them by mistake at multiple occurrences. There are no such 'statistics' about disks.
  • NRC-KI:
  • OSG:
  • PIC: Oracle T10000D tape troubles. Since Q3-2015, 'T10KD' technology deployed in PIC and ATLAS enabled to write in. On Dec. 17th PIC detects a problematic file to be read from 'T10KD' tape (internal migration) and decides to stop ATLAS writing to 'T10KD'. 80 'T10KD' cartridges read completely - 8 tapes presented problematic files to be read. Data was migrated from these problematic tapes, by using dedicated firmware from ORACLE for data recovery. In the end only 20 files could not (yet) be recovered, only half were unique copies, none of them critical (confirmed by ATLAS). A lot of time and effort is spent to understand the cause of this problem (tapes have been sent to ORACLE USA - it might be that we have a problematic drive, it's under investigation). PIC will provide a SIR to help others who might face similar problems.
  • RAL: One disk server intervention takes place now. Expected patched and in production tomorrow at the latest.
  • TRIUMF: NTR

  • CERN batch and grid services: The ARGUS event, detailed under the CMS report above.
  • CERN storage services: Nothing to report
  • Databases: not present.
  • GGUS:
  • Grid Monitoring:
    • The draft availability reports for January are ready
  • MW Officer:

AOB: Many thanks to the T0 and T1s for expert and complete input for their batch systems' configuration. Twiki BatchSystemsConfig is now complete and a recommendation resulted for tomorrow's MB. Maria D.

Thursday

Attendance:

  • local: Maria D (SCOD), Miguel (batch + grid), Maria A (WLCG), Hervé (Storage), Kate (DB), David C (ATLAS).
  • remote: Pepe (PIC), David M (FNAL), Ulf (NDGF), Kyle (OSG), Andrew P (NL_T1), Rolf (IN2P3), John K (RAL), Di (Triumf), Michael E (BNL), Daniele B (CMS), Christoph W (CMS), Sang Un (KISTI), Vladimir R (LHCb).

Experiments round table:

  • ATLAS reports ( raw view) -
    • CERN castor checksum problems recovering old RAW data after tape losses: GGUS:119568
    • Taiwan CVMFS Stratum 1 is persistently "bad" in monitoring GGUS:119557
    • FTS deployment of 3.4.x version requested to sites: to be discussed during WLCG Ops coord.
    • Amazon EC2 test with BNL in one week from now. Up to 100k cores available for ATLAS simulation for a few days.
      • Plan is to ramp up during next week Monday/Tuesday and to go up to 100k (plus all the standard 200+k) Wednesday/Thursday.
    • Reprocessing ongoing since 10days. Fairshare inside Panda being tuned to speed it up to the max, expect to finish in 1-2 weeks.

Hervé said on the storage issue that it has been hoped fixed since last year. Discussions at the WLCG Ops (after this meeting) revealed that the FTS 3.4.2 campaign should wait for a bit more than a week till Readiness verification is completed on the CERN pilot. Michael E briefly described the Amazon EC2 set-up at BNL which will double or triple the resources of the T1 and several T2s. Miguel asked Michael E to share their technical report, with the "Event Server" definition and the conditions for terminating the VM etc.

  • CMS reports ( raw view) -
    • Good CPU utilization over the last days (incl. HLT and Openstack resources usually used for Tier-0)
    • Network problem between Europe and America
      • Started in the night from Monday to Tuesday (15th to 16th)
      • Fixed on Tuesday
      • Further details: GGUS:119551

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Data Processing
      • Stripping 24 is almost at 100%, the merging productions are following slowly. Validation of stripping for Turbo and TurCal
      • MC and User jobs
    • Site Issues
      • T0:
      • T1:

Sites / Services round table:

This list contains information about the status of the patches for the critical vulnerability announced yesterday per site. The issue was discussed at the Ops Coord meeting. Some sites were contacted directly to complete the information. As it is late for Asian sites, not all status info is known at this point.

  • ASGC: not connected. Will patch this week.
  • BNL: nothing to add. Already completed the patches. Done in a rolling fashion as they have many service instances and can reboot transparently.
  • CNAF: Sent by Matteo in email: nothing to report apart the on-going intervention on electric line. Asked about the patches by Maria D in email.
  • FNAL: network problems experienced during the last 2 days are now gone. Still operating on the secondary link.
  • GridPP: not connected
  • IN2P3: ntr. Vulnerability patches installed. Services restarted.
  • JINR: not connected
  • KISTI: Plan to update glibc on Monday (22nd) from 00:00 UTC (09:00 KST) and to finish by Tuesday (23rd) from 00:00 (09:00 KST). Will try not to intervene in the service as much as possible.
  • KIT: not connected. Emergency downtime announcement in GOCDB: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=19931 for the critical vulnerability announced yesterday.
  • NDGF: Yesterday's intervention announced last Monday went well. It took longer but still services were again available on time. Pending downtime for one of the NDGF sites this Saturday - ALICE operations will be affected. All in GOCDB.
  • NL-T1: ntr
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: Will reboot its entire farm, as a result of the security incident with the glibc library. It will be done in two steps, between tomorrow Friday the 19th and Monday 22nd of February, proceeding first with one half of the farm, then the other half. Worker nodes will be put to drain 24 hours in advance to minimize the impact on running tasks. It will however still affect some running jobs. ATLAS, CMS and LHCb collaborations have been informed.
  • RAL: ntr. They will patch for vulnerability early next week.
  • TRIUMF: The network problem reported by FNAL impacted Triumf as well. It is fixed now. Done the vulnerability patches.

  • CERN batch and grid services:
    • The VOMS service hosted on voms2.cern.ch and lcg-voms2.cern.ch will be down on 29 Feb at 13:30 for approximately 1 hour due to a software upgrade. SSB
  • CERN storage services: nta
  • Databases: Rolling upgrade of the experiment databases during the next 2 weeks
  • GGUS: ntr
  • Grid Monitoring: not present. no report.
  • MW Officer: full report at the Ops Coord MW news.

AOB:

  • Operations Coordination Meetings have been reorganised as of 1st March. See MB slides presented this week:
    • 3PM meetings once a week on Mondays
    • Written reports from sites and central services are requested. Either edit the twiki yourselves or send a mail to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch and the SCOD will include the report in the twiki before the meeting.
    • We encourage sites to declare downtimes in GOCDB and central services to use SSB for any service changes or interventions. Do not hesitate to include a reference to these items in your reports if they are already defined before the meeting.
  • Meeting on the 7th of March clashes with GDB. Is it OK to keep the date?

The wlcg-scod@cernNOSPAMPLEASE.ch posting permission problem reported by Ulf is now fixed. Kate was added to the SCOD e-group and will participate in the rota. Dziękuję Kate!

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx GGUS-for-MB-Feb-16.pptx r1 manage 2845.5 K 2016-02-15 - 18:08 MariaDimou GGUS slide for the 16/2 WLCG MB
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2016-02-19 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback