Week of 150216

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria D. (SCOD), Maarten (ALICE), Nacho (CERN Grid Services), Alessandro (ATLAS), Herve' (CERN Storage).
  • remote: Hung-Te Lee (ASGC), Lisa Giachetti (FNAL), Onno Zweers (NL_T1), Sanh Un Ahn (KISTI), Tiju Idiculla (RAL), Michael Ernst (BNL), Rolf Rumler (IN2P3), Christoph Wissing (CMS), Dimitri (KIT), Alexei (LHCb), Dmytro (NDGF), Rob Quick (OSG).

Experiments round table:

  • ATLAS reports ( raw view) -
    • Central Services/T0/T1
      • nothing special to report
Alessandro added that ATLAS run is stable since last week with 130K job slots in parallel and increasing. They are supervising their Data Mgnt system's issues.

  • CMS reports ( raw view) -
    • CRUZET (Field-off cosmics)
      • Night from Fri to Sat: Problems reading CMS streamer files from CASTOR, GGUS:111754
        • Fixed Saturday morning. Herve' said this was apparently a network problem that solved itself. Its reasons remain unclear.
    • Production
      • MC generation for Run2 (mainly at T2s)
      • Digi-Reco of Upgrade MC

  • ALICE -
    • CERN: ALARM ticket (GGUS:111753) Fri evening when all the old gLite VOBOXes were found to suffer SSL errors on job submission.
      • The trouble may have been due to a Java update on the CEs on Thu/Fri:
        • SSLv3 got disabled by default.
      • Fortunately the single new WLCG VOBOX was able to take on the whole workload.
      • Its companion WLCG VOBOXes should enter production in the coming days (earlier than foreseen).
        • The old hosts then can finally be retired.
Nacho commented that the CEs started rebooting on Fri. Maria said that, if java update was part of automatic procedures and if the process entails a CE reboot, measures should be taken to avoid anything 'automatic' happening on Fridays, to avoid weekend surprises.

  • LHCb reports ( raw view) -
    • Still finalising last few file statuses of Stripping21 campaign. User jobs and MC at the moment.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: This week there is a public holiday from Wednesday to next Monday Feb 23rd to celebrate the chinese New Year (Year of the Goat). All services will be available on best-effort basis.
  • BNL: ntr
  • CNAF: not connected
  • FNAL: The wrong person seems still to be one who will be paged in case of a GGUS ALARM. Rob will remind Soichi to reflect the right info in the source that GGUS uses for "Emergency email" of the T0/T1s. Maria said this is GOCDB and OIM but Lisa said this is not true. The ticket where valid info is recorded is JIRA:1385.
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: ntr. Not possible to connect next Thu due to a public holiday.
  • KIT: a tape library was found broken and is being repaired. All experiments are affected but data hosted there are old.
  • NDGF: ntr
  • NL-T1: Last Tue an OS update and minor dCache upgrade led to a nice temperature drop in the Computer Centre by 3-4 C. There were issues with gridftp SRM client due to SSLv3 which is no more accepted by the site. No HEP users complained. By setting env var GLOBUS_GSSAPI_FORCE_TLS=1 globus-url-copy will use TLS instead of SSLv3.
  • NRC-KI: not connected. Alessandro said there is a thread open since this morning concerning pledged resources at the site, so it would be practical if they were connected to discuss the issue 'live'.
  • OSG: ntr
  • PIC: not connected
  • RAL: ntr
  • TRIUMF: Planned site outage this Thu 1am CET to finalise their Switchgear work.

  • CERN batch and grid services:
    • The whole VOMS service will be unavailable on Wednesday 18th of February from 9AM to 10AM due to a hardware intervention in the underlying database.
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 2nd of March.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
  • CERN storage services:
    • All instances of Castor were unavailable between Friday 23h40 and Saturday 1h30 because of some network issues.
    • Tuesday: EOSALICE Upgrade starting at 09h30 (CET) OTG0018572
    • Thursday: EOSLHCB Upgrade starting at 09h30 (CET) OTG0018573
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Maria D. (SCOD), Maarten (ALICE), Nacho (CERN Grid Services), Andrea Manzi (MW Officer), Herve' (CERN Storage), Zbigniew Baranowski (CERN DB), Pablo Saiz (GGUS & Monitoring), Alexei Zhelezov (LHCb).
  • remote: Andrej Filipcic (ATLAS), Christoph Wissing (CMS), Di Qing (Triumf), Dmytro Karpenko (NDGF), Elizabeth Prout (OSG), Jeremy Coles (GridPP), Rolf Rumler (IN2P3), Gareth (RAL), Lisa Giacchetti (FNAL), Matteo (CNAF), Michael Ernst (BNL).

Experiments round table:

  • ATLAS reports ( raw view) -
    • Central Services/T0/T1
      • nothing special to report
ATLAS and the FTS3 experts still investigate the observed bug with fair share (it doesn't work). More at the WLCG Ops Coord meeting later today.

More at the WLCG Ops Coord meeting later today.

  • ALICE -
    • KIT: ALARM ticket (GGUS:111813) Tue evening Feb 17 because of GGUS ticket update e-mail flood
      • see GGUS report below

  • LHCb reports ( raw view) -
    • User jobs and MC at the moment.
    • T0: NTR
    • T1: NTR
Observed contact lost with 5K jobs at RAL around 2am UTC last night. Now all ok.

Sites / Services round table:

  • ASGC: not connected. Holiday. Chinese New Year.
  • BNL: Scheduled maintenance next Tuesday 24th February around 15hrs UTC on the ATLAS Tier1 WAN connection. The SEs will experience an 1'-2' interruption, so short, that it is classified "Transparent intervention".
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: ntr
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: not connected. Public holiday.
  • KIT: not connected.
  • NDGF: A cluster in Sweden accepts no jobs due to a service intervention now. Tomorrow it will be back to normal. A scheduled downtime is recorded in GOCDB for tomorrow about work on dCache.
  • NL-T1: not connected
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: ntr
  • RAL: Planned work on the network will cause an interrupt of very few minutes. Gareth asked Alexei to open a GGUS ticket for the LHCb problem investigation with last night's queued jobs (what happened to them?)
  • TRIUMF: All Tier1 is back ok after today's intervention at 1am CET to finalise the Switchgear work.

  • CERN batch and grid services:
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 2nd of March.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
  • CERN storage services:
    • Upgrade of CMS AAA root redirector to XrootD 4.1.1 planned for next monday at 14:00 (CET)
  • Databases: A lot of work starts this week and continues in the next one with the integration databases in agreement with the experiments.
  • GGUS:
    • ALARM GGUS:111813 describes a great email trouble GGUS faced on Tue 17th Feb around 6pm CET that flooded some users with email notifications.
    • Wednesday 18th, from 16:00 to 18:00 UTC one of the four KIT DNS servers caused troubles and had to be replaced. During this period the access to GGUS was slow and in few cases ticket updates failed with: "ERROR (39): Filter/escalation 'set field' process timed out before completion".
    • About the FNAL paged person following the last ALARM test Lisa reminded that this number should not appear anywhere and Pablo repeated that GGUS simply takes the contact info since always from OIM. Still to investigate and comment in JIRA:1385.
  • Grid Monitoring: ntr
  • MW Officer: ntr

AOB: Savannah closes down today. Final notes on https://twiki.cern.ch/twiki/bin/view/LCG/TrackingToolsEvolution#Savannah_close_down_date_2015_02

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Feb-15-v2.pptx r1 manage 2850.0 K 2015-02-19 - 14:49 MaartenLitmaath v2 of the GGUS slides for the WLCG Service Report of the Feb 17 MB
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2015-02-19 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback