Week of 140428

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Alessandro (ATLAS), Alexandre (grid monitoring + GGUS), David (CMS), Felix (ASGC), Hector (grid monitoring + GGUS), Jan (CERN storage), Joao (Audio/Video services), Luca C (databases), Maarten (SCOD), Raja (LHCb), Vitor (CERN grid services)
  • remote: Alexander (NLT1), Jeremy (GridPP), Lisa (FNAL), Lucia (CNAF), Michael (BNL), Pavel (KIT), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Xavier (KIT)
Experiments round table:

  • CMS reports (raw view) -
    • Quiet until Sunday -- 13 TeV MC production and high priority upgrade workflows running
    • GGUS:104926 ~12 hour Reverse DNS failure at FNAL brought down CMS production during that time. No problems since the workaround put in place sunday afternoon CERN time.
    • INC:0538717 "Problems accessing files on EOS" We are re-running the jobs that had trouble and should soon see if the problem is fixed.

  • ALICE -
    • CERN: 45 files on a broken tape were lost
      • not a big deal

  • LHCb reports (raw view) -
    • Incremental Stripping campaign finished - many thanks to all T1 sites !!!!
    • MCsimulation, Working Group Production and User jobs.
    • T0:
      • Migration of user data Castor -> EOS finished. Some minor problems in mapping users with production roles fixed by hand.
    • T1:
      • GridKa : Some users not able to upload to GridKa-User SEs. Being investigated by LHCb GridKa contact - will open GGUS ticket if needed.
Sites / Services round table:

  • ASGC - ntr
  • BNL
    • on Tue May 6 all T1 services will be affected by network upgrades the whole day; the services are expected to be back by the end of that day; if necessary, the downtime will be extended into the next day
  • CNAF - ntr
  • IN2P3 - ntr
  • FNAL - ntr
  • GridPP - ntr
  • KISTI - ntr
  • KIT
    • firewall maintenance this morning appears to have worked OK, no complaints were received
  • NLT1
    • on Thu there will be memory modules replaced on 9 dCache pool nodes
  • OSG - ntr
  • RAL
    • tomorrow there will be network upgrades the whole day

  • CERN grid services
    • CERN CvmFS The stratum 0 nodes for /cvmfs/grid.cern.ch , /cvmfs/na61.cern.ch and /cvmfs/na49.cern.ch will be migrated from CvmFS 2.0 to 2.1. All sites mounting these repositories should be unaffected. This will happen from 12:00 UTC on Monday 5th May.
  • CERN storage - ntr
  • GGUS
    • New release on the 30th of April. New support units 'ARGO/SAM', rOCCI and several CMS specific ones
  • grid monitoring
    • tomorrow the old EGI message brokers will be decommissioned

AOB:

  • The Vidyo firewall configuration in details is attached and the updated list of routers is on this link:
    • http://information-technology.web.cern.ch/services/fe/info/vidyo-routers-cernlhc-vidyo-network
      • Joao: Vidyo routers at some sites are behind firewalls, causing traffic failures and other anomalies
      • Raja: do users need to do something?
      • Joao: they ought to configure their PC to talk to the router of their institute, else use the proxy mode (less performant); in any case they can contact our Vidyo support when they experience connection problems
      • Alessandro: let's open a GGUS bulk ticket for the sites concerned
      • Maarten: will do

  • Next meeting on Friday

Thursday: May Day holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Alexandre (grid monitoring), Felix (ASGC), Jan (CERN storage), Maarten (SCOD), Raja (LHCb), Vitor (CERN grid services)
  • remote: Antonio (CNAF), Dennis (NLT1), John (RAL), Ken (CMS), Kyle (OSG), Michael (BNL), Sang-Un (KISTI), Thomas (NDGF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • deletion errors at CERN-PROD_DATATAPE (GGUS:103691), SURLs turned into question marks
    • T1
      • staging errors at FZK-LCG2_MCTAPE (GGUS:104987),solved
      • deletion errors at RAL-LCG2 (ELOG:49039)
      • staging errors at SARA-MATRIX tapes (GGUS:105102)

  • CMS reports (raw view) -
    • Two problems at CERN on Tuesday, GGUS:104977 and GGUS:104978. (Looks like they got incorrectly bridged to GGUS at first.) Both attributed to problems with the ARGUS service, but the tickets are not resolved yet. Not that this seems to have kept us from operating.
    • On Thursday, voms-proxy-init went flaky all over the place, alarm ticket GGUS:105060. Things are fine now and the ticket was closed, but to be sure it looks like the underlying issue is not understood! It might (or might not) be related to the fact that FNAL reverse DNS lookup went bad (again) around the same time.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MCsimulation, Working Group Production tests and User jobs.
    • T0:
    • T1:
      • RAL : Jobs hitting CVMFS problems due to biomed jobs overloading some WNs (GGUS:105047). Thanks for quick action solving this issue.
      • GridKa : Some users not able to upload to GridKa-User SEs. Being investigated by LHCb GridKa contact - problem confirmed by GridKa and will look at it on Monday.
      • GridKa : Problem accessing some files at GridKa (GGUS:105093)

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • BNL
    • in the last few days there have been intermittent job and transfer failures due to invalid CRLs for the DigiCert CA; more details in the OSG report
  • KISTI
    • scheduled downtime for network intervention to maintain firewall from 7 May 06:00 to 7 May 15:00
    • In this time, we are going to place a new rule for perfSONAR available to be accessed from outside: web services and control ports will be opened only to known sources. Details will come in the next report.
  • NDGF - ntr
  • NLT1
    • 1 dCache door node had a memory problem today, but should be OK now
    • other dCache door nodes had memory replaced yesterday as planned
  • RAL
    • 1 CMS disk server is out, being looked into
  • OSG
    • for unknown reasons the DigiCert CA CRL expiration period was reduced from 30 to 3 days
    • this should have been mostly transparent for grid services that run the fetch-crl cron job every 6 hours, but because of the observed failures, the CA has agreed to reinstate the 30-day period later today
    • we will discuss this matter further with the CA
    • Maarten:
      • you may want to point out that grid services normally get the latest CRL every 6 hours, so there would only be a 6-hour window during which a revoked certificate may still be used without error
      • DigiCert could therefore apply a "generous" period to their grid CA, while being stricter for their other CAs
      • various CAs work OK with periods of 1 week or longer
      • the important thing would be to cover at least a long weekend (4 days), so the expiration period had best be 5 days or more

  • CERN grid services
    • SHA2 certificates automatically added to the CMS VOMS users. The other VOs will follow next week.
      • this should make it easier for users with CERN SHA-1 certificates to migrate to CERN SHA-2 certificates
      • all VO managers welcomed this mechanism
  • CERN storage - ntr
  • GGUS
    • April release deployed on the 30th of April. The test alarms were sent successfully. One test alarm still open
  • grid monitoring - ntr

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filertf VidyoFirewallConfiguration.rtf r1 manage 2.4 K 2014-04-28 - 14:37 MaartenLitmaath Vidyo firewall configuration
Microsoft Word filertf adminsVidyoT1s.rtf r1 manage 1.1 K 2014-04-28 - 14:39 MaartenLitmaath T1 Vidyo admins
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2014-05-02 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback