Week of 131125

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaV/SCOD, Xavi/Storage, Felix/ASGC, Stefan/LHCb, Alessandro/ATLAS, MariaD/middleware, Ignacio/Grid, Ulrich/Grid, Pablo/Dashboard, Maarten/ALICE
  • remote: Sang-Un/KISTI, Xavier/KIT, Stefano/CNAF, Onno/NLT1, Lisa/FNAL, Pepe/PIC, Rolf/IN2P3, Tiju/RAL, Kyle/OSG, Tommaso/CMS, Alexei/ATLAS

Experiments round table:

  • CMS reports (raw view) -
    • not much to say, all quiet [Tommaso: learnt now from ATLAS about RAL network issue, we did not notice any issue]
    • last GGUS:
      • GGUS:98253 : Problem CMSSW/DPM @ T3_UK_ScotGrid_GLA (WLCG: UKI-SCOTGRID-GLASGOW). Switching to xrootd for local access.
      • (Not yet transformed into a GGUS): Apparent Problem in communications between T2_CN_Beijing and a specific WMS (wms306.cern.ch). [MariaD: suggest that the ticket is assigned to CERN with info CC to Beijing]

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activities is Simulation at all Sites.
    • T0: ntr
    • T1:
      • Reminder for all dCache sites to pls check if they have a single endpoint for xroot access (Castor, EOS, Storm are ok)
        • SARA: single xroot endpoint currently being setup
      • GRIDKA: mount point of a worker node fixed

Sites / Services round table:

  • Sang-Un/KISTI: ntr
  • Xavier/KIT:
    • Yesterday the tape software crashed, tapes were not available until this morning
    • There will be a dcache intervention for LHCb on Thu affecting both disks and tapes
  • Stefano/CNAF: intervention on ATLAS filesystem ongoing to unify the two ATLAS filesystems into a single one, storage will be completely unavailable until tomorrow
  • Onno/NLT1: the controller that was installed last Thu broke on Fri and will be replaced next Wed. This is the third controller of this type and at that location that broke down in one month, and the eight controller of this type that had to be replaced in one month. This is old hardware at end of life, new hardware will be installed over the next few weeks, before Christmas. [Alessandro: what is the type of this controller, so that other sites can check? Onno: this is DDN S2A9900. Pepe: we probably have a few at PIC too.]
  • Lisa/FNAL: ntr
  • Pepe/PIC: was travelling this weekend, will follow up on dcache issues
  • Rolf/IN2P3: ntr
  • Tiju/RAL: this morning had a network problem due to a fiber cut, but manual intervention was needed to switch to backup link. Will investigate into why auto failover did not work.
  • Kyle/OSG: ntr
  • Felix/ASGC: intervention went ok. Will prepare a SIR.

  • Pablo/GGUS: there will be a GGUS release this Wed
  • Pablo/Dashboard: problem with SAM last week, no results available for experiment tests for 27 hours between Thu and Fri morning
  • Ignacio/Grid: contacted ATLAS and LHCb to go on with the deployment of new puppet managed nodes for LFC (but will keep the old VMs around anyway). [Alessandro: noticed that authentication is slightly different now for LFC. Ignacio: will follow up offline on the technical details.]
  • Xavi/Storage,:
    • incident yesterday, broken switch in the CC around 1.20pm, affected EOSATLAS/EOSCMS head node, service was in read-only mode for two hours
    • Castor version 14 deployment has finally completed, it took two weeks to update the database for all entries
    • Castor minor patch release will be deployed this week (ATLAS) and next week (CMS, ALICE, LHCb), this will be a transparent intervention, will send email reminders to the experiments
    • [Alexei: noticed a reduction in tape token for ATLAS at CERN prod, less than 9 TB are available. Xavi: will follow up.]
  • MariaD/middleware: a doodle http://doodle.com/pc7yaxb4cai62tmx was communicated to the WLCG operations list about the Middleware Readiness Working Group, please sign up if you want to be part of the working group and meeting. Details in https://twiki.cern.ch/twiki/bin/view/LCG/MiddlewareReadinessArchive#Coming_Events
  • Ulrich/Grid: there was a network outage with Wigner last Fri, part of the SLC6 batch capacity for CERN was unavailable [MariaD/Maarten: should have a SIR about this, Wigner is meant to be transparent]

AOB: none

Thursday

Attendance:

  • local: MariaD/SCOD, Victor/LHCb, Alessandro/ATLAS, Felix/ASGC, Xavi/Storage, Maarten/ALICE, Ulrich/Grid_services, Pablo/Dashboards&GGUS, Ivan/CMS.
  • remote: Alexei/ATLAS, Xavier/KIT, Dennis/NL_T1, Jon/NDGF, Rolf/IN2P3, Gareth/RAL, Salvatore/CNAF.

Experiments round table:

  • CMS reports (raw view) -
    • GGUS:99203 (26.11, Tue.) - CASTOR default pool full. v14 - After upgrading to CASTOR v14 a file tag was missing, making files persistent on disk. A hot fix was applied.
    • GGUS:99205 (26.11, Tue.) - BDII configuration error due to emergency removal of FCR from the BDII configuration GGUS:99133.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Answering Victor's question on the long-lasting SARA_MATRIX controller problems, Dennis reported that multiple controller replacements haven't brought the end of the problem yet and many files remain unretrievable.

Sites / Services round table:

  • OSG, BNL, FNAL: Not connected, ThanksGiving
  • ASGC: CMS broken tape found. Luckily some files were still in cache and others have replicas elsewhere.
  • NL_T1: nta
  • CNAF: nta
  • KIT: Coming out of the official downtime. Several services are already up again.
  • PIC: Not connected,
  • RAL: ntr
  • NDGF: dCache pools will be updated next Monday. GOCDB is up-to-date.
  • IN2P3: ntr
  • KISTI: ntr (said offline to MariaD)

  • CERN:
    • VOMS Update of voms and voms-admin on lcg-voms.cern.ch and voms.cern.ch including migration from SL5->6 is now complete. Points of note:
      • Start up vomrs to sync new registrations was delayed by a couple of hours. I initially thought there was a problem but there was not in fact.
      • The VO-Admins in particular were bombarded by email about changing certificate authorities. It was unfortunate that we had new IGTF around the same time as I was doing final comparisons of the old and new service so the installed CAs were flip-flopping.
      • For two hours voms proxies for cms, alice, atlas, lhcb and geant4 were being issued with a validity of 24 hours rather than pre agree durations. ITSSB. Alessandro showed us that SSB now shows all interventions (not only computing) and it is hard to find something. Maarten will open a SNOW ticket for this.
        • After the meeting: the correct IT SSB page only lists IT-related incidents, interventions and changes.
      • SLS for voms is now reflecting the same checks as before on the new service.
      • Maarten added that a ticket was opened today because the new voms v3 client does not work with the upgraded CERN servers (GGUS:99292).
    • FCR temporarily got re-enabled on sam-bdii with the rollback to prevent a worse situation (sam-bdii getting empty); a proper rpm fix is being awaited. The normal top-level BDII (lcg-bdii) was ok. The symptom is that if a CE dedicated to CMS is considered bad by the FCR, it will be removed from the Information System and then can no longer be tested by the FCR...
    • Storage:
      • Bug found in CASTOR v.14 were a label was missing for tape recalled file. This induced skews in garbage collection mechanism having the effect of filling disks. This is now fixed and the other experiments are being done and will continue next week.
      • ALICE intervention next Tuesday.
      • CMS CASTOR intervention next Monday.
      • ALICE and LHCb CASTOR intervention next Wednesday.

  • Dashboards: Recomputing the site A/R during the period of the sam-bdii problems.

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback