Week of 141013

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to annouce it by email to wlcg-operations@cernNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Andrea Sciabà, Alessandro Di Girolamo, Maarten Litmaath, Xavi Espinal (IT-DSS), Alessandro Fiorot (IT-DSS), Nacho Barrientos (IT-PES), Tsung-Hsun Wu (ASGC), Zbigniew Baranowski (IT-DB)
  • remote: Dea-Han Kim (KISTI), Tommaso Boccali (CMS), Dmytro Karpengo (NDGF), Rolf Rumler (IN2P3-CC), John Kelly (RAL), Josep Flix (PIC), Michael Ernst (BNL), Onno Zweers (NL-T1), Rob Quick (OSG), Xavier Mol (KIT), Alexey Zheledov (LHCb), Lucia Morganti (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • DiskSpace: cleaning and rebalancing: still waiting from MCprod expert if a list of unmerged HITS can be obsoleted. FZK and SARA primary
      • LHCOPN link between IN2P3-CC and CERN saturated for more than 24hours https://netstat.cern.ch/monitoring/network-statistics/ext/?q=Tiers1&p=LHCOPN&mn=FR-CCIN2P3&t=Weekly
      • Rucio stress test is ongoing. 1M files/day have been running for more than a week, Saturday small issue (bringing down below 1M files/day). Repair service in case of Stuck rule to be checked: tomorrow we will task one guy (Joaquin) to check each morning. Log files rotation and dimension and verbosity to be increased.
      • Rucio: staging - still waiting for the overwrite.
      • Migration to AI (for DDM service): deadline is October31, discussed with CERN-IT and kindly agreed that we can keep the nodes a bit more: CERN-IT agreed in keeping quattor alive few weeks more. We need a list of nodes with "don't touch them": these are all the DQ2 machines, plus stuff like the SLS.
      • DQ2 accounting runs on CERN-IT hadoop . Migrated to Rucio last week. The 2 dumps . Popularity needs to have Paul adding one line to have double posting.

  • CMS reports (raw view) -
    • production / analysis in full steam
    • Global Run done, still reprocessing tests at the Tier0
    • No major issues,:
      • GGUS:109201 since 4 days CMS SRM SLS shows frequent problems. I am not aware of any real disruption, but would like to know whether it is known/expected. Seems to be due to overload caused by an ongoing transfer of several millions of streamer files from CASTORCMS to EOSCMS via FTS3. SOLVED by decreasing the number of file movements per job.
      • GGUS:109274 : Sat Evening, SRM not working for CMS. Alarm sent and promptly restarted (daemon problem). Xavi says that SRM was stuck trying to talk to a central daemon, still under investigation.

  • ALICE -
    • CERN: 1 or 2 CEs published bad job numbers on Sun from ~16:43 to ~21:52; job submissions had to be switched off to avoid overloading LSF (GGUS:109281). Nacho explains that the problem is understood on the LSF side; the LSF master was rebooted but the procedure should be improved. Maarten adds that there is also a bug in the information providers, because they should never report zero jobs if they cannot get the information from LSF.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: scheduled downtime tomorrow at 1700 to replace a router, but the LHC VOs should not be affected
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
    • had issues with the ATLAS SRM which overloaded frequenty last week for several hours. The autotuning of FTS3 at RAL was misconfigured. Alessandro adds that probably the autotuning is not the sole responsible for the problem and that the issue is still under investigation.
    • we have upgraded dCache for LHCb to solve the ROOT6 problem
  • NDGF: due to a fire in the Copenhagen machine room in the weekend, the site is running at reduced capacity and some disk pools are unavailable. Everything should be back to normal tomorrow.
  • NL-T1: ntr
  • OSG: tomorrow there will be a maintenance intervention in OSG operations, it will not affect any WLCG service. Last Friday we had a meeting with CMS and Maarten to discuss the new VOMS servers. The plan is to deploy the new vomses file on November 11 (GGUS:109265)
  • PIC: ntr
  • RAL: downtime planned tomorrow to move back the databases for ATLAS and ALICE CASTOR instances to production machines
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • CERN FTS
    • Wednesday 8th , srm-ifce updated simultaneously on fts3.cern.ch and fts3-pilot.cernc.ch. This should never happen, changes have been to made to stop it happening again. Pilot should be ahead of production.
    • New srm-ifce cause transfers with overwrite enabled and EOS (bestman) as a destination to fail.
    • Monday 13th October 10:00 CEST fts3 scheduled upgrade happened 3.2.27 -> 3.2.28 , would have included broken srm-ifce if not already deployed.
    • Monday 13th October 13:00 CEST srm-ifce upgraded with with fix for EOS (bestman) writes.
    • FTS all good now.
  • CERN storage services: ntr
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB: starting from next meeting, whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to annouce it by email to wlcg-operations@cernNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Thursday

Attendance:

  • local: Andrea Sciabà, Felix Lee, Alessandro Di Girolamo, Maarten Litmaath, Pablo Saiz, Nacho Barrientos
  • remote: Andrej Filipcic, Dmytro Karpenko (NDGF), Gareth Smith (RAL), Rolf Rumler (IN2P3-CC), Pepe Flix (PIC), José Hernández (CMS), Michael Ernst (BNL), Sonia Taneja (CNAF), Sang Un Ahn (KISTI), Alexey Zheledov (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • holding increasing. "ORA-00001: unique constraint (ATLAS_RUCIO.DIDS_GUID_IDX) violated" . seems panda is trying to add a GUID already existing. it seems one user who has created a broken pool file catalog with GUID the same. The number of sites affected is 20-30.
    • SiteServices getting full. This is because of a bug in FTS3 new clients, the output of the glite-transfer-status is big in size. Reported to FTS3 our glite version: waiting from FTS3 for a patch.
    • DQ2 get not returning files from RRC-KI ADCSUPPORT-3960

ATLAS is working on defining procedures to follow when FTS3 has problems; there will be a document which will be shared.

  • CMS reports (raw view) -
    • GGUS:109339 : Tuesday Evening, SRM endpoint for CMS got stuck again. Alarm sent and service promptly restarted. SRM is under pressure due to the staging of few million streamer files to EOS needed for scale testing the Tier-0 processing system. However, the SRM endpoint should not stop... Operations getting in contact with Castor admin to figure out what is the admissible load.

  • ALICE -
    • major instabilities were observed yesterday until early evening:
      • the number of active jobs at CERN showed a "roller coaster" profile
      • most pilots at CERN were exiting prematurely because they could not talk to the central AliEn services
      • other sites initially seemed unaffected
      • late afternoon one T2 reported all their WN got frozen due to CVMFS
      • many other sites went bad similarly, while CERN was unaffected
      • around 18:15 CVMFS started recovering on the affected sites
      • operations then recovered everywhere
      • the IT SSB did not show what may have caused all this trouble. Nacho proposes to check the CVMFS server logs to see if during that period the number of requests was lower than normal. Maarten points out that usually if the client cannot connect to a server, it will automatically fall back to another, which did not happen.

  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for future processing.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL:
    • Last week the WAN connection of the Tier-1 was reconfigured; the 8-10 Gb link was replaced by a 100 Gb link, bringing the total bandwidth for LHCONE to 200 Gb. The change was totally transparent and notified to ATLAS operations.
    • We have doubled the capacity at the Tier-1 of the nodes provided by Amazon, for a total of 2000 8-core nodes. This is in the context of a project with Amazon to study usage patterns of the LHC experiments. The resources are accessed via a dedicated PanDA queue.
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: yesterday afternoon we observed a drop in the number of running ALICE jobs without an obvious reason, which then automatically recovered. Maarten comments that's probably related to the issue he reported about and there's no need to worry too much.
  • KIT:
  • NDGF: ntr
  • NL-T1:
  • OSG: We had problems with the twiki service for CERN certificates due to the CRL updating being stuck. Now it is back to normal.
  • PIC: ntr
  • RAL: ntr
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: XRDFED upgrade to xrootd4 for CMS & ATLAS
  • Databases:
  • GGUS:
    • Scheduled outage for the 16th of October, 7:30 to 9:30 to switch to the failover instance of REMEDY
    • Test alarm for INFN-T1 took two weeks to acknowledge.
  • Grid Monitoring:
  • MW Officer:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Oct.pptx r1 manage 2868.3 K 2014-10-13 - 09:56 PabloSaiz  
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2014-10-16 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback