Week of 150706

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alessandro (ATLAS), Doug (ATLAS), Fernando (batch and grid), Kate (databases), Maarten (SCOD + ALICE), Xavi (storage)
  • remote: Alexander (NLT1), Antonio (CNAF), Christoph (CMS), Gareth (RAL), Kyle (OSG), Michael (BNL), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Ulf (NDGF), Vladimir (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • ATLAS monitoring changes made to Rucio to cure lost files issue reported last week. Early prognosis looks good
    • Triumf tape issue while consolidating susy space (400TB). GGUS:114796 .
    • Triumf errors LCG2-MWTEST-DATATAPE - no write pool error GGUS:114819
      • Maarten: as that is a MW Readiness test instance, the issue is not urgent

  • CMS reports (raw view) -
    • No problems with respect to restart of collisions.
    • On Friday afternoon, the xrootd global redirector got hung up again, and I myself logged in and restarted the processes. But once again, it was noticed by the wrong person. On the re-opened GGUS:114712 I asked to get a copy of the logs so that we can understand the underlying problem, but I have no evidence that anyone has gotten them.
      • Xavi: will check with the experts
      • Christoph: we can login and restart the service, but have no access to the logs
    • GGUS:114777 is still open, at least in part due to the holiday weekend in the US. But it is believed that some network problems got fixed, solution still being verified.
    • GGUS:114792 -- Problems with ssh key pair handling in open stack affecting T0. Seems to have started around June 22. Still iterating on this.
    • It's still very hot outside! Also many places inside....

  • ALICE -
    • CERN: some recently written CASTOR files could not be read back; being investigated
      • Xavi: could be due to congestion in a few of the disk servers:
        • there are 2 HW types, big vs. small
        • their configurations may need to be checked
        • any server might be temporarily unavailable due to disk-to-disk copies or draining

  • LHCb reports (raw view) -
    • Data Processing, User and MC jobs on the grid.
    • T0
      • Data transfer problem from Pit to CASTOR. Fixed, misconfiguration of some new SRM nodes.
    • T1

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
    • transparent OPN intervention tomorrow morning 5-7 CEST
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: ntr
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: nta
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Asa (ASGC), Cheng-Hsi (ASGC), Fernando (batch and grid), Hervé (storage), Maarten (SCOD + ALICE)
  • remote: Chris (LHCb), Christoph (CMS), David (ATLAS), Di (TRIUMF), Elizabeth (OSG), Jeff (NLT1), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Tiju (RAL), Ulf (NDGF), Zoltan (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • ALARM ticket to CERN, LSF instance not accessible. Caused burst of lost T0 jobs each morning for last two days GGUS:114929
      • Fernando: the WN were rejected by LSF because of an incorrect configuration change; the issue has been fixed
    • Problem reading data from NIKHEF: GGUS:114431
      • Jeff: only functional tests seems to be affected, not production
    • Some sites failing large jobs due to wrongly configured working dir size in AGIS. Campaign is underway to fix it.
    • Still problems with wrong/duplicate FTS messages (being discussed in concurrent FTS meeting)
    • Massive consistency check underway to check for lost files - T1s were asked to provide storage dumps.

  • CMS reports (raw view) -
    • CMS magnet at full field since Monday evening
      • Maarten: congratulations and best wishes!
    • Problems with CASTOR CERN
      • SLS page dropped to red values on Tuesday afternoon - recovered until Wednesday morning (without any action from CMS side)
      • SAM test still has problems to write to CASTOR on Wednesday
      • Phedex Transfers have also issues to write
      • GGUS:114906 opened on Wednesday morning - DN mapping issue, fixed Wednesday afternoon
      • Another issue seems to be reading from CASTOR, GGUS:114938 - likely a firewall issue
        • confirmed, see below
    • Problems of analysis jobs getting executed at FNAL
      • Suspicion: site overloaded with production - but it should process at least some user jobs
      • Problem checked from the Glidein Factory side and the site side
      • GGUS:114887 and GGUS:114888
    • Symbolic link (to site configuration) got lost in CVMFS repository Thursday morning ~6:00 (CERN time)
      • Causes basically all CMS jobs to fail including SAM tests at CERN (GGUS:114935)
      • Link got restored in CVMFS round 10:30
      • CVMFS experts found a bug in CVMFS (to be fixed in the next release)
      • ALARM Ticket: GGUS:114933
      • Site readiness will be corrected
    • Once more trouble with Global Xrootd redirector at CERN
      • Same problem as reported by Ken
      • Service run out of threads
      • CMS xroot experts suggest to upgrade one machine behind the alias to xrootd 4.2 (from 4.1)
      • GGUS:114712
      • Hervé: the upgrade is planned for early next week; meanwhile the devs are examining a core dump of the current version

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC:
    • downtime next Mon 01:00-10:30 UTC for memory upgrades
  • BNL: ntr
  • CNAF:
  • FNAL:
    • the issue with CMS jobs is being investigated
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
    • one of the AliEn daemons on the ALICE VOBOX crashed and needed to be restarted manually
  • KIT: ntr
  • NDGF: ntr
  • NL-T1:
    • today's important openssl update may entail some downtime if services need to be updated urgently
  • NRC-KI:
  • OSG:
    • the accounting report for June looks OK except for 1 site that is being looked into
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: nta
  • CERN storage services:
    • New nodes for SRM in production => Forgot to add them to the firewall exception list, now fixed
    • Upgrade of Atlas xrootd redirector to 4.2 completed, same process for CMS is ongoing
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2015-07-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback