Week of 110919

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Peter, Stefan, Andrea, Stephen, Torre, Jamie, Maria, Ivan, Luca, Ricardo, Ignacio, Ale, Dirk);remote(Michael/BNL, Catalin/FNAL, Onno/NL-T1, Daniele/CNAF, Kyle/OSG, Gareth/RAL, Jhen-Wei/ASGC, Pavel/KIT, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports - Peter
    • T0/Central Services
      • RAW files disappeared again from CASTOR/T0ATLAS Alarm GGUS:74448 (machine in maintenance)
      • Ignacio: will check with Massimo, but short term (<1h) disk server interventions go on all the time.
      • Ale: ATLAS retries with 15min and 30 min delay.
    • T1 sites
      • SARA pre-staging GGUS:74422
        • Onno: updated ticket - will check again and mark solved
      • CNAF storm offline Alarm GGUS:74429
        • Daniele: will send email update as audio quality of phone connection was poor.
      • RAL pre-staging problems, was OK for reprocessing part I in August (no tape?). Has Castor upgraded since?
        • Gareth: upgrades completed before august - will follow up with Shaun about result of further debugging...
    • T2 sites
      • Ongoing issue at AGLT2
        • Dirk: will contact network people offline to close loop with site and experiment experts

  • CMS reports - Stephen
    • LHC / CMS detector
      • NTR
    • CERN / central services
      • NTR
    • T0:
      • Requested nodes moved from CAF to T0. Done during the weekend, thanks! GGUS:74417
      • PromptReco processed data released (after 48 delay) upto Friday 9pm.
    • T1 sites:
      • Backfill mostly, some MC in progress. Also redigi running at FNAL.
    • T2 sites:
      • NTR

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Stefan
    • Experiment activities:
      • Reconstruction and stripping
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 2
      • T2: 1
    • Issues at the sites and services
      • SARA: Problems with file access via protocol, several pool nodes were rebooted during the w/e (GGUS:74416), NIKHEF also affected by the same problem (GGUS:74427). Another ticket opened b/c after removal and archiving campaign executed on the w/e, the disk pools in front of tape storage got full (GGUS:74441)
      • CNAF: Problems with file access, turned out to be a problem with the GPFS file system, which was restarted on Sunday (GGUS:74428)
      • dcache sites were asked to re-allocate disk space from "old" LHCb disk space tokens after removal campaign (GGUS:74365, GGUS:74366, GGUS:74367, GGUS:74368)
      • Onno/NL-T1: disk pool in front of tape: problems were caused by mass storage system - space should be freed soon. Problem with pool nodes are still being investigated.
      • Pavel/KIT: received ticket for shared area problem - starting to look into this

Sites / Services round table:

  • Michael/BNL - ntr
  • Catalin/FNAL - ntr
  • Onno/NL-T1 - ntr
  • Daniele/CNAF - nta
  • Kyle/OSG - ntr
  • Gareth/RAL - Problem on early Fri evening: first DB deadlock for ATLAS SRM and after some LSF scheduling issues - all have been resolved that evening. Short network break for few minutes for tomorrow has been registered in GOC DB.
  • Jhen-Wei/ASGC - ntr
  • Pavel/KIT - staging problems related to the tape library have been fixed - ATLAS should confirm
  • Rolf/IN2P3 - ntr
  • Luca/DB - ATLAS streams problem to T1 due to streams bug. Latency was raising to 4h and is now normal again. Peter: saw some DB monitoring alerts: can shifter do anything about this? Luca: this was high load on ATLR (which should go away with the move to frontier/coral srv). Shifter should wait for 30 mins before acting.

AOB: (MariaDZ) GGUS MB slides with ticket totals and ALARM drills attached for use at tomorrow's MB by the SCOD on duty.

Tuesday:

Attendance: local(Toore, Ricardo, Stefan, Massimo, Nicolo, Ivan, Ale, Eva, Dirk);remote(Michael/BNL, Kyle/OSG, Lisa/FNAL, Burt/FNAL, Daniele/CNAF, Rolf/IN2P3, Pavel/KIT, Ronald/NL-T1, John/RAL).

Experiments round table:

  • ATLAS reports - Torre
    • T0/Central Services
      • Alarm ticket "RAW files disappeared again from CASTOR/T0ATLAS" confirmed as routine maintenance on a node, no data loss, closed and verified. GGUS:74448
      • A high rate of generator data file requests against www.hepforge.org amounted to a DoS disabling of their server. Traced to a Panda user, who was contacted (email and phone) and he immediately killed his jobs. (This all played out over ~50min following ATLAS ops notification this afternoon. Hepforge first noticed it yesterday.)
        • From Hendrik Hoeth @ Hepforge: It is quiet now, keeping an eye out. "Thanks a lot to everybody for the quick and effective response!!!!!!"
        • There are 7 GGUS tickets in various places over this... how best to disseminate info to all
    • T1 sites
      • RAL SRM errors are degrading data transfer efficiency. Being looked at. GGUS:74466
      • RAL ticket on reattempts to replicate existing files closed. Not a site issue. GGUS:74432
      • SARA pre-staging problem - DPF mass storage stuck, restarted Sep 19 morning. Transfers successful since. Closed and verified. GGUS:74422
      • CNAF storm offline Alarm. GPFS deadlock led to file system hang. Seen before; bug being filed to IBM. Restarted. Closed and verified. GGUS:74429
    • T2 sites
      • AGLT2 - CERN bandwidth issue for calibration data: CERN IT network experts in contact with AGLT2, arranging perfmon tests to diagnose. Tracking as ongoing issue. GGUS:73463

  • CMS reports - Nicolo
    • LHC / CMS detector
      • NTR
    • CERN / central services
      • NTR
    • T0:
    • T1 sites:
      • Backfill mostly, some MC in progress. Also redigi running at FNAL.
      • IN2P3-CC: GGUS:74474 for timeouts on software area.
    • T2 sites:
      • NTR

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Stefan
    • Experiment activities:
      • Reconstruction and stripping
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 2
      • T2: 1
    • Issues at the sites and services
      • Gridka: Problems with access to local LFC server, due to broken certificates (GGUS:74476)
      • SARA: Problems with file access via protocol (GGUS:74416)
        • Ronald: 8 pool nodes affected by crashes but did not find reason yet. All nodes are SL4, so considering upgrade which would need downtime. Will inform once decision has been taken.

Sites / Services round table:

  • Michael/BNL - ntr
  • Lisa/FNAL, Burt/FNAL - vomrs did not send email for several days affecting new registrations. Is the support 24 or 8*7? Steve Traylen confirmed the problem until today and will give a more detailed report tomorrow. The support level is 8*7.
  • Daniele/CNAF - ntr
  • Rolf/IN2P3 - ntr
  • Pavel/KIT - problem fixed for LFC server
  • Ronald/NL-T1 - sara fixed migration from disk to tape - more free disk space now.
  • John/RAL - planned intervention went well: < 1min downtime
  • Jhen-Wei/ASGC : Thu downtime for castor upgrade to 2.1.11-2 (1am - 6am UTC)
  • Kyle/OSG - is there a general mailing list for bdii & sam exchange? Dirk: will check with experts...

AOB:

  • A service incident report by BNL has been filed on the twiki for the recently observed Oracle streams inconsistency.

Wednesday

Attendance: local(Torre, Stefan, Ricardo, Massimo, Luca, Nicolo, Ivan, Dirk);remote(Michael/BNL, Jhen-Wei/ASGC, Ronald/NL-T1, Pavel/KIT, Jeremy/GridPP, Tiju/RAL, Rolf/IN2P3, Liz/FNAL, Lorenzo/CNAF, Kyle/OSG, Ale/ATLAS)

Experiments round table:

  • ATLAS reports - Torre
    • T0/Central Services
      • NTR
    • T1 sites
      • RAL SRM errors resolved. Transfers successful since. Closed and verified. GGUS:74466
      • FZK stager problem declared as solved by site this morning, keeping an eye on it for a while before verifying. GGUS:74433
      • NDGF stager errors, ticket 16:29 UTC yesterday, under investigation by site. GGUS:74502
    • T2 sites
      • NTR

  • CMS reports - Nicolo
    • LHC / CMS detector
      • Physics running
    • CERN / central services
      • NTR
    • T0:
      • Going through PromptReco backlog
      • Nicolo: CMS would like a report on the recent DB reboots (last week and today around lunchtime)
        • Luca: CMSR DB: 3 nodes rebooted 12:20-12:50 - similar to last weeks incident. Donít have full understanding yet, but can see high load on node 2, but the fact that another two nodes rebooted as a consequence looks like an Oracle cluster ware problem. Db team opened an SR open against Oracle and started to collect more debugging info. Nicolo: some queries seem to execute slower since the reboot. Luca: will look into this.
    • T1 sites:
      • Backfill mostly, some MC in progress. Also redigi and PromptSkimming at FNAL.
      • New MC reprocessing campaign about to start, T1 contacts were reminded that the MinBias sample used as input for pileup mixing needs to be available on disk (if possible with multiple replicas).
      • IN2P3-CC (T1 and T2): GGUS:74474 for timeouts on software area was closed yesterday.
      • IN2P3-CC: GGUS:74501 for file failing in exports, file was missing from storage and will be invalidated. Closed.
      • CNAF: GGUS:74510 - SRM endpoint unavailable for ~9 hours, affecting SAM tests and transfers to/from CNAF. Closed at 10:02 UTC
      • FNAL: SAV:123624 opened for input files missing on storage in MC production
    • T2 sites:
      • Starting submission to T2s of test MC workflows with WMAgent.

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Stefan
    • Experiment activities:
      • Reconstruction and stripping
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 2
      • T2: 1
    • Issues at the sites and services
      • Gridka: Problems with access to local LFC server, due to broken certificates (GGUS:74476), fixed
      • SARA: Problems with file access via protocol (GGUS:74416), upgrading pool nodes

Sites / Services round table:

  • Michael/BNL - ntr
  • Jhen-Wei/ASGC - ASGC will after consultation with the CERN team upgrade to CASTOR 2.1.11-5 during the downtime mentioned yesterday.
  • Ronald/NL-T1 - ntr
  • Pavel/KIT - ntr
  • Tiju/RAL - ntr
  • Rolf/IN2P3 - ntr
  • Liz/RAL - ntr
  • Lorenzo/CNAF- ntr
  • Kyle/OSG - ntr
  • Jeremy/GridPP - ntr

  • Ricardo/CERN: explained the response to the vomrs question listed under AOB
  • Ivan/Dashboard - sam tests are failing for ATLAS in INFN - ATLAS expert have been informed.

AOB:

  • Feedback on yesterdays question from Steve Traylen by email:
    • CERN VOMRS
      • Between 14th and 18th the vomrs backend agent was not running on the CERN LHC VOMS instance. The initial cause for failure is still being looked at and was related to number of VOMS database connections peaking at this time. While action was taken soon after the situation was reported it was inaffective due to missing configuration that had not made it to SLC5 service (chkconfig vomrs on ).
      • During this time all visible services were available however new registrations to the VOs were held up.
      • Comment the vomrs backend service has always been less critical to operations, it has always relied on a cold standby as a recovery procedure.

Thursday

Attendance: local(Torre, Nicolo, massimo, Ignacio, Ricardo, Stefan, Ivan);remote(Michael-BNL, Daniele-CNAF, Jhen-Wei-ASGC, Lisa-FNAL, Kyle-OSG, Tiju-RAL, Rolf-IN2P3, Onne-NLT1, Pavel-KIT, NDGF).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • High priority HLT reprocessing launched at CERN via production system last night
    • T1 sites
      • RAL SRM errors reappeared today, new ticket opened. GGUS:74562
      • FZK staging errors reappeared today, ticket reopened. GGUS:74433
      • NDGF tape pool problem fixed, ticket set to solved yesterday evening, keeping an eye on it. GGUS:74502
      • PIC gridftp errors, new ticket 10:45 UTC today. GGUS:74578
    • T2 sites
      • NTR

  • CMS reports -
    • LHC / CMS detector
      • Physics running
    • CERN / central services
      • NTR
    • T0:
      • Going through PromptReco backlog, currently 5k jobs pending
      • Today CMSSW_4_4_0 will be released, with improvements for memory usage and reconstruction time. It will be validated for PromptReco over the weekend; if all goes well, the T0 will switch to the new version next week.
      • This version will also be used for reprocessing at T1s later this year.
    • T1 sites:
      • MC production. Also redigi and PromptSkimming at FNAL.
      • FNAL: SAV:123624 opened for input files missing on storage in MC production - files were lost, will be invalidated.
      • RAL: SAV:123626 for failure to open input files in MC production - issue with disk server, now fixed, closed.
      • ASGC: SAV:123651 for failure to open input files in MC production - site is restaging the files.
    • T2 sites:
      • Starting submission to T2s of test MC workflows with WMAgent.
      • T2_PL_Warsaw failing SAM tests, SAV:123647
      • T2_RU_PNPI failing SAM tests, now OK, SAV:123648

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • CNAF: For fews hours this morning there were no jobs at the site due to the expiration of the proxies. Solved
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Reconstruction and stripping
      • Validation of reprocessing applications has started
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 2
      • T2: 1
    • Issues at the sites and services
      • CERN: very low number of running jobs observed since this morning, currently under investigation
      • SARA: Problems with file access via protocol (GGUS:74416), upgrading pool nodes

Sites / Services round table:

  • ASGC: CASTOR upgrade done. Investigating problem. Putting some downtime in GOCDB
  • BNL : ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: Several small problems on data management behind current issues (ATLAS). Confident that now is solved.
  • NDGF: Problems generated by recent upgrades. Root cause identified and problem removed.
  • NLT1: 1 pool upgraded successfully (SLC5; no more error as reported by LHCb). In fact other nodes are also OK (still on SLC4); main different little tape activities these days. It looks the weak point could be the SLC4-gFTP use to ship data to tape. Propose to upgrade to SLC5 tomorrow (8h down time). OK for LHCb, pending OK from ATLAS.
  • PIC: ntr
  • RAL: ATLAS problem under study. It looks they can control it by throttling. Investigating
  • OSG: ntr

  • CASTOR/EOS: ntr
  • dashboards: ntr
  • databases: One node of the CMS DB (offline - main application: tier0) just went down. Investigating
  • grid services: ntr

AOB:

Friday

Attendance: local(Stefan, Torre, Ivan, Nicolo, Massimo, Ale, Lola, Ricardo, Eva, Jamie, Dirk);remote(Michael/BNL, Onno/NL-T1, Lisa/FNAL, Daniele/CNAF, Jeremy/GridPP, Jhen-Wei/ASGC, Rolf/IN2P3, Gareth/RAL, Pavel/KIT).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1 sites
      • RAL SRM errors yesterday solved (related to heavy ATLAS load), keeping an eye on it. GGUS:74562
      • FZK staging errors on ~10% of staging transfers this morning, ticket remains open. GGUS:74433
      • NDGF tape pool has looked healthy since yesterday's fix, solution verified. GGUS:74502
      • PIC gridftp high-rate errors to DESY-HH, DESY-ZN endpoints yesterday, ticket being worked on but error rates much lower today. GGUS:74578
    • T2 sites
      • NTR

  • CMS reports -
    • LHC / CMS detector
      • Physics running
      • Issue with switch of CMSONR database at P5 at 6AM, resulting in unavailability of 5 out of 6 nodes. IT-DB piquet promptly reacted and called the CMS sysadmin, who was able to complete the intervention at ~9AM. In the meantime, the surviving node was overloaded, so most subsystems at P5 could not be configured for the run - lost essentially all of this morning's short 2h run.
    • CERN / central services
      • CMSR1 reboot reported at yesterday's meeting, hosting T0AST: low impact, several T0 components crashed when they lost connection but came back when they were restarted by experts.
    • T0:
      • Going through PromptReco backlog, currently 3.5k jobs pending
      • Test CMSSW_4_4_0 workflow will be run today on a replay of an old run. If all goes well, configuration will be switched on Sunday, resulting in first jobs with new version running on Tuesday (after 48h built-in delay for PromptReco).
    • T1 sites:
      • MC production. Also redigi, PromptSkimming and Release Validation at FNAL.
        • Several workflows at various sites (FNAL, CNAF, IN2P3, ASGC) failing - experts investigating, it seems to be an issue with workflow configuration rather than a site-related problem.
      • ASGC: SAV:123651 for failure to open input files in MC production - site is restaging the files, and fixing config issues on WNs which have trouble opening the files.
    • T2 sites:
      • Starting submission to T2s of test MC workflows with WMAgent.
      • T2_PL_Warsaw failing SAM tests, SAV:123647
      • T2_IN_TIFR failing SAM tests, SAV:123658
      • T2_BE_UCL failing SAM SRM tests, now OK, closed, SAV:123665
      • T2_DE_DESY failing SAM tests and transfers, SAV:123675

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • RAL: ALICE jobs are going into waiting status.The waiting jobs were killed, but immediately others change their status from queued to waiting.
        • Lola: fixed wrong VObox config just before the meeting and now waiting to confirm result.
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping
      • Finishing on validation of reprocessing applications
    • T0
      • very low number of running jobs observed, fixed yesterday afternoon
      • Migration to castor for one raw file pending (GGUS:74601)
    • T1 sites:
      • SARA: Problems with file access via protocol (GGUS:74416), downtime today for upgrading pool nodes

Sites / Services round table:

  • Michael/BNL - ntr
  • Onno/NL-T1 - addition: we hope the pool node upgrade fixed the problem, but will carefully monitor over weekend and would like to be notified immediately about any problems. This morning frontier/squid server got stuck due to a full partition - fixed now.
  • Lisa/FNAL - ntr
  • Daniele/CNAF - ntr
  • Jhen-Wei/ASGC - ntr
  • Rolf/IN2P3 - ntr
  • Gareth/RAL - followup on transfer failures: found 100 corrupted disk files from period before checksums were deployed. will contact ATLAS via UK contact.
  • Pavel/KIT - applied fix this morning for staging errors - no further errors since then
  • Kyle/OSG - (by email) ntr

AOB:

-- JamieShiers - 25-Aug-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2498.0 K 2011-09-19 - 11:42 MariaDimou  
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2011-09-23 - OnnoZweersExCern
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback