Week of 150706
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: Alessandro (ATLAS), Doug (ATLAS), Fernando (batch and grid), Kate (databases), Maarten (SCOD + ALICE), Xavi (storage)
- remote: Alexander (NLT1), Antonio (CNAF), Christoph (CMS), Gareth (RAL), Kyle (OSG), Michael (BNL), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Ulf (NDGF), Vladimir (LHCb)
Experiments round table:
- ATLAS reports (raw view) -
- ATLAS monitoring changes made to Rucio to cure lost files issue reported last week. Early prognosis looks good
- Triumf tape issue while consolidating susy space (400TB). GGUS:114796
.
- Triumf errors LCG2-MWTEST-DATATAPE - no write pool error GGUS:114819
- Maarten: as that is a MW Readiness test instance, the issue is not urgent
- CMS reports (raw view) -
- No problems with respect to restart of collisions.
- On Friday afternoon, the xrootd global redirector got hung up again, and I myself logged in and restarted the processes. But once again, it was noticed by the wrong person. On the re-opened GGUS:114712
I asked to get a copy of the logs so that we can understand the underlying problem, but I have no evidence that anyone has gotten them.
- Xavi: will check with the experts
- Christoph: we can login and restart the service, but have no access to the logs
- GGUS:114777
is still open, at least in part due to the holiday weekend in the US. But it is believed that some network problems got fixed, solution still being verified.
- GGUS:114792
-- Problems with ssh key pair handling in open stack affecting T0. Seems to have started around June 22. Still iterating on this.
- It's still very hot outside! Also many places inside....
- ALICE -
- CERN: some recently written CASTOR files could not be read back; being investigated
- Xavi: could be due to congestion in a few of the disk servers:
- there are 2 HW types, big vs. small
- their configurations may need to be checked
- any server might be temporarily unavailable due to disk-to-disk copies or draining
- LHCb reports (raw view) -
- Data Processing, User and MC jobs on the grid.
- T0
- Data transfer problem from Pit to CASTOR. Fixed, misconfiguration of some new SRM nodes.
- T1
Sites / Services round table:
- ASGC:
- BNL: ntr
- CNAF: ntr
- FNAL:
- GridPP:
- IN2P3: ntr
- JINR:
- KISTI:
- transparent OPN intervention tomorrow morning 5-7 CEST
- KIT: ntr
- NDGF: ntr
- NL-T1: ntr
- NRC-KI:
- OSG: ntr
- PIC:
- RAL: ntr
- TRIUMF:
- CERN batch and grid services: ntr
- CERN storage services: nta
- Databases: ntr
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Asa (ASGC), Cheng-Hsi (ASGC), Fernando (batch and grid), Hervé (storage), Maarten (SCOD + ALICE)
- remote: Chris (LHCb), Christoph (CMS), David (ATLAS), Di (TRIUMF), Elizabeth (OSG), Jeff (NLT1), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Tiju (RAL), Ulf (NDGF), Zoltan (LHCb)
Experiments round table:
- ATLAS reports (raw view) -
- ALARM ticket to CERN, LSF instance not accessible. Caused burst of lost T0 jobs each morning for last two days GGUS:114929
- Fernando: the WN were rejected by LSF because of an incorrect configuration change; the issue has been fixed
- Problem reading data from NIKHEF: GGUS:114431
- Jeff: only functional tests seems to be affected, not production
- Some sites failing large jobs due to wrongly configured working dir size in AGIS. Campaign is underway to fix it.
- Still problems with wrong/duplicate FTS messages (being discussed in concurrent FTS meeting)
- Massive consistency check underway to check for lost files - T1s were asked to provide storage dumps.
- CMS reports (raw view) -
- CMS magnet at full field since Monday evening
- Maarten: congratulations and best wishes!
- Problems with CASTOR CERN
- SLS page
dropped to red values on Tuesday afternoon - recovered until Wednesday morning (without any action from CMS side)
- SAM test still has problems to write to CASTOR on Wednesday
- Phedex Transfers have also issues to write
- GGUS:114906
opened on Wednesday morning - DN mapping issue, fixed Wednesday afternoon
- Another issue seems to be reading from CASTOR, GGUS:114938
- likely a firewall issue
- Problems of analysis jobs getting executed at FNAL
- Suspicion: site overloaded with production - but it should process at least some user jobs
- Problem checked from the Glidein Factory side and the site side
- GGUS:114887
and GGUS:114888
- Symbolic link (to site configuration) got lost in CVMFS repository Thursday morning ~6:00 (CERN time)
- Causes basically all CMS jobs to fail including SAM tests at CERN (GGUS:114935
)
- Link got restored in CVMFS round 10:30
- CVMFS experts found a bug in CVMFS (to be fixed in the next release)
- ALARM Ticket: GGUS:114933
- Site readiness will be corrected
- Once more trouble with Global Xrootd redirector at CERN
- Same problem as reported by Ken
- Service run out of threads
- CMS xroot experts suggest to upgrade one machine behind the alias to xrootd 4.2 (from 4.1)
- GGUS:114712
- Hervé: the upgrade is planned for early next week; meanwhile the devs are examining a core dump of the current version
Sites / Services round table:
- ASGC:
- downtime next Mon 01:00-10:30 UTC for memory upgrades
- BNL: ntr
- CNAF:
- FNAL:
- the issue with CMS jobs is being investigated
- GridPP:
- IN2P3: ntr
- JINR:
- KISTI:
- one of the AliEn daemons on the ALICE VOBOX crashed and needed to be restarted manually
- KIT: ntr
- NDGF: ntr
- NL-T1:
- today's important
openssl
update may entail some downtime if services need to be updated urgently
- Breaking news: the update is NOT relevant for WLCG, hooray!
- NRC-KI:
- OSG:
- the accounting report for June looks OK except for 1 site that is being looked into
- PIC:
- RAL: ntr
- TRIUMF: ntr
- CERN batch and grid services: nta
- CERN storage services:
- New nodes for SRM in production => Forgot to add them to the firewall exception list, now fixed
- Upgrade of Atlas xrootd redirector to 4.2 completed, same process for CMS is ongoing
- Databases:
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB: