Week of 121203

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(AndreaS, Stefano, Ignacio, Torre, Mike, Nicoḷ, Massimo, Maarten, MariaD);remote(Ulf/NDGF, Rolf/IN2P3, Lisa/FNAL, Michael/BNL, Kyle/OSG, Pavel/KIT, Tiju/RAL, Paolo/CNAF, Wei-Jen/ASGC).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • CASTOR: a problem to transfer 2 files for T0 export because of faulty disk server (ALARM GGUS:89107 submitted at 01:00 on Sunday). Massimo fixed the problem in one hour - many thanks. [Massimo adds that this is the third incident of this type, always in the same cluster of nodes, which are a bit old and will be retired (transparently) some time this week]
    • Central services
      • Overloaded PANDA server on Sunday. The problem has been fixed by Phys.DB on shift who killed the 'analyze index' session as it was taking too long and had caused the 'library cache lock'. Thanks.
    • T1
      • IN2P3: the long proxies problem (GGUS:88984) is fixed, IN2P3 is back into T0 export. We noticed a 30 min glitch with failing transfers on Saturday (no tickets) and many transfers were failing on Sunday during one hour (GGUS:89111). [Nicoḷ comments that actually the problem with long proxies was not solved and the glitch mentioned by Torre is still due to it, as it can be seen from the ticket; Rolf confirms]
      • RAL: Failures in input file staging in past 12 hrs, ticketed today. "File has no copy on tape and no diskcopies are accessible." Site looking into it. GGUS:89141
    • GGUS
      • When Atlas shifters click the back button in Firefox twice after he has submitted a GGUS, a second ticket has been sent (we noticed this twice during weekend) [Maria asks to put the ticket numbers in the agenda, so she can follow up]

  • CMS reports -
    • LHC / CMS
      • Proton-proton physics data taking.
      • Special run with increased ZeroBias trigger rate on Sunday, for subdetector studies.
    • CERN / central services and T0
      • Some backlog on Tier-0 farm for special run on Sunday, reabsorbed.
      • VOMS: ANSPGrid CA not recognized by CMS VOMS server GGUS:88889 FIXED.
      • CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN PROGRESS GGUS:89124
    • Tier-1:
      • IN2P3: local contacts report that dCache SRM patch to allow large proxies did not solve the issue.
      • RAL: Files on CASTOR waiting for tape migration for several days: files were on a diskserver which was in intervention. GGUS:89004 FIXED
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, network investigation IN PROGRESS GGUS:88752
    • Tier-2:
    • Dashboard:
      • SiteStatusBoard: SUM SRM availability for OSG sites was N/A in SSB (despite SAM tests running fine) for a change in an output field with the deployment of SAM-Update-19 on Wed 28th. FIXED on Friday 30th SAV:134216

  • LHCb reports -
    • Reprocessing until last stop finished. New DataBases for the last step (from 30th of November) will be ready around Thursday this week.
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s (until reprocessing will restart)
    • T0:
      • Is planned the upgrade of the LFC tu EMI.
    • T1:
      • RAL: Some problem in accessing data. A disk server is down and need a fsck before to put it again in production (not before tomorrow). [Tiju announces that it was put back in production this morning]
      • CNAF: Installed the new disk storage (many thanks to CNAF people)
      • GRIDKA: transfer failure to/from several sites. No clue from the site. [Pavel adds that experts are working on the problem: they increased the debug level and do see some transfers failing (but not all of them)]
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: about the long proxy SRM problem, we are waiting for more input from the developers. There is indication that it is induced by some T2 activity on the same SRM server. Investigations are continuing.
  • KIT:
    • We are having problems with the tape library, which is down.
    • We plan to migrate the whole cluster to SGE from 12/12 to 20/12. It will be a rolling intervention, so experiments will see a reduction in the available resources. During the intervention we will also have some CREAM downtimes and introduce new CEs.
  • NDGF:
    • Last weekend we had tape problems for ATLAS, now solved.
    • We will have a short network glitch this night, it should not last more than a few minutes to reboot a router.
    • Question to ALICE: when is the current tape writing activity foreseen to end? Maarten: will find out and let you know after the meeting
  • RAL: Investigating a CASTOR ATLAS problem affecting transfers, it looks like it might be due to the database.
  • CERN batch and grid services: we received the EGI broadcast about the end of support for the gLite LFC by the end of January. We sent an email to ATLAS and LHCb to agree when it could be a good time to upgrade to the EMI version.
  • CERN storage: ntr
  • Dashboards: ntr
  • GGUS: MariaD: the alarm tests of last week completed at all T0 and T1s, but the savannah ticket was closed only this weekend because of problems some NGIs had with their ticketing system interfaces.
AOB:

Tuesday

Attendance: local(Massimo, Oliver, Stefano, MariaD, David, Ignacio);remote(Xavier, Gonzalo, Torre, WeiJen, Ulf, Lisa, Paolo, Ronald, Tiju, Rolf, Rob).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • Proton-proton physics data taking.
    • CERN / central services and T0
      • CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN PROGRESS GGUS:89124
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573
      • Question: content of Castor pool CMSPRODLOGS was archived, we can recycle the pool. We can manually delete the files or just recycle the pool (easier). What should we do?
      • Question: We are preparing to delete old streamer files from 2010 from tape. Should we delete them ourselves or could we submit a list of files to be deleted (it would be a large number of files)?
    • Tier-1:
      • IN2P3: local contacts report that dCache SRM patch to allow large proxies did not solve the issue, another ticket was opened because of SAM failures in SUM, GGUS:89170
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752
    • Tier-2:
      • NTR

  • LHCb reports -
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s (until reprocessing will restart)
    • Had some problem because a partition on a vobox got full. Hot-fixed. Plan to reshuffle a bit the distribution of databases among the voboxes.
    • T0:
    • T1:
      • RAL: Some problem in the early morning in with FTS transfer from CERN. It seemed to be a corruption in FTS database. It has been fixed quickly.
      • IN2P3: Lots of FTS transfer failure during the night (also between IN2P3/IN2P3 and IN2P3/IN2P3-T2). Problem disappeared in the morning.
Sites / Services round table:
  • ASGC: ntr
  • CNAF:ntr
  • FNAL: ntr
  • IN2P3:
    • SRM is now monitored in order to do a restart and avoid a complete blocking.
    • Long outage (electrical intervention) next week (Mon-Wed). Long queues (12h+) submission will be stopped on Sunday
  • KIT:Tape drive fixed.
  • NDGF:
    • Info from ALICE received: thanx
    • Similar problems with SRM dCache as at IN2P3. It seems connected to long-lived voms proxy. Suggestion to have a phone call with IN2P3 to find out if the 2 issues have the same origin.
  • NLT1:ntr
  • PIC:ntr
  • RAL:The problem reported by LHCb (FTS) seems to be a network glitch (solved by itself)
  • OSG:ntr

  • CASTOR/EOS:ntr
  • Central Services: ntr. The open ticket is being dealt with. The problem looks load-related.
  • Dashboard: Downtime tomorrow (9-10 UTC) for DB intervention

  • GGUS:
    • GGUS:84770 (Cagliari) and GGUS:84261 (Pisa) are TEAM tickets for the BIOMED VO, opened since July and August and still in status 'Assigned'. Although we understand these are not WLCG problems, MariaD having received notification for 3rd level escalation by the submitters for inadequate support, would the NGI_IT be able to mediate for some supporters' attention?
    • If WLCG experiments use any sites in Greece, please note https://ggus.eu/pages/news_detail.php?ID=475 about the GRnet-to-GGUS interface direction being now broken.
    • NB! Sites/VOs with GGUS tickets getting not-good-enough support, please submit to MariaD for this Thursday's WLCG Operations' Coordination Meeting.
AOB:

Wednesday

Attendance: local (AndreaV, Kate, Oliver, Torre, David, Massimo, Ignacio, MariaD, Maarten); remote (Michael/BNL, Ulf/NDGF, Gonzalo/PIC, Wei-Jen/ASGC, Dimitrios/RAL, Ron/NLT1, Pavel/KIT, Paolo/CNAF, Rolf/IN2P3, Lisa/FNAL, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • T0 LSF: slow LSF job dispatching ALARM ticket 23:46 last night. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Looking at why the config took so long. Ticket closed. GGUS:89202
    • ATLAS VO
      • Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182
    • T1
      • Taiwan-LCG2: many job failures due to insufficient space on local disk. Resolved by site and maximum job workdir size limited in PanDA site config to avoid recurrence. Ticket closed. GGUS:89200

  • CMS reports -
    • LHC / CMS
      • Last fills of proton-proton physics data taking
      • Thursday morning, the 25ns program starts with scrubbing and machine development
      • 25 ns physics fills are expected around Dec. 12th
    • CERN / central services and T0
      • SRM: srm-cms.cern.ch went down due to chain of srmAbort and srmReleaseFiles requests, unveiled bug which is not planned to be corrected immediately due to the LHC schedule, GGUS:89186
      • CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN PROGRESS GGUS:89124
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573
      • [Oliver: having problems with EOS getting full. Massimo: will look into capacity, we should be able to increase the CMS quota.]
    • Tier-1:
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752
    • Tier-2:
      • NTR

  • LHCb reports -
    • Normal operation activities. Waiting for the new databases to start the last reprocessing step.
    • T0:
      • GGUS:88796 - Pilots failing. Opened since 23rd of November. The problem is not critical. No issue from operation point of view
    • T1:
      • NTR

Sites / Services round table:

  • Michael/BNL: ntr
  • Ulf/NDGF: upgraded dcache from 2.2 to 2.4 (passing via an intermediate upgrade to 2.3 because the direct upgrade failed)
  • Gonzalo/PIC: will have an intervention from the 10th to the 21st, will need to reduce power during that time so will run at 70% capacity
  • Wei-Jen/ASGC: ntr
  • Dimitrios/RAL: ntr
  • Ron/NLT1: network issue this morning, solved within 45 minutes, all ok now
  • Pavel/KIT: question to LHCb, is the planned SE/dcache intervention for January 15th-17th ok? [Andrea: will ask LHCb after the meeting]
  • Paolo/CNAF: ntr
  • Rolf/IN2P3: ntr
  • Lisa/FNAL: ntr
  • Rob/OSG: ntr

  • Massimo/Storage: ntr
  • David/Dashboard: ntr
  • Kate/Databases: ntr
  • Ignacio/Grid: ntr

AOB:

  • MariaD: tried to understand the issue with duplicate GGUS creation reported by ATLAS, but could not reproduce it so far. Will continue looking into it.

Thursday

Attendance: local(David, Ignacio, Maarten, Massimo, Oliver, Stefano, Torre);remote(John, Lisa, Michael, Paolo, Rob, Rolf, Ronald, Ulf, Wei-Jen, WooJin).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • Made request for SAM nagios ATLAS availability recalculation without SRM tests -- ATLAS SAM machine was not able to properly submit SRM (and OSG-SRM) tests for some hours on 2012-11-23 GGUS:89220
    • T1
      • T0 export transfers to FZK tape showing increased failure rate for several hours early this morning, then subsided. Ticket updated and closed, any recurrence will go to new ticket. GGUS:88877
      • Taiwan-LCG2: recurrence of job failures due to insufficient space on local disk. New ticket opened, site is looking at it. GGUS:89253

  • CMS reports -
    • LHC / CMS
      • 25 ns program starts with scrubbing and machine development started and will last at least till Dec. 12th
      • 25 ns physics fills are not expected before Dec. 12th
    • CERN / central services and T0
      • CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN PROGRESS GGUS:89124, last updated 2012-12-04
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-11-30
    • Tier-1:
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752, last updated 2012-12-05
    • Tier-2:
      • NTR

  • LHCb reports -
    • Normal operation activities. Waiting for the new databases to start the last reprocessing step.
    • Still some problem with agents responsible of submitting pilot to the sites. Investigation is ongoing.
    • T0:
      • GGUS:87702. Ticket relative to 4 files in EOS with bad checksum. Ticket closed and verified.
      • GGUS:89190. Found another bunch of files missing in EOS. Files are lost when launching a filesystem check.
    • T1:
      • NL-T1: Bunch of failed FTS transfers just before lunch.
    • WooJin: Jan 15-17 OK for LHCb SE maintenance at KIT?
      • Stefano: probably OK, will discuss with colleagues
    • Maarten: are the EOS data incidents understood?
      • Massimo: the first incident was due to a bug, while the root cause of the second is not clear yet

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • KIT
    • checksum verification for all unavailable ATLAS files has finished, no problems found; the files will be moved to new disks; we expect to finish that some time next week
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1
    • Mon Dec 10 SARA downtime for tape back-end maintenance and dCache upgrade
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • grid services - ntr
  • storage
    • EOS-CMS: 500 TB added
    • CASTOR-ATLAS: T0 capacity will fluctuate as old machines are replaced with new ones

AOB:

Friday

Attendance: local(Simone, Torre, Ignacio, Edi, Jan, Belinda, Eva, Maarten);remote(Xavier - KIT, Onno - NL-T1, Stefano -LHCb, Michael - BNL, Oliver - CMS, Kyle - OSG, Jeremy - GridPP, Felix - ASGC, Zeeshan - NDGF, Lisa - FNAL, Gareth - RAL)

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • SARA: T0 export failures ticketed at 15:30 yesterday, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. GGUS:89289
      • Taiwan-LCG2: still seeing many job failures due to insufficient space on local disk, site has reduced job slots on small disk WNs to see if it helps. GGUS:89253

  • CMS reports -
    • LHC / CMS
      • 25ns program starts with scrubbing and machine development started and will let till Dec. 12th 25 ns physics fills are not expected before Dec. 12th
    • CERN / central services and T0
      • CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN PROGRESS GGUS:89124
        • last updated 2012-12-06: The GlueCEStateEstimatedResponseTime calculation is fixed. The issue with blacklisting is difficult to debug because it recovers by itself usually. I suspect that we have reached the limit of what cream can do so we may need to add some more CEs.
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-11-30
    • Tier-1:
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752, last updated 2012-12-06 via Footprints: "setting out for review"
    • Tier-2:
      • NTR

  • LHCb reports -
    • Prompt reconstruction at CERN + attached T2s. Monte Carlo at T1s and T2s
    • The problem of agents submitting pilots at the sites seems related with network issues between our voboxes.
    • T0:
    • T1:
      • GRIDKA: OK for the SE downtime of 15-17 Jannuary. Please fill the GOCDB and remind us few days before.
      • NL-T1: The SE down of 10th December is ok. Do you plan to stop just tape backend or also disk?
        • Onno: both TAPE and DISK (two maintenances combined, disk will be shorter). Stefano: may be LHCb will be able to start as soon as disk in back in production.

Sites / Services round table:

  • KIT: next week a tape the library will be stopped to add drives. 2 hours intervention, not defined when.
  • RAL: Tue morning at RISK for UPS (low risk).
  • CERN: EOS updates in the next 10 days. Experiments will be contacted.

AOB:

  • Maarten: audio conferences (Alcatel) server should be upgraded soon and fix known issues we face in the daily calls from time to time.

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2012-12-07 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback