Week of 111024

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Alex, Iouri, Jacob, Jan, Luca, Maarten, Maria D, Maria G, Mattia, Raja);remote(Burt, Dimitri, Gareth, Giovanni, Gonzalo, Jhen-Wei, Michael, Onno, Rob, Ulf).

Experiments round table:

  • ATLAS reports -
    • Physics
      • data taking
      • HLT reconstruction in progress.
    • T0/Central services
      • High load on ADCR (instance 1) over weekend
        • Luca: investigating the matter together with ATLAS DBAs and developers; the latest high load was related to large deletion activities and is gone since ~2 pm
    • T1 site
      • RAL: srm down, Saturday morning (3 am UTC). GGUS:75597. Converted to ALARM. The problem for ATLAS was fixed 30 min after the alarm ticket was sent. RAL had a problem with DB behind CASTOR, but ATLAS was not affected since 6:30 on Saturday until 4 am on Sunday, when all transfers started to fail. RAL declared DT until 13:00 on Monday and was off for ATLAS activities . The problem was fixed at 20:45 on Sunday. Back in ATLAS activities. Warning DT until 13:00 today. GGUS is still open. Many thanks.
        • Maria D: the cause of an incident is not always reflected in a ticket's solution - when further details are foreseen, it may be desirable not to verify the ticket yet, such that it can be updated later
      • INFN-T1: LFC down, Saturday ~10 am UCT. ALARM GGUS:75601. The problem was fixed in less 1.5 h after submitting the ticket. Many thanks.
        • Giovanni: issue was due to memory leak, the LFC had been up for 1 year; cured by a restart
      • IN2P3-CC : srm down, Sunday 7 am, GGUS:75609, converted to ALARM. The site was offline until the the problem was fixed at 18:00 (there are no details on what caused the problem in GGUS). GGUS is still open. Many thanks.
      • INFN-T1: one stucked FTS job blocking new transfers. GGUS:75524. Now only one channel has a very long queue (queue has the high number of fts jobs submitted). The limit for the number of files that can be simultaneously transferred has been increased to 30. Solved
        • Giovanni: FTS problem not understood, but it is gone for now
        • Alessandro: we investigated the matter with the help of Stefano Antonelli at CNAF and wonder if the problem of long queues may have started after the T2D channels were added? If so, other T1 might be similarly affected. Could CNAF look into monitoring the queue lengths? The problematic files have been transferred in the meantime.
      • Transfers PIC->GRIF-LAL GGUS:75429. We are still seeing the problem, GRIF-LAL has problem for transfers from other sites. There is no reply from GRIF-LAL.
    • T2 sites
      • ntr
    • Other business
      • priority for ALARM GGUS was changed from top to less urgent when ticket was updated.

  • CMS reports -
    • LHC / CMS detector
      • Data taking ongoing
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, GGUS:75377, possibly a software problem?
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3. GGUS:75397. Ongoing.

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
      • CERN : Running a lot more jobs now, but not fully clear if the fairshare system has been fixed
    • T1 sites:
      • IN2P3 : (GGUS:75610) : srm problem on 23 Oct. Fixed after alert from LHCb.
      • RAL : SE in unscheduled downtime.
    • T2 sites:

Sites / Services round table:

  • ASGC - nta
  • BNL - ntr
  • CNAF - nta
  • FNAL - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • ~1 hour ago 12 SARA pool nodes were restarted to pick up new host certificates
    • communication problem between dCache head node and pool nodes reported on Fri persists, dCache developers have been asked to look into it
  • OSG
    • 3h normal maintenance window tomorrow, should essentially be transparent
  • PIC - ntr
  • RAL
    • during the weekend all 4 CASTOR instances were down due to problems with the 2 Oracle RAC setups: both suffered nodes crashing and not automatically rebooting; the cause lay in corruption of a disk array area used for backups; the situation was corrected and a restart of all nodes then restored the service; for now the FTS channels and the batch system are throttled, but those limitations are expected to be lifted later today

  • CASTOR/EOS
    • eosatlas updated between 10 and 11 am; deletions then failed due to misconfiguration of added node; fixed, but looking further into unexpected errors that were observed by ATLAS (e.g. connection refused)
      • Iouri: ~10% of the transfers failed
  • dashboards - ntr
  • databases - nta
  • grid services - ntr

AOB: (MariaDZ) Drils for 9 real GGUS ALARMs, for tomorrow's MB attached at the end of this twiki page.

Tuesday:

Attendance: local(Alessandro, Alex, Iouri, Jan, Luca, Maarten, Maria D, Mattia, Pepe, Raja);remote(Burt, Gareth, Giovanni, Gonzalo, Jhen-Wei, Michael, Rob, Rolf, Ronald, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • CERN-PROD_DATADISK transfer failures. GGUS:75632 verified: due to a scheduled replacement of the SRM server in CERN_PROD from 9-10 UTC. No errors since then, all failed transfers staged successfully meanwhile.
      • CERN-PROD_DATADISK -> JINR 1 transfer is constantly failing with "Source file/user checksum mismatch". Savannah:88115, the file seems to be corrupt.
    • T1 sites
      • Taiwan-LCG2 production job failures in the morning: host credential (dpm25) has expired. GGUS:75662. Host certificate has been replaced at ~8am, thanks!
      • Iouri: IN2P3 came back OK after their scheduled downtime
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • Data taking ongoing. Till 16:30 set-up of injection.
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, GGUS:75377, possibly a software problem?
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers from/to IN2P3. GGUS:75397. Ongoing.
      • [T1_IT_CNAF]: Failing transfers from and to CNAF. GGUS:75675. There is a problem with the STORM storage backend. Experts having a look.
        • Giovanni: the problem is fixed, was due to a GPFS bug
    • Others:
      • Yesterday we spotted out a problem with downtimes tracing in the SSB. There is a bug which affected, at least CNAF, which was marked as being in Unscheduled Downtime for some days, which was not the case. Savannah:124240.
        • Mattia: the Dashboard team are looking into it

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
      • Starting to actively synchronise files on LHCb Tier-1 SEs with expectation (LFC, DIRAC catalogs)
    • T0
      • CERN : GGUS ticket submitted regarding fairshare (GGUS:75663) as requested yesterday.
        • Alessandro: will the increase in the LHCb share affect other VOs?
        • Raja: possibly, but it will be reset tomorrow
    • T1 sites:
      • IN2P3 : "Scheduled" downtime to update Chimera server, but batch queues were not drained. There were also problems with LHCb jobs at IN2P3 even before the downtime officially started.
        • Rolf: LHCb had told us the announcement was sufficient for LHCb to prepare for the downtime; please open a ticket to discuss what LHCb expect to happen in such cases; AFAIK, nothing happened prior to the official start of the downtime, but if there was anything wrong, please open a ticket for that
    • T2 sites:
      • GRIF in downtime until end of this week. Update appreciated on how soon it will be back - used for LHCb reprocessing.

Sites / Services round table:

  • ASGC
    • no update on Condor-G issue affecting CREAM jobs (see FNAL report)
  • BNL - ntr
  • CNAF
    • at-risk downtime on Thu from 9 to 12 for ATLAS LFC upgrade to 1.8.0
  • FNAL
    • there are tickets open against a few T1 CREAM CEs that do not work OK for the Condor-G pilot factory at FNAL
      • Maarten: there is a known issue in Condor's use of CREAM CE leases, I will forward details offline; the upshot is that the sites may be unable to do anything about such errors
  • IN2P3
    • scheduled downtime was mainly to put more RAM into dCache servers; running jobs were canceled
    • also some CVMFS work was done
    • an issue with a CREAM CE for CMS led to an unscheduled downtime, now fixed: certificate DNs containing a '/' in the final CN did not work
  • KIT - ntr
  • NDGF
    • on Nov 4 starting at 18:00 for 10 h both OPN links to NDGF will be cut; also NLT1 may be affected by the intervention, but they would have a backup path via KIT; more news expected tomorrow
  • NLT1
    • 1 out of 2 tape libraries had a problem leading to reduced performance, now fixed
  • OSG
    • scheduled maintenance in progress, still 2.5 h to go
  • PIC - ntr
  • RAL
    • the site services have been ramped up slowly to full capacity without incident

  • CASTOR/EOS
    • after yesterday's EOS update for ATLAS a problem was discovered: clients effectively have more permissions than they should, allowing files to be stored outside designated areas and with unclear ownership; an emergency update was applied, but then rolled back after it had led to crashes; the developers are on it
  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Wednesday

Attendance: local(Mattia, Luca, Yuri, Maarten, Jamie, Jarka, Jan, Alessandro, Steve, Pepe, Raja);remote(Burt, Giovanni, Gonzalo, Jeremy, Ulf, Michael, John, Rolf, Pavel, Kyle, ASGC, Ron).

Experiments round table:

  • ATLAS reports -
  • T0/Central services
    • CERN-PROD_DATADISK transfer failures yesterday: Source error, failed to contact on remote SRM (srm-eosatlas.cern.ch). GGUS:75690 verified: the SRM daemon was running out of file descriptors (is running on new HW since yesterday). Hotfix applied, service restarted, seems to be OK.
  • T1 sites
    • TRIUMF-LCG2_DATADISK ->INFN-MILANO-ATLASC transfer failures (timeouts). GGUS:75437 updated. Manual transfer checks are very slow, traffic TRIUMF->MILANO (and CERN lxplus) goes through the research network (1G bandwidth), fully saturated.
  • T2 sites
    • Last minute: related to calibration centre at Great Lakes / US. Since 09:00 this morning no calibration data to this centre. Informed expert but no reply. As transfers continuing failing submitted GGUS:75739 (half an hour ago)


  • CMS reports -
  • LHC / CMS detector
    • Data taking (high PU runs ongoing). Till Sunday!

  • CERN / central services
    • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution. Asked for more information.
  • T0 and CAF:
    • cmst0 : very busy processing data !
    • GGUS:75743 alarm ticket -> Problems with T1 Transfer Pool - seems that transfer pool had overload and most transfers timed out in FTS. Ticket closed as soon as we checked all ok.
  • T1 sites:
    • [T1_TW_ASGC]: general file access problem, GGUS:75377, possibly a software problem? Site reports the file is OK in CASTOR. We will run again on the file to check that the error is still reproducible. Then decide how to proceed....
    • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3. GGUS:75397. Ongoing. The priority has been increased and this is mandatory to be solved soon. There is a large backlog of transfers (90TB) from IN2P3 to CMS Tier-2s, growing since two weeks (6 Oct), which is not being digested properly, even if they made changes on the FTS configuration: See attached plot
      Also, it seems that the proxy is expired and incoming transfers are failing as well. This needs to be looked at with high priority [ Rolf - someone from local CMS support is working on this. Not aware of how much progress has been made but work is ongoing. ]
    • [T1_IT_CNAF]: Failing transfers from and to CNAF.GGUS:75675. There is a problem with the STORM storage backend. Experts having a look. Closed
    • [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. GGUS: 75648.
  • Others:
    • On Monday we spotted out a problem with downtimes tracing in the SSB. There is a bug which affected, at least CNAF, which was marked as being in Unscheduled Downtime for some days, which was not the case. Savannah:124240. [ Mattia - two downtimes for CNAF, confused monitoring. Will be updated by hand this time and working on longer term solution. ]




Sites / Services round table:

  • FNAL - ntr
  • CNAF - this morning CNAF has joined LHCONE network and operations has gone smoothly
  • PIC - ntr
  • NDGF - update on networking trouble; peering should take care of everything. Same fibre has links for NDGF, NL-T1 and KIT. Backup link to KIT (CNAF) will carry traffic of all during downtime of 10h. Will probably be "loaded". Issue when 1 fibre can cut of 3 T1s at once. Have to ask networking people at these sites to know for sure. Data transfers will be somewhat slower when all going through CNAF
  • BNL - ntr
  • RAL - ntr
  • KIT - ntr
  • IN2P3 - nta
  • ASGC - ntr
  • NL-T1 - ntr
  • GridPP - ntr
  • OSG - maintenance yesterday had no issues

  • CERN Grid: incident CMS VOMRS service not syncing correctly to VOMS. Holding up new users - on IT SSB. Intervention on Monday - would like to update webservers for T0 export to SL5 - has been on T2 service for some months; transparent, easy rollback. Agent node upgrades will happen in November stop. Ale - ATLAS fine.
  • CERN Storage - new EOS; will try to upgrade ATLAS this afternoon.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 14-Sep-2011

Topic attachments
I Attachment History Action Size DateSorted ascending Who Comment
PowerPointppt ggus-data.ppt r1 manage 2389.5 K 2011-10-24 - 17:23 MariaDimou GGUS ALARM drill slides for the 2011/10/25 MB
PNGpng pending-source-T1_FR.png r1 manage 66.0 K 2011-10-26 - 14:34 MaartenLitmaath pending PhEDEx transfers for source T1_FR

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek111024
Topic revision: r9 - 2011-10-26 - JamieShiers
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback