Week of 111024

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Alex, Iouri, Jacob, Jan, Luca, Maarten, Maria D, Maria G, Mattia, Raja);remote(Burt, Dimitri, Gareth, Giovanni, Gonzalo, Jhen-Wei, Michael, Onno, Rob, Ulf).

Experiments round table:

  • ATLAS reports -
    • Physics
      • data taking
      • HLT reconstruction in progress.
    • T0/Central services
      • High load on ADCR (instance 1) over weekend
        • Luca: investigating the matter together with ATLAS DBAs and developers; the latest high load was related to large deletion activities and is gone since ~2 pm
    • T1 site
      • RAL: srm down, Saturday morning (3 am UTC). GGUS:75597. Converted to ALARM. The problem for ATLAS was fixed 30 min after the alarm ticket was sent. RAL had a problem with DB behind CASTOR, but ATLAS was not affected since 6:30 on Saturday until 4 am on Sunday, when all transfers started to fail. RAL declared DT until 13:00 on Monday and was off for ATLAS activities . The problem was fixed at 20:45 on Sunday. Back in ATLAS activities. Warning DT until 13:00 today. GGUS is still open. Many thanks.
        • Maria D: the cause of an incident is not always reflected in a ticket's solution - when further details are foreseen, it may be desirable not to verify the ticket yet, such that it can be updated later
      • INFN-T1: LFC down, Saturday ~10 am UCT. ALARM GGUS:75601. The problem was fixed in less 1.5 h after submitting the ticket. Many thanks.
        • Giovanni: issue was due to memory leak, the LFC had been up for 1 year; cured by a restart
      • IN2P3-CC : srm down, Sunday 7 am, GGUS:75609, converted to ALARM. The site was offline until the the problem was fixed at 18:00 (there are no details on what caused the problem in GGUS). GGUS is still open. Many thanks.
      • INFN-T1: one stucked FTS job blocking new transfers. GGUS:75524. Now only one channel has a very long queue (queue has the high number of fts jobs submitted). The limit for the number of files that can be simultaneously transferred has been increased to 30. Solved
        • Giovanni: FTS problem not understood, but it is gone for now
        • Alessandro: we investigated the matter with the help of Stefano Antonelli at CNAF and wonder if the problem of long queues may have started after the T2D channels were added? If so, other T1 might be similarly affected. Could CNAF look into monitoring the queue lengths? The problematic files have been transferred in the meantime.
      • Transfers PIC->GRIF-LAL GGUS:75429. We are still seeing the problem, GRIF-LAL has problem for transfers from other sites. There is no reply from GRIF-LAL.
    • T2 sites
      • ntr
    • Other business
      • priority for ALARM GGUS was changed from top to less urgent when ticket was updated.

  • CMS reports -
    • LHC / CMS detector
      • Data taking ongoing
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, GGUS:75377, possibly a software problem?
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3. GGUS:75397. Ongoing.

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
      • CERN : Running a lot more jobs now, but not fully clear if the fairshare system has been fixed
    • T1 sites:
      • IN2P3 : (GGUS:75610) : srm problem on 23 Oct. Fixed after alert from LHCb.
      • RAL : SE in unscheduled downtime.
    • T2 sites:

Sites / Services round table:

  • ASGC - nta
  • BNL - ntr
  • CNAF - nta
  • FNAL - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • ~1 hour ago 12 SARA pool nodes were restarted to pick up new host certificates
    • communication problem between dCache head node and pool nodes reported on Fri persists, dCache developers have been asked to look into it
  • OSG
    • 3h normal maintenance window tomorrow, should essentially be transparent
  • PIC - ntr
  • RAL
    • during the weekend all 4 CASTOR instances were down due to problems with the 2 Oracle RAC setups: both suffered nodes crashing and not automatically rebooting; the cause lay in corruption of a disk array area used for backups; the situation was corrected and a restart of all nodes then restored the service; for now the FTS channels and the batch system are throttled, but those limitations are expected to be lifted later today

  • CASTOR/EOS
    • eosatlas updated between 10 and 11 am; deletions then failed due to misconfiguration of added node; fixed, but looking further into unexpected errors that were observed by ATLAS (e.g. connection refused)
      • Iouri: ~10% of the transfers failed
  • dashboards - ntr
  • databases - nta
  • grid services - ntr

AOB: (MariaDZ) Drils for 9 real GGUS ALARMs, for tomorrow's MB attached at the end of this twiki page.

Tuesday:

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • CERN-PROD_DATADISK transfer failures. GGUS:75632 verified: due to a scheduled replacement of the SRM server in CERN_PROD from 9-10 UTC. No errors since then, all failed transfers staged successfully meanwhile.
      • CERN-PROD_DATADISK -> JINR 1 transfer is constantly failing with "Source file/user checksum mismatch". Savannah:88115, the file seems to be corrupt.
    • T1 sites
      • Taiwan-LCG2 production job failures in the morning: host credential (dpm25) has expired. GGUS:75662. Host certificate has been replaced at ~8am, thanks!
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • Data taking ongoing. Till 16:30 set-up of injection.
    • CERN / central services
      • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information, still waiting for a solution
    • T0 and CAF:
      • cmst0 : very busy processing data !
    • T1 sites:
      • [T1_TW_ASGC]: general file access problem, GGUS:75377, possibly a software problem?
      • [T1_FR_CCIN2P3]: overload in dCache affecting transfers form/to IN2P3. GGUS:75397. Ongoing.
      • [T1_IT_CNAF]: Failing transfers from and to CNAF. GGUS:75675. There is a problem with the STORM storage backend. Experts having a look.
    • Others:
      • Yesterday we spotted out a problem with downtimes tracing in the SSB. There is a bug which affected, at least CNAF, which was marked as being in Unscheduled Downtime for some days, which was not the case. Savannah:124240.

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
      • Starting to actively synchronise files on LHCb Tier-1 SEs with expectation (LFC, DIRAC catalogs)
    • T0
      • CERN : GGUS ticket submitted regarding fairshare (GGUS:75663) as requested yesterday.
    • T1 sites:
      • IN2P3 : "Scheduled" downtime to update Chimera server, but batch queues were not drained. There were also problems with LHCb jobs at IN2P3 even before the downtime officially started.
    • T2 sites:
      • GRIF in downtime until end of this week. Update appreciated on how soon it will be back - used for LHCb reprocessing.
Sites / Services round table:

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 14-Sep-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2389.5 K 2011-10-24 - 17:23 MariaDimou GGUS ALARM drill slides for the 2011/10/25 MB
Edit | Attach | Watch | Print version | History: r13 | r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2011-10-25 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback