Week of 110725

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Peter, Massimo, Maarten, Lukasz, Jacek, Ale, Dirk);remote().

Experiments round table:

  • ATLAS reports - Ale
    • CERN-PROD CASTORATLAS problem: GGUS:72890
      • GGUS ALARM page did not work, TEAM ticket sent, then escalated to ALARM, CERN-PROD did not get this alarm. We called directly the CERN computer operator.
      • CASTORATLAS was down from 5am till approx 9am. Tier0 and DataExport were affected. From 9am till early afternoon Tier0 was OK, but DataExport/Import still eff 80% in reading, 50% in writing.
    • Discussion:
      • Massimo: root cause still being investigated, might be networking issue as connection to several machines was lost at the same time.
      • Dirk: need o follow up in addition on GGUS alarm issues and valid certificate for all shifters.
    • French Cloud status due to IN2P3-CC LFC issue: ATLAS re-enabled all the French site to import/export data via DDM. Tier0 data will not be subscribed yet to IN2P3-CC (no RAW/AOD/HIST etc). Cloud still offline for production and analysis.

  • CMS reports - Peter
    • LHC / CMS detector
      • Relativey short runs over the week-end (CMS acquired a total of ~53pb-1), with high peak luminosities (1.7 10^33/cm2sec).
    • CERN / central services
      • CASTOR pool CMSPRODLOGS has only ~1% free space... The space on that pool is entirely managed by CMS, however it's urgent and deletion is starting now (2PM)
    • T1 sites:
      • Very high CMS Readiness-Fraction for Tier-1s in last 2 weeks :
        • t1readyrank.png:
          t1readyrank.png
      • Small issue with PhEDEx FileStager agent at FNAL getting periodically red on PhEDEx Component monitor (see Savannah:122411). Ticket now closed, however PhEDEx developer believe this should be further investigated.
    • T2 sites:
      • Nagios test-publication errors at Vienna Tier-2 (DPM based site) : causes the Site Availability to be "undefined" and hence the Site Readiness to be red. Support is provided by CMS SAM and DPM experts (Savannah:122420) and a general ticket (GGUS:72841) was opened to SAM/Nagios
    • AOB:
      • Stephen Gowdy taking over as CMS CRC from tomorrow on

  • ALICE reports - Maarten
    • Nothing to report

Sites / Services round table:

  • Christian/NDGF - ntr
  • IN2P3-CC - up and running: still some issue with reco of the DB, none of them for LHC VO. postmortem underway, SIR within this week.
  • Jon Bakken/FNAL - ntr
  • Gonzalo/PIC -ntr
  • Tiju/RAL - ntr
  • Giovanni/CNAF - ntr
  • Ron/NL-T1 - ntr
  • Michael/BNL - ntr
  • Jhen-wei/ASGC - ntr
  • Dimitri/KIT - ntr
  • Rob/OSG - service maintenance of OSG BDII tomorrow approx at 15:00 CET to apply the BDII caching
  • CERN services
    • Jackec/DB: yesterday problem with offline DB copy of alice pvss data (short period of running just one instead of 3 nodes)

AOB: (MariaDZ) Attached (end of this page) the GGUS slides for the MB for use by the presenting wlcg-scod.

Tuesday:

Attendance: local(Lukasz, Alexei, Luca, Maarten, Stefan, Massimo, Steve, Stephen, Dirk);remote(Michael/BNL, Dimitri/KIT, Jon/FNAL, Kyle/OSG, Giovanni/CNAF, Ronald/NL-T1, Jeremy/Gridpp, Tiju/RAL, Ale, Marc/IN2P3, Gonzalo/PIC, ?/ASGC, Christian/NDGF ).

Experiments round table:

  • ATLAS reports -
    • CERN-PROD CASTORATLAS : Issue was observed again last evening. To try to reduce problems for users, some central activities were stopped at CERN:
      • Analysis queue automatically blacklisted since last sunday evening
      • MC production was significantly reduced (CERN + associated multicloud sites)
      • Central deletion stopped
    • IN2P3-CC : Activity is restarting. Sites are currently being validated with few production jobs (usual procedure)

  • CMS reports -
    • LHC / CMS detector
      • Still recovering
    • CERN / central services
      • CASTOR pool CMSPRODLOGS deleting ongoing (up to 1.7% free), many small files to remove
      • ALARM ticket opened due to CASTOR xrootd redirector problem GGUS:72944
    • T1 sites:
      • None
    • T2 sites:
      • Nagios test-publication errors at Vienna Tier-2 (DPM based site) : causes the Site Availability to be "undefined" and hence the Site Readiness to be red. Support is provided by CMS SAM and DPM experts (Savannah:122420) and a general ticket (GGUS:72841) was opened to SAM/Nagios
    • AOB:
      • Are ALARM tickets working?

  • ALICE reports -
    • T0 site - Nothing to report
    • T1 sites - Nothing to report
    • T2 sites - Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Production chain working on latest data.
    • T0
    • T1
      • IN2P3: Oracle problem (GGUS:72756). Yesterday the conddb user was unblocked at IN2P3. Monday evening some more failures with authentication problems.
        • Marc: can't see many jobs for lhcb at IN2P3 - will check if the affected user has now access to conditions
      • RAL: One disk server was broken Monday and replaced early this morning

Sites / Services round table:

  • Michael/BNL - ntr
  • Jon/FNAL - ntr
  • Kyle/OSG - reminder: in maintenance window for bdii
  • Giovanni/CNAF - ntr
  • Ronald/NL-T1 - nikhef had overloaded nfs server which affected ce temporarily - reboot of the ifs serve solved it
  • Jeremy/Gridpp - ntr
  • Tiju/RAL - ntr
  • Marc/IN2P3 - ntr
  • Christian/NDGF - ntr
  • Gonzalo/PIC - ntr
  • ASCG - scheduled downtime: Thu 4am- Fri 10am
  • Luca/CERN - investigating high streams latency (3h) for LHCb

AOB:

  • Regarding the GGUS incident with the ALARM on Sunday, here short response from the GGUS team: A required change in the system security configuration caused the temporarily limited access to the Alarm-mail interface. It is fixed now and tested since 10:30 UTC (Monday). A full SIR will be produced describing the issue and follow-up in more detail.
  • At the meeting CMS reported again problems with the alarm interface today - GGUS developers came back after the meeting with another issue found which they will fix asap. The SIR draft for this incident will need to be updated with the additional issue.

Wednesday

Attendance: local(Ale, Steve, Massimo, Lukasz, Edoardo, Maarten, Stephen, Stefan, Dirk);remote(Michael/BNL, Jon/FNAL, Kyle/OSG, Gonzalo/PIC, Jhen-Wei/ASGC, Giovanni/CNAF, Tiju/RAL, Onno/NL-t1, Marc/ IN2P3, Dimitri/KIT, Christian/NDGF).

Experiments round table:

  • ATLAS reports -
    • CERN-PROD CASTORATLAS : In agreement with Castor team, all Grid activities at CERN were resumed this morning. No major problem reported for the moment.
    • CERN-PROD : List of lost files provided and being processed (recover lost files from other sites whenever possible)
    • IN2P3-CC/FR cloud : All activities resumed. In2P3-CC provided a list of list of affected LFC entries that will be used

  • CMS reports -
    • LHC / CMS detector
      • Still recovering
    • CERN / central services
      • just opened GGUS ticket for redirector problems at CERN.
    • T1 sites:
      • Nothing to report
    • T2 sites:
      • No change on GGUS:72841 which was opened to SAM/Nagios
    • AOB:
      • Nothing to report

  • ALICE reports -
    • General Information: the hardware of some of the machines which host the central services is being upgraded. This operation has been causing some instabilities since the beginning of this week.

  • LHCb reports -
    • Experiment activities:
      • Changed magnet polarity, new productions have been setup and working on new data
    • T0
    • T1
      • IN2P3: Oracle problem (GGUS:72756). Despite the conddb user unblocked several more authentication errors have been observed. The IN2P3 LFC is currently inactive and the redirect for conditions access to Cern has been re-activated.

Sites / Services round table:

  • Michael/BNL - ntr
  • Jon/FNAL - ntr
  • Kyle/OSG - ntr
  • Gonzalo/PIC - ntr
  • Jhen-Wei/ASGC - ntr
  • Giovanni/CNAF - ntr
  • Tiju/RAL - ntr
  • Onno/NL-t1 - ntr
  • Marc/ IN2P3 - ntr
  • Dimitri/KIT - ntr
  • Christian/NDGF - ntr
  • Edoardo/CERN - planned intervention on T1 links ongoing (nl-t1, ral-backup, cnaf , tomorrow: fnal, triumph). Independent of this: investigating packet loss on link to FNAL with carriers and site. The problem started on Monday

AOB:

  • The GGUS alarm issue has been fixed after yesterday's meeting. A Service Incident Report for the issue has been prepared by the GGUS team and is available at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents .
    • Agreed to send test alarms for CERN tomorrow morning at 9:30 to check proper functioning of the alarm chain.

Thursday

Attendance: local(Alessandro, Giuseppe, Jacek, Lukasz, Maarten, Massimo, Stefan, Stephen);remote(Christian, Giovanni, Gonzalo, Jhen-Wei, John, Jon, Kyle, Marc, Michael, Ronald).

Experiments round table:

  • ATLAS reports -
    • TAIWAN-LCG2 : Offline until tomorrow since site is in scheduled downtime

  • CMS reports -
    • LHC / CMS detector
      • Still recovering
        • Maarten: what is recovering from what?
        • Stephen: LHC is recovering from the shutdown...
        • Maarten: there was an 11-hour fill overnight, not too bad...
        • Stephen: OK, will adjust that text tomorrow
    • CERN / central services
      • EOS configuration issue caused problems overnight for CAF and other users: INC:055795
        • Massimo: also ATLAS were affected, the problem is understood
        • Massimo: NOTE - the piquet service currently covers CASTOR only, not EOS, so experiments should not open alarm tickets for EOS!
        • Stephen: CMS have plans to include EOS in the T0 workflow
        • Massimo/Giuseppe: that will first have to be discussed and agreed, we are not yet in a position to be able to treat EOS as a T0 resource, the support is best effort outside working hours
    • T1 sites:
      • FNAL hit by storms causing power outage at 1am their time. GCC down, FCC on generators. Expect to start recovery of GCC at 6:30am their time.
      • Are there issues at CNAF for our jobs? See jobs queuing instead of running.
        • Giovanni: no problems were seen, CMS jobs appear to be running OK
        • Stephen: will look further into the matter
    • T2 sites:
      • No change on GGUS:72841 which was opened to SAM/Nagios
    • AOB:
      • Nothing to report

  • ALICE reports -
    • Further upgrades of central service machines, going smoothly so far.

  • LHCb reports -
    • Issues at the sites and services
      • T0
      • T1
        • IN2P3: Oracle problem (GGUS:72756), access to database has been tried by LHCb experts, no success yet. The password is going to be re-initalized today at IN2P3.
        • PIC: LHCb-Disk space token is almost full. Write access has been disabled. Will ask PIC to reallocate space from other ST (GGUS:73002)

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL
    • the power to the whole site was lost, which had not happened since a few years; the FCC (Feynman Computing Center) backup generator worked OK, protecting the critical services as designed; the GCC (Grid Computing Center) is expected to have its power restored around 09:00 CDT
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR - nta
  • dashboards - ntr
  • databases
    • Streams replication for ATLAS to KIT is not working since an intervention was done on the ATLAS DB at KIT; under investigation

AOB:

  • the alarm tests were not done at 09:30 as suggested yesterday, but in the course of the afternoon: all OK for the 4 experiments

Friday

Attendance: local(Stefan, Steve, Lukasz,Ale, Jacek, Maarten, Jan, Dirk);remote(Kyle/OSG, Jon/FNAL, Ulf/NDGF, Jhen-Wei/ASGC, Marc/IN2P3, Stephen/CMS, Giovanni/CNAF, Onno/NL-T1, Gareth/RAL, Michael/BNL, Gonzalo/PIC).

Experiments round table:

  • ATLAS reports -
    • Activity in TAIWAN-LCG2 resumed
    • CASTOR CERN : Increase the number of stager threads
      • Jan: instability two weeks ago - developers found likely cause. Now increased stager threads.
    • vobox (DDM transfer) at CERN not reporting to sls but did not affect the service itself
    • Many T2s are unstable (example : IFIC-LCG2 : Calibration site)

  • CMS reports -
    • LHC / CMS detector
      • 34pb-1 recorded overnight
      • ~90% efficiency for CMS.
    • CERN / central services
      • Nothing to report
    • T1 sites:
      • CNAF job issue yesterday was due to the Factory being at FNAL
    • T2 sites:
      • No change on GGUS:72841 which was opened to SAM/Nagios
    • AOB:
      • Nothing to report

  • ALICE reports -
    • Upgrades of central AliEn service machines finished OK yesterday.

  • LHCb reports -
    • Issues at the sites and services
      • T0
      • T1
        • IN2P3: Oracle problem (GGUS:72756), access is working after re-set of password. Ticket closed.

Sites / Services round table:

  • Kyle/OSG - ntr
  • Jon/FNAL - this morning 30 min power outage: situation again normal - almost all services back
  • Ulf/NDGF - ntr
  • Jhen-Wei/ASGC - ntr
  • Marc/IN2P3 - ntr
  • Giovanni/CNAF - ntr
  • Onno/NL-T1 - on tue 10 cest network maintenance which might affect sara srm. Scheduled outage for 1h)\.
  • Garteh/RAL - ntr
  • Michael/BNL - encountered failure of network line cards: fts and lfc affected - should be back in 10-20 mins
  • Gonzalo/PIC - ntr

AOB:

-- JamieShiers - 27-Jun-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2412.5 K 2011-07-12 - 16:22 MariaDimou GGUS ALARM drill slides up to 2011/07/12
PNGpng t1readyrank.png r1 manage 33.8 K 2011-07-25 - 13:56 DirkDuellmann  
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2011-07-29 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback