Week of 110801

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Jan, Lukasz, Maarten, Rod, Stefan, Stephen);remote(Catalin, Giovanni, Jhen-Wei, Marc, Michael, Onno, Pepe, Tiju, Ulf).

Experiments round table:

  • ATLAS reports -
    • INFN-T1 : ALARM ticket GGUS:73054 (probably also Team GGUS:73068) : Cannot export data to INFN-T1. Site answer:'the GPFS process in the StoRM BE failed. a restart fixed the problem.'. Solved
    • TAIWAN-LCG2 : Team GGUS:73068 : Some jobs could not download input files from SE (timeout). Problem not observed later. Solved
    • BNL-OSG2 : Team GGUS:73066 : voms-proxy-init issue when pointing to BNL. Site answer: 'middlemanVOMS_ATLAS 4 VOMS_ATLAS vomsatlas process had crashed. A service restart seems to have fixed it.' Solved
    • STORM problem at several sites seems to be caused by recent upgrade to non-validated releases. 'Staged rollout' found problems at CNAF, QMUL, milano, technion and weizmann. Probably 'staged' should start with 1 first. Validation might be better.

  • CMS reports -
    • LHC / CMS detector
      • Recovering from LHC Cryo problem most of today
      • Record peak lumi yesterday: 2.03 e33 cm-2s-1
      • Logged about 22 pb-1 during the weekend (45pb-1 on Friday)
    • CERN / central services
      • Unable to login to lxplus for some minutes on Friday evening: INC:056242
      • Two files lost on EOS rediscovered by user (CMS failed to recover them after being informed a week ago): INC:056376
      • Increase in "monitoring glitches" in SLS: INC:056362
        • Jan: that will happen e.g. when a disk server is inaccessible due to a network glitch; some monitoring alarms were seen, will look into them
    • T1 sites:
      • Some transfer errors from FNAL to a T2 look like a busy SRM server there: Savannah:122551
    • T2 sites:
      • No change on GGUS:72841 which was opened to SAM/Nagios
    • AOB:
      • Nothing to report

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • VOMS server incident: voms-proxy-init automatic failover to CERN worked as designed
    • planning upgrade to next VOMS server version built by OSG (equivalent to EMI version)
  • CNAF - nta
  • FNAL - ntr
  • IN2P3
  • NDGF
    • had a mysterious fault on Sunday. dCache lost contact to all pools, causing 1.5h of flapping and unstableness. According to the graphs data flowed as usual, and no users seem to have even noticed.
  • NLT1
    • reminder: short SARA SRM downtime tomorrow
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR
    • today's transparent intervention to increase the number of threads in the ATLAS SRM did not work; will be rescheduled with ATLAS
    • tomorrow the castoratlas.cern.ch alias (for Xroot and RFIO access) will be changed to a load-balanced alias
  • dashboards - ntr

AOB:

Tuesday:

Attendance: local(Alessandro, Eva, Gavin, Giuseppe, Jan, Jose, Lukasz, Maarten, Nilo, Rod, Stefan);remote(Catalin, Giovanni, Jeremy, Jhen-Wei, Kyle, Marc, Pepe, Ronald, Tiju, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • LYON LFC loss of 3 days registration, not yet cleaned up.
    • Deletion service backlogs lead to LFC registration remaining long after SRM file is removed - prod failures.
      • Need 3-6Hz replica deletion rate, per T1, in LFC bulk calls. Only get ~3Hz at KIT. Investigation ongoing.
    • Avoid flip-flop of Storm sites by ignoring 2nd sys admin claim it is fixed, and wait for stable SAM tests. For some sites, will ignore 1st claim too.
    • Castor intervention was transparent.
    • EOS analysis queue runs stably. Awaiting pilot version capable to read castor and EOS before retiring Castor disk-only.
      • Some concern from off-grid users: will nsls-like access still work? Status of xrootd-fuse?

  • CMS reports -
    • LHC / CMS detector
      • Fills for physics
    • CERN / central services
      • INC:056376 fixed.
      • INC:056242 (occasional glitches with the whole nscd / XLDAP stack) and INC:056362 (a problem of the monitoring script talking to the stager DB) acknowledged.
    • T1 sites:
      • Some transfer errors from FNAL to a CERN look like a busy SRM server there
    • T2 sites:
      • Nothing to report
    • AOB:
      • Nothing to report

  • ALICE reports -
    • job efficiency problems look solved, after various improvements in the central AliEn services: more and better machines, latest xrootd version, improved configuration of various services

  • LHCb reports -
    • Issues at the sites and services
      • T0
        • Castor: Pending transfers to LHCb-Archive ST token are progressing, Current status is 10 files left to be transferred (INC:055007)
      • T1
        • SARA: Downtime this morning passed unnoticed
        • SARA: Thanks for quick handling of disk space allocation (GGUS:73090)

Sites / Services round table:

  • ASGC - ntr
  • CNAF
    • 2 problems are observed when lcg-gt is used with StoRM:
      1. it can abort with a core dump when the service is not fully available: GGUS:73101
      2. it sends its own environment as part of the list of protocols, leading to lots of errors logged on the server (but the command works): GGUS:70066, GGUS:72464, BUG:84618
  • FNAL
    • CMS transfer problem may be due to network issue, looking into it
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • 2 downtimes Wed next week 08:00-16:00 CEST:
      1. tape system maintenance
      2. ATLAS SRM GPFS maintenance + dCache upgrade
  • NDGF - ntr
  • OSG - ntr
  • PIC
    • intervention ongoing for LHCb and CMS SW areas, currently read-only while new HW is being set up
  • RAL - ntr

  • CASTOR - ntr
  • dashboards
    • new Site Status Board version deployed for ATLAS, ALICE and LHCb; lots of new features and fixes
  • databases
    • AMI replication being restored
    • because of high disk failure rates the ADCR DB will be moved to the standby HW (on the Safe Host premises) tomorrow
  • grid services - ntr

AOB:

Wednesday

Attendance: local(Edoardo, Eva, Jan, Jose, Lukasz, Maarten, Nilo, Rod, Stefan);remote(Catalin, Dimitri, Giovanni, Jhen-Wei, Kyle, Marc, Onno, Pepe, Tiju, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • ADCR downtime Db 17-18 leads to ADC 15-19 and grumpy users
      • leave DDM running this time
        • Rod: switching it off would imply many operations on many machines

  • CMS reports -
    • LHC / CMS detector
      • Fills for physics
    • CERN / central services
      • Tier-0 scale test ongoing staging out into EOS
        • Jan: the rate is still limited by a suboptimal Kerberos configuration
    • T1 sites:
    • T2 sites:
    • AOB:
      • Nothing to report

  • LHCb reports -
    • Issues at the sites and services
      • T0
      • T1
        • RAL: few files missing on the storage element. Currently under investigation by the site
        • SARA: several files found to be outside the space token, currently checking how to resolve the issue (GGUS:73087)

Sites / Services round table:

  • ASGC
    • there was a power cut in the morning, most services are OK again, except the FTS, being worked on
  • CNAF - ntr
  • FNAL
    • CMS transfers OK after restart of various dCache components
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR - ntr
  • dashboards - ntr
  • databases
    • CMS integration DB has been updated with the latest patches
  • networks - ntr

AOB:

Thursday

Attendance: local(Alessandro, Andrea V, Eva, Jan, Jose, Lukasz, Maarten, Nilo, Rod, Stefan);remote(Catalin, Giovanni, Jeremy, Jhen-Wei, John, Ronald, Ulf).

Experiments round table:

  • ATLAS reports -
    • ADCR downtime smooth.
      • Minimal user disruption due to leaving services running, and Db downtime short (35mins?).
      • Panda accidentally on during part of downtime, but no problem.
    • Hint of a problem with backup Db hardware
      • #holding panda jobs increases due to slow DDM central catalogue operations
      • an op took <1s up to 6:00 UTC, but now takes 45s. How is the Db load, or should we look elsewhere?
      • ORA-00054: resource busy and acquire with NOWAIT specified
      • Rod: I opened a ticket now (GGUS:73192)
        • Eva/Nilo: we will have a look

  • CMS reports -
    • LHC / CMS detector
      • PS intervention to start at 09:00 for 8 hours
    • CERN / central services
      • Nothing to report
    • T1 sites:
      • Transfers from/to ASGC now OK
    • T2 sites:
      • Nothing to report
    • AOB:
      • VOMRS issue INC:057129 : at least 2 CMS users are facing problems, more specifically users with CA = /C=BE/OU=BEGRID/O=BELNET/CN=BEgrid CA . The error message says "User is currently suspended!" or "status of your certificate [...] has been changed from "Approved" to "Expired" due to following reason: "Certificate signed by /C=BE/O=BELNET/OU=BEGrid/CN=BEGrid CA/Email=gridca@belnet.be is not longer valid", while these users seem to have valid CMS registrations. It is not 100% clear this needs the intervention of a central vomrs manager, however it would be nice a central vomrs manager provides some advice on the above ticket.
        • Maarten: VOMRS admin Steve on holidays, back next week; please send me those certs to check if they have not been revoked
          • certs look OK, but BEGrid CRL had only been updated early afternoon: the problem would be explained if the old CRL had expired in the meantime; probably only the VOMRS admin can fix this now...

  • ALICE reports -
    • T0 site
      • Strange distribution of jobs among some of the LCG-CEs. One of the CEs had 12k ALICE jobs. We will clean the node today.

  • LHCb reports -
    • Issues at the sites and services
      • T0
        • Problems re-appearing for setting up runtime environment on batch nodes. This seems to be specific to a certain type of worker-node only (GGUS:73177)
      • T1
        • RAL: few files missing on the storage element. Currently under investigation by the site
        • SARA: several files found to be outside the space token, currently checking how to resolve the issue (GGUS:73087)

Sites / Services round table:

  • ASGC
    • grid services OK again since last night
  • CNAF
    • tape libraries not available, intervention expected to be finished later this afternoon; disk buffers are OK
  • FNAL - ntr
  • GridPP - ntr
  • NDGF
    • at-risk SRM downtime tomorrow: kernel updates in some pools in Sweden
  • NLT1
    • NIKHEF WNs lost external network access last evening, due to a switch disabling one of its interfaces; fixed ~22:30
  • RAL
    • downtime next Tue (Aug 9) 08:00-09:00 local time for site firewall reboot

  • CASTOR
    • castorlhcb was unavailable 08:xx-09:30 due to master head node losing its SW RAID; OK after reboot
  • DB
    • yesterday's ADCR move went fine; will look into issues reported by ATLAS
  • dashboards - ntr

AOB:

Friday

Attendance: local(Alessandro, Andrea V, Eva, Jan, Lukasz, Maarten, Nilo, Rod, Simone, Stefan);remote(Catalin, Gareth, Giovanni, Jhen-Wei, Jose, Kyle, Michael, Onno, Pepe, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • ADCR problem continued until ~21:00
      • Restarted prod and analysis around then
    • DDM CC eased contention on rows by reducing threads (256 to 64)
      • appears that back up hardware cannot handle same contention as prod HW
    • Problem did not appear with switch to standby HW, but only at 8am
      • coincides with backup of the standby HW (was this incremental or full?)
        • Eva/Nilo: so far there appears to be no relation between the backup and the problems experienced by ATLAS; the problem went away after the client application had been modified following advice from the ATLAS DBA
        • Simone: we do not know if the fix actually is OK; the problem still is not understood: we did not see a change in the way the DB is being used since about 1 year, but now we throttled the number of threads and thereby the load; ATLAS will further improve its own monitoring, to be better prepared when the problem comes back
    • Other side-effects
      • T0 export stopped due to uncaught exception made worse by slow CC
      • DDM site service out of memory crash due to increase in unfinished subscriptions
    • Saw problem only by effects. Will add some monitoring of standard Db query timing.

  • CMS reports -
    • LHC / CMS detector
      • Fills for physics
    • CERN / central services
      • Yesterday afternoon accidental submission of a huge number of jobs via the CMS WMS at CERN causing large backlogs. Situation back to normal few hours later. Noticed by Maarten. Caused by CRAB-3 tests. Developers trying to understand what happened.
      • Thanks to Maarten for acting on VOMRS issue INC:057129 . It looks a problem with the update of the BEGrid CRL, having the old CRL expired meanwhile. Might require the intervention of VOMRS admin (next week)
    • T1 sites:
      • Issue of the CMS WMS with KIT preventing job submission. Found no CE fulfilling the requirements close to the SE. Not clear if the problem is in the publication of the site information in the BDII or in the CMS WMS side. Site and CMS experts looking into it (Savannah:122611)
        • Maarten: will have a look - the information published by the CEs and the SE looks OK, with a caveat; ticket updated with details
      • Plan to run over the weekend the reprocessing of the data taken during the last few weeks in all Tier-1s.
    • T2 sites:
      • Nothing to report
    • AOB:
      • Nothing to report

  • LHCb reports -
    • Issues at the sites and services
      • T0
        • Problems re-appearing for setting up runtime environment on batch nodes. This seems to be specific to a certain type of worker-node only (GGUS:73177)
        • Castor SRM down tonight and fixed swiftly this morning (GGUS:73213)
          • Jan: the SRM problem was induced by a network outage that triggered a bug in the logic that keeps track of the number of busy threads
        • Castor it seems some files on the disk only pool are corrupted. Out of the whole lot only 2 files are probably lost for good as they have not been transferred yet to archive. (INC:0021564)
          • Jan: there may be others, we will provide a complete list
      • T1
        • SARA: several files found to be outside the space token, list of files will be provided by site (GGUS:73087). Another GGUS:73196 where a user cannot remove his files is probably related to the issue

Sites / Services round table:

  • ASGC
    • power maintenance tomorrow 01:00-04:00 UTC, should be transparent for grid services; at-risk downtime has been declared
  • BNL - ntr
  • CNAF
    • tape libraries back in operation since yesterday afternoon
  • FNAL
    • downtime Aug 25 for dCache HW upgrade, should take a few hours
  • KIT - ntr
  • NDGF - ?
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ?
  • RAL - ntr

  • CASTOR
    • updates of xrootd redirectors and EOS will be scheduled with the affected experiments
  • dashboards - ntr
  • databases
    • an extra node (#3) will be added to ADCR on Monday, should be transparent

AOB:

-- JamieShiers - 18-Jul-2011

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2011-08-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback