Week of 120910

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, Yuri, Mike, Jan, Steve, MariaDZ, Eva, Ignacio); remote (Michael/BNL, Rolf/IN2P3, Lisa/FNAL, Ulf/NDGF, Jhen-Wei/ASGC, Jeremy/GridPP, Gareth/RAL, Salvatore/CNAF, Rob/OSG, Dimitri/KIT; Joel/LHCb, Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • PIC: errors seen in deletion log point to SRM instabilities (Monday morning). awaiting site response. GGUS:85960. High load on the SRM due to the big amount of transfers, rebalanced pools data in order to increase the performance (new pools are being added on production).
      • DE/FZK-LCG2_MCTAPE ~14,000 staging failures on Sat.-Sun.:"SRMV2STAGER:StatusOfBringOnlineRequest:SRM_FAILURE". GGUS:85955.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Friday evening we had a user analysis script attempt to remove the mount point of a fuse mounted EOS session with rm -r. Unfortunately the unmount had failed causing the script to attempt to remove all group writable data in EOS recursively. The act appears to be accidental, and thanks to the weekend emergency recovery efforts. We are still trying to assess what was lost and how to avoid it in the future.
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of reprocessing at T1s and selected T2s : OK
    • Validation of stripping will start soon
    • T0: Problem of stratum 1 for CVMFS which was full [Steve: should be ok now]
    • T1: ntr

Sites / Services round table:

  • Michael/BNL: ntr
  • Rolf/IN2P3: ntr
  • Lisa/FNAL: ntr
  • Ulf/NDGF: issues with storage in Denmark over the weekend, being followed up
  • Jhen-Wei/ASGC: issue with tape hardware last Friday, fixed this morning by vendor intervention
  • Jeremy/GridPP: ntr
  • Gareth/RAL: scheduled Castor4 upgrade tomorrow morning, will only affect LHCb
    • [Steve: can you please check the stratum1 at RAL? Gareth: will do, thanks]
  • Salvatore/CNAF: storm issues for CMS during the weekend, possibly due to a filesystem bug, all is ok after a restart
  • Rob/OSG: ntr
  • Dimitri/KIT: ntr
  • Onno/NLT1 (via e-mail): the mass storage system at SARA is in maintenance due to a hardware issue, so files on tape are temporarily unavailable. See GOCDB for details.

  • Ignacio/Grid: ntr
  • Mike/Dashboard: ntr
  • Eva/Databases: latest Oracle patches have been deployed on test and integration DBs, now scheduling with the experiments the deployment on the production DBs
  • Jan/Storage:
    • CASTOR: series of database interventions during the technical stop:
      • 18.09.2012 10:00 ~2h CASTOR nameserver NAS migration + security patches
      • 19.09. 10:00 NAS intervention for all stagers. Should be "at risk" but might have 10min outage
      • 24.09. 10:00 CMS NAS migration, up to 2h downtime
      • 26.09. 10:00 LHCb NAS migration, up to 2h downtime
        • [Joel: could you do the LHCb intervention this week? We have a massive production upcoming. Jan: will check with database team (Kate) and let you know.]
      • 27.09. 10:00 ATLAS NAS migration, up to 2h downtime
      • rolling DB patches for ALICE and PUBLIC stagers, exact time to be announced
    • CASTOR - will update to CASTOR-2.1.13 on ATLAS,CMS,ALICE,LHCb during the technical stop, may combine with the above database interventions
    • EOSCMS - will update to EOS-0.2 on Monday 17.09., 30min down + 1h30 readonly

AOB:

  • CERN-PROD Batch: upgrading to the last gLite-WN (3_2_12-1) and gLite-gLexec-WN (3_2_6-3) in preparation of move to Argus. Nothing noted in preprod.

Tuesday

Attendance: local (AndreaV, Yuri, Maarten, Massimo, MariaDZ, Mike, MariaG, Ignacio); remote (Ulf/NDGF, Rolf/IN2P3, Jhen-Wei/ASGC, Lisa/FNAL, Paco/NLT1, Gareth/RAL, Xavier/KIT, Michael/BNL, Salvatore/CNAF; Ian/CMS, Joel/LHCb).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
    • T1
      • NIKHEF data transfer failures from T0. GGUS:85993 filed at 3:20UTC, resolved at ~5:40. The disk server had lost the mount of the file system, remounted.
      • RAL-LCG2 some production job failures with "lsot heartbeat" errors on the range of WNs testing hyper-threading. Jobs have been killed because they exceed the wall time limits, which haven't been correctly adjusted due to the effect of hyper-threading. Should be fixed this morning. ELOG:39241. [Gareth/RAL: we reduced the overcommit so that fewre jobs are executed, this has fixed the issue.]

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Error reported at RAL: "srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/data/Run2012C/TauPlusX/AOD/PromptReco-v2/000/202/272/F8A13C61-1CFA-E111-966A-003048D3750A.root" with error: TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 500 Command failed. : bad data was encountered
    • Tier-1/2:
      • NTR
    • Ian Fisk CRC again

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of reprocessing at T1s and selected T2s : OK
    • Validation of stripping will start soon
    • T0: Problem of stratum 1 for CVMFS which was full. Problem understood
    • T1: ntr
    • [Joel: noticed a mismatch of 600 TB in storage between different reports, this should be checked. Massimo: will follow this up offline.]
    • [Joel: agreed with IT=DB to move the DB intervention from next week to this week.]

Sites / Services round table:

  • Ulf/NDGF: ntr
  • Rolf/IN2P3: ntr
  • Jhen-Wei/ASGC: ntr
  • Lisa/FNAL: ntr
  • Paco/NLT1: followup on mass storage issue at SARA, the controller has been replaced and this fixed the problem
  • Gareth/RAL: had planned an upgrade on LHCb Castor for today, but this has been cancelled yesterday. Apologies if there was some communication problem due to the use of different mailing lists in LHCb. [Joel: you can tweet the messages, from RAL this works fine for us.]
  • Xavier/KIT: ntr
  • Michael/BNL: ntr
  • Salvatore/CNAF: ntr
  • Rob/OSG (sent via email because of problems connecting via Alcatel, again):

  • Mike/Dashboard: ntr
  • Massimo/Storage: EOS upgrade for ALICE was done this morning
  • Ignacio/Grid: report by Steve on CVMFS global - mainly LHCb but atlas also. All stratum ones stopped replicating /cvmfs/lhcb.cern.ch at different times on Sunday after a bug presented itself in the replication scripts. The full file system at CERN was a consequence rather than a cause of this bug. At the time of writing 18:00 UTC Monday RAL, CERN and BNL have taken action to reinstate replication. FNAL and TAIWAN have just had requests made to them GGUS:85983 and GGUS:85982. Status appears well in LHCB SLS... The bug will of course be addressed, trivial of course... During the time when LHCb files were being transfered in excess atlas replication was also affected but no one noticed AFAIK. Atlas SLS. [Joel: the issue seems to have been resolved].

AOB: ntr

Wednesday

Attendance: local (AndreaV, Yuri, Ignacio, Jan, MariaDZ, Luca, Mike);remote (Ulf/NDGF, Ron/NLT1, Tiju/RAL, Dimitri/KIT, Gonzalo/PIC, Jhen-Wei/ASGC, Rob/OSG, Salvatore/CNAF; Joel/LHCb, Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • HLT reprocessing jobs in progress.
    • T1
      • NIKHEF data transfer failures from T0 appeared again. GGUS:85993, GGUS:86035 original problem (and its recurrence on another file system) has been solved by remounting the FS.
      • SARA ->TRIUMF transfer failures (DATADISK) "failed to contact on remote SRM". GGUS:86201.
      • PIC still many file deletion failures "Error reading token data: Connection reset by peer". GGUS:85960.

  • CMS reports -
    • LHC / CMS
      • Preparing for proton-ion test run. Tier-0 is configured for this data.
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Transfer errors reported at T2_FR_IN2P3 being unable to submit to FTS. The last was GGUS:86022 but also reported on Sept 6

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of reprocessing at T1s and selected T2s : OK
    • Validation of stripping will start soon
    • T0:
      • CERN: cleaning TMPDIR on lxbatch (GGUS:86039) [Ignacio: batch experts are looking at the issue]
    • T1 :
      • CERN: FTS transfer problem between CERN-GRIDKA (GGUS:86025)
      • SARA: one node was not beahving properly and all LHCb pilot were aborted.
      • PIC: tape repack in progress

Sites / Services round table:

  • Ulf/NDGF: ntr
  • Ron/NLT1: ntr
  • Tiju/RAL: ntr
  • Dimitri/KIT: ntr
  • Gonzalo/PIC: working on SRM issue reported by ATLAS
  • Jhen-Wei/ASGC: ntr
  • Rob/OSG: ntr
  • Salvatore/CNAF: ntr

  • Jan/Storage: the list of 36k EOS CMS files that can be recovered (from last Friday's incident) will be sent around; tonight we observed another PB of data deleted from EOS CMS< but this looked like normal production operations
  • Luca/Database: ntr
  • Mike/ Dashboard: ntr
  • Ignacio/Grid: report from Ulrich
    • CERN-PROD Batch: Preproduction nodes have been reconfigured to use Argus for glExec instead of SCAS. Nothing noted in preprod so far. The plan is to move this configuration to prod on Thursday morning unless any issues are reported.

AOB: ntr

Thursday

Attendance: local (AndreaV, Yuri, Kate, Jan, Ignacio); remote (Saverio/CNAF, Gonzalo/PIC, Ulf/NDGF, Rolf/IN2P3, Woojin/KIT, Gareth/RAL, Jhen-Wei/ASGC, Paco/NLT1, Lisa/FNAL; Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • ATLAS-AMI-CERN DB replica not available in SLS for ~2h. Savannah:132019 filed ~3pm, misconfiguration problem with tomcat server. Fixed at ~5pm on Wed.
      • Short outage of ATLAS Elog server (connection errors) observed this morning at ~11:30.
    • T1
      • SARA ->TRIUMF transfer failures. GGUS:86021 solved: there was a network outage with the link between SARA and TRIUMF. Unfortunately due to a known problem, when the link is back, the router did not correctly reroute the traffic. Fixed. Was it the OPN link outage? [Paco: a TRIUMF note mentions that the outage is due to a known cause, then a router did not come up as expected, so it seems that the problem is on the TRIUMF side; but will check with colleagues at SARA anyway. Yuri: thanks, this gives a better understanding.]
      • SARA-MATRIX job failures with cvmfs errors on one WN. GGUS :86074 filed at 10am, fixed in ~40min. A malfunctioning disk will be replaced, the system is out of production temporarily.
      • PIC GGUS:85960 update: SRM restarted this morning, but one pool (dc14_1) responsible for at least some transfer failures didn't restart properly - forced to restart.
      • BNL file transfer failures "FIRST_MARKER_TIMEOUT". GGUS:80861. Seems network related, no failures in BNL storage found. Some tuning of the doors has been done.
      • IN2P3 some lost files found. GGUS:86059. Might be related to the migration of Atlas files from the old to new servers using a dCache internal module. Can be declared "lost", and then the recovery process can start. The complete list will be prepared.
      • [Yuri: one item was marked as ATLAS Internal but is worth mentioning: observing a strange overload of Frontier and Squids at several sites (Brookhaven, TRIUMF and RAL), the number of connection requests has dramatically increased (factor 10 in some cases) and it is not clear why. There is no ticket yet. Andrea: please open a ticket for better tracking if the issue persists.]

  • CMS reports -
    • LHC / CMS
      • Proton-ion test run was largely uneventful
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of reprocessing at T1s and selected T2s : OK
    • Validation of stripping started
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 ) CASTOR intervention
    • T1 :
      • CERN : FTS transfer problem between CERN-GRIDKA (GGUS:86025)

Sites / Services round table:

  • Saverio/CNAF: ntr
  • Gonzalo/PIC: still debugging the SRM overload issue for ATLAS
  • Ulf/NDGF: lost power on the Denmark pool again, affecting ALICE (but no one complained); not clear why this keeps happening
  • Rolf/IN2P3: announcement of an outage next week on Tue 18, batch will slow down previous evening and will resume Wed 19 but will only be flly operation on Thu 20; also affecting dcache and databases. After the outage, the new cvmfs configuration for LHCb and CMS will be operational.
  • Woojin/KIT: ntr
  • Gareth/RAL: ntr
  • Jhen-Wei/ASGC: ntr
  • Paco/NLT1: nta
  • Lisa/FNAL: ntr
  • Rob/OSG (via e-mail): OSG has nothing to report other than the continuing troubles with Alcatel [more details were provided to help debug the issue in INC:158097 ]

  • Jan/Storage: LHCb castor upgrade ongoing, after a database intervention
  • Kate/Databases:
    • many DB interventions for Castor next week as announced at a previous meeting
    • also planning Oracle upgrades for all experiments on the production DBs next week
  • Ignacio/Grid:
    • glexec switch to Argus announced yesterday has been done in all production nodes
    • as mentioned in SSB, there was a hickup of many hypervisors this morning, causing a reboot of many VMs; this may explain the Atlas elog issue. [Yuri: how should we communicate about these issues? Ignacio: GGUS is the best way.]

AOB:

Friday

Attendance: local (AndreaV, Kate, Yuri, Jan, Mike, Ignacio); remote (Xavier/KIT, John/RAL, Gonzalo/PIC, Rob/OSG, Rolf/IN2P3, Lisa/FNAL, Salvatore/CNAF, Jhen-Wei/ASGC, Ulf/NDGF; Joel/LHCb, Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • PIC GGUS:85960 update: modified some parameters in the dCache namespace domain in order to improve its performance.
      • BNL file transfer. GGUS:80861 solved:minor network issues were fixed, the storage pool has been fine after the adjustment.
      • [Yuri: the Frontier/Squid issue reported yesterday has now been understood as being fully an ATLAS issue. This was due to the misconfiguration of one Squid, and at the same time to the increase in ATLAS MC job submission.]

  • CMS reports -
    • LHC / CMS
      • Negotiating on a reprocessing of the 2011 Heavy Ion data
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • We issued an unpin request at FNAL
      • All Tier-2 sities were contacted for a consistency check

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of reprocessing at T1s and selected T2s : OK
    • Validation of stripping started
    • [Joel: we are now ready to start our massive reconstruction on Monday]
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039) [Ignacio: the experts are still working on this]
    • T1 :
      • IN2P3 : problem of jobs stalled : investigate if it is not a problem CPU time left not properly reported..

Sites / Services round table:

  • Xavier/KIT: a downtime was scheduled for this coming Monday, but it has been cancelled and no date has been fixed yet to do it later on
  • John/RAL: ntr
  • Gonzalo/PIC: still looking at the ATLAS deletion failures. [Question to ATLAS: are these massive deletions going to last much longer? Yuri: these are due to a centralized deletion policy that depends on each site and is tuned according to volume and number of files. We discussed with the Spanish cloud and we could decrease the deletion; there would be less space available at the site then, but PIC just added some storage space, so overall this could be ok. Gonzalo: ok, let's keep in contact offline.]
  • Rob/OSG: ntr [Andrea: thanks for the details about the Alcatel issue yesterday, this is still being followed up]
  • Rolf/IN2P3: ntr
  • Lisa/FNAL: ntr
  • Salvatore/CNAF: ntr
  • Jhen-Wei/ASGC: ntr
  • Ulf/NDGF: reminder, there will be a downtime on Monday
  • Onno/NLT1 (via email because of the ongoing problems with Alcatel): follow up on the SARA-Triumf connection issue - the problem was on the Triumf side.

  • Mike/Dashboard: ntr
  • Ignacio/Grid: nta
  • Jan/Storage:
    • update on CASTOR LHCb yesterday was ok
    • contacted CMS today to understand if their current usage of CASTOR pools is as expected. [Ian: if spikes are seen in CASTOR, these are probably due to the ongoing repopulation of EOS after last week's incident. Jan: there may also be some individual users contributing to the CASTOR pool usage however.]
    • reminder next Tuesday CASTOR upgrade (in parallel to DB interventions)
  • Kate/DB: reminder DB intervention on CASTOR next Tuesday, as well as many other DB interventions as previously discussed

AOB: none

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2012-09-14 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback