Week of 120213

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, Fernando, Jamie, Maarten, Luca, Eva, Mike, Steve, MariaDZ); remote (Michael/BNL, Gonzalo/PIC, Alexander/NLT1, Ulf/NDGF, Kyle/OSG, Burt/FNAL, Rolf/IN2P3, Dimitri/KIT, Tiju/RAL, Lorenzo/CNAF; Ian/CMS, Raja/LHCb).

Experiments round table:

  • ATLAS reports -
    • ddmadmin certificate is used for all central DDM operations (e.g. FTS job submission). Yearly proxy had been renewed 2 weeks ago, but not the membership to ATLAS. Transfers started failing and it took a few hours until the membership was renewed. Took a while to propagate to services because of proxy delegation caches on the different FTSs, SEs, etc. Commands such as glite-delegation-* didn't seem to have any effect. Bottomline: transfers on Saturday had a very low efficiency.
      • RAL had to roll out the grid-mapfiles manually after the incident - missing entry for ddmadmin. GGUS:79137
    • FZK staging and transfer errors. High load and full disk GGUS:79145
    • Best effort AMOD for the next 2 weeks

  • CMS reports -
    • Tier-0 Services:
      • First Midweek global run will happen this Wednesday. Data to the Tier-0 will be considered a success. T0 will work with the latest CMSSW release (release 5).
    • Tier-1s:
      • Problem with BlockDownload verify agents at IN2P3. GGUS:79143
    • Tier-2s: NTR
    • CRC on duty: Ian Fisk (Jose Hernandez Starting tomorrow)

  • LHCb reports -
    • Experiment activities: User analysis
    • T1
      • IN2P3 : Jobs problem from last Friday - problem with LFC at IN2P3. Jobs recovered after the LFC there was rebooted. Currently a problem with ghost jobs at IN2P3 Cream CE (GGUS:79164). Also a possibly related problem with LHCb pilots at IN2P3 not picking up jobs from the DIRAC task queue - currently analysing this problem.
    • Other information
      • Need to avoid having two Tier-1 sites simultaneously down if possible, at least for scheduled downtimes.
        • Note : 24 hours is minimum to declare a scheduled downtime.
      • Need CVMFS at GridKa (asap ...)
      • LHCb sees problems with proxy delegation with the EMI WMS at PIC

Sites / Services round table:

  • Michael/BNL: ntr
  • Gonzalo/PIC: ntr
  • Alexander/NLT1: one fileserver crashed because of broken CPU, there will be short at risk downtime tomorrow to replace the CPU [Raja: is this the one affecting LHCb? Alexander: yes]
  • Ulf/NDGF: ntr
  • Kyle/OSG: any news about the GGUS:78844 issue? [MariaDZ: this is being followed up, progress will be documented in the ticket.]
  • Burt/FNAL: ntr
  • Rolf/IN2P3:
    • problem since Friday, as reported by LHCb: switch from one batch system to another failed, waiting for intervention by Oracle, else will switch back
    • about the policy for outage announcement discussed by LHCb, note that we will not be able to cancel outages if experiments ask us 24h before, because some interventions require external vendors [Raja/Maarten: ok, major interventions are announced much more than 24h in advance anyway]
  • Dimitri/KIT: about CVMFS for LHCb, this has been accepted and will be installed [Raja: any time estimate? Dimitri: not yet, will give an update tomorrow]
  • Tiju/RAL: reminder, intervention tomorrow from 8 to 1 on the batch system as well as Oracle
  • Lorenzo/CNAF: ntr

  • Eva/Databases: transparent intervention ongoing on storage for CMS online/offline and LCG
  • Luca/Storage: EOSCMS was readonly for 30 minutes, this is a known issue, software version will be upgraded tomorrow
  • Steve/Grid: new glite release for AFS UI, will rollout tomorrow changing the new_3.2 link from 3.2.10 to 3.2.11
  • Mike/Dashboard: switched to new interface for WLCG availability plots, noticed some inconsistencies for ATLAS (3 other VOs are ok), under investigation

AOB: none

Tuesday

Attendance: local (AndreaV, Eva, Mike, Steve, Jamie, MariaDZ, Luca); remote (Ulf/NDGF, Michael/BNL, Xavier/KIT, Ronald/NLT1, Gonzalo/PIC, Lisa/FNAL, Jeremy/GridPP, Tiju/RAL, Rolf/IN2P3, Rob/OSG; Stephane/ATLAS, Ian/CMS, Jose/CMS, Raja/LHCb).

Experiments round table:

  • ATLAS reports -
    • LFC Migration RAL->CERN : Progressing as expected. UK cloud is offline.
    • Writing on TAPE at FZK : ATLASMCTAPE was being filled quickly (15 TB in < 24 hours) because of MC production. Since this morning, ATLASMCTAPE occupancy is stabilized at few TB free which means that writing on TAPE is back in production. [Xavier/KIT: a tape library broke down yesterday evening, was fixed by external experts between 10am and noon.]
    • T1s contacted through ATLAS T1 mailing list to require the same FTS priviledges for the ATLAS backup certificate as for the production one. Already done for CA/TW/US/FR. Implementation expected within a week.

  • CMS reports -
    • CRC on duty: Jose Hernandez for one week
    • Tier-0 Services:
      • First Midweek global run tomorrow 9h30 till Thurs 21h00
    • Tier-1s:
      • Problem with JobRobot monitoring jobs at IN2P3. GGUS:79162

  • LHCb reports -
    • Experiment activities: User analysis
    • T1
      • RAL : Possible corrupted file. RAL Internal ticket opened.
      • [Raja: any news about CVMFS at Gridka? Xavier: not yet.]
    • Other information
      • Proxy delegation bug on EMI WMS. LHCb would like to request that if possible not all WMS-es are upgraded until the bug is fixed. [Maarten: Dirac developers agreed to implement a workaround for this: when this is done, this issue will go away. A fix in WMS is also being pursued but will take a few weeks. The WMS at CERN can be kept a bit longer but we hope this will not be needed. Raja: thanks, can you follow this up with the Dirac developers? Maarten: yes, send them an email and copy me in CC.]

Sites / Services round table:

  • Ulf/NDGF: ntr
  • Michael/BNL: ntr
  • Xavier/KIT: nta
  • Ronald/NLT1: ntr
  • Gonzalo/PIC: ntr
  • Lisa/FNAL: ntr
  • Jeremy/GridPP: ntr
  • Tiju/RAL:
    • upgraded nameserver, but there were problems in communicating this to the tape system, so this is at risk
    • updated batch system
    • there was a power glitch on site,some machines rebooted
  • Rolf/IN2P3: again a crash of the batch system and jobs were lost. Asked Oracle again to fix this within 24h, else the system will be rolled back. Cause was identified as memory exhausted (though it is not clear why this happened).
  • Rob/OSG: ntr

  • Steve/Grid: ntr
  • Eva/Databases:
    • applied security patcheson LHCb offline yesterday
    • transparent hw intervention announced for yesterday was completed today
    • LHCb online will be moved to a new hw on Thursday
  • Mike/Dashboard: ntr
  • Luca/Storage:
    • updated EOSCMS today to the latest version
    • decommissionned ALICE Castor default this morning
    • LHCb default was overloaded, cannot do anything else than ask LHCb users to reduce the load. This also is a problem for SRM because LHCb default is used as the SRM probe, can we move the SRM probe to another LHCb pool? [Raja: will follow this up with the experts.]

AOB:

  • MariaDZ/GGUS: ticket GGUS:77157 about BDII is open since three months, can someone from the Grid team have a look at this? Steve: will forward this to the experts in the team.

Wednesday

Attendance: local (AndreaV, Edoardo, Mike, LucaC, Steve, MariaDZ, LucaM); remote (Gonzalo/PIC, Ulf/NDGF, Burt/FNAL, Tiju/RAL, Michael/BNL, Pavel/KIT, Stefano/CNAF, Rolf/IN2P3, Jhen-Wei/ASGC, Rob/OSG; Raja/LHCb, Alexei/ATLAS).

Experiments round table:

  • ATLAS reports -
    • LFC Migration RAL->CERN : finished; verification is in progress, UK cloud will be set online later today
    • Scheduled (14:30-16:00) ATLAS INTR database intervention

  • LHCb reports -
    • Experiment activities: User analysis
    • T0
      • lhcbDefault overload : Users informed and believe back to normal
      • Welcome move of srm-lhcb probe to LHCb-Disk
    • T1
      • RAL : Corrupted file. LHCb datamanagement informed. Also problems with batch server (internal ticket opened).
    • Other information
      • GridKa : CVMFS asap. [Pavel: CVMFS is still in test, will be deployed in a few weeks]

Sites / Services round table:

  • Gonzalo/PIC: next Tue scheduled intervention (in GOCDB) on CVMFS, will drain queues and may impact ATLAS and LHCb
  • Ulf/NDGF: short break in the network connection to Finland, ALICE and CMS may suffer short unavailability
  • Burt/FNAL: ntr
  • Tiju/RAL: ntr
  • Michael/BNL: network engineers are replacing some routers, these are in redundant pairs so there is internal fallover, there is a small risk in connectivity, announced in ATLAS elog
  • Pavel/KIT: nta
  • Stefano/CNAF: ntr
  • Rolf/IN2P3: ntr
  • Jhen-Wei/ASGC: there will be a short downtime affecting CASTOR tape servers
  • Rob/OSG: ntr

  • Steve/Grid: pilot FTS was faulty tonight, now is OK
  • Edoardo/Network: saw traffic increase a lot and packets dropped in the last two days, this may affect data transfers, doubling the capacity now to fix the issue
  • Mike/Dashboard: ntr
  • LucaC/Database: ntr
  • LucaM/Storage:
    • EOSCMS suffered a short unavailability at lunch
    • In contact with Stefan in LHCb to follow up the SRM issues reported yesterday and discussed by Raja today

AOB:

Thursday

Attendance: local(Alessandro, Eva, Luca M, Maarten, Maria D, Mike, Oliver K, Steve);remote(Andreas M, Gonzalo, Ian, Jhen-Wei, John, Jose, Kyle, Lisa, Michael, Raja, Rolf, Ronald, Stefano P, Ulf).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC :
      • Whole batch system down for the night
        • Rolf: yesterday late afternoon an urgent downtime was declared for the batch system, after the master server had suffered yet more crashes; on advice from Oracle some fixes were applied in the evening, after which the situation started looking better
      • GGUS:79278 (ALARM ticket). DDM transfer activity could not proceed because CERN VOBOXes could not contact LFC and FTS in IN2P3-CC. In addition, pilot factories located in IN2P3-CC could not contact Panda server at CERN. Problem occurred over the whole night and was reported/detected at lunch time.
        • Rolf: around 09:00 UTC routing changes were made at IN2P3 because of LHCONE, but it seems the problems actually started earlier; the Operations Portal is also affected and that is not on LHCONE
          • added after the meeting: indeed, the problem lay on the NREN side, as explained in the ticket resolution
    • RAL : LFC migration is done. It was longer than expected and needs to be followed up for next migrations.
    • INFN-T1 : GGUS:79269 : Many lcg-gt errors from 5:00 to 8:00 CET . No information feedback yet within ticket although the problem looks solved.
    • TRIUMF : GGUS:79266 : Problem to stage a file because tape blocked in tape drive
    • ATLAS integration database INTR intervention : Done successfully with a small delay

  • CMS reports -
    • IN2P3 announced yesterday at 16:30 problems with the batch system. They closed it for draining to intervene this morning.
    • Tier-0 to Tier-1s data transfer stress tests. 1.2 GB/s distributed across the sites according to their custodial tape pledges. All the T1s use the FTS pilot at CERN. We have asked CERN and the Tier-1s to upgrade to FTSMonitor-1.6 (CERN GGUS:79256 FZK GGUS:79257 INFN GGUS:79259). Steve Traylen deploying it at CERN. With the new FTSMonitor-1.6.1, we can properly see the last 24h transfers while the current older versions have some limitations.

  • LHCb reports -
    • T0
      • Problems transferring files to IN2P3. GGUS ticket (GGUS:79281) opened against IN2P3 - problem with network?
    • T1
      • IN2P3 : See above.
      • RAL : Continuing problems with publishing queue parameters / submission of jobs. GGUS ticket (GGUS:79283) submitted.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL
    • CMS CASTOR upgrade on Monday, still to be declared in GOCDB

  • dashboards - ntr
  • databases
    • LHCb online DB has been moved to new HW
  • GGUS/SNOW - ntr
  • grid services
    • CERN LCG-CE nodes are being retired, yet are still used by some CMS users
      • Maarten: CMS were made aware already; many/most of those jobs are just tests that can be ignored
  • storage - ntr

AOB:

  • FTS 2.2.8 upgrade
       After a successful run on the CERN pilot service, EMI is expected to release
       FTS 2.2.8 on Thu Feb 16 (today).
    
       The intention is for the same FTS version to be made available also in the
       glite 3.2 repository. 
    
       The EMI version has the additional benefit of a standard packaging and file
       location layout. This will make it easier to deploy and operate.
    
       FTS-2.2.8 is foreseen to be the production version for the remainder of the
       year. Given that the previous version has been released about 20 months ago
       it is certainly worthwhile to consider a move to new hardware for the service.
       Those on SL4 have to re-install from scratch to move to SL5.
    
       Taking all this into account and considering the cost to support two
       packaging styles, we recommend that all T1s move to the EMI version. 
    
       In order to coordinate the FTS Servers upgrade to the 2.2.8 release,
       please post when you think it is possible for you to do the upgrade:
    
       https://docs.google.com/spreadsheet/ccc?key=0AthhzXLQok7XdFpUeDBfLXE2S1RDZE4zcHp6QWVpUFE
    
       For more information about the upgrade and previous experience from CERN
       please check:
    
       https://svnweb.cern.ch/trac/glitefts/wiki/FTS228RolloutPlanning
    
       Those sites that need more information on moving to the EMI version are
       invited to contact fts-support@cern.ch.
       

  • MariaDZ http://emisoft.web.cern.ch/emisoft/index.html is the location promised yesterday. FTS 2.2.8 is not there yet but it will appear a.s.a.p. Please open a GGUS ticket for every clarification needed for the Support Unit 'FTS Development'.

Friday

Attendance: local (AndreaV, Alessandro, Eva, Mike, Steve); remote (Michael/BNL, Mette/NDGF, Alexandre/NLT1, Xavier/KIT, Lisa/FNAL, Stefano/CNAF, Rolf/IN2P3, Jeremy/GridPP, Gareth/RAL, Rob/OSG; Jose/CMS, Raja/LHCb).

Experiments round table:

  • CMS reports -
    • SLS monitoring unresponsiveness: INC:105438 It was caused by ourselves. Massive failures of prompt reconstruction jobs during the MWGR produce a large xml update which caused overload in SLS. Already fixed. Thanks to the SLS support for prompt reaction.

  • LHCb reports -
    • Experiment activities: User analysis, Validation productions for restripping
    • T0: NTR.
    • T1:
      • PIC : LHCbUser space token out of space - more space added by site (GGUS:79317). Thanks!
      • PIC : Request for space token migration (GGUS:79305)
      • GridKa : LHCbUser space token out of space - more space added by site (GGUS:79318). Thanks!
      • GridKa : Request for space token migration (GGUS:79303)
      • SARA : Request for space token migration (GGUS:79307)
      • SARA : FTS timeout transferring a file to CNAF. (GGUS:79325) [Alexandre: looking into this problem right now]
      • RAL : Regular problems with zombie jobs in Cream CE (GGUS:78873, GGUS:79283). Also seen at IN2P3 (GGUS:79200) and LAL (GGUS:79200). Is this already known?

Sites / Services round table:

  • Michael/BNL: ntr
  • Mette/NDGF: ntr
  • Alexandre/NLT1: nta
  • Xavier/KIT: ntr
  • Lisa/FNAL: ntr
  • Stefano/CNAF: ntr
  • Rolf/IN2P3: comments on the network problem yesterday: this was due to some configuration changes external to IN2P3, between Renater and CERN, the issue is now fixed.
    • [Alessandro: did not understand excatly what went on, experiments should be informed if there is a change in the planned use of the networks. Steve: this was just a mistake, not the consequence of a change in the model.]
    • [Raja: some users are complaining today that they cannot access the CERN web pages from IN2P3, is this related? Rolf: not aware of this, issue should be fixed, please open a ticket if necessary. Raja: will check with the user again and open a ticket if necessary.]
  • Jeremy/GridPP: ntr
  • Gareth/RAL: scheduled intervention on Monday on CASTOR CMS, announced in gocdb
  • Rob/OSG: still working on ticket attachments between GGUS and OSG, will give an update next week

  • Eva/Databases: the hw intervention on RAC10 did not fix the problems, there will another transparent intervention on Monday (affecting CMS and LCGR, users have been notified)
  • Mike/Dashboard: ntr
  • Steve/Grid: FTS 2.2.8 has been released from EMI today, will update the pilot from the 2.2.8 release candidate to this 2.2.8 official release (this has been agreed with CMS)

AOB: none

-- JamieShiers - 18-Jan-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r2 r1 manage 2340.5 K 2012-02-13 - 14:48 MariaDimou Final GGUS slides including ALARM drills for the St Valentine's MB
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2012-03-07 - JaroslavaSchovancova
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback