Week of 111107
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday:
Attendance: local(Massimo,
MariaDZ, Ricardo, Doug, Jhen-Wei, David, Jacek);remote(Rolf, Onno, Lisa, Michael, Mette, Gonzalo, Gareth,
RobQ, Dimitri, Paolo).
Experiments round table:
- ATLAS reports -
- T0
- CERN-PROD transfer failures: "SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] [srm2__srmPrepareToGet". GGUS:76031. Recovered within one hour and no more error appeared. Maybe a temporary problem that threads busy during that time period.
- CERN-PROD Slow LSF response time. GGUS:76039. Hardware problem affecting the LSF master machine. Now running on the secondary master while the raid array is rebuilding on the primary master. The response times to be commands looks good already.
- Transparent upgrade of CASTOR 2.1.11-8 on CASTOR Atlas. 7 Nov. 10:00 - 16:00 UTC. Completed at 10:30 UTC.
- Patching of ATLAS online production database (ATONR) - rolling intervention. 7 Nov. 10:30 - 12:30 UTC. Completed at 10:45 UTC.
- WLCG production database (LCGR) rolling intervention. 7 Nov. 11:00 - 13:00 UTC.
- T1 sites
- BNL downtime scheduled downtime - Three days of facility maintenance for all services. 7-9 November.
- Taiwan-LCG2 scheduled downtime - Two days maintenance for Castor upgrade. 7-8 November.
- T2 sites
- LHCb reports -
- Experiment activities
- Reprocessing is now being ramped up
- Stripping is continuing at CERN:
- T1
- PIC : GGUS:76028
: FTS transfer problem. Fixed
- PIC : problem of space on pool behind LHCb-tape space token.
Sites / Services round table:
- ASGC: ntr
- BNL
- Intervention at BNL started at 13:00 UTC. Services at US ATLAS Tier-1 expected to be restored by Wednesday, November 9 at 18:00 UTC.
- FNAL: ntr
- PIC: LHCb's tickets fixed but the one still open is being monitored as FTS transfers appeared blocked due to a dCache problem. dCache developers are contacted.
- NDGF: There was a network problem today with a fibre that cut off the University. It seems to be solved now.
- NL_T1: Atlas dark data found in SARA which seem to be old data and now being cleaned up.
- CNAF: ntr
- RAL: A patch was installed this morning to fix the over-reporting of tape capacity.
- KIT: There will be a scheduled intervention next Monday to Wednesday on the Atlas LFC instances and, next Tuesday, specifically, changes will be applied to Atlas dCache.
- OSG: A maintenance session will take place tomorrow Tue between 14:00-18:00 hrs UTC that will affect several OSG services. One of the resulting improvements will be the possibility ton include ticket attachments (important for the interface with GGUS).
- IN2P2: There are 4 CMS tickets on network issues. Performance numbers should be added in the tickets from CMS to help debugging!!! GGUS:75514
is waiting for reply from LHCb. There is a problem with the IN2P3 ticketing system's interface to GGUS tracked via GGUS:76041
.
- CERN:
- Grid Services: Busy checking Atlas' reported LSF problem.
- Dashboards: ntr
- Storage Services: There is an intervention today and one scheduled for this Wednesday, concerning the name service that should be transparent.
- Databases: Oracle Security patches being applied now.
AOB: (
MariaDZ) ALARM drills for 4 weeks attached at the end of this page. GGUS-SNOW interface was not working since last Friday and until this morning. Tracked via
GGUS:76052
. SIR in preparation.
Tuesday:
Attendance: local(Adam, Douglas, Eddie, Maarten, Maria D, Massimo, Nicolo, Steve);remote(Burt, Gareth, Gonzalo, Jhen-Wei, Joel, Kyle, Mette, Michael, Paco, Paolo, Rolf, Xavier).
Experiments round table:
- ATLAS reports -
- Tier-0
- Rolling upgrades of Oracle machines today, effecting the ATLR and ADCR databases. Upgrades happened between 10-11:30am, and no problems reported.
- Problems with SRM access to CERN-PROD, failures like "failed to contact on remote SRM [httpg://srm-eosatlas.cern.ch:8443/srm/v2/server]". This was a problem yesterday, and was fixed with a system reboot at 10:50pm, but the problem has returned this afternoon. This is a daily occurrence this week, and was also reported a few times each in the previous few weeks. GGUS:76123
.
- Tier-1
- BNL outage continues for today. Offline status correctly set in ATLAS systems, no problems to report at this time.
- Taiwan CASTOR outage continues today, status is correct set in systems, nothing to report.
- Jhen-Wei: more time is needed, downtime extended until Wed 11:00 UTC
- CMS reports -
- LHC / CMS detector
- Technical stop
- Preparing for HI run
- CERN / central services
- LSF master node issues affecting T0 during the weekend, GGUS:76045
TEAM opened. VOBOXes used for CMS T0 submission were reconfigured to contact LSF secondary master on Monday. Ticket escalated to ALARM when LSF stopped responding again around 16:00, understood to be caused by switching back to the primary master. T0 recovered after that, ticket closed.
- No issue observed during T0 FTS upgrade
- CASTORSRM degraded: https://sls.cern.ch/sls/history.php?id=CASTOR-SRM_CMS&more=availability&period=24h
- but minimal impact on transfers.
- T1 sites:
- Processing and/or MC running at all sites.
- [T1_IT_CNAF] 2 files waiting for migration to tape for 26 days, GGUS:76090
- [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
- [T1_FR_IN2P3]: investigation on networking issues in progress
- T2 sites
- LHCb reports -
- Experiment activities
- Reprocessing is now being ramped up
- Stripping is continuing at CERN:
- T0
- T1
- Steve: last Friday's FTS authorization problem was due to a simple configuration input mistake, viz. LHCb not being in the list of VOs to support!
Sites / Services round table:
- ASGC - nta
- BNL
- maintenance work progressing according to schedule
- Steve: notice - when the CERN-BNL FTS channel is switched on again, it will be using the upgraded FTS agent for the first time; all other channels are looking OK so far
- Nicolo: in PhEDEx the FTS agent upgrade was invisible
- CNAF
- CMS ticket: main issue fixed, residual matter in progress
- FNAL - ntr
- IN2P3
- various tickets waiting for reply:
- unscheduled downtime of tape robot tomorrow between 8:00 and 15:00 UTC to apply fixes for issues that caused Oct 31 incident (internal switch + microcode update); file staging will be unavailable
- KIT - ntr
- NDGF - ntr
- NLT1
- SARA dark data mentioned yesterday affects both ATLAS and LHCb; it is being cleaned up
- OSG - ntr
- PIC - ntr
- RAL - ntr
- CASTOR/EOS
- transparent upgrade of CASTOR name server front-end nodes on Thu
- CMS SLS issue: for each experiment the SLS requests now go to the proper SRM and the default pool - maybe a better pool can be decided per experiment
- Nicolo: the t1transfer pool would make SLS show better what CMS experiences as the state of the CASTOR SRM
- dashboards - ntr
- databases - ntr
- GGUS/SNOW - ntr
- grid services - ntr
AOB:
Wednesday
Attendance: local(David, Douglas, Jhen-Wei, Maarten, Massimo, Nicolo, Ricardo, Steve);remote(Dimitri, Gonzalo, Joel, John, Lisa, Mette, Michael, Paolo, Rob, Rolf, Ron).
Experiments round table:
- ATLAS reports -
- Tier-0
- SRM problems at CERN-PROD happened yet again, ticket was updated with new errors. Stability issues are becoming a more than daily problem. GGUS:76123
.
- Massimo: the EOS SRM was restarted after some time was spent to try and understand why it was stuck; we also are looking into a more robust configuration for it; more news in the coming days
- Upgrades to Oracle continue today, with the ATLARC database servers.
- Dashboard intervention today, and service is out from 14:00-15:00.
- VOMS DB intervention to fix current ATLAS VO ability to update the database. Haven't heard if this is done, or how well it went?
- Steve: the operation is ongoing, the progress is fine; more news later today
- Maarten: can the other experiments also run into the problem?
- Steve: CMS already did and already got the same fix recently, viz. a dump + restore of their DB to get an automatic counter reset, after which the DB should again be good for a few years; ALICE and LHCb still are very far below the limit in question
- Tier-1
- BNL outage continue today, but should end this evening at 21:00 CERN time. Is this still the case?
- Michael: looking OK so far
- Taiwan CASTOR outage is over today, ATLAS systems are set to on-line for the site.
- Jhen-Wei: will check the current state
- The name for the site IN2P3 was changed from "LYON" to "IN2P3-CC" in internal ATLAS systems today.
- CMS reports -
- LHC / CMS detector
- Technical stop
- Cosmic runs
- CERN / central services
- T0
- Deployed config for HI running, in testing
- T1 sites:
- MC production and/or reprocessing running at all sites.
- Run2011 data reprocessing starting.
- [T1_IT_CNAF] 2 files waiting for migration to tape for 26 days, GGUS:76090
- fixed. Several more files are stuck in migration because they were produced by CMS with wrong LFN pattern: asked CNAF to define migration policy for these files.
- [T1_IT_CNAF]: Transient SRM SAM test failures tonight, GGUS:76154
- closed.
- [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
- [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983
GGUS:75919
GGUS:71864
- [T1_FR_CCIN2P3]: GGUS:75397
about FTS channel config: configuration of shared FTS channel now looks OK, network issues on slow links to be followed up in other tickets, can be closed.
- [T1_FR_CCIN2P3]: GGUS:75829
about proxy expiration in JobRobot test: happened again on Nov 5th on cccreamceli02.in2p3.fr
- [T1_TW_ASGC]: GGUS:75377
about read errors on CASTOR: pending, investigation to resume after end of downtime.
- LHCb reports -
- Experiment activities
- Reprocessing is progressing well
- Stripping ongoing
- T0
- T1
- IN2P3: close GGUS:75703
?
- Rolf: let's follow up in the ticket
- KIT: CVMFS deployment plans?
- Dimitri: will ask and report tomorrow
- SARA: CVMFS deployment plans?
- Ron: planned for the next 2 weeks
Sites / Services round table:
- ASGC - nta
- BNL
- expect to be back on time
- tried one extra network intervention (multiple spanning trees), but ran into problems and had to back out
- CNAF - ntr
- FNAL - ntr
- IN2P3
- robot downtime proceeding OK
- network problems affecting CMS transfers: Renater and Geant network providers are investigating, there appear to be packet losses, cause unknown; transfers to a Japanese T2 are affected as well
- incomplete storage dumps for LHCb: the missing space tokens are the ones that were removed at the request of LHCb?
- Joel: will check and resolve the confusion in the ticket
- KIT - ntr
- NDGF - ntr
- NLT1 - ntr
- OSG - ntr
- PIC - ntr
- RAL - ntr
- CASTOR/EOS - nta
- dashboards
- upgrade of ATLAS DDM dashboard went OK
- grid services - nta
AOB:
Thursday
Attendance: local(David, Douglas, Jan, Maarten, Maria D, Nicolo, Ricardo);remote(Andreas, Gonzalo, Jhen-Wei, Joel, John, Lisa, Mette, Michael, Paolo, Rob, Rolf, Ronald).
Experiments round table:
- ATLAS reports -
- Tier-0
- No problems seen with the CERN-PROD SRM today.
- Tier-1
- BNL outage continues today, but should be over very soon.
- Michael: during the attempt to move the SRM DB to new HW the DB dump was found to contain consistency/integrity violations; it was mandatory to find and repair the problems before the operation could proceed; the problems were fixed overnight and the downtime was extended to 15:00 UTC today; that timeline looks OK so far
- David: LCG-CE and CREAM CE SAM tests for INFN-T1 were timing out on Tue and yesterday, the cause is being investigated; now they are OK
- CMS reports -
- LHC / CMS detector
- Technical stop
- Cosmic runs
- CERN / central services
- T0
- Deployed config for HI running, in testing
- T1 sites:
- MC production and/or reprocessing running at all sites.
- Run2011 data reprocessing starting.
- [T1_IT_CNAF] GGUS:76090
- files stuck in migration because they were produced by CMS with wrong LFN pattern: asked CNAF to define migration policy for these files.
- [T1_IT_CNAF] GGUS:76175
- degraded imports from other T1s in PhEDEx Debug instance
- [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
- [T1_DE_KIT]: JobRobot failures - jobs expired after staying in WMS queue with "no compatible resources" - GGUS:76191
- [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983
GGUS:71864
- [T1_FR_CCIN2P3]: GGUS:75829
about proxy expiration in JobRobot test: happened again on Nov 5th and Nov 9th on cccreamceli02.in2p3.fr - also affecting T2_BE_IIHE SAV:124597
and SAV:124471
- possible CREAM bug?
- Nicolo: the CREAM developers will be asked to have a look
- [T1_TW_ASGC]: GGUS:75377
and GGUS:76204
about read errors on CASTOR
- LHCb reports -
- Experiment activities
- Reprocessing is progressing well
- Stripping ongoing
- T0
- T1
- IN2P3 : (GGUS:75158
) : migration of files to the correct space token.
- GRIDKA : (GGUS:75851
) : problem of benchmark on some nodes
- CNAF: file transfer issue being looked into; will report on it tomorrow if needed
Sites / Services round table:
- ASGC
- yesterday CASTOR got stuck; as data is being migrated and new disk servers are being deployed for CMS, the capacity is reduced; the CMS tickets are being looked into
- BNL - nta
- CNAF - ntr
- FNAL
- tape back-end maintenance until 2 pm local time, reads are delayed, writes are buffered
- IN2P3
- network problems being investigated by Renater NREN
- LHCb ticket: waiting for information from LHCb
- an outage is planned for Dec 6, no details yet
- tomorrow is a public holiday, the service level will be as during weekends
- KIT
- CVMFS: still testing, will try moving it into production before the end of the year
- NDGF
- 1 hour SRM downtime on Mon for kernel update
- NLT1 - ntr
- OSG
- today there is a release fixing a potential accounting vulnerability, sites will be asked to deploy the fix urgently
- PIC - ntr
- RAL - ntr
- CASTOR/EOS
- SRM-EOSATLAS outage 2011-11-09 20:00-21:00 (no ticket)
- SRM-EOSCMS has a large number of stuck GridFTP sessions using up ports, mostly from lcg-cp clients on worker nodes at 2 sites
- Maarten: you need to deploy a cron job that kills such processes after a few hours at most; we do that on the WMS nodes, I will point you to the rpm and the sources
- Jan: will look into that
- the rpm has been deployed late afternoon, with a grace period of 8 hours for now
- CASTOR transparent name server update to 2.1.11-8 - done.
- dashboards - ntr
- GGUS/SNOW - ntr
- grid services
- yesterday's VOMS intervention for ATLAS finished OK
- various top-level BDII nodes were serving stale data due to a full "tmpfs" (RAM file system); the monitoring will be improved and a ticket will be opened for the developers to look into the problem
AOB:
Friday
Attendance: local();remote().
Experiments round table:
- ATLAS reports -
- Tier-0
- Missing file from atlt3 service class on Castor, It appears in the nameserver, but it's not staged on the atlt3 service class. GGUS:76250
- More problems with CERN-PROD SRM to EOS. GGUS:76237
- Tier-1
- BNL was back on-line last night at 8pm CERN time. ATLAS use started slowly, but by morning things look to be working rather well. SRM was restarted last night to adjust parameters, and then again today because of loss of disk space do to log file, and the reduce the debugging of the service.
- CMS reports -
- LHC / CMS detector
- Technical stop ending at 18:00
- First Heavy Ions collisions possible tonight
- CERN / central services
- T0
- Deployed config for HI running, in testing
- T1 sites:
- MC production and/or reprocessing running at all sites.
- Run2011 data reprocessing started.
- Heavy Ion data to be subscribed to T1_US_FNAL and T1_FR_CCIN2P3
- [T1_IT_CNAF] GGUS:76090
- files stuck in migration because they were produced by CMS with wrong LFN pattern: asked CNAF to define migration policy for these files, in progress.
- [T1_IT_CNAF] GGUS:76175
- degraded imports from other T1s in PhEDEx Debug instance: looks OK since 06:00 UTC
- [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
- any update?
- [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983
GGUS:71864
- [T1_FR_CCIN2P3]: GGUS:75829
about proxy expiration in JobRobot test: happened again on Nov 5th and Nov9th on cccreamceli02.in2p3.fr - also affecting T2_BE_IIHE SAV:124597
and SAV:124471
- CREAM developers involved GGUS:76208
- [T1_TW_ASGC]: GGUS:75377
and GGUS:76204
about read and stageout errors on CASTOR, in progress
- T2 sites:
- [T2_US_Vanderbilt]: main T2 for Heavy Ions, deployed configuration and testing.
- LHCb reports -
- Experiment activities
- Finishing up on reprocessing last but one range of data over the week-end
- Beginning of next week the last reprocessing productions will be launched
- T0
- T1
- IN2P3: (GGUS:76248
) check possibility of disabling pool to pool replication to increase throughput
- IN2P3 : (GGUS:75158
) : migration of files to the correct space token.
- GRIDKA : (GGUS:75851
) : problem of benchmark on some nodes
Sites / Services round table:
AOB:
--
JamieShiers - 31-Oct-2011