TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek111128
(revision 15) (raw view)
Edit
Attach
PDF
---+ Week of 111128 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1 Dial +41227676000 (Main) and enter access code 0119168, or 1 To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] 1 The scod rota for the next few weeks is at ScodRota ---++ WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments | *VO Summaries of Site Usability* |||| *SIRs, Open Issues & Broadcasts* ||| *Change assessments* | | [[http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=101&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&algoId=6&timeRange=lastWeek][ALICE]] | [[http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=403&sites=CERN-PROD&sites=BNL-ATLAS&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&sites=TRIUMF-LCG2&sites=Taiwan-LCG2&sites=pic&algoId=161&timeRange=lastWeek][ATLAS]] | [[http://dashb-cms-sam.cern.ch/dashboard/request.py/historicalsiteavailability?siteSelect3=T1T0&sites=T0_CH_CERN&sites=T1_DE_KIT&sites=T1_ES_PIC&sites=T1_FR_CCIN2P3&sites=T1_IT_CNAF&sites=T1_TW_ASGC&sites=T1_UK_RAL&sites=T1_US_FNAL&timeRange=lastWeek][CMS]] | [[http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&sites=LCG.SARA.nl&algoId=82&timeRange=lastWeek][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpenIssues][WLCG Service Open Issues]] | [[https://cic.gridops.org/index.php?section=roc&page=broadcastretrievalD][Broadcast archive]] | [[https://twiki.cern.ch/twiki/bin/view/CASTORService/CastorChanges][CASTOR Change Assessments]] | ---++ General Information | *General Information* ||||| *GGUS Information* | *LHC Machine Information* | | [[http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/][CERN IT status board]] | M/W PPSCoordinationWorkLog | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]] | | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://lhc.web.cern.ch/lhc/][Cooldown Status]] - [[http://lhc.web.cern.ch/lhc/News.htm][News]] | --- ---++ Monday: Attendance: local(Massimo, Oliver, Alexei, David, Manuel, Eva, JhenWei);remote(Michael, Onno, Gonzalo, Lisa, Tiju, Rolf, Roger, Kyle, Paolo). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * DDM Dashboard issue. Statistics plots are not updated (Nov 25, ~21:00) * David Tuckett is investigating it. There is an issue with LCGR database. <a rel="nofollow" href="../Atlas/PhysDB?topicparent=Atlas.ADCOperationsDailyReports;nowysiwyg=1" title="this topic does not yet exist; you can create tit">PhysDB</a> team is investigating it. * ticket : <a target="_top" href="https://cern.service-now.com/service-portal/view-incident.do?n=INC082484">https://cern.service-now.com/service-portal/view-incident.do?n=INC082484</a> * fixed at ~11:00, some statistics will be restored on Monday * FR, US Tier2s issues * Sun Nov 27, Distributed analysis monitoring * some tables have no info, experts reported that probably information isn't in the [[../../view/Atlas/PanDA][PanDA]] database, under investigation * Mon Nov 28. 11:40am : CASTOR to EOS ph.groups migration is in progress * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * Heavy Ions collision data taking * CERN / central services * [CERN SRM]: srm-cms.cern.ch availability had some holes yesterday, under investigation <a target="_top" href="https://sls.cern.ch/sls/service.php?id=CASTOR-SRM_CMS">https://sls.cern.ch/sls/service.php?id=CASTOR-SRM_CMS</a> * [CERN/Vanderbilt FTS]: create dedicated channel, apologies for competing request to increase the STAR channel before, experts determined that it would be better to have an own channel <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=76783" title="GGUS Ticket '76783'">GGUS:76783</a>, solved. * [CASTORCMS_T0TEMP]: 400 active transfers stuck since 24th, continuation of <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=76649" title="GGUS Ticket '76649'">GGUS:76649</a> * T0 * Running HI express and prompt reconstruction. * T1 sites: * MC production and/or reprocessing running at all sites. * Run2011 data reprocessing. * [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=75985" title="GGUS Ticket '75985'">GGUS:75985</a> (in progress, last update 11/21) * [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, IN2P3 asked for starting low level investigations with iperf, due to ThanksGiving we can follow up on Monday <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=75983" title="GGUS Ticket '75983'">GGUS:75983</a> (in progress, last update 11/23), <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=71864" title="GGUS Ticket '71864'">GGUS:71864</a> (in progress, last updated 11/23) * [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now, will clean up on Monday due to ThanksGiving, <a href="https://gus.fzk.de/ws/ticket_info.php?ticket=76597" title="GGUS Ticket '76597'">GGUS:76597</a> (in progress, last update 11/233) * T2 sites: * NTR (at least, relevant for this meeting) * Other: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * NTR Sites / Services round table: * IN2P3: * UPS incident this morning. Service incident report is being prepared. * Several interventions are being prepared and will be bundled in a single downtime (Dec 6th). WLCG affected services: dCache and HPSS * PIC: Transparent interventions from today to Wednesday. The only side effects (since we are moving worker nodes) some oscillation (up to -30%) of the total capacity. * NLT1: Several interventions are being prepared and will be bundled in a single downtime (Dec 13th) * KIT: Misbehaving filesystem stopped on Friday. Now it is back (debugged). This affected 5 CMS pools * CERN: * Dashboards: Follow up of the ATLAS incident: last sets of statistics are being regenerated * Storage Services: In order to allow swift addition of capacity, a new version of EOS has been deployed on EOSCMS. The CMS T0TEMP ticket is being followed up but it is not impacting production (the 400 transfers are "zombie" transfers). * Databases: CMSonline DB problems: analysis going on (including service req. to Oracle) AOB: (MariaDZ) ALARM drills for the last 3 weeks (since last MB) are attached to this page. ---++ Tuesday: Attendance: local(Alex, Claudio, Jhen-Wei, Maarten, Maria D, Massimo, Mike, Przemyslaw);remote(Burt, Jeremy, Michael, Rob, Roger, Ronald, Tiju). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * T0 * Migrated CERN-PROD group disk area from CASTOR to EOS. Done at 16:00, Nov. 28. No major issue observed. * Massimo: there still are other volumes to be moved, in particular atlast3 (large); a big jump is expected after the HI run has finished * T1 sites * Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011. * T2 sites * ntr * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * No data taking due to LHC problems until 21.00 then, if recovery is ok, HI physics * CERN / central services * srm-cms availability down to 0 tonight. Due to T1TRANSFER service class unavailability. Had ~ 540 (zombie) active transfers yesterday. OK since the morning. * Massimo: the problem was not with the SRM, but due to overload of the service class, which caused monitoring requests to remain queued and the status to go red; we will discuss this offline * 400 active transfers on T0TEMP have been cleaned (continuation of GGUS:76649) * FTS channel between CERN and Vanderbilt (GGUS:76783): ticket reponened because it had to use CERN-PROD not CERN. * T0 * Running HI express and prompt reconstruction. * T1 sites: * MC production and/or reprocessing running at all sites. * Run2011 data reprocessing. * [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21) * [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate) * [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now GGUS:76597 (in progress, last update 11/24) * T2 sites: * NTR (at least, relevant for this meeting) * Other: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * No report Sites / Services round table: * BNL - ntr * FNAL - ntr * GridPP - ntr * NDGF - ntr * NLT1 - ntr * OSG - ntr * RAL - ntr * CASTOR/EOS - nta * dashboards - ntr * databases * yesterday morning the CMS online DB suffered 3 hangs of 0.5 hour each; the problem has been identified by Oracle support and a bugfix will be available at some point; meanwhile a workaround has been applied * tomorrow between 09:30 and 12:30 CET the LHCb integration DB will be upgraded to Oracle 11g * GGUS/SNOW - ntr * grid services - ntr AOB: * The AFS team at CERN has observed a high number of failing AFS callback connections to remote machines. For instance, yesterday we detected about 35k failed callback connections. In addition to AFS consistency relying on proper callback delivery, failures to deliver callbacks may result in delays for other clients (as the server needs to time out on the unresponsive clients first). As this impacts the AFS service at CERN, it would be desirable to understand why these machines do not respond to callbacks and if that could be changed. Sites with the most "failed connections" will be contacted by the AFS team. * Massimo: the first site would be RAL * Tiju: I will send our helpdesk e-mail address to the list (done) ---++ Wednesday Attendance: local(Jhen-Wei, Massimo, Claudio, David, Luca, MariaDZ, Maarten, Dirk);remote( Michael/BNL,Giovanni/CNAF, John/RAL, Ron/NL-T1, Roger/NDGF, Lisa/FNAL, Pavel/KIT, Rolf/IN2P3). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - Jhen-Wei * T0 * ntr * T1 sites * IN2P3-CC_DATADISK: "Internal timeout. Also affected T0 export.". GGUS:76911. High srmput activity yesterday evening (Atlas and CMS import + WNs), making the transfers slower and generating a backlog. This explains the timeouts seen. The backlog disappeared by itself around midnight. Ticket closed. * FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. MCTAPE had been blacklisted to let transfer focus on T0 export. Some transfers back in DATATAPE. * Pavel/KIT: contacted dcache support who suggested to update to the latest release, which will be done soon. * INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 Thu, 1 December, 12:00. * Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011. * T2 sites * ntr * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - Claudio * LHC / CMS detector * No data taking due to LHC problems. HI physics will restart hopefully tomorrow. * CERN / central services * The CASTOR_T1TRANSFER load showed up shortly again yesterday afternoon (with a corresponding short glitch in srm-cms availability). Not clear the reason yet. Production activities where high but not enough to explain the peak. Probably it was an overlap with some individual operation. * Massimo: added additional resources to the CMS pool. * FTS channel to Vanderbilt reconfigured at CERN (GGUS:76783) * T0 * Running HI express and prompt reconstruction. * T1 sites: * MC production and/or reprocessing running at all sites. * Run2011 data reprocessing. * [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21) * [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate) * [T1_IT_CNAF]: CREAM reporting dead jobs as REALLY-RUNNING (GGUS:76597). Ticket closed. Will test the new EMI CREAM as soon as it is installed. * T2 sites: * NTR (at least, relevant for this meeting) * Other: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - Maarten * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * no report Sites / Services round table: * Michael/BNL - NTR * Giovanni/CNAF: in scheduled outage for CMS SRM: GPFS is offline due due h/w failures. Expect to finishing by 18:00. * John/RAL: NTR * Ron/NL-T1: NTR * Roger/NDGF: NTR * Lisa/FNAL: tomorrow power work in CC: might affect some dCache pool nodes (expect access delays) from 7:00-13:00 * Pavel/KIT: NTR * Rolf/IN2P3: NTR * Rob/OSG: a combination of changes (several weeks ago and recent) broke the comparison report between OSG and SAM calculated availability for few days. This has been fixed and the report looked ok this morning. AOB: (MariaDZ) Concerns the Tier0 only: In view of the Year End CERN closure the GGUS-SNOW ticket handling was reviewed for the period 2011/12/22-2012/01/04. The conclusions for production grid services at CERN (in IT PES, DSS, DB groups) are summarised by Maite Barroso below: * Main route for getting notification of critical problems are the GGUS ALARM tickets. This is working and does not need any change. * For GGUS TEAM tickets, the members of e-group grid-cern-prod-admins will be put in the Outside Working Hours (OWH) group in order to have access to the SNOW instance of the ticket at all times. * For the rest, CERN specific, all services are correctly declared in SNOW with OWH support groups, and in Services' Data Base (SDB) with the right criticality, so it should be fine. ---++ Thursday Attendance: local(Jhen-Wei, Massimo, Eva, David, Alex, Dirk);remote(Gonzalo/PIC, Michael/BNL, Gareth/RAL, Giovanni/CNAF, Rolf/IN2P3, Roger/NDGF, Claudio/CMS, Lisa/FNAL, Rob/OSG, Ronald/NL-T1). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - Jhen-Wei * T0 * ntr * T1 sites * FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. Is rolling upgrade done? Do other T1s using same version have potential issue? * INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 Thu, 1 December, 12:00. * Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011. * T2 sites * ntr * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * HI Physics data taking * CERN / central services * Smooth... * Still peaky use of the T1TRANSFER pool. No consequences. The activity is legitimate. We are investigating whether we need more resources allocated permanently to CMS. * T0 * Running HI express and prompt reconstruction. * T1 sites: * MC production and/or reprocessing running at all sites. * Run2011 data reprocessing. * T1_IT_CNAF back after the storage intervention * Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983) * T2 sites: * NTR (at least, relevant for this meeting) * Other: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - Maarten * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * no report Sites / Services round table: * Gonzalo/PIC - ntr * Michael/BNL - ntr * Gareth/RAL - ntr * Giovanni/CNAF - ntr * Rolf/IN2P3 - ntr * Roger/NDGF - ntr * Claudio/CMS * Lisa/FNAL - in power maintenance with few dCache nodes down. Should be up by 1pm * Ronald/NL-T1 - ntr * Rob/OSG - ntr * Massimo/CERN - next week: transparent castor srm upgrade on Tue (Alice and LHCb) and Wed(CMS and ATLAS) AOB: * In yesterday's meeting: Michael was asking for an update on the networking issues by KIT. Pavel will follow up with experts at KIT. ---++ Friday Attendance: local();remote(). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Sites / Services round table: AOB: -- Main.JamieShiers - 31-Oct-2011
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
ppt
ggus-data_MB_20111129.ppt
r1
manage
2330.0 K
2011-11-28 - 11:03
MariaDimou
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r18
<
r17
<
r16
<
r15
<
r14
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r15 - 2011-12-01
-
DirkDuellmann
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback