TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek120618
(revision 18) (raw view)
Edit
Attach
PDF
---+ Week of 120618 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] 3. The scod rota for the next few weeks is at ScodRota ---++ WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments | *VO Summaries of Site Usability* ||||*SIRs, Open Issues & Broadcasts*||| *Change assessments* | | [[http://dashb-alice-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ALICE_CRITICAL&group=all%2Bsites&site%5B%5D=CCIN2P3&site%5B%5D=CERN&site%5B%5D=CNAF&site%5B%5D=FZK&site%5B%5D=NIKHEF&site%5B%5D=RAL&site%5B%5D=SARA&type=quality][ALICE]] | [[http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ATLAS_CRITICAL&group=All%2Bsites&site%5B%5D=BNL-ATLAS&site%5B%5D=CERN-PROD&site%5B%5D=FZK-LCG2&site%5B%5D=IN2P3-CC&site%5B%5D=INFN-T1&site%5B%5D=NDGF-T1&site%5B%5D=NIKHEF-ELPROD&site%5B%5D=pic&site%5B%5D=RAL-LCG2&site%5B%5D=SARA-MATRIX&site%5B%5D=Taiwan-LCG2&site%5B%5D=TRIUMF-LCG2&type=quality][ATLAS]] | [[http://dashb-cms-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=CMS_CRITICAL_FULL&group=Tier1s%2B%252B%2BTier0&site%5B%5D=T0_CH_CERN&site%5B%5D=T1_CH_CERN&site%5B%5D=T1_DE_KIT&site%5B%5D=T1_ES_PIC&site%5B%5D=T1_FR_CCIN2P3&site%5B%5D=T1_IT_CNAF&site%5B%5D=T1_TW_ASGC&site%5B%5D=T1_UK_RAL&site%5B%5D=T1_US_FNAL&type=quality][CMS]] | [[http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=LHCb_CRITICAL&group=Tier%2B0/1&site%5B%5D=LCG.CERN.ch&site%5B%5D=LCG.CNAF.it&site%5B%5D=LCG.GRIDKA.de&site%5B%5D=LCG.IN2P3.fr&site%5B%5D=LCG.NIKHEF.nl&site%5B%5D=LCG.PIC.es&site%5B%5D=LCG.RAL.uk&site%5B%5D=LCG.SARA.nl&type=quality][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpenIssues][WLCG Service Open Issues]] | [[https://cic.gridops.org/index.php?section=roc&page=broadcastretrievalD][Broadcast archive]] | [[https://twiki.cern.ch/twiki/bin/view/CASTORService/CastorChanges][CASTOR Change Assessments]] | ---++ General Information | *General Information* ||||| *GGUS Information* | *LHC Machine Information* | | [[http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/][CERN IT status board]] | M/W PPSCoordinationWorkLog | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]]| | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://lhc.web.cern.ch/lhc/][Cooldown Status]] - [[http://lhc.web.cern.ch/lhc/News.htm][News]] | <HR> ---++ Monday Attendance: local (Andrea, Doug, Ian, David, Maarten, Vladimir, Luca, Eva); remote (Michael/BNL, Ulf/NDGF, Lisa/FNAL, Rolf/IN2P3, Jhen-Wei/ASGC, Tiju/RAL, Onno/NLT1, Kyle/OSG, Dimitri/KIT). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * T1 * IN2P3 FTS channels got stuck. GGUS:83320 solved: some channel agents did not recover the Oracle connexion after the logrotate at 4:00 AM due to a problem with Oracle virtual IPs. Solved by defining a new connection string which does not use Oracle virtual IPs. * TRIUMF 1745 files lost.Files declared to the consistency service. Savannah:95440. Ticket will be updated when the exact number of lost files is confirmed. * [Doug: also had a hickup in T0 processing two nights ago due to a full disk, not properly monitored hence not noticed by the shifters. David: what monitoring was this? Doug: the issue was in Firefox for the T0 console monitoring, we are now trying to improve our monitoring] * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC machine / CMS detector * Good data taking * CERN / central services and T0 * Problems with the few Express stream files. Software experts are looking * Tier-1/2: * Problems with FNAL over the weekend. Network issues and problems on the submission services. It seems to be recovered now * Migration issues at ASGC. Local admins are working * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * Over the weekend, a large number of jobs at CERN failed due to insufficient scratch space: GGUS:83345 * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Users analysis and prompt reconstruction and stripping at T1s ongoing * MC production at Tiers2 * T0: * CERN (GGUS:83351) batch system: We have peak of submitted jobs every night. * [Vladimir: also had pilots aborted with reason 999 during the last two days, no ticket as the issue is now fixed. Maarten: must have been an LSF issue.] * T1: ntr Sites / Services round table: * Michael/BNL: ntr * Ulf/NDGF: * ATLAS ticket about files not returned from tape is being investigated, may be related to dcache rather than tapes * tomorrow electrical maintenance in Slovenia, some ATLAS files will be unavailable * Lisa/FNAL: ntr * Rolf/IN2P3: ntr * Jhen-Wei/ASGC: ntr * Tiju/RAL: work on site network tomorrow morning 8am to 11am * Onno/NLT1: this morning SARA downtime, completed at 2pm: dcache was upgraded to 2.2 and the tape library was fixed for cartridge insertion issues * Kyle/OSG: ntr * Dimitri/KIT: ntr * Luca/Storage: ntr * David/Dashboard: ntr * Eva/Databases: ntr AOB: (!MariaDZ) https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls is up-to-date and attached to twiki WLCGOperationsMeetings. Complete ALARM drills are attached at the end of this page. There were 7 real ALARMS since the last MB, all from ATLAS, all for CERN, mostly storage and LSF issues. ---++ Tuesday Attendance: local(David, Eva, Ignacio, Luca M, Maarten, Oliver, Yuri);remote(Gareth, Gonzalo, Jeremy, Jhen-Wei, Lisa, Lorenzo, Michael, Rob, Rolf, Ulf, Vladimir, Xavier M). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * T0 * T0,CERN-PROD ~5000 transfer failures:SRMV2STAGER:SRM_FAILURE. GGUS:83361 solved: see GGUS:83360. * T0 problems in writing/retrieving data to/from t0merge and t0atlas pools. ALARM GGUS:83360 solved:configuration issue all the diskservers were unreachable,fixed. * T0 delay with finishing 6000 running jobs, many pending jobs since 5pm, June18. ALARM GGUS:83362 solved: filesystems unavailable issue fixed in <1h, June 18. * Luca: all those problems were due to the CASTOR disk servers not being reachable; it took ~45 minutes before almost all of them had recovered; 5 remained in a funny state, fixed ~20:00 yesterday evening * T0: LSF very slow response time to bsub (>3-5min.) affects event reco distribution to T1. ALARM GGUS:83375 assigned at ~7:40am, June 19. Looks better after 8am. * Ignacio: this time no culprit was identified yet; snapshots and logs have been sent to Platform who are trying to reproduce the problem in their labs; the problem disappeared by itself, then it got a bit worse again later around noon * T1 * NDGF-T1 files can't be pinned from tape issue. GGUS:83349 solved: HSM script failed points to tape problems, files restored, transfers succeeded. * FZK many transfer failures due to the log FTS partition full. ALARM GGUS:83367 solved: cleaned up this morning. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC machine / CMS detector * Machine development * Preparing to move Tier-0 to different machine with more disk space, testing today * CERN / central services and T0 * It was discovered by the CERN Security Team that a security incident happened on the CMS HyperNews system, with a security hole exploited and resulting in a bunch of passwords being exposed online. All measures have been taken within few hours, including informing users and blocking unsafe access to resources: operations are not compromised, and post-mortem is in progress * Tier-1/2: * KIT had tape issues yesterday which caused very low CPU efficiencies for running jobs (no writing and almost no reading from tape). This is fixed now. But now running almost no jobs because of the fairshare of CMS. * Xavier: the problem is back, it started failing during the night; currently no one can write to tape, because the failing library is the only one with free space! the other libraries are available for reading only; we will post an entry in the GOCDB * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * Some EOS-ALICE disk servers cannot be reached from outside CERN, leading to job failures and/or inefficiencies; being worked on. * Luca: it has been fixed just now * IN2P3: bad job efficiency being investigated, looks due to issues with accessing local storage. * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Users analysis and prompt reconstruction and stripping at T1s ongoing * MC production at Tiers2 * [[https://ggus.eu/ws/ticket_search.php?show_columns_check%5B%5D=REQUEST_ID&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CREATION&show_columns_check%5B%5D=LAST_UPDATE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket=&supportunit=all&vo=lhcb&user=&keyword=&involvedsupporter=&assignto=&affectedsite=&specattrib=0&status=open&priority=all&typeofproblem=all&mouarea=&radiotf=1&timeframe=lastyear&tf_date_day_s=1&tf_date_month_s=1&tf_date_year_s=2008&tf_date_day_e=&tf_date_month_e=&tf_date_year_e=&lm_date_day=16&lm_date_month=3&lm_date_year=2009&orderticketsby=GHD_INT_REQUEST_ID&orderhow=descending][<strong>New GGUS (or RT) tickets </strong>]] * T0: * CERN (GGUS:83351) batch system: We have peak of submitted jobs every night. * T1: * RAL: Downtime * IN2P3: (GGUS:83391) Redundant jobs at cccreamceli05.in2p3.fr Sites / Services round table: * ASGC - ntr * BNL - ntr * CNAF - ntr * FNAL - ntr * GridPP - ntr * IN2P3 * will look into issues reported by LHCb and ALICE * KIT - nta * NDGF - ntr * OSG * one week from today (i.e. June 26) OSG central services will be patched during the maintenance window that day * PIC - ntr * RAL * today's planned network outage went OK, the access routers were updated * dashboards - ntr * databases - ntr * grid services - nta * storage - nta AOB: ---++ Wednesday Attendance: local (Andrea, Yuri, Oliver, David, Luca, !MariaDZ, Ignacio, Eva); remote (Michael/BNL, Ulf/NDGF, Lisa/FNAL, Pavel/KIT, Jhen-Wei/ASGC, Ron/NLT1, Gonzalo/PIC, Tiju/RAL, Rolf/IN2P3, Rob/OSG; Vladimir/LHCb). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * CENTRAL SERVICES * GGUS:82907 updated: still not possible to specify VO when sending a GGUS team ticket from comp@P1. In other words, on the right of the field VO there is no option neither a box to fill. [!MariaDZ: GGUS developer discovered what happens. It is OK if the certificate is loaded in the browser, VOMS knows it's ATLAS. It is not OK if the username and password are used because the certificate is not seen and the user is not associated to ATLAS. Will be fixed on Monday, ATLAS please test it on Monday.] * T0 * NTR * T1 * NTR * T2 + OTHERS * Running jobs failed at Prague after applying the SL5 python security update/patch (python-2.4.3-46.el5_8.2) from the FNAL repo, not at CERN repo yet. https://cern.service-now.com/service-portal/view-request.do?n=RQF0111006 New jobs starting after the uprade run well. Just to get some reassurance that the cern repo will not have this problem. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC machine / CMS detector * Machine development * CERN / central services and T0 * NTR * Tier-1/2: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * ALICE VOMRS service is failing, GGUS:83432 [Ignacio: fixed now, it was related to the tns database upgrade on Friday, the low level address was used instead of the tns alias and the port was changed. Eva: it is always better to use the tns alias if possible.] * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Users analysis and prompt reconstruction and stripping at T1s ongoing * MC production at Tiers2 * T0: * CERN (GGUS:83351) batch system: We have peak of submitted jobs every night. * T1: * FZK-LCG2: (GGUS:83425) Jobs failed with "dcap: Last IO operation timeout." Sites / Services round table: * Michael/BNL: ntr * Ulf/NDGF: ntr * Lisa/FNAL: ntr * Pavel/KIT: tape system is now fully operational * Jhen-Wei/ASGC: ntr * Ron/NLT1: had to reboot SRM to fix a dcache issue and a storage issue, now moving to a new kernel and a new driver * Gonzalo/PIC: announcement of a major full-day intervention on the core router on July 4th (urgent but presently waiting before of ICHEP pressure) * Tiju/RAL: investigating a problem with network traffic into RAL * Rolf/IN2P3: ntr * Rob/OSG: ntr * David/Dashboard: ntr * Eva/Databases: LHCb online database is being patched with security updates * Luca/Storage: ntr * Ignacio/Grid: still working with the platforms group on the problem with latency and submissions, looking into both network and storage * CERN VOMRS - Registration processing including renewals is currently impossible for LHC VOs since one or two days. The situation will be corrected today or tomorrow at latest. * GGUS: (!MariaDZ) As announced a week ago in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek120611#Wednesday (see section AOB/GGUS) please note that the June 2012 GGUS Release will take place next Monday 2012/06/25, especially focused on the [[https://ggus.eu/stat/stat.php][Report Generator]] redesign. *NB!* ATLAS is kindly asked to test, as of Monday pm, using GGUS username "CompAtP1Shift", the fix of Savannah:129256 (the VO value was not appearing on the GGUS ticket, if the submitter was authenticating via usename/passwd instead of certificate). Conclusions of this test should be recorded in Savannah:129607. AOB: none ---++ Thursday Attendance: local (Andrea, Yuri, Stephen Marcin, Mike, Luca, Ignacio); remote (Gonzalo/PIC, Ulf/NDGF, Lisa/FNAL, John/RAL, Jhen-Wei/ASGC, Ronald/NLT1, Rolf/IN2P3, Rob/OSG; Vladimir/LHCb). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * CENTRAL SERVICES * NTR * T0 * NTR * T1 * SARA: file transfer failures to various sites in CA with "failed to contact on remote SRM" . GGUS:82490. Probably caused by a wrong kernel level on the SRM. Booted the SRM with a new kernel on June 20 (afternoon). * T2 * GOEGRID->FZK transfer failures. Source error: failed to contact on remote SRM. GGUS:83444 solved (June 21, 8:17). A pool node got stuck and needed to be rebooted. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC machine / CMS detector * Machine development * CERN / central services and T0 * NTR * Tier-1/2: * NTR * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Users analysis and prompt reconstruction and stripping at T1s ongoing * MC production at Tiers2 * T0: * CERN (GGUS:83351) batch system: We have peak of submitted jobs every night. * T1: * FZK-LCG2: (GGUS:83425) Jobs failed with "dcap: Last IO operation timeout." * FZK-LCG2: (GGUS:83456) LHCb VO-box at !GridKa is down; Fixed Sites / Services round table: * Gonzalo/PIC: ntr * Ulf/NDGF: ntr * Lisa/FNAL: ntr * John/RAL: * network issue mentioned yesterday has been understood and fixed this morning at 10am * next Wednesday will upgrade the database behind Castor, will be in GOCDB * Jhen-Wei/ASGC: ntr * Ronald/NLT1: ntr * Rolf/IN2P3: ntr * Rob/OSG: ntr * Mike/Dashboard: ntr * Luca/Storage: ntr * Ignacio/Grid: ntr * Marcin/Database: * yesterday patched LHCb online db * tomorrow will patch CMS online db and CMS active data guard AOB: none ---++ Friday Attendance: local (Andrea); remote (). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * CERN CENTRAL SERVICES, T0 * NTR * T1 * !GridKa informed on some issues with the disk stack at ~5pm June21 resulting offlining the 13 dCache disk-only pools. No DT announcement in GOCDB. Didn't affect ATLAS transfer and production. * T2 + OTHERS * Running jobs failures after applying the SL5 python security update/patch (python-2.4.3-46.el5_8.2) https://cern.service-now.com/service-portal/view-request.do?n=RQF0111006 Discussed on Wed. June 20. Now this update/patch is in both FNAL and CERN repo. ATLAS prepared the special pilot patch to fix this issue, but it will be implemented only on Monday in order to complete the urgent tasks. We'd like to recommend the sites to postpone the SL5/python update till Monday as well if possible. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Sites / Services round table: AOB: -- Main.JamieShiers - 22-May-2012
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
ppt
ggus-data.ppt
r1
manage
2546.5 K
2012-06-18 - 11:54
UnknownUser
Complete GGUS ALARM drills for the 2012/06/19 WLCG MB.
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r20
<
r19
<
r18
<
r17
<
r16
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r18 - 2012-06-22
-
unknown
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback