TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGDailyMeetingsWeek140203
(2014-02-06,
MariaDimou
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Week of 140203 %TOC% ---++ WLCG Operations Call details * At CERN the meeting room is [[https://maps.cern.ch/mapsearch/?centerX=2492565¢erY=1121070¢erScale=2500][513]] R-068. * For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following: 1 Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or 1 To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] * In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found [[https://indico.cern.ch/conferenceDisplay.py?confId=287280][here]]. The SCOD will email the WLCG operations list in case the Vidyo backup should be used. ---++ General Information * The SCOD rota for the next few weeks is at ScodRota * General information about the WLCG Service can be accessed from the [[WLCGOperationsWeb][Operations Web]] ---++ Monday Attendance: * local: !MariaD (SCOD), Maarten (ALICE), Massimo (CERN Data Mgnt), Vitor (CERN Grid Services), Felix (ASGC). * remote: Roger (NDGF), Sang-Un (KISTI), Michael (BNL), Matteo (CNAF), Elena (ATLAS), Eric (CMS), Onno (NL_T1), Kyle (OSG), Tiju (RAL), Alexei (LHCb), Lisa (FNAL), Pepe (PIC). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014][reports]] ([[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014?raw=on][raw view]]) - * Central services/T0 * CERN_PROD: Transfers were failing with permission denied errors on Monday morning. Noticed and fixed by CERN team. Thanks. * T1 * TAIWAN: heavy SRM load caused transfer failures on Sunday (GGUS:100904). Fixed. * FZK: staging errors for DATATAPE on Friday (GGUS:100885). Fixed by issuing a retry for all outstanding stage requests for ATLAS and restarting tape storage software. * PIC: problem with one disk pool, which caused transfers to failed on Friday (GGUS:100874), dCache pool restarted. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * T1/T2/Others: Business as usual. Smooth running. * Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing * One problem: ARGUS cluster issue(s) (DNS? and then a new, uninitialized node in the cluster) caused problems with analysis jobs running. * Debugged by CMS analysis operations. Better would be to have SLS monitoring of the ARGUS cluster. Ticket is GGUS:100870 * ALICE - * sites please take note of the necessary WLCG VOBOX update announced last Fri * see details below * KIT * the number of corrupted files has _shrunk_ by 45% to 26126 * 21k files have been salvaged after all, thanks very much! * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * Mostly simulation and user jobs. Smooth running over most of the grid. * T0: Pilots aborted at ce202.cern.ch today. Ticket is GGUS:100902 * T1: NTR * T2: NTR Sites / Services round table: * ASGC: ntr * BNL: ntr * FNAL: ntr * OSG: ntr * KISTI: ntr * NL_T1: ntr * CNAF: ntr * PIC: ntr * NDGF: ntr * !IN2P3: ntr (sent be email) * !RAL: Tomorrow, between 8-10hrs am UK time, tape system intervention. Site set at risk in GOCDB. * CERN: * Grid Services: ntr * Data Mgnt: * Problem to access EOS from outside CERN. Now solved. Lasted for 1h 15'. * ROOT access to CASTOR is now switched off. Hardly 10 users concerned. They have been informed about alternative access methods. AOB: * WLCG VOBOX * as announced on the wlcg-operations list last Fri, please ensure your WLCG VOBOX instances generate host proxies with *1024-bit* keys! * preferably update Globus; correct minimal versions of the affected rpm: * =globus-proxy-utils-5.0-6= (Globus 5.0) * [[http://repository.egi.eu/sw/production/umd/3/sl6/x86_64/updates/globus-proxy-utils-5.0-6.el6.x86_64.rpm][UMD-3 SL6 rpm]] (can be installed manually also on UMD-2 machines) * [[http://repository.egi.eu/sw/production/umd/3/sl5/x86_64/updates/globus-proxy-utils-5.0-6.el5.x86_64.rpm][UMD-3 SL5 rpm]] (ditto) * =globus-proxy-utils-5.2-1= (Globus 5.2) * from EPEL for EMI-3 and EMI-2 * otherwise one can apply this quick hack: <verbatim> perl -pi.bak -e 's/ -q / -bits 1024 $&/' \ /etc/vobox/templates/voname-box-proxyrenewal \ /etc/init.d/*-box-proxyrenewal </verbatim> ---++ Thursday Attendance: * local: !MariaD (SCOD), Maarten (ALICE), Massimo (CERN Data Mgnt), Vitor (CERN Grid Services), Felix (ASGC), Pablo (GGUS), Przemek (DB), Alexandre (Dashboards). * remote: Roger (NDGF), Michael (BNL), Saverio (CNAF), Eric (CMS), Dennis (NL_T1), Kyle (OSG), Gareth (RAL), Alexei (LHCb), Lisa (FNAL), Pepe (PIC), Jeremy (GridPP), Rolf (IN2P3), Pavel (KIT). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014][reports]] ([[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014?raw=on][raw view]]) - * Central services/T0 * IT and DE clouds moved to FTS3 * T1 * CERN-PROD CVMFS inside CERN faulty GGUS:100928 https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG7278 * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * T1/T2/Others: Bussiness as usual. Smooth running. * Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing. Ramp-down has begun. * We are encouraging all our sites to switch to FTS3 server at RAL for load testing. Begins in a week or so. * ALICE - * CNAF * tape SE updated to xrootd v3.3.4 (on Jan 28) with new checksum plugin successfully validated (Feb 5) with test transfers, thanks! * KIT * investigating why many jobs read a lot of data remotely from CERN * RRC-KI-T1 * memory tuning for jobs ongoing, thanks! * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * Mostly simulation and user jobs. Smooth running over most of the grid. * T0: CVMFS caused 50%+ jobs failures Mo and Tu, back to normal since Wed * T1: NTR Sites / Services round table: * ASGC: ntr * BNL: ntr * FNAL: ntr * OSG: ntr * KISTI: not connected or I couldn't hear... sorry! * NL_T1: ntr * CNAF: ntr * PIC: ntr * NDGF: ntr * !IN2P3: ntr * !RAL: Now investigating CASTOR failing tests. Scheduling FTS3 tests. * CERN: * Grid Services: ntr * Data Mgnt: ntr * Dashboards: The following FTS servers do not report information properly. * These do not appear in the year log (is it possible they still have the old broker name harcoded ?) : fts02.usatlas.bnl.gov, w-fts001.grid.sinica.edu.tw, fts-kit.gridka.de, fts-fzk.gridka.de * These have authentication error in the broker: fts00.grid.hep.ph.ic.ac.uk, fts3.grid.sara.nl (empty or bogus username). * Databases: Here is a short description of last Thursday's 2014/01/30 LCGR problems: _As a part of preparation work to migration and upgrade of LCGR database, the replication of the database has been established. Because of a misconfiguration of database archived log files deletion policy, the space on production database server has been exhausted and the database got stuck at around 7.20AM. The problem has been corrected at 9.30AM and the database came back. After the database went up, all applications, which cached their data during the database outage, started to write to the database the content of their cache. The LCGR database was not able to handle such strike of traffic in one moment, so it got stuck again. Another reboot of the database was required. The manually synchronized restart of applications allowed the database to come back to normal operation. The preparation work for the migration was continued during the day and around 5PM we hit an Oracle bug, which caused the database not to accept new connections. The existing ones were working properly. Around 6PM, restart of one of database nodes and cut of connection between the production and replicated database helped to solve the problem._ * GGUS: * Suggestion to remove three fields from the 'Ticket Submission Form' (see [[https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek140203/submitForm.png][attachment]]). Those fields are hardly ever used, and they are anyways concatenated to the body of the issue. The meeting decided these fields can be deleted. AOB: * !OpenSSL issue * [[https://operations-portal.egi.eu/broadcast/archive/id/1079][EGI broadcast]] sent Feb 4 describing current state of affairs and recipes for cures * Sites using *HTCondor as batch system* may need to apply one of these configuration changes for now: * =DELEGATE_JOB_GSI_CREDENTIALS = False= * =GSI_DELEGATION_KEYBITS = 1024= * !HTCondor v8.0.6 will have the default increased to 1024
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
submitForm.png
r1
manage
116.8 K
2014-02-06 - 11:41
PabloSaiz
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r15
<
r14
<
r13
<
r12
<
r11
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r15 - 2014-02-06
-
MariaDimou
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback