TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek090727
(2009-07-31,
HarryRenshall
)
(raw view)
E
dit
A
ttach
P
DF
---+ Week of 090727 %TOC% ---++ WLCG Service Incidents, Interventions and Availability | *VO Summaries of Site Availability* ||||*SIRs & Broadcasts*|| | [[http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=101&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&algoId=6&timeRange=lastWeek][ALICE]] | [[http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=403&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&sites=TRIUMF-LCG2&sites=Taiwan-LCG2&sites=pic&algoId=21&timeRange=lastWeek][ATLAS]] | [[http://dashb-cms-sam.cern.ch/dashboard/request.py/historicalsiteavailability?siteSelect3=T1T0&sites=T0_CH_CERN&sites=T1_DE_FZK&sites=T1_ES_PIC&sites=T1_FR_CCIN2P3&sites=T1_IT_CNAF&sites=T1_TW_ASGC&sites=T1_UK_RAL&sites=T1_US_FNAL&timeRange=lastWeek][CMS]] | [[http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&algoId=82&timeRange=lastWeek][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://cic.gridops.org/index.php?section=roc&page=broadcastretrievalD][Broadcast archive]] | ---++ GGUS section * [[https://gus.fzk.de/pages/metrics/download_escalation_reports_wlcg.php][GGUS Escalation reports every Monday]] (used for WLCG Service Report to MB) * Full search: https://gus.fzk.de/ws/ticket_search.php and select "VO" - this will give all tickets, including team & alarm * *LHCOPN* Tickets in GGUS: see https://gus.fzk.de/pages/all_lhcopn.php and change your selection criteria. Future actions are also listed. * If network group participation is necessary, please invite them in time. * [[https://gus.fzk.de/pages/ggus-docs/documentation/pdf/1541_FAQ_for_team_member_registration.pdf][Procedure to become a LHC Experiment VO TEAM member]] * Other recent [[https://gus.fzk.de/pages/faq.php][GGUS FAQs]] ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] | *General Information* |||| | [[http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/][CERN IT status board]] | M/W PPSCoordinationWorkLog | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMeetingMinutes][Weekly joint operations meeting minutes]] | Additional Material: | *STEP09* | [[https://twiki.cern.ch/twiki/bin/view/Atlas/Step09][ATLAS]] | [[https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Logbook][ATLAS logbook]] |[[https://twiki.cern.ch/twiki/bin/view/CMS/Step09][CMS]] [[http://cern.ch/planet-wlcg][WLCG Blogs]]| <HR> ---++ Monday: Attendance: local(Julia, Jean-Philippe, Ricardo, Eva, Jamie, Harry, Olof, David, Andrew, Alessandro, Gang, Simone);remote(Daniele, Michael, Andrea, John, Angela). Experiments round table: * ATLAS (Ale) - not many issues: observed a problem vs DESY - due to misconfig from central OPS. Monitoring showed it and fixed Sat night. Issues with 2 T2s: 1 RO (FR cloud) and 1 RU (NL). Observed late Friday pm security incident - already discussed with security team. This morning ATLAS elog not accessible. James Casey answered - only a few hours downtime - no contacts except one single person: should be improved. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] (Daniele) - full details in report! (Follow link) ASGC(Gang): for data migration due to streaming problem(?), for link connection quality waiting for reply from Russian T2 admin. * ALICE - RAS * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - (not updated - no report) Sites / Services round table: * FZK(Angela) - informed in morning about security intervention tomorrow morning. Short network interuptions between 7 and 8 CEST. Only informed late - sorry for the late announcement. Declared "at risk" at this time. * BNL (Michael) - ATLAS farm will be upgraded tomorrow to SL5. Other services will be performed so expect to be down for the entire day. Already announced in GOCDB. AOB: ---++ Tuesday: Attendance: local(Ricardo, Eva, Jamie, Olof, Simone, Jean-Philippe, Andrea, Gavin, Julia, Gang, Harry, Roberto);remote(Tiju Idiculla (RAL), Xavier Mol (FZK), Ronald Starink (NIKHEF / NL-T1), Michael, Daniele Bonacorsi [CMS], Jeremy). Experiments round table: * ATLAS - 1st: problem at CERN with acrontab - users A-L. DDM users in this range. No interuption of service for DM & WM but monitoring affecting (SLS relies on acron). VO Boxes flagged as "downgraded(?)" - 30% availability. Now cured. Functional tests - which rely on batch system at CERN - also affected (same reason). acron back in service mid-late morning. 2nd: dashboards - reminders / alarms from monitoring system about ATLR DB & dashboard app - locked sessions. Got a variety of alarms in the morning - can someone from dashboard team check? Julia - nothing seen on support list. Simone - will forward one. 3rd: ASGC status: reprocessing tested. Seems to be no more workload management problem. Jobs can run at T1 - peak at >1K jobs. Remaining issues: test stopped as CASTOR showing XIO errors. Failure rate too high so test stopped. Ran 4K reprocessing jobs -quite good! Main worry: ATLAS cannot run when CMS is running & vice versa. Access to tape system. FIFO access to tape requests & not enough tape drives. Makes little sense to run ATLAS only exercise - need ATLAS+CMS. Files lost in TRIUMF - thousands! MC files - can be regenerated. Very little interest. Believed from SRM 1-2 migration. Cleaned from catalogs. Yesterday reported unavailability of 2 files (CASTOR) at CERN - problematic disk server- understood. Yesterday DDM bug - happens when unavailability of Oracle b/e etc. Client side: client not transaction safe. Transaction not rolled back - caused troubles 20days ago - problem will be fixed. SARA is in unsched downtime since this morning. What is the problem? Last point: FTS - noticed that transfers CERN-BNL failing as channel not defined in FTS at CERN. Looks like BNL stopped publishing sitename under old name. Michael- part of site name consolidation. To surprise of experts here site name change affecting transfers such that they fail. Name change reversed but will go ahead to work towards consolidation. BNL in sched down for entire day anyway. Should not worry about these failures at this point in time. Gavin - understand go back to BNL_LCG2. Michael - yes, then BNL_ATLAS1 in future. Gavin - Will affect CERN and all T1 FTS servers. Can generate procedure for this. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - 1st: on dashboard had issue on DESY CE not properly included. Fixed. ASGC: still backlog of migration. ~2.5TB of data to be migrated to tape. Files ready for migration - tape drives free but migration not proceeding. Expect some news later... Some issues at PIC: CMSSW had some R/O f/s errors but these are solved. T2s: all tickets open apart from one at Caltech. >3/4 related to intense T2-T2 commissioning activity. Julia: need to upgrade schema for dashboard job monitoring - request to track output SE and some parameters of staging out. Probably need to coordinate this upgrade. Daniele - stable production from now until early September! Please suggest a duration for this for facilities hypernews. Some midday slot probably fine. Just propose a date... Andrea - acron affected also CMS: PhEDEx, DBS and FroNTier. * ALICE - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - 2 weeks ago reprocessing activity run after STEP'09 to test staging capacity: post-mortem made available, see LHCb report page. As for ATLAS & CMS. acron problem (SLS sensor & SAM test submission). A lot of MC production - up to 10 MC prods from physics groups. 15K concurrent jobs. Sustained for several days. 2 different problems, 1st at CERN reported yesterday with LHCb data pool - full - more disk capacity requested. Bernd agreed to provide 100TB. Got 5TB temporarily... CNAF - sqlite DB file in shared area again? - looks like fix to copy file locally to WN before job doesn't work for all s/w versions, site banned from production. Will chase problem up via GGUS ticket. Sites / Services round table: * FZK - this morning's services will be at risk for short network interruption performed without problems. Did some security upgrades on dcache. Also no problems, production not interrupted. * NL-T1 - SARA stared dcache upgrade this morning. Last info: SRM nodes wouldn't start. Will continue to work on this. If cannot be solved before tomorrow will be downgraded again. Version # 1.9.3-3. * GridPP - redoing Hammercloud tests that ran last week with some problems. * ASGC - globus XIO error as ATLAS pool only has 1 server. Has reached its limit. Waiting for another 2 servers expected mid August. Tape drives: # jobs sent based on # CPUs. If other resources cannot be increased as fast this will be a problem - big increase in CPU resources recently. Simone - should rerun tests after mid August - TBC with Jason. * DB - LHCb online DB down since ~21:00 last night. Someone pressed a red emergency button - everything stopped! Trying to recover DB from backup. SLC5 upgrade postponed - problem understood, testing the solution before going ahead with migrations. AOB: ---++ Wednesday Attendance: local(Julia, Gavin, Eva, Antonio, Harry, Jamie, Olof, Ale, Edoardo, Jean-Philippe, Gang, Simone, Ricardo);remote(Ronald, Michael, Andrea, Xavier, Tiju, Daniele). Experiments round table: * ATLAS - email from Simone: <verbatim> 1) BNL is back in production after yesterday dCache upgrade. All configuration changes associated with the site name consolidation had to be reverted, i.e. the US ATLAS Tier-1 center is back to BNL-LCG2 (from Michael). Therefore FTSes can transfer files from/to BNL, but it remains an open question of hw to have a smooth transition once the name is changed. More details as to the results of the investigation (summarized by John Hover (BNL)) became available later in the day. The issue was ultimately traced back to the way a particular GIP probe is configured within CEMon. Despite assurances that the change could be made solely in OIM (OSG's GOCDB), there was an inconsistency that made glite-query-sd fail. The wider issue is that OSG is moving to a model where "resource groups" correspond to EGEE "sites", whereas previously OSG "resources" roughly corresponded to sites. The OSG tools have not all made this shift consistently, and no one seems to fully understand all the ramifications. Deeper technical details and troubleshooting discussion can be found at this Tiwki page which is being used in place of an email thread: https://twiki.grid.iu.edu/bin/view/Operations/AtlasBdiiIssues Gavin provided scripts and procedure to change a site name in FTS https://twiki.cern.ch/twiki/bin/view/LCG/FtsProcedureSiteNameChange It is very good news that a clear recipe and scripts now exist to quickly switch sitenames for FTS. That should make our eventual consolidation much easier. It must be understood how to do this transparently. 2) SARA this morning was still having problems with the dCache upgrade. The information from Hurng might be interesting for other sites: """ After the upgrade of dCache, SARA was suffered by a problem that the new version of dCache writes out additional information to the billing file, which then cause the partition of the main dCache node full (instead of 40 MB/day .. it grows up to 5 GB after few hours). SARA people is working on it and contacting developers for a good solution. Another downtime was claimed. """ </verbatim> * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - CMS - progress on T1 link commissioning ASGC-RU site. DPM layout changed. Expect commissioning to progress soon. No other progress on open tickets, e.g. ASGC migration, PIC load test T2 transfer. T2s: quite a lot of progress in T2-T2 link commissioning. High level of activity in US and UK. GRIF/Roma tickets closed - good progress in this activity! * ALICE - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - There are currently running 16K jobs concurrently for 6 different physics MC production. In this picture a snapshot of the last 24 hours running jobs. T0 issues: Intermittent time outs retrieving tURL from SRM at CERN (verified for the RDST space, most probably general issue Observed LFC access times out this morning. SLS does not seem to indicate problems. T1 issues: issue with SQLLite DB file in the shared area: provided to CNAF people all suggestions to fix this problem as done at GridKA. Waiting for them. Sites / Services round table: * SARA (Ronald) - did upgrade due to vulnerability. Try not to make habit of it! This morning and last night had problem with partition filling up with logging info - about orphaned files (on disk but not in namespace). Lost but taking up space. Have to be cleaned up. This new version of dCache complains at quite a high rate. Separate partition for logging but this was written in billing file! May affect sites upgrading from dCache 1.9.1 or older to 1.9.2 or newer! (If site also has a lot of orphaned files). * FZK - shortly after midnight dcache srm space manager for all VOs apart from ATLAS died for out of memory. Only restarted at working time. Monitoring confused by errors - not always critical, sometimes warning or unknown hence on-call not activated. * BNL - as reported earlier site name consolidation failed and is under investigation with OSG colleagues. OS upgrade to 64bit SL5, NFS server (split between analysis and production now). Upgrade of AFS server - all completed within announced downtime. * ASGC - concerning CMS savannah ticket now understood - tape pool assigned is full. Added 80TB of space and migration then started again. About 400 files migrated in last few hours. (Thanks Gang from Daniele!) * DB - LHCb online DB back in prod since yesterday evening. During last night ASGC ATLAS DB down until this morning - some problems with storage and listener. Waiting for Jason to restart propagation CERN-Taiwan * SRM b/e DBs (Gav) - patch scheduled tomorrow at 15:00 CEST - should be transparent! * Release update (Antonio): BDII V5 released last week. Maybe of interest for services on SL5. Some fixes for issue found at one of production sites - affects BDII on SL5 - some sites may disappear due to ongoing change of GLUE schema. If you run SL4 BDII only affect is a corruption of LDAP tree. SL5 top-level BDII should run this fix released glite 3.2 update 04. New version of LFC & DPM on SL5, new gfal libs, myproxy client on WN. For future preparing update glite 3.1 update 52 on SL4. Update to CREAM CE. Will start to be published correctly in IS. New host certs for VOMS server. Current certs expire end August - voms.cern.ch certs to be replaced and will be in next release. * Network - upgrade DNS at CERN due to vulnerability. No impact except for load balanced alias 07:00 - 07:30. AOB: ---++ Thursday Attendance: local(Ricardo, Jamie, Julia, Eva, Jean-Philippe, Roberto, Harry, Gang, Simone, Alessandro, Miguel);remote(Xavier, John, Michael, Daniele, Jeremy). Experiments round table: * ATLAS - nothing! * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - update on ASGC ticket: migration backlog being digested. Just 7 files waiting now.. Problem seems to be now solved. Some progress especially in T2-T2 link commissioning - details in twiki. * ALICE - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Previous days running at full regime productions have now been stopped because the failover is also getting full at all T1's due to the lhcbdata space at CERN full. CERN disk capacity is indeed today the main problem (also accordingly SLS ) and became a show stopper. LHCb would like to hear from FIO which the plans to get these extra 100TB agreed on Monday. Ricardo: machines are being prepared, still have to be installed and configured but should start to become available next week. Roberto: CNAF working with LHCb & Roberto on SQLite issue. Sites / Services round table: * FZK - same problem as yesterday but this time on-call called as monitoring improved. SRM service restarted... Local monitoring still failing as cert outdated. Normal SAM & OPS tests succeeded - production not affected. * BNL (John R. Hover [jhover@bnl.gov]) <verbatim> Hello all, As Michael said, all changes were rolled back so nothing needed to change on the EGEE side. The issue was ultimately traced back to the way a particular GIP probe is configured within CEMon. Despite assurances that the change could be made solely in OIM (OSG's GOCDB), there was an inconsistency that made glite-query-sd fail. The wider issue is that OSG is moving to a model where "resource groups" correspond to EGEE "sites", whereas previously OSG "resources" roughly corresponded to sites. The OSG tools have not all made this shift consistently, and no one seems to fully understand all the ramifications. Deeper technical details and troubleshooting discussion can be found at this Tiwki page which is being used in place of an email thread: https://twiki.grid.iu.edu/bin/view/Operations/AtlasBdiiIssues It is very good news that a clear recipe and scripts now exist to quickly switch sitenames for FTS. That should make our eventual consolidation much easier. Cheers, --john </verbatim> * LHCb - also ticket for SRM problem at CERN due to disk space full. Fixed now. AOB: ---++ Friday Attendance: local(Luca, Harry, Roberto, Alessandro, Jean-Philippe, Ricardo, Gavin, Gang);remote(Xavier/FZK, John/RAL). Experiments round table: * ATLAS - 1) Problems staging data in at RAL with some files waiting since 26 July. Ticket raised and is being worked on. 2) Planning to add a third machine into the central catalogue pool next Monday. Should be transparent. 3) The LCG tags interface does not work on SL5 worker nodes so ATLAS cannot update their WN software on SL5. A new version 0.4 that does work will be released with gLite 3.2 so ATLAS will meantime package this with their software. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - Apologies for absence today. * ALICE - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - 1) Monte Carlo production has been put on hold due to the LHCBDATA space token being full at CERN. An increase is expected next week - an advance warning would be appreciated. 2) IN2P3 is publishing CE names in a way that prevents LHCb agents from working. 3) Two UK Tier 2 sites (Lancaster U. and Queen Mary London U.) have a high job failure rate. Sites / Services round table: * FZK: SRM out of memory problems and associated 'confused' monitoring both now fixed. * RAL: The lcg monbox disk replacement yesterday during a scheduled outage subsequently failed leading to an additional unscheduled outage. * CERN Databases: The LHCb Online database will be down next Monday and Tuesday for a scheduled maintenance at P8: "upgrade of the core-router in the SX8 network and the electrical installation". AOB:
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r10
<
r9
<
r8
<
r7
<
r6
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r10 - 2009-07-31
-
HarryRenshall
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback