---+ Week of 090928 %TOC% ---++ WLCG Service Incidents, Interventions and Availability | *VO Summaries of Site Availability* ||||*SIRs & Broadcasts*|| | [[http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=101&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&algoId=6&timeRange=lastWeek][ALICE]] | [[http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=403&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&sites=TRIUMF-LCG2&sites=Taiwan-LCG2&sites=pic&algoId=21&timeRange=lastWeek][ATLAS]] | [[http://dashb-cms-sam.cern.ch/dashboard/request.py/historicalsiteavailability?siteSelect3=T1T0&sites=T0_CH_CERN&sites=T1_DE_FZK&sites=T1_ES_PIC&sites=T1_FR_CCIN2P3&sites=T1_IT_CNAF&sites=T1_TW_ASGC&sites=T1_UK_RAL&sites=T1_US_FNAL&timeRange=lastWeek][CMS]] | [[http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&algoId=82&timeRange=lastWeek][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://cic.gridops.org/index.php?section=roc&page=broadcastretrievalD][Broadcast archive]] | ---++ GGUS section * [[https://gus.fzk.de/pages/metrics/download_escalation_reports_wlcg.php][GGUS Escalation reports every Monday]] (used for WLCG Service Report to MB) * Full search: https://gus.fzk.de/ws/ticket_search.php and select "VO" - this will give all tickets, including team & alarm * *LHCOPN* Tickets in GGUS: see https://gus.fzk.de/pages/all_lhcopn.php and change your selection criteria. Future actions are also listed. * If network group participation is necessary, please invite them in time. * OSG items selected from the [[https://gus.fzk.de/pages/metrics/download_escalation_reports_roc.php][GGUS escalation reports]]. (MariaDZ) No overdue tickets assigned to OSG in this week's reports. * [[https://gus.fzk.de/pages/ggus-docs/documentation/pdf/1541_FAQ_for_team_member_registration.pdf][Procedure to become a LHC Experiment VO TEAM member]] * Other recent [[https://gus.fzk.de/pages/faq.php][GGUS FAQs]] ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] | *General Information* |||| | [[http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/][CERN IT status board]] | M/W PPSCoordinationWorkLog | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMeetingMinutes][Weekly joint operations meeting minutes]] | Additional Material: | *STEP09* | [[https://twiki.cern.ch/twiki/bin/view/Atlas/Step09][ATLAS]] | [[https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Logbook][ATLAS logbook]] |[[https://twiki.cern.ch/twiki/bin/view/CMS/Step09][CMS]] [[http://cern.ch/planet-wlcg][WLCG Blogs]]| <HR> ---++ Monday: Attendance: local(Lola, Eva, Roberto, Maria, Simone, Andrea, Olof, Gang, Jan);remote (Daniele, Kyle, Angela, Ronald, Brian, Alexei, Jim, Gareth, Michael, Brian) Experiments round table: * ATLAS (Alexei): Two main activies going on: transfer of reprocessed cosmics data and transfer request to support the ATLAS Users Analysis Test (UAT). 150 TB being distributed to all Tier1 at the moment (50 for analysis, 100 for reprocessing) (bulk at BNL) * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - ( Daniele): still open ticket for gLite WMS. CMS MC production have problems (suffering from error 10). It is a CondorG/GRAM problem and is being investigated by developers and a workaround is available (Savannah 110143). For T1s to report is the backlog digested at ASGC, tranfer issues CIEMAT, DESY ->IN2P3 and SAM analysis tests and job failures at IN2P3 being looked at. For T2s, transfer issues from Florida->CNAF, FNAL ->UCSD, IC->RAL and SAM analysis test failures at T2+TR_METU, IC and Lisbon. More details in the atttached report. * ALICE - No report. * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] Roberto: No activity in the system apart from few hundred user jobs at the T1s. Certificate expired on volhcb04. No alarm as the server is on maintenance (under request last August by LHCb). Sites / Services round table: * ASGC (Gang): SMR degradation fixed today. 25 disk servers rebooted this morning because of some FS problem. SAM errors for job submission (for 6 hours) fixed. Also Streams replication propagation is disabled because of a listener problem with the database at ASGC. To be checked. Yesterday transfer problems with ATLAS now fixed (after the power maintenance). FTS failures with delegation credentials error. This affects the replication to ASCG. * FZK (Angela) : All OK. The problem on the worker nodes reported on Friday is fixed. * BNL (Michael) : Maximum number of connections exausted on the LFC (Jean-Philippe and the BNL experts are looking into this). * NL-T1 (Ronald) : Nothing to report. * RAL (Gareth): Tomorrow Intervention at risk on Castor information provider. * FIO Ops : SRM transfer failures for ATLAS due to a disk server in maintenance. Intervention at risk tomorrow morning on the Castor storage switches. AOB: (MariaDZ) The periodic ALARM tests to T1s must be sent this week. Instructions in https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru ---++ Tuesday: Attendance: local(Jean-Philippe, Jamie, Harry, Dirk, Eva, MariaG, Simone, Olof, Alessandro, Jan, Gang, Andrea, Diana, MariaD);remote(Jeremy, Gareth, Daniele, Michael, Ronald, Brian, Angela). Experiments round table: * ATLAS (Simone) - Reprocessing from ESD finished (bulk done). Time for output distribution ongoing to the 10 T1s. Also analisys tests for a total of 150 TB/site. 3 Gb/s total throughput 1 week required for 1 PB. Delegation credential problems reported yesterday fixed at ASGC. Also Castor DB overload at ASGC, now fixed. No transfer to Lyon as the site is in scheduled downtime (Chimera upgrade). * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] (Daniele) - the ticket RAL had been asking for on Estonia ->RAL transfer is closed (savannah ticket # 110138). The gLite WMS problem reported yesterday is still being addressed. For details and updates see the Savannah ticket # 110143. Good progress with the open tickets to the T1s which are all closed. One new ticket (savannah ticket # 110207) opened to ASGC: no default service classes defined" error in transfers from T1_TW_ASGC to T2_IT_Rome. Highlights from T2s with several tickets in the process of being closed. To mention still slow responsiveness of few Russian T2's, Brasil UERJ,.. * ALICE - No report * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Today the FEST week starts. The activity represents the expected work flow at T0 and T1s. XPRESS Stream will be analyzed first, followed by the DQ WG giving green-light and then the FULL stream at 1.8Hz HLT data acquisition rate. Stripping work-flow has to be commissioned (top priority task for coming weeks). Sites / Services round table: * ASCG (Gang): Many disk servers rebooted yesterday after the scheduled downtime on Sunday. The database seems to be corrupted after the sunday intervention. A point in time recovery was performed but the physics database service team was not informed about this and streams operations cannot at the moment be restarted. A SIR is is required by the site. Also communication has to be improved. * RAL (Gareth): space token problem for disk servers misconfiguration for ATLAS MCdisks. SAM tests problems under investigation. Check-sum on FTS for transferring files fails verification because of ext3 configuration is being discussed under request of ATLAS (Shawn will give a update in the next days) Plan to migrate to SRM 2.8? Yes, but waiting for a patch. Can the LHCb migration to 64bit postpone by one week? * BNL (Michael): Reprocessing and data distribution for analysis exercise going on: the load on storage element is 40 GBits/sec, combined read/write. A large fradction of ESD data has been processed and is delivered at 1GByte/sec to other T1s at sustained rate. The site shown no problem is coping this this load which actually seems to be below the capability. Well done! * FZK (Angela): nothing to report. * Services Round Table: * Databases: the RHEL5 migration foreseen to happen in summer and delayed because of several bugs is now postponed after the LHC run has finished. New hardware will be installed on RHEL4. AOB (MariaD): Please look at https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Site_scheduled_interventions for instructions on interventions, as decided at the SA1 coordination meeting at EGEE'09. ---++ Wednesday Attendance: local (Eva, Jamie, MariaG, MariaD, Jan, Harry, Gavin, Oliver, Julia, Lola, Gang, Nick, Jean-Philippe, Antonio);remote (Daniele, Gareth, Onno, Michael, Jim, Alexei, Angela). Experiments round table: * ATLAS (Alexei) - Data transfers from BNL->T1s for data reprocessing and analysis going on. Also starting data deletion at sites to prepare for MC production starting next week. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] (Daniele) - Nothing special to highlight today with respect on what discussed already yesterday. Most of the activities are at T2s level because of the analysis challenge starting next week. Harry commented on the 2k pool accounts be created. 999 were done. The rest will follow after the analysis challenge. * ALICE (Patricia) - Not much happening last week because of a MC cycle change. Activities restarted last night. No problems so far. * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] (Roberto) - No reconstruction jobs in the system because of a with options file used by DaVinci application. Few thousand jobs in the system currently from distributed user analysis activity. Problems at RAL in file opening which takes long (about 1 hour). A GGUS ticket will be opened under RAL request for tracking the problem. Sites / Services round table: * ASGC (Gang): Access to the 3D database is possible but the database seems to be corrupted. The problem being investigated. SIR requested. * RAL (Gareth): The LHCb database upgrade to 64bit is postponed to the week after next when the DBA returns from holidays. SAM test failures problems reported yesterday seem to be fixed but investigations are still going on. SRM upgrade to latest 2.8.1 patch will happen after the tests at CERN have finished. Checksum not in 2.1.7, which is the version deployed at RAL for the LHC run. * BNL (Michael): Slowness for data export to T1s being currently investigated. More news tomorrow. * FZK (Angela): Nothing to report. * NL-T1 (Onno): yesterday afternoon SARA had SRM problems while running a stage test. The stage test was stopped and the SRM restarted, since then no problem. The dcache developers are still searching for a solution. * Services round table: * FIO Ops (Jan): All LCG CEs planning an upgrade for next week. Release report (Antonio): [[https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status][deployment status wiki page]]. The version of the lcg-CE to be recommendedn to the production sites is cg-CE 3.1.35-0. AOB (Diana): EGEE has a new ROC. Latin american sites will be taken care by the ROC_LA. ---++ Thursday Attendance: local(Julia, Jean-Philippe, Maria, Eva, Olof, Ricardo, Jan, Jamie, Harry, Andrea, Edoardo, Roberto, Gang); remote(Jeremy, Michael, Ronald, Gareth, Daniele, Xavier). Experiments round table: * ATLAS (Simone by email) - SARA MCDISK space token is full. A strategy is being thought within ATLAS about how to proceed. - It would be good to know the status of Lyon. From what I understand, the intervention should be over today. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] (Daniele)- Update on the gLite WMS globus/gram problem with significant decrease of job aborts. More feedback from debugging expected soon. About T2s (under the light-spot because of the analysis exercise) T2-T2 links commissioning for the muon and tracker physics groups in progress. Three tickets for Florida, TR_METU and Lisbon closed and also progress with the responsiveness of some sites. * ALICE - No report. * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] (Roberto) - FEST production still paused for a fix of the Davinci option file. Transfers from P8 to CASTOR going on. Also MC productions for about 20k cuncurrent jobs in the system. Sites ound table: * ASGC (Gang): No news. * RAL (Gareth): Upgrade to SRM 2.8.1 and timeouts observed on the ATLAS Castor instance being investigated. * NL-T1 (Ronald): GGUS ticket on FTS investigated and solved by Simone. * BNL (Michael): The slowness of the functional data transfer tests is understood. The distribution of the MonteCarlo samples for the user analysis test (UAT) and the Functional Test (FT) data use both the same DQ2 input share which is currently fully utilized. The data volume for the sample distribution is 45TB * 10 ATLAS T1s so the FTs get swallowed by the MC distribution. * FZK (Xavier): Nothing to report Services round table: * FIO Operations (Ricardo): 4 new CEs (version 3.1.35, SLC5) will be made available next week. * Database Operations (Eva): firmware upgrade and parameter configuration change scheduled for next week on a faulty disk arrays of the LHCb online cluster. * Network Operations (Edoardo): Connectivity lost with CNAF (re-established) and FNAL following old prefix remove (sheduled for today after the router upgrade). $ *CERN !VOMS*: Since 21:00 UTC Wednesday the LHC voms service on both voms.cern.ch and lcg-voms.cern.ch has been experiencing failiures and restart. Situation looks to be highload triggering monitoring and service restarts. * The service is basically available though you may be unlucky. Under investigation. * No user reports have been made on failures as yet. * Under investigation. <img src="%ATTACHURLPATH%/voms.png" alt="voms.png" width='500' height='200' /> VOMS status for 24 hours up to 12:00 October 1st. AOB: ---++ Friday Attendance: local();remote(). Experiments round table: * ATLAS - * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * ALICE - * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Sites / Services round table: AOB: -- Main.JamieShiers - 2009-09-24
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
voms.png
r1
manage
2.6 K
2009-10-01 - 11:00
SteveTraylen
voms status for 24 hours up to Thu 1st October 2009
This topic: LCG
>
WebHome
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek090928
Topic revision: r14 - 2009-10-02 - JamieShiers
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback