Week of 090928

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Lola, Eva, Roberto, Maria, Simone, Andrea, Olof, Gang, Jan);remote (Daniele, Kyle, Angela, Ronald, Brian, Alexei, Jim, Gareth, Michael, Brian)

Experiments round table:

  • ATLAS (Alexei): Two main activies going on: transfer of reprocessed cosmics data and transfer request to support the ATLAS Users Analysis Test (UAT). 150 TB being distributed to all Tier1 at the moment (50 for analysis, 100 for reprocessing) (bulk at BNL)

  • CMS reports - ( Daniele): still open ticket for gLite WMS. CMS MC production have problems (suffering from error 10). It is a CondorG/GRAM problem and is being investigated by developers and a workaround is available (Savannah 110143). For T1s to report is the backlog digested at ASGC, tranfer issues CIEMAT, DESY ->IN2P3 and SAM analysis tests and job failures at IN2P3 being looked at. For T2s, transfer issues from Florida->CNAF, FNAL ->UCSD, IC->RAL and SAM analysis test failures at T2+TR_METU, IC and Lisbon. More details in the atttached report.

  • ALICE - No report.

  • LHCb reports Roberto: No activity in the system apart from few hundred user jobs at the T1s. Certificate expired on volhcb04. No alarm as the server is on maintenance (under request last August by LHCb).

Sites / Services round table:

  • ASGC (Gang): SMR degradation fixed today. 25 disk servers rebooted this morning because of some FS problem. SAM errors for job submission (for 6 hours) fixed. Also Streams replication propagation is disabled because of a listener problem with the database at ASGC. To be checked. Yesterday transfer problems with ATLAS now fixed (after the power maintenance). FTS failures with delegation credentials error. This affects the replication to ASCG.

  • FZK (Angela) : All OK. The problem on the worker nodes reported on Friday is fixed.

  • BNL (Michael) : Maximum number of connections exausted on the LFC (Jean-Philippe and the BNL experts are looking into this).

  • NL-T1 (Ronald) : Nothing to report.

  • RAL (Gareth): Tomorrow Intervention at risk on Castor information provider.

  • FIO Ops : SRM transfer failures for ATLAS due to a disk server in maintenance. Intervention at risk tomorrow morning on the Castor storage switches.

AOB: (MariaDZ) The periodic ALARM tests to T1s must be sent this week. Instructions in https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru

Tuesday:

Attendance: local(Jean-Philippe, Jamie, Harry, Dirk, Eva, MariaG, Simone, Olof, Alessandro, Jan, Gang, Andrea, Diana, MariaD);remote(Jeremy, Gareth, Daniele, Michael, Ronald, Brian, Angela).

Experiments round table:

  • ATLAS (Simone) - Reprocessing from ESD finished (bulk done). Time for output distribution ongoing to the 10 T1s. Also analisys tests for a total of 150 TB/site. 3 Gb/s total throughput 1 week required for 1 PB. Delegation credential problems reported yesterday fixed at ASGC. Also Castor DB overload at ASGC, now fixed. No transfer to Lyon as the site is in scheduled downtime (Chimera upgrade).

  • CMS reports (Daniele) - the ticket RAL had been asking for on Estonia ->RAL transfer is closed (savannah ticket # 110138). The gLite WMS problem reported yesterday is still being addressed. For details and updates see the Savannah ticket # 110143. Good progress with the open tickets to the T1s which are all closed. One new ticket (savannah ticket # 110207) opened to ASGC: no default service classes defined" error in transfers from T1_TW_ASGC to T2_IT_Rome. Highlights from T2s with several tickets in the process of being closed. To mention still slow responsiveness of few Russian T2's, Brasil UERJ,..

  • ALICE - No report

  • LHCb reports - Today the FEST week starts. The activity represents the expected work flow at T0 and T1s. XPRESS Stream will be analyzed first, followed by the DQ WG giving green-light and then the FULL stream at 1.8Hz HLT data acquisition rate. Stripping work-flow has to be commissioned (top priority task for coming weeks).

Sites / Services round table:

  • ASCG (Gang): Many disk servers rebooted yesterday after the scheduled downtime on Sunday. The database seems to be corrupted after the sunday intervention. A point in time recovery was performed but the physics database service team was not informed about this and streams operations cannot at the moment be restarted. A SIR is is required by the site. Also communication has to be improved.

  • RAL (Gareth): space token problem for disk servers misconfiguration for ATLAS MCdisks. SAM tests problems under investigation.
Check-sum on FTS for transferring files fails verification because of ext3 configuration is being discussed under request of ATLAS (Shawn will give a update in the next days) Plan to migrate to SRM 2.8? Yes, but waiting for a patch. Can the LHCb migration to 64bit postpone by one week?

  • BNL (Michael): Reprocessing and data distribution for analysis exercise going on: the load on storage element is 40 GBits/sec, combined read/write. A large fradction of ESD data has been processed and is delivered at 1GByte/sec to other T1s at sustained rate. The site shown no problem is coping this this load which actually seems to be below the capability. Well done!

  • FZK (Angela): nothing to report.

  • Services Round Table:
  • Databases: the RHEL5 migration foreseen to happen in summer and delayed because of several bugs is now postponed after the LHC run has finished. New hardware will be installed on RHEL4.

AOB (MariaD): Please look at https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Site_scheduled_interventions for instructions on interventions, as decided at the SA1 coordination meeting at EGEE'09.

Wednesday

Attendance: local (Eva, Jamie, MariaG, MariaD, Jan, Harry, Gavin, Oliver, Julia, Lola, Gang, Nick, Jean-Philippe, Antonio);remote (Daniele, Gareth, Onno, Michael, Jim, Alexei, Angela).

Experiments round table:

  • ATLAS (Alexei) - Data transfers from BNL->T1s for data reprocessing and analysis going on. Also starting data deletion at sites to prepare for MC production starting next week.

  • CMS reports (Daniele) - Nothing special to highlight today with respect on what discussed already yesterday. Most of the activities are at T2s level because of the analysis challenge starting next week. Harry commented on the 2k pool accounts be created. 999 were done. The rest will follow after the analysis challenge.

  • ALICE (Patricia) - Not much happening last week because of a MC cycle change. Activities restarted last night. No problems so far.

  • LHCb reports (Roberto) - No reconstruction jobs in the system because of a with options file used by DaVinci application. Few thousand jobs in the system currently from distributed user analysis activity. Problems at RAL in file opening which takes long (about 1 hour). A GGUS ticket will be opened under RAL request for tracking the problem.

Sites / Services round table:

  • ASGC (Gang): Access to the 3D database is possible but the database seems to be corrupted. The problem being investigated. SIR requested.

  • RAL (Gareth): The LHCb database upgrade to 64bit is postponed to the week after next when the DBA returns from holidays. SAM test failures problems reported yesterday seem to be fixed but investigations are still going on. SRM upgrade to latest 2.8.1 patch will happen after the tests at CERN have finished. Checksum not in 2.1.7, which is the version deployed at RAL for the LHC run.

  • BNL (Michael): Slowness for data export to T1s being currently investigated. More news tomorrow.

  • FZK (Angela): Nothing to report.

  • NL-T1 (Onno): yesterday afternoon SARA had SRM problems while running a stage test. The stage test was stopped and the SRM restarted, since then no problem. The dcache developers are still searching for a solution.

  • Services round table:

  • FIO Ops (Jan): All LCG CEs planning an upgrade for next week.

Release report (Antonio): deployment status wiki page. The version of the lcg-CE to be recommendedn to the production sites is cg-CE 3.1.35-0.

AOB (Diana): EGEE has a new ROC. Latin american sites will be taken care by the ROC_LA.

Thursday

Attendance: local(Julia, Jean-Philippe, Maria, Eva, Olof, Ricardo, Jan, Jamie, Harry, Andrea, Edoardo, Roberto, Gang, Steve); remote(Jeremy, Michael, Ronald, Gareth, Daniele, Xavier).

Experiments round table:

  • ATLAS (Simone by email) - SARA MCDISK space token is full. A strategy is being thought within ATLAS about how to proceed.
- It would be good to know the status of Lyon. From what I understand, the intervention should be over today.

  • CMS reports (Daniele)- Update on the gLite WMS globus/gram problem with significant decrease of job aborts. More feedback from debugging expected soon. About T2s (under the light-spot because of the analysis exercise) T2-T2 links commissioning for the muon and tracker physics groups in progress. Three tickets for Florida, TR_METU and Lisbon closed and also progress with the responsiveness of some sites.

  • ALICE - No report.

  • LHCb reports (Roberto) - FEST production still paused for a fix of the Davinci option file. Transfers from P8 to CASTOR going on. Also MC productions for about 20k cuncurrent jobs in the system.

Sites round table:

  • ASGC (Gang): No news.

  • RAL (Gareth): Upgrade to SRM 2.8.1 and timeouts observed on the ATLAS Castor instance being investigated.

  • NL-T1 (Ronald): GGUS ticket on FTS investigated and solved by Simone.

  • BNL (Michael): The slowness of the functional data transfer tests is understood. The distribution of the MonteCarlo samples for the user analysis test (UAT) and the Functional Test (FT) data use both the same DQ2 input share which is currently fully utilized. The data volume for the sample distribution is 45TB * 10 ATLAS T1s so the FTs get swallowed by the MC distribution.

  • FZK (Xavier): Nothing to report

Services round table:

  • FIO Operations (Ricardo): 4 new CEs (version 3.1.35, SLC5) will be made available next week.

  • Database Operations (Eva): firmware upgrade and parameter configuration change scheduled for next week on a faulty disk arrays of the LHCb online cluster.

* Network Operations (Edoardo): Connectivity lost with CNAF (re-established) and FNAL following old prefix remove (sheduled for today after the router upgrade).

  • VOMS Operation (Steve)

CERN VOMS
Since 21:00 UTC Wednesday the LHC voms service on both voms.cern.ch and lcg-voms.cern.ch has been experiencing failiures and restart. Situation looks to be highload triggering monitoring and service restarts.
  • The service is basically available though you may be unlucky. Under investigation.
  • No user reports have been made on failures as yet.
  • Under investigation. voms.png
VOMS status for 24 hours up to 12:00 October 1st.

AOB:

Friday

Attendance: local(Jean-Philippe, Eva, Maria, Roberto, Simone, Gang, Olof, Jan, Steve );remote(Xavier, Jeremy, Gareth, Onno, Michael, Daniele).

Experiments round table:

  • ATLAS (Simone) - Lyon finished yesterday chimera migration. Back in functional tests and data distribution (production activity on monday).
MCdisk failure at ASGC been investigated by Jason. Transfers have improved to 80% efficiency - need to investigate 20% loss. Next week throughput tests from T0->T1s for new T0 set-up and FTS 2.2 test. Monday-Tuesday low throughput (17k files delivered per site per day, approx 5TB/day/T1), Wed-Thu (same # files per site, but bigger size (factor 10 bigger) - total 50 TB/site/day). Sites with spare capacity should use it (deploy in ATLASDATADISK). Would be nice if SLAC could participate since it has BeStMan. Waiting for confirmation. Deletion day next Friday.

  • CMS reports (Daniele) - Most of activity is on T2s, ramping up to get everything ready for next week (for the Physics Groups for the October exercise). Tickets being followed up - evident progress - and still commissioning of the remaining T2-T2 links.

  • ALICE - No report.

  • LHCb reports (Roberto) - MC production drained due to book-keeping problem. Now restarted with 2K jobs at the moment. P8-CASTOR transfers continuing at nominal rate. Stripping also being commissioned and ready to be launched against large MC production (1B minimum bias events). FEST week in 2 weeks exercising full chain detector to plots. File access slowness at CERN understood: due to high I/O activity.

Sites round table:

  • ASGC (Gang): disk server not properly configured caused file transfer efficiency to drop - now fixed. Still to understand the 20% missing efficiency reported by ATLAS.

  • FZK (Xavier):

  • NL-T1 (Onno): MCDisk at SARA SRM was full. Some files moved and the issue should be fixed. Next Monday shceuled intervention at NIKKEF from 1pm-2pm CET on dpm.

  • RAL (Gareth): some problems on ATLAS transfers from RAL with timeouts and higher proportion of failures. LHCb 3D upgrade is scheduled for Monday.

  • BNL (Michael): Planning a upgrade of HPSS mass storage from 6.2 to 7.1 from 6th-8th Oct. Does not affect dcache and therefore BNL will participate to the ATLAS thoughtput test (see above ATLAS report).

Services Round Table:

  • Databases: LHCb online disk array intervention is scheduled for next Monday. Announced on the SSB.

AOB: LHCb test alarm sent yesterday was not properly received by RAL and FZK (field to where the alarm is sent to was empty for both sites). ATLAS test alarm ticket will be sent today. Sites please check. Another LHCb test alarm will be sent next week. The proper escalation of alarms needs to be followed-up next week also for CERN as the LHCb test alarm from yesterday was not received by the site.

-- JamieShiers - 2009-09-24

Topic attachments
I Attachment History Action Size Date Who CommentSorted ascending
PNGpng voms.png r1 manage 2.6 K 2009-10-01 - 11:00 SteveTraylen voms status for 24 hours up to Thu 1st October 2009

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek090928
Topic revision: r15 - 2009-10-02 - MariaGirone
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback