Week of 090518

WLCG Baseline Versions

WLCG Service Incident Reports

GGUS Team / Alarm Tickets during last week

Archive of Broadcasts

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry, Gang, Alessandro, Olof, Jacek, Diana, Nick, MariaDZ, Roberto,Patricia, Simone);remote(Michael, Jeremy, Angela, Gareth, Daniele).

Experiments round table:

  • ATLAS - (By email from Simone Campana): On Saturday Jason reported that the network problem in ASGC had been fixed. Therefore, we started testing the site again, with controlled subscriptions of datasets with various file sizes. I can not exactly measure the network performance, but surely this is now what I would consider "normal". 3GBfiles are transfered in order of few minutes and the site shows 100% successes and no failures at all. Therefore, I believe we should move on and restart functional tests for the site. To keep the environment controlled, I would start with T0-T1 traffic, which I just resumed. I would resume the T1-T1 traffic tomorrow if things are going OK.

AdG report: No major service interruptions during the weekend. Problems exporting data from BNL to NL-T1 (A GGUS ticket was opened in the end) illustrate the need for Tier 1 experts to be able to quickly contact ATLAS experts. Also at BNL some datasets are not complete due to their having an old version of site services and would all sites please upgrade (BNL is planning to do so). This afternoon (Monday) the central catalogue will be upgraded. This will be done first on one of two load balanced servers and should be transparent.

  • CMS reports - A quiet weekend in terms of tickets. Solved one problem at CERN (CAF space needed cleaning up), two at Tier 1 and 3 at Tier 2 but routine activities. The planning for the castorcms upgrade at CERN has been agreed.

  • ALICE - Ran more than 10000 concurrent jobs over the weekend.Some minor issues - VObox at Nikhef innaccessible on Saturday (ticket sent), CREAM-CE in FZK not working (Angela reported a typo in a script that redefined their queues) and a strange new job submission problem in Torino where a normal proxy works but there are submission errors with a delegated proxy.

  • LHCb reports - No large scale production at the moment. GGUS tickets cannot be created this morining (Maria DZ reported this as due to a temporary CRL failure) and a user had a problem at IN2P3 where some pnfs files became innaccessible. Olof asked for more details on the failure rates LHCb presented at last weeks gdb - Roberto to follow up.

Sites / Services round table:

* ASGC - (by email from Jason Shih): Recent atlas FT error arise from faulty interface installed in the core switch and was lately clarified the root cause after doing mutual network performance testing. The continuous timeout error observed since last week should affect all production VOs especially for Atlas and CMS. CMS confirm the transfer able to resume normally and we confirm local MSS can now take FT data smoothly.

Gang report: the faulty switch hardware was corrected on Saturday. There will be a several hours downtime on 25 May for more router work. Simone added that the repair of ASGC was good news and that ATLAS T1 to T1 traffic there will resume tomorrow. Data for STEP'09 low rate tests is already being sent to ASGC and it is critical they get back their 5% of last years ATLAS cosmics data to be ready for the STEP'09 reprocessing component. They expect ASGC to be fully available by next week.

  • RAL - in the middle of their 2 day move of backend Oracle databases to new hardware.

  • FZK - having some instabilities in their tape services that they must fix before STEP'09.

  • CERN (Jacek, databases) - In the coming two weeks will migrate the integration databases from RHEL4 to RHEL5. Will take about 2 hours per database but should be transparent. They are not planning to migrate the production databases before STEP'09.

AOB: Maria reminded ATLAS there are two outstanding GGUS tickets (that generated OSG tickets) from early May namely 48366 and 48574. Alessandro agreed to follow up.

Tuesday:

Attendance: local(Harry, Gang, Alessandro, Olof, Ricardo, Miguel, Steve, MariaDZ, Diana, Roberto);remote(Daniele, Luca, Angela, Gareth).

Experiments round table:

  • ATLAS - Have issued detailed instructions for clouds to prepare for STEP'09 and are looking for contacts to report back having so far only UK, FR and NL. They were not so affected by the WAN failure as their current operations are more centralised. Yesterdays upgrade of one of the central catalogue servers was successful so the second one will be done today. T1 to T1 transfers with ASGC have resumed and they hope to complete certification tomorrow. They will then start transferring cosmics raw data and AOD files to ASGC in preparation for STEP'09.

  • CMS reports - One Tier 1 ticket (to ASGC) closed and several Tier 2 tickets closed - mostly concerning resolved transfer errors. Most worrisome is the Caltech Tier 2 with an unresponsive SRM. CMS operations were interrupted by the WAN failure this morning. Seen immediately by users (messaging, Indico, web access) and CERN web services monitoring was cut off. Job status could not be monitored and some work was lost. Daniele raised a CMS GGUS alarm ticket to CERN given the severity and in order to be sure to reach CERN but queried if this was the correct action for CMS to do and if not what should have been done. In fact this ticket did not reach CERN till the network came back (unsurprising) at 11.06 when it was rejected by the CERN operator as not relating to data management. Olof thought a phone call to the CERN operator would have been better had Daniele known the number. We will follow up on this particular case involving a WAN failure and also look at other high severity use cases not yet covered.

  • ALICE (by email from Patricia) - Running smoothly with more than 11000 jobs and no major issues.

  • LHCb reports - some transactions were aborted by the network failure. Reports of aborted pilot jobs at Nikhef. Should restart MC09 later today. CERN Castor upgrade now scheduled for 27 May.

Sites / Services round table:

  • CERN - There was a network connectivity failure to CERN from (at least) DESY and CNAF this morning from just before 10.00 to 11.00. The cause is not yet known but was initially thought to be a GEANT configuration issue which cut off CERN addresses. More information to come. CMS raised a site alarm ticket https://gus.fzk.de/ws/ticket_info.php?ticket=48850 which was rejected by the CERN operator as per procedure as it did not concern the services of castor, fts, lfc or srm. How to handle such cases to be followed up.

  • ASGC (detailed report by email from Jason and verbally by Gang) -

* network: new contract for 10G network start from Jun 1st and Data Centre fully restored now. Two hour intervention for core router relocation maintenance window start from 2009-05-25 2:00 UTC till 2009-05-25 4:00 UTC. All connection to T0/1/2/3 are affected and have 100% overall degradation.

* new tape system delivered this Wed will append to existing base frame set and provide service two weeks after the delivery check. Continue tape drive expansion this Fri., hope to finish installing before weekend. Expect to have 6 LTO4 drive involve in functional testing and also upcoming STEP09 (total drive pool should be 8 drives).

* DC fully restored. Facility expect to move back start from this Thu. Not all capacity will be back in DC for STEP'09.

* RAL - in day 2 of the Oracle RAC move. Should be complete by 18.00 BST.

* FZK - ongoing tape library problems.

AOB:

Wednesday

Attendance: local(Harry, Gang, Eva, Simone, Laura Perini, Ricardo, MariaDZ, Olof, Nick);remote(Daniele, Jeremy, Angela, Michael, Jeff, Gareth, Luca, Alessandro I, Greig).

Experiments round table:

  • ATLAS - 1) The update of the central catalogue was completed yesterday. There was some disruption of services (for about an hour) at Tier 2 sites running VO boxes as there was a role change that needed to be propogated. 2) The SARA downtime extension yesterday lead to a ticket being sent as the ATLAS framework does not detect the extension since no broadcast is sent. Jeff suggested we think of a mechanism allowing the frameworks to poll Gocdb on a suitable timescale. 3) At ASGC T0 to T1 and T1 to T1 transfers have resumed and the datasets allowing Panda and Ganga robot functional tests to resume have been sent. All required ATLAS software is there and replication to the Melbourne Tier 2 has resumed. The last element is conditions data for reprocessing where the flat files component has been sent leaving the Oracle component. Eva is in contact with ASGC and this DB part should be ready by Monday.

  • CMS reports - Closed the Caltech unresponsive srm ticket. Waiting for suggestions on how to better react to the recent Geant WAN failure. Waiting for results of recent FZK downtime which should have addressed several internal tickets and waiting for consistency checks to be made on a dataset at CNAF (this is an internal CMS ticket).

  • ALICE - (Patricia before the meeting) Good behavior of Alice sites in general. The CREAM issue reported two days ago and affecting Torino T2 has been found by the developers of the service and the production of Alice through the CREAM system in Torino is back. ALICE VOBOX at GRIF has been moved to a new hardware and several issues has been found and reported to the expert of the site this morning: First of all the whole home directory of Alice was gone. The contain was recovered by the site admin. In addtion the whole VOBOX configuration (as a service) was also gone. It has been asked the site admin to reconfigured the node with the corresponding service

Sites / Services round table:

  • NL-T1: During their downtime they upgraded to the most recent dcache and increased the tape disk buffer size from 1.7 to 22 TB. They have memory issues with a Java virtual machine - working but an at-risk. They also lost an xntp (time server) configuration file leading some WNs to drift in time and fail.

  • CNAF: Had a temporary problem of SAM tests failing due to a configuration mistake of the sgm role - jobs failed to be submitted. They queried the reported failing LHCb disk server with a bad CRL and were given the ID by Greig (it is in storm).

  • RAL: Oracle RAC move downtime was extended to 16.00 BST today but has just ended. Gareth confirmed FTS and CE traffic can be resumed. There will be a 90 minute scheduled outage next Tuesday (hopefully shorter) for a second try at making network changes. Batch jobs will be paused.

  • ASGC: 1) Tier2 DPM new setup, expect to add new end point into ATLAS DDM and join functional tests. New setup expect to finish this Thu including also the cabling to backend disk servers. 2) Can confirm the schedule for re-processing validation while the new tape drives (6 LTO4) wait on completeness of acceptance verification this Fri and continue the installation in the afternoon. We hope the cabling able to finish in the evening including also the server pkg upgrade.

AOB (MariaDZ): Maria queried the status of BNLs migration from GGUS to OIM. Michael confirmed that this is in the final stages of preparation and should be done before STEP'09. Added later: For Michael's information on the need to obtain OSG sites' contact and emergency email addresses for use by GGUS: https://savannah.cern.ch/support/?107531#comment13 NB! The emergency emails can only be used by ALARM tickets to Tier1s. Other sites don't need to provide this info.

Thursday - CERN holiday but conf call will be available to be used.

Attendance: local();remote().

Experiments round table:

  • ATLAS: Data Distribution to T1s and T2s has been resumed for various data formats. The major impact comes from MC performance DPDs (derived from ESDs), which are not merged and will be distributed to all T1s and many T2s (lot of small files). Details in Decisions of ATLAS CREM. Site issues: failures transfering to MCDISK at RAL due to lack of disk space and file servers being particularly busy. See elog from Gareth. Problem sending data to TRIUMF (DATADISK). GGUS submitted.

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local(Harry, Gang, Simone, Andrea, Jamie, Patricia, Ricardo);remote(Michael, Alessandro, Gareth).

Experiments round table:

  • ATLAS - 1) As reported yesterday ATLAS is distributing many small DPD files to Tier 1 for redistribution to Tier 2 then deletion from the Tier 1. 2) From 15.00 to 19.00 yesterday data exports from CERN to Tier 1 were degraded. The problem cured itself but ATLAS would like an explanation and have submitted a ticket for this. 3) SARA failed for 1 hour exporting data anywhere - also cured by itself. 4) BNL had a problem getting data from MCDISK in Lyon. Turned out to be a poorly configured dcache pool in Lyon being too busy. 5) ASGC is getting Oracle insert errors causing 30% of transfers to fail. Gang reported this is being worked on. 6) This morning PANDA monitoring was showing zeroes for jobs being activated while the number of pilots was correct and jobs were actually running.

  • CMS reports - Operations running more or less ok - some backlog from the FZK tape issues. A STEP'09 planning meeting was held yesterday with 35 participants. Mostly looking at prestaging which will be tested on a small scale at Tier 1 next week. Sites must decide if they want to prestage via Phedex, a gfal/srm script or manually.

  • ALICE - Sufferring from small network interrupts from time to time today but not really affecting production (may be related to that the CERN primary firewall router crashed this morning and we are running over it's backup).

Sites / Services round table:

RAL: The issue of MCDISK space token transfer failures reported by ATLAS yesterday has been solved. CASTOR is still publishing 36TB of available disk space that ATLAS cannot in fact use (disks draining - this is allowed for in Castor 2.1.8). Monday is a UK bank holiday so RAL will not join the conf call.

CNAF: The LHCb ticket on a STORM disk server failing on CRL had not been sent to the optimal address. The problem is in fact an expired host certificate for which a replacement has been requested.

AOB:

-- JamieShiers - 14 May 2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-05-22 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback