Week of 081006

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry,Ulrich,Jean-Philippe,Simone,Maria,Roberto,Gavin,Jan,Patricia,Markus);remote(Gavin,Michael).

elog review:

Experiments round table:

ATLAS (SC): Weekend of cosmics showed several problems. 1) SRM timeouts at IN2P3. They have installed a cron to restart agents. 2) FTS exports to RAL are being throttled somewhere back to 50-60 MB/sec (instead of 80-90). There is a 40 TB backlog that will be slow to clear and which is occupying space on disk at CERN that will need to be cleared. Have already increased FTS to 20 slots and will now go to 30. Will look again tomorrow and may decide to drop RAL out of the cosmics distribution. 3) NL-T1 had a power cut on Saturday (affected part of Amsterdam). We were informed by ATLAS contact but not officially by the site and it did not fully come back after the power cut. We put in a GGUS ticket asking them to put themselves in scheduled downtime and we have quarantined the site. Other ATLAS issues are: 4) testing of SLC4 FTS continues. 5) PIC still have FTS proxy delegation failures though the same configuration works at other sites. 6) The ATLAS LCG Voms certificate has been updated causing problems at NDGF, PIC and ASGC. 7) Prestaging has been tested at CERN, TRIUMF, RAL and CNAF while at NDGF, PIC, FZK and IN2P3 it needs attention. Tests at BNL begin today (the pandamover will be involved). 8) The problem of FTS slowing down because of uncleared history in the DB reported at FZK 2 weeks ago has now happened at NL-T1. This is a known issue and there is a workaround script that ATLAS will refer to their sites. See: https://twiki.cern.ch/twiki/bin/view/LCG/FtsAdminTools20#FTS_history_package . The tool ought to be included as part of the FTS core release.

LHCb (RS): Not much activity so the dummy MC work has been increased. There was some slowness in the DIRAC server this weekend, cured by restarting.Some pilot jobs aborted at FZK. The Voms role=user has been deployed at CERN and we are checking the LHCb LSF resource shares at CERN.

ALICE (PM): Have been testing FTS under SLC4 and so far so good. We have problems with LCG RBs and failing proxies at a couple of Tier 2 sites. M.Schultz pointed out that this is no surprise as these old RBs are not Voms aware and have been obsoleted for some time. They will never be fixed so these sites must upgrade. H.Renshall will organise a suitable communication.

Sites round table:

Core services (CERN) report: Jan reported there will be two transparent CASTOR upgrades tomorrow - to the UserPrivilegesV at 10.00 and bringing castoratlas to patch level 2.1.7-19 at 11.00. A vdqm upgrade is planned for 8 October. The LHCb LFC global change from srmv1 to srmv2 file names (1.5 M entries) is scheduled from 10.00-12.00 tomorrow. This will enable to stop the CERN srmv1 endpoint on Wednesday. A major release of the castor srm interface has been made and this will be closely tested with the experiments. Gavin reported that they have now patched the SLC4 FTS for the bug where large error messages crashed the server.

DB services (CERN) report (MG): An Oracle patch fixing a cacheing problem has beem deployed on the integration cluster prior to moving to the deployment cluster in 2 weeks time. The next Oracle security patches are due then so they will look at synchronising these patches.

Monitoring / dashboard report:

Release update (MS): They are still waiting for a gLexec patch from Nikhef. There will be a pre-gdb tomorrow discussing middleware options now that we have started the long LHC shutdown (see http://indico.cern.ch/conferenceDisplay.py?confId=20246 ).

AOB:

Tuesday:

Attendance: local(Julia, Roberto, Patricia, Andrea, Simone, Sophie);remote(Gareth).

elog review: no new eLog entries

Experiments round table:

  • ALICE: nothing to report
  • ATLAS (Simone): the situation has improved with respect to yesterday, SARA partially recovered, functional tests will restart this afternoon. For RAL, after yesterday CASTOR went down, the backlog is increasing (now 60 TB), as subscriptions are still coming. RAL will be excluded from cosmic data distribution to give it chance to recover. A configuration problem at CNAF was found and fixed this morning by Gavin.
  • CMS (Andrea): the mid week global run has started this morning, data is flowing to the T0, reconstruction jobs just started.
  • LHCb (Roberto):
    1. the intervention on the LFC's went fine, all the Tier-1 LFC mirrors were properly updated, apart from RAL where the database backend update is very slow. Gareth explained that there is a number of problems, among which a hardware issue with the database disk server which needs a fairly urgent intervention; ORACLE will be down in afternoon. Sophie asked to be notified when it is done;
    2. at IN2P3 400 jobs were found doing nothing, with a call to access data on dCache via dCap hung forever after a connection got lost. Now investigating, maybe the jobs will be killed. The problem is that there is no timeout to prevent this kind of problem.

Sites round table:

  • RAL (Gareth): the problem with database crosstalk is still being investigated, now setting up a test instance to try to reproduce it.

Core services (CERN) report: nothing to report. Andrea mentioned a problem with the CERN AFS UI, where the certificate of one of the VOMS servers, voms.cern.ch, is old and thus the server signature cannot be verified on proxies generated by it. Sophie reported that the problem is only on the UI, all the CERN Grid services have received the updated certificate, so no major problems are expected. Patricia asked if it could affect the proxy renewal on the VOBOX, but this should not be the case.

DB services (CERN) report: nothing to report

Monitoring / dashboard report: nothing to report

Release update:

AOB:

Wednesday

Attendance: local(Gavin, Roberto, Oliver, Jean-Philippe, Ulrich, Jan, Olof); remote(Michael - BNL, Gareth - RAL, Olof(?) - Nikhef).

elog review: no new log entries.

Experiments round table:

  • LHCb: Yesterday we had a problem at IN2P3 (400 jobs hanging). This was understood, and is due to clients stuck in a call to DIRAC. Not much going on this week – fake MC production stopped – bug found in job finalisation. LFC intervention (updating SURLs) went fine – green light to Castor CERN for phase-out of SRM v1 endpoints. Same point for Castor at RAL.

Sites round table:

  • RAL: Intervention on Oracle hardware caused some downtime. There was a distinct problem on the recataloging of LHCb LFC – this went really slow compared to other sites and it’s still not clear why (5 hours vs. expected 10 minutes). Q to RAL: (LHCb): Same RAC supports LFC and Castor? No, distinct databases. All LFC’s on the same RAC? We don’t use a RAC for LFC (not sure how this is laid out, one per experiment, or all on the same database - will find out),

  • NIKHEF: noted batch down for 2 hours – loss of several ATLAS jobs.

Core services (CERN) report:

  • VOMS (Steve) - On Tuesday Afternoon October 7th the VOMRS VALIDATION cluster was misconfigured such that incorrect VOMRS Notifications were emailed to a number VO Members. The false notifications can be identified easily.
    1. The emails' from address is steve.traylen@cernNOSPAMPLEASE.ch rather the VO's mail address.
    2. The URLs in the email direct to https://voms104.cern.ch:8443/vo//vomrs rather than the production VOMRS service on lcg-voms.cern.ch.
    • The VO managers have been notified.
    • No harm was done, the links in the mails went no where.
    • It has of course caused confusion and I have replied to many mails that have come my way.

  • Silent data corruption in CMS on one of the castor file-systems – trying to assess the impact to this (smart failures on filesystem). One file found bad and migrated bad to tape – we still have the original and will fix – but need to look for other cases. Hardware experts are currently trying to understand the problem.

  • Decommissioning SRMv1 endpoints ongoing -> will be gone by end of the week.

DB services (CERN) report: no report

Monitoring / dashboard report: no report

Release update: nothing

AOB: none

Thursday

Attendance: local(Jeremy, Uli, Harry, Jamie, Jean-Philippe, Julia, Simone, Andrea, Roberto, Nick);remote(Gareth, Michael).

elog review:

Experiments round table:

  • ATLAS (Simone) - things look smooth. Some 4-5% failure exporting data from CERN to all sites. Source problem? Ticket to castor-support. SRM timeouot on prepare to get. Checked fiels on disk and ok. Wait for support to get back. 2) Strange effect at Lyon, can get data from CERN with 100% efficiency but getting data from other sites 100% failure. Not yet reported - will check after meeting. As usual cosmic data taking over night and ship in morning.

  • CMS (Andrea) -
    • Mid week global run finished.
    • Might decide to have a global run with magnetic field in the weekend.
    • Data corruption discovered in CAF due to faulty disc (the error was not generating an alarm). FIO has produced a post-mortem (https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008). Some files rescued, unrecoverable files being identified, alarm will be triggered in the future.
    • Decided to make critical the failure of the check that the trivial file catalogue at a site matches with the version in CVS; many sites (including CERN) became unavailable in SAM as a result.

  • LHCb (Roberto) - ramp-up MC at Tier2/3/4 in afternoon after succcesful test jobs yesterday. Some FTS transfer problems from SARA (source preparation problem) to other T1s. Reported to SARA. Otherwise ....

  • ALICE (Patricia) - I will be in the alice TF meeting and then in the NA4 meeting this afternoon. During the alice TF meeting I am presenting the WMS status and the strategy to migrate to cream. Otherwise there is nothing special to report today

Sites round table:

  • RAL (Gareth) - LFC access from LHCb - extremely slow re-cataloging at RAL. Tuning, memory, should now be significantly better.

  • BNL (Michael) - all runing smoothly. ~400MB/s of cosmic data coming in. Looking forward to physicists requesting access to this and reprocessing tests upcoming...

Core services (CERN) report:

  • Silent corruption issue at CERN: We have started to draft a postmortem for this incident https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008 where you also find a first go on the list of affected or potentially affected files. We will continue analysing the impact and figure out how to cleanup to reach a consistent state. However, if you still have copies of the affected files in your online buffers, you may consider erasing (nsrm) the castor copy and copy the file again. In this particular case it is important to first remove the file in castor with nsrm in order to force a new bitfileid so that we can see it is a new copy of the file if in parallel we are trying some rescue operation.

    We apologize for all trouble caused. Clearly there are still a number of 'holes' in our monitoring and procedures despite that a lot of work in recent years on improving the protection against data corruption.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Announcement (CERN) - change in power supply used for tape systems on 18th November - tape systems will be down for 8 hours on that day.

Friday

Attendance: local(Maite, Jamie, Simone, Uli, Andrea, Simone, Harry, Sasha, Maria, Olof);remote(Gonzalo, Jeremy, Michael, Gareth).

elog review:

Experiments round table:

  • CMS (Andrea) - little to report. Transfers CERN to FNAL have very bad quality in PhEDEx. For many files checksum fails. Seems to be related to castor file corruption problem. Not clear if related to problems on CAF. Same thing or another?

  • ATLAS (Simone) - 1st observation: transfers to Lyon from CERN at degraded level until ~24:00. 40% failures. Contact with site during afternoon. Not understoond but errors disappeared around about midnight. A few hours ago transfers to/from FZK stopped. John Kennedy reported overload of pnfs. Following up at site. Yesterday 16:00-17:00 failures at 5% level source from castor. Increased to 15% overnight. ATLAS pushing extremely ahrd. - 3GB/s for an hour with peaks at 3.5GB/s! Didn't increase level of alarm as might have been due to this. Rate went down this morning to nominal but errors still occur - updated remedy ticket. Testing of FTS on SLC4: noticed that error messages in case of failed transfer source/dest info missing. Nor stage (prep, transf, finalisation). In general cannot understand errors - makes operations very hard. ATLAS request to fix before FTS deployed on SLC4 for production. Andrea - CMS has same request. Open bug from January 2008(!!!!!) opened by developers. Increased in priority from Simone+Andrew Cameron Smith. Considered a pre-requisite for putting this into production. Jamie - this should trigger a post-mortem!

  • ALICE (Patricia) - I have written a document which specifies the Alice requirements for the CREAM-CE which can be deployed at the sites. It will be distributed to the ops. team next week to clarify to the sites what exactly Alice is asking for. no additional issues regarding the production.

  • LHCb (Roberto)
    • CERN-LSF problem seems to have been understood (I had open a remedy ticket yesterday for traceability of it)
http://lblogbook.cern.ch/Operations/783.
    • glite 3.1.18 UI (production) yesterday afternoon got screwed up. Reported directly via phone call to retore it. Worth to know the reasons

Sites round table:

  • NIKHEF - still working on post-mortem from last week. Simone - SARA was quarantined after problems but now put back and working ok. Jeff - see no LHCb jobs - assume this is because no active production.

Core services (CERN) report:

  • Uli - LSF share issue raised by LHCb . Spent ~week trying to understood. Found one of root causes last night. Fixed this morning. Now they are getting 200 job slots - even above what is expected and jobs draining out.

  • AFS UI - service manager wanted to update VOMS cert - missed on Monday. Something went wrong (details not known) and was fixed. Job submit scripts did not work. Olof - wasn't updated on Monday as UI is tar ball - not installed in standard way. Sophie ran YAIM script - apparently new procedure. Andrea - in practice just replacing a single file. Olof - on most nodes install a single RPM. Interrupted script. Harry - would have been good to have had intervention announced at morning meeting. Traditionally is announced.

DB services (CERN) report:

  • Maria - transparent intervention at RAL to apply parameter set needed for streams. Announced for Tue 09:30 - 10:30 UTC+1. Distributed DB ops meeting: CMS requiring study to allow some PVSS data to be replicated from online to offline cluster. Start tests schema by schema. Task force launched by Sasha to understand conditions DB access for ATLAS. Currently focussing on CERN. Publish first results by DB workshop in November. Sasha - initiated request to sites to collect disk IO data. Have memory and CPU speed but not disk performance - seems to be limiting factor. Gonzalo - which? Sasha - disk i/o - #MB/s can be read. Jamie - would you not expect this data to be cached? A: no, as this is slow control data which is not shared between jobs.

Monitoring / dashboard report:

Release update:

AOB:

  • Jeremy - is problem with LHCb SAM test(s) understood? Seems they are using wrong role or s/w inst. script is broken.

  • Andrea - elogs. Problem appeared again whereby you cannot login using firefox 3 - with IE it works.

-- JamieShiers - 02 Oct 2008

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2008-10-10 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback