Week of 081006

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Harry,Ulrich,Jean-Philippe,Simone,Maria,Roberto,Gavin,Jan,Patricia,Markus);remote(Gavin,Michael).

elog review:

Experiments round table:

ATLAS (SC): Weekend of cosmics showed several problems. 1) SRM timeouts at IN2P3. They have installed a cron to restart agents. 2) FTS exports to RAL are being throttled somewhere back to 50-60 MB/sec (instead of 80-90). There is a 40 TB backlog that will be slow to clear and which is occupying space on disk at CERN that will need to be cleared. Have already increased FTS to 20 slots and will now go to 30. Will look again tomorrow and may decide to drop RAL out of the cosmics distribution. 3) NL-T1 had a power cut on Saturday (affected part of Amsterdam). We were informed by ATLAS contact but not officially by the site and it did not fully come back after the power cut. We put in a GGUS ticket asking them to put themselves in scheduled downtime and we have quarantined the site. Other ATLAS issues are: 4) testing of SLC4 FTS continues. 5) PIC still have FTS proxy delegation failures though the same configuration works at other sites. 6) The ATLAS LCG Voms certificate has been updated causing problems at NDGF, PIC and ASGC. 7) Prestaging has been tested at CERN, TRIUMF, RAL and CNAF while at NDGF, PIC, FZK and IN2P3 it needs attention. Tests at BNL begin today (the pandamover will be involved). 8) The problem of FTS slowing down because of uncleared history in the DB reported at FZK 2 weeks ago has now happened at NL-T1. This is a known issue and there is a workaround script that ATLAS will refer to their sites. See: https://twiki.cern.ch/twiki/bin/view/LCG/FtsAdminTools20#FTS_history_package . The tool ought to be included as part of the FTS core release.

LHCb (RS): Not much activity so the dummy MC work has been increased. There was some slowness in the DIRAC server this weekend, cured by restarting.Some pilot jobs aborted at FZK. The Voms role=user has been deployed at CERN and we are checking the LHCb LSF resource shares at CERN.

ALICE (PM): Have been testing FTS under SLC4 and so far so good. We have problems with LCG RBs and failing proxies at a couple of Tier 2 sites. M.Schultz pointed out that this is no surprise as these old RBs are not Voms aware and have been obsoleted for some time. They will never be fixed so these sites must upgrade. H.Renshall will organise a suitable communication.

Sites round table:

Core services (CERN) report: Jan reported there will be two transparent CASTOR upgrades tomorrow - to the UserPrivilegesV at 10.00 and bringing castoratlas to patch level 2.1.7-19 at 11.00. A vdqm upgrade is planned for 8 October. The LHCb LFC global change from srmv1 to srmv2 file names (1.5 M entries) is scheduled from 10.00-12.00 tomorrow. This will enable to stop the CERN srmv1 endpoint on Wednesday. A major release of the castor srm interface has been made and this will be closely tested with the experiments. Gavin reported that they have now patched the SLC4 FTS for the bug where large error messages crashed the server.

DB services (CERN) report (MG): An Oracle patch fixing a cacheing problem has beem deployed on the integration cluster prior to moving to the deployment cluster in 2 weeks time. The next Oracle security patches are due then so they will look at synchronising these patches.

Monitoring / dashboard report:

Release update (MS): They are still waiting for a gLexec patch from Nikhef. There will be a pre-gdb tomorrow discussing middleware options now that we have started the long LHC shutdown (see http://indico.cern.ch/conferenceDisplay.py?confId=20246 ).

AOB:

Tuesday:

Attendance: local(Julia, Roberto, Patricia, Andrea, Simone, Sophie);remote(Gareth).

elog review: no new eLog entries

Experiments round table:

  • ALICE: nothing to report
  • ATLAS (Simone): the situation has improved with respect to yesterday, SARA partially recovered, functional tests will restart this afternoon. For RAL, after yesterday CASTOR went down, the backlog is increasing (now 60 TB), as subscriptions are still coming. RAL will be excluded from cosmic data distribution to give it chance to recover. A configuration problem at CNAF was found and fixed this morning by Gavin.
  • CMS (Andrea): the mid week global run has started this morning, data is flowing to the T0, reconstruction jobs just started.
  • LHCb (Roberto):
    1. the intervention on the LFC's went fine, all the Tier-1 LFC mirrors were properly updated, apart from RAL where the database backend update is very slow. Gareth explained that there is a number of problems, among which a hardware issue with the database disk server which needs a fairly urgent intervention; ORACLE will be down in afternoon. Sophie asked to be notified when it is done;
    2. at IN2P3 400 jobs were found doing nothing, with a call to access data on dCache via dCap hung forever after a connection got lost. Now investigating, maybe the jobs will be killed. The problem is that there is no timeout to prevent this kind of problem.

Sites round table:

  • RAL (Gareth): the problem with database crosstalk is still being investigated, now setting up a test instance to try to reproduce it.

Core services (CERN) report: nothing to report. Andrea mentioned a problem with the CERN AFS UI, where the certificate of one of the VOMS servers, voms.cern.ch, is old and thus the server signature cannot be verified on proxies generated by it. Sophie reported that the problem is only on the UI, all the CERN Grid services have received the updated certificate, so no major problems are expected. Patricia asked if it could affect the proxy renewal on the VOBOX, but this should not be the case.

DB services (CERN) report: nothing to report

Monitoring / dashboard report: nothing to report

Release update:

AOB:

Wednesday

Attendance: local(Gavin, Roberto, Oliver, Jean-Philippe, Ulrich, Jan, Olof); remote(Michael - BNL, Gareth - RAL, Olof(?) - Nikhef).

elog review: no new log entries.

Experiments round table:

  • LHCb: Yesterday we had a problem at IN2P3 (400 jobs hanging). This was understood, and is due to clients stuck in a call to DIRAC. Not much going on this week – fake MC production stopped – bug found in job finalisation. LFC intervention (updating SURLs) went fine – green light to Castor CERN for phase-out of SRM v1 endpoints. Same point for Castor at RAL.

Sites round table:

  • RAL: Intervention on Oracle hardware caused some downtime. There was a distinct problem on the recataloging of LHCb LFC – this went really slow compared to other sites and it’s still not clear why (5 hours vs. expected 10 minutes). Q to RAL: (LHCb): Same RAC supports LFC and Castor? No, distinct databases. All LFC’s on the same RAC? We don’t use a RAC for LFC (not sure how this is laid out, one per experiment, or all on the same database - will find out),

  • NIKHEF: noted batch down for 2 hours – loss of several ATLAS jobs.

Core services (CERN) report:

  • VOMS (Steve) - On Tuesday Afternoon October 7th the VOMRS VALIDATION cluster was misconfigured such that incorrect VOMRS Notifications were emailed to a number VO Members. The false notifications can be identified easily.
    1. The emails' from address is steve.traylen@cernNOSPAMPLEASE.ch rather the VO's mail address.
    2. The URLs in the email direct to https://voms104.cern.ch:8443/vo//vomrs rather than the production VOMRS service on lcg-voms.cern.ch.
    • The VO managers have been notified.
    • No harm was done, the links in the mails went no where.
    • It has of course caused confusion and I have replied to many mails that have come my way.

  • Silent data corruption in CMS on one of the castor file-systems – trying to assess the impact to this (smart failures on filesystem). One file found bad and migrated bad to tape – we still have the original and will fix – but need to look for other cases. Hardware experts are currently trying to understand the problem.

  • Decommissioning SRMv1 endpoints ongoing -> will be gone by end of the week.

DB services (CERN) report: no report

Monitoring / dashboard report: no report

Release update: nothing

AOB: none

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

-- JamieShiers - 02 Oct 2008

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2008-10-08 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback