-- HarryRenshall - 17 Apr 2009

Week of 090420

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).

GGUS Team / Alarm Tickets during last week

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Nick, Julia, Alessandro, Harry, Gavin, Ewan, Olof, Roberto, Patricia, Steve);remote(Angela, Michael, Gareth, Ronald).

Experiments round table:

  • ATLAS report - During the Weekend observed many Tier-1 failed, in particular PIC to INFN Tier-1 and FZK. GGUS tickets submitted. Problems solved at PIC and FZK. No news from INFN. Many small other issues observed. Also problem with Tier-2 (Toronto): starting from March was down for 80% of the time, sometime scheduled sometimes not but from the point of availability for ATLAS it's the same. How should ATLAS cope with such fluctuations. Harry: in MB it was decided that it's up to the experiment to decommission if site is considered unusable. If so, the information about the decommissioning should be made public. Harry will alert the Canadian Tier-2 federation contact person will a copy to the MB.

  • ALICE - (Patricia) Several issues: second VOBox GridKa Tier-1 is not performing well because of late proxy renewal. Should be increased to 48hrs. Second issue: Maarten said that several WMS at CERN suffered from large queues last Friday. The WMS were submitting to a site with which there had already been problems during Christmas. Patricia correct the repeated resubmissions in the ALICE software at that site (had not been patched) but that seems to have moved the problem to a different WMS, which looks like a problem with the WMS itself. Still investigating. Nick: there is already the GD WMS, which could be used for debugging in a production like environment. Patricia will start to use it.

  • LHCb reports - (Roberto) All hands-on MC09 preparation, which is the next round for physics MC production. In parallel there will be a test of the time-left utility used in by LHCb software to use up the CPU slots on the WNs. Other issues last couple of days: PIC problem with rouge user, which was immediately fixed. Also problem with CASTOR SRM publication CNAF BDII. Gavin: seems to be because CNAF is publishing with the wrong site name. Did not receive the notification about downtime extension at CNAF end of March. LHCb suggests that site register a new downtime instead of extension in this case. Nick: developers working on a solution for flagging the different types of extensions of downtime.

Sites / Services round table:

  • FZK (Angela) Seeing submission problems: globus-gma daemon dying but don't know why. Keep restarting it... Also confirms problems reported by ATLAS: it should have been fixed in the latest PBS patch but it doesn't seem to be the case.
  • BNL (Michael): NTR
  • RAL (Gareth): were some problems with national BDII last week. It's all working fine now and they have some ideas about the root cause. Put in downtime tomorrow while they repower service nodes (BDII, FTS,...). Also put in a 'at-risk' for CASTOR for next week: VDQM update (tape queue).
  • NIKHEF (Ronald): announcement: no worker nodes available upcoming Weekend. Will start draining queues from Wednesday (ATLAS) and Thursday (others).
  • CERN (Ewan): very long tape queue for CMS tape recalls. WMS problems for ALICE and also ATLAS.

AOB:

  • FZK: seeing SAM test failures because site BDII cannot be queried from Taiwan. Loosing a few percent availability per day. Could be related to network connection. Has anybody else seen this problem before? Ronald: yes, sometimes ago at NIKHEF.

Tuesday:

Attendance: local(Harry, Olof, Ewan, Julia, Maria, Alessandro, Patricia, Roberto, Gavin, Nick);remote(Michael, John Kelly, Jeff).

Experiments round table:

  • ATLAS reports- (Alessandro) No issue to report. One comment about BNL to where ATLAS shifters by mistake have been repeatedly submitting tickets for the same issue. The shifters have been informed to better check existing open tickets before submitting a new one.

  • ALICE - (Patricia) Yesterday night/afternoon a strange issue with ALICE jobs creating unkillable child processes consuming zero CPU. Other sites reporting the same problem. The cause was found this morning: a gLite bug (savannah ticket https://savannah.cern.ch/bugs/?49440, submitted this morning), gLite-WMS jobwrapper leaves a process behind. It was seen at CERN because ALICE had been using the GRIF WMS, which is running the WMS 3.2 version, for submitting jobs to the CERN-PROD CE. This is currently a showstopper for the WMS 3.2.

  • LHCb reports - (Roberto) Not much. LHCb is 'massaging' the system in preparation for the upcoming big MC production. FEST issues: FTS problem with CNAF channel - seems to be related to bad publishing of CNAF/CASTOR. Another issue is that several sites (~10) have not upgraded their VOMS host certificate (from 31st of March). Tickets submitted to the affected sites. Nick thought there was already a critical SAM test for that?

Sites / Services round table:

  • BNL (Michael) temporary failure in one of the two LFC running at BNL. The server ran out of file descriptors despite that the per-process limit had been set to the maximum and there was no shell limit. They have now set the system limit of number of file descriptors to a very big number as an attempt to avoid this problem.
  • RAL (John) NTR
  • NIKHEF (Jeff) question for Patricia: also saw the problem of excessive load but they don't see a 'sleep' process in the tree? Problem is going on right now so Patricia will login and check after meeting.
  • CERN (Ewan) NTR

AOB:

Wednesday

Attendance: local(Eva, Alessandro, Jamie, Sophie, Jean-Philippe, MariaDZ);remote(Gonzalo, Angela, JT, Gareth, Michael).

Experiments round table:

  • ATLAS reports (Alessandro) - yesterday was quiet day, mainly problems in UK and BNL. BNL understood and solved (Michael has updated elog for ATLAS). In UK problems related to 2 T2s. 1 had config problem - still to be solved. 2nd - Durham - got 'attack' from ATLAS user - not malicious. Hit SE. Prevent error happening again. DoS at CERN for CASTOR -1 week. Vladmyr Vinograd contacted and sent root macro used for his analysis. Have to study to understand - CASTOR should avoid that user can create such problems.

  • ALICE (Patricia) - dropped WMS in France out of production. Too many critical bugs. Site admin has downgraded version to latest glite 3.1. Suffering some instabilities (node level, not service level). Now only WMS in France - put latest WMS at CERN (gswms01) - intended as debug node - will be put into production for FR-T1 and FR-T2s.

Sites / Services round table:

  • BNL (Michael) - one of JVMs on dcache admin node went out of memory and hence entire dcache instance stuck. Started 20:30 local time fixed 90' later. Currently problem in conjunction with T2 in US - problem with proxy - think understood just has to be fixed.

  • PIC (Gonzalo) - LHCb ID card: 1 requirement asks to configure authorization (gridmap file in particular) all members mapped to standard pool acct by default. This means any other role different than standard user will not use static mapping. Are other VOs happy with this default behaviour? If we configure this for LHCb have to for all other VOs - i.e. removing anything but standard user from gridmap files. MariaDZ - how many service nodes still need gridmap file? Sophie - at CERN do something different for each VO. For some services have special 'hack'. Maria - open GGUS ticket and assign to installation and config. Does YAIM have to offer one config across VOs - this would be "too brutal" for e.g. sites that support non-LHC VOs. JT - we have implemented this behaviour for all VOs - if not ok interesting that there are no complaints!

  • CERN (Sophie) - some tape robot problems at the moment - investigations by Sun - affects CASTOR tape infrastructure.

AOB:

Thursday

Attendance: local(Harry, Alessandro, MariaD , Diana, Jacek, Roberto, Julia, Nick, Dirk);remote(JohnK).

Experiments round table:

  • ATLAS reports - (Alessandro) SRM space token for scratch disk at CNAF is not working. Most likely this is due to a configuration problem. A GGUS ticket has already been submitted.

  • ALICE - Apologies from Patricia

  • LHCb reports - (Roberto) 3 WMS problems at RAL, CNAF and GRIDKA - all have GGUS tickets. John Kelly had no additional information, but will follow up at RAL.

Sites / Services round table:

  • Physics Database Services - (Jacek) The quarterly Oracle security patches will be applied in rolling way on all pre-production and validation database clusters on Mon and Tue next week. This intervention is transparent.

AOB:

Friday

Attendance: local(Jamie,Olof, Nick, Jean-Philippe, Harry, Sophie, Julia, Patricia, Roberto, Stephane, Dirk);remote(Gang Qin, Gareth, Jeff)

Experiments round table:

  • ATLAS reports - (Stephan Jezequel) functional test for italian t2 had to be restarted, no problem. Wuppertal not functional for a while, will be cleaned for fresh start. ATLAS is in contact with the site.

  • ALICE - (Patricia) smooth running, 2 small issues: for the new WMS made available early this week, root access for Martin is requested. The wms in Caliari shows performance problems. Followed up by ALICE directly the site.

  • LHCb reports - (Roberto) new WMS issue at PIC after upgrade. Ticket submitted. LHCb issue with LFC use from Persistency is being discussed between both development teams. Jeff: LHCb jobs only arrived at NIKHEF few hours before queues had to be disabled for intervention.

Sites / Services round table:

  • ASGC: For ATLAS the cleaning of LFC is done, cleaning of SE is going on.
  • NIKHEF: site partially down (eg storage is still up), expect to be fully back on Mon
  • FZK: sporadic dCache read problems. Waiting for new occurrence for further analysis.

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback