Week of 090914

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Olof, Gang, Andrew, MariaG, Jean-Philippe, Alessandro, Markus, Simone, Diana, MariaD, Roberto, Lola, Ewan, Harry, Julia, Dirk(chair));remote(Gonzalo/PIC, Michael/BNL, Brian + Gareth /RAL, Daniele/CMS , Andreas/FZK, Ron/SARA).

Experiments round table:

  • ATLAS - (Alessandro) over weekend only few issues on T2, Central catalog some load-balanced node outage due to kernel panic- service restarted, SAM tests affected for 8h. ATLAS has started ESD-ESD reprocessing, most T1 sites have been validated for the correct ATHENA version. During this experiment activity ATLAS will run a daily phone conf at 15:30 (after this meeting) focussing on site issues (chaired by expert on call). All sites participating are invited to join, some code as ADC.

  • CMS reports - (Danielle) last week physics week in bologna - now restarting with report on CMS twiki - should be completed by tomorrow. Some delay in support replies from IN2P3. Good progress in closed tickets in Asia/India. Also still waiting on replies from Russian T2 on issues outstanding for 2-3 weeks (not acceptable).

  • ALICE -

  • LHCb reports - (Roberto) active MC production (some 7k jobs running right now) with issue on Dirac side: large amount of failing pilot jobs (WMS overload suspected). Currently debugging many T2 site (pilot aborts). T0 network issue this morning (6:30-9:00) affected castor + bdii. Also issue at PIC this morning with no access to Dirac services. Problem disappeared around noon. Reason not fully understood yet, but apparently not related to T0 network problem. Also at PIC: LHCb found a larger amount of stalled user jobs (GGUS ticket from 3rd Sept). Gonzalo: no known issue at PIC, but will check logs of Dirac services. Roberto: connection problems not only from CERN but also from RAL (around 12:00). Alessandro: PIC networking works fine for ATLAS. CNAF ticket: one of the Storm directories had wrong write protection, SARA: dCache problem is currently blocking MC09, Ron: waiting for response dCache developers. Also IN2P3 had similar dCache issues. They may be able to provide a workaround if the problem turns out to be the same.

Sites / Services round table:

  • Golzalo: NTR
  • Micheal: NTR
  • Gareth/RAL: disk server in ATLAS DATA disk was out until this morning, upgrade of LHCb SRM to 2.8.0 ongoing, migrating most batch capacity to SL5 over next days, CASTOR name server upgrade to 2.1.8 planned.
  • Andreas/FZK: problem with 2 atlas disk only pools being worked on, Thu 17 - 8-10 local time at risk for firmware upgrade (in gocdb), expect short interruption for minutes. ATLAS SRM SAM test fail since a few days: ATLAS has now a new dedicated LFC but tests use still old LFC - ATLAS will change tests.
  • Ron/SARA : Thu last week thu storage was moved to 10GB switches, intervention went smoothly. Last Fri NIKHEF had cooling problems and WN had to be switch off. The problem is solved now. This Wed WN will move to 10Gb, next Tue (22nd) grid services will move to new networking infrastructure.
  • Gang/ASGC: good news for ATLAS: T1 and T2 will stay ultimately separate - exact timeline will be given by Jason later. Currently available 6 Tape drives will be extended by add another 18 drives mid of October. Tape performance will be limited until then. Plan to dedicate 4-5 drives to ongoing ATLAS activity. Mail from Jason, CERN<->ASGC network performance problems since network maintenance at CERN: CERN and T1 experts are investigating the cause.
  • MariaG/CERN: Security patch being applied to LHCb online today.
  • MariaD/CERN and OSG participants: discussion on OSG tickets related to LHC experiments now in this meeting (each Monday). Currently not items in the escalation queue for OSG. Does ATLAS want to test alarm ticket procedure already today? Fine for ATLAS and OSG. ATLAS will create a new alarm ticket after green light from Diana/MariaD.

AOB:

Tuesday:

Attendance: local(Harry,Jamie, Gavin, Olof, Ewan, Patricia, Julia, Jean-Philippe, Alessandro, Roberto, Diana, Dirk(chair));remote(Gonzalo/PIC, Tiju/RAL, Michael/BNL, Jeremy/GridPP, Daniele/CMS ).

Experiments round table:

  • ATLAS - (Alessandro) another kernel panic on the 3 load-balanced central catalog (experts are investigating), but the service was not affected. ATLAS reprocessing activity had to be stopped due to unforeseen problems: all tasks have been aborted. More detail in ATLAS daily meeting in 30 min.

  • CMS reports - (Daniele) CMS is following the Castor intervention which took longer than expected, but the information flow is good. T1 sites: going systematically through all issues still open from last two weeks. CMS remarked that CERN and all T1 had very high availability during the last weeks! A number of individual T1/ T2 transfer problems is being followed up with good progress. The CMS twiki give more details and individual ticket references. USCD is restructuring its storage element: all files are deleted and afterwards retransferred (to measure time for several TBs hosted).

  • ALICE - (Patricia) Alice production at all sites is stopped to prepare for new MC cycle. Now is a good moment for sites to provide new VObox. One instance has been setup at CERN (SLC5 VObox pointing to cream) and should be ready for next production.

  • LHCb reports - (Roberto) smooth MC and merging activity at T1s. As many sites moving to SL5 but Dirac certification for SL5 still pending the usable capacity for Dirac is decreasing. One faulty WN has been observed at CERN. LHCb is still debugging many T2 sites with failing pilot jobs (open GGUS tickets and more details on LHCb twiki).

Sites / Services round table:

  • Gonzalo/PIC: PIC in scheduled downtime, transfer failures today due that intervention - back at 5. SRM outage affects SE - always closed as well.
  • Tiju/RAL: Castor name server upgrade encountered a problem - outage extended to tomorrow midday.
  • Michael/BNL: NTR
  • Jeremy/GridPP: NTR
  • Gang/ASGC: NTR - Alessandro: any result from debugging of transfer problems ASGC->T2? Gang: no concrete problem found. Alessandro: Maybe FTS monitoring would help to diagnose the problem. Currently the configuration of the ASGC FTS does not allow further diagnostic for ATLAS. Gang: Will follow up.

AOB:

Wednesday

Attendance: local(Gang, Jan, Antonio, Harry, Olof, Patricia, Roberto, Eduardo , Alessandro, MariaD, Jamie ,Dirk(chair));remote(Gareth/RAL, Xavier/FZK).

Experiments round table:

  • ATLAS - (Alessandro) ATLAS dashbord still quite red but due to only few issue. RAL in extended downtime - wait for report. SARA in scheduled downtime and out of DDM, also T1-T1 traffic from/to SARA affected. FTS problems with CNAF as destination (eg from TRIUMF and other sites). Suspect that CNAF disappeared briefly while FTS config was updated. Problem went away at most sites but TRIUMF should verify/update FTS server config.

  • ALICE - (Patricia) ALICE is back in production and ramped up quickly: first 13k now 3k jobs. Two of the new voboxes are being flagged “maintenance mode” to avoid spma errors for operators. Final configuration will be put into Quattor soon.

  • LHCb reports - (Roberto) Little activity at the moment. Dirac has been certified for SL5 which allows LHCb to progressively use SL5 resources. Proposed Castor intervention slot is fine for LHCb. SARA: still a problem with ocality returned by SRM - show stopper for LHCb: asking for an update from the site.

Sites / Services round table:

  • Gareth/RAL: Name server update (2.1.8) intervention triggered unrelated problems with disk-to-disk copies. RAL rolled back to old name server version but the problem persisted. RAL experts work closely with castor development. A service incident report will be created one the root cause has been identified.
  • Xavier/FZK: Schedule outage from 8:00-10:00 local time for firmware upgrade, expect short network outage.
  • Gang/ASGC: NTR
  • Jan/CERN: Castor update today went well. CMS Castor intervention yesterday had to be extended due to a missing index after reimporting the DB. A service incident report has been prepared.

Release report: deployment status wiki page

Antonio summarised the status of upcoming releases. Alessandro: DPM 1.7.3 with SRM patch for SL5 is still in certification. Some sites are upgrading to SL5 already. When will patch be out? Antonio: don’t have a firm date yet but will follow up offline.

AOB:

MariaD: Test alarm ticket to BNL will be done on Monday 15:00 by ATLAS. ATLAS should report if ticket got successfully assigned to OSG.

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- DirkDuellmann - 2009-09-14

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2009-09-17 - JanIven
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback