Week of 090907

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Steve, Gavin, Andrew, Edoardo, Stephane, Olof, Jean-Philippe, Patricia, MariaD, Roberto, Gang, MariaG, Dirk(chair));remote(Gonzalo, Ron, Gareth, Angela, ).

Experiments round table:

  • ATLAS - (Stephane) disk server RAID rebuild over weekend at RAL affected ATLAS - Q: why more frequently problem at RAL then at other sites? Gareth: In case of issues with automatic RAID rebuild (eg hot spare problem, controller failure to initiate rebuild) disk servers have to be taken out of production. What do other sites do? T0 does not systematically take disk servers out of production, but also in case of rebuild problems. ATLAS experienced problems at Lancaster: site has migrated to dpm 1.7.2 but not yet applied the patch solving SRM problems discovered during STEP in Glasgow. Also several other sites still missing this patch. Brian: will look into this problem based on ATLAS Elog entry. ATLAS further saw slow T2 transfers from ASGC - ATLAS is investigation together with site experts.

  • ALICE - (Patricia) quiet weekend... Cream CE down at CERN (ticket 51389, BL parser is not alive), may just need a restart. Gavin: will check. Further WMS at CERN is showing strange behavior with number of scheduled jobs ramping up and down quickly. So far only falling slope can maybe explained with ALICE workload falling from 11k to 5k jobs.

  • LHCb reports - (Roberto) large number of MC production ran over weekend, but experienced problems with WMS at CERN which forced LHCb to stop pilot submission to allow the system to catch up (more details on LHCb twiki report) (running/indle). Not yet clear if the reason for the problem is local at CERN or is rather wrong number of jobs published by the sites. (one case found at Manchester: 8k waiting, while info system showed 11 jobs). Investigation ongoing at additional sites. As the system recovered LHCb has resumed pilot submission restarted. Castor outage the morning from 6:10 to 7:45 caused by lack of database space. SIR requested to clarify the DB space monitoring in place.

Sites / Services round table:

  • Angela/FZK: NTR
  • Ron/SARA: New intervention for network expansion next Wed 9 Sept (whole day) - Apologies for late announcement. Further intervention planned for 16 Sept: grid services will moveto another switch (short outage), and 22 Sept: mass storage will move to new switch (short tape outage). Roberto: does 9th Sept outage also apply to NIKHEF -> no, just SARA. MariaG: intervention slot for DB migration at SARA to new h/w for 11sept fall on a Fri and some of the preconditions have not yet been met. Suggest to reschedule to a later date (to be announced).
  • Gareth/RAL: scheduled outage tomorrow on 3D cluster - migration to 64bit Oracle, Wed 9th Sept: SRM endpoint upgrade (2h, expected to be transparent)
  • Gonzalo/PIC: NTR
  • Gang/ASGC: Last Sat INode space was exchausted, fixed on the same day.
  • Gavin/CERN: Scheduled linux upgrade is ongoing and should finish today. LSF service will start using new license server (intervention should be transparent).
  • Edoardo/CERN: maintenance on LHC backbone, problem with T0->T1 traffic observed by PIC and RAL is now understod and fixed (https://gus.fzk.de/ws/ticket_info.php?ticket=51180). PIC should confirm that the issue is now removed.
  • Jan/CERN: Castor upgrade on the shared t3 instance (analysis) will take place tomorrow, upgrade of experiment stager is planned for Tue/Wed next week.

AOB

Tuesday:

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

Release report: deployment status wiki page

AOB:

Thursday

No call today - holiday at CERN (jeune Genevois)

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 2009-09-03

Edit | Attach | Watch | Print version | History: r11 | r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2009-09-08 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback