Week of 090907

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Steve, Gavin, Andrew, Edoardo, Stephane, Olof, Jean-Philippe, Patricia, MariaD, Roberto, Gang, MariaG, Dirk(chair));remote(Gonzalo, Ron, Gareth, Angela, ).

Experiments round table:

  • ATLAS - (Stephane) disk server RAID rebuild over weekend at RAL affected ATLAS - Q: why more frequently problem at RAL then at other sites? Gareth: In case of issues with automatic RAID rebuild (eg hot spare problem, controller failure to initiate rebuild) disk servers have to be taken out of production. What do other sites do? T0 does not systematically take disk servers out of production, but also in case of rebuild problems. ATLAS experienced problems at Lancaster: site has migrated to dpm 1.7.2 but not yet applied the patch solving SRM problems discovered during STEP in Glasgow. Also several other sites still missing this patch. Brian: will look into this problem based on ATLAS Elog entry. ATLAS further saw slow T2 transfers from ASGC - ATLAS is investigation together with site experts.

  • ALICE - (Patricia) quiet weekend... Cream CE down at CERN (ticket 51389, BL parser is not alive), may just need a restart. Gavin: will check. Further WMS at CERN is showing strange behavior with number of scheduled jobs ramping up and down quickly. So far only falling slope can maybe explained with ALICE workload falling from 11k to 5k jobs.

  • LHCb reports - (Roberto) large number of MC production ran over weekend, but experienced problems with WMS at CERN which forced LHCb to stop pilot submission to allow the system to catch up (more details on LHCb twiki report) (running/indle). Not yet clear if the reason for the problem is local at CERN or is rather wrong number of jobs published by the sites. (one case found at Manchester: 8k waiting, while info system showed 11 jobs). Investigation ongoing at additional sites. As the system recovered LHCb has resumed pilot submission restarted. Castor outage the morning from 6:10 to 7:45 caused by lack of database space. SIR requested to clarify the DB space monitoring in place.

Sites / Services round table:

  • Angela/FZK: NTR
  • Ron/SARA: New intervention for network expansion next Wed 9 Sept (whole day) - Apologies for late announcement. Further intervention planned for 16 Sept: grid services will moveto another switch (short outage), and 22 Sept: mass storage will move to new switch (short tape outage). Roberto: does 9th Sept outage also apply to NIKHEF -> no, just SARA. MariaG: intervention slot for DB migration at SARA to new h/w for 11sept fall on a Fri and some of the preconditions have not yet been met. Suggest to reschedule to a later date (to be announced).
  • Gareth/RAL: scheduled outage tomorrow on 3D cluster - migration to 64bit Oracle, Wed 9th Sept: SRM endpoint upgrade (2h, expected to be transparent)
  • Gonzalo/PIC: NTR
  • Gang/ASGC: Last Sat INode space was exchausted, fixed on the same day.
  • Gavin/CERN: Scheduled linux upgrade is ongoing and should finish today. LSF service will start using new license server (intervention should be transparent).
  • Edoardo/CERN: maintenance on LHC backbone, problem with T0->T1 traffic observed by PIC and RAL is now understod and fixed (https://gus.fzk.de/ws/ticket_info.php?ticket=51180). PIC should confirm that the issue is now removed.
  • Jan/CERN: Castor upgrade on the shared t3 instance (analysis) will take place tomorrow, upgrade of experiment stager is planned for Tue/Wed next week.

AOB

Tuesday:

Attendance: local( MariaG, Gavin, Gang, Patricia, Roberto, Andrea, Jamie, JPB, MariaD, Diana ,Dirk(chair));remote(Onno/SARA, John/RAL,Michael/BNL).

Experiments round table:

  • ATLAS - (Stephane) problems T2 transfers in ASGC being investigated. Some files at IN2P3 not accessible: disk server down for repair

  • CMS reports - (Andrea) Flle migration problem at T0 solved- what caused the problem? Gavin will add a link to full description to minutes. Development instance of phedex after upgrade showed some missing package (openldap) Ticket has been created and prod phedex upgrade has been delayed.

  • ALICE - (Patricia) Cream at T0: any news? Gavin: problem with BL parser not yet fully understood, but being investigated. New ticket for Budapest: vobox not reachable. Since several days problem with sara vobox - now solved and vobox in back in prduction. Also the T2 in bologna has vobox problems - ticket will be created.

  • LHCb reports - (Roberto) WMS problem reported yesterday has been traced down to new bug glite WMS (details on LHCb twiki). Despite their WMS problems LHCb are now running 15k jobs for MC/physics production but operate close to a limit. New ticket at CNAF: shared area instability. Short CASTOR DB instability (h/w problem of head nodes) around noon, which affected experiment user.

Sites / Services round table:

  • Onno/SARA: problems with SRM pools (last week and today): when max. number of open files is reached then dcache switches the pool off. Site now increased this parameter. WNs: some s/w directory using a glusterFS mount showed instablity, now replaced with NFS mount. Reminder: tomorrow network maintenance.
  • John/RAL: DB outage scheduled had to be extended to 3pm local time (affects ATLAS 3D cluster).
  • Micheal/BNL: NTR
  • Jos/FZK: LFC hickup for ATLAS FTS after DNS misconfiguration - solved now.
  • Gang/ASGC: NTR
  • Gavin/CERN: upgrade of t3 CASTOR instance is now complete. LSF for main batch cluster has been reconfigured to use a new license server. Batch user should report any unexpected behaviour. Scheduled OS upgrades are being run now
  • MariaG/CERN: security patch and recommended patch bundle (PSU) applied in a rolling way to CMS online DB cluster yesterday and LHCB offline (this afternoon). Tomorrow CMS offline +WLCG clusters will follow. Streams replication to FZK currently paused to FZK streams due to problems with the destination DB. Jos: being dealt with by the site and ATLAS DBAs.

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

Release report: deployment status wiki page

AOB:

Thursday

No call today - holiday at CERN (jeune Genevois)

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 2009-09-03

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-09-08 - MariaGirone
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback