Week of 120105

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

CERN closed

Tuesday:

CERN closed

Wednesday

CERN closed

Thursday

Attendance: local(Massimo, Stephane, Alessandro, MariaD, Eddie, Marcin, Zbigniew, Maarten, Manuel);remote(Paco, Michael, Burt, Mette, Ian, Pavel, Rob, Rolf, Gareth, Vladimir).

Experiments round table:

  • ATLAS reports -
  • Christmas summary
    • HAPPY NEW YEAR
    • ATLAS was able to run 110-120k jobs permanently over xmas period. The Grid was running efficiently over this period. The usual site issues (mainly T2s) did not affect significantly the global production. CERN CPUs were not fully used because the last-minute changes to bypass eos-srm limitations with 6k jobs were not fully debugged.
    • For T1s implementing TAPE family : new project : mc11_2TeV
    • GGUS:77804 : The backup path for IN2P3-CC and SARA to INFN-T1 was going over general network which was saturated. As consequence, transfers to IT T1+T2 sites were affected. The backup path now goes through CERN.
    • voms issue in BNL (elog:32755): Affected Panda (still not understood why haing CERN running correctly was not sufficient) and analysis jobs. Many sites were blacklisted for analysis over half a day (3 January)
    • Software versions not deployed everywhere (Savannah:90197) : Affecting mainly TW and RAL/UK T2s (Related to a change in Stratum0/1 at CERN in december)
    • ATLAS SSB not providing reliable numbers over 2 first days of the year : Displayed information is correct now and previous will be corrected
    • RAL schedulded downtime (Today) : Castor + FTS
    • Reminder : Due to ATLAS database migration to 11g, there will be no ATLAS activity on the Grid from Monday 16th January later afternoon untill 17th afternoon. Please use this slot as opportunity for intervention
  • January 5
    • RAL schedulded downtime : Castor + FTS
    • GGUS:77804 :Problem seems to be solved by network team. The backup path for IN2P3-CC and SARA to INFN-T1 was going over general network which was saturated. As consequence, transfers to IT T1+T2 sites were affected. The backup path now goes through CERN.
  • January 4:
    • voms server issue in BNL (quickly solved after identifying the problem). One of Panda server could not renew the proxy (4 renewals before expiration). A fraction of HammerCloud jobs failed to register the output dataset. As a consequence, most of analysis queues were set 'brokeroff'. Because of observed DDM limitations, the successfull jobs did not run quickly enough. Analysis queues were set online manually to speed-up.
    • New problems with 5 T2s
  • January 3
    • GGUS:77804 - transfer failing from DE,NL to IT cloud: 'Probably related to LHCOPN link failure in Italian GARR POP Milano'
    • CERN-PROD : Rod tries to force waiting jobs to CERN-CVMFS .
    • Software versions not deployed everywhere (Savannah:90197) : Affecting mainly TW and RAL/UK T2s (Related to a change in Stratum0/1 at CERN in december)
    • ATLAS SSB not providing reliable numbers over 2 first days of the year : Displayed information is correct now and previous will be corrected
    • Main activity is still MC production

  • CMS reports -
    • LHC / CMS detector
      • Shutdown
    • CERN / central services
      • One team ticket was issued during the holiday for SRM problems at CERN. Promptly addressed by experts
    • T0
      • Tier-0 farm performed LHE MadGraph production at 8TeV over the break. Generally smooth
    • T1 sites:
      • #125488: Transfers from T1_TW_ASGC to many T2's failing (Properly bridged to GGUS)
      • Tier-1 sites had good availability and were responsive over the holiday. Reprocessing activity remained reasonably high.
    • T2 sites:
      • Noticeable dip in analysis submissions to Tier-2s during the holiday week. Beginning to come back up.
      • Tier-2s had good availability over the holiday
    • Other:
      • Over the break 60 tickets were issued. Roughly half of which were closed.

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo (Some problems in merging jobs at RAL ~1st January and more recently at CERN)
    • T0
    • T1

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: Waiting for answer ona few tickets (from CMS: network with the US: GGUS:75983 and SW area configuration GGUS:77359). The other pending issue was the confirmation of the closure of GGUS:77633 (AFS problem: done)
  • KIT: ntr. Note tomorrow is public holiday in Germany
  • NDGF:ntr
  • NLT1: ntr
  • PIC:
  • RAL: Maybe the LHVb problem was due to a failing disk server (around about the same time). The upgrades are going on OK; it should be OK by 4pm UTC as expected.
  • OSG: ntr

  • CASTOR/EOS:
    • NameServer upgrade to Oracle 11 g (+new hardware): 11/1 whole morning ( all CASTOR services unavailable).
    • Note that on the 10th we will upgrade PUBLIC to the new version (2.1.11-9 TG) using the new tape software (for the moment only on PUBLIC, not on the LHC experiments instances)
    • Oracle 11g for the experiments stagers: ATLAS 16th of January, CMS on the 17th, ALICE on the 23rd and LHCb on the 24th (2h downtime)
  • Central Services: As 16 of January no LCG-CE will be deployed at CERN. MariD: please remove the OWS settings for ticket routing (Christmas setting)
  • Data bases: Several interventions scheduled (essentially Oracle 11g migration). ATLAS will be affected (online) as 11/1 at 2pm. Tomorrow we will try to create a table with main interventions to optimise interventions (see ATLAS comment)
  • Dashboard: ntr

AOB:

Friday

Attendance: local(Massimo, Maarten, Alessandro, Stephane, Eddie, Manuel, Marcin);remote(Micheal, Onno, JhenWei, Burt, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
    • RAL back in production after schedulded downtime
    • FTS pilot 2.2.8 being tested for all transfers to CERN (untill monday). No problem observed for the moment.
    • Data export with gsiftp from EOS-CERN being tested within Functionnal Tests (transfer to CERN-EOS through gsiftp already tested) . No problem observed for the moment
    • GGUS 77892 : srm-eos instability affecting data export from CERN on 4-5 January (mainly MC production for other T1s). OK since last evening.
    • GGUS 77923: cvmfs issue in TAIWAN. Caches on WN were full. Cache size is being increased.

Sites / Services round table:

  • ASGC:
    • CVMF problem being fixed (tracked down to a WN problem); expected to be fixed within few hours.
    • CMS 1-T2 transfers failing: now fixed (configuration change: too many parallel streams were used)
  • BNL:ntr
  • CNAF:
  • FNAL:ntr
  • IN2P3:ntr (correction for a typo in yesterday meeting done - wrong ggus number)
  • KIT:
  • NDGF:
  • NLT1:
    • Last night one data server went down; restarted this morning
    • GGUS:77660: Investigating failures to transfers from SARA to T2 (Ireland, Israel, ...). Waiting for dCache reply but suspecting the error is network related (TCP settings or alike); investigating
  • PIC:
  • RAL:ntr (yesterday intevention OK)
  • OSG:ntr

  • CASTOR/EOS: ntr
  • Central Services:ntr
  • Data bases:ntr
  • Dashboard:ntr

AOB: ATLAS would like to ask NLT1 to increase the number of files transfered in parallel (from CERN); currently this is set at 10; since this might impact LHCb we decided to wait Monday in order to see if there are counterargument and possibly schedule a test during next week.

-- JamieShiers - 29-Nov-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2012-01-06 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback