Week of 120716

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Luc, Maarten, Giuseppe, Ulrich, Edward, Alexandre);remote(Michael, Saerda, Gonzalo, Jhen-Wei, Lisa, Ronald, Paolo, Tiju, Vladimir, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
    • T1
      • PIC transfer failures after migration to Chimera. Alarm ticket GGUS:84217. PIC stable now & back in T0 export.
    • CALIB_T2

  • CMS reports -
    • LHC machine / CMS detector
      • Taking data during the week-end
      • Van der Meer scan for CMS is foreseen on Tuesday morning
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • PIC recovered almost completely, some RUN where not transferred from T0 to PIC due to a problem which was fixed in the morning
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
      • GGUS:84229 : CMSSW_5_3_2_patch4 missing at PIC. Will be installed by SW deployment team ASAP
      • T2_DE_DESY had a power cut today. Site is recovering, GRID services may be affected until tomorrow
    • Other:
      • NTR

  • LHCb reports -
    • Users analysis and Reconstruction at at T1s
      • MC production at T2s
    • New GGUS (or RT) tickets
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • NDGF: ntr
  • NLT1: In response to to the question GGUS:84223 (ticket solved after the weekend) Roland pointed out this is their service level (weekend on best effort)
  • PIC: CMS_SW ticket: due to the "sgm" worker node misconfiguration. CMS should trigger another sw install. The GGUS:84217 was due to the fact that after the upgrade the system was overloading (regitrations + new transfers). The latter had to be cancelled in order to finish registration (situation recovered on Saturday around noon).
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: One CE got /var full. Now it is back in production but the root cause is under investigation. The LHCb ticket is also under investigation
  • Dashboard: ntr
AOB:

Tuesday

Attendance: local(Massimo, JhenWei, Oliver, Guido, Alexandre, Edward, Ulrich, Eva, Maarten);remote(Michael, Saerda, Paolo, Lisa, Tiju, Gonzalo, Jeremy, Rolf, Vladimir, Rob).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD: FTS problem GGUS:84154 still open, no major news, not a showstopper
      • atlt3 Castor pool being erased, will be discarded by ATLAS in the next few days
    • T1
      • PIC downtime finished yesterday
      • NDGF-T1 transfer failures to MCTAPE due to staging problem GGUS:84207 solved (files lost)
    • CALIB_T2
      • INFN-NAPOLI still in downtime after power cut in the week end

  • CMS reports -
    • LHC machine / CMS detector
      • Van der Meer scan for CMS is now foreseen on Tuesday afternoon/night
      • Tomorrow, Wednesday, back to physics
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • KIT: high load situation on Frontier squids, maybe related to large number of running jobs yesterday? Peaked at close to 5k running jobs in parallel.
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
      • T2_DE_DESY had a power cut yesterday. Site is recovering, network and basic services are working again, queues have been opened
    • Other:
      • NTR

  • LHCb reports -
    • Nothing new to report
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: ntr
  • Data bases: ntr
  • Dashboard: ntr
AOB:

Wednesday

Attendance: local(Massimo,Guido, Alexandre, Edward, Ulrich, Luca, Luca, Maarten); remote(Oliver, Michael, Saerda, Jhen-Wei, Paolo, Burt, Tiju, Gonzalo, Rolf, Vladimir, Kyle, Alexander). Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD: some failures in writing to castor (castoratlas/t0atlas), under investigation
      • 2 new LFC frontend nodes added yesterday; since then, high number of connections (each node accepts 90 conn)
    • T1
      • PIC: many transfer failures (GGUS:84311). All dCache pools assigned to Atlas VO were filled, a new disk space assigned to Atlas, no more errors since then
      • TRIUMF: GGUS:84327, bad ACLs on some directories in LFC, asked the site to kindly change them (certificate used to create them is no more valid)
    • CALIB_T2
      • INFN-NAPOLI back in production since last night, everything fine

  • CMS reports -
    • LHC machine / CMS detector
      • Machine had RF and cryo problems yesterday
      • Van der Meer scans postponed, expected to start this afternoon with LHCb, then Atlas and CMS starting in the evening, takes 12 hours for Atlas and CMS
    • CERN / central services and T0
      • GGUS:84302: ce206, ce207, ce208 show issues when jobs from wms316 land there: "Failed to create a delegation id", recovered over night, was most probably effect of reconfiguration yesterday, ticket closed
    • Tier-1/2:
      • KIT: high load situation on Frontier squids, had again a spike in number of jobs
        the two CMS squids were again maxed out, although they max out at 54 MB/s and not at over 100 MB/s which is normal for squids with 1 Gbit connections
        currently failover to CERN is able to sustain the load
        recommendation is that KIT deploys a 3rd squid (CNAF did already) and investigates why the performance of the two already deployed is limited
    • Other:
      • NTR

  • ALICE reports -
    • Low job activity due to temporary unavailability of a popular package (fixed) and erroneous job train definition (fixed). Activity should ramp up later today.

Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • NDGF: ntr
  • NLT1: ntr
  • PIC: Shortage of space (ATLAS) due to some hw problems in the new delivery. Being fix and in the mean time some spare disk are put in production to alleviate the problem
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: investigating the root cause of the ATLAS problem
  • Central Services: CE reconfiguration was due to a restart of Tomcat after installing the new BLAH component (normally invisible)
  • Data bases: In relation with the LFC problem: whenver possible they would like to be prealerted by an increase of load
  • Dashboard: ntr
AOB:

Thursday

Attendance: local(Massimo,Guido, Alexandre, Edward, Ulrich, Luca, Eva, Maarten); remote(Oliver, Michael, Saerda, Jhen-Wei, Paolo, Lisa, John, Rolf, Vladimir, Rob, Alexander).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD: still a few failures when reading from castor (castoratlas/t0atlas), under investigation; not a showstopper
    • T1
      • TRIUMF: GGUS:84327, bad ACLs on some directories in LFC, solved
    • T2
      • nothing to report

  • CMS reports -
    • LHC machine / CMS detector
      • Van der Meer scans for CMS started in the morning, takes 12 hours
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • T1_DE_KIT: high load situation on Frontier squids, had again a spike in number of jobs
        • the two CMS squids were again maxed out, although they max out at 54 MB/s and not at over 100 MB/s which is normal for squids with 1 Gbit connections
        • currently failover to CERN is able to sustain the load
        • recommendation is that KIT deploys a 3rd squid (CNAF did already) and investigates why the performance of the two already deployed is limited
      • T1_TW_ASGC: periodic stage out problems noticed in T1 prompt processing, GGUS:84365
    • Other:
      • NTR

Sites / Services round table:
  • ASGC: CMS problem due to new worker nodes (misconfiguration). Now closed.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • NDGF: ntr
  • NLT1: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: ntr
  • Data bases: ntr
  • Dashboard: ntr
AOB:

Friday

Attendance: local(Massimo, Oliver, Xavier, Edward, Maarten); remote(Stephan,Michael, Saerda, Jhen-Wei, Paolo, Lisa, John, Rolf, Vladimir, Rob, Xavier, Onno, Gonzalo)

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • T1_DE_KIT: high load situation on Frontier squids, had again a spike in number of jobs
      • the two CMS squids were again maxed out, although they max out at 54 MB/s and not at over 100 MB/s which is normal for squids with 1 Gbit connections
      • currently failover to CERN is able to sustain the load
      • recommendation is that KIT deploys a 3rd squid (CNAF did already) and investigates why the performance of the two already deployed is limited
    • Other:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: Because of a very low job efficiency of ALICE the equivalent of 900 cores is unused. ALICE: a lot of data currently being processed is not local; we will look into this
  • KIT:
    • We will deploy an additional Squid
    • Tape library failure last night (now fixed). Writes were rerouted to another library, reads failed/postponed
  • NDGF: 1h downtime at 6am on Monday (SRM intervention - see GOCDB)
  • NLT1: ntr
  • PIC: ntr
  • RAL: Waiting for replay GGUS:84307 (LHCb)
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Dashboard: ntr
AOB:

-- JamieShiers - 09-Jul-2012

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2012-07-20 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback