Week of 120806

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, Alexandre, Ken, Marc, LucaM, Philippe, Alessandro, Eva); remote (Ulf/NDGF, Kyle/OSG, Michael/BNL, Jhen-Wei/ASGC, John/RAL, Alexander/NLT1, Marc/IN2P3, Dimitri/KIT, Burt/FNAL).

Experiments round table:

  • ATLAS reports -
    • T1s:
      • PIC: Issue exporting T0 data to PIC during the night. GGUS:84833. Solved this morning. From PIC: "We had a huge load on ATLAS pools due to data replication to a new hardware. It was solved more than one hour ago but we'll keep the ticket opened a couple of hours until we are sure that current load won't affect data transfers anymore."

  • CMS reports -
    • LHC machine / CMS detector
      • Not much data this weekend due to series of unfortunate events.
    • CERN / central services and T0
      • Had to delay prompt reconstruction of some recent runs because of a problem updating databases that is still being investigated.
      • Some files have been inaccessible from Castor, see INC:151642. [LucaM: files were inaccessible from EOS (not Castor) due to two machines down, now fixed.]
    • Tier-1/2:
      • ASGC is having downtime today to fix Castor problems. Currently have open tickets for Hammercloud (GGUS:84658), SUM test (GGUS:84632), and MC production (SAV:130787).
    • Other:
      • Daniele Bonacorsi is the next CRC.

  • LHCb reports -
    • Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. Latest updates on tickets say that CERN see SRM timeouts? (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
      • [Dimitri/KIT: problems ongoing with connectivity to tape, experts are following up. Marc: could these issues also explain the problem with low space on tape cache? Dimitri: not sure, will ask the experts.]
      • [LucaM: will do some debugging from CERN too, is this issue also seen at other sites? Marc: no, this is only Gridka.]
    • T1:
      • GridKa: Very low space left on GridKa-Tape cache. Possibly related related to above. Ticket opened (GGUS:84838)
      • IN2P3: Investigating why pilots aren't using CVMFS both here and at the Tier2. This caused some job failures last week.

Sites / Services round table:

  • Ulf/NDGF: Alice is writing a lot of data and we are having troubles coping with the data rates. Tomorrow the situation may get worse because of a network intervention, we will only have 2/3 of the writing capacity for three hours.]
  • Kyle/OSG: ntr
  • Michael/BNL: ntr
  • Jhen-Wei/ASGC: Castor downtime is ongoing, scheduled until 4pm UTC (6pm CET). Ale: will switch on a few transfers to Taiwan tonight but will wait for tomorrow morning before switching on T0 exports to Taiwan, after we are sure that everything is back to normal.]
  • John/RAL: ntr
  • Alexander/NLT1: ntr
  • Marc/IN2P3: ntr
  • Dimitri/KIT: nta
  • Burt/FNAL: ntr

  • Philippe/Grid: ntr
  • Eva/Databases: ntr
  • LucaM/Storage: nta
  • Alexandre/Dashboard: SRM voput tests failing for CMS/ASGC. [Jhen-Wei: due to the ongoing Castor intervention.]

AOB: none

Tuesday

Attendance: local (Andrea, Alexandre, Alessandro, Mark, Eva, Philippe); remote (Kyle/OSG, Ulf/NDGF, John/RAL, Xavier/KIT, Marc/IN2P3, Jhen-Wei/ASGC, Michael/BNL, Burt/FNAL, Ronald/NLT1; Daniele/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/1s:
      • CERN: ALARM ticket slow response of LSF GGUS:84928 . "Since 22 on average takes 10s to LSF to answer, with peaks of 3 mins". Problem disappeared after 2hours.
        • [Philippe/LSF: slowdown yesterday evening because LSF hit machines that do not exist in LANDB/CDB/DNS. Ulrich is looking into this now. Alessandro: we are seeing issues with LSF since several weeks, it would be nice to add some monitoring and checks in LSF to block client IPs from submitting too many jobs if they are not recognized as production users (normal users are often submitting too many jobs and affecting production without realizing). Philippe: thanks for this suggestion, will pass it on to LSF experts as temporary solution; however, as discussed two weeks ago, LSF is hitting its limits and we are investigating a possible replacement by an alternative to LSF, as a longer term solution.]
      • PIC: same issue of yesterday, ticket updated GGUS:84833
      • [Alessandro: T0 export to ASGC still disabled, will enable it tomorrow morning.]

  • CMS reports -
    • LHC / CMS
      • pp physics collisions
    • CERN / central services and T0
      • T0 Ops reported problems in accessing 2 files from EOS. First investigation showed that they have only one copy and they are on a faulty diskserver (waiting for a vendor call). Quickly files were made available again, thanks (INC:152249).
        • We need a stable solution here. Loosing files in this way is impacting T0 Ops. BTW, apart from this specific case (1 replica only should never happen) can we increase the replication factor from 2 to 3?
          • [Alessandro: you can probably increase yourslef the replication factor from 2 to 3, but will then be charged a factor 1.5 more. Daniele: thanks, yes we could probably do it ourselves, but we want to discuss and negotiate this with the EOS team first.]
      • T0_CH_CERN: HC errors, late July ticket, now green, no ticket update though (Savannah:130709 bridged to GGUS:84617, HC 2nd line support)
    • Tier-1:
    • Other:
      • MC prod and analysis smooth on T2 sites

  • ALICE reports -
    • NDGF: experts at CERN and NDGF are looking into the unexpectedly high write rates reported yesterday and discovered a few issues that are being followed up.
      • [Ulf/NDGF: we are working on this and found the probable cause. We are realizing however that there has been a communication problem with Alice. It is unfortunate that Alice had already seen this type of error but never reported it to us. The Alice system tries over and over again to write tapes when there are problems, so now our tapes are full of files that are not usable or not necessary. We are in contact with Alice to improve this situation.]

  • LHCb reports -
    • T1:
      • Ongoing issues with RAW export to Gridka. Still many files in "Ready" status. Very low transfer rate. GridKa still investigating (GGUS:84550, GGUS:84778 (Alarm), GGUS:84670).
      • GridKa have added 40TB of front end disk to the Tape system to allow transfers to come through (GGUS:84838). [Xavier/KIT: actually added 80 TB, not 40 TB. We cannot really tell however what improved the situation, we have experimented with many changes and one worked, but we are now trying to understand which one was the relevant one. We will leave the setup as it is now, so that LHCb can recover from the backlog.]
      • IN2P3: Re: Pilots not using CVMFS - there was confusion as CVMFS is in /opt at IN2P3 rather than /cvmfs. The problem turned out to be an LHCb config issue.

Sites / Services round table:

  • Kyle/OSG: ntr
  • Ulf/NDGF:
    • mainly working on the ALICE issue
    • downtime at 15h UTC today (OPN to Finland down)
    • one dcache upgrade in Norway tomorrow
  • John/RAL: ntr
  • Xavier/KIT: nta
  • Marc/IN2P3: nta
  • Jhen-Wei/ASGC:
    • downtime was necessary this morning at 1030 to complete the Castor intervention started yesterday. Database was moved to a new storage; Castor was upgraded to 1.11-9; xrootd was switched on. Transfers look stable since this morning, so we are in production again.
  • Michael/BNL: ntr
  • Burt/FNAL: ntr
  • Ronald/NLT1: ntr

  • Philippe/Grid: BDII was restarted this morning to fix issues with BNL and FNAL
  • Alexandre/Dashboard: ntr
  • Eva/Databases: ntr

AOB:

  • Maintenance work on KIT internet connection (originally planned for August 20) is postponed to September 17th same time slot (see also GOCDB:115602).

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r16 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2012-08-07 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback