Week of 120730

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Jamie, Alexandre, Javier, Peter, Eddie, Stefan, Eva, Steve);remote(Michael, Thierry, Gonzalo, Jhen-Wei, Paolo, Kyle, Tiju, Pavel, Onno, Marc).

Experiments round table:

  • ATLAS reports -
    • Taiwan SRM DB problems over weekend (and last week), any comment? GGUS:84589 Last 48hr DDM plot
    • Deletion errors at BU_ATLAS (T2) last week, SRM congestion, ATLAS reduced deletions rate for this site: GGUS:84189
    • Frontier problem seen by CMS at Tier0 due to bad batch node config, ATLAS sees fail-over being used. Frontier monitor

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking, until machine stop due to French Prime Minister visit !
    • CERN / central services and T0
      • Issue with batch nodes on public queues at CERN not reserving enough memory, hence CMS Tier-0 has stopped the job spill-over onto public queues last Friday. CMS has the suspicion of a misconfigution, since the concerned machines have ulimit -v (virtual memory) of only 1.5GB, while such CPU's should have ~3GB available per slot. Could the CERN Batch Support please respond to INC:150045 and tell CMS what to do ?
    • Tier-1/2:
      • ASGC (T1) : SAM SUMCE errrors during the night, probably related to a Castor DB error (see GGUS:84632). Now recovered, however no low HammerCloud efficiency (no new ticket yet). Are these problem related ?
      • PIC (T1) : high HammerCloud error rates over the week-end : local admin restarted PBS Services today, now rates seem fine again (covered in GGUS:84633 , GGUS:84623 and GGUS:84622)
      • CCIN2P3 (T2) : lfn to pfn mismatch issue (GGUS:84522): still no solution and hence CMS not submitting any jobs to the site. Marc - sorry, local T2 contact away, ticket just updated (David Bouvet).
    • Other:
      • Next CMS CRC-on-Duty reporting here from tomorrow on : Ken Bloom

  • ALICE reports -
    • NTR (at least until 13:00)

  • LHCb reports -
    • Validation productions for new versions of LHCb application software.
    • T0: ntr
    • T1:
      • PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.
Sites / Services round table:

  • BNL - ntr
  • NDGF - had some pools which went down over w/e at IGF / Slovenia. Fixed; h/w problem but pools not online yet, warning downtime on portal. Will update with end of downtime and update when we get new info.
  • PIC - ntr; acknowlege CMS issues, partition filled up Fri/Sat. Space freed and most problems fixed, restarted service this morning
  • ASGC - CASTOR: 2 issues; 1) yesterday our DB had h/w problem - now solved; 2) observed last week some CMS transfers failing from time to time - not enough threads for daemon; our CASTOR mgr is investigating and will report if any news
  • CNAF - ntr
  • KIT - ntr
  • IN2P3 - ntr
  • NL-T1 - follow up on pool node issue report last week. Friday evening broken storage controller replaced which fixed issue. At same time another controller broke giving two other pool nodes down. Only 1 spare on site which had been used hence 2 nodes down over w/e. 2nd controller replaced this afternoon so now all ok
  • RAL - ntr
  • OSG - ntr

  • CERN - ATLAS migration of CVMFS repositories still ongoing. Transparent for sites
AOB:

Tuesday

Attendance: local(Andrea, Ken, Simone, Stefan, Xavier, Alex, Eddie, Eva); remote(Xavier, Stefano, Kyle, Michael, Jhen-Wei, Marc, Ron, Thierry, Tiju, Jeremy).

Experiments round table:

  • ATLAS reports -
    • T0
      • One file unaccessible in CASTOR, needed by HLT reprocessing. GGUS:84642. The file was in a disabled disk server. SRMBringOnline reports the file as "online" while SRMLs reports it as NEARLINE. Fixed by CASTOR team by re-staging the file.
    • Central Services
      • CVMFS migration took longer than expected but was successful per se. An issue was introduced as side effect: the new DDM clients were prematurely introduced and broke many analysis queues. The DDM clients have been rolled back.

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking, have requested that LHC provide 100 Hz of Higgs production.
    • CERN / central services and T0
      • Issue with batch nodes on public queues at CERN not reserving enough memory, followed at INC:150045. People at work, but we would appreciate quick resolution, as it is affecting random users, including users who land in these queues from the grid.
      • Also following up on issues with propagation of downtime information; dashboard team has given explanations for some of the observed behavior.
      • GGUS:84618, about HammerCloud errors at CERN, is unresolved after four days??
    • Tier-1/2:
    • Other:
      • NTR

  • LHCb reports -
    • Validation productions for new versions of LHCb application software to be launched tomorrow
    • New GGUS (or RT) tickets
    • T0:
      • Application crashes on certain CERN batch node types (GGUS:84672)
      • Castor default pool under stress, DMS group contacted to get more information about the current activity on the pool.
    • T1:
      • RAL: many pilots stuck in state "REALLY-RUNNING", was already observed before and site admins have procedure to cleanup (GGUS:84671)
      • CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF (site contacts informed)
      • GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
      • PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.
    • Other :
Sites / Services round table:
  • ASGC - ntr
  • BNL - ntr
  • CNAF - Found the cause of the FTS transfer problems from RAL, it was due to a misconfiguration. Being fixed
  • IN2P3 - ntr
  • KIT - intervention on Oracle performed to apply some patches. The LHCb LFC crashed at 0945, but recovered and the intervention successfully completed. Need to find out why the Streams replication did not start for ATLAS, though. Set a 3rd squid server for CMS, the CMS contact was informed. This Thursday from 10 to 12 will need to replace two disk controllers, for that the entire rack has to be switched off and six LHCb disk-only pools will be down. Does a downtime need to be declared? Stefan: no, it is not necessary.
  • PIC - ntr
  • NDGF - ntr
  • NL-T1 - successfully upgraded the CREAM CEs at NIKHEF
  • RAL - ntr
  • OSG - ntr
  • GridPP - ntr

  • CERN
    • Storage - Investigating rfcp reads timing out on some ATLAS jobs (about 1/1000 of the jobs are affected). Some CMS files have been lost due to problems in a disk controller, the CMS computing operations team has been informed
    • Databases - ntr
    • Dashboards - the FTS server at FNAL is not sending out transfer information
AOB:

Wednesday

Attendance: local(Jamie, Alexandre, Kasia, Ken, Stefan, Simone);remote(Michael, Stefano, Alexander, Dimitri, Thierry, Jhen-Wei, Tiju, Marc, Gonzalo, Kyle).

Experiments round table:

  • ATLAS reports -
  • T0
    • NTR
  • T1s
    • Writing T0 data into TW tape shows 25% inefficiency since this morning. Taiwan needs some time for an resolutive intervention, which is complicated by a typhoon coming in the next days. Therefore the service will be back to standard operations only on monday. T0 subscriptions to TW datatape have been preventatively stopped and will be resumed next week.
  • Central Services


  • CMS reports -
  • LHC machine / CMS detector
    • physics data taking, after some studies this morning
  • CERN / central services and T0
    • INC:150045 -- memory on public batch queues appears to be resolved, but tests ongoing to check
    • GGUS:84618, about HammerCloud errors at CERN, is unresolved after five days?? [ Being checked by HC 2nd line support ]
  • Tier-1/2:
    • ASGC is having Castor problems, and has suggested having a downtime next week (after the typhoon). Currently have open tickets for Hammercloud (GGUS:84658), SUM test (GGUS:84632), and MC production (SAV:130787).
    • I went and closed GGUS:84576, as the problem seems to have gone away.
    • GGUS:84522 needs attention -- confusing separation of storage between T1 and T2 at IN2P3 is preventing MC production at the site.
    • In general we have too many Savannah tickets open, with sites not responding to them within 24 hours as they should. I tickled a bunch of them to see what might happen.
  • Other:
    • Still trying to figure out if today is a real holiday or not.
    • FNAL thinks it is (now?) sending dashboard info


  • LHCb reports -
  • T0:
    • Application crashes on certain CERN batch node types (GGUS:84672)
    • Castor default pool under stress, user has been contacted.
  • T1:
    • RAL: many pilots stuck in state "REALLY-RUNNING", have been cleaned (GGUS:84671)
    • CNAF: FTS transfers from RAL not working because of srm endpoint not detectable from CNAF, fixed, was misconfiguration (site contacts informed)
    • GRIDKA: RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550)
    • PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. Situation has improved, probably due to limiting of user jobs.

Sites / Services round table:

  • BNL - ntr
  • CNAF -
  • NL-T1 - ntr
  • KIT - ntr
  • NDGF - at ATLAS pool m/c with 18TB of data with h/w problems; migrating data out of it. 7 files lost.
  • ASGC - to fully recover CASTOR need to do something on DB. To avoid unexpected problems like power cut during typhoon days have downtime on Monday. Tomorrow no people on site for security reasons
  • RAL - ntr
  • IN2P3 - ntr
  • PIC - ntr
  • OSG - ntr

  • CERN DB - intervention tomorrow on public stager and SRM db at 14:00 for 1.5 hours
AOB:

  • Has issue with (CERN) downtime depending on castor public finally been fixed

Thursday

Attendance: local(Alexandre, Jamie, Simone, Stefan, Ken, Javier);remote(Stefano, Kyle, Thierry, Jhen-Wei, John, Ronald, Marc).

Experiments round table:

  • CMS reports -
  • LHC machine / CMS detector
    • physics data taking, great uptime yesterday :), then a cooling failure today frown
  • CERN / central services and T0
    • GGUS:84618 -- Jamie was right, now it's closed.
    • CMS was doing scale testing of DBS3 on a CERN server yesterday, resulting in high load. We were asked whether it could be rebooted; we said no, because of the tests.
    • Late breaking: Three files lost due to Castor diskserver controller problem. Investigating to see whether they can be recovered; these are files of real reconstructed data. [ Javier - recovered files on Tuesday! ]
  • Tier-1/2:
    • ASGC is having Castor problems, and has suggested having a downtime next week (after the typhoon). Currently have open tickets for Hammercloud (GGUS:84658), SUM test (GGUS:84632), and MC production (SAV:130787).
    • In general we have too many Savannah tickets open, with sites not responding to them within 24 hours as they should. I tickled a bunch of them to see what might happen.
  • Other:
    • Turns out the Swiss like explosives as much as Americans do.



  • LHCb reports -
  • T0:
    • Application crashes on certain CERN batch node types (GGUS:84672)
    • RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests


Sites / Services round table:

  • CNAF - we had to close a fraction of our WNs in order to do upgrade to EMI. Temporary reduction in # job slots
  • NDGF - during the night there was ATLAS tape pool down in Sweden. Problem fixed during the morning. Primary OPN connection to CERN is down - no more info for the moment.
  • ASGC - ntr
  • NL-T1 - ntr
  • RAL - ntr
  • IN2P3 - ntr
  • KIT - ntr; new CMS squid up and running. Q: intervention on LHCb disk servers - is it over? A: no news, will contact offline
  • FNAL - FTS monitoring reported as not working at FNAL from our point of view is ok.
  • OSG - ntr

  • CERN storage : for ALICE people, intervention planned next week for ALICE DB. Need ok for ALICE online (have ok from Latchezar for offline)

AOB:

Friday

Attendance: local (AndreaV, Ken, Simone, Steve, Gavin, Stefan, Alexander, Xavi, Eva); remote (Stefano/CNAF, Kyle/OSG, Alexandre/NLT1, Xavier/KIT, John/RAL, Thierry/NDGF, Burt/FNAL, Jhen-Wei/ASGC, Marc/IN2P3, Gonzalo/PIC).

Experiments round table:

  • ATLAS reports -
    • Almost NTR: the INFN-NAPOLI calibration T2 started failing importing calibration data from T0 at 5AM; GGUS:84777; fixed in the early morning.

  • CMS reports -
    • LHC machine / CMS detector
      • Yesterday's cooling failure "fortunately" coincident with machine problems. But the cooling failure was fairly serious; took 12 hours to recover.
    • CERN / central services and T0
      • Assorted stuff with VO boxes, none of which seems to be critical: cmsweb analytics weighing down vocms100, failed backup on vocms111, need to renew host certificate on vocms163. All recovered.
    • Tier-1/2:
      • ASGC is having Castor problems, and has suggested having a downtime next week (after the typhoon). Currently have open tickets for Hammercloud (GGUS:84658), SUM test (GGUS:84632), and MC production (SAV:130787).
      • In general we have too many Savannah tickets open (mostly to T2's), with sites not responding to them within 24 hours as they should. I tickled a bunch of them to see what might happen. Still not getting a whole lot of response.
    • Other:
      • nix!

  • LHCb reports -
    • T0:
      • RAW export to Gridka, many files in "Ready" status, executed jobs succeed with efficiency 55 % (GGUS:84550, GGUS:84778 (Alarm)), ticket priority raised, situation unchanged since 1 week. In addition timeouts for Gridka srm is observed in Sam/Nagios tests
      • Castor files unavailable, have been recovered by castor team, maybe more files missing (GGUS:84763)
      • [Gavin/FTS: two problems with FTS, many timeouts for Castor, and low bandwitth 8 MB/s (asked Gridka to investigate).]
    • T1:
      • Gridka: Pilots aborted, fixed this morning (GGUS:84740)

Sites / Services round table:

  • Stefano/CNAF: participation of CNAF to the meeting is not guaranteed until August 19, but experts on call will continue to answer
  • Kyle/OSG: ntr
  • Alexandre/NLT1: ntr
  • Xavier/KIT:
    • yesterday's disk intervention for LHCb went ok and finished around midday
    • many issues with tapes since yesterday, including a broken card, being followed up
  • John/RAL: ntr
  • Thierry/NDGF: ntr
  • Burt/FNAL: ntr
  • Jhen-Wei/ASGC: intervention next Monday to fix DB issues between 1pm and 4pm UTC
  • Marc/IN2P3: downtime on CE, issues with GridFTP, but should not impact jobs thanks to failover
  • Gonzalo/PIC: ntr

  • Xavi/Storage: ntr
  • Eva/Database: ntr
  • Alexander/Dashboard: issue with messages is now fixed, the problem lasted for a few hours
  • Gavin/Grid: ntr

AOB: none

-- JamieShiers - 09-Jul-2012

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2012-08-03 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback