Week of 101004

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Edward, Jean-Philippe, Alexei, Jan, Ricardo, Przemek, Maarten, Julia, Edoardo, Jamie, Maria, MariaD, Nicolo, Simone, Massimo, Dirk);remote(Jon/FNAL, Joel/LHCb, Kyle/OSG, Ron/NLT1, Rolf/IN2P3, Dimitri/KIT, Pepe/CMS, John/RAL, Michael/BNL, Gonzalo/PIC, Gang/ASGC).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • CNAF-BNL network problem (slow transfers) GGUS:61440.
    • Oct 4 (Sat, Sun, Mon)
      • LHC/ATLAS activities
        • Data collected but issue with beam background
        • Main CPU activity for MC production
      • T0
        • GGUS ALARM 62662: afs26 slow response (discussed by mail). MariaD: an alarm ticket should not be discussed by mail.
        • GGUS ALARM 62688 : File copy not correctly reported to Castor (solved within hours). Jan: the problem is understood (putdone missing), file status set manually.
        • GGUS 62701/62705 : File not accessible in CERN-PROD_DATADISK (Related to 62688)
        • overload in accessing DB releases at CERN. An HOTDISK space token will be deployed at CERN.
      • T1-T1 network issue
        • CNAF-BNL : Waiting for feedback from network experts. Michael: intense investigation of LHCnet and ESNET: the path between BNL and Vienna is ok (no packet loss). It's now up to GARR to investigate the rest of the path.
      • T1 :
        • IN2P3-CC : GGUS 62500 : Seems to work much better since this morning, but problem not understood.
        • RAL : GGUS 62714 : Missing libaio
      • SAM tests: Pointed issue with (network?) in IN2P3-CC on saturday morning

  • CMS reports -
    • Experiment activity
      • Data taking
    • CERN
      • GGUS ticket 62696: occasional errors when opening files (via xrootd) on the Tier-0 processing pools. Became evident for jobs opening a large number files. In progress.
      • GGUS ticket 62716: problem using myproxy.cern.ch affecting analysis users. Error message: Server authorization failed. Server identity (/DC=ch/DC=cern/OU=computers/CN=px301.cern.ch) does not match expected identities
    • Tier1 issues
      • Savannah ticket 117117. Slow data distribution from PIC to T2s. Important re-reconstructed data being distributed to T2s. About 50 MB/s in average for the last 3 days. Quality is good but for some reason transfers are proceeding slowly. PIC should check if data is on disk and FTS channels are set reasonably.
    • Tier2 Issues
    • MC production
      • ongoing
    • AOB

  • ALICE reports -
    • T0 site
      • CT15493: problems with myproxy.cern.ch while registering new user proxies. GOCDB reports some problems (apparently associated to this reported issue) during the weekend with ce202.cern.ch. However it appears to be solved yerterday afternoon.
    • T1 sites
      • NDGF SE: xrdcp issues reported by MonaLisa. Checking the issue with the AliEn experts
    • T2 sites
      • Usual operations, no remarkable issues to report

  • LHCb reports - Analysis, lot of jobs stuck in our pre-stager (Dirac). We will ask T1 a dump of the LHCb USER space in order to do a clean of all the orpheans user files. Already received one.
    • T0
      • jobs failing at CERN due to problems in DIRAC. Fixed this morning.
    • T1 site issues:
      • IN2P3 : jobs failing with Segmentation fault (Is something went wrong on sunday at IN2P3. Jobs are running fine now?). No comment yet.
      • RAL: CondDB. Few WNs configured with old version of middleware to try to understand the problem. In progress.
      • IN2P3: Eddie reports that the SAM availability tests show failure in Lyon (gridFTP doors?). Joel will open a ticket. GGUS:62739.

Sites / Services round table:

  • FNAL: ntr
  • NLT1: maintenance on network this morning: short interruptions on Storage but all back now.
  • KIT: ntr
  • RAL:
    • libaio now updated
    • LHCb DB problem: 4 nodes set aside to investigate.
    • tape migration problems for LHCb during the weekend.
  • BNL: ntr
  • PIC: ntr
  • ASGC: ntr
  • OSG: ntr

  • Ricardo/CERN:
    • Myproxy server host certificate was renewed with wrong DN. 2 hours stop. SIR produced.
    • CREAMCEs down this morning because of partition full for logs on LB server.
  • Edoardo/CERN: Emergency maintenance of one of the two routers tomorrow morning. Some sites will use the backup link for 30 seconds. See: https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=62734
  • Massimo/CERN: upgrade of xrootd CASTOR for CMS, ATLAS and LHCb. Will be at 14:00 tomorrow for CMS and 09:00 on Wednesday for LHCb. Should be transparent but site declared at risk. Should fix problems seen in the past few weeks.
  • Przemek/CERNDB: supporting restore of ATLAS CondDB from RAL to ASGC.

AOB:

Tuesday:

Attendance: local(Jean-Philippe, Flavia, Luca, Jan, Manuel, Alexei, Stephen, Edward, Massimo, Andrea, Roberto, Nicolo, Ignacio);remote(Gonzalo/PIC, Michael/BNL, Dimitri/KIT, Rolf/IN2P3, Jon/FNAL, Joel/LHCb, Ronald/NLT1, Gareth/RAL, Roger/NDGF, Rob/OSG, Luca/CNAF, Farida/ASGC).

Experiments round table:

  • ATLAS reports -
    • Data taking and data distribution goes smoothly to all sites, but INFN-T1
    • GGUS Tickets :
      • CNAF
        • GGUS-Ticket 62761 (alarm Oct 5, 11:30am)
        • GGUS-Ticket 62745 (very urgent Oct 4, 22:00)
        • fixed at 11:50am
      • Tier-0
        • GGUS 62733, it looks like internal ATLAS problem. Ticket is closed
      • Tier1s
        • network transfer problems with NDGF

  • CMS reports -
    • Experiment activity
      • Data taking
    • CERN
      • GGUS ticket 62696: occasional errors when opening files (via xrootd) on the Tier-0 processing pools. Became evident for jobs opening a large number files. In progress.
    • Tier1 issues
      • Savannah ticket 117117. Slow data distribution from PIC to T2s. Looks like due to all T2s sharing star channel. Will update configuration to have good and bad T2s on different channels and can then raise the limits for the good sites.
    • Tier2 Issues
    • MC production
      • ongoing
    • AOB

  • ALICE reports -
    • T0 site
      • Lack of synchronization betwrrn the readable/writable AFS volumes at CERN (even forcing this synchronization manually). Rainer contacted, waiting for the answer. CREAM-VOBOX out of production because of this issue
    • T1 sites
      • No issues to report
    • T2 sites
      • Common operations, nothing remarkable

  • LHCb reports -
    • Issues at the sites and services
      • T0
        • xroot intervention agreed for tomorrow
      • T1 site issues:
        • RAL : LHCb files "NONE" at RAL SRM (GGUS:62744)
        • RAL : CondDB : part of the problem understood. (see GGUS:62667). The problem is in LHCb software and is being fixed. Can batch be reopened? Wait for LHCb ok.
        • IN2P3 : SAM tests failing against LHCb_USER at IN2p3 (GGUS:62739)
        • GRIDKA : All jobs aborted at cream-2-fzk.gridka.de cream-3-fzk.gridka.de (GGUS:62758)
        • GRIDKA : Dump of LHCb USEr space token (GGUS:62763)
        • SARA : Most pilots aborted creamce2.gina.sara.nl creamce.gina.sara.nl (GGUS:62759)
        • SARA : LHCb waiting for the dump of the LHCb user space token.

Sites / Services round table:

  • PIC: ?
  • BNL: Michael reports that there is a lot of activity going on to understand the BNL-CNAF slow transfer rates. Tests continue. The path Amsterdam-Vienna show some packet loss. The path Vienna-Milano seems to be ok. Still have to test the path Milano-CNAF and better understand the packet loss between Amsterdam and Vienna.
  • KIT: LHCb tickets updated. CREAM CE fixed.
  • IN2P3: ntr
  • FNAL: ntr
  • NLT1: ntr
  • RAL: ntr
  • NDGF:
    • downtime tomorrow evening for network testing by ISP.
    • the SAM test certificate was revoked. A new one has to be installed. This does not affect NDGF service but only SAM figures.
  • CNAF: StoRM problem for Atlas started yesterday night. Because of a bug, the backend was taking all CPU. As it was a team ticket and not an alarm ticket, the problem was not noticed immediately.
  • OSG: ntr
  • ASGC: fighting with 3D recovery. Slow network performance between RAL and ASGC does not help.

  • Jan/CERN:
    • missing putdone problem: still investigating the status of some Atlas files
    • ATLASHOTDISK space token now available. Waiting for feedback.
    • ATLAS and LHCb xroot upgrade tomorrow.
  • Flavia/CERN: recommended version of Frontier (3.24) being installed for ATLAS and could be used by CMS later.
  • Ignacio/CERN: the "missing putdone" problem reported above was due to an ldap problem on a new SLC5 head node. A SIR is being produced.
  • Manuel/CERN:
    • ce201 back in production
    • SLS WMS problem being investigated

AOB: (MariaDZ) NDGF please reply to the ALARMs' handling at each Tier1 survey as per https://savannah.cern.ch/support/?116430

Wednesday

Attendance: local(Alexei, Jean-Philippe, Serguei, Edward, Manuel, Luca, Stephen, Lola, MariaD, Carlos, Roberto);remote(Michael/BNL, Jon/FNAL, Gonzalo/PIC, Roger/NDGF, Onno/NLT1, John/RAL, Gang/ASGC, Xavier/KIT, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • bulk of errors generated by group data transfer from WEIZMANN to FZK
      • the issue is known, and according to C.Serfon it is related to the low networking performance @ WEIZMANN
    • atladcops was rebooted and SLS service wasn't started automatically. Restart done. Small issue with lock file.
    • general failure while installing a TopPhys cache, caused by a typo in the installation definition, which cause the AtlasLogin requirements to break in all the sites for release 15.6.12.
    • DDM build system is down.
      • machines hosting DDM build system are down after reboot. It looks like HW bug. Repository was back at ~12:00. This machines are important for Atlas and should get the necessary level of support.

  • CMS reports -
    • Experiment activity
      • Data taking
    • CERN
      • GGUS:62696 occasional errors when opening files (via xrootd) on the Tier-0 processing pools. Became evident for jobs opening a large number files. For valid service classes it is reporting them as invalid. In progress.
    • Tier1 issues
      • Savannah ticket 117117. Slow data distribution from PIC to T2s. Looks like due to all T2s sharing star channel. Will update configuration to have good and bad T2s on different channels and can then raise the limits for the good sites. Looks much better since the change.
      • GGUS:62807 SAM Tests failing at CNAF
    • Tier2 Issues
      • situation improved
    • MC production
      • ongoing
    • AOB

  • ALICE reports -
    • Production status: Pass1 reconstruction, MC cycles and several analysis jobs ongoing
    • T0 site
      • AFS issue solved this morning (lack of synchronization between R/W and R/O partitions).
    • T1 sites
      • Nothing to report
    • T2 sites

  • LHCb reports -
    • Issues at the sites and services
      • T0
        • none
      • T1 site issues:

Sites / Services round table:

  • BNL: news on BNL-CNAF transfer problems: still a lot of testing activity by Giant and USLHCnet. The problem is almost certainly in the connection LHCnet-Giant. More news later today or tomorrow.
  • FNAL: ntr
  • PIC: ntr
  • NDGF: ntr
  • NLT1: ntr
  • RAL: ntr
  • ASGC: ntr
  • KIT: ntr
  • OSG: one BDII server will be down on Tuesday 12th October at 09:00, because of a short maintenance (15 minutes): the machine will be moved from one rack to another.
  • CNAF by mail: As told in the GOCDB ticket, tomorrow a down for ATLAS srm end-point is scheduled at CNAF: we have closed the queue on the CEs and we are stopping and unmounting the T0D1 fs late this night.

  • Luca/CERN: ntr
  • Manuel/CERN:ntr
  • Carlos/CERN:ntr

AOB:

Thursday

Attendance: local(Manuel, Jean-Philippe, Edward, Roberto, Marcin, Maria, Simone, Massimo, Alessandro, Alexei, Lola, Andrea);remote(Elizabeth/OSG, Jon/FNAL, Gonzalo/PIC, Michael/BNL, Rolf/IN2P3, Stephen/CMS, John/RAL, Foued/KIT, Patrick/dCache, Jeff/NLT1, Roger/NDGF, Gang/ASGC, Luca/CNAF).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • CNAF-BNL network problem (slow transfers) GGUS:61440.

    • CNAF downtime (was it announced? see CNAF report below)
    • SW installation problem solved
    • 9:40am warning : t0atlas pool I/O capacity exceeded (I/O max in/out = 4 GB/s)
    • ATLASHOTDISK space token deployed and tested. Can be used.

  • CMS reports -
    • Experiment activity
      • Data taking
    • CERN
      • GGUS:62696 occasional errors when opening files (via xrootd) on the Tier-0 processing pools. Became evident for jobs opening a large number files. For valid service classes it is reporting them as invalid. In progress.
    • Tier1 issues
      • GGUS:62807 SAM Tests failing at CNAF, now in unscheduled downtime due to storage problems.
      • Savannah: 117190 IN2P3 transfer quality was low. CMS_DEFAULT space was not big enough for incoming data, should be fixed now.
    • Tier2 Issues
    • MC production
      • ongoing
    • AOB

  • ALICE reports -
    • Production status: High activity, 31K concurrent jobs running. Pass1 reconstruction, MC cycles and several analysis jobs ongoing
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites

  • LHCb reports -
    • Experiment activities: General remark: SAM availability for T1.
      • IN2p3: DIRAC unit test failing systematically for a connectivity problem from lxplus. Moved to a dedicated LHCb machine the clients. It seems fixed
      • PIC and CNAF and NIKHEF: DIRAC unit test failing because of space filled. Preamble added to this test (it was already in the unit test based on gfal) to take this into account. Fixed
      • NIKHEF PIC and RAL. A test (CondDB-conn) checking the connectivity from a site WN against all 7 instances of CondDB was failing because of an issue with SARA. As matter of fact this test had not to be in the list of tests in the "LHCb Critical Availability" but rather on the list of tests for the "LHCb Desired Availability" in dashboard. Fixed.
    • Issues at the sites and services
      • T0
        • sam201 and sam202 (used to publish SAM tests results) became unresponsive (overload) and many test results weren't published since long time. Moved ot a dedicated (samnag003) machine.
      • T1 site issues:
        • RAL : FTS transfer to RAL SRM Finished yet destination files 0 size (GGUS:62829)
      • Edward: CondDB-conn test still configured as "critical". Fixed now and site availability will be corrected.

Sites / Services round table:

  • FNAL: ntr
  • PIC: ntr
  • BNL: ntr for T1. Update for CNAF-BNL problem: The path Milano-CNAF has been successfully tested (no packet loss). Still testing the path between US and Vienna
  • IN2P3: ntr
  • RAL:
    • CASTOR problem for LHCb still being investigated by the CASTOR team
    • problem with one CE
  • KIT: downtime for disk only storage because of firmware upgrade. Should be ok at 18:00. Communicated to ATLAS.
  • dCache: ntr
  • NLT1:ntr
  • NDGF: ntr
  • ASGC: ntr
  • CNAF:
    • scheduled downtime for ATLAS (firmware upgrade). Short announcement but only possible time slot. Agreed with ATLAS
    • CMS: maintenance yesterday on GPFS. The maintenance has not completed successfully yet. Waiting for IBM. Site in unscheduled down time.
  • OSG: http://osggoc.blogspot.com/2010/10/goc-machine-room-maintenance-tuesday.html [osggoc.blogspot.com]. This is the correct link for OSG Operations machine room maintenance

  • Massimo/CERN: at risk on next Tuesday: CASTOR upgrade for public stager
  • Marcin: not able to start replication for ASGC because the initial copy has not been completed (due to slow network)

AOB:

Friday

Attendance: local(Alexei, Edward, Jean-Philippe, Roberto, Harry, Stephen, Maarten, Lola, Carlos, Alessandro, Manuel, Maria, Dirk, Jamie, Massimo);remote(Michael/BNL, Jon/FNAL, Xavier/KIT, Gonzalo/PIC, John/RAL, Rolf/IN2P3, Gang/ASGC, Onno/NLT1, Alessandro/CNAF, Jeremy/GridPP, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • CNAF-BNL network problem (slow transfers) GGUS:61440.

    • CNAF downtime
    • IN2P3-CC
      • SRM of all sites went down at the same time about 10/07 22:00 UTC
      • 8:01 CEST : escalated to alarm ticket
      • 8:31 CEST : fixed
        • DB servers couldn't be accessed because of networking problem
    • Autumn2010 reprocessing
      • express stream reprocessing will be started later today
    • Calibration stream replication to Napoli: problem solved (network), can restart replication (probably on Monday)

  • CMS reports -
    • Experiment activity
      • Data taking
    • CERN
      • GGUS:62696 occasional errors when opening files (via xrootd) on the Tier-0 processing pools. Became evident for jobs opening a large number files. For valid service classes it is reporting them as invalid. In progress.
    • Tier1 issues
      • GGUS:62807 SAM Tests failing at CNAF, still in unscheduled downtime due to storage problems.
      • Savannah:117216, problems with rereco at T1_TW_ASGC, merging jobs failing. Gang: files seem to have been created a couple of weeks ago and deleted (garbage collector?); investigating.
    • Tier2 Issues
    • MC production
      • ongoing
    • AOB

  • ALICE reports -
    • Production status: Lots of activity. Pass1 reconstruction, MC cycles and several analysis jobs ongoing
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites

  • LHCb reports -
    • Experiment activities: Mostly User jobs. Reconstruction jobs delayed due to a bug in the new release of DIRAC.
    • Issues at the sites and services
      • T0
        • CE: they seem to be wrongly publishing into the BDII the architecture of the cluster RT717022
      • T1 site issues:
        • IN2P3 : seg fault still present and we would like a post mortem of the main problem on Sunday (GGUS:62732)
        • RAL: SRM problem (no GGUS ticket yet but Roberto will create one with the necessary information)
        • Eddie: investigating why SARA appears as red. Could be a dashboard issue.

Sites / Services round table:

  • BNL: There will be an intervention on Tuesday to upgrade Name server, SRM server and SRM DB server. The intervention will last for 4 hours and the queues will be drained 8 hours before.
  • FNAL: ntr
  • KIT: downtime finished, but apparently the service was restarted before all disk servers were available, so lost the cache information. No file lost but some files currently unreadable. Will be fixed before the weekend.
  • PIC: ntr
  • RAL:
    • outage 14th October for FTS and LFC (reboot for a kernel patch)
    • LHCb problem: some files with zero size (a few percent)
  • IN2P3:
    • DB problem due to an unexpected reboot of a network switch
    • will follow up on the LHCb segmentation faults
  • ASGC: ntr
  • NLT1:
    • site BDII at Nikhef unstable: needed a reboot
    • SARA Vobox for ALICE: symlink not correct
    • FTS monitor node has hardware problems. Should be fixed today.
  • CNAF:
    • downtime finished for ATLAS
    • downtime still ongoing for CMS. FS is now up but wants to get information from IBM to be sure that the problem will not reoccur. Outage extended up to tomorrow (hopefully).
  • GridPP: ntr
  • OSG: ntr

  • Massimo/CERN: 1st November 09:00-13:00, there will be 3 short power interruptions in Computer Centre. The UPS will be available and the intervention should be transparent. The date was chosen because it is just before the Heavy Ion run and the PS is off, but CASTOR and probably other teams will be testing at that time to prepare for the Heavy Ion run.

AOB:

-- JamieShiers - 30-Sep-2010

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2010-10-08 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback