Week of 121022

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(AndreaS, Jarka, Stephen, Torre, Felix, Alessandro, Manuel, Eva, Luca);remote(Paolo/CNAF, Michael/BNL, Stefano/LHCb, Tiju/RAL, Rob/OSG, Dimitri/KIT, Christian/NDGF, Rolf/IN2P3-CC).

Experiments round table:

  • ATLAS reports -
    • Central services
      • Slow VOMRS updates to VOMS, reported for specific user 10/19, seems to be more general, assigned to VOMRS support to have a look. GGUS:87597
      • Removed as ongoing issue: mod_gridsite crashing due to access from hosts which don't have reverse DNS mapping. The fix is included in gridsite-1.7.22 and the panda server machines have installed that version. (GGUS:81757)
      • Removed as ongoing issue: lfc.lfc_addreplica problem when using python 2.6.5. We cannot use python 2.6 in combination with LFC registrations in the pilot until this is fixed. The problem has been resolved and we are in the process of starting a large scale tests. GGUS:84716
    • T0/T1

  • CMS reports -
    • LHC / CMS
      • Physics running
    • CERN / central services and T0
      • Transfers to EOS from CASTOR problems since ~noon Thursday. INC:179785.
      • FTS bug causing various transfer issues, GGUS:86775. [Alessandro with Nicolò checked if the number of ATLAS transfer errors increased after the sites applied the FTS patch, but it does not seem to be the case.]
      • Tier-0 Frontier node down, INC:180042
    • Tier-1:
      • KIT has stopped production and custodial imports due to tape system issues.
        • Production restarted this morning.
    • Tier-2:
      • CMS SAM test leave behind cmsRun job on node, GGUS:87624. Shouldn't batch system remove all child processes?

  • LHCb reports -
    • T0:
      • CERN : Problem with aborting pilots (GGUS:87447) not solved yet; Pilots stuck that prevent submission of new jobs (GGUS:87448) seem they are not stuck but the CEs are very slow. Still to be solved.
    • T1:
      • GRIDKA: Very low efficiency of staging (GGUS:80794), data access problems for jobs reading from tape cache (GGUS:87318)
      • RAL: tomorrow scheduled downtime in order to upgrade CEs to EMI. Today I will proceed to stop user job submission and to ban SEs.
      • CNAF: Staging efficiency increased. Transfer from disk to tape increased. Still to recover backlog due to storage downtime of the middle of september.
    • Activity:
      • Prompt reconstruction and reprocessing are going on quite well (very low failure rate)
      • Did a rebalancing of resources usage in order to increase the speed of prompt reconstruction.

Sites / Services round table:

  • ASGC: Two tape-related problems: one tape robot is unavailable due to maintenance and the tape pool urgently needs some repacking because it is running out of space. There is no estimation about the time required, more details are expected later.
  • BNL: ntr
  • CNAF: ntr
  • IN2P3-CC: ntr
  • KIT: experts working on the tape problems, but no real news, maybe tomorrow will know more. New CREAM CEs are being added.
  • NDGF: heads-up for a downtime next Wednesday for a dCache upgrade at DCSC, some files on tape might be temporarily unavailable.
  • RAL: tomorrow there will be two outages: one to migrate CEs to EMI and one to upgrade CASTOR for CMS
  • OSG: maintenance of all the OSG services foreseen for tomorrow at 9:00 EST for an OS ugrade.
  • GGUS: NB!! GGUS Release this Wednesday 2012/10/24 with the usual test tickets (ALARMs and attachments).
  • CERN Grid Services: 2 transparent interventions today
    • Upgrade of CERN WMS servers to EMI1 Update 19
    • Move of myproxy.cern.ch to 2 new servers (due to retirement of current hardware)
  • CERN storage: we updated EOS ALICE
  • Dashboards: ntr
  • Databases: ntr

AOB:

Tuesday

Attendance: local(AndreaS, Torre, Felix, Luca, Jarka, Manuel, Ian);remote(Stefano/LHCb, Michael/BNL, Lisa/FNAL, Paolo/CNAF, Xavier/KIT, Christian/NDGF, Gareth/RAL, Ron/NL-T1, Rob/OSG, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports -
    • Central services
      • Slow VOMRS updates to VOMS, expert update: Synchronization is failing on for ALL vomrs-voms instances at CERN. VOMRS service certificate doesn't have right ACL in voms. Elevated to very urgent. GGUS:87597
    • T0/T1
      • SARA-MATRIX: SRMV2STAGER errors, ticket from last week 10/18, "We're not sure what the 'Request clumping limit reached' means so we'll ask dCache support." Asked for an update yesterday, no response yet. Want to be sure all is OK in advance of the start of bulk reprocessing next week. GGUS:87531
      • FZK- DATATAPE remains out of Tier 0 export pending resolution of hardware issues GGUS:87510 GGUS:87526 GGUS:87630

  • CMS reports -
    • LHC / CMS
      • Physics running
    • CERN / central services and T0
      • Transfers to EOS from CASTOR problems since ~noon Thursday. INC:179785. [Luca: an rpm to fix the problem has been built and will be transparently applied this afternoon]
      • FTS bug causing various transfer issues, GGUS:86775. [Things look even worse than before the patch, and Torre reports that know also ATLAS sees that. The developers are working on a fix, but for the moment there is no satisfactory workaround. Rolling back has been suggested.]
    • Tier-1:
      • Requests for the monthly T1 consistency checks were issued yesterday
      • T1_FR_CCIN2P3 has successfully enabled CVMFS for CMS. The plan is to switch to it after some tests are performed by CMS
    • Tier-2:
      • NTR

  • ALICE reports -
    • CERN: EOS-ALICE looks OK so far after yesterday afternoon's upgrade
    • jobs are failing at a number of sites for a number of users, possibly due to a bug in a recent AliEn version; under investigation

  • LHCb reports -
    • Activity:
      • Prompt reconstruction and reprocessing are going on quite well (very low failure rate)
    • T0:
      • Nothing new to report.
    • T1:
      • RAL: Upgrade of CEs to EMI is done. Waiting for the first jobs to be picked up by the system in order to check that everything is fine.

Sites / Services round table:

  • ASGC: the tape robot is now back in production and the problem with too little space left in the tape pool has been alleviated by the addition of 500 tapes not previously accounted for.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: all services recovered and back to production after yesterday's outage
  • IN2P3-CC: tomorrow between 1000 and 1200 CEST there will be an 'at risk' downtime for SRM, in theory almost transparent (but for a restart of a few minutes).
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • RAL: updated CASTOR for LHCb to 2.1.12 and replaced the gLite CREAM CEs with EMI CREAM CEs: so far everything looks ok, just need a little time to settle things down. Investigating network traffic issues for PerfSONAR, already understood for CNAF, and under study for other sites.
  • OSG: the announced maintenance window started and is ongoing; it should be transparent and nodes will be rebooted in the next 2 hours. The power issue at FNAL temporarily affected the OSG accounting but no issues were reported.
  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Dashboards: ntr
  • GGUS: NB!! GGUS Release tomorrow Wednesday 2012/10/24 with the usual test tickets (ALARMs and attachments).

AOB:

Wednesday

Attendance: local(AndreaS, Torre, Felix, LucaM, Jarka, LucaC, Manuel, MariaD);remote(Michael/BNL, Stefano/LHCb, Lisa/FNAL, Rolf/IN2P3-CC, Tiju/RAL, Christian/NDGF, Paolo/CNAF, Rob/OSG, Pavel/KIT, Ian/CMS, Ron/NL-T1).

Experiments round table:

  • ATLAS reports -
    • ATLAS computing operations
      • Express stream reprocessing finished except for some tails (see INFN issue). Bulk reprocessing expected to begin next week. Should have a firmer idea of the date on Friday.
    • Central services
      • FTS error "delegation ID can't be generated" seen by CMS also seen by ATLAS. See FTS proxy expiration ongoing issue above and GGUS:86775
      • Slow VOMRS updates to VOMS, synchronization failing on for all vomrs-voms instances at CERN, received word investigation is underway. GGUS:87597
    • T0/T1
      • FZK: urgent UNAVAILABLE files at FZK-LCG2 ATLASDATADISK needed for high priority production jobs GGUS:87766
      • INFN-T1: Reopened 10/15 ticket for jobs failing at INFN-T1-EMI due to a missing library for Oracle access. GGUS:87346
      • SARA-MATRIX: SRMV2STAGER errors, ticket from last week 10/18. Received word that dialogue with dCache support is ongoing. Want to be sure all is OK in advance of the start of bulk reprocessing next week. GGUS:87531
      • FZK- DATATAPE remains out of Tier 0 export pending resolution of hardware issues GGUS:87510 GGUS:87526 GGUS:87630

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • NTR
      • [Ian: apologies for yesterday's incorrect report about IN2P3 using CVMFS for CMS (note by Andrea: I amended the minutes)]
    • Tier-2:
      • NTR
[Ian: any news about the FTS problem? Manuel: nothing from yesterday]

  • ALICE reports -
    • CERN: the old SE hosting conditions data is being retired; for that type of data ALICE mainly relies on EOS-ALICE now, with backup replicas at other sites

  • LHCb reports -
    • Activity:
      • Prompt reconstruction and reprocessing are going on quite well (very low failure rate)
    • T0:
      • GGUS:87702 some file have bad checksum, it seems they have been copied both from a WN that from an FTS transfer; GGUS:87690 some files where missing due to a bug in repair tools of EOS, not a big deal, they can be recovered
    • T1:
      • RAL: had some hardware problem with a CASTOR server, substituted with a spare. Usual server will be put again in production next week

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • IN2P3-CC: yesterday's SRM intervention went well
  • KIT: the ATLAS pools are being recovered. About the tape situation, it is stable, work going on, nothing new to add. Ian: we opened a ticket about the CMS Download Agent being down since 13 hours.
  • NDGF: the pools' downtime finished and now they are back online
  • NL-T1: this morning we had issues with the tape system, some nodes needed to be rebooted.
  • RAL: it seems that ATLAS, ALICE and LHCb are still using the old CEs instead of the new EMI CREAM CEs.
  • OSG: ntr
  • CERN batch and grid services: about GGUS:87597, we rebooted VOMRS. Could the experiments check if the synchronisation problem is still there? Our guess is that it might be related to a certificate update last Friday.
  • CERN storage: ntr
  • Dashboards: ntr
  • Databases: ntr

AOB:

Thursday

Attendance: local(AndreaS, Torre, Michail, Manuel, Alessandro, Felix, Jarka, LucaM);remote(Michael/BNL, Woo-Jin/KIT, Stefano/LHCb, Gareth/RAL, Rolf/IN2P3-CC, Christian/NDGF, Lisa/FNAL, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • Central services
      • FTS ticket, ATLAS position on rollback: we see large increase in submission errors since the patch but before saying yes to a rollback would like to hear from experts whether there is a technical workaround to alleviate errors until a proper fix and so avoid a rollback (which couldn't be agreed on and done before the weekend anyway). Propose to discuss here with input from experts on Monday. GGUS:86775
      • Slow VOMRS updates to VOMS: we confirm the problem appears to be resolved. GGUS:87597
    • T0/T1
      • TAIWAN-LCG2 new ticket today: DATATAPE: [SRM_FILE_UNAVAILABLE] File has no copy on tape. Site says tape mounting is hanging, being addressed. GGUS:87796
      • IN2P3-CC: regarding 10/18 ticket reporting failure staging data from tape, received response yesterday that files on tape are lost. Presumably then all files on the tape are lost? Lost files to be reported? GGUS:87529
      • FZK: urgent UNAVAILABLE files at FZK-LCG2 ATLASDATADISK needed for high priority production jobs. Some data restored, some still unavailable at last report. GGUS:87766
      • FZK- DATATAPE remains out of Tier 0 export pending resolution of hardware issues GGUS:87510 GGUS:87526 GGUS:87630
      • INFN-T1: Reopened ticket for jobs failing at INFN-T1-EMI due to a missing library is pending. GGUS:87346

Alessandro and Michail explain that the FTS patch recently applied was needed and we see more errors only because before these errors happened but were less visible. Most probably Michail identified the root cause (a bug in FTS introduced in the gLite-to-EMI FTS code migration) and a fix is under testing. On Monday or Tuesday we may have that confirmed and therefore decide whether to apply the fix at the T0 and subsequently at the T1's. After applying the new patch, a simple Tomcat restart will be needed, so it will be transparent.

About the unavailable files at IN2P3-CC, Rolf said that he already contacted the local ATLAS contact to provide the requested information.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • Preparing to start Heavy Ion reprocessing at Vanderbilt
    • Tier-2:
      • NTR
Written report only today.

  • LHCb reports -
    • Activity:
      • Prompt reconstruction and reprocessing are going on quite well (very low failure rate)
    • T0:
      • GGUS:87702 Investigation is ongoing. Seems that the files have been corrupted because the FTS transfer was not using the overwrite flag. To be understood because it is the default option for LHCb. EOS people suggested to set the overwrite flag as default for eos. We prefer to keep things as they are and continue to investigate. [LucaM: investigating a simpler case where for some reason when we abort a transfer, the file is reopened without the truncation flag]
      • Around midday we started to suffer problems in transfer to EOS. [LucaM: we think it was because we stalled the clients for longer than the timeout used by LHCb during an EOS server update]
    • T1:
      • NTR

Sites / Services round table:

  • ASGC:
    • we have two tapes stuck during mounting, one for ATLAS and one for CMS. Now doing some raw tests on them to see if there is any media error.
    • last evening we had an SRM issues caused by some stuck requests, probably on the problematic tapes. SRM was restarted and we are keeping a close look to it and the tape system
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3-CC: ntr
  • KIT: ntr
  • NDGF: ntr
  • RAL: on Tuesday we phased out the gLite CEs, replaced by EMI CEs, but ATLAS is still testing the old ones in SAM. Alessandro: checking why.
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Dashboards: ntr

AOB:

Friday

Attendance: local(AndreaS, Jarka, Manuel, Felix, Torre, LucaM);remote(Stefano/LHCb, Paolo/CNAF, Xavier/KIT, Miguel/NDGF, Gonzalo/PIC, Dimitrios/RAL, Michael/BNL, Lisa/FNAL, Rolf/IN2P3-CC, Rob/OSG, Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • Computing operations
      • Bulk reprocessing of 2012 beam data expected to begin mid-week next week, but later is possible
    • Central services
      • No new issues
    • T0/T1
      • TAIWAN-LCG2 DATATAPE: [SRM_FILE_UNAVAILABLE] due to tape mounting hanging, reported yesterday, awaiting word. GGUS:87796
      • IN2P3-CC: Seeking clarification re: lost files on tape. Are only two files lost or is the whole tape -- and therefore presumably more than two files -- lost? Would like full list. GGUS:87529 [Rolf: sorry for the delay, the person in charge of this incident was not in the office yesterday]
      • FZK: urgent UNAVAILABLE files at FZK-LCG2 ATLASDATADISK needed for high priority production jobs. Site working on recovering data. GGUS:87766
      • FZK- DATATAPE remains out of Tier 0 export pending resolution of hardware issues GGUS:87510 GGUS:87526 GGUS:87630 [Xavier: we are confident that now the tape setup as KIT is good enough to resume T0 export. This holds for all the experiments]
      • INFN-T1: Reopened ticket for jobs failing at INFN-T1-EMI due to a missing library is pending. GGUS:87346

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • LSF slow down at CERN. We stopped the test infrastructure for now [Ian will make sure a ticket is created, as Manuel did not see any]
    • Tier-1:
      • Transfer quality problems to ASGC. Increasing rate of failures over last 18 hours [Ian: we see 50% of SRM calls to fail, but the latest transfers look ok, maybe it is fixed?]
    • Tier-2:
      • NTR

  • LHCb reports - Apologize, Alcatel connection doesn't work.
    • Activity:
      • Prompt reconstruction and reprocessing are going on quite well (very low failure rate)
    • T0:
      • Problem caused by EOS server update has been solved.
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: The failed transfers reported by Ian seem to be due to some authorisation problem; still troubleshooting. About the tape problems reported yesterday, we are still investigating.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • PIC: we have completed (and linked to the official twiki) the SIR about the accidental file deletion of two weeks ago
  • RAL: just to remind that this Tuesday morning we are going to upgrade CASTOR for ALICE.
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: yesterday's problem with EOS for LHCb was due to a file number quota problem that we fixed
  • Dashboards: ntr

AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r29 - 2012-10-26 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback