Week of 110404

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Ken, Alexei, Dirk, Maria, Jamie, Maarten, Massimo, Ulrich, Mattia);remote(Felix, Rolf, Maria Francesca, Michael, Jon, Roberto, Gonzalo, John, Claudia, Ron, Rob).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • reduced shift crew until Wednesday 15:00
    • ADC
      • AFS/LSF issue. Alarm GGUS:69320 ticket.
        • The problem was fixed by AFS team in 1h.
      • problem with IN2P3-CC_MCTAPE. GGUS ticket. Site is excluded from DDM. Savannah ticket. Tape staging service was restarted. The endpoint was enabled and after 2h the same error was observed again. The endpoint is disabled. [Rolf - we look at GGUS tickets, not Savannah. Alexei - there is a GGUS:69278 ticket. ]
      • INFN-MILANO-ATLASC endpoints have problem. All excluded from DDM. Savannah ticket
      • ITEP, JINR, PROTVINO (NL T2s in Russia) have similar problem Elog
      • GRIF-LAL_DATADISK is excluded

  • CMS reports -
    • LHC / CMS detector
      • Just accelerator tests all weekend.
    • CERN / central services
      • Last night's shifter noticed a number of sites that the dashboard indicated were not visible in BDII. But it seems that perhaps the dashboard information was stale. Situation unclear, but dashboard team notified, they are looking into it. [ Mattia - being followed up ]
      • CERN SRM was down briefly earlier this afternoon? Seems to have resolved itself. [ Could be public SRM? To be checked ]
    • Tier-0 / CAF
      • No news.
    • Tier-1
      • Some MC production in progress (trying to get to 50% utilization), many sites available for WMAgent testing
    • Tier-2
      • One site in scheduled downtime, usual collection of minor site issues, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: the problem with MonALISA is still there.
      • CNAF: GGUS:69284. Since Friday site was not running jobs for ALICE from MonaLisa point of view. There were 2.7K jobs hanging. There was a problem with the BDII of one of the CREAMS.
    • T2 sites
      • GRIF_IPNO: GGUS:69284. VOBox not reachable due to a problem of memory consumption during the night. Solved
      • LPSC (Grenoble): GGUS:69326. Information provider reporting wrong

  • LHCb reports -
    • Waiting for a new CondDB TAG to run the Reconstruction for Collision11 data.
    • MC productions running smoothly.
    • T0
      • CASTOR: sent a request to evaluate the possibility to change the LHCb space token definition by merging all space tokens with the same service class.
      • CVMFS: Steve will change on all batch nodes the environment to point to CVMFS as mount point for the shared area.
    • T1
      • RAL kept accidentally banned during the w/e and then not running any job but SAM

Sites / Services round table:

  • ASGC- Merged T2 farm into T1. Channels connected to DPM endpoint. Ask each T1 to create FTS channel to Taiwan FTT. Please set configuration as same as Taiwan LCG2. [ Jon - are you asking us to change FTS channels to ASGC. Maarten - ASGC have moved their disk only storage to DPM on T1. Jon - is there a ticket on this? Maarten - the change is only for ATLAS; all ATLAS T1 need to implement what ASGC asked ]
  • IN2P3 - ATLAS ticket, found ticket but state is "solved". Has been put to solved when service restarted on Saturday. Comment added from ATLAS since - didn't reopen ticket as internal comment. Should reopen ticket if required. We saw long delays on staging queues due to large number of requests for small files. Might explain why ATLAS saw problems. People here did not notice that you had a problem as ticket was set to solved when we restarted the service on Saturday. From our side the service is ok. So if you see any symptoms please describe them
  • NDGF - found a bug in dCache at Finnish T2 resources. Causes failure at SRM door. Investigating at site a the moment - we apologise and will report soon
  • BNL -
  • FNAL - ntr
  • PIC - ntr
  • RAL - over w/e had problems with site BDIIs. round-robin DNS alias on 3 machines. 1 and sometimes 2 were not reporting anything. Believe fixed now.
  • CNAF - during w/e problems on LCG CE due to local configuration of YAIM. Problem understood and fixing. Hope to close ticket today. Affecting ALICE & ATLAS. Downtime scheduled for LHCb on GOCDB last week - apologise as it was cancelled late. Communications were giving by e-mail and this meeting last week. StoRM e-p ugprade maybe during next technical stop of LHC.
  • NL-T1 - issue with xrootd door at SARA dCache that was used by ALICE. Fixed over w/e.
  • OSG - on Friday pm US time our main CA (DOE grid CA) experienced an outage affecting most services. CRLs service ok but new /renewed CERTs failed. Restored last Friday evening. Awaiting word from DOEGRID about stability. Probably an announcement later today.
  • KIT - this morning around 02:00 one of ur pools was down disk only) due to dCache problem whole dCache instance became very slow. GridKA-dcache-fzk.de. This is instance for LHCb.

  • CERN Grid - comment about ATLAS batch LSF - was problem on one of the kerberos servers were s/w upgrade went wrong. Over w/e config error on SCAS server. Problem when glexec was used.

  • CERN Storage - upgrade on SRM was on PUBLIC and ATLAS so could affect availability. Finished by 09:45. Some ATLAS tests which probably should be rediscussed. In case of certain failures even if system is back probe fails. More a logic problem in probe.

AOB: (MariaDZ) Announcement by OSG published in GGUS:69276 and also broadcasted on http://operations-portal.egi.eu/: The primary CA used by OSG is offline until further notice. CRLs are not affected but new and renewal certificates can not be issued. Can the experiments please comment in these notes if this is the necessary and sufficient action to take in such cases.

Tuesday:

Attendance: local(Alessandro, Eva, Maarten, Maria D, Miguel, Mike, Nicolo, Pedro, Ulrich);remote(Claudia, Jeff, Jeremy, Jhen-Wei, Jon, Maria Francesca, Michael, Rob, Roberto, Rolf, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • reduced shift crew until Friday 15:00
      • Muon reprocessing started: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCppRepro2011#Muon_reprocessing_campain_March
      • Central services
        • Central Catalog: one reader instance (there are 2 instances) showed no activity for one h during the night. No degradation of service observed by users. Experts are investigating.
        • Pilot factories: 2 out of the 3 instances running at CERN are in troubles (for different reasons). Experts are investigating.
        • GGUS problem while trying to update a ticket. GGUS Ticket submitted to GGUS: GGUS:69375
    • Site issues
      • problem with IN2P3-CC TAPE. GGUS:69377 . MCTAPE errors: Site is re-included in DDM GGUS:69278 closed. But since yesterday DATATAPE errors (again srmbringonline). Ticket opened, waiting for reply. Errors now related to MuonStream reprocessing.
        • Rolf: ticket was updated 2h ago with a question for ATLAS
      • IFIC-LCG2_CALIBDISK storage in troubles GGUS:69366 . discussion ongoing with site responsible that do not think the problem in on their site. NAGIOS ATLAS tests link (showing the unavailability of IFIC-LCG2) sent into the GGUS
      • BNL-ATLAS reported an anomalous I/O intensive activity on the SRM database. Experts are investigating https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/24021
      • Some "file exist" errors have been noticed for Functional Test activity. The FTS overwrite option is configurable per site. In ATLAS we are trying to understand how many sites observe this problem and where we can apply eventually a fix
        • Jeff: was overwriting ever supported? delete + rewrite instead!
        • Alessandro: will look further into the matter; something was changed, the file name time stamp suffix is no longer being used; more info tomorrow

  • CMS reports -
    • LHC / CMS detector
      • Scrubbing run, magnet ramped down.
    • CERN / central services
      • CMSWEB and PhEDEx upgrade this morning.
      • Mike: a partial fix for yesterday's dashboard problem is in place, a full fix is being worked on
    • Tier-0 / CAF
      • No news.
    • Tier-1
      • Some MC production in progress, many sites available for WMAgent testing
      • FNAL: 3 custodial files from AOD dataset missing: SAV:120192
      • CNAF: SAM CE-cms-analysis and CE-cms-frontier tests failing with timeouts, now looks OK, GGUS:69348
    • Tier-2
      • T2_BR_UERJ in scheduled downtime, usual collection of minor site issues, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: one of the CREAM-CEs was out of production since last night and in the other one submission was quite slow this morning. Current status is: cream-3 is back in production but slow submission, cream-1 is OK and a new CE has been added (cream-5)
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Launched this morning the FULL Reconstruction production over 2011 data.
      • MC productions running smoothly.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • First archival of 2.8TB of at CERN and CNAF. Some slowness in migration to tape at CERN due to the usage of one single tape drive.
          • Miguel: open a ticket if the migration speed causes a problem
      • T1
        • RAL discussion on how to deploy in production CVMFS there. LHCb proposed to setup a small bunch of batch nodes as pre-prod before moving all of them to CVMFS.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • this morning another SRM performance degradation was observed again due to high I/O load on the DB; more forensics data have been collected; the performance is OK now
  • CNAF
    • downtime tomorrow 14-16 UTC for ATLAS StoRM instance to increase the RAM in the back-end nodes
    • CMS problem mentioned was due to a broken switch
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - nta
  • KIT
    • around 13:00 CEST the dCache instance used by LHCb became unusable for 1h due to a central dCache daemon running out of memory
  • NDGF
    • tomorrow 6:30-10 UTC power maintenance at NSC site, affecting the FTS, 1 computing cluster and a dCache pool with ALICE and ATLAS data
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • yesterday night the site BDII was degraded for a few hours, fixed now and being investigated

  • CASTOR - nta
  • dashboards - nta
  • databases - ntr
  • GGUS-SNOW
    • Q for OSG: was the issue in yesterday's AOB (CA outage) handled correctly?
    • Rob: yes
  • grid services
    • 6 LCG-CE nodes will be retired (4 remain)
  • GT group
    • see AOB

AOB:

  • Maria D: Sensible values for a field 'Type of Problem' to be added to GGUS TEAM and ALARM ticket submission form will be discussed this Thursday after the daily meeting. Comments on this proposal https://savannah.cern.ch/support/?117206#comment20 anytime please. Thanks!
  • Nicolo: there was a SAM-Nagios issue this morning affecting all experiments and now it seems test jobs can remain blocked for hours on the WN !
  • Pedro: 1 temporary problem was fixed 1h ago, the WN timeout issue will be looked into

Wednesday

Attendance: local(Jamie, Mike, Uli, Nico, Luca);remote(Roberto, Dimitri, Jon, Claudia, Michael, Rolf, Onno, Maria Francesca, Tiju, Rob, Alessandro, Jhen-Wei).

Experiments round table:

  • ATLAS reports -
    • ATLAS general
      • calibration foreseen until Friday 15:00
    • Central Ops
    • Site issues
      • many sites have observed failures in Functional Tests transfers due to "file exists" errors. This is due to FTS -o (overwrite) option used since few days in FT, to understand if ATLAS DDM could drop the _DQ2 timestamp suffix ATLAS is now appending to all "retried' transfers. Now switched off, it needs more testing.
      • CERN-PROD files unavailable - GGUS:69442

  • CMS reports -
    • LHC / CMS detector
      • Scrubbing run continues; collimator studies [ no major Tier0 activities hence ]
    • CERN / central services
      • CMSWEB and PhEDEx upgrade generally successful - bug in PhEDEx Datasvc (T0 lost the ability to automatically approve subscriptions), fix in testing.
    • Tier-0 / CAF
    • Tier-1
      • MC production in progress, many sites available for WMAgent testing
      • FNAL: 3 custodial files from AOD dataset missing: SAV:120192
    • Tier-2
      • T2_BR_UERJ in scheduled downtime
      • T2_IN_TIFR SAM CE failures SAV:120015
      • MC production in progress

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • LPC-Clermont: GGUS:69409. Submission working but ALICE jobs are being cancelled at the site.
      • Usual operations

  • LHCb reports -
    • Reconstruction and MC simulation
    • T0
      • ~20% of reconstruction jobs showed a problem at CERN related with the CVMFS (either application not found or problem in setting up the environment). The problem as pointed out by Steve is due to cold caches on some "monster 48-cores WNs"; the first job served the rest of the jobs in these machines run fine.
    • T1
      • PIC: many transfers failing in the channel RAL-PIC. PIC people are looking at the FTS configuration to check whether the channel is properly configured.
      • GridKA: we confirmed that the issue with the dCache reported yesterday was affecting the xfers to GridKA for a few hours in the morning. Now it is OK.

Sites / Services round table:

  • BNL -
  • KIT - announcement of downtime next Wednesday April 13 for FTS and LFC b/e Oracle migration. 13:00 - 17:00. Also in GOCDB.
  • CNAF - ntr; reminder of downtime for ATLAS StoRM e/p in less than 1 hour for increase of RAM on StoRM b/e.
  • FNAL - yesterday at request of CMS switched from dCache to Lustre for unmerged datasets. Works successfully. Intermittent trouble with Alcatel system - have to dial several times. Also noticing drop-outs when people are talking. Do others see this? Been for about one week.
  • IN2P3 - ntr
  • NL-T1 - ntr
  • NDGF - from tomorrow 7 Apr until 20 Apr maintenance work at server room at Copenhagen Univ. All services there offline. ATLAS disktape reading will be affected. Ale - degradation or complete downtime for datatape? There is a reprocessing campaign going on. Can verify how much data has to be recalled from tape. Important we know what level of degradation. [ to be confirmed tomorrow ]
  • RAL - at risk at site for Sat and Sun due to work on power in building hosting network equipment
  • ASGC - ntr
  • OSG - ntr

AOB:

Thursday

Attendance: local(Jamie, Maria, Uli, Alessandro, Maarten, Mattia, Pedro, Miguel, Jacek, Nicolo);remote(Jon, Felix, John, Gonzalo, Ronald, Jeremy, Rolf, Maria Francesca, Foued, Rob, Claudia).

Experiments round table:

  • ATLAS reports -
    • Central Ops
      • CentralCatalog: one (voatlas03) of the 4 instances got stuck around 22:30, then restarted was ok. The same to another instance around 1am (voatlas05), restarted and now ok. https://savannah.cern.ch/bugs/index.php?80612
      • Muon reprocessing campaign is going smoothly: total jobs 81497, total done 80690 (%done 99.0).
    • Site issues
      • FZK-LCG2 problems in accessing DBRelease file GGUS:69486
      • NIKHEF-ELPROD GGUS:69485 : not a site issue, ATLAS production jobs were using >4GB of VM. From the resp of the simulation: "All problematic tasks were aborted. To be redone with optimized setup.Junji". Thanks to the site to have reported this.

  • CMS reports -
    • LHC / CMS detector
      • Scrubbing run continues; possibilities of short STABLE BEAMS periods; magnet off.
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • MC production in progress, many sites available for WMAgent testing
      • FNAL: 3 custodial files from AOD dataset missing: SAV:120192
    • Tier-2
      • T2_BR_UERJ in scheduled downtime
      • T2_IN_TIFR in unscheduled downtime - SAM CE failures SAV:120015
      • T2_UK_SGrid_RALPP draining queues for scheduled downtime
      • MC production in progress

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • SARA opened a GGUS:69434 for ALICE support because the vobox was showing a memory problem, as last week, that might suggest it will become unreachable any time soon. We identified the cause (CM) and fixed it temporarily until the new AliEn deployment, which includes the patch, will be done. Thanks for reporting
    • T2 sites
      • Usual operations

  • LHCb reports - Reconstruction and MC simulation
    • T0
      • NTR
    • T1
      • PIC: one file transfer problem in the channel RAL-PIC. Investigation going on, dashboard does not report any obvious problem with FTS configuration at PIC.

Sites / Services round table:

  • FNAL - ntr
  • ASGC - ntr
  • RAL - ntr
  • PIC - ntr
  • NL-T1 - ntr
  • NDGF - comment to communication of yesterday: will be starting from today a downtime at CPH some ATLAS data will not be available. What will be the degradation? About 20% of ATLAS data will not be available but will be able to accept data from CERN at full rate. Scheduled until 20 April but hopefully will be finished earlier. Ale - about 1.3% of data will be accessed at NDGF, so perhaps 20-30 files which are hopefully already staged in.
  • IN2P3 - ntr
  • KIT - ntr
  • CNAF - ntr
  • GridPP - ntr
  • OSG - ntr

  • CERN - ntr

AOB: (MariaDZ) Discussion on values for a field 'Type of Problem' to be added to GGUS TEAM and ALARM ticket submission form. Proposed values are:

 
Data Management - generic
File Access
File Transfer
Information System
Local Batch System
Middleware
Monitoring
Network problem
Security
Storage Systems
VO Specific Software
Workload Management
3D/Databases
Other 
Names can be changed. Are all areas covered? Details in https://savannah.cern.ch/support/?117206#comment20

Friday

Attendance: local(Dirk, Nicolo, Miguel, Uli, Maarten, Jamie, Maria, Alessandro, Andrea, Mattia);remote(Giovanni Zizzi, Jon, Xavier, Roberto, Ulf, Onno, John, Jhen-Wei, Rolf, Gonzalo, Rob).

Experiments round table:

  • ATLAS reports -
    • ATLAS general
      • Commissioning for now, waiting for beam
    • Central Ops
      • CentralCatalog : again (voatlas05) one instance got stuck yesterday evening. a restart cured the issue. Problem (most probably) understood: misunderstanding in an upgrade of oracle-instanclients
    • ~ ok the rest

  • CMS reports -
    • LHC / CMS detector
      • No beam today
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • MC production in progress, many sites available for WMAgent testing
      • FNAL: 3 custodial files from AOD dataset missing: SAV:120192
    • Tier-2
      • T2_BR_UERJ in scheduled downtime
      • T2_IN_TIFR in unscheduled downtime - SAM CE failures SAV:120015
      • T2_UK_SGrid_RALPP draining queues for scheduled downtime
      • Short unscheduled outages at T2_ES_IFCA and T2_FR_GRIF_IRFU
      • MC production in progress

  • ALICE reports -
    • T0 site - Nothing to report
    • T1 sites
      • IN2P3: GGUS:69528 . VOBox is not reachable. Services are not running. Informed some cooling problem that might be related. [ Rolf - we had a cooling problem which made several servers shutdown due to temperature problems. Still getting machines back - will respond to ticket with news. Currently in unscheduled downtime due to this cooling problem. ]
    • T2 sites - Usual operations

  • LHCb reports -
    • Reconstruction and MC simulation
    • Validation of the stripping.
    • T0
      • NTR
    • T1
      • PIC: reported a problem transferring data to PIC using the FTS server there. The problem is the ambiguity of the membership of the Data Manager credentials and FTS does not resolve properly using VOMS FQAN. They will open a FTS Support case (GGUS:69520) [ Nicolo - also been seen several times by CMS site admins. Work-around is to ask to be registered exclusively for CMS. ] [ Gonzalo - people still looking into details; was not aware that this was known by other VOs. Didn't forward ticket to FTS support but will do so this afternoon. Maarten - could check order of VOs in gridmap files. DTEAM should come after relevant VO. Have always said that DTEAM and OPS should be below LHC experiments. ]
      • NIKHEF: problem with CVMFS yesterday due to a reconfiguration of the WNs. Seems OK now. (GGUS:69501)
      • RAL is going to revert shared after from CVMFS to NFS for w/e. LHCb would prefer to not stepback. Most likely new s/w will not be in NFS. We understand no callout for CVMFS for w/e but may risk to run without RAL over w/e.

Sites / Services round table:

  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - short service break Monday morning on SRM for upgrade. Will be in Vilnius all next week - hope to join call from there
  • NL-T1 - announcement: SARA will have downtime Monday morning. Scheduled for 1h at 10:00 we have to reboot a few nodes in storage cluster as running with wrong version of storage driver which is unstable. Impact probably small - apologise for short notice. Would have liked to have planned more in advance but important to reduce risk of problems. Alessandro - don't see need to drain queues.
  • RAL - 1) AT RISK this w/e as building housing network equipment is having power work done 2) LHCb query about CVMFS: had meeting with local LHCb rep and decided to leave CVMFS in place over w/e. Roberto v happy
  • ASGC - ntr
  • PIC - nta
  • OSG - ntr

  • CERN Grid - investigating problem with update in pre-prod which causes some grid jobs to fail. Affecting only preprod batch nodes.

AOB:

-- JamieShiers - 31-Mar-2011

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2011-04-08 - MariaGirone
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback