Week of 101206

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local( Jamie, MariaG, Massimo, Juan Manuel, Ale, MariaDZ, Jacek, Edward, Maarten, Roberto, Simone, Dirk);remote( Michael/BNL, Jon/FNAL, Felice/CNAF, Xavier/KIT, Kyle/OSG, Rolf/IN2P3, Ron/NL-T1, John/RAL, Christian/NDGF, Suijian/ASGC).

Experiments round table:

  • ATLAS reports - Alessandro
    • LHC
      • Last fill 1540 in stable beams since 04:00
      • Last refill of the year in the morning (10:00-11:00)
        • Dumped by 18:00
    • T0 & Central Services:
      • DB atlr4 rebooted at ~04:30am of Sat. Services were not affected. During the past week atlr4 rebooted ~ every 2 days at 4:30am.
        • Diagnosis of atlr4 rebooting issue? ▪ Jacek: DB log-rotate seems to create the problem, but don’t yet know how and why only this node is affected. No reboot since log rotation was switched off.
      • New pilots (version released on Friday at 13:00) had 2 problems: generic jobs using direct access were not properly working due to a problem with the PandaMover, and ANALY_CERN was failing due to the special copy tool setup. Both issues have been fixed on Saturday.
    • T1:
      • SARA-MATRIX LFC registration errors ALARM GGUS:65019 submitted on Sunday at 19:19UTC.
        • NO answer till Monday 8:14amUTC!! Problem fixed at 10:10UTC.
        • SARA ALARM ticket, problem solved quickly but response time ~12hrs GGUS:65019
        • Ron: saw ticket - people came in today and fixed problem.
      • NDGF-T1 file staging timeouts GGUS:65031
      • RAL castor upgrade, UK production and analysis in 'brokeroff' state
      • DDM Site Services upgrade: we start today with the DE cloud.

  • CMS reports -
    • T1 sites
      • [SAM tests] The CE sft-job test was failing in RAL during the weekend. The failing test was running as user ‘cmssgm’, which is only allowed to have one job at a time. One job had been stuck for a few days, preventing other cmssgm jobs from running. Savannah:118201.
  • ALICE reports - Maarten
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: Alarm GGUS ticket 65015. Yesterday none of the CREAM-CEs were working, they did not admit submission requests. Since Saturday night there are problems with the batch server. Last news: reset the whole cluster is needed
    • T2 sites
      • Usual operations

  • LHCb reports - Roberto
    • Experiment activities: Reprocessing going at full steam, more than 80% done. CNAF and CERN almost finished. IN2p3 and RAL going to finish. Merging is running in parallel smoothly almost everywhere.
    • T0
      • Despite the 2 new disk servers added last week, this morning we killed again LHCbDST. We banned in writing the SE until the number of active transfers decreased. No clear the root cause of the problem but it looks like when gfal gives up a transfer there is not a SRM abort internally in CASTOR . The transfer request is then still active in the LSF queue and when it is scheduled it takes 3 minutes to realize that the client has gone and eventually aborts. This feeds the snow ball effects. Sebastien is looking at that.
    • T1 sites:
      • SARA: Problem grabbing computing resources (GGUS:65407). The site rank becomes strongly non atractive then all of sudden the site starts running jobs but then (fair share) the sites rejects any further jobs that pile up in the local queue for being scheduled the day later. We need to run almost 12K reconstruction jobs there and at this pace it will take 2 weeks.

Sites / Services round table:

  • Michael/BNL - ntr
  • Jon/FNAL - ntr
  • Felice/CNAF - ntr
  • Xavier/KIT - problem on sat with network adapter for NFS which impacted to job submission cluster. Resintalled pbs but could not keep jobs in the queue (cancelled). Now all fine again. Planned at-risk on Wed 8-12 UTC - new network component and firmware upgrades.

  • Rolf/IN2P3, - ntr
  • Ron/NL-T1 - LFC error due to “out of table space” in DB. Space has been extended and working on nagios plugin to be warned in advance. Fair share problem for LHCb: looking at the problem now and will update GGUS. Also one pool node had h/w problem this morning - will fix this tomorrow.
  • John/RAL - castor atlas upgrade is going well.
  • Christian/NDGF - looking into GGUS ticket from ATLAS
  • Kyle/OSG - CERN timeout adjusted - looks ok
  • ASGC - ntr

  • Juan Manuel/CERN : FTS was shown as grey for some period during monitoring migration which took place. Reminder two power test 7+9 of Dec 7:30-17:00 (at risk)
  • Massimo/CERN: planned downtime Wed - upgrade castor to 2.1.10 (PUBLIC) + 10.2.1.5 DB upgrade. No direct impact on other VOs but but SAM test run on public.
    • Simone: public upgrade: odd that CERN is shown as affected by this downtime even though experiments are not affected. This info is also propagated to experiment black listing and needs special treatment so that site can be used. Should review the situation and fix root cause of this inconsistency.
    • Massimo: did not have much time to address this. Needs some discussion between experiments and ops tests owners. Maria: Who owns the ops tests anyway? Need to clarify.
    • Maarten: just ignore the ops tests for your black-listing -
    • Ale: maybe not that easy as ops tests down affects all instances
    • Rolf/IN2P3: also a problem for dcache - eg when only ATLAS pools are affected...
    • -> need offline followup and change proposal

AOB: (MariaDZ) The GGUS host certificate will change on Thu Dec. 9th at 8hrs UTC. The alarm process shouldn't be affected as another certificate is used to sign email notifications of ALARM tickets. Nevertheless, GGUS tickets will be opened to notify each Tier1 for this change.

Tuesday:

Attendance: local(Gavin, Manuel, Jerome, Peter, Ignacio, Harry, Ale, Roberto, Lola, Edward, Luca, Jamie, Maarten, Simone, MariaDZ, MariaG, Andrea, Dirk);remote(Michael/BNL, Jon/FNAL, Felice/CNAF,Ronald/NL-T1, Gareth/RAL, Kyle/OSG, Rolf/IN2P3, Christian/NDGF, Stephen/CMS ).

Experiments round table:

  • ATLAS reports - Peter
    • T1 issues
      • RAL testing of Castor upgrade postponed due to net outage, UK cloud offline whilst LFC unavailable
      • NDGF-T1 file staging timeouts GGUS:65013

  • CMS reports - Stephen
    • Experiment activity
      • End of 2010 run yesterday
    • CERN and Tier0
      • Some issues with transfers of last data from P5 to IT due to CMS folk messing with the system, should finish processing in the next day
    • Tier1 issues and plans
      • Not much going on, running backfill at T1's
    • Tier-2 Issues
      • no noteworthy T2 issues, files get lost here and there, disk pools come and go... usual things. Site admins always very responsive to those problems, thanks.

  • ALICE reports - Lola
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: GGUS ticket 65015. CE's are working again since yesterday afternoon, submission restarted. Ticket closed.
      • RAL: one vobox was not working this morning, the services were stuck and the other one was unreachable. News we got from the site is that it is completely isolated due to a main network problem
    • T2 sites
      • Usual operations

  • LHCb reports - Roberto
    • Experiment activities:Reprocessing now mainly running at SARA (backlog formed beacuse of the FS issue)
    • T0
      • Found the reason of the failures with LHCBDST: The key problem is in the mismatch between resource scheduling and polling protocols. Idles requests left in the LSF preventing further requests to get a slot. This is triggered when the diskserver gets overloaded by too many requests (exhausting the slot). The only viable solution is more h/w than reasonable to compensate and too many slots to compensate for the "unused" ones. I think a detail SIR is needed.
    • T1 site issues:
      • SARA: Problem grabbing computing resources (GGUS:65407). It was a fair share issue (very small for our largest T1) that was preventing to run constantly at the load that LHCb would expect. We also added NIKHEF in the abstract definition of NL-T1 to drain the 15K jobs backlog formed but we also get a glitch of PBS (4 hours last night) over there. Ron agreed to tune the FS on SARA batch farm to speed up this backlog drain.

Sites / Services round table:

  • Michael/BNL - ntr
  • Jon/FNAL - ntr
  • Felice/CNAF - ntr - tomorrow will be holiday in Italy
  • Ronald/NL-T1 - ntr
  • Gareth/RAL - The RAL Tier1 was unavailable from 06:30 to 12:15 local time (=UTC) today owing to a network problem. This was traced to a problem with the Site Access Router at RAL.Yesterday we announced (via a broadcast) a problem with our callout system that would have delayed our response to any alarm tickets. This was resolved early yesterday evening. The upgrade of the Atlas Castor instance to version 2.1.9 is going well, although testing has been delayed by the networking problem.
  • Rolf/IN2P3 - There will be an outage of IN2P3-CC / IN2P3-CC-T2 on December 14th. Services impacted: HPSS, xrootd; FTS, LFC (except lfc-lhcb-ro), VOMS (R/O access possible), other Oracle based services like Operations portal. This means in particular that downtime notifications for any grid site of WLCG / EGI which normally would occur during the outage of the portal will be delayed. However, the operations dashboard used by the operators of EGI and the various NGIs will stay available. The downtime declaration will be done as usual.
  • Christian/NDGF - at risk tomorrow - some srm pools (atlas data) may be unavailable
  • Kyle/OSG - bdii timeout is back to 30 sec - so far no problems. On Thu 8:00 UTC (2am local) GGUS will switch certs: asked if that could be done during OSG working hours - eg at 14:00. MariaDZ will check with developers and report back tomorrow.
  • Luca/CERN: upgrading atlas archive DB to 10.2.0.5 on Wed,
  • Gavin/CERN: reminder: will retire SLC4 batch services next week
  • Ignacio/CERN: castor 2.1.10 + DB upgrade for public tomorrow

AOB:

Wednesday

Attendance: local(Carles, Eddie, Ignacio, Lola, Luca, Maarten, Manuel, Maria D, Peter);remote(Dimitri, John, Jon, Kyle, Michael, Onno, Renato, Rolf, Stephen).

Experiments round table:

  • ATLAS reports -
    • T0 / Central issues
      • LFC and srm-atlas outage observed this morning GGUS:65121
        • Ignacio: srm-atlas issue due to castor-public upgrade (coupled through SAM tests)
        • LFC debugged afterwards - seems the same problem as reported for RAL below
    • T1 issues
      • RAL castor upgrade went smoothly from ATLAS perspective, thanks!
      • Peter: LFC ping failures observed at RAL, gone later
        • debugged afterwards: problems started after upgrade of vo.racf.bnl.gov VOMS server certificate Tue evening - that server was not properly supported by the LFC at RAL

  • CMS reports -
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • One PromptReco job still running (but looks like it will fail).
      • CMS CVS migration prevented a lot of work and JobRobots from yesterday afternoon till sometime this morning (still not working offsite due to DNS caching). Not clear how this was coordinated, will be followed up on.
        • Manuel: AFAIK there was some scheduled intervention/migration announced, will look into it
        • Maarten: CMS to open a ticket as needed
      • Stephen: SSO upgrade at CERN 1.5 month ago broke support for non-CERN certificates, fixed yesterday
        • Jon: who fixed it, CERN or DOEGrids?
        • Stephen: CERN. UK certificates were also affected, but fixed earlier
    • Tier1 issues and plans
      • Rereco in process.
    • Tier-2 Issues
      • Nothing to report.

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities: Reprocessing now mainly running at SARA (backlog formed beacuse of the FS issue)
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Merging is over at CERN.
      • T1 site issues:
        • SARA/NIKHEF: reprocessing jobs proceeding smoothly but still 10K remaining. Jobs are competing with MC simulation due to some internal configuration in DIRAC. Addressed.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NLT1
    • Tue Dec 14 the Oracle DB will be moved: Conditions DB, LFC and FTS will be down
  • OSG
    • GGUS host certificate change: at which time?
      • Thu Dec 9 at 14:00 UTC, see AOB
  • RAL
    • CASTOR upgrade for ATLAS finished early today

  • CASTOR
    • castor-public upgrade to 2.1.9-10 went OK
  • dashboards - ntr
  • databases - ntr
  • GGUS
    • see AOB
  • grid services
    • Manuel: 5 LCG-CEs were retired, 2 are being drained, see AOB
    • Maarten: SAM does not consider those 2 being in maintenance - downtime notification might have got lost due to network outage at RAL yesterday
  • networks
    • message from John Shade sent after the meeting: Reduced capacity on CERN-BNL link this weekend

AOB:

  • Juan Manuel/CERN: Last Monday CE103,CE104,CE105,CE106 and CE107 were retired (after one week of draining). CE112 and CE113 were put into draining mode yesterday afternoon. They will be migrated to SLC5 submission when drained, scheduled for next Monday.
  • MariaDZ: OSG request to move the new GGUS host certificate change to 14hrs UTC was accepted as per https://savannah.cern.ch/support/?118146#comment4. Nevertheless, this announcement was sent to the GGUS interface developers e-group on 2010/12/02 and a request for time change would have been appreciated earlier. Should we put more OSG members in the e-group for backup purposes? Apologies for not having thought of the timezones in the first place.

Thursday

Attendance: local( Peter, Simone, Ale, Manuel, David, Harry, Roberto, Lola, Jacek, Giuseppe, Ignacio, Stephen, Maarten, Dirk);remote(Gonzalo/PIC, Jon/FNAL,Rolf/IN2P3, Gareth/RAL, Felice/CNAF, Ronald/NL-T1, Foued/KIT, Christian/NDGF, ASGC).

Experiments round table:

  • ATLAS reports - Peter
    • T0 / Central issues
      • LFC at CERN needs attention with respect to BNL cert update GGUS:65121
    • T1 issues
      • RAL saw same voms issue and fixed yesterday, all sites please check GGUS:65121

  • CMS reports - Stephen
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • Last PromptReco job failed. Turns out to be due to corruption on WN of RAW data, regenerating it, see GGUS:65130
    • Tier1 issues and plans
      • Rereco in process.
    • Tier-2 Issues
      • Nothing to report.

  • ALICE reports - Lola
    • T0 site
      • A possible strange behavior has been observed in one of the CREAM-CE (ce201). We are keeping track and statistics on it to get to a conclusion if it is normal or a missbehavior
    • T1 sites
      • Nothing to report
    • T2 sites
      • Couple of T2's: Hiroshima and Cyfronet are back in production after downtimes

  • LHCb reports - Roberto
    • Experiment activities: remaining 6K jobs to accomplish the reprocessing (NL-T1 5500 jobs in the belly stiil but draining very smoothly). User dark area clean up ongoing. Missing CERN.
    • T0
      • Requested the exhaustive list of user files (GGUS:65036). This prevents to finish the User Dark Area clean up in duty time.
      • Received the SIR from CASTOR developers about the shortage of the lhcbdst service class.
    • T1 site issues:
      • NTR

Sites / Services round table:

  • Gonzalo/PIC - ntr
  • Jon/FNAL - circuit between FNAL and KIT was down yesterday for 6h, trouble between New York and Amsterdam leg
  • Rolf/IN2P3 - ntr
  • Gareth/RAL - reported disk server (CMS) unavailable this morning - will come back now
  • Felice/CNAF - had a problem with network interface in a gridftp server - now back in production
  • Ronald/NL-T1 - small defect in tape robot made some tapes briefly unavailable - defective part is now replaced
  • Foued/KIT - ntr
  • Christian/NDGF - ntr
  • ASGC - ntr
  • Ignacio/CERN - small hick-up after castor public DB upgrade: fixed quickly by DBAs refreshing the DB statistics.

AOB:
Frontier/Squid

  • The new release of the frontier-squid rpms Version 2.7.STABLE9-5.1 fix a vulnerability identified by Alessandra Forti and include the latest release of the squid distribution made available by Dave Dykstra. More info about these rpms can be found here:https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2
  • The frontier-tomcat-1.0-8 rpm configures log rotates for tomcat catalina.out and includes some performance tuning for ATLAS sites. More details can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGFrontier
  • Frontier servers, squid launchpads and squid servers at CERN have been updated to those latest releases.

Friday

Attendance: local(Stephen, Roberto, Simone, Ignacio, Massimo, David, Harry, MariaDZ, Maarten, Ale Dirk);remote(Michael/BNL, Jon/FNAL, Gonzalo/PIC, Rolf/IN2P3, Christina/CNAF, Suijan/ASGC, John/RAL, Tore/NDGF, Kyle/OSG, Xavier/KIT, Onno/NL-T1).

Experiments round table:

  • ATLAS reports - Simone
    • [ Dirk: These are my notes from the discussion - the official ATLAS report may replace those when it arrives ]
    • first pass reco ongoing (HI) - should be done on Mon,
    • DB intervention to fix problems serving panda: status of 50k jobs got lost, will be cleaned now.
    • Advance warning for Jan 17+18: foresee to have ATLR DB split in two DBs depending on apps. This will be a complex operation and will create downtime for all atlas activities for up to two days. More detailed notification will follow via the usual channels later.

  • CMS reports - Stephen
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • Have requested incident report on CVS migration. Old repository was still getting updates during migration.
      • Tier-0 Idle
    • Tier1 issues and plans
      • Rereco in process.
    • Tier-2 Issues
      • Nothing to report.

  • ALICE reports - Maarten
    • T0 site
      • Downtime scheduled for the deployment of new AliEn v2.19 on Wednesday. It will take three days and we will star with the Central Services and a couple of T2's.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Roberto
    • Experiment activities: remaining 3K jobs to accomplish the reprocessing in total. Proceeding smoothly
    • T0
      • NTR
    • T1 site issues:
      • Flooding NIKHEF with direct CREAM submission due to a problem DIRAC side.

Sites / Services round table:

  • Michael/BNL - there will be a network intervention on the weekend (h/w relocation done by the provider): for 8h one 10g link between BNL and CERN will not be operational. The second link will take all traffic (up to 8.5 Gb/s). No impact on experiment activities is expected.
  • Jon/FNAL - attempted yesterday DNS upgrade for CMS system with only partial success. Intervention was rolled back after 15mins - all fine now. Now working with company to find out the reasons for the upgrade problem.
  • Gonzalo/PIC - scheduled intervention 15th Dec (one day): main reason is migration of hw for PNFS
  • Rolf/IN2P3 - update on LHCb job setup problem: suspect hyper threading problem and will get reference from jobs running at CERN. Site in contact with CERN team to get local submission rights for this test.
  • Christina/CNAF - reminder: next week on 14th Dec 1h downtime for STORM upgrade
  • Suijan/ASGC - ntr
  • John/RAL - power outage expected in old data center - should not affect any experiment services (at risk)
  • Tore/NDGF - ntr
  • Xavier/KIT - ntr
  • Onno/NL-T1 - ntr
  • Kyle/OSG - switched to new GGUS certs - went well

AOB:

  • The experiments are asked to comment on an issue observed in the FTS behavior (GGUS:65151): it does not transfer empty files - is that OK?
  • Simone, Ale: OK for ATLAS (and ATLAS would be against changing the current behavior)
  • Stephen: not OK for CMS - existence of a file (even empty) may carry information
  • Jon: FNAL would actually delete 0-byte files anyway
  • Maarten: will fix the current misleading error message in FTS, but not the current behavior

-- JamieShiers - 03-Dec-2010

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2010-12-10 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback