Week of 100809

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Christine, Ho, Cedric,Ullrich, Luca, Edward, Lola, Harry, Ignacio, Ale, Maarten, Dirk);remote(Barbara/CNAF, Gang/ASGC, Catalin/FNAL, Marc/ IN2P3, Xavier/KIT, Michael/BNL, Tiju/RAL, Vladimir/LHCb, Rob/OSG, Pepe/PIC).

Experiments round table:

  • ATLAS reports - Ho
    • Tier-0:
      • couple of stable runs as well as couple of issues in LHC (beam loss, cryo problem) and detector during the weekend. Reached the first pb^-1 int. luminosity.
      • eLog system unavailable from Saturday morning. We reported to elog support right after the problem was found. Problem fixed on Sunday morning. Critical for ADC shifters to communicate and handover issues, would ask for 7-days support (no need for night support).
      • T0 data processing system was stuck from Saturday afternoon. Noticed and fixed by expert by Sunday morning.
    • Tier-1s:
      • started replicating extra ESD copies to 4 T1 sites on last Friday (100% to FZK, IN2P3-CC, NL-T1; 30% to CNAF), transfers went through the T1-T1 channels. Observed certain delay in the transfers from BNL. It could be related to the high load on BNL ... to be followed.
      • some transfer errors from BNL to US T2s (SWT2_CPB and MWT2_UC) and to NDGF. The transfer error to MWT2_UC was clarified to be a Tier-2 issue; however, later on shifter updated the ticket with errors: "Pinning failed: finding read pool failed" and "file locality unavailable" at BNL (GGUS:60980)
      • "SRM Aborted" error in the T0 export to ASGC. It was an old issue (GGUS:60740) and was claimed to be an transient issue; but happened again since this morning. Alarm ticket issued at 06:45 UTC this morning. No response from site so far - did site receive the alarm?? (GGUS:60983)
      • still waiting for RAL MCDISK
    • Tier-2/3s: nothing to report
    • Harry: for e-logger issues: please use alarm ticket, which should get dispatched also on weekends.
    • Michael: don't see load issue for T1 transfers - nature of delay no clear yet. Several different transfer issues for T2: problem with mid-west site has been traced to storage issue there. Re-opening of the existing ticked may be misleading as the ticket information may not correspond to recent issue.
    • Brian: work on MC disk issue will be presented at different meeting today, but in the process of expedite transfer for important files.
    • Gang: received alarm ticket - problem was due to overload condition which since then disappeared and files have been transferred. ATLAS will check again.

  • CMS reports - Christine
    • Tier-0
      • Normal operation. Data taking over the weekend.
      • Tail of skims on going
    • Tier-1s
    • Tier-2s
      • [ OPEN ]GGUS Information Ticket-ID 60855 about the transfer between T2 Beijing and CCIN2P3, still open
    • Notes MC prod with pile-up will start soon by now.
    • Christine: also elogger problems on the weekend - CMS contacted helpdesk. Harry: please use alarm ticket as helpdesk tickets are picked up only on Monday.
    • CMS observed some problems with the CIC portal.

  • ALICE reports - Lola
    • Production: running 22K jobs
    • T0 site
      • Nothing to report
    • T1 sites
      • GGUS: 6097 submitted Saturday morning. The CREAM CE cream-3-fzk.gridka.de does not accept proxy delegation.
    • T2 sites
      • Small issues found at some T2 sites.

  • LHCb reports -
    • Experiment activities:
      • Data during weekend and today. Reconstruction and merging. No MC.
    • GGUS (or RT) tickets:
      • T0: 0
      • T1: 1
      • T2: 1
    • T0 site issues:
    • T1 site issues:
      • GRIDKA: Users reporting times out accessing data. Looks like a load issue with dcache pools GGUS:60887. No problems today.
      • RAL: after recovered the faulty disk server at LHCb_USER space token users reported problems with files access during weekend. No problem today.
    • T2 site issues:
      • nothing

Sites / Services round table:

  • Barbara/CNAF - FTS intervention finished successfully.
  • Gang/ASGC - yesterday afternoon one squid server was down for 2h, fixed by restarting. Some CMS production jobs failed at ASGC t1 and T2 due to wrong workflow defined by cmsdataops.
  • Catalin/FNAL - during the weekend problems due to wm agent testing (late friday) - now resolved.
  • Marc/ IN2P3 - two tickets updated: (GGUS:58646 transfer problem with BNL ) since this morning tried new network config in place on ftp servers - transfer speed improved by factor 5. Bejing transfer problems: network team is awaiting answer from geant about problematic network segment(ticket 5966).
  • Xavier/KIT - glitch in one disk cluster - some atlas disk only pools down for 5 min. Cream node 2 down in scheduled downtime, 1 and 3 are up and ticket should be updated soon. Observed short term instability in GGUS from 11:00-11:30 (loss of DB contact).
  • Michael/BNL - one pool failing due to java io error (dcache issue)
  • Tiju/RAL - ntr
  • Pepe/PIC - had cooling issue affecting worker nodes - stopped some WN until cooling is fixed: 30% of WN remaining.
  • Rob/OSG - on Fri afternoon all of OSG info was stale by several hours in the cern BDII as the update processes had stopped. The problem was fixed a few hours later but OSG would like to get some more details on this issue (GGUS:60927). Ullrich will follow up.

  • Luca/CERN: 2 nodes of the ATLAS offline DB had to be rebooted (10-11)
  • Ullrich/CERN: Glitch with cream log file size has been detected and resolved before affecting experiments. This morning production CE support for ILC VO has been finalized.
  • Ignacio/CERN: detected problem in lemon downtime display - fixed by lemon developers. Next stop (30th): proposing transparent intervention on castor name server to add additional indexes for CASTOR 2.1.9-8. Ale: will discuss proposal inside ATLAS.

AOB:

Tuesday:

Attendance: local(Harry(chair), Ricardo, Ulrich, Marie-Christine, Edward, Cedric, Nilo, Luca, Maarten, Alessandro);remote(Barbara(INFN), Michael(BNL), Gang(ASGC), Angela(KIT), Ronald(NL-T1), John(RAL), Jeremy(GridPP), Thomas(NDGF), Pepe(PIC), Catalin(FNAL), Rob(OSG), Vladimir(LHCb)).

Experiments round table:

Ongoing issues: FZK-NDGF transfer issue GGUS:60437 (updated 2010 08 04)

Tier-0:

Problem of PVSS2Cool yesterday around noon probably due to the reboot of 2 nodes of ATLAS offline DB mentioned yesterday. Luca reported there is a known problem in COOL not supporting reboot of nodes.

Issue with files on T0 not registered on catalogs (LFC/DQ2). Tracked back to a DQ2 bug. Will be fixed

Tier-1: ASGC : Problem reported yesterday ("SRM Aborted") effectively disappeared. Alarm ticket 60983 probably didn't reach the good people : From Jhen-Wei ""asgc-t1@lists.grid.sinica.edu.tw" is registered as "Emergency email" in GOCDB and just few persons listed there (Just me is belonged to T1 operation..). I am updating contact information in GOCDB https://goc.gridops.org/site/list?id=22." Propose to submit a test alarm. It was agreed this would be done tomorrow (Wednesday) at about 9 am Geneva time (CEST).

Yesterday transfers from BNL to CNAF were running slowly without obvious explanation. Barbara reported they could not find any network problems at CNAF and will copy to Michael the relevant FTS logs.

Tier-2: UKI-SOUTHGRID-OX-HEP back after fixing ticket 60969

Central infrastructure: No objection to the date of CERN announced (30th August) intervention on castor name server to add additional indexes. Maarten queried ATLAS position - they would like more information on the timing and risk of the intervention as they may have a backlog of data on a Monday morning. CASTOR team to be contacted.

Experiment activity: Luminosity in the last 24 hours (11:00 am to 11:00 am): Delivered: 42.2 nb-1 Still checking Recorded luminosity

Tier1/2 Issues: Nothing to report

MC production: On going MC prod with pileup

AOB *Still some problems in accessing cmsweb since the front end upgrade, maybe due to not well understood security (certificates) interactions between Firefox and PhEDEX web pages. reported in Savannah ticket https://savannah.cern.ch/support/?116081

Production: Several analysis trains and MC cycle ongoing

T1 sites: GGUS:60970 submitted Saturday morning. The CREAM CE cream-3-fzk.gridka.de does not accept proxy delegation. Status: SOLVED

T2 sites: Small issues found at some T2 sites.

Experiment activities: No data due to power cut in pit and corruption of LHCBONR database. Reconstruction and merging. No MC. Luca reported the power outtage was from 10 pm till midnight. Recovery of LHCBONR from the local 'compressed on disk' copy was first tried but took too long. Then a computer centre copy was tried and completed by 2 pm though configuration is still ongoing.

Sites / Services round table:

  • BNL: Michael restated the request for site access to fts logs at other sites in order to help analyse our frequent file transfer performance issues. Maarten reported that this has been accepted as a feature request by the FTS developers but with no date yet.

  • ASGC: Expecting a new test alarm ticket tomorrow morning CEST.

  • KIT: Third CREAM-CE is back in production. From time to time, about 1 in 2, ALICE authorisation to their lcg-ce fails. Maarten suggested to look at the CERN voms service where there are two load-balanced front-ends.

  • RAL: Ongoing disk server issues - gdfs81, part of ATLASDATADISK, is read-only but has a ticket to fix it and put it back in production.

  • NDGF: Tomorrow from 10-12 dcache servers will be down for a dcache upgrade.

  • PIC: Repairs following yesterdays cooling problems were completed this morning and all worker nodes should be back by about 4 pm today.

  • OSG: 1) What was the CERN bddii problem of last Friiday ? Ricardo reported that an upgrade triggered a rare bug only seen once which then affected all top level gLite 3.2 bdii. The fix is in a release being prepared for September but Rob should contact Laurence Field for details on this particular bug found. 2) A top priority ticket for a file transfer issue from Texas was uneccessarily put in last night after working hours - sites should please be careful what they flag as top priority.

  • CERN LSF: The failover node of the CERN lsf master was down for 30 minutes this morning for a memory change.

AOB:

Wednesday

Attendance: local(Cedric, Ian, Luca, Lola, Edoardo, Ullrich, Ignacio, Maarten, Ale, Ricardo, Dirk);remote(Barbara/CNAF, Michael/BNL, Angela/KIT, Gang/ASGC, John/RAL, Onno/NL-T1, Vladimir/LHCb, Thomas/NDGF, Pepe/PIC, Rob/OSG, Catalin/FNAL).

Experiments round table:

  • ATLAS reports -
    • Tier-0:
      • ~100 of gridftp failure for export from CERN around 01:00 AM. Correlated with high I/O rate from T0. Wait a bit and I/O didn't decrease (even went above 6 GB/s). Nb files for FTS channels reduced to 1/3 to release the load. Put back default values this morning.
    • Tier-1:
      • ALARM test ticket submited to ASGC quicky aknowledged.
      • RAL : "After copying about 4000 files off the gdss417 the disk server crashed again. We believe the file system has been corrupted by the RAID card hardware fault and we no longer consider it practical to attempt to recover anymore files. We were able to recover all the high priority files we were given.". ~44000 files lost. ~26000 were unique at RAL and will be regenerated if needed. The rest is being recovered from other sites.
      • SAM problem reported yesterday by BNL due ldapsearch queries to OSG BDII taking very long and could hit timeouts. Need to know if it can affect Central Services (e.g. FTS to BNL) if Central BDII updated.
    • Tier-2:
      • TR-10-ULAKBIM whitelisted in DDM
    • Ignacio: to further investigate T0 issue some example filenames + timestamps would be needed? Ale: problem is understood on ATLAS side - dimension of pool is too small for this increased load. Maarten: as similar problems took place before - should we reduce standard load to avoid incidents. Ale: tuning nobs to throttle on ATLAS side exist - but still need automation.
    • Maarten summarized the BDII update problems from OSG BDII: one of two OSG BDII nodes showed long query times after a rolled back upgrade attempt. Long query time may fail the update procedure for WLCG BDII and need further investigation. Rob: spikes have been seen also after previous updates but only druing short durations. Investigation is ongoing.

  • CMS reports -
    • Central infrastructure
      • No objection to the date of CERN announced (30th August) intervention on castor name server to add additional indexes. If there is any risk, Tuesday the 31st is safer.
      • Ignacio: would like to start early as other interventions are planned for later days and last days should be left as contingency in case of problems.
      • [from discussions after the meeting]: suggest to discuss for future intervention planning if start of machine stop (at which time the experiment production pipeline is still full) is really the best time to start service interventions. Maybe some time to drain production queue could lower the risks.
    • Experiment activity *Luminosity in the last 24 hours (11:00 am to 11:00 am):
      • Not much lumi
    • Tier1 Issues
      • Two Savannah tickets at RAL both related to MSS and Transfers
        • #116178: Large amount of idle data at T1_UK_RAL
        • #116138: T1_UK_RAL_Buffer to T1_IT_CNAF_buffer: no transfer but queued data
    • John: points out that RAL does not receive notification for savannah bug. Ian: will check if communication chain for above issues is in adequate and working. Should there have been a GGUS ticket auto generated? RAL did not receive one.

    • Tier2 Issues Nothing to report
    • MC production On going MC prod with pileup
    • AOB *Still some problems in accessing cmsweb since the front end upgrade, maybe due to not well understood security interactions between Firefox and PhEDEX web pages. reported in Savannah ticket https://savannah.cern.ch/support/?116081

  • ALICE reports -
    • Production: Several analysis trains and MC cycle ongoing
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Small issues found at some T2 sites
.
  • LHCb reports -
    • Experiment activities:
      • No data today due to power cut in pit. Reconstruction and merging. No MC.
    • GGUS (or RT) tickets:
      • T0: 0
      • T1: 3
      • T2: 0
    • T0 site issues:
    • T1 site issues:
      • IN2P3 SRM problem GGUS 61023, solved by restarting service
      • IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.
      • CNAF Reconstruction jobs killed by batch system GGUS 61048. Under investigation.
      • RAL Most pilots aborted GGUS 61052.
    • T2 site issues:
    • John: just received RAL ticket about pilot failures - will investigate.

Sites / Services round table:

  • Barbara/CNAF - ntr
  • Michael/BNL - ntr
  • Angela/KIT - CMS restart SRM tomorrow - new host cert - short interrupt.
  • Gang/ASGC - ntr
  • John/RAL - ntr
  • Onno/NL-T1 - ntr
  • Thomas/NDGF - ntr
  • Pepe/PIC - ntr
  • Catalin/FNAL - ntr
  • Rob/OSG - ntr in addition to the BDII investigation.
  • Ullrich/ CERN - 47 batch node were temporarily blocked yesterday due to unauthorised user activity - user notified
  • Luca/CERN: switched LHCB online DB to standby node after power cut. DB has been restored on the other node in the meantime. Now planning flip-back to normal DB operation and preparing a service incident report.
AOB: The service incident report for the cooling failure in the CERN vault has been received: https://twiki.cern.ch/twiki/bin/view/FIOgroup/513Temp100719 .

Thursday

Attendance: local(Cedric, Jacek, Edward, Harry, Ullrich, Miguel, Lola, Ignacio, Ale, Maarten, Dirk);remote(Barbara/CNAF, Michael/BNL, Gang/ASGC, Ian/CMS, Angela/KIT, Ronald/NL-T1, John/RAL, Jeremy/GridPP, Rob/OSG, Pepe/PIC, Vladimir/LHCb, Catalin/FNAL).

Experiments round table:

  • ATLAS reports - Cedric
    • Quiet day
    • Tier-0:
      • Ticket 61080 submitted for xrdcp problem.
      • Ignacio: problem has been passed to xroot developers at CERN for analysis.
    • Tier-1:
      • CNAF-BNL slow transfers. Some exchange but still no clue what happens. What about iperf test ?
      • Barbara: performance right now seems not bad (10Gb filled). Suspect problems may be due to MTU settings of a router on the path to BNL. Iperf tests ongoing
      • Michael: ESNET did not see indications of dropped packets when consulted yesterday. Also NIKHEF/SARA and other sites use same path as CNAF but do not experience similar problems.
      • Jeremy: who else uses jumbo frames? Pic + FNAL do. Michael: BNL too but on OPN, not on general IP network. Miguel: all sites on OPN use jumbo frames.

      • Ticket 60740 (SRM_ABORTED) to ASGC reopened.
    • Tier-2:
      • UKI-LT2-RHUL whitelisted in DDM
      • Wuppertalprod blacklisted

  • CMS reports - Ian
    • Central infrastructure
      • Nothing to report
    • Experiment activity *Luminosity in the last 24 hours (11:00 am to 11:00 am):
      • Zero delivered, Zero recorded
    • Tier1 Issues
      • Two Savannah tickets at RAL both related to MSS and Transfers. Interacting with Chris Brew. It appears a samples was intentionally removed from RAL, but somehow this was reflected in PhEDEx. A consistency check was requested.
    • Tier2 Issues
      • T2_BR_UERJ recently did a full migration of storage. This involves invalidating all the data resident and retransferring. Ongoing.
    • MC production
      • Next processing campaign with new software is expected to start shortly.

  • ALICE reports - Lola
    • Production:
      • Several analysis trains and MC cycle ongoing
    • T0 site
      • Alice agrees to CASTOR intervention on the 30th of August
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report
    • Discussion on CASTOR interventions during technical stop
      • Miguel: Need three days for interventions planned during the stop. Delaying the start of the first intervention will increase risk to not be ready at accelerator restart.
      • Ignacio: taking into account the experiment comments CASTOR interventions may get rescheduled to have (common) nameserver intervention closer not on the first day.

  • LHCb reports - Vladimir
    • Experiment activities:
      • No data. Reconstruction, merging and MC.
    • GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • T0 site issues:
    • T1 site issues:
      • IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.
      • Job submission problems at RAL: only 50 jobs running for LHCb. John: ticket has been updated: LHCb slots limited by increased activity of other VOs.
    • T2 site issues:

Sites / Services round table:

  • Barbara/CNAF - ntr
  • Michael/BNL - ntr
  • Gang/ASGC, - ntr
  • Angela/KIT - repeated problems with LHCb software test - working on way to get better diagnostic.
  • Ronald/NL-T1 - ntr
  • John/RAL - ntr
  • Brian: ATLAS backlog takes long to clear for IN2P3,KIT, ASCG. Could we increase channels setting? KIT:OK - ASGC and IN2P3 should be ticketed to obtain their OK.
  • Jeremy/GridPP - ntr
  • Rob/OSG - BDII investigation continuing - did not find source of spike, plan update to more curent version,
  • Pepe/PIC - ntr
  • Catalin/FNAL. - yesterday dCache overloaded during lumi section calculation (WM agent test).

AOB:

Friday

Attendance: local(Harry(chair), Lola, Edward, Cedric, Jacek, Kate. Alessando, Ulrich, Maarten);remote(Vera(NDGF), Xavier(KIT), Gang(ASGC), Michael, Onno(NL-T1), John(RAL), Pepe(PIC), Vladimir(LHCb), Catalin(FNAL), Rob(OSG), Ian(CMS) ).

Experiments round table:

Tier-0: Nothing to report.

Tier-1:

KIT : 2 unrelated issues : Some WNs have lost the sw mount (Ticket 61090) + staging problem (ticket 61083). Xavier suggested ATLAS to contact him by phone.

ASGC : "Some disk servers have insufficent bandwidth for data transfers to other T1's (CNAF, FZK, LYON, SARA). We are upgrading bandwidth and some disk servers are affected by this action(c2fs081, c2fs082, c2fs083).". FZK, LYON looks OK now (no more backlog from ESD replication).

Failing transfers to SARA from GRIF-IRFU (tticket 61049). This is again network issues.

RAL fully online in Panda now (was brokeroff till yesterday) after the fix of the lost files.

Experiment activity *Luminosity in the last 24 hours (11:00 am to 11:00 am):

Zero delivered, Zero recorded

Tier1 Issues:

Two Savannah tickets at RAL both related to MSS and Transfers. Interacting with Chris Brew. It appears a samples was intentionally removed from RAL, but somehow this was reflected in PhEDEx. Reassigned to data operations - not a site issue.

MC production Next processing campaign with new software is expected to start shortly.

Production: Several analysis trains and MC cycle ongoing

T0+T1 sites: Nothing to report

T2 sites: Hiroshima: after suffering from several problema like a serious unmatch of OpenSSL deployed in the recent CREAM-CE release and a CRL issued, the site is back in production.

Experiment activities: No data. Reconstruction, merging and MC.

T1 site issues:

IN2P3 SharedArea problem GGUS 61045. SAM test failed, user jobs failed , software installation problem.

NIKHEF(SARA) Timeouts while getting TURLs from the SE at NIKHEF GGUS 60603 (Updated - same files aways timing out). Onno will look at the ticket.

Sites / Services round table:

  • KIT: announcing a downtime of CE cream1 over 21-23 August for upgrade to sl5. Remaining two cream-CEs will take the load.

  • OSG: Scheduling a bdii upgrade for about a months time. Will restart their bdii which fails in their next maintenance window probably 24 August. Objective is to upgrade both to bdii 5.08, the level CERN is running. Kyle will be giving the OSG reports for the next 6 weeks.

  • CERN CASTOR: A consolidation of the DLF (distributed logging facility) will be made Monday 16th August starting at 10.00. Nameserver and stagers are not concerned.
AOB:

-- JamieShiers - 03-Aug-2010

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-08-13 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback