Week of 120305

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, Eva, Ivan, Maarten, Maria D, Massimo);remote(Burt, Dimitri, Gonzalo, Jhen-Wei, Lorenzo, Michael, Onno, Pepe, Rob, Rolf, Tiju, Ulf, Vladimir).

Experiments round table:

  • ATLAS reports -
    • Central Services
      • a corrupted pool file catalog in CVMFS /cvmfs/atlas.cern.ch/repo/conditions/poolcond/PoolFileCatalog.xml created massive failures SAV:126847 . Problem fixed, but it took a while to propagate the modifications
      • SAM(NAGIOS) machine is in trouble, NAGIOS team is working on it, GGUS:79883 . Most probably an availability recalculation will be needed. We will keep following this issue.
      • GGUS:79850 SAM availability recalculation asked to cancel global failures of 5 hours in the 11th of February
      • intermittent acron jobs on lxplus were failing (GGUS:79853) : excluding one problematic lxplus node fixed the issue.
    • T0/1:
      • from last week, CERN-PROD one tape with some problems (GGUS:79788), one file lost
        • Massimo: the last file on that tape cannot be read - can we declare it lost or should we try to get it recovered (can take many weeks)?
        • Alessandro: declaring it lost is OK
    • T2s:
      • two tickets GGUS:79855 and GGUS:79854 opened to T2s (LIP-LISBON and NCG-INGRID-PT) for jobs failing with "'libaio.so.1: cannot open shared object file: No such file or directory'" (still to verify, waiting for the CVMFS problem to be fully fixed)
      • SRM problems at IFIC-LCG2: restarted SRM, fixed GGUS:79852

  • CMS reports -
    • Tier-0 (plan):
      • Today at 14:00 will ramp to 0.2T for 2h; continue at 17:00 to 1T, then back down to 0T. for the night Tomorrow go all the way up to 3.8T, not clear yet when. Then, Cosmics Data Taking.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Tier-0/Tier-1 issues:
      • [closed+] GGUS:79829 (SAV:126799): T1_FR_CCIN2P3: Files not stagged, cannot be accessed. The syntax of the configuration file used by the tape protection mechanism implemented in dCache 1.9.12 (second golden release) has changed. They modified it and the (automatic) staging is working again, last update 2012-03-02
      • [in progress+] GGUS: 79887 (SAV:126368): T1_TW_ASGC: Backfill dataset (not meant to go to tape) was accepted to be transferred to Tape. The migration is stuck. See ticket, to solve this incident we might delete the dataset, which will clear the request and the queue... moved to GGUS, last update 2012-03-05
      • [open/closed+] GGUS: 79868 (SAV:126837): T1_DE_KIT: JobRobots failures due to "error while loading shared libraries: libcares.so.0: cannot open shared object file: No such file or directory /opt/glite/bin/glite-lb-logevent". Checks on the WNs ok. There were not many jobs (24) failing with this error. A glitch elsewhere? Not reproduced later on. Closed the ticket, last update 2012-03-05
      • [new/in progress+] GGUS: 79858(SAV:126822): T1_IT_CNAF: All or some JobRobots going through wmsXXX.cnaf.infn.it are failing for all the CMS sites. The WMS's cannot get or upload files. glite-wms-wm was using a very high amount of memory and it was restarted this morning, last updated 2012-03-05
      • [in progress] GGUS:79648 (SAV:126644): T0_CH_CERN: Tier-0 node (probably lxbrl2316) failing a lot of jobs. Apparently it has some sort of memory allocation problem. last updated 2012-02-29
      • [in progress] GGUS:79259 (SAV:126400): T1_IT_CNAF: request to update FTSMonitor, will be handled within 2-4 weeks due to separation from lemon that runs on the same server, last update 2012-02-21
      • [in progress] GGUS:79257 (SAV:126398): T1_DE_KIT: request to update FTSMonitor, experts already planning the upgrade, last update 2012-02-29
      • [in progress+] GGUS:79799 (SAV:126762): T1_FR_CCIN2P3: Misbehaving CREAM-CE, needs an upgrade. Upgrade done, CMS is ramping up to use it, last update 2012-03-05
    • Services and Infrastructure issues:
      • [in progress] GGUS:79719: Some CMS SAM tests not published in SAM for several days. Affected site: RWTH-Aachen, from 21/2 to 27/2 two CMS WN tests, org.cms.WN-analysis and org.cms.WN-basic, were not published. Assigned (SAM/Nagios), but they do not provide suppor (EMI Product Teams do). CMS ask more: "if the code of a WN test gets stuck, what happens? Is Nagios going to kill the process and report the output as CRITICAL, or the test simply does not get published?", last update 2012-03-02
    • Notes:
      • [new+] SAV:126848: Column 126 in SSB is stuck since 1st March, last update 2012-03-05
        • Ivan: we will look into it

  • LHCb reports -
    • T0
      • CERN : setting the variable TMPDIR : (GGUS:79685)
    • T1
      • SARA : All pilots waiting (GGUS:79869) Fixed
      • GRIDKA : Condition DB unavailable (GGUS:79800)
      • NIKHEF : HOME directory not set (GGUS:79650)
      • PIC : Request for space token migration (GGUS:79305)
      • GridKa : Request for space token migration (GGUS:79303) "nearly finished"
      • SARA : Request for space token migration (GGUS:79307)
        • Onno: the ticket is waiting for a reply
        • Vladimir: we are waiting for our data manager to respond
    • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • any news on the EMI worker node issue broadcast last Friday?
    • Maarten: no news yet, but we will send an update this month once we know to what extent ATLAS or CMS jobs would continue to fail on the EMI WN; hopefully any remaining issues can be addressed on the experiment side, such that sites can proceed with upgrades of their WN
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF
    • tomorrow FTS 2.2.8 upgrade; warning: we had issues with the publicly available Oracle client packages
      • Maarten (after the meeting): the issues only affect the 10g client, whereas the 11g client is fine; if you still need the 10g client, please contact the FTS support list
  • NLT1
    • SARA tape system in scheduled maintenance, ongoing
      • Alessandro: why did you declare an outage in the GOCDB instead of an at-risk downtime?
      • Onno: that seems to have been a mistake, to be followed up
    • GGUS:76920 (FTS errors) - network tuning ongoing, but currently there is little traffic
    • will not attend tomorrow
  • OSG - ntr
  • PIC - ntr
  • RAL
    • tomorrow FTS 2.2.8 upgrade

  • dashboards - nta
  • databases
    • Fri evening there was a problem with CMS conditions streaming due to a user mistake, quickly fixed
  • GGUS/SNOW
    • see AOB
  • storage
    • tomorrow CASTOR ATLAS upgrade to 2.1.12, 3h downtime

AOB: (MariaDZ) File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings . We had no real ALARM this week. The 20+ test ALARMs launched last Monday with the GGUS release no more appear in the weekly escallation reports https://ggus.eu/pages/metrics/download_escalation_reports_wlcg.php as suggested in this meeting and developed as per Savannah:125862 .

Tuesday

Attendance: local(Alex, Fernando, Guido, Ignacio, Ivan, Luca, Maarten, Massimo, Peter);remote(Gareth, Gonzalo, Jeremy, Jhen-Wei, Lisa, Rob, Rolf, Stefano, Ulf, Vladimir, Xavier).

Experiments round table:

  • ATLAS reports -
    • Central Services:
      • SAM(NAGIOS) machine srm-atlas-prod back in production, "pipe" file corrupted, regenerated and everything back GGUS:79883. Still to understand how /var/nagios/rw/nagios.cmd got corrupted.
    • T0/1:
    • T2s:
      • UKI-LT2-QMUL, Problems with SRM (GGUS:79900), 10Gig network card seems to have failed, reconfigure server to use 1Gig card.
      • PSNC (Poland): jobs failing systematically with commands.getoutput error (text.pipe() call). GGUS:79385 no reply from the site for days. Problem disappeared. Ticket closed

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking with Magnet ON planned tonight.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Tier-0/Tier-1 issues:
      • [Info] KIT Tier-1 will be in downtime next week on Mar 13-15 and CMS is starting organize the draining of queues
    • Services and Infrastructure issues:
      • [Issue] Saw a glitch of its SRM endpoint at CERN this morning, between 11am and noon, see https://sls.cern.ch/sls/history.php?period=6h&offset=null&more=availability&id=CASTOR-SRM_CMS . No major consequence on CMS since not transfering data out of CERN heavily at the moment. No ticket opened by CMS.
        • Massimo: will check SRM monitoring glitch, often due to high load on t1transfer; see next item
      • [Warning] Have observed that the public batch cluster looks degraded since last night (see https://sls.cern.ch/sls/history.php?id=LXBATCH&more=availability&period=24h ) : it is not clear to which extend this is affecting CMS jobs, but probably it does.. No ticket opened by CMS.
        • Ignacio: there was no degradation of lxbatch; will look into the issue
        • added after the meeting: SLS probes were affected by an lxplus overload, fixed
      • [Info] Central CMSWEB Service upgrade this morning, all went fine.

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • tape staging for ATLAS failed since yesterday evening to 07:30 today, now OK
  • NDGF
    • FTS 2.2.8 being ramped up, all channels expected online by 15:00 UTC
  • OSG - ntr
  • PIC - ntr
  • RAL
    • CASTOR DB move went OK
    • CMS transfer failures being investigated
    • FTS 2.2.8 upgrade went OK

  • dashboards - ntr
  • databases - ntr
  • grid services
    • CERN gLite 3.1 WMS servers being drained for retirement
  • storage - nta

AOB:

Wednesday

Attendance: local(Edoardo, Ignacio, Ivan, Luca, Maarten, Maria D, Massimo, Peter);remote(Burt, Gonzalo, Jhen-Wei, Pavel, Rob, Rolf, Ron, Stefano, Tiju, Ulf, Vladimir).

Experiments round table:

  • ATLAS reports -
    • Central Services:
      • Site Services (and kernel!) upgrade on CERN voboxes: problems with MySQL database on 3 machines. Reinstalled MySQL and rebuilt the database, everything seems fine. Problem still under investigation.
    • T0/1:
      • Lost file on EOS reported yesterday (INC:110511): file never written!!! To be investigated with ATLAS DDM team
      • FTS upgrade in NDGF and UK. Finished in the afternoon yesterday and everything ok
      • US network intervention yesterday. Everything ok
      • Problem with SRM at PIC (GGUS:79963). Still under investigation
    • T2s:
      • IFIC SRM problems (GGUS:79948). StoRM frontend daemon (SRM) died unexpectedly. Automatic recover procedure did not work. Manually restarted

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking on-going.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Tier-0/Tier-1
      • -
    • Services and Infrastructure
      • [Issue] FTS server issue at CERN this morning. First report came by CMS Computing shifter at 9:50AM regarding CERN-FNAL transfer problems. CMS opened ticket GGUS:79958 .
        • Maarten: experts looking into high load from the dCache to EOS copy commands running on the channel agent machine
        • Massimo: last week Steve already reported strange behavior for that channel, which went away by itself, but it was then moved to a separate machine as a precaution
      • [Issue] (internal to CMS) : Central CMSWEB Service upgrade caused the transfer request approval tool (via PhEDEx Web interface) to fail, hence at the moment data transfer requests can be approved only by central transfer admins. CMS Site contacts have been made aware of this. Should be fixed soon with a new CMSWEB release.

  • ALICE reports -
    • The central AliEn services were unreachable from 13:36 to 14:02 due to a power cut affecting the network.

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL
    • FTS 2.2.8 upgrade may already happen tomorrow instead of next week
  • IN2P3 - ntr
  • KIT
    • LHCb DB ticket (GGUS:79800) is waiting on LHCb
      • Vladimir: we asked for a generic alias for the DB machine
      • Pavel: please put that into the ticket
  • NDGF
    • FTS 2.2.8 upgrade went almost completely OK, there was just an issue with the site BDII not publishing the endpoint
    • network downtime at 18:00 UTC, max 4h; also tomorrow
  • NLT1
    • 2 short downtimes this morning
      • fixed memory of crashed pool node
      • restarted GridFTP service with different network configuration to address long-standing ticket
  • OSG
    • there was a network problem between MIT and CERN, issue was on the MIT side, resolved
  • PIC
    • ATLAS SRM issue being investigated
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • next monthly GGUS update on Tue March 20
      • avoid overlap with EGI Community Forum
      • implies Remedy update affecting OSG and other ticketing systems
  • grid services - ntr
  • networks
    • Oracle 11g upgrade of network DBs starts at 17:30
      • network configurations frozen for 3h
      • otherwise transparent
  • storage
    • CASTOR ALICE red in the monitoring, could just be the probe
      • Maarten: there was a sysadmin ticket about a CASTOR ALICE head node
    • CASTOR upgrade to 2.1.12 done for ATLAS; proposed schedule for the others:
      • Mon March 12 - CMS + LHCb, 09:00-13:00 CET
      • Wed March 14 - ALICE rate test
      • Mon March 19 - ALICE upgrade

AOB:

Thursday

Attendance: local(Alessandro, Guido, Ivan, Maarten, Nicolo, Paul, Peter, Steve, Xavier);remote(Andreas M, Gonzalo, Jeremy, John, Lisa, Michael, Paco, Rob, Rolf, Stefano, Ulf, Vladimir).

Experiments round table:

  • ATLAS reports -
    • Central Services:
      • FTS overwriting (-o option) not working properly. GGUS:80034
    • T0/1:
      • Problem with SRM at PIC (GGUS:79963). Apparently, a bug in dCache (not timeout-ing the movers leading to pool blocking and/or overloading). Workaround to reduce load on pools, working
      • FTS channel IN2P3->FZK not working. And FZK FTS monitoring (http://ftm-kit.gridka.de/ftsmonitor/ ) not accessible. GGUS:80043. "After restarting the transfer-agent, transfers seem to succeed now. The agent log indicates connection problems to gridftp doors at IN2P3 which may have causes the agent-process to hang"
    • T2s:
      • nothing to report

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking on-going.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Tier-0/Tier-1
      • -
    • Tier-2
      • [Issue] Failing data transfers from several CMS Tier-2s (UCSD, MIT) to the Beijing Tier-2, identified as a middleware issue (SRM Best Man doesn't accept Beijing provided proxy certificate). It has been reported, that ATLAS saw similar problems also. Issue is solved in the latest SRM server (bestman2-server-2.2.0-14.osg.el5), and tested at Nebraska, see Savannah:125025. CMS now wondering if all sites need to make the same upgrade ?
        • Maarten: the issue may have to do with the IHEP CA (along with the APAC and IUCC CAs) still using e-mail addresses in their signing policies; that will not change in the near future
        • Michael: we saw the same issue for USATLAS sites; there is a complicated workaround
        • Maarten: it would seem best that OSG inform their sites; the EOS team should also check their SRM
          • Nicolo: the EOS SRM is currently only used for stageout of user output files from WNs at remote CMS sites into EOS
          • ATLAS experts: the EOS SRM is not used for transfers
    • Services and Infrastructure
      • [Known issue] FTS server problems at CERN reported on Mar 7 : FTS developpers still looking at the large memory consumption on the server, waiting for update on GGUS:79958 .
        • Paul: is there a memory leak?
        • Steve: the machine got rebooted before we could do a full analysis; it now has more memory and swap; currently the problem has disappeared
        • Paul: we will hammer the channel with transfers to provoke the problem, so that the developers get a chance to debug it

Sites / Services round table:

  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • 1 broken tape for CMS, details will be sent to CMS
    • Vladimir: can you reopen a queue for LHCb jobs?
    • Andreas: will ask colleagues
  • NLT1 - ntr
  • NDGF
    • network intervention at 18:00 UTC, max 4h
    • ATLAS FTS issue: the relevant logs were not found yet
  • OSG
  • PIC
    • the dCache bug affecting ATLAS is caused by a patch that was applied to fix another bug, we are in contact with the developers
  • RAL
    • some issues observed with the SAM Programmatic Interface etc. since 2 days
      • Maarten: 2 days ago various old SAM components were switched off; they should not have been used any more since a while; please open a ticket for the SAM team if you need their assistance in this matter

  • dashboards - ntr
  • grid services - ntr
  • storage
    • CASTOR ALICE 3h degradation yesterday was due to a network glitch that was not properly handled, now fixed in the code
    • 1.5 PB added to EOS ATLAS
    • 0.5 PB soon to be added to EOS ALICE
    • next week's stress test by ALICE will be using (and testing) 500 TB to be added to CASTOR ATLAS (sic)

AOB:

Friday

Attendance: local(Guido, Ignacio, Ivan, Maarten, Peter, Xavier E);remote(Gonzalo, Jhen-Wei, Lisa, Michael, Onno, Rob, Rolf, Stefano, Tiju, Ulf, Vladimir, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0/1:
      • FTS overwriting (-o option) not working in NDGF. GGUS:80034. Misconfiguration of a channel, fixed
    • T2s:
      • SRM problem in UKI-LT2-RHUL (GGUS:80073). Due to network problem on storage nodes, SRM fails to access some of storage nodes. Fixed
    • Guido: lots of failed SAM-Nagios tests for CREAM and OSG-CE at KIT, BNL and SARA; such jobs are sent via CERN WMS nodes and somehow get stuck somewhere; after 5.5 hours SAM gives up on them

  • CMS reports -
    • Tier-0 (plan):
      • Cosmics Data Taking on-going.
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 and Tier-2 sites
    • Tier-0/Tier-1
    • Tier-2
      • -
    • Services and Infrastructure
      • [Known issue] CERN/FTS server memory issues : on-going tests as reported in GGUS:79958 .
      • [Issue] Central JobRobot issue affecting the test results at many CMS sites, due to one or more gLite WMS's deployed at INFN-CNAF, see GGUS:79858. The main effect is that CMS Computing shifters cannot reliably ping CMS sites for "real" JobRobot errors, but it may also affect real CMS analysis jobs. Local experts at CNAF are on the case, the alternative for CMS would be to remove the concerned WMS's form the central CRAB configuration, but we are not sure about potential scaling issues once loosing 1/2 gLite WMS's.
        • Maarten: CERN WMS admin provided CNAF WMS admin with details on WMS configuration and workarounds in place; one tip looks promising

  • LHCb reports -
    • T0
      • CERN : setting the variable TMPDIR : (GGUS:79685)
    • T1
      • SARA : Incorrect platform (GGUS:80048) Solved by "hot fix" at site
        • Vladimir: thanks to SARA!
      • GRIDKA : Condition DB unavailable (GGUS:79800)
      • PIC : Request for space token migration (GGUS:79305)
      • GridKa : Request for space token migration (GGUS:79303) "nearly finished"
      • SARA : Request for space token migration (GGUS:79307)
      • Vladimir: thanks to KIT for reopening a queue for LHCb!
    • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - nta
  • FNAL
    • FTS 2.2.8 upgrade postponed until next week
  • IN2P3 - ntr
  • KIT
    • there was no tape access between 01:00 and 10:00, OK now
  • NDGF
    • FTS 2.2.8 all OK; the old server was forgotten to be stopped!
  • NLT1
    • ATLAS ticket GGUS:80102 seems due to cleaning the scratch space with an "mtime" grace period of 7 days
      • Maarten: you probably just need to check the "ctime" instead
    • lots of tape activity: LHCb staging data while ATLAS are storing data; could cause delays; storing has higher priority
    • SARA downtime on March 20
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • grid services - ntr
  • storage
    • Mon 09:00-13:00 CASTOR upgrade for CMS and LHCb
    • Mon March 19: idem for ALICE
    • EOS ATLAS monitoring glitch 06:00-07:30 could be due to issue with ATLAS SAM tests: not running?

AOB:

-- JamieShiers - 31-Jan-2012

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2012-03-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback