Week of 100301

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Availability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Harry(chair), David, Malik, Simone, Ricardo, Lola, MariaDZ, Roberto, Andrea, Alessandro, Eva, Timur, Julia, Miguel, Dirk);remote(Jon(FNAL), Gonzalo(PIC), Michael(BNL), Josep(CMS), Ron(NL-T1), Rob(OSG), Alessandro(CNAF), Jens(NDGF), Gareth(RAL), Rolf(IN2P3), Gang(ASGC), Xavier(KIT)).

Experiments round table:

  • CMS reports - 1) 27 Beam splash events from beam 1 + 24 events from beam 2. 2) T1 Highlights: Backfill running. Temporarily enabled T2_CH_CAF-->T1_FR_CCIN2P3 transfers to enable recovery of some datasets lost in deletion incident in November 2009. Some files not on CMSCAF anymore, might need invalidation. IN2P3 - Two skimming jobs stuck waiting long time to access input files - files still not accessible. Rolf reported that they are accessible to local CMS staff so it is not clear what is going on. IN2P3 - temporary SAM test failures in stageout during weekend, disappeared. RAL - tape migration stuck for a few days, Castor migration hunter restarted. PIC - GGUS #56066 - PIC has a high fraction of aborted jobs with the Maradona error. At least since the last two weeks the CMS Job Robot shows a significant failure rate at PIC, typically around 20%. These jobs are submitted using the gLite WMS and the logging information of the failed jobs shows that the failure reason is "File not available.Cannot read JobWrapper output, both from Condor and from Maradona.". A reasonable guess is that there is a fraction of "bad" worker nodes causing the jobs to fail, as the SAM tests, which use different nodes, do not show any problem. In progress. 3) T2 highlights: MC ongoing in all T2 regions. T2_TW_Taiwan SLC5 WNs deployed - some SAM test failures caused by known DPM library issues after SLC5 migration, fix applied. T2_PK_NCP - SLC5 CMSSW deployed. T2_IN_TIFR SAM test error - Got a job held event, reason: Globus error 12: the connection to the server failed (check host and port). T2_RU_JINR SAM test errors - squid server not reachable. 4) Services: GridMap back, GGUS ticket verified. GGUS #55917.

  • ALICE reports - 1) GENERAL INFORMATION: New MC cycple has been submitted this weekend. in addition, the Pass1 reconstruction goes on at the T0 site. Number of concurrent jobs exceeding 7800. 2) T0 site: Use for the Pass1 reconstruction activities, no issues to report. voalice14 has been (temporarily) taken out of production. The node is currently used to check some of the new Alienv2.18 packages which will be deployed in few weeks. The machine has been put in status maintenance. 3) T1 sites: Following the agreement achieved at the TF meeting few weeks ago, we are changing the configuration at CCIN2P3 to use the CREAM-CE as the unique submission backend. 4) T2 sites: 4 sites have been put in CREAM-CE submission mode during the weekend: GRIF_IRFU, GRIF_IPNO, IRES and Clermont. All of these sites are in France. The services has been tested before putting them into the AliEn production software with no incidents to report.

  • LHCb reports - 1) Experiment activities: No large activity in the system right now (few MC production created). Drained the remaining stripping productions to allow the new release of DaVinci application (based on the version of ROOT that fixes the problem of compatibility with dcap libraries) to be deployed on all Tier-1. As consequence of that: all dCache sites have been reintegrated in the production mask . SAM tests also confirm that the problem preventing to access data is now fixed almost in all dCache sites. 2) T1 sites issues: SARA: issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue so independent of this). RAL: issue last week with voms cert tests. RAL: issue on Friday with on one disk-server (gdss160) taken temporarily out of production. 3) T2 sites issues: GRIF and UKI-LT2-QMUL failing also the voms-cert critical SAM test. Software installation issues at UKI-SOUTHGRID-BHAM-HEP. User jobs stalling at RAL-HEP most likely due to C/A issue.

Sites / Services round table:

  • FNAL: Reminder they will upgrade to FTS 2.2.3 starting at about 17:00 UTC.

  • NL-T1: 1) A pool node crashed 3 times, thought to be infiniband problems. Back up but under investigation. 2) Upgraded dcache to 1.95-16. 3) The LHCb problem is actually at Nikhef and they have added information to the ticket.

  • CNAF: Tomorrow afternoon will upgrade ATLAS STORM instance to version 1.5 - expected to take 2 hours.

  • RAL: Have two ATLASDATADISK servers out with independent problems. A bdii problem (ATLAS ticket) from another UK site has been resolved.

  • ASGC: This Wednesday the new parallel power supply will be retested between 1.00 and 7.00.UTC.

  • OSG: Maria reported there is one open ticket from ATLAS on 26 Feb #544674 concerning an incomplete Athena installation in Nebraska. Rob will raise this at the next OSG operations meeting and Michael added that this concerns non-production resources which are offered on an opportunistic basis.

AOB: LHC news: Beam splash events over the weekend. Cryo intervention and recovery will last until about 22:00 today. - As soon as we are back with cryo-conditions: Pre-cycle and Alarm checks; When beam injection conditions are restored: Inject , check orbits and check capture for the 2 beams; Beam measurements: tune + chromaticity + coupling and finally beta beat.

Tuesday:

Attendance: local(Miguel, Jean-Philippe, Nicolo, Harry, Timur, Malek, Alessandro, Simone, Patricia, Roberto, Eva, Nilo, MariaD);remote(Jon/FNAL, Xavier/KIT, Michael/BNL, Gang/ASGC, Jeremy/GridPP, Ronald/NLT1, Tiju/RAL, Rolf/IN2P3, Alessandro/CNAF, Tore/NDGF, Rob/OSG).

Experiments round table:

  • CMS reports (Nicolo)-
    • IN2P3 - following up tails in repopulation of deleted datasets.
    • IN2P3 - More skimming jobs stuck waiting long time to access input files - files still not accessible, to be checked again after end of IN2P3 downtime. Actually more files in that state than previously reported.
    • PIC - ongoing investigation on MARADONA errors in JobRobot.
    • RAL - ReReconstruction running - first jobs failed for operator mistake in workflow configuration, second attempt running smoothly.
    • CNAF - error opening 32 RAW files on TSM in backfill jobs. elog entries can be found in the CMS wiki page.
    • T2s: Monte Carlo at high scale. Transfer problems between Warsaw T2 and KIT (timeouts when trying to connect to SRM).

    • T0: continuous data taking at low level
    • T1s: reconstruction of cosmics at RAL
    • T2s: MC production
    • Andrea becomes Tier1 coordinator for CMS
    • if sites install FTS 2.2.3, they should probably install FTS monitoring 1.2.0 at the same time.

  • ALICE reports (Patricia)-
    • GENERAL INFORMATION: Pass1 reconstruction at the T0, MC production and train analysis activities going on now, also during the weekend with no incidents to report (close to 100 % efficiency).
    • T0 site: Low rate of jobs running due to the low amount of jobs required for the pass1 reconstruction. No issues to report.
    • T1 sites: Stable behavior observed at T1 sites.
    • T2 sites: Still some sites (in particular Subatech) are reporting user jobs with a high memory consumption. It has been identified the problem coming for the user jobs themselves. Users will be contacted. The issue and procedures proposed by some T2 sites will be discussed during this weeks ALICE TF meeting inclusing the possibility to include a JobAgent protection against such a user jobs. GRIF_IPNO has announced a downtime to do some reorganization in the site's computing room.This site is in downtime from 9:00 to 19:00.

  • LHCb reports (Roberto)-
    • 40 million MC events to be produced scattered across 20 different production requests. Running smoothly so far.
    • Gridview: issue with historical SAM information: the 1st of March was skipped (fixed now).
    • T0 sites issues: none.
    • T1 sites issues: SARA: issue with critical test for file access under investigation.
    • T2 sites issues: UK sites: BDII information flickering, pilot can't be submitted because not available resources found. Something to do with what reported yesterday? Tiju says that the DNS problems have been fixed.

Sites / Services round table:

  • FNAL: GGUS 56091 - 2/3 of our CEs reported error. We see 3rd one as OK (see OSG RSV data). WLCG sees our 3rd CE as in state unknown & hence charged FNAL with failing test. Rob says that there was a problem with the messaging on the SAM side. To be watched.

  • KIT: NTR
  • BNL: NTR for T1. SE not working at Midwest T2 (being investigated).
  • ASGC: NTR
  • NLT1: problem between SARA and NIKHEF being investigated.
  • RAL: NTR
    • APEL is experiencing unusual peaks of activity around times when most sites publish. For some reasons we are still investigating, connections to the service don't get cleaned properly and cause the service to hang until it finally restarts. As a result, many sites don't see their records published on APEL even if their log does not necessarily show any obvious error. To keep everyone informed of the daily status of APEL, we will maintain the following page as a permanent feature: http://goc.grid.sinica.edu.tw/gocwiki/ApelStatus . All relevant information and known issue about the service will be put on this page.
  • IN2P3: nearly at the end of the downtime. Most services are back and seem to be ok.
  • CNAF: about the CMS problem, no ticket has been opened. The CMS local contact is working with the site on the problem and will open a ticket if necessary.
  • NDGF: NTR

  • CERN central services: NTR

AOB:

Wednesday

Attendance: local(Harry, Miguel, Graeme, Jean-Philippe, Nicolo, Patricia, Roberto, Alessandro, Timur, Maria, Ricardo, Eva, Nilo);remote(Gonzalo/PIC, Alessandro/CNAF, Jon/FNAL, Michael/BNL, Onno/NLT1, Tiju/RAL, Jens/NDGF, Jeremy/GridPP, Gang/ASGC, Rolf/IN2P3, Xavier/KIT, Rob/OSG).

Experiments round table:

  • CMS reports (Nicolo)-
    • T0 Highlights: Disk server on T0EXPORT service class temporarily unavailable, causing job failures - jobs succeeded on resubmission.
    • T1 Highlights: IN2P3 - following up tails in repopulation of deleted datasets - almost all files recovered (five left). IN2P3 - More skimming jobs stuck waiting long time to access input files - files still not accessible. Update from IN2P3: problem reproduced, dCache experts investigating. RAL - ReReconstruction running - succeeded, except for one job killed by batch system after reaching 60 hour CPU time limit (probably input file contains run at high rate). Backfill running at T1s - mostly non-site-related failures at FNAL, PIC, ASGC. CNAF - error opening RAW files on TSM in backfill jobs - Elog #1029 and Elog #1046 and Elog #1061.
    • T2 highlights: MC ongoing in all T2 regions. Upload of MC from T2_PL_Warsaw to T1_DE_KIT failing - HTTP_TIMEOUT from Warsaw SRM (ticket not answered). Regeneration of input GEN files invalidated for duplicate events in progress at CERN. T2_IN_TIFR SAM, JobRobot test errors - now OK. T2_RU_JINR SAM test errors - squid server not reachable (HW issues).
    • Other: CMS SAM test not submitted for some time after upgrade, reverted to old submission method.

  • ALICE reports (Patricia)-
    • GENERAL INFORMATION: Pass 1 reconstruction tasks going on together with the train analysis (about to finish. 90% completed). Registration of pending VOBOXES at the GOCDB: After todays meeting with SAM experts we have concluded the necessity of the registration of the Alice VOBOXES at the GOCDB. At this moment, only 27 VOBOXES out of 90 nodes are properly registered in the mentioned system. This will be discussed with the sites during this weeks ALICE TF Meeting.
    • T0 site: voalice14 will be moved out of production to use it as development machine. A new VOBOX will be required this week to replace the production tasks which were performed at voalice14. CE202 is now back in production according to Ulrich/Ricardo.
    • T1 sites: MC cycle has been succesfully completed with no incidents to report.
    • T2 sites: Testing of the CREAM-CE system at Hiroshima T2 scheduled for this afternoon.
    • Just few minutes before the ops. meeting the site admin from Strasbourg (IRES) announced us the setup of a CREAM-CE service available for Alice. We will test it today in so that we can put it in production by today.

  • LHCb reports (Roberto)-
    • MC production running at the pace of 8K concurrent jobs.
    • T0 sites issues: none.
    • T1 sites issues: SARA (after the LRMS intervention yesterday): users (and also SAM) report problems in getting their jobs running (all stuck in Scheduled Status). Ron notified via private mail. It seems that the LHCB jobs at SARA were in scheduled state because there were many Atlas jobs running at SARA, and not because of a failure. GRIDKA:SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server.
    • T2 sites issues: INFN-CAGLIARI (aborting jobs), UKI-NORTHGRID-SHEF-HEP (uploading problem) UKI-LT2-UCL-CENTRAL (shared area) UKI-SCOTGRID-GLASGOW (uploading problems).
    • XROOT testing and discussing protocol publication. CERN, KIT, LYON and SARA seem ok, PIC is working on it. Not available at CNAF. RAL needs to upgrade CASTOR before offering an XROOT service.

Sites / Services round table:

  • PIC: NTR
  • CNAF: LHCb DB upgrade will take place Thursday next week. STORM upgrade for LHCb and CMS will take place next week.
  • FNAL: NTR
  • BNL: NTR
  • NLT1: Problem with one disk server at NIKHEF, this triggered Atlas failures. The problem seem to be due to an Infiniband driver and looks similar to the problem at SARA last weekend. Kernel timeout values need to be increased.
  • RAL: NTR
  • NDGF: BDII for FTS stopped updating last night. Restarted. Ok now.
  • GridPP: the BDII problem reported by LHCb does not seem to occur anymore. LHCb will confirm.
  • ASGC: The parallel power system was tested to be ok during the scheduled downtime this morning. Most services got recovered quickly except for lfc and FTS due to some oracle block error, a 2-hour uncheduled downtime was created for this. These 2 services got recovered at 14:15, we are still monitoring them.
  • IN2P3: NTR apart from this lack of GGUS ticket for CMS problems reported above: There seems to be some misunderstanding between the local CMS expert and IN2P3 ops staff about the necessity to create a ticket for every problem. IN2P3 insists that a ticke should be created and Nicolo will make sure that this is followed in CMS.
  • KIT: NTR
  • OSG: the SAM records missing from last weekend have been retransferred. The Midwest T2 problem still needs to be investigated (GGUS 56122).

  • CERN central services:
    • CE202 back in production as reported by ALICE.
    • Hot fix on CASTOR public tomorrow: entire site marked at risk, but LHC experiments should not be affected.

AOB:

  • MariaDZ: The team ticket submission form has a mandatory field "MoU". Should it be kept or removed? The decision is to keep the field but accept the value "not applicable" for some sites. Development progress will be reflected in https://savannah.cern.ch/support/?113091

Thursday

Attendance: local(Maria, Jamie, Majik, Jean-Philippe, Simone, Miguel, Jacek, Harry, Lola, Nicolo, Roberto, Ricardo, Graeme, Alessandro, MariaDZ);remote(Jon/FNAL, MIchael/BNL, anon, Gareth/RAL, Ronald/NL-T1, Rolf/IN2P3, Rob/OSG, Jeremy/GridPP, Xavier/KIT, Jens/NDGF, Gang/ASGC).

Experiments round table:

  • CMS reports -
    • T1 Highlights:
      • MC processing running at T1s.
        1. FNAL - file open errors, fixed restarting dcap doors.
        2. RAL - also file open errors on Castor, reducing number of jobs running
        3. Both issues caused by special nature of these workflows: large amount of jobs accessing the same few GEN input files.
      • ReReco running at PIC, CNAF, FNAL
        1. Merge failures at PIC, VSIZE limit reached.
      • CNAF - corrupt files identified, retransferring.

  • ALICE reports - GENERAL INFORMATION: Pass1 reconstruction and some remaining user analysis jobs are the current activity of Alice
    • T0 site
      • 2nd CREAM-CE at CERN: ce202.cern.ch has been checked and its god behavior validated by Alice. Ready to enter in production
      • CASTOR upgrade announced yesterfday afternoon to update the system to the latest 2.1.9-4 version. Both offline and online representatives have given green light to perform this operation (it will take 2h) before the end of this week
    • T1 sites - No issues to report
    • T2 sites
      • Hiroshima CREAM-CE service tested. Submission problems coming from some configuration issues have been reported to the site admin. The system cannot be put in production
      • Strasbourg CREAM-CE service tested and validated by Alice. The system is entering in production today

  • LHCb reports - MC productions running at sustained regime w/o major problems
    • T0 sites issues - Proposed time-slot for CASTOR upgrade: the sooner the better or at the next LHC stop. [ Tentatively agreed for Monday depending on LHC schedule. ]
    • T1 sites issues:
      • SARA : The critical FileAccess SAM test failing has been understood by the core application devs as some libraries (libgsitunnel) for slc5 platforms not properly deployed in the AA.
      • NIKHEF: Issue with HOME not set on some WN yesterday afternoon. Issue trapped by some jobs in the (very) short period that the watch dog running there did not run restarting the nscd daemon dead.
    • T2 sites issues:
      • Shared area issue at INFN-NAPOLI and uploading problem at UKI-LT2-Brunel. This is the fourth UK site (plus the ones reported yesterday) banned in LHCb because of that problem: under investigation.
      • The issue with the BDII "flickering" was due to a bug in the dirac script that queries the BDII and fills their own information system.
    • DB (Jacek) - logical corruption in LHCb offline DB - introduced on Tuesday due to misconfig. Restore schema to time in past. Took 2 h. Schema was streamed to online and 6 T1 sites. Apply processes misconfigured in 5 sites and online not correctly propagated. Discovered yesterday evening. LHCb online and GridKA and SARA fixed immediately. For others fixed this morning. CNAF working well all the time as apply process configured properly. All replication now ok. (Since ~10:00 today).

Sites / Services round table:

  • FNAL - more about problem Nicolo mentioned at FNAL with dcap doors. New workflow never run before at FNAL. When traced down jobs running on WNs were accessing 1 file open 2100 times. This ultimately caused problems in dcap doors. Restarted to get things back online quickly. Proposed to CMS to replicate file - file that will be opened many times is known before job. Replicated to 120 pools on different nodes and waiting for job to run again to know if this works.
  • BNL - ntr
  • RAL - same issue: CMS hot files. Read requests to 2 servers for 3 files. Looking at what w can do. For future runs maybe a different file(s). Jon - CMS needs to inform sites before running this type of workflow so that correct file(s) can be replicated. Or run fewer jobs so file is not opened so many times. Miguel - could also consider copying file locally.
  • NL-T1 - next week on Wednesday FTS will be upgraded to 2.2.3. On Thursday dCache head nodes will be migrated to new h/w.
  • IN2P3 - ntr
  • GridPP - following up on 4 LHCb sites that have problems.
  • OSG - ntr
  • KIT - ntr
  • NDGF - ntr
  • ASGC - ntr

  • CERN CASTOR operations - before upgrade on CASTOR public inconsistency in DB and jobs got locked for ~15'. Will send link for incident report. Inconsistency known but need to understand how it was generated.
  • CERN DB - issue with streaming of CMS PVSS data from online to offline. Between 2 production DBs.

AOB:

Friday

Attendance: local(Nico, Graeme, Maria, Jamie, Harry, Roberto, Majik, Timur, Jean-Philippe, Lola, Alessandro, Ricardo);remote(Jon/FNAL, Michael/BNL, Onno/NL-t1, Rolf/IN2P3, Jeremy/GridPP, Andreas/KIT, Gang/ASGC, Gareth/RAL, Gonzalo/PIC, Rob/OSG).

Experiments round table:

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco running at T1s.
        1. Some jobs waiting for files from Castor at RAL
        2. Investigating merge job memory issues at PIC
    • T2 highlights
      • MC ongoing in all T2 regions.
        1. Some jobs seem lost in communication problem between CREAM CEs and WMS (status Done on CREAM but Running on WMS).
    • Other
      • SLS taking long time to render plots for DB issues, fixed Remedy #665986 Harry - SLS was down yesterday evening.

  • ALICE reports - GENERAL INFORMATION: Production dominated by the use analysis tasks (Pass1 reconstruction still ongoing and in addition, the train analysis: TR017: ESD+MC -> AODMC + delta AOD 91% already completed)
    • T0 site - ALICE Online agrees with the CASTOR update (announced this week) for next Monday the 8th. (Announcement already sent by the CASTOR team)
    • T1 sites - No issues to report
    • T2 sites - Hiroshima CREAM-CE: Problem reported yerteday has been solved this morning and validated by the Alice support.

  • LHCb reports -
    • MC productions running at sustained regime w/o major problems (9-10K jobs concurrently)
    • Organizing the first LHCb T1 Jamboree meeting in Amsterdam (22nd and 23rd of March). All site representatives/site-service managers at T1 and CERN are invited. Already adhesions from T1 we need to have few names of people from CERN. Agenda still to be finalized.
    • T0 sites issues: CASTOR upgrade to 2.1.9-4 + SRM upgrade to 2.8-6 on Monday the 8th (9:00 -11:00 am CET).
    • T1 sites issues:none
    • T2 sites issues: Shared area issue at INFN-MILANO

Sites / Services round table:

  • FNAL - ntr
  • BNL - As Graeme mentioned we had a brief SE outage. Problem was detected by facility alarm system during the early morning (no GGUS ticket was submitted). Presumably caused by network glitch. Port card rebooted - following reboot of port card sync problem between pnfs server and SRM causing transfer timeouts. Investigation by BNL networking and Force10 suggests to replace all 10Gbit port cards.
  • NL-T1 - ntr
  • KIT - ntr
  • ASGC - ntr
  • RAL - ntr
  • PIC - ntr
  • IN2P3 - ntr
  • GridPP - ntr
  • OSG - ntr

  • CERN - ntr

AOB:

  • Next Thursday there will be a Tier1 Service Coordination meeting. Services + support + problem handling will be discussed, including alarms / monitoring. Short report from each service. Also request a DB Streams SIR for problem seen yesterday.
  • Physics officially starts on the Ides of March.

-- HarryRenshall - 26-Feb-2010

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2010-03-05 - GraemeAStewart
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback