Week of 110502

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Ueda, Dan, Ramon, DIrk, Maria, Jamie, Mike, Nilo, Eva, Massimo, Maarten, MariaDZ, Alessandro, Ignacio);remote(Michael, Gonzalo, Jon, Onno, Rolf, Ian, Rob, Chiara Genta/CNAF, Federico, Tore, Todd/ASGC).

Experiments round table:

  • ATLAS reports -
    • ATLAS: collecting a lot of data (30 pb^-1 per day)
    • T0/Central Services:
      • Sunday morning at CERN (GGUS:70164) 700+ source transfer errs from CERN-PROD_SCRATCHDISK to CERN-PROD_PERF-JETS: source file failed on the SRM with error [SRM_ABORTED]
        • Hard reboot of the disk server to fix
    • T1s:
      • Saturday AM ALARM to NDGF (GGUS:70157) "all transfers fail with SRM AUth Failed"; T0 export affected
        • Alarm at 6:12. Solution at 7:42 GMT: one Dcache component stuck and needed manual intervention. Possibly triggered by a larger than usual number of queued staging requests.
      • TAIWAN finished the schedule downtime. Transfers started OK, but eventually no space left on tape (GGUS:70166).
        • TAIWAN cleaned 10TB and it resolved the problem.
      • SARA FTS Overwrite (GGUS:70128). Message from FTS developers is that delete/recopy is not implemented so there is nothing SARA can do. - propose to close GGUS ticket.. [ Maarten - SARA does not want to enable flag? Dan - that's my understanding ] [ Ueda- from ATLAS we want the possibility to overwrite if this flag is set. ]

  • CMS reports -
    • LHC / CMS detector
      • Good data taking over the weekend
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • High memory consumption on T0/CMSSW application side seems to have been alleviated with a patch. No jobs needed to be killed over the weekend.
      • Problem with submitter over the weekend due to expired AFS token. Promptly recovered.
    • Tier-1
      • 2010 Reprocessing on-going at Tier-1s is mostly completed.
      • ASGC still failing availability tests after downtime. Savannah ticket for local CMS site contact.
    • Tier-2
      • MC production and analysis in progress (summer11 production)
    • Other
      • CMS still waiting for unscheduled downtimes to be included in the SSB, see Savannah:119944
    • SSB - issue was reported again on Friday and has since been resolved (Mike)

  • ALICE reports -
    • General Information: Production has dropped significantly to give priority to user analysis jobs for the coming two weeks. Urgent productions more or less - some still going on.
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operation

  • LHCb reports -
    • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
    • A lot of MC continues to run.
    • T0
    • T1
      • SARA problem with aborted pilots for reco jobs (GGUS:70170). Problem fixed this morning.
      • RAL jobs looping over 3 unavailable files (GGUS:70158).
      • GridKA - some jobs failing since a few days as in the process of changing space tokens - SAM jobs failing. People responsible for SAM tests on holidays. Stefan looking into it but not clear how to make a quick fix.
      • IN2P3 - some jobs failing up to Friday but Stefan solved - should be ok now.
    • T2

Sites / Services round table:

  • BNL - ntr.
  • PIC - ntr.
  • FNAL - ntr.
  • NL-T1 - nta to ATLAS & LHCb reports.
  • IN2P3 - two things: 1) planned outage on 24 May - this will be for nearly all services; still in the process of defining details of what will be stopped, batch & storage will be stopped. Don't yet know if transfers from CERN to T1 will be concerned by this too - might be as we will do an upgrade of dCache core server. TBC. 2) Computing centre changing batch system to grid engine. Planned and on-going since some time. Non-grid users already in 'beta testing phase' for this. CREAM CE with grid engine with success. Now going to switch really to G-E. In coordination with local LHC support people will open one CREAM CE so that they can start to test new batch system. 60% of current computer power over to G-E by end-June. The rest will depend on usage etc.
  • CNAF - yesterday we had a connection problem that affected s/w area. Lasted ~1h then service restored by manual intervention.
  • NDGF - problem with SRM server Saturday morning early. Resolved around 08:00.
  • ASGC - ntr.
  • OSG - ntr.

  • CERN DB - reminder that between today and Wed inclusive patching INT DBs - rolling intervention so should be available all the time.

AOB:

Tuesday:

Attendance: local(Ian, Stefan, Gavin, Jamie, Maria, Ueda, Mike, Michal, Alessandro, Massimo, Maarten, Eva, MariaDZ);remote(Tore, Paco, Michael, Jon, Gonzalo, Jeremy, John, Rolf, Chiara, Rob, Xavier, Jhen-Wei).

Experiments round table:

  • ATLAS reports -
    • LHC : Mon-Tue Physics fills, Wed-Sun Machine Studies, Mon-Thu Technical Stop (http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHCCOORD)
    • ATLAS : runs Tue 3, Wed 4 morning, Fri 6 night (20h-4h). otherwise tests and calibrations
    • FZK (GGUS:70199 - alarm) 2011-05-02 19:07 UTC CONNECTION_ERROR, HTTP_TIMEOUT
      • solved 2011-05-02 19:56
      • alarm 2011-05-02 19:58 (apologies for late noticing the solution)
    • FZK : (GGUS:70222) 2011-05-03 08:29 UTC SRM_FAILURE: Pinning failed: CacheException(rc=10007;msg=Entry not in repository
      • files staged after a while
      • error rate is not too high ~30/100 (8h-8h20 UTC), ~20/100 (9h-9h20)
      • no more such failure after 09:16 UTC
    • CERN EOS (GGUS:70207) 2011-05-03 04:05 UTC GRIDFTP_ERROR: open/create error: No space left on device 2011-05-03 09:34 solved (a software bug) NOTE: ATLAS will submit ggus tickets when we see problems in EOS ( in the past the agreement was to avoid ggus tickets for EOS [ nodded agreement from IT-DSS ] )
    • TW (GGUS:70228) 2011-05-03 11:13 UTC
      • backlog in data export from T0 to Taiwan-LCG2/TW-FTT
      • possibly due to slow transfer rate (eg. 200 kB/s for 1 GB files)

  • CMS reports -
    • LHC / CMS detector
      • Quiet evening without beam
    • CERN / central services
    • Tier-0 / CAF
      • Prompt Reco from weekend working it's way through the system. Running at a high load
        • Memory issues seem improved, but some work is still needed - upgrade of s/w during tech. stop which hopefully will fix this
    • Tier-1
      • Some prestage requests were sent to the site contacts for 2010 reprocessing, but generally proceeding
      • ASGC still failing availability tests after downtime. - transfer quality is extremely poor from CERN->ASGC but 100% successful from FNAL->ASGC. "Something is up" [ Jhen-Wei: about slow transfer rate - will ask network manager for info - seems some problem in (under) sea cable ]
      • Large scale MC with out of time pile-up expected within the next few days - dataops requested replication of data samples that are required for this.
    • Tier-2
      • MC production and analysis in progress (summer11 production)

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
    • A lot of MC continues to run.
    • T0
    • T1
      • SARA problem with aborted pilots for reconstruction jobs (GGUS:70170). [ Paco - problem at SARA: related to tape that cannot be mounted. Last update that I have - under investigation. Update ticket soon! ]
      • RAL jobs looping over 3 unavailable files (GGUS:70158). (problem with tape server - these files should be available again)
      • PIC Sam SRM tests working again as of tonight. There was a problem due to changes in space token names.
      • CNAF The same change for STs happened. Sam SRM tests changed today.

Sites / Services round table:

  • NDGF - ntr
  • NL-T1 - ntr
  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr
  • RAL - ntr
  • CNAF - ntr
  • IN2P3 - ntr
  • ASGC - about CMS SAM tests: removed old s/w on WNs(?) and now ok
  • KIT - nta
  • GridPP - ntr
  • OSG - opened ticket with GGUS yesterday regarding BDII - clock of one about 1 hour off. This turned into a SNOW ticket which we don't have access to. Doesn't seem as if GGUS ticket is getting updated so we don't know if there has been any activity on this ticket. GGUS:70197 [ MariaDZ - will check ticket and reassign if appropriate. ] SNOW ticket is with 3rd line support and there has been no update to the ticket.

  • CERN DB - the Oracle patching on INTR DB progressing well. Will start scheduling on production DBs next week. Moved INT11R to a different h/w running on 11g clusterware but DB 10.2.0.5 and will be upgraded next Tuesday pm.

  • CERN storage - to confirm that EOS tickets welcome! planning. CASTOR NS upgrade in 10 days for SLC5 migration. 12 May as possible date? (Tech Stop). Meant to be transparent but impact could be large as the only shared service across all experiments. VOMS downtime on 10 May - can this be co-scheduled. 12 May is last day of technical stop so maybe earlier?? Tape reconfigurations during same period for LHCb especially. Ale - will report this to ATLAS management so it can be confirmed.

AOB:

Wednesday

Attendance: local(Ueda, Jamie, Maarten, Stefan, Mattia, Michal, Luca, Alessandro, Massimo, Pedro, Gavin, Pepe, Ian, MariaDZ);remote(Michael, Jon, Gonzalo, John, Rolf, Tore, Onno, Jhen-Wei, Chiara, Rob, Dimitri).

Experiments round table:

  • ATLAS reports -
  • LHC : Machine Studies, ATLAS : reconstruction and export of data from yesterday
  • Tier-1
    • TW : low transfer rate (GGUS:70228) 2011-05-03 11:13 UTC
      • T0-export affected
      • TW ASGC-AMS Link down caused by a submarine cable cut (GGUS:70194)
    • FZK : GRIDFTP_ERROR (GGUS:70268) 2011-05-04 11:44 UTC
      • T0-export affected
      • reached the limit of the GridFTP doors for the maximum count of 1000 transfers
  • Tier-2
    • AGLT2 (GGUS:70251) 2011-05-04 06:32 UTC
      • T0-export affected (for calibration stream)
    • dlopen error: liblcgdm.so (GGUS:70264) 2011-05-04 09:24 UTC
      • known issue in glite-WN/3.2.10-0 (missing symbolic links for libdpm.so, liblfc.so or liblcgdm.so) [ some sites have fixed with a manual patch - document in "known issues". Broadcast this? ]
      • already reported at LCG.WLCGDailyMeetingsWeek110411#Friday and some sites have fixed (BUG:80061, GGUS:70033)
      • Can WLCG make sure that the sites are aware of the issue and the patch?

  • CMS reports -
    • HC / CMS detector
      • Quiet evening (now week...) without beam after early evening
    • CERN / central services
    • Tier-0 / CAF
      • Prompt Reco from weekend mostly caught up
        • Memory issues seem improved, but some work is still needed. There will be some resubmissions
    • Tier-1
      • 2010 re-reco is mostly complete. Skims are running
      • ASGC still failing availability tests after downtime. (submarine cable cut - will not be repaired until 9th.) [ Jhen-Wei: sorry about sea cable cut. Will let you know once our link is back. News from service provider: back next Monday. For moment can change FTS channel configuration for lower rate. Set "AT RISK" event in GOCDB?? Will announce via EGI broadcast ]
      • Large scale MC with out of time pile-up expected pilot samples will be launched when MinimumBias sample is replicated
    • Tier-2
      • MC production and analysis in progress (summer11 production) - about 200K completed analysis jobs / day

  • ALICE reports -
    • T0 site
      • Users experienced many errors with the (ALICE) Central Services yesterday due to a problem with a switch. It needed to be restarted and the issue was solved.
    • T1 sites
      • FZK: there were problems with the xrootd SE, not completely resolved yet. Experts are looking into that.
    • T2 sites
      • Usual operations

  • LHCb reports -
    • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
      • A lot of MC continues to run.
      • T0
      • T1
        • SARA problem with aborted pilots for reconstruction jobs (GGUS:70170). [ Memory monitoring implemented last week but seems that water mark is too high for LHCb reconstruction jobs ]
        • RAL jobs looping over 3 unavailable files (GGUS:70158). [ Been fixed but still cannot confirm as we have a backlog on RAL to process ]
      • T2

Sites / Services round table:

  • CNAF - ntr
  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr
  • RAL - seeing problems with LHCb SRM - LHCb disk space token 3% free - all on one disk server. Currently putting two standby disk servers into pool.
  • IN2P3 - ntr
  • NDGF - yesterday at 14:30 LFC server unavailable for 30' due to scheduled downtime which was not properly announced. at 15:00 lost ATLAS data pool due to faulty RAID controller. Tomorrow scheduled downtime 13:00 - 17:00 which was result in some ALICE and ATLAS data being unavailable.
  • NL-T1 - a few mins ago switch broken in compute cluster at SARA. CE and CREAM CE in unsched downtime. Reminder: Tuesday 10 May SRM will be in scheduled maintenance. Another reminder that tomorrow is a Dutch national holiday so noone at call.
  • ASGC - timeout value in FTS configuration have done this in ASGC-CERN and CERN-ASGC service but don't have priv for CERN FTS so will submit request for this to be changed at CERN.
  • KIT - ntr
  • OSG - ntr

  • CERN - ntr

AOB: (MariaDZ) Notification emails will be sent as of the May 25th GGUS Release to sites which have no local ticketing system on every GGUS ticket update. Affected sites are NDGF (requestor), UK-T1 (accepted), NL-T1 and ASGC (being informed now). Reasons are explained in https://savannah.cern.ch/support/?120243#comment11 [ Onno - we have no objection to this plan ]

  • CERN CASTOR intervention - ATLAS prefer before May 12 as this is last day before technical stop. Massimo - 2nd choice was Tuesday 10. Ueda - this would be ok. Confirmed also by CMS and LHCb.

Thursday

Attendance: local(Massimo, Maarten, Jamie, Maria, Dirk, Ueda, Mike, Michal, Alessandro, Ignacio, Stefan, Eva, Manuel, Nilo);remote(Jon, Tore, Michael, Wejien, Todd, Jeremy, Gonzalo, Marc, Tiju, Paolo, Onno, Chiara, Ian).

Experiments round table:

  • ATLAS reports -
    • LHC : Machine Studies
    • ATLAS : reconstruction and export of data still on-going
    • IN2P3-CC : dCache problem (GGUS:70280) 2011-05-04 19:06 UTC [ Problem with dCache server yesterday, all VOs impacted due to memory leak on dCache core server. Has impacted SRM servers - had to drain batch for all experiments. Added a downtime this morning - cause is dCache problem which has been going on for a few weeks - discussing with dCache developers trying to understand what is going on. ]
    • FZK : LFC/FTS problem (no ticket) (One report from German cloud) - would be nice to have some more info [ FZK - LFC problem. About midday short outage for LFC for ATLAS now solved. Rebooted some Oracle nodes and solved the problem now. No ticket as it was a short outage. ]
    • AOB - problem with WN yesterday, Maarten sent a broadcast about work-around yesterday

  • CMS reports -
  • LHC / CMS detector
    • Machine Development
  • CERN / central services
    • Frontier configuration modified to help improve condition situation at ASGC
  • Tier-0 / CAF
    • Prompt Reco from weekend essentially complete
    • Replays scheduled to test new software version.
  • Tier-1
    • 2010 re-reco announced. Skims are running
    • ASGC has 2 issues
      • Frontier conditions were not being updated due to poor network connectivity on the backup link. Dave Dykstra trying to help
        • Massive job failures
      • Transfer quality is difficult to understand. ASGC Transfer quality is green to all Tier-1s except CERN where it's very low.
    • First 30M simulation with out of time pile-up were launched last night. As expected this has a very high load on the storage infrastructure for reads. Sites were asked to replicate the samples, but some impact on storage systems is still observed.
  • Tier-2
    • MC production and analysis in progress (summer11 production)
    • T2_IN_TIFR dropped off the BDII. Referred to their local ROC for help.

  • ALICE reports -
    • T0 site
      • A high amount of user activities has led to some "interesting" observations in the AliEn central services.
      • The Job Broker was down for some hours last night which caused that we had around 115 k jobs in the Task Queue waiting at some point. We still are at a value that is abnormally high (80 k).
    • T1 sites
      • FZK: there were problems with the SE, experts are looking into that. Still ongoing.
      • SARA: site was unavailable due to a problem with a switch. Its VOBOX has a problem that is being investigated. (Related??)
      • NDGF maintenance was noticed by users (job failures).
    • T2 sites
      • Usual operations

Experiment activities:

  • RAW data distribution and their FULL reconstruction is going on at most Tier-1s.
  • Cleaning of old data to be started (~ 1/2 PB)
  • A lot of MC continues to run.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1
    • SARA problem with aborted pilots for reconstruction jobs (GGUS:70170).
    • RAL jobs looping over 3 unavailable files (GGUS:70158).
    • RAL has increased disk pools as reported low yesterday
    • Change of space token names has been done at GridKa
  • T2

Sites / Services round table:

  • BNL - had a brief outage of storage services last night - dCache n/s manager become unresponsive probably due to a problem at O/S level. Reboot cured it. Nothing in logs.
  • FNAL - ntr
  • NDGF - ntr
  • KIT - except for LFC (see ATLAS above) ntr
  • ASGC - we will have downtime May 7 at 07:30 - 15:30. Purpose for DPM migration (move DPM SRM from T2 to T1).
  • IN2P3 - nta
  • PIC - ntr
  • RAL - ntr
  • CNAF - Paolo: ntr
  • GridPP - ntr

  • CERN DB: all INTR DBs have been patched. Next week we will have round of production DBs. Also CASTOR DBs with latest security and other recommended patches.
    • 9th May
      • CMSR
      • ALIONR - to be confirmed
      • ATONR
      • LHCBONR - to be confirmed
      • PDBR
    • 10th May
      • ATLR
      • Downstream dbs (ATLDSC, LHCBDSC) -> replication to Tier1 sites downtime
    • 11th May
      • LCGR

AOB:

Friday

Attendance: local(Simone, Ueda, Jamie, Massimo, Michal, Mattia, Stefan, Alessandro, Maarten, Nilo, Eva, Manuel);remote(Jon, Gonzalo, Michael, Xavier, Tore, Jhen-Wei, Kyle, Rolf, Tiju, Chiara, Ian).

Experiments round table:

  • ATLAS reports -
  • LHC : Machine Studies
  • ATLAS : export of data still on-going to TW (backlog)

  • TW : LFC problem (elog:25092, GGUS:70305) 2011-05-05 16:55 UTC
    • solved 2011-05-06 05:36 [ Claim from Taiwan people is that backup link is significantly slower ]
  • IN2P3-CC : another dCache problem (ELOG:25100, GGUS:70309) 2011-05-06 02:06 UTC
    • solved 2011-05-06 09:46 [ IN2P3 - dCache problem worked on in collaboration with developers - appears a memory leak in Java VM. In this case an overload by CMS activity which caused the problem. Not particularly a CMS problem but too much load causes memory leak to appear. Still under work but will happen again until developers find a way to avoid. ]


  • LHC / CMS detector
    • Machine Development
  • CERN / central services
    • PhEDEx central agents were down for ~ 1 hour yesterday. Components waiting for a lock. Issue also observed 1 week ago, experts investigating
  • Tier-0 / CAF
    • Ticket to IT for jobs stuck in run state, but already finished. Preventing us from closing out a run
    • Replays of Tier-0 workflows have begun
  • Tier-1
    • Simulation with out of time pile-up were launched last night. As expected this has a very high load on the storage infrastructure for reads. Success at KIT, IN2P3, FNAL. No requests yet to RAL and we're seeing efficiency issues at PIC. - Some sort of failure of stagng infrastructure. Input files not staged as believed to be - tracking down with local site contacts.
  • Tier-2
    • MC production and analysis in progress (summer11 production)


  • ALICE reports -
    • T0 site
      • Continuously high user activity besides production background at lower priority than usual. The task queue remains unusually loaded (~70 k jobs where 30 k is normal) while the grid works fairly well (75 k jobs done OK in 1 day, up to 30 k running in parallel).
    • T1 sites
      • FZK: no more user complaints about the xrootd SE, but it still failed the standard tests this afternoon; being looked into.
      • SARA: the VOBOX had run out of space on /var due to the AliEn SW cache; it was OK again after a cleanup.
    • T2 sites
      • Usual operations


Experiment activities:

  • RAW reconstruction of current data almost completed, Stripping/Merging jobs in progress.
  • Data removal / archiving postponed because of backlogs in data management processes.
  • MC productions on most T1/T2 sites

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
  • T1
    • SARA problem with aborted pilots for reconstruction jobs (GGUS:70170).
    • RAL jobs looping over 3 unavailable files (GGUS:70158). Running right now - so OK.
    • RAL space token renaming has started yesterday. (Only SARA, IN2P3 and CERN left to be done).
  • T2

Sites / Services round table:

  • FNAL - ntr
  • PIC - ntr
  • BNL - ntr
  • KIT - ntr
  • NDGF - ntr
  • ASGC - yesterday the tablespace of DB was full so files could not be registered properly. Tablespace increased and now ok. Network issue - no news. Backup link very busy - will confirm with network manager.
  • CNAF - ntr
  • IN2P3 - nta
  • RAL - ntr
  • OSG - ntr

  • CERN - ntr

AOB:

  • Q - LCGR intervention? Should be on Wednesday and transparent.

-- JamieShiers - 28-Apr-2011

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2011-05-06 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback