Week of 101108

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jerka, Maria, Jamie, Hurng, Peter, Maarten, Simone, Andrea, Luca, Dirk, Harry, Ignacio, Nilo, Ricardo, Alessandro, Eddie, Lola, Roberto, Massimo, MariaDZ);remote(Michael, Xavier, Jon, Alexander Verkooijen, Daniele, Gonzalo, Federico, Rolf, John, Rob).

Experiments round table:

  • ATLAS reports -
    • SARA-MATRIX had BDII issues & no ATLAS jobs were running. GGUS ALARM ticket #63994 sent on Saturday evening. Ticket was acknowledged after 17 hrs. Bunch of emails sent to SARA contact email (the one in GGUS) and to site administrators along with GGUS. Can SARA please make sure that ALARM tickets are acknowledged a bit sooner? Two separate issues: BDII, and issue with local batchsystem. SARA cloud excluded from ATLAS DDM for night, cloud offline for production until Sunday noon. Issue fixed Monday morning. Thank you!
    • BNL: sunday morning lots of "dCache" errors in DDM. GGUS ALARM ticket #63999 sent. Issue fixed within several hours from report. Thank you!
    • INFN-T1: Saturday evening SRM failures. GGUS TEAM ticket #63995 (top priority) on Saturday evening. No response to the TEAM ticket until Sundy morning, we escalated to GGUS ALARM ticket #64000. Issue with storm - underlying GPFS cluster. Issue fixed within couple hours. Thank you! [ Daniele - INFN T1 StoRM issue. Didn't answer TEAM ticket until Sunday morning as opened after midnight on Saturday. Problem was not with StoRM itself but underlying GPFS cluster that had problems due to human error. Actions taken to avoid such problems in future. ]
    • Taiwan-LCG2: SRM issues, GGUS team ticket #63983
    • Lyon: No major problems over the weekend. Can you please comment on previous issues evolution? Thank you! [ Rolf - the jobs of ATLAS ran well this w/e as ATLAS increased timeout values. This was sufficient to get jobs through. Obviously this does not solve underlying problem - working on this. Investigation and tests on basis of LHCb and ATLAS jobs (LHCb have same problem). No more details for now but our Tier1 representative will give more complete report at next T1SCM. Any news will be reported here. ]

  • CMS reports -
    • Experiment activity
      • First HI collisions over the week-end
    • CERN and Tier0
      • CMS reconstructed first HI events/tracks
      • Tier-0 : small SW worries for some Prompt Reco/AlCa workflows at Tier-0, but patch was made and is now in place
      • SLS glitch this morning 10:50-11:20, but got resolved
    • Tier1 issues and plans
      • IN2P3-CC : Monte-Carlo Pile-up production now complete. Will immediately switch to Heavy Ion simulation.
      • More generally : processing data re-reco + skimming of new HI data at all Tier-1s
    • Tier2 Issues
    • AOB
      • Upgrade of CMSWEB cluster last Thursday Nov 4, in particular new rule of http==>https re-direction, cause problems of some CMS WebTools clients, eventually causing excessive memory usage (blockreplicas calls) of CMSR node hosting PhEDEx, that needed to be rebooted Friday Nov 5 at 16:38 CET.
        • Immediate action was to patch the most active clients (ProdAgent, CRAB already had an up-to-date version but was not deployed everywhere)
        • CMS Operators in alert state over the whole week-end to monitor the situation
        • will discuss this afternoon internally on short/long-term solution
      • Next CMS Computing Run Coordinator, reporting here from tomorrow on : Oliver Gutsche

  • ALICE reports - GENERAL INFORMATION: right now there are stable Pb beams, data taking is ongoing. It has to be followed up immediately by MC for the RAW data runs with the same detector conditions data, we are keeping the resources on hot standby. Some problems with AliROOT solved on the spot.
    • T0 - ntr
    • T1 - ntr
    • T2 - usual operations

  • LHCb reports - MC campaign all the weekend
    • T0
      • Transparent intervention on CASTOR (10-12 UTC) Maybe not really transparent? GGUS:64043 [ Ignacio - don't see anything in CASTOR, could be SRM, will look at it. Roberto - looks like a disk server problem. Massimo - one thing that looks a bit unusual at least 4 users reading data with more streams than LHCBPROD. Unusual but could be unrelated. ]
    • T1 site issues:

Sites / Services round table:

  • RAL - SIR for problems for LHCb at the RAL Tier1. These appeared first as bad TURLs being returned, then an interruption to the service. SIR

  • KIT - during w/e one NFS server died. High impact on other services, e.g. WLM service died also. FIxed today. Probably will have unstable WNs rest of week but will have an eye on it.
  • FNAL - had a lot of trouble this morning with Alcatel. Currently asking for a pw! Today is our downtime, started 3 hours ago, continuing until at least 18:00 Chicago time (10 more hours). Availability problem of last week partially fixed. Some done LCG side not yet propagated to CMS. We detector that FNAL-KIT network links down Saturday for some time - no more info.
  • BNL - Channel to CERN didn't work!) In addition we encounter problems with SAM BDII. Our probe last night detected that LCG BDII dropped all resources associated with BNL T1 around 09:00 pm. Recovered shortly after. Also encountered issues with SAM BDII. Unclear why. Opened OSG ticket 9526. Maarten - OSG tickets not automatically forwarded to right GGUS category - should be assigned to CERN. Rob - submitted ticket a couple of hours ago to GGUS:64039.
  • NL-T1 - 3 things: in night Fri-Sat NIKHEF BDII had vm problems and had to be restarted. On Sat SARA BDII restarted. Our batch system had some problems - several thousand ALICE jobs submitted. Simone - problem to batch and BDII non-correlated? A: yes, not correlated.
  • CNAF - nothing to add to ATLAS report
  • PIC - 1/2 hour ago cooling incident in part of the data centre - had to switch off part of farm. Affecting ~400 jobs. More later..
  • IN2P3 - nta
  • RAL - quite a few things. Gareth mailed in SIR - see above. Problems over w/e with site BDII - locked up and failed some SAM tests. ATLAS disk server failed this am. Fabric looking at it. ATLASDATADISK out of service. An outage for LHCb on Wed - to upgrade d/s & o/s to 64bit. Outage to CASTOR CMS next week.
  • OSG - ATLAS alarm this w/e. Several people woken up early Sunday. Don't mind responding if really alarms. Made sure ticket forwarded to BNL. Would like to find out if this was a real alarm issue. Michael - no doubt it was an alarm situation and needed to be fixed asap which it was. Jerka -

  • CERN storage - CASTOR LHCb upgraded to 2.1.9-9 - nothing to do with problems seen since.

AOB:

  • LHC news: "Geneva, 8 November 2010. Four days is all it took for the LHC operations team at CERN to complete the transition from protons to lead ions in the LHC. After extracting the final proton beam of 2010 on 4 November, commissioning the lead-ion beam was underway by early afternoon. First collisions were recorded at 00:30 CET on 7 November, and stable running conditions marked the start of physics with heavy ions at 11:20 CET today."

Tuesday:

Attendance: local(Eddie, Roberto, Oliver, Lola, Andrea, Jamie, Huang, Luca, Harry, Ricardo, Alessandro, Miguel);remote(Rolf, Michael, Jon, Kyle, Daniele, Ronald, John, Christian Sottrup (NDGF) ).

Experiments round table:

  • ATLAS reports -
    • Nov 8th - 9th (Mon, Tue)
      • last night, heavy ion collisions with stable beams, ATLAS in data taking mode.
      • p-p data reprocessing going on at T1s. It's about to submit 3rd batch of jobs to process the data taken during October (~217 TB in total)
    • T0:
      • this night around 4:07 - 4:22 am, we observed many acronjob failures, most probably due to AFS problems. Has CERN observed something? [ Harry. Nothing in morning meeting. Alessandro - have observed also yesterday short glitches, last week too. ]
    • T1s:
      • short SE glitches at INFN-T1 last evening (~6:00 pm CERN time). After the GGUS:64056 opened, problem seemed to be gone. [ Daniele - GGUS ticket solved. In case it comes again please let us know. ]
      • data reprocessing backlog at SARA due to the CE issue reported yesterday. Although the problem was fixed, ATLAS didn't reach reasonable amount of CPU share to catch it up. SARA reconfigured the share so now more jobs are able to run at SARA (thanks for the help). Still watching the progress and preparing alternative plans in case we need it for next batch of data reprocessing.
      • FZK increased the DATADISK space token, 100% ESD replication resumed.
    • T2s:
      • copy-on-demand mechanism for ESD/DESD analysis implemented for UK cloud; therefore auto-replication ESD/DESD stopped
      • keep T2-T2 functional test to CA-ALBERTA-WESTGRID-T2 and CA-VICTORIA-WESTGRID-G2 for debugging the FTS + gridftp door setup that is causing T2-T2 transfer failures in CA cloud.

  • ALICE reports - GENERAL INFORMATION: The amount of RAW data is still at the level that we can reconstruct all runs Pass1@T0 quasi online and the data replication to T1s can follow the data taking (no backlog). Analysis is ongoing. The MC production has returned to normal.
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: The disk at SE was full, so it could not store more files. They are working on getting new storage capacity up, so issue will be probably solved by today
    • T2 sites
      • Usual operations

  • LHCb reports - MC campaign still going on, users jobs, reconstruction tail
    • T0
      • GGUS:64043 (opened yesterday). Files could not be accessed. CASTOR support now asking to close it, we still see some failures. Re-checking on our side. [ Miguel - you are seeing timeouts right? In effect the # files being opened and their location on servers is causing users to block each other out. Users see some timeouts - maximum # of connections reached. Roberto - user addressing huge number of files? A: that is a special case - user activity conflicting with each other on default pool. Would be better if user activity could be moved to xroot. Other option is to add some more h/w. This would have more limitations than first option. Roberto - do not have control on users doing analysis using lxbatch. Will encourage usage of xroot. ]
      • GGUS:63933 (opened 5 days ago). Shared area slowness causing jobs to fail. Seems to be understood. Can probably close - just wait to see if there are some more problems in this area.
    • T1 site issues:
      • NTR
      • [ Eddie - still tests timing out with shared area of IN2P3. Both CE and SRM. Still many errors from many tests. Rolf - shared area problem - a lot of things are ongoing there. Have isolated some WNs to see if only LHCb is working on them. Try to simulate heavy load for LHCb setup. On concrete issue mentioned don't know if this has been seen here. Ticket reference? Roberto - no ticket opened on recent SAM failures. Rolf - report at next T1SCM and also MB. ]

Sites / Services round table:

  • IN2P3 - nta
  • BNL - all T1 services runnign smoothly. Problems with SAM BDII continue. Meanwhile found out that there are 2 used in round-robin mode. 1 fails constantly as it has no info on BNL resources in particular SE. Communication Hiro-Ale. Ale opened another ticket. Going through OSG and Rob had already filed a GGUS ticket. Close one ticket or cross-reference? Ale- will close it.
  • FNAL - had our downtime yesterday, generally successful. A few loose ends, e.g. BDII published as Oliver mentioned. Reported to me that FNAL-KIT circuit restored yesterday. Faulty line card in AMS on US-LHCNET. FNAL runs a central PhEDEx service for US T3s. Node that does that crashed and is broken. All circuits to US T3s down. Replacing node.
  • CNAF - there will be power intervention on Friday in the morning. Should not affect farm as diesel generators. Set an AT RISK but not more.
  • NL-T1 - ntr
  • RAL - 3 things today. Issue with ATLAS disk server still going on. Local ATLAS contact in touch. 1 m/c that serves site BDII out of DNS alias so only 1 left. Scheduled downtime tomorrow for LHCb SRM to upgrade to 64 bit.
  • NDGF - upgrading dCache. Going according to plan will finish tomorrow around noon. During this time some data may be unavailable.
  • OSG - noticed a ticket exchange problem with FNAL which we are currently trying to diagnose. Any tickets to US CMS might not arrive immediately. (Good says Jon!)

  • CERN Grid services: LSF - need to schedule an intervention to replace h/w of 2 masters. Before end year? End Jan? Please suggest timeframe. Preferably beginning December if possible. acronjob failures - ticket svp. SAM BDII failures - please update ticket with info that only 1 of BDIIs is bad.

  • CERN DB - replication of CMS online to offline stopped 07:15 - 09:15 due to problem with maintenance job - restarted manually. Investigating.

  • CERN Dashboards - for ATLAS DDM dashboard scheduled downtime today at 10:00 for 25' All went smoothly - new version has statistics on data transfer rates.

AOB:

  • Next T1SCM this coming Thursday, day after tomorrow.

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 04-Nov-2010

Edit | Attach | Watch | Print version | History: r13 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2010-11-09 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback