Week of 110530

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jamie, Maarten, Fernando, Ewan, Eva, Jan, Michal, Ignacio, MariaDZ);remote(Alexander/NL-T1, 0764872215, Jhen-Wei/ASGC, Xavier/KIT, Rolf/IN2P3, Stephen, Jon/FNAL, Vladimir Romanovsky/LHCb, Daniele Andreotti/CNAF).

Experiments round table:

  • ATLAS reports -
    • CERN-PROD GGUS:70977 CERN-PROD_TMPDATADISK, [GRIDFTP_ERROR] globus_ftp_client errors, due to automatic firewall misconfiguration. Solved. Thank you!
    • CERN-PROD GGUS:71015 no space at CERN TMPDATADISK, issue is understood (EOS bug caused regular "heartbeat" messages to not be processed, so "available space" reporting was incorrect. Fixed by deploying a software update around 09:30 local time (28th May 2011).
    • CERN-PROD GGUS:71026 "Device or resource busy" from t0atlas castor pool, Identified one "stuck" node in the T0ATLAS pool ("blackhole" for jobs), was at least responsible for the example files in NC042060 . Rebooted, should hopefully fix issue (29th May).
    • CERN-PROD GGUS:71027 SRM_ABORTED on CERN-PROD_TZERO, source diskserver for the example file was stuck, rebooted, should be OK now (Same machine as involved in GGUS:71026).
    • CERN-PROD GGUS:71049 Too many threads busy with Castor at the moment.The PrepareToPut request has been successfully aborted. [ Jan - high load seen on DB backend. ]
    • SARA-MATRIX ALARM GGUS:71028: LFC down due to a spanning tree problem in part of SARA's network. (29th May)
    • IN2P3-CC GGUS:71039 SRMV2STAGER issue at IN2P3-CC_MCTAPE, open (30th May).
    • ATLAS is taking data.

  • CMS reports -
  • LHC / CMS detector
  • CERN / central services
    • CVS issuesThursday, suppose to be fixed. Still some issues afterwards, asked for clarification. SNOW:INC041721.
  • Tier-0 / CAF
    • Tier-0 busy processing latest data
    • Backlog in copying files from Point5 to CASTOR. GGUS:71047. [ Ignacio - replied in SNOW before midday. Will put update directly in GGUS. Jan - you are sending stuff at 10-15Hz for various commands. Ignacio - is this sort of rate normal? Stephen - yes, this is normal for files from P5 to T0. ]
  • Tier-1
    • MC and Data re-reconstruction ongoing.
    • Transfers from T1s to FNAL not working well. Believed to be in PhEDEx somewhere, being investigated.
  • Tier-2
    • MC production and analysis in progress
    • T2_BE_IIHE probably lost cooling over the weekend. Savannah:121206.
  • AOB :
    • None
    • IN2P3 - Rolf GGUS:70664 waiting for reply from CMS. Stephen - is it assigned to CMS? I haven't seen it. A: opened by CMS to IN2P3. We have answered the request but we need a confirmation and no reaction.

  • ALICE reports -
    • General information:
    • T0 site
      • voalicefs05, one of the xrootd servers, is back in production after a hardware intervention this morning
    • T1 sites
      • FZK: Two of the three CREAM-CEs are back in production: [ KIT - we fixed problems with LRMS on Saturday and took one CREAM after the other into production again. Last will enter prod at 18:00 today. ]
    • T2 sites
      • Usual operations

  • LHCb reports - A lof of data during w/e. Processing and reprocessing are running. Certification of new DIRAC version. * Tier1 * GRIDKA CREAMCE (GGUS:70835) * IN2P3 LFC RO Mirror back. * SARA LFC RO Mirror problem (GGUS:71042) [ Alexander - yes we had a problem in part of network and LFC could not reach DB because of this. Started yesterday around 09:30 and fixed 10:00 this morning. We forgot to put in a downtime - very sorry. MariaDZ - ATLAS opened an alarm ticket for this yesterday GGUS:71028

Sites / Services round table:

  • NL-T1 - nta
  • ASGC - ntr
  • KIT - Friday pm at 17:00 one of GPFS cluster crashed and several CMS down in consequence. Restored at 21:00 same day.
  • IN2P3 - nta
  • FNAL - ntr
  • CNAF - ntr

  • CERN storage - still investigating network problem from pit to CASTOR tape which appears to be pool running at full speed - hitting bandwitdh.

  • CERN DB - last week ALICE availabilty of T0 and T1 marked red - due to some changes in tests. One test was not executed. Now under investigation in ALICE. Expect some "false red" boxes in report. As of today should be ok again.

  • CERN DB - ATLAS integration database being upgraded to 11g now. Finish around 17:00. Tomorrow ATLAS offline DB ADCR will be switched back to original h/w.

AOB: (MariaDZ) The true ALARM tests will take place tomorrow Tuesday due to UK and USA public holidays. Related tickets GGUS:71007 and Savannah:120772. The interface GGUS - NGI_IBERGRID is still not restored following the Release. Savannah:119899 is dedicated to the GGUS Remedy upgrade and the follow-up of other ticketing systems' interfaces.

  • CERN is closed Thursday and Friday - no meeting on those days.

Tuesday:

Attendance: local(Eva, Ewan, Ignacio, Jamie, Fernando, Maarten, Michal, Stefan, Ian);remote(Xavier/KIT, Jhen-Wei/ASGC, Jon/FNAL, Tore/NDGF, Ronald/NL-T1, Tiju/RAL, Rob/OSG, Gonzalo/PIC, Michael/BNL, Lorenzo/CNAF).

Experiments round table:

  • ATLAS reports -
  • Data taking: data11_7TeV
  • CERN/T0/Central Services
    • Transient alarms related to ATLAS SSB, DDM Site Services, Tier 0. Immediately solved.
    • Transfer errors from CERN-PROD_TMPDATADISK: GGUS:70977 updated with some new cases. (For EOS)
    • ADCR database under high load since yesterday and occasionally the DB monitor highlights the instance loads in red. DB migration scheduled for today should mitigate the problem.
    • Scheduled ADCR database intervention to migrate database back to original hardware (Tue 10-11AM local time). Consequent DDM and PanDA service disruption. Intervention finished as planned and DDM&PanDA started up correctly.
    • FTS monitor CERN fts22-t0-export.cern.ch not working GGUS:71073. [ Ewan - will be fixed later today or tomorrow morning. Not high priority but being looked at. ]
    • T1s
      • TAIWAN-LCG2_MCTAPE GGUS:70763: Faulty tape - impossible to recover files. ADC would need a list of affected files and if the site could prepare an incident report. [ Jhen-Wei - We will do SIR ] * Transfer error burst to SARA-MATRIX_SCRATCHDISK GGUS:71110


  • CMS reports -
  • LHC / CMS detector
  • CERN / central services
    • CVS issuesThursday, suppose to be fixed. Still some issues afterwards, asked for clarification. SNOW:INC041721. "We should expect it to be slow". Forever? Clarification please.
  • Tier-0 / CAF
    • Tier-0 busy processing latest data. Prompt reco injected
    • Backlog in copying files from Point5 to CASTOR. GGUS:71047. Cleared with lack of data, but appears correlated to the file transaction rate change in CMS. CMS split one of streams from P5 into two which seems to have pushed us over the top of limit in which CASTOR can create files. Can buffer at P5 - need to understand next week if we can increase rate or need to recombine streams and split out afterwards. Had backlog of 4-5K files at end. Looks like change was on CMS side as said. Ignacio - could not see any issue in T0STREAMER. Total transaction rate close to "limit" - something like 10Hz put and ~9Hz read. No queue in CASTOR. Ian - we are expecting data over w/e if we see backlog we will mark the time and look at it later in detail.
  • Tier-1
    • MC and Data re-reconstruction ongoing. Restarting large scale simulation production with pile-up.
    • Good response from Tier-1s creating tape families. Consistency check of storage requested to experiment site contacts.
    • Transfers to FNAL generally not working well. Believed to be in PhEDEx somewhere, being investigated. Logs sent to experts
  • Tier-2
    • MC production and analysis in progress
  • AOB :
    • New CRC Ian Fisk until June 14


  • ALICE reports - Production slowly ramping up again to normal levels. Some issues with central services which mean that e.g. CERN does not see a constant high number of jobs
    • T0 site Nothing to report
    • T1 sites
      • IN2P3: GGUS:71067 . NAGIOS job submission tests were not running since last morning after giving an error. Moreover, submission to one of the CREAM-CEs was very slow in case of being possible. The problem is solved by now and the ticket closed: issue related to web services and Java.
    • T2 sites Usual operations

  • LHCb reports - A lot of data. Processing and reprocessing are running.
    • T0
      • The number of possible rootd connections to Castor disk pools was exhausted twice tonight, plus building up of queues for waiting jobs. (Triggered intervention of castor team tonight). The reason for the long queues are stripping jobs which take much more time than usual, thus piling up. The castor team has taken immediate action and is in the verge of deploying new disk servers for this pool (5 new disk servers now deployed). The root cause for the longer stripping jobs is not found yet. [ Ignacio - working with disk servers ]
    • T1
      • GRIDKA CREAMCE (GGUS:70835) (On Hold)
      • IN2P3 LFC RO Mirror (Waiting update LFC at CERN) [ schema mismatch between CERN and IN2P3 being followed up by LFC people ]
      • IN2P3 pilots aborted at cccreamceli02 GGUS:71077 (Fixed)
      • SARA LFC (GGUS:71042) is reachable again as of yesterday morning.
    • T2

Sites / Services round table:

  • KIT - ntr
  • ASGC - nta
  • FNAL - last night at 03:15 am received ticket as GGUS alarm ticket. From OSG footprints bridge. We did not receive GGUS alarm directly. 1) 03:15 am is not a good time for test alarm! 2) Alarm failed as depend on GGUS ticket directly to "wake people up" and generate things. Ticket from OSG treated as normal and we don't respond to these in middle of night. Maarten - alarms send for 3/4 experiments - not for ATLAS.
  • NDGF - ntr
  • NL-T1 - ntr
  • RAL - ntr
  • PIC - ntr
  • BNL - ntr
  • CNAF - tomorrow afternoon 15:00 intervention on network CNAF-FNAL for T1-T1 traffic. Switch from general IP to LHCOPN. TIcket opened on GGUS OPN for this. Intervention should be transparent for transfer activities.
  • OSG - ntr

  • CERN - ntr

AOB: (MariaDZ)

  • Thanks to Ignacio for bringing to our attention that the GGUS-SNOW interface is broken. The reason is that, with the 2011/05/25 GGUS release the wsdl URI that SNOW should use for the web service calls to GGUS has changed. GGUS developers communicated this info to ggus-if-devs@cernNOSPAMPLEASE.ch on 4 occasions since 2011/03/22, via Savannah:119899. Unfortunately the SNOW developers were not included in any of the e-groups notified. Apologies were sent to all circles affected by this incident.
  • The GGUS-NGI_IBERGRID interface is restored as announce at 7:58am CEST today. Record is Savannah:119899#comment16

Wednesday

Attendance: local(Fernando, Jamie, Maria, Ewan, Michal, Ignacio, Edoardo);remote(Michael, Gonzalo, Felix, Jon, Xavier, Kyle, Vladimir, Tore, Jeremy, Daniele, Onno, Ian, Rolf, Tiju).

Experiments round table:

  • ATLAS reports -
  • Data taking: data11_7TeV
  • CERN/T0/Central Services:
    • ADCR database continues to suffer punctually under high load.
  • T1s:
    • IN2P3-CC continues to have a high job failure related to the setup time due to AFS slowness GGUS:71032.


  • CMS reports -
  • LHC / CMS detector
    • NTR
  • CERN / central services
    • NTR [ just moments ago unresponsive CVS - will submit a SNOW ticket ]
  • Tier-0 / CAF
    • Global Tag problems yesterday with failures from Express and Prompt Reco. Being recovered after fix
    • Backlog in copying files from Point5 to CASTOR. GGUS:71047. CMS has combined 2 Streams which should lower the transaction rate, and we watch
  • Tier-1
    • MC and Data re-reconstruction ongoing. Restarting large scale simulation production with pile-up.
    • FNAL Transfers improved after interactions with Nicolo (Thanks)
  • Tier-2
    • MC production and analysis in progress
  • AOB :
    • New CRC Ian Fisk until June 14


  • ALICE reports - General information: Production is back to normal ~30k jobs running. Last night torrent server was down and all jobs were falling back on using "wget", which overloaded the build server hosting the software. Authentication servers were overloaded and the JobBroker on one of the central machines was not working. Situation went back to normal around midnight.
    • T0 site Nothing to report
    • T1 sites NIKHEF: VOBox SAM tests failing at the site. GGUS:71155. The DN of the host changed so it has to be registered in myproxy again. Done.
    • T2 sites Usual operations

  • LHCb reports - No data. Processing and reprocessing are running. Problems with Stripping jobs.
  • T0
    • CERN Some stripping jobs consume CPU time more then usual with high "sleeping %" on the 48 core batch nodes.
  • T1
    • GRIDKA CREAMCE (GGUS:70835) (On Hold)
    • IN2P3 LFC RO Mirror (Solved)
    • SARA, RAL Huge backlog of waiting stripping jobs.
  • T2

Sites / Services round table:

  • BNL - ntr
  • PIC - ntr
  • ASGC - ntr
  • FNAL - ntr
  • KIT - two broken tapes and hence lost 190 files for ATLAS and 13 files for CMS. German expt reps informed. at 1:30 in morning part of PBS system crashed and no new jobs until 7:30. Next Tuesday 9 - 9:30 gridka-dCache.fzk.de reconfiguration and update
  • NDGF - ntr
  • CNAF - ntr
  • NL-T1 - ntr
  • IN2P3 - ntr
  • RAL - ntr
  • OSG - ntr
  • GridPP - ntr

  • CERN Storage - as discussed yesterday added 5 disk servers for LHCb tape pool to avoid peaks of load affecting data taking

AOB: (MariaDZ) GGUS test ALARMs' outcome in Savannah:120772, including comments on the FNAL remark for a 3:15am test!

Thursday

No meeting - CERN closed

Friday

No meeting - CERN closed

-- JamieShiers - 24-May-2011

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2011-06-01 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback