Week of 120319

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Cedric, Maarten, Jamie, Jae, Luc, Michail, MariaDZ, Claudio, Mark, Massimo, Ivan, Alexei, Eva);remote(Gonzalo, Jhen-Wei, Kyle, Thomas, Onno, Burt, Rolf, Stefano, Daniele, Tiju, Lorenzo, Dimitri).

Experiments round table:

  • ATLAS reports -
  • TAPE family for T1s : new project : mc12_14TeV
  • AGIS error during the weekend, experts are investigating "agis.utils.exceptions.AGISException: (agis.utils.exceptions.AGISException) Unable to acquire Oracle environment handle"
  • IN2P3-CC LFC migration: drain of the panda queues started, LFC migration will start Tuesday morning
  • TRIUMF-LCG2 FTS228 upgrade foreseen for today 6:30 PM CET ->8:30. AMOD will follow up: TRIUMF-LCG2 experts, please check that your new FTS is publishing into http://dashb-wlcg-transfers.cern.ch/ui/
  • FZK-LCG2 Oracle issue due to storage failure https://savannah.cern.ch/bugs/?92676

  • CMS reports -
  • Tier-0 (plan):
    • Data taking during beam commissioning
  • Processing activities:
    • 8 TeV MC simulation on Tier-1 and Tier-2 sites
  • Sites:
    • Nothing to report
  • Services and Infrastructure
    • glitch on SRM on Friday evening
    • Still living with WMS problems that affect the Job robot results

  • ALICE reports -
    • Low job efficiency at CERN Sat afternoon and evening due to absence of one popular package on Torrent server, with failover to an SE that is not meant for that purpose. Fixed late Sat evening.


  • LHCb reports -
  • Data reStripping and user analysis at Tiers1
  • MC simulation at Tiers2
  • Latest productions using Stripping 18 are going well

  • T0
    • Batch/CE/WMS/FTS/LFC/VOMS/BDII: NTR
  • T1
    • Ongoing investigation into corrupt files at IN2P3 (GGUS:80338)
      • dCache reports checksum correctly for file available but this is different to that reported by LFC
      • Affects both MC & Data, ROOT can open then, but will reach a bad event and crash
      • Jobs report successful upload, but pfn-metadata and lfn-metadata checksums are different
      • (Rolf) - as far as we can see checksum from SRM at moment of writing is the same as when you apply checksum algorithm in dCache. Command updating LFC lcg_cr: how does this command calculate checksum? Does some expert know? Two example files have been corrupted - only common point is creation date (~10 Mar). Massimo - in the past with LHCb data we had a series of problems that seemed similar. Triggered by an aborted transfer which was not retried. gridftp terminated strangely. Clear signature: binary dump shows large blobs of zeroes. Ultimately always connected to LHCb having a problem and not retrying. Mark - could be that problem with lcg_cr or transfer command. Maarten - checksum in LFC is presumably checksum of "good file" not corrupted file as stored in SE.
    • Minor config issue at CNAF affecting internal ARCHIVE/TAPE transfers was quickly fixed by Vincenzo

Sites / Services round table:

  • PIC - reminder of large downtime in 2 weeks (3 April) yearly electrical maintenance
  • ASGC - ntr
  • NDGF - ntr
  • NL-T1 - reminder of tomorrow's SARA downtime: interlink between NIKHEF and SARA will be increased; MSS maintenance and various s/w updates
  • FNAL - ntr
  • IN2P3 - nta
  • CNAF - ntr
  • RAL - on Sat am we had problem on FTS DB - service down ~3 hours; still investigating cause
  • KIT - had problem on w/e that some of CREAM CEs had job failure; case was not updated CRLs; two tickets about this problem and fixed now
  • OSG - wondering if there is going to be GGUS update tomorrow: A- yes!

  • CERN Storage - upgrade of ALICE CASTOR 2.1.12.4; starting now with ALICE rate tests

  • CERN Dashboards - ATLAS-SAM i/f not working properly at the moment; multiple overlapping intervals GGUS:80361; meanwhile working on handling multiple availabilities; please can FNAL configure FTS so that we can receive messages to monitor transfers ; Burt - working on it this week

AOB: (MariaDZ) File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings . We had no real ALARM this week. Slides with totals and drills for tomorrow's MB are attached at the end of this page as ggus-data.ppt

  • GGUS: update tomorrow as usual plus alarm tests;

  • CERN DG: extended by two years (2014 - 2015)

Tuesday

Attendance: local(Cedric, Ricardo, Jamie, Mark, Ignacio, Eva, Ivan, Alessandro, MariaDZ);remote(Gonzalo, Xavier, Paco, Rolf, Kyle, Tiju, Jhen-Wei, Thomas, Burt, Lorenzo).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • Some jobs running at CERN fail with "Failed to connect to service sqlite200/ALLP200.db" : GGUS:80391.
  • T1s
    • Triumf : FTS upgrade to 2.2.8. Problem with /var partition full (fixed now) + problem reporting to FTS dashboard (saw activity for 2 hours during the night, but nothing after).
    • IN2P3-CC : Test of FTS test server 2.2.8. Works fine. Cloud now blacklisted for LFC consolidation.
    • PIC : Corrupted files GGUS:80385
  • T2s
    • NET2 problems during the night : "checksum mismatch" GGUS:80389


  • CMS reports -
  • Tier-0:
    • Data taking during beam commissioning. Beams lost several times. SPS intervention. Probably first collisions early next week (no stable beams though)
    • DataOps seems to have found a black hole node in the T0: node lxbrl2311 had 11 failures in the last day, according to the alarms. Rises in the number of failures in the past hours were observed (see SLS here). The DataOps operator Samir Cury is opening a ticket to CERN-IT.
  • Processing activities:
    • as usual: 8 TeV MC simulation on Tier-1 and Tier-2 sites continues
  • Sites:
    • Nothing special to report
  • Services and Infrastructure
    • Still living with WMS problems that affect the Job robot results (expected to be like this for a while)

  • NOTE: Current CMS CRC is Daniele Bonacorsi (remotely from Bologna), starting today at 8AM.
      • SCOD informed with apologies for not being able to attend Tuesday's and Friday's call this week (due to clash with teaching). Apologies. Reports will be posted in due time though.



  • LHCb reports -
  • No significant issues from yesterday
  • Data reStripping and user analysis at Tiers1 Ongoing
  • MC simulation at Tiers2 ongoing

  • T0
  • T1
    • Banned SARA in DIRAC due to the DT today
    • Re-enabled the LFC at GridKa

  • Update to investigation into corrupt files at IN2P3 (https://ggus.eu/ws/ticket_info.php?ticket=80338)
  • LHCb jobs use lcg-cp and then register separately in LFC
    • As files seem consistent at storage, one of the only explanations now is that the lcg-cp didn't transfer properly but returned successfully
    • Very unlikely an overwrite occurred
  • More investigations are going on..
  • IN2P3 (Rolf) - saw a ticket open for CVMFS issue. Some dirs are not there apparently. Local experts checked and these dirs are not available at CERN either. Some problem with LHCb setup perhaps? Mark - will investigate


Sites / Services round table:

  • NL-T1 - link between SARA & NIKHEF has been upgraded to 20gbps
  • PIC - corrupted file issue mentioned by ATLAS; arose yesterday. Launched an extensive checksum check in most disk pools. Found two pools with files affected. Complete list being finalized and will be posted later today in GGUS ticket. ~3000 files affected in two pools that had disk rebuild last week. In close contact now with vendor of disks and of RAID controllers. Suspicion is that RAID rebuild procedure caused this corruption in a silent way. Any news -> GGUS ticket incl. complete list
  • KIT - ntr
  • IN2P3 - nta
  • RAL - ntr
  • ASGC - ntr
  • FNAL - ntr
  • NDGF - ntr
  • CNAF - sched down on 29 Mar for FTS upgrade to FTS 2.2.8
  • OSG - ntr; testing that the ticket exchange worked after GGUS update

  • CERN CVMFS LHCb stratum 0 The CVMFS installation machine for /cvmfs/lhcb.cern.ch was migrated yesterday to new hardware. Migration took around 24hours , 12 hours longer than hoped.
  • CERN VOMS On Monday 26th March the host certificate for voms.cern.ch will be updated. The only services requiring the update, lcg-vomscerts-6.9.0, are glite-FTS and glite-WMS, all other VOMS aware services support .lsc files.
  • CERN FTS One agent node on T0 was suffering very high load (60). The unreleased FTS version from the pilot (of EMI 2.2.8) was installed on Monday around 21:00 on Monday Mar 19th to this agent node. Upon which the load dropped < 1. GGUS:79958
  • CERN WN pre-prod nodes upgrade to equivalent of SLC5.8

AOB:

  • GGUS: test alarms on-going; alarms to European site issued; 80411 - 80420 (CERN & T1s). Operator responses started coming in. One issue to be investigated offline (ASGC alarm); GGUS host cert will also change with affect Mar 22 at 13:00 UTC. Public key sent to all i/f developers.

Wednesday

Attendance: local(Cedric, Jamie, Mark, MariaDZ, Ivan, Luca);remote(Gonzalo, Daniele, Jhen-Wei, Guenter, Rolf, Tiju, Dimitri, Kyle, Ron, Lorenzo, Burt).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • GGUS portal problem.
  • T1s
    • TRIUMF : FTS now correctly report to FTS dashboard after a fix in the coniguration. An other problem affecting transfers between TRIUMF and ASGC was also fixed : GGUS:80404
    • FZK : Problem with one FTS agent GGUS:80478
    • RAL : Problem of credential in FTS transfers to/from UKI-LT2-QMUL GGUS:80471
    • IN2P3-CC : LFC successfully consolidated at CERN. The FR cloud is back now
    • PIC : Corrupted files GGUS:80385. List of 2766 files corrupted is being check and will probably be declared to the DDM recovery service.
  • T2s
    • Transfer errors to Weizmann "StoRM encountered an unexpected error!" GGUS:80444

  • CMS reports -
  • LHC
    • All yesterday afternoon: injection studies. Over night: correction of optics during the squeeze.
  • CERN / central services
    • CASTORCMS_T1TRANSFER: availability drop in SLS (here), queued transfers (here), figuring out what's triggering this, and monitoring how it evolves, no tickets (yet)
  • Tier-0:
    • DataOps yesterday found a black hole node in the T0: node lxbrl2311 had 11 failures in a day, according to the alarms. The DataOps operator Samir Cury opened a ticket to IT (here). No reply yet. Ricardo - was looking at it this morning; will update the ticket
  • Sites:
    • T1_TW_ASGC: SUM SAM (SRMv2) errors, seem intermittent (here), ticket opened (Savannah:127263) Jhen-Wei: 1 disk server cannot handle transfers; 1 CASTOR SRM server has no more SRM process running and hence transfers will fail if this node is selected. Now looks like CASTOR returned to normal

  • LHCb reports -
  • No significant issues from yesterday
  • Data reStripping and user analysis at Tiers1 going well (~2700 running jobs)
  • 17b Stripping will complete in ~1 week at current rate
  • MC simulation at Tiers2 ongoing

  • T1
    • Reenabled SARA after DT. SAM tests and jobs working again.
    • IN2P3: Corrupt file issue and CVMFS issue are still under investigation


Sites / Services round table:

  • GGUS - first of all we had a communication probem within our team concerning release. Sysadmin put online two frontend webservers which was not communicated by Savannah ticket in time. Along with this change - tested on test system some weeks ago which worked quite well and hence sysadmin did not expect any issues - unfortunately not the case on production system and a lot of problems with DNS updates. We use DNS server of KIT - we don't understand what went wrong with DNS updates. For some regions there were no problems for others, especially CERN, there were. Still no explanation. At the end at some regions the local network admin had to force an update of local DNS to get new IP addresses and then the problems were fixed for the users. Still one or two issues for some cllients which we are tring to fix. In parallel we also had a problem that update of remedy server; a necessary patch was not installed. The associated problem is intermittant and first tests looked ok and did not recognize that patch was missing. US guys in particular recognised this. Installed necessary patch this morning and will redo alarm tests shortly. [ See also AOB points ]

  • PIC - an update on issue with ATLAS corrupted files: updated GGUS ticket with list of affected files. In contact with vendor but still unsatisfactory support. ATLAS has list of files; wait tomorrow, deadline Friday, to declare those files lost of not. Will try to recover in the meantime.
  • ASGC - nta
  • RAL - update on FTS problems. Today have updated Oracle DB to latest patches; also latest FTS release. Still monitoring to see if this solves problem
  • IN2P3 - ntr
  • KIT - nta
  • NL-T1 - ntr
  • CNAF - ntr
  • FNAL - alarm tickets from GGUS: new tests at 14:00 UTC, same problem as yesterday, will respond to Savannah. Guenter - will investigate again

  • OSG - follow up on GGUS: Guenter please cc me or GOC on that discussion. Exchanges working better from tests at 14:00. Some people woken up at 04:00 today from other tests.

  • CERN - ntr

AOB: (MariaDZ)

Thursday

Attendance: local(Ricardo, Jamie, Maarten, Cedric, Mark, Ivan, Rolf);remote(Paco, Daniele, Kyle, Jhen-Wei, Thomas, Burt).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • ntr
  • T1s
    • BNL : Problem on one queue (temporary high load on an AFS volume) GGUS:80505 . Quickly fixed.
    • FZK : FTS problem reported yesterday GGUS:80478 due to Deadlocks in the FTS-DB caused by duplicated transfer-agents run. Fixed.
  • T2s


  • CMS reports -
  • LHC machine
    • Commissioning is going well. Collimator setup at 450GeV finished. Preliminary loss maps available. Need to test squeeze with crossing angles and beam separation.
    • Plans (as from this morning's LHC Ops meeting):
      • Thu M/A: optics measurements with squeezed beams at beta* of 0.7m, 1 pilot per beam; will deliver 5 splashes for ATLAS at 1pm
      • Thu A: will try LHCb crossing angle study; could result in probe bunch collisions for CMS.
      • Fri: access 07:30-11:00
      • Fri N - Sun: full LHC cycle with 1 probe bunch, possibility of collisions with a probe
      • Sun N - Mon: same thing
      • Mon A: will inject 2-3 nominal bunches, might try collisions!
  • CMS
    • Sub-systems preparing for access tomorrow morning.
  • CERN / central services
    • [follow-up] CASTORCMS_T1TRANSFER: availability drop in SLS observed yesterday, correlated with load (queued transfers) is now digested. Tracked to user activities. SOLVED.
  • Tier-0:
    • [follow-up] black-hole WN in the T0 farm: node lxbrl2311, Samir Cury opened a ticket to IT (here). Raised at yesterday's call. Ricardo working on it. Samir did not see any reply this morning yet, so he asked again on the ticket. OPEN. [ Ricardo - didn't follow up any further but put m/c in standby so won't start new jobs; updated ticket ]
  • Tier-1:
    • [follow-up] T1_TW_ASGC: yesterday, intermittent SUM (SRMv2) errors, Savannah:127263, Jhen-Wei fixed, commented at yesterday's call and closed the ticket. I confirm it's OK now. Unfortunately, ASGC had 85% and 78% of SAM availability over the last 2 days, so it is now "red" in the CMS Site Readiness, but it will become green again tomorrow if the solution is indeed stable and no other problems arise. SOLVED.
  • Tier-2:
    • business as usual



  • LHCb reports -
  • No significant issues from yesterday
  • Data reStripping and user analysis at Tiers1 going well
  • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
  • The associated merging jobs are ongoing.
  • MC simulation at Tiers2 ongoing

  • T0
  • T1
    • Update on CVMFS issue at IN2P3 ([GGUS:80405]): It seems there were some 'dead' dirs that reported as if it didn't exist. Forced cache refresh caused the client to die and so a corrupt cache is (on these workers) suspected. CVMFS people have been notified. [ Rolf - cannot confirm status either, Yannick still investigating ]

Sites / Services round table:

  • NL-T1 - ntr
  • IN2P3 - nta
  • ASGC - ntr
  • NDGF - ntr
  • FNAL - issue with GGUS alarm tickets; something changed either in release and/or on GOC. what changed? Trying to figure out; get emails but not paged for alarms
  • BNL - brief issue at BNL; upgraded job slots from 7K to 12K. Observed some hot afs vols depending on job profile. Replicated hot vols across different servers to spread load. Proved to be ok in terms of lowering load and preventing commands from timing out
  • RAL - applied patches to FTS yesterday; in morning patched DB and afternoon FTS 2.2.8 and waiting to see if better now
  • OSG - nta

  • CERN - ntr

AOB: (MariaDZ) Can't join today due to a GGUS-SNOW dev. tel. meeting but as CERN central service experts at the daily meeting yesterday advised me to open a SNOW ticket for the GGUS macros in the CERN twiki. This is it: https://cern.service-now.com/service-portal/view-incident.do?n=INC115285 . It was changed into https://cern.service-now.com/service-portal/view-request.do?n=RQF0080242 [ Twiki macro was changed this morning ]

Friday

Attendance: local(Jamie, Maarten, Cedric, Mark, Ivan, Massimo, Ricardo);remote(Gonzalo, Mette, Xavier, Alexander, Kyle, Rolf, Guenter, Jhen-Wei, John).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • ntr
  • T1s
    • PIC : Corrupted files GGUS:80385. What is the status of the files : recoverable/irrecoverable ? [ Gonzalo - not much news; still in contact with vendor who will need more time, they are asking us not to take any action. WRT ATLAS just after this meeting we will upload latest version of list of lost files; somewhat shorter than previous list - have recovered from files from tape and some from elsewhere. About 2400 single copies that have been affected. Will upload into ticket and declare as lost. Will keep contact with vendor but ATLAS can replicate from other sites or whatever. ]
    • SARA : FTS problem (another one this week) GGUS:80537 due to crashing VO-agent. Restarted manually twice. No new problem after upgrading the packages: transfer-url-copy, gridftp-ifce and ic-interface.
  • T2s
    • ntr


  • CMS reports -
  • Absent at the call today, my apologies. SCOD informed by mail earlier this week. Report is below

  • LHC machine
    • Access on Fri M, access 07:30-11:00
    • Plans (as from this morning's LHC Ops meeting):
      • Fri N - Sun: full LHC cycle with 1 probe bunch, possibility of collisions with a probe
      • Sun N - Mon: same thing
      • Mon A: will inject 2-3 nominal bunches, might try collisions!
  • CERN / central services
    • Issue affecting gLite CrabServer usage for analysis for a while on Thu afternoon. Opened TEAM ticket (GGUS:80542), see also SNOW (here). Investigated by Maarten and Steve, tracked down to voms.cern.ch, CMS just happened to get this first, fixed at 11pm on Thu. Perfect support, thanks.
  • Tier-0:
    • [follow-up] interesting comments and possibile future investigations by Gavin and CMS Ops on the black-hole WN in the CMST0 queue. Organizing more checks.
      • Anyway, production activities are protected, so I will drop this item from the next reports - unless something becomes relevant.
    • some crashes in the express workflow, being followed up internally in DataOps
  • Tier-1:
    • [follow-up] T1_TW_ASGC: back to green in the CMS Site-Readiness. Case closed.
  • Tier-2:
    • business as usual



  • LHCb reports -
  • Data reStripping and user analysis at Tiers1 going well
  • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
  • The associated merging jobs are ongoing.
  • MC simulation at Tiers2 ongoing

  • T0
  • T1
    • Had issues at GridKa this morning (and several T2s) due to pilots failing from failed proxy renewal from the WMS. This has happened before but we're not sure of cause though believe there is a fix on it's way (This is an EMI release of WMS)
    • Yesterday we increased the number of stripping & merging jobs at IN2P3. This seems to have hit a limit as over night a backlog of GridFTP transfers built up until there were 2700+ jobs running this morning. Went back to previous limit and jobs are slowly transferring as they should. [ Rolf - as mentioned we had a power cut and all WNs gone just before 12:00, the y are starting up now. (For all other experiments too). One reason why LHCb jobs are not getting data fast enough is use of srmget and not dcap protocol. This introduces a performance loss > x 4. So you might get 20MB/s max instead of 80MB/s. ] Mark - we want to grab whole file; do an lcg_cp which uses GridFTP rather than dcap. Current policy is to avoid different protocols for different sites - manpower issue.


Sites / Services round table:

  • GGUS
  • PIC - nta
  • NDGF - ntr
  • KIT - from midnight to midday a tape lib was down hence reading degraded
  • NL-T1 - ntr
  • IN2P3 - some news on long-standing network problem. Mostly resolved but one thing missing: performance from here to T2s in US and AP was very bad. Basic reason understood: network algorithm in Linux service that we use when bonding 2 Ethernet cards. Work around is to use just 1 card - performance will go up by factor 20 or more. For a real connection would have to upgrade network switches which will have to wait for next major downtime. Could also effect other sites.
  • ASGC - ntr
  • RAL - ntr
  • OSG - ntr

  • CERN storage - announce 2 small interventions which should be transparent: Mon and Tue. Oracle-related and small things on central DBs of CASTOR which should be invisible to clients, perhaps some delays.

  • CERN grid - the VOMS problem was operational error following a procedure. Will further emphasize necessity of checking up to date procedure. Maarten - host cert of VOMS server behind VOMS.cern.ch were going to run out soon. For VOMS this always needs special treatment - new cert had been prepared some weeks ago. Automatic procedure kicked in ; old certs and keys were no longer available. Only recourse was for service manager to deploy new certs earlier than planned hence since yesterday evening new certs are being used. In case anyone sees cert related errors this could be the reason. They should then update lcg-vomscerts RPM that was announced at least to all EGI sites

AOB:

  • European Summer Time begins (clocks go forward) at 01:00 UTC on 25 March 2012

-- JamieShiers - 31-Jan-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r2 r1 manage 2396.5 K 2012-03-19 - 14:31 MariaDimou ALARM drills for the 2012/03/20 WLCG MB
Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r27 - 2012-03-23 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback