Week of 120521

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Ueda, Steve, Marc, MariaDZ, Nicolo, Mike);remote(Mette, Dimitri, Stephane, Jhen-Wei, Lisa, Onno, Tiju, Rolf, Paolo, Elisabeth).

Experiments round table:

  • ATLAS reports -
    • RAL (GGUS: 82317) : New list of 90 lost files. 2 RAW files were recovered but not 88 MC files (single copy)
    • RAL (GGUS:82188 reopened) : SRM issue but seems solved over saturday (no info in GGUS on 21st morning). Tiju (RAL) said the SRM problem fixed itself. They are still investigating to understand how.
    • ATLAS SAM tests (GGUS:82344): Apparently running but not visible to ATLAS. Configuration issue solved. Affected Wednesday-Sunday (we will ask for a cleanup of the period).
    • CRL issue at CERN (GGUS:82326 and INC:130120) : No problem noticed on saturday.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • T1_TW_ASGC: HammerCloud test success rate degraded on May 18th, GGUS:82310 - disk server issue, was fixed.
      • T1_TW_ASGC: ongoing file access problems since May 5th, GGUS:81887 - on May 16th a problematic disk server was put offlline.
      • T1_FR_CCIN2P3: added new EMI CREAM-CEs to glidein factory SAV:128747
      • T1_DE_KIT: CMS Frontier squid servers saturated, causing large fail-over to CERN. GGUS:82337 - also seen at T1_IT_CNAF.

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation mostly complete for the moment
    • Prompt reprocessing of data
    • Disabled filling mode for pilots for next 10 days.
    • New GGUS (or RT) tickets
    • T0
      • Ongoing "Input data resolution" problems by jobs at CERN - DIRAC tunings were tried over the weekend but there still seems to be an issue.
      • Waiting for the files from the faulty diskserver to be recovered.
      • A peak of failed pilots this morning (GGUS:82356)
    • T1
      • RAL : Reported an unavilable disk server from LHCbUser space token - any news?
      • SARA :
        • Access to data seems to have improved considerably over the last 24h, though I assume we're waiting for the upgrade before marking this as solved (GGUS:82178).
        • Peak of aborted pilots this morning with qsub errors (GGUS:82368). Onno said that they are now busy at SARA throttling file deletion to the SRM. After this Wednesday's downtime the namespace database will be moved to faster hardware. There will also be Oracle changes this Wed. Discussion on aborted pilots will continue in the ticket.

Sites / Services round table:

  • NDGF: ntr
  • NL_T1: Oracle security patches will be applied during the SARA downtime this Wednesday. This will also affect FTS.
  • KIT: ntr
  • FNAL: ntr
  • CNAF: ntr
  • ASGC: ntr
  • IN2P3: ntr
  • OSF: ntr
  • CERN:
    • Dashboards: ntr
    • Grid services: ntr

AOB:

Tuesday

Attendance: local (Andrea, Mark, Ueda, Kate, Szymon, Maarten, Luca, Mike, Nicolo); remote (Xavier/KIT, Mette/NDGF, Lisa/FNAL, Paco/NLT1, Jhen-Wei/ASGC, Tiju/RAL, Jeremy/GridPP, Paolo/CNAF, Rolf/IN2P3, Elisabeth/OSG).

Experiments round table:

  • ATLAS reports -
    • TW TAPE error (ORA-01578: ORACLE data block corrupted), not high rate but worrying. No ticket (the site reacted before ticketing)
    • SARA is still excluded from T0 export, but has been included in MC production (by the automatic exclusion system).
      • [Paco/NLT1: a downtime that was scheduled for tomorrow to fix GGUS:82032 has been cancelled after some tuning done for a similar LHCb issue. Did this solve the ATLAS issue too? Ueda: was not aware of this ticket. Mark/LHCb: the tuning did fix the FTS and transfer issues reported by LHCb. Ueda: we also stopped seeing transfer failures for ATLAS, so maybe this also worked for us.]
    • castor intervention appears on ITSSB and as GOCDB downtime but no announcement to castor-announce-atlas. [Luca: can send a warning also to the mailing list, but this was supposed to be a transparent DB intervention. Ueda: it was not transparent, we saw some errors, and we prefer to always be informed on the mailing list too. Luca: ok will always copy that mailing list; please send us some details about the errors so that we can investigate.]

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • CMSONR DB blocked due to Oracle issue while creating snapshot - run could not be configured at P5 for two hours. [Kate: we apologize for not contacting the CMS shifters until after the issue was fixed, we got busy in fixing it.]
      • T0: some file transfers from T0 to T1_UK_RAL stuck since May 16th. GGUS:82385 - also affected other T1s. Files were stuck in "STAGEIN" state on CASTORCMS T1TRANSFER, fixed by restaging them manually. Ticket can be closed when all files are at destination.
      • CMS software installation on CVMFS broken, causing job failures at several T2s. SAV:128794
    • Tier-1/2:
      • T1_TW_ASGC: ongoing file access problems since May 5th, GGUS:81887 - on May 16th a problematic disk server was put offlline, waiting for confirmation from CMS production team to close ticket.
      • T1_ES_PIC: CMS software area was mounted read-only, fixed. GGUS:82397

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • New (and quite significant) MC campaign starting, T1s included as well
    • Prompt reprocessing of data
    • T0
      • Waiting for the files from the faulty diskserver to be recovered. Any updates? [Luca: recovered the file system 20 minutes ago, now working on recovering the files.]
      • Failed pilot problem from yesterday solved (GGUS:82356)
    • T1
      • RAL :
        • Diskserver problem from yesterday quickly solved
        • Some network glitches this morning but with little impact on jobs
      • SARA :
        • Data access continues to be good so I think we regard this problem as solved (GGUS:82178).
        • Still a background of aborted pilots that started a few days ago. Not a big problem but would be nice to understand it (GGUS:82368). [Paco/NLT1: will follow up.]
      • IN2P3 :
        • Offlined this morning due to downtime

Sites / Services round table:

  • Xavier/KIT: ntr
  • Mette/NDGF: ntr
  • Lisa/FNAL: ntr
  • Paco/NLT1:
    • downtime for tomorrow to address LHCb GGUS:82678 and ATLAS GGUS:82032 has been cancelled as discussed
    • another downtime is confirmed for Oracle instead
  • Jhen-Wei/ASGC: our DBA is looking at the Oracle issue
  • Tiju/RAL:
    • concerning the network glitch reported by LHCb, we had a few SAM tests failing, indeed Tuesday morning is a known at risk period for the network
    • [Andrea: are you happy with the resolution of the CRL issue? Tiju: yes, the issue is fixed and we agree with the new policy]
  • Jeremy/GridPP: ntr
  • Paolo/CNAF: ntr
  • Rolf/IN2P3: still in downtime, should be back in a few minutes
  • Elisabeth/OSG: ntr

  • Kate/Databases: ntr
  • Luca/Storage: nta
  • Mike/Dashboard: ntr

AOB:

  • Maarten: the echo we are experiencing on the CERN phone may be due to one of the remote phones, please check your connections.

Wednesday

Attendance: local (Andrea, LucaM, LucaC, Maarten, Mike, Ueda, Giuseppe, MariaDZ, Mark, Nicolo); remote (Mette/NDGF, Alexander/NLT1, Elizabeth/OSG, Lisa/FNAL, Rolf/IN2P3, Tiju/RAL, Jhen-Wei/ASGC, Paolo/CNAF).

Experiments round table:

  • ATLAS reports -
    • GGUS was not accessible since Tue, May 22 around 18h CET until this morning.
      • [Maria: thanks to Ueda for reporting the issue, it had not been noticed before. This is now being followed up, see some details in he AOB.]
      • [Andrea: is there no monitoring for GGUS? Maria: that's one of the problems that is being followed up between the management of KIT and the management of GGUS. Maarten: we could develop a sensor for this, though this sort of DNS issues should not happen again in the future hopefully].
      • [Ueda: can we still use the GGUS support email? Maria: yes, the support email should sill work and should hav egenerated a ticket. Will check after the meeting.]
    • CERN CASTOR (intervention yesterday) -- following the discussion yesterday, sent a mail to castor-operations with information about errors. "Xavier Espinal: The timestamps of the errors matches perfectly with a rolling intervention in ATLAS oracle DB (castor listener patches) this morning (from 10h to 11h). Checking some of the files I see they are healthy on CASTOR, so the errors you got were most probably related to the (nearly transparent) intervention and you should not see any error of this kind after 10:57AM (and if so let us know)". Indeed, the failed processes have recovered afterwards, and we have not seen such errors since then.
      • [Giuseppe: yesterday there was actually an incident with some diskservers, unrelated to the DB intervention that was successful. Creating a report about this.]
    • CERN CASTOR and EOS (intervention today) -- We knew these interventions from notification (eos) and from itssb (castor and eos), but wonder why there is no GOCDB downtime for this. Our distributed shifters rely on GOCDB rather than site specific pages.
      • [Giuseppe: for CASTOR today's intervention was on the nameserver, which is even lower level and less visible. This typeof intervention should be transparent for CASTOR and SRM as it is a local intervention. It was a dliberate choice not to put in GOCDB, otherwise we would flood GOCDB with low-level interventions. Ueda: ok.]
      • [LucaM: for EOS, it was decided in agreement with Guido that this should not go in GOCDB, because it is a CERN intervention not affecting Grid users.]
    • TW-T1 problem in data transfers to TAPE (castor). Data export from T0 is set off. The problem was solved within ~1.5h. no ticket (ggus problem). Then started observing failures in transfers from TAPE. still not solved yet. (no ticket yet)
    • SARA (GGUS:82032) -- as replied yesterday, we have not seen any problem since the "tuning". Data export from T0 will be restarted once we are sure about the stability. The ggus ticket will be verified after that.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • CMS software installation on CVMFS broken, causing SAM test and job failures at several T2s. SAV:128794
    • Tier-1/2:
      • T1_UK_RAL: update on files stuck in transfer from T0 GGUS:82385 - files are now on RAL storage element, but not yet migrated to tape after 2 days.
      • T1_FR_CCIN2P3: HammerCloud errors "JobSubmissionDisabledExceptionE", probably due to downtime - the site can close the ticket if this is confirmed. GGUS:82436. [Rolf: do the shifters not see the downtime? Nicolo: yes they see the downtime but maybe the calculation was done when the downtime had already finished. We should check how the calculation is done, with the HammerCloud and dashboard teams.]

  • LHCb reports -
    • DataReprocessing of 2012 data and prompt reprocessing at T1s going well.
    • MC generation running well after fixing some Bookkeeping issues
    • T0
      • Files now seem accessible. Waiting for Joel to close the ticket (GGUS:82146)
    • T1
      • SARA :
        • Pilot problem seems to have resolved itself but I'll leave the ticket open for a bit longer just in case (GGUS:82368).
      • IN2P3 :
        • Ongoing investigations into corrupt files (GGUS:82247). [Rolf: found that checksums are different for several thousands of files, but this is largely due to different algorithms being used by the middleware to calculate checksums. Mark: thanks, indeed there is a lot of information about this in the ticket. Maarten: MD5 should no longer be used, if it is still being used then there is a problem.]

Sites / Services round table:

  • Mette/NDGF: ntr
  • Alexander/NLT1: ntr
  • Elizabeth/OSG: ntr
  • Lisa/FNAL: risk of network issues tomorrow morning 830am to 11am Chicago time due to works by the accelerator division
  • Rolf/IN2P3: nta
  • Tiju/RAL: ntr
  • Jhen-Wei/ASGC: transfer failures from TW to CERN and other T1s are being investigated
  • Paolo/CNAF: ntr

  • LucaM/Storage: ntr
  • LucaC/Databases: ntr
  • Mike/Dashboard: have a question for BNL about FTS, will follow up offline

AOB: (MariaDZ)

  • GGUS was not accessible from yesterday 17:45 CET until today 07:45 CET (5:45 UTC) due to a DNS update related problem again. Details in Savannah:113831#comment40 . Preventive measures for the future will be in the same ticket.
  • NB! GGUS Release next Wednesday 2012/05/30! There will be no representation at the daily meeting till then. Problems should be reported via GGUS or email to helpdesk@ggusNOSPAMPLEASE.org .
  • File ggus-tickets.xls with total numbers of GGUS tickets per experiment per week is up-to-date and attached to WLCGOperationsMeetings page.There was one real ALARM last week GGUS:82237 on 2012/05/15 by CMS against CERN - 15 real ALARMs so far since the last MB.

Thursday

Attendance: local (Andrea, Nicolo, Ueda, Mark, LucaM, LucaC, Mike, Ignacio, Alessandro, Maarten); remote (John/RAL, Paco/NLT1, Christian/NDGF, Jhen-Wei/ASGC, Rolf/IN2P3, Lisa/FNAL, Jeremy/GridPP, Giovanni/CNAF).

Experiments round table:

  • ATLAS reports -
    • GGUS: see Maria's report
    • TW-T1 CASTOR problem in staging from tape (GGUS:82474) the same issue as yesterday.
      • Even though it is not related to the problem in data transfer into TAPE (reported yesterday), we keep suspending the data transfer from T0 to the site.
      • [Jhen-Wei: will follow up, this is related to a diskserver move to new hardware for Castor, we stopped draining the queues.]
    • SARA ATLASMCTAPE (GGUS:82475 verified) no free space in ATLASMCTAPE
      • Solution: A configuration error in our MSS affecting ATLASMCTAPE has been corrected. There is some free space in the space token now.
    • SARA SRM errors (GGUS:82490) since yesterday ~16h UTC (16h-22h and 01h-now)
      • Data export from T0 will stay off for the site.
      • The ggus ticket can be merged with GGUS:82032 if the problems are the same. (One of the error patterns matches in the one in the ticket 82032).
    • TRIUMF FTS proxy expiration (GGUS:82460 verified) - caused by backlog in the transfer jobs to CA T2s due to slow transfers, rather than a problem at TRIUMF FTS. ATLAS and TRIUMF are working on it.
    • Transfer error between SARA and TRIUMF (GGUS:82489) - The problem looks similar to GGUS:82143

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
    • Tier-1/2:
      • T1_US_FNAL: some transfer failures following FTS reboot SAV:128866
      • T1_DE_KIT: short glitch if HammerCloud test failures, closed SAV:128873
      • T1_IT_CNAF: transfer failures with connection errors to specific T2s, SAV:128884

  • LHCb reports -
    • DataReprocessing of 2012 data and prompt reprocessing at T1s going well
    • MC generation mostly complete at present
    • T1
      • SARA / NIKHEF :
        • This morning seen a large increase in 'Input Data Resolution' errors - any problems to report wrt to storage? [Paco: is this Sara or Nikhef? Mark: both Sara and Nikhef, the peak was around the same time. Paco: thanks, please open another ticket if you see the problem again.]
      • IN2P3 :
        • Ongoing investigations into corrupt files (GGUS:82247)

Sites / Services round table:

  • John/RAL: a few downtimes for next week (Castor) have been added to GOCDB
  • Paco/NLT1: will follow up GGUS:82490 for ATLAS
  • Christian/NDGF: ntr
  • Jhen-Wei/ASGC: nta
  • Rolf/IN2P3: ntr
  • Lisa/FNAL: ntr
  • Jeremy/GridPP: ntr
  • Giovanni/CNAF: ntr

  • Ignacio/Grid: ntr
  • LucaM/Storage: ntr
  • LucaC/Databases: ntr
  • Mike/Dashboard: can sites please complete this Doodle poll to indicate whether they are running FTS 2.2.8 patched for the ActiveMQ-cpp bug. [Alessandro: will also follow up via email with users mailing lists, admins and ATLAS contacts at sites. Maarten: please all T1 fill this in by tomorrow.]

AOB:

Friday

Attendance: local (Ueda, Mark, Maarten, Andrea, Mike, Nicolo, LucaM, Marcin, Zbiszek); remote (Elisabeth/OSG, Ulf/NDGF, Alexander/NLT1, Xavier/KIT, John/RAL, Lisa/FNAL, Rolf/IN2P3, Paolo/CNAF).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking. Important and urgent MC production on-going (not new).
    • FTS proxy expiration is observed at many FTS in transfers to several sites
      • possibly because of a backlog in data transfers
      • consulting with FTS dev if FTS can avoid such errors
      • [Nicolo: we also had proxy expiration issues, but this actually seems to be a bug in FTS (2.2.8, after the upgrade) that uses a wrong delegated proxy. More details and comments from the developers in GGUS:81844. Maarten: this was also seen at KIT, SARA, PIC; the FTS developers have been looking at this for weeks but the issue is still not yet fully understood.]
    • T1
      • SARA SRM errors (GGUS:82490) the number of errors dropped by 22h UTC (24 May).
        • The ticket is open waiting for response from the site.
        • Data export from T0 will stay off for the site.
      • TW-T1 CASTOR problem in staging from tape (GGUS:82474) the situation is better since yesterday evening, no new failure in the last 12h.
        • transfers to the site failed yesterday 2012-05-24 at 14:03:08. problem not understood.
        • Data export from T0 will stay off for the site.
    • Network
      • Transfer error between SARA and TRIUMF (GGUS:82489) - solved
        • Solution: (TRIUMF) "We did a hard reset on the BGP session for SARA again on our edge router, now it's working. We will watch if this happen again."
      • Transfer failures from T0 to Italian T2s (calibration sites) (GGUS:82463, GGUS:82454)
        • Solution: (NAPOLI) Interruption on the geographical network from Napoli and Roma to CERN, due to a misconfiguration in routing propagation GEANT-LHCone. Issue fixed by GARR and CERN network people

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • Jobs failing on T0 for input file unavailability: GGUS:82497 - affected CASTOR disk server promptly restored by IT-DSS; most files were recovered immediately, the rest this morning after hw intervention. [Luca: ok, will close the ticket.]
    • Tier-1/2:
      • T1_UK_RAL - GGUS:82495 HammerCloud and production job failures with xrootd open errors, switched back to rfio
      • T1_TW_ASGC - two files repeatedly failing export for checksum mismatch: SAV:128925 pending decision to recover or invalidate
    • Other:
      • CMS CRC role during weekend will be covered by other members in the CRC team on best effort.

  • LHCb reports -
    • DataReprocessing of 2012 data and prompt reprocessing at T1s ongoing
    • MC generation mostly complete at present - more due
    • T0:
      • Inaccessible files have been verified (GGUS:82146)
    • T1
      • SARA / NIKHEF :
        • Previously reported Input Data Resolution errors possibly due to Swimming production overloading SE. Config changes made to prevent this.
      • IN2P3 :
        • Ongoing investigations into corrupt files (GGUS:82247).
        • Some MC files weren't accessible have been moved. Dodgy disk server is going to get repaired soon and no additional problems seen as yet.

Sites / Services round table:

  • Elisabeth/OSG: ntr
  • Ulf/NDGF: ntr
  • Alexander/NLT1: next Monday is a holiday in Holland
  • Xavier/KIT: ntr
  • John/RAL: ntr
  • Lisa/FNAL: updated the Doodle poll
  • Rolf/IN2P3: ntr
  • Paolo/CNAF: ntr

  • Mike/Dashboard: please remember to fill in this Doodle poll, thanks to those who did it already
  • Ignacio/Grid: ntr
  • Luca/Storage: ntr
  • MarcinB/Databases: Post mortem for CMS Online DB incident from 22.05 is ready: https://twiki.cern.ch/twiki/bin/view/DB/PostMortem22May12
    • [Zbiszek:there seem to be several concurrent causes for this issue, not all related to Data Pump. Opened a SR to better understand what went on. Anyway a fix has been applied and should prevent further issues.]

AOB:

  • Maarten: GGUS seems to have a problem with email notifications. Opened a ticket about this but also got no notification, will try to inform the GGUS developers through another route to make sure they are aware of the issue. Xavier/KIT: will inform the GGUS developers at KIT. Luca and Nicolo: also saw similar issues with email notifications when opening and closing tickets yesterday and today.]
  • No meeting on Monday May 28th. Next meeting is on Tuesday May 29th.

-- JamieShiers - 11-Apr-2012

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-05-26 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback