Week of 140721

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Belinda Chan Kwok Cheong (Storage), Maria Dimou (GGUS), Kate Dziedziniewicz-Wojcik (IT-DB), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb), Alberto Rodriguez (Grid&Batch)
  • remote: Sang-Un Ahn (KISTI), Jeremy Coles (GridPP), Michael Ernst (BNL), Lisa Giacchetti (FNAL), Tiju Idiculla (RAL), Kai Leffhalm (ATLAS), Dmitry Nilsen (KIT), Rob Quick (OSG), Rolf Rumler and Emmanouil Vamvakopoulos (IN2P3), Ulf Tigerstedt (NDGF), Matteo Manzali (CNAF)
Experiments round table:

  • ATLAS reports (raw view) -
    • Tier0/1
      • IN2P3: Staging Problem on Sunday, solved GGUS: 107064
      • Taiwan network problem solved: GGUS: 106736, 107052

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs mostly
    • T0:
      • Problem with lbvobox14 (Alarm GGUS:107065). Hardware problem yesterday and machine is still not back in operation. Would really really like to know an ETA for the machine as we are debating what to do with some services which were on the machine. They will need to be moved if the machine is not back today, but it will need a lot of effort (customisation).
      • Continuing problem with lcg-voms2 (GGUS:107014)
      • Awaiting CERN update on GGUS:106434 about open files at CERN with Brazilian proxies. Need to let us know if the fix has been rolled out to production machines also.
    • T1: Problem with transfers from SARA-NCBJ (GGUS:106949 against NCBJ). It is only this channel which is having a problem. All transfers between other destinations for both these sites are fine.
Raja adds for the lbvobox14 problem that it seems CERN IT is waiting for an external technician to fix the HW problem, but so far there are no more news on this. Stefan Roiser is following this up, Maria will contact him after the meeting to see whether something else could be done to speed this up.

Maarten asks whether the LHCb still sees any problems with the Brazilian certificates. Raja explains that brazilian users have now a CERN certificate to workaround this issue, but the problem is still there. Since the problem requires a fix in EOS Maarten suggests to put an update on the ticket asking for the status to the EOS team.

Sites / Services round table:

  • ASGC: Reported by mail that the LHCOPN problem between CERN and Taiwan has now been solved.
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: Connected but couldn't be reached on the phone during the meeting.
  • KIT: NTR
  • NDGF: Reminder for the tape downtime scheduled on Wednesday that will affect ATLAS users. For more details check GOCDB Downtime
  • NL-T1: Not present
  • OSG: Rob reports about the activity to set up a testing Condor CE SAM instance. All relevant people are successfully collaborating on this and making progress.
  • PIC: Not present
  • RAL: The FTS3 server has now been upgraded to v. 3.2.26 in the morning. See GOCDB downtime for more details. Raja from LHCb asks about the Networking Upgrade scheduled for Tuesday morning. Tiju confirms this is correct. More details on GOCDB downtime.
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services:
    • CvmFS stratum 2.0 -> 2.1 migration:
    • FTS3 Software Upgrade , Tuesday afternoon, transparent ITSSB entry. software upgrade from 3.2.22 to 3.2.26. Includes workarounds to frequent crashes in underlying gridsite.
  • CERN storage services: NTR
  • Databases: The following interventions will happen in the upcoming days:
    • 23.07.2014: CASTOR CMSSTG,PUBSTG, ADCR (transparent for the users)
    • 24.07.2014 at 10am: ATLARC LHCBR (rolling interventions), ATLR, CMSR, LCRG, CASTOR ATLASSTG, CMSSTG,LHCBSTG (transparent)
  • GGUS: Follow-up from last week's ALARM tests with the new GGUS host certificate. We have a new cert we could install tomorrow Tue 22 July at 9am CEST. We need to decide which tests to redo:
    • The sites which failed the alarm are RAL, IN2P3, FNAL and TRIUMF. Are RAL and IN2P3 OK with the above suggested date/time? INFN didn't clearly answer their test ALARM ticket GGUS:106905, so we don't know if they use the GGUS host cert at all...
    • Can we do the ALARM test for FNAL and TRIUMF at 11am Central Standard Time (CST), as suggested by Lisa? The old certificate is still valid until 28th of July. So lets fix a day that suits all.
    • We, GGUS dev. team, do not why some sites didn't fail to receive the test ALARM last week e.g. CERN. Is it because they don't verify the certificate before accepting the ALARM notification? If T0/T1s could contact ggus-info@cernNOSPAMPLEASE.ch off-line with this information, we could enhance the ALARMs' documentation.
      • The following sites affected by the ALARM tests have provided the following information in terms of times for the test:
        • RAL 1 hr later than usual
        • FNAL 1 hr earlier from now on, i.e. at 11am CST
        • IN2P3 the usual time
  • Grid Monitoring: Not present
  • MW Officer: Not present
AOB:

Maria reminds sites that there will be a WLCG Coordination Meeting on Thursday and sites are invited to use the slot reserved for them to bring issue and ask questions of their concern.

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Marcin Blaszczyk (Databases), Belinda Chan Kwok Cheong (Storage), Maria Dimou (GGUS), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb), Sebastien Ponce (Storage), Alberto Rodriguez (Grid&Batch)
  • remote: Sang-Un Ahn (KISTI), Tommasso Boccali (CMS), Jeremy Coles (GridPP), Michael Ernst (BNL), Kyle Gross (OSG), Lisa Giacchetti (FNAL), Tiju Idiculla (RAL), John Kelly (RAL), Felix Lee (ASGC), Emmanouil Vamvakopoulos (IN2P3), Ulf Tigerstedt (NDGF), Thomas Hartmann (KIT)

Experiments round table:

  • CMS reports (raw view) -
    • No major issues, processing and production is continuing
    • CSA14 undergoing, major part is analysis tests with CRAB3
    • yesterday CMS sent a PhEDEx deletion to T1_*_*_Disk, more than 3 PB. This should help sites, most of out T1s were at 95%+ on disk usage.
    • GGUS/INC: nothing major, business as usual I would say.

  • ALICE -
    • NTR

Raja adds that there is now progress with Brazilian certificates issue, on the other hand the ALARM ticket is still not fixed. Maria Dimou adds that there is information fed back in the ticket but Raja complains about the fact that ALARM tickets are normally fixed within 24h while for this case this is taking very long. Maria Alandes mentions that Stefan Roiser reported that the problem has been worked around internally by LHCb.

Maarten informs that the lcg-voms2 problem is solved now. Raja explains that it seems there are still a number of machines hanging but Maarten explains that some sites have special configurations and the cause of those failures may be unrelated to this issue. Under normal conditions it looks that this is now working.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: Not present
  • FNAL: Lisa explains that the test ALARM ticket didn't arrive to FNAL at 11am CST as requested. Lisa explains that the problem is that if the ticket arrives to FNAL at lunch time, there won't be a fast reaction since people may be away for the lunch break. For this reason it would be better to have the ticket earlier. Maria Dimou will follow up on this after some discussion on whether it should be daylight saving time, and whether in any case the GGUS development team should handle all these individual timing requests.
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: Sang-Un reports that there was a problem in one of the tapes where some files were missing or reported as 0 size. KISTI will get the list of affected files in the upcoming days and will try to recover them.
  • KIT: Not present
  • NDGF: Ulf explains that the tape upgrade went OK. The plans for upgrading dCache 2.10 have been postponed after finding an issue in this version (this is acknowledged by the developers who have an internal ticket to track this). They are going to wait for version 2.10.1. Ulf also reports some strange xrootd ALICE file reads mostly coming from INFN Bologna that will be investigated. Maarten will follow up on this after the meeting.
  • NL-T1: Not present
  • OSG: NTR
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services:
    • Incident with the Load Balancing service that led to services such as argus and bdii to be degraded (https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG0012529)
    • 2 Argus backend nodes needed to be restarted yesterday.
    • FTS (fts3.cern.ch) upgrade 3.2.22 -> 3.2.26 happened on Tuesday, it was not transparent as advertised with transfers failing for a time but service was quickly restored.
    • FTS Pilot (fts3-pilot.cern.ch) database migration , 1 hour downtime after this meeting.
  • CERN storage services:
    • Sebastien reports some gridftp transfers failing from Lyon for the AMS experiment. See BUG 143266. Maarten explains that this is something to be understood with AMS and unless there is a common component with LHC VOs, there is little WLCG Operations can do for this matter. Andrea explains also that the problem seems to come from their storage after following the thread in the mentioned ticket. Raja also mentions some problems in the past related to files corrupted after transfers.
  • Databases:
    • The migration to Golden Gate for ATLAS conditions replication went well. This affects the online to offline replication. The replication with T1s is still using Streams technology and it is planned to move this to Golden Gate in September.
    • There will be a rolling intervention on Tuesday morning at 9am to apply security patches to the ALICE online DB. More details in IT SSB.
  • GGUS: The test ALARMs following the GGUS host certificate were all , successful this time. Details in JIRA:1291. Following the discussion on the best time of the day to open GGUS test ALARMs to american sites MariaD opened JIRA:1296 for off-line follow-up.
  • Grid Monitoring: Not present
  • MW Officer: NTR
AOB:
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2014-07-25 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback