Difference: WLCGOpsMinutes160317 (1 vs. 62)

Revision 622018-02-28 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 185 to 185
 

Middleware Readiness WG

Changed:
<
<

>
>

  • The 16th MW Readiness WG meeting took place yesterday March 16th 2016. Agenda https://indico.cern.ch/e/MW-Readiness_16, notes MWReadinessMeetingNotes20160316, Summary:
  • Tier1s are invited to tell the e-group wlcg-ops-coord-wg-middleware at cern.ch whether they agree to install the pakiti client on their production service nodes, so that the versions of MW run at the site be known to authorised DNs site managers taken from GOCDB and expert operations' supporters.
  • SRM-less DPM test on-hold until ATLAS pilot code is changed as per JIRA:MWR-104
  • Excellent progress with gfal2 testing (various configurations) as per JIRA:MWR-101 & JIRA:MWR-117
  • Proposed date for the next meeting is Wed May 18th at 4pm CEST.

 

Multicore Deployment

Revision 612016-03-21 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 22 to 22
 
  • Next Ops Coord meetings:
    • April 7th
Changed:
<
<
    • May 12th (or April 28th, since May 5th is Ascension day)
>
>
    • April 28th (since May 5th is Ascension day)
 
    • June 2nd
    • July 7th
Line: 131 to 131
 
  • SAM instability last week/this weekend after VM migration
  • SAM test infrastructure will switch from Nagios to ETF by the end of March
Changed:
<
<
Giuseppe added that CMS users also suffered from the massive email notifications coming from voms-admin.
>
>
Giuseppe added that CMS users also suffered from low voms-admin performance.
 

LHCb

Line: 238 to 238
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
17.03.2016 ASGC (Felix) to report progress on their networking issues at the Ops Coord meeting ATLAS network metrics      
>
>
17.03.2016 The Operations' team to follow-up ASGC's progress on the networking issues experienced there and report progress at the Ops Coord meeting ATLAS network metrics      
 
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - Should be done by the end of Feb. GGUS tickets are opened. Still 9 sites missing. - CLOSED
18.02.2016 CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557 ATLAS - Done - CLOSED

Revision 602016-03-18 - GiuseppeBagliesi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 14 to 14
 

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi (MW Officer), Marian Babik (Network WG), Krystof Borkovec (CERN IT Compute Group). Marc Slater (LHCb), Nurcan Ozturk (ATLAS).
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi (MW Officer), Marian Babik (Network WG), Krystof Borkovec (CERN IT Compute Group). Marc Slater (LHCb), Giuseppe Bagliesi (CMS), Nurcan Ozturk (ATLAS).
 
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Antonio Maria Perez Yzquierdo, Di Qing, David Mason, Andrew McNab, Hung-Te Lee, Renaud Vernet, Jeremy Coles, Catherine Biscarat, Eygene Ryabinkin, Gareth Smith, Onno Zweers, Oliver Keeble, Thomas Hartmann, Ulf Tigerstedt, Victor Zhiltsov, Vincenzo Spinoso, Xavier Mol, Alessandro Cavalli.
  • apologies: Josep Flix (PIC)

Revision 592016-03-17 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 15 to 15
 

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi (MW Officer), Marian Babik (Network WG), Krystof Borkovec (CERN IT Compute Group). Marc Slater (LHCb), Nurcan Ozturk (ATLAS).
Changed:
<
<
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Antonio Maria Perez Yzquierdo, Di Qing, David Mason, Andrew McNab, Hung-Te Lee, Renaud Vernet, Jeremy Coles, Catherine Biscarat, Eygene Ryabinkin, Gareth Smith, Onno Zweers, Oliver Keeble, Thomas Hartmann, Ulf Tigerstedt, Victor Zhitsov, Vincenzo Spinoso, Xavier Mol, Alessandro Cavalli.
>
>
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Antonio Maria Perez Yzquierdo, Di Qing, David Mason, Andrew McNab, Hung-Te Lee, Renaud Vernet, Jeremy Coles, Catherine Biscarat, Eygene Ryabinkin, Gareth Smith, Onno Zweers, Oliver Keeble, Thomas Hartmann, Ulf Tigerstedt, Victor Zhiltsov, Vincenzo Spinoso, Xavier Mol, Alessandro Cavalli.
 
  • apologies: Josep Flix (PIC)

Operations News

Line: 109 to 109
 
  • VOMS: VOMS2 has a different behavior wrt VOMS. After one year of deployment, now we have seen a lot of ATLAS users with membership suspension because of the fact that they did not re-sign the AUP. While we know this is to be dealt within ATLAS, it's quite a painful exercise.
  • FTS: New release 3.4.x deployed. ATLAS will start consuming messages. Still some duplicates have been noticed, FTS team is aware.
Changed:
<
<
Andrea M.: The FTS issues are not related to the messaging but to a race condition. The problem ia known. Tentative fix in the pilot now. Duplicate messages are related to duplicate actual transfers. It is good to consume messages, this will bring the load down.
>
>
Andrea M.: The FTS issues are not related to the messaging but to a race condition. The problem is known. Tentative fix in the pilot now. Duplicate messages are related to duplicate actual transfers. It is good to consume messages, this will bring the load down.
 

CMS

  • General Status
Line: 144 to 144
 Marc added the continuing LHCb concern of simultaneous downtimes by more than 2 Tier1s. Maria A. explained the new version of GOCDB downtime calendar, very helpful for checking clashes. This is linked from the 3pm twiki. The Tier1s were asked to inform the wlcg-scod about downtimes between Monday 3pm calls.
Changed:
<
<
Maarten explained the various issues related to voms-admin, affecting all experiments. One year ago, we were running vomrs, which had a different behaviour. This is the first year that voms-admin has to handle period expiration of AUP signatures and other forms of VO affiliation expiration. The massive co-signature of the AUP by all VO members is not justified to bring the service down. The load shouldn't be that much. There are 2 physical servers, supposedly equivalent, which present instabilities at various points in time.
>
>
Maarten explained the various issues related to voms-admin, affecting all experiments. One year ago, we were running VOMRS, which had a different behaviour. This is the first year that voms-admin has to handle periodic expiration of AUP signatures and other forms of VO affiliation expiration. The massive co-signature of the AUP by all VO members is not justified to bring the service down. The load shouldn't be that much. There are 2 services, lcg-voms2.cern.ch and voms2.cern.ch, that should be completely equivalent, but sometimes one of them presented instabilities recently.
 

Ongoing Task Forces and Working Groups

Line: 153 to 153
 
Changed:
<
<
Maarten confirmed that support will continue. No 100% alternative solution is ready at this moment but it is expected in the course of this year. The GDB recently instaured teh Traceability WG and Security Ops Centres which will be offering similar functionality. The Amsterdam GDB kicked-off these WGs. Ian Collier, GDB chairman, is active in these activities. Experiments are represented, e.g. Alessandro di Girolamo from ATLAS. Marian asked if gLExec monitoring should be stopped. The Ops team will check that with the experiments.
>
>
Maarten confirmed that support will continue. No 100% alternative solution is ready at this moment but significant progress is expected in the course of this year. The GDB recently instated a WG on Traceability and another one on Security Ops Centres which will concern themselves with these matters. The Amsterdam GDB kicked-off these WGs. Ian Collier, GDB chairman, is active in these activities. Experiments are represented, e.g. Alessandro Di Girolamo from ATLAS. Marian asked if gLExec monitoring should be stopped for ALICE, ATLAS and LHCb. Only CMS use it in production. The Ops team will check that with the other experiments.
 

HTTP Deployment TF

Line: 218 to 218
 
  • IN2P3: Resources available as pledged.
  • JINR: CMS-only site. Expected increase of about 50%. All resources are hoped to conform to pledges, tape will take longer to increase capacity.
  • KISTI: not connected
Changed:
<
<
  • KIT: Alessandro asked about the tape insfrastructure because slow performance is observed at times. Thomas replies that there is no specific plan for tape system migration, probably towards the end of 2016.
  • NDGF: More memory installed, more jobs can be run, more storage available, hardware was delivered to various NDGF sites at the end of last year and early this year. Additions were transparent. No more expected storage deliveries this year. Tapes are not evenly distributed, so some capacity limitations can be temporarily faced while the overall backup capacity is still abundant. No problem envisaged to fulfill the pledge. Maria A. reminded that monthly Accounting reports gives the current availability. This can be compared with the pledge.
>
>
  • KIT: Alessandro asked about the tape infrastructure because slow performance is observed at times. Thomas replies that there is no specific plan for tape system migration from TSM to HPSS, probably it will happen towards the end of 2016.
  • NDGF: More memory installed, more jobs can be run, more storage available, hardware was delivered to various NDGF sites at the end of last year and early this year. Additions were transparent. No more expected storage deliveries this year. Tapes are not evenly distributed, so some capacity limitations can be temporarily faced while the overall backup capacity is still abundant. No problem envisaged to fulfill the pledge. Maria A. reminded that monthly Accounting reports gives the current usage. This can be compared with the pledge.
 
  • NL-T1: The site support ALICE, ATLAS and LHCb. The slides list pledges per VO. The experiments are invited to inform NL_T1 how they wish to distribute the increase of storage between disk and tape. April pledges are fine with some exception concerning ATLAS pledged disk. There will be a downtime in April to replace dCache database servers, and a long shutdown of 2 weeks in October.
  • NRC-KI: All on slide. No comment.
  • OSG: Not applicable.
Line: 230 to 230
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
17.03.2016 The Operations' team to check with the 4 experiments whether the gLExec monitorin can be stopped. Maarten New  
>
>
17.03.2016 The Operations' team to check with ALICE, ATLAS and LHCb whether the gLExec monitoring can be stopped. Maarten New  
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Revision 582016-03-17 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Highlights

Changed:
<
<
>
>
  • Please check the T0 & T1 2016 pledges attached to the agenda.
  • The Multicore TF accomplished its mission. Its twiki remains as a documentation source.
  • The gLExec TF also completed. Support will continue. Its twiki is up-to-date.
 

Agenda

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi, Marian Babik, Julia Andreeva. EDIT AFTER THE MEETING!!!!!!!!!
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles. EDIT AFTER THE MEETING!!!!!!!!!
  • apologies:
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi (MW Officer), Marian Babik (Network WG), Krystof Borkovec (CERN IT Compute Group). Marc Slater (LHCb), Nurcan Ozturk (ATLAS).
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Antonio Maria Perez Yzquierdo, Di Qing, David Mason, Andrew McNab, Hung-Te Lee, Renaud Vernet, Jeremy Coles, Catherine Biscarat, Eygene Ryabinkin, Gareth Smith, Onno Zweers, Oliver Keeble, Thomas Hartmann, Ulf Tigerstedt, Victor Zhitsov, Vincenzo Spinoso, Xavier Mol, Alessandro Cavalli.
  • apologies: Josep Flix (PIC)
 

Operations News

Deleted:
<
<
 
  • Next Ops Coord meetings:
    • April 7th
Changed:
<
<
    • May 12th (Since May 5th is Ascension day)
>
>
    • May 12th (or April 28th, since May 5th is Ascension day)
 
    • June 2nd
    • July 7th
Changed:
<
<
Trying to sync with GDB
>
>
In principle the meetings are the first Thursday of each month but there need to be exceptions like today and in May. In general, we try to sync with GDB dates in a way that mutual input can be timely, useful and content-full for both groups.
 

Middleware News

Line: 64 to 66
  public cloud procurements. Mass delivery is foreseen for mid May to mid August. We are in contact with the experiments to discuss details.
Added:
>
>
Maria A. asked if this point affects the pledges or it refers to the Cloud procurement activity, which has no effect on pledges. The answer will be added here when known.
 
  • The loan of CPU capacity to AMS has been extended until March 10, at which time it will be claimed back. There was no impact on the capacity allocated for LHC VOs.
Line: 105 to 109
 
  • VOMS: VOMS2 has a different behavior wrt VOMS. After one year of deployment, now we have seen a lot of ATLAS users with membership suspension because of the fact that they did not re-sign the AUP. While we know this is to be dealt within ATLAS, it's quite a painful exercise.
  • FTS: New release 3.4.x deployed. ATLAS will start consuming messages. Still some duplicates have been noticed, FTS team is aware.
Changed:
<
<
Andrea M.: The FTS issues are not related to the messaging but to a race condition. Problem known. Tentative fix in the pilot now. Duplicate messages are related to duplicate actual transfers. It is good to consume messages, this will bring the load down.
>
>
Andrea M.: The FTS issues are not related to the messaging but to a race condition. The problem ia known. Tentative fix in the pilot now. Duplicate messages are related to duplicate actual transfers. It is good to consume messages, this will bring the load down.
 

CMS

  • General Status
Line: 205 to 209
  All slides linked from the agenda. Clarifications from the discussion at the meeting, where needed, are:
Changed:
<
<
  • CERN: The resources pledged till April are respected. New capacity is arriving in May. These are opportunistic resources, not related to the pledged figures. Not clear right now if all resources pledged for 2016 will be there on time.
>
>
  • CERN: The resources pledged till April are respected. New capacity is arriving in May. These are opportunistic resources, not related to the pledged figures (? - to be clarified). Not clear right now if all resources pledged for 2016 will be there on time. Maria A. asked to clarify the relationship between the statement in T0 report about massive delivery in May-August and the second statement in the pledge slide.
 
  • ASGC: Oliver said that the DPM performance limitations must be due to some non-optimal configuration because other sites can work without bottleneck issues. To be taken offline. Alessandro says that network issues can harm the performance that achieved pledges can offer. Still dealing with hardware equipment issues related to their network set-up with the vendor. Action on the site to report progress in this meeting.
  • BNL: ESnet offers a very good quality network connectivity. Opportunistically, the network capacity can go up to 100Gbps on LHCONE and OPN
  • CNAF: Resources available as pledged. There will be slide uploaded tomorrow, explaining the numbers.
Line: 226 to 230
 

Action list

Creation date Description Responsible Status Comments
Added:
>
>
17.03.2016 The Operations' team to check with the 4 experiments whether the gLExec monitorin can be stopped. Maarten New  
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Line: 233 to 238
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - Should be done by the end of Feb - ONGOING
18.02.2016 CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557 ATLAS - - - ONGOING
>
>
17.03.2016 ASGC (Felix) to report progress on their networking issues at the Ops Coord meeting ATLAS network metrics      
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - Should be done by the end of Feb. GGUS tickets are opened. Still 9 sites missing. - CLOSED
18.02.2016 CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557 ATLAS - Done - CLOSED
 

AOB

Revision 572016-03-17 - OnnoZweersExternal

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 216 to 216
 
  • KISTI: not connected
  • KIT: Alessandro asked about the tape insfrastructure because slow performance is observed at times. Thomas replies that there is no specific plan for tape system migration, probably towards the end of 2016.
  • NDGF: More memory installed, more jobs can be run, more storage available, hardware was delivered to various NDGF sites at the end of last year and early this year. Additions were transparent. No more expected storage deliveries this year. Tapes are not evenly distributed, so some capacity limitations can be temporarily faced while the overall backup capacity is still abundant. No problem envisaged to fulfill the pledge. Maria A. reminded that monthly Accounting reports gives the current availability. This can be compared with the pledge.
Changed:
<
<
  • NL-T1: The site support ALICE, ATLAS and LHCb. The slides list pledges per VO. The experiments are invited to inform NL_T1 how they wish to distribute the increase of storage between disk and tape. April pledges are fine with some exception concerning ATLAS pledged disk. There will be a long shutdown of 2 weeks in April.
>
>
  • NL-T1: The site support ALICE, ATLAS and LHCb. The slides list pledges per VO. The experiments are invited to inform NL_T1 how they wish to distribute the increase of storage between disk and tape. April pledges are fine with some exception concerning ATLAS pledged disk. There will be a downtime in April to replace dCache database servers, and a long shutdown of 2 weeks in October.
 
  • NRC-KI: All on slide. No comment.
  • OSG: Not applicable.
  • PIC: All on slide. The T10KD tape incident is not yet understood.

Revision 562016-03-17 - MarkWSlater

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 133 to 133
 
  • Had a clash of 3 Tier 0/1s being in downtime simultaneously earlier this week which would have significantly affected operations.
  • Validation Productions for Turbo and 2011 ReStripping currently being processed
Added:
>
>
  • Once validation is complete this evening/tomorrow, major Run1 stripping campaign will be launched early next week and will take ~1 month
  • Assuming this goes OK, we will start staging the 2012 data soon after
 
  • In addition, significant MC production and user jobs running.

Marc added the continuing LHCb concern of simultaneous downtimes by more than 2 Tier1s.

Revision 552016-03-17 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 11 to 11
 

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva. EDIT AFTER THE MEETING!!!!!!!!!
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles. EDIT AFTER THE MEETING!!!!!!!!!
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Andrea Manzi, Marian Babik, Julia Andreeva. EDIT AFTER THE MEETING!!!!!!!!!
  • remote: Stephan Lammel (CMS), David Cameron (ATLAS), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles. EDIT AFTER THE MEETING!!!!!!!!!
 
  • apologies:

Operations News

Line: 24 to 24
 
    • June 2nd
    • July 7th
Added:
>
>
Trying to sync with GDB
 

Middleware News

  • Useful Links:
Line: 103 to 105
 
  • VOMS: VOMS2 has a different behavior wrt VOMS. After one year of deployment, now we have seen a lot of ATLAS users with membership suspension because of the fact that they did not re-sign the AUP. While we know this is to be dealt within ATLAS, it's quite a painful exercise.
  • FTS: New release 3.4.x deployed. ATLAS will start consuming messages. Still some duplicates have been noticed, FTS team is aware.
Added:
>
>
Andrea M.: The FTS issues are not related to the messaging but to a race condition. Problem known. Tentative fix in the pilot now. Duplicate messages are related to duplicate actual transfers. It is good to consume messages, this will bring the load down.
 

CMS

  • General Status
    • CMS Detector is being commissioned for 2016 running
Line: 123 to 127
 
  • SAM instability last week/this weekend after VM migration
  • SAM test infrastructure will switch from Nagios to ETF by the end of March
Added:
>
>
Giuseppe added that CMS users also suffered from the massive email notifications coming from voms-admin.
 

LHCb

  • Had a clash of 3 Tier 0/1s being in downtime simultaneously earlier this week which would have significantly affected operations.
  • Validation Productions for Turbo and 2011 ReStripping currently being processed
  • In addition, significant MC production and user jobs running.
Added:
>
>
Marc added the continuing LHCb concern of simultaneous downtimes by more than 2 Tier1s. Maria A. explained the new version of GOCDB downtime calendar, very helpful for checking clashes. This is linked from the 3pm twiki. The Tier1s were asked to inform the wlcg-scod about downtimes between Monday 3pm calls.

Maarten explained the various issues related to voms-admin, affecting all experiments. One year ago, we were running vomrs, which had a different behaviour. This is the first year that voms-admin has to handle period expiration of AUP signatures and other forms of VO affiliation expiration. The massive co-signature of the AUP by all VO members is not justified to bring the service down. The load shouldn't be that much. There are 2 physical servers, supposedly equivalent, which present instabilities at various points in time.

 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Added:
>
>
Maarten confirmed that support will continue. No 100% alternative solution is ready at this moment but it is expected in the course of this year. The GDB recently instaured teh Traceability WG and Security Ops Centres which will be offering similar functionality. The Amsterdam GDB kicked-off these WGs. Ian Collier, GDB chairman, is active in these activities. Experiments are represented, e.g. Alessandro di Girolamo from ATLAS. Marian asked if gLExec monitoring should be stopped. The Ops team will check that with the experiments.
 

HTTP Deployment TF

  • TF has completed what we hope is the final round of ticketing - ~20 issues still open http://cern.ch/go/h8Kl
Line: 145 to 159
 
  • One proposal is the continuation of the storage provider collaboration beyond the lifetime of the TF, in order to synchronise development targeting mid/long term WLCG data management. Feedback on this proposal is welcome.
Added:
>
>
Discuss offline, including also Julia, the possible conversion of the TF into a WG.
 

Information System Evolution


  • List of primary information sources is now summarised in this document.
  • CMS and ATLAS agreed to evaluate together a common information system (CRIC). First meetings are taking place. It was agreed to work on a prototype in the next few months.
  • The strategy to stop depending on the BDII and using GOCDB/OIM as unique information sources will be evaluated as part of the CRIC work.
  • Short and medium term plans within the TF will be discussed at the next meeting taking place on 31st of March.
Line: 169 to 185
 
Added:
>
>
Antonio Y. confirmed, on behalf of the TF that it can now be closed. Advice to LHCb and T2s will continue. The twiki will be updated so that it stands as documentation for the future.
 

Network and Transfer Metrics WG

Changed:
<
<

>
>

  • ICFA SCIC meeting was held at J-Park in February, slides from the report (including WG contribution) can be found at http://icfa-scic.web.cern.ch/ICFA-SCIC/meetings.html
  • LHCOPN/LHCONE Meeting held in Taipei (https://indico.cern.ch/event/461511/)
  • WLCG Network Throughput SU: ASGC connectivity
    • Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
    • Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Throughput meetings held on Feb 24th and March 9th :
    • Soichi Hayashi presented the new configuration interface that will become part of perfSONAR 3.6
    • Shawn presented the way we currently monitor the perfSONAR infrastructure, including OSG production services
  • perfSONAR 3.5.1 released, 184 instances were auto-updated, only 13 instances on 3.4
 

RFC proxies

Line: 183 to 201
 

Tier0 & Tier1 2016 pledges

Changed:
<
<
  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:
>
>
All slides linked from the agenda. Clarifications from the discussion at the meeting, where needed, are:

  • CERN: The resources pledged till April are respected. New capacity is arriving in May. These are opportunistic resources, not related to the pledged figures. Not clear right now if all resources pledged for 2016 will be there on time.
  • ASGC: Oliver said that the DPM performance limitations must be due to some non-optimal configuration because other sites can work without bottleneck issues. To be taken offline. Alessandro says that network issues can harm the performance that achieved pledges can offer. Still dealing with hardware equipment issues related to their network set-up with the vendor. Action on the site to report progress in this meeting.
  • BNL: ESnet offers a very good quality network connectivity. Opportunistically, the network capacity can go up to 100Gbps on LHCONE and OPN
  • CNAF: Resources available as pledged. There will be slide uploaded tomorrow, explaining the numbers.
  • FNAL: Resources available as pledged.
  • GridPP: Not applicable.
  • IN2P3: Resources available as pledged.
  • JINR: CMS-only site. Expected increase of about 50%. All resources are hoped to conform to pledges, tape will take longer to increase capacity.
  • KISTI: not connected
  • KIT: Alessandro asked about the tape insfrastructure because slow performance is observed at times. Thomas replies that there is no specific plan for tape system migration, probably towards the end of 2016.
  • NDGF: More memory installed, more jobs can be run, more storage available, hardware was delivered to various NDGF sites at the end of last year and early this year. Additions were transparent. No more expected storage deliveries this year. Tapes are not evenly distributed, so some capacity limitations can be temporarily faced while the overall backup capacity is still abundant. No problem envisaged to fulfill the pledge. Maria A. reminded that monthly Accounting reports gives the current availability. This can be compared with the pledge.
  • NL-T1: The site support ALICE, ATLAS and LHCb. The slides list pledges per VO. The experiments are invited to inform NL_T1 how they wish to distribute the increase of storage between disk and tape. April pledges are fine with some exception concerning ATLAS pledged disk. There will be a long shutdown of 2 weeks in April.
  • NRC-KI: All on slide. No comment.
  • OSG: Not applicable.
  • PIC: All on slide. The T10KD tape incident is not yet understood.
  • RAL: All on slide. Resources available as pledged.
  • TRIUMF: All on slide. Resources available as pledged.
 

Action list

Revision 542016-03-17 - MarkWSlater

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 125 to 125
 

LHCb

Added:
>
>
  • Had a clash of 3 Tier 0/1s being in downtime simultaneously earlier this week which would have significantly affected operations.
  • Validation Productions for Turbo and 2011 ReStripping currently being processed
  • In addition, significant MC production and user jobs running.
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Revision 532016-03-17 - GiuseppeBagliesi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 118 to 118
 
  • Phedex update: sites are asked to move to Phedex 4.1.5 (required), 4.1.7 is recommended
    • 1 Tier-1, 5 Tier-2, 22 Tier-3 sites need to upgrade
    • Sites not at required version received a GGUS ticket
Changed:
<
<
  • Still fallout triggered by the glibc update/reboot a few weeks ago, md5/FTS issue now understood
>
>
  • md5/FTS issue now understood: legacy proxies signed with MD5
  • Still fallout triggered by the glibc update/reboot a few weeks ago,
 
  • SAM instability last week/this weekend after VM migration
  • SAM test infrastructure will switch from Nagios to ETF by the end of March

Revision 522016-03-17 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 30 to 30
 
Changed:
<
<
  • Baselines/New releases:
>
>
 
  • Issues:
Changed:
<
<
>
>
    • Advisory on NSS vulnerability broadcasted by EGI SVG (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2016-1950). All running resources MUST be either patched or software removed by 2016-03-22 00:00 UTC. As there is not a list of Services affected, sites admins are asked to check which services are impacted after the upgrade (needs-restarting command) and restart them or just reboot the host. We have discussed within WLCG ops the need to have those advisories more clear in term of software affected, will check with SVG about this.
    • The DPM team has discovered a critical bug in the new HTTP based drain commands implemented in the dmlite-shell and the lcgdm-dav component available in DPM 1.8.10. Using the new drain commands together with lcgdm-dav version <= 0.16.0 can lead to data loss in some circumstances. Sites have been contacted via broadcast and asked not to use the new drain commands (documented at https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Dmlite/Shell#Newfunctionality:Drain) and until the fixed components are released continue to use the old dpm-drain command. One site has been affected in production heavily (200TB data loss)
    • Still issues reported with software using MD5 signature (latest openjdks have disable it). This time an old version of the fts2 client (used by CMS) packaged in osg 3.0 with a dependency towards a version of gridsite using MD5 and deployed on some Phedex box was the cause of FTS transfer failures. The affected OSG sites have been asked to move to the version of fts2-client included in OSG 3.2.
 
  • T0 and T1 services
Changed:
<
<
>
>
    • BNL
      • FTS upgraded to v 3.4.2
    • CERN
      • FTS upgraded to v 3.4.2
    • JINR-T1
      • dCache upgraded to v 2.10.56.
      • Postgres upgraded to Postgres v.9.4.6
    • IN2P3
      • dCache upgraded to v.2.13.26 on pools 
    • NDGF
      • dCache upgraded to v 2.15.1
    • NT-T1
      • SURFsara will move all grid hardware to a new datacenter in October 2016. We expect to be down for 2 weeks.
      • Upgrade to dCache 2.13 planned before 2016-05-01.
    • RAL
      • FTS upgraded to v 3.4.2
 

Tier 0 News

Revision 512016-03-17 - GiuseppeBagliesi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 88 to 88
 
  • FTS: New release 3.4.x deployed. ATLAS will start consuming messages. Still some duplicates have been noticed, FTS team is aware.

CMS

Added:
>
>
  • General Status
    • CMS Detector is being commissioned for 2016 running
      • Global runs with increasing number of components
      • First field-off Cosmics taken
  • Overall production/processing load is rather low
    • Ongoing campaigns are in their tails
    • New major 2016 MC production campaign expected to be launched early April
  • Enabling Tier-2 sites for multi-core pilots in progress
    • 8 sites (19,050 cores) completely transitioned
    • 6 sites ( 7,500 cores) work in progress
    • 10 sites (21,600 cores) site admins contacted
  • Phedex update: sites are asked to move to Phedex 4.1.5 (required), 4.1.7 is recommended
    • 1 Tier-1, 5 Tier-2, 22 Tier-3 sites need to upgrade
    • Sites not at required version received a GGUS ticket
  • Still fallout triggered by the glibc update/reboot a few weeks ago, md5/FTS issue now understood
  • SAM instability last week/this weekend after VM migration
  • SAM test infrastructure will switch from Nagios to ETF by the end of March
 

LHCb

Revision 502016-03-17 - NurcanOzturk

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 80 to 80
 
  • various VOMS-Admin issues caused a support load on the VO admins

ATLAS

Added:
>
>
  • Reprocessing of 2015 data has been completed.
  • Large scale MC15c Monte-Carlo reconstruction campaign started, will process 5B events for the next 3-4 months.
  • High Level Trigger reprocessing is running.
  • Heavy Ion express stream processing is ongoing, bulk processing is expected in ~2 weeks.
  • VOMS: VOMS2 has a different behavior wrt VOMS. After one year of deployment, now we have seen a lot of ATLAS users with membership suspension because of the fact that they did not re-sign the AUP. While we know this is to be dealt within ATLAS, it's quite a painful exercise.
  • FTS: New release 3.4.x deployed. ATLAS will start consuming messages. Still some duplicates have been noticed, FTS team is aware.
 

CMS

Revision 492016-03-17 - JeromeBelleman

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 37 to 37
 
  • T0 and T1 services
Added:
>
>
 

Tier 0 News

Added:
>
>
  • Following CERN's Invitation to Tender IT-4143/IT, a contract has been placed with T-Systems. The idea is to deploy these resources in more similar a way to the standard Tier-0 processes than with previous public cloud procurements. Mass delivery is foreseen for mid May to mid August. We are in contact with the experiments to discuss details.

  • The loan of CPU capacity to AMS has been extended until March 10, at which time it will be claimed back. There was no impact on the capacity allocated for LHC VOs.

  • The HTCondor pool has been extended to 15'000 cores; further extension will require the Kerberos support, which is very close to being available as a result of a very good collaboration between CERN and the HTCondor team. We will invite pilot users and then adapt the ratio of HTCondor to LSF according to user demands.

  • The upgrade of computing hypervisors to valid configurations (Kilo-1 wherever possible, EPT switched on on the others) to contribute to better performance has been finished on schedule - thanks to the cloud team!

  • The CASTOR software have been consolidated; release 2.1.16 merges CASTOR with the SRM stack, which will allow for co-locating the stager and the SRM daemons. This feature will be rolled out to the new headnodes.
 

DB News

Tier 1 Feedback

Revision 482016-03-17 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 78 to 78
 

Information System Evolution

Changed:
<
<

>
>

  • List of primary information sources is now summarised in this document.
  • CMS and ATLAS agreed to evaluate together a common information system (CRIC). First meetings are taking place. It was agreed to work on a prototype in the next few months.
  • The strategy to stop depending on the BDII and using GOCDB/OIM as unique information sources will be evaluated as part of the CRIC work.
  • Short and medium term plans within the TF will be discussed at the next meeting taking place on 31st of March.
 

IPv6 Validation and Deployment TF

Changed:
<
<

>
>

 

Machine/Job Features TF

Line: 94 to 94
 

Middleware Readiness WG

Changed:
<
<

>
>

 

Multicore Deployment

Changed:
<
<
>
>
 

Network and Transfer Metrics WG

Changed:
<
<

>
>

 

RFC proxies

Revision 472016-03-17 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 66 to 66
 
Deleted:
<
<

Machine/Job Features TF

  • Finalized specification published as HSF-TN-2016-02
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (done), and HTCondor (started).
  • See MachineJobFeaturesImplementations for links to sources, RPMs, YUM repo etc.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.
 

HTTP Deployment TF

  • TF has completed what we hope is the final round of ticketing - ~20 issues still open http://cern.ch/go/h8Kl
Line: 92 to 84
 
Added:
>
>

Machine/Job Features TF

  • Finalized specification published as HSF-TN-2016-02
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (done), and HTCondor (started).
  • See MachineJobFeaturesImplementations for links to sources, RPMs, YUM repo etc.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.
 

Middleware Readiness WG


Revision 462016-03-17 - OliverKeeble

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 76 to 76
 

HTTP Deployment TF

Added:
>
>
  • TF has completed what we hope is the final round of ticketing - ~20 issues still open http://cern.ch/go/h8Kl
  • Advice on storage configuration has been documented - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe#Storage_systems_specific_advice
    • This is particularly important for DPMs
  • As most systematic issues have been found and fixed, followup can pass to the experiments when ETF goes live (which will automatically deploy the HTTP TF monitoring).
  • Next meeting (Wed 23rd) will discuss what final steps the TF needs to take
  • One proposal is the continuation of the storage provider collaboration beyond the lifetime of the TF, in order to synchronise development targeting mid/long term WLCG data management. Feedback on this proposal is welcome.
 

Information System Evolution


Revision 452016-03-17 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 49 to 49
 

ALICE

Added:
>
>
  • mostly high activity
    • also thanks to the HLT clusters: up to 9.2k job slots
    • new record reached March 10: 99372 running jobs!
  • various VOMS-Admin issues caused a support load on the VO admins
 

ATLAS

CMS

Line: 59 to 64
 

gLExec Deployment TF

Added:
>
>
 

Machine/Job Features TF

Line: 91 to 98
 

RFC proxies

Added:
>
>
  • NTR
 

Squid Monitoring and HTTP Proxy Discovery TFs

  • lhchomeproxy.cern.ch is now being used for LHC@home cvmfs, and monitoring is set up. For now the URL is hard-coded but by next month it should be using Web Proxy Auto Discovery so it can switch to other worldwide proxies, automatically ordered by geo location. Squid deployments are being pursued at Fermilab and in China, Australia, and Brazil.

Revision 442016-03-16 - AndrewMcNab

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 63 to 59
 

gLExec Deployment TF

Deleted:
<
<
 

Machine/Job Features TF

Added:
>
>
  • Finalized specification published as HSF-TN-2016-02
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (done), and HTCondor (started).
  • See MachineJobFeaturesImplementations for links to sources, RPMs, YUM repo etc.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.
 

HTTP Deployment TF

Revision 432016-03-16 - DaveDykstra

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Line: 94 to 94
 

Squid Monitoring and HTTP Proxy Discovery TFs

Added:
>
>
  • lhchomeproxy.cern.ch is now being used for LHC@home cvmfs, and monitoring is set up. For now the URL is hard-coded but by next month it should be using Web Proxy Auto Discovery so it can switch to other worldwide proxies, automatically ordered by geo location. Squid deployments are being pursued at Fermilab and in China, Australia, and Brazil.
 

Tier0 & Tier1 2016 pledges

  • ASGC:

Revision 422016-03-15 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, March 17th 2016

Revision 412016-03-15 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"
Changed:
<
<

WLCG Operations Coordination Minutes, February 18th 2016

>
>

WLCG Operations Coordination Minutes, March 17th 2016

 

Highlights

Changed:
<
<
  • The new Experiments Test Framework (ETF) will be tentatively ready in April. It will contain few user changes.
  • The WLCG workshop and the MB decided to freeze the gLExec deployment.
  • LHCb to discuss with Multicore deployment TF experts about the future of the TF and advise the sites.
  • All sites are to install the patches for the critical vulnerability announced yesterday.
>
>
 

Agenda

Changed:
<
<
>
>
 

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva.
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles.
  • apologies: Catherine Biscarat (IN2P3)
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva. EDIT AFTER THE MEETING!!!!!!!!!
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles. EDIT AFTER THE MEETING!!!!!!!!!
  • apologies:
 

Operations News

Changed:
<
<
  • Operations Coordination Meetings have been reorganised as of 1st March. See MB slides presented this week:
    • 3PM meetings once a week on Mondays
    • Ops Coord meetings once per month on the first Thursday of the month
      • Topical meetings
      • Written reports still requested, but not necessary to go through them during the meeting
>
>
 
  • Next Ops Coord meetings:
Deleted:
<
<
    • March 3rd
 
    • April 7th
    • May 12th (Since May 5th is Ascension day)
    • June 2nd
    • July 7th
Deleted:
<
<
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALL-time usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPU-time to pledges). See MB slides for more details.

Experiments Test Framework (ETF)

Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of April, depending on the validation process being successful by this date.

ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that could impact sites:

  • Testing with RFC proxies (coordinated by RFC TF)
  • All services in the VO feeds will be tested
  • New HTTP tests (coordinated by HTTP TF)
  • Updated gLExec worker node test to the latest from UMD

Probes' results will be taken only from job outputs, i.e. the WN will no longer send them directly to the message bus.

 

Middleware News

Line: 55 to 31
 
Changed:
<
<
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
>
>
 
  • Issues:
Changed:
<
<
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new ETF-Nagios which currently has HTCondor 8.4.4. Sites have been asked either not to upgrade to the latest java, or to re-enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
    • EGI SVG advisory sent yesterday describing a critical vulnerability of glibc on all platforms (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2015-7547), All running resources MUST be patched by 2016-02-24 21:00 UTC.
  • T0 and T1 services
    • FNAL
      • EOS Upgraded to v 0.3.127, with xrootd 3.3.6-4.slc6 and Bestman 2.3.0-21
      • planned upgrade to dCache 2.13 in April
    • JINR-T1
      • minor dCache upgrade to v 2.10.54
    • IN2P3
      • xrootd 4.2.3-3 and new balanced redirector on tape buffer under test
    • INFN-T1
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF

Marian will look into how complicated it is to upgrade HTCondor still on the old SAM-Nagios hosts.

>
>
 
Added:
>
>
  • T0 and T1 services
 

Tier 0 News

Deleted:
<
<
  • Condor: 118 kHS06 out of a total of 817 kHS06 (15%).
  • Larger CREAM CE flavours being deployed.

Jerome clarified that much more memory will be configured for the CREAM CEs. By the summer the LSF vs HTCondor proportion at the T0 is expected to be half-half.

 

DB News

Tier 1 Feedback

Line: 91 to 49
 

ALICE

Deleted:
<
<
  • mostly high activity
  • disk space
    • about 2.5 PB were recovered thanks to ad-hoc cleanups
    • further cleanups expected by spring, pending agreement on policy changes
    • CASTOR situation for raw data reco looks good, thanks for the support!
 

ATLAS

Deleted:
<
<
  • High activity: reprocessing of some 2012 data completed one week ago, just in time to start re-reprocessing of all 2015 data (expected to last 1-2 more weeks)
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557
  • FTS: New release 3.4.x contains a couple of important features for ATLAS, can sites deploy it?

FTS3.4.2 is now verified for Readiness at the CERN pilot. Expected in production at CERN next week. The stratum issue will become an action for firm follow-up.

 

CMS

Deleted:
<
<
  • General Operational Issues
    • Overall good usage at Tier-1 and Tier-2 sites, good job success rate and CPU utilization (incl. HLT and Tier-0 Openstack resources for processing)
    • We are short on disk space and are in contact with sites about readiness of the 2016 pledges
  • Requests for Sites
    • Sites are ask to update to Phedex 4.1.5 (or higher, 4.1.7 is the recommended version) by the end of February. One Tier-1 and about 20 Tier-2 sides still need to upgrade.
    • All Tier-1 sites are running multi-core pilots and Tier-2 sites are now switching coordinated by the Submission Infrastructure team via GGUS tickets (still to be opened).
 
Deleted:
<
<
Maria A. suggested that sites report on 2016 pledges at the next Ops Coord meeting (March 3rd).
 

LHCb

Deleted:
<
<
  • Stripping for 2015 is almost finished. Cleaning processed RAW files from disk
  • Validation of Turbo and TurCal; Prestaging files for them
  • Validation of Sim09; We hope we can start massive MC production soon
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Deleted:
<
<
  • a new plan for gLExec has been proposed in Ian Bird's presentation on the
    "Follow-up to the WLCG Workshop in Lisbon" during the Feb 16 MB meeting
  • page 4 says:
    "Freeze deployment of glexec - keep it supported for the existing use,
    but no point to expend further effort in deployment"
  • in principle the minutes of that meeting still need to be approved
  • in practice this will imply closing the remaining open tickets and wrapping up the TF
 

Machine/Job Features TF

Deleted:
<
<
  • Finalized specification and Technical Note document: https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesSpec
  • TN in HSF TN consultation process (for formatting, spelling etc.)
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (started), and HTCondor.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.
 

HTTP Deployment TF

Information System Evolution

Changed:
<
<

  • Information System discussed at the WLCG workshop:
    • General agreement that it would be desirable to become independent from the BDII, although in practice this needs to be understood.
    • No clear outcome about the new IS. There is a general feeling that a new IS is useful, but this needs in any case to be supported by the experiments. As a follow up at the MB on Tuesday, it was agreed to re-visit the experiment needs for this.
  • An IS TF meeting took place on 11th February:
    • In order to define a strategy for the BDII, EGI was invited to present their plans to support the BDII and it was made clear that EGI plans to support the BDII as many VOs rely on it.
    • It was agreed to assess the feasibility of moving static information to GOCDB/OIM, since experiments like ATLAS are interested in going in this direction.
    • It was agreed to work on a table where all primary information sources for each experiment will be described and identified. This should be a compact version of the Use Cases document and an easy way to understand where information is defined and where information is consumed, highlighting possible inconsistencies and also helping to steering the discussion on how to evolve the IS.
    • It was agreed to investigate whether there is room for collaboration between LHCb and ATLAS after LHCb’s implementation of multiple information collector plugins for the DIRAC CS.
    • It was decided to stop discussing about definitions since this work fits better within the benchmarking working group and the MJF TF.
>
>

 

IPv6 Validation and Deployment TF

Changed:
<
<

>
>

 

Middleware Readiness WG

Changed:
<
<

>
>

 

Multicore Deployment

Changed:
<
<

>
>
 
Deleted:
<
<
Andrea V. said that the Friday Feb. 26th LHCb meeting will discuss the issue and decide on the preferred model (slides 4 and 8 in Antonio's presentation).
 

Network and Transfer Metrics WG

Changed:
<
<

  • WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751)
  • WLCG Network Throughput SU: BNL to PIC throughput degradation
    • Root cause was instability of the GEANT Spain fiber channels
    • Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
  • WLCG Network Throughput SU: FNAL to CERN
    • Issue at ESNet, resolved by LHCOPN ops
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Meeting held on LHCb DIRAC bridge on January 18th:
    • Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
  • Throughput meeting held on January 27th:
>
>

 

RFC proxies

Deleted:
<
<
  • NTR
 

Squid Monitoring and HTTP Proxy Discovery TFs

Changed:
<
<
  • Squid monitoring based on OIM/GOCDB registrations is improving, with Alastair Dewhurst making a bit more progress on getting exceptions added on the monitoring machine
  • CMS is now planning on making a virtual opportunistic computing site, that can find its proxies with a single configuration based on a Web Proxy Auto Discovery service
    • Dave Dykstra is beginning to work on hosting http://wlcg-wpad.cern.ch/wpad.dat on an existing pair of 10gbit/s external proxy machines, beginning by just supporting a few sites but eventually basing it on the OIM/GOCDB data
  • A separate proxy service is also being added to the same external proxy machines for support of LHC@home and CMS opendata
>
>

Tier0 & Tier1 2016 pledges

* ASGC:

  • BNL:
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea M, Maarten DONE Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. All identified affected services now have compliant certificates and the corresponding tickets have been closed.
2015-12-17 Recommend site configurations to enforce memory limits on jobs   DONE 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened, answered, recommendation written in the same twiki and MB informed.
>
>
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Deleted:
<
<
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - DONE
 

Specific actions for sites

Line: 198 to 129
 

AOB

Changed:
<
<
The vulnerability issue announced yesterday was raised at the 3pm Ops call and moved here due to lack of time. All positions known at this moment are in the site reports of WLCGDailyMeetingsWeek160215#Thursday. The CERN security team will tell the T0 when to do the batches within the allowed timeframe (before Feb 24th).

-- MariaDimou - 2016-02-16

>
>
-- MariaDimou - 2016-03-15

Revision 402016-02-19 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 8 to 8
 
  • The new Experiments Test Framework (ETF) will be tentatively ready in April. It will contain few user changes.
  • The WLCG workshop and the MB decided to freeze the gLExec deployment.
  • LHCb to discuss with Multicore deployment TF experts about the future of the TF and advise the sites.
Changed:
<
<
>
>
 

Agenda

Line: 34 to 34
 
    • July 7th
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
Changed:
<
<
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALLTime usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPUtime to pledges). See MB slides for more details.
>
>
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALL-time usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPU-time to pledges). See MB slides for more details.
 

Experiments Test Framework (ETF)

Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of April, depending on the validation process being successful by this date.

Changed:
<
<
ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that we will introduced that could impact sites:
>
>
ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that could impact sites:
 
  • Testing with RFC proxies (coordinated by RFC TF)
  • All services in the VO feeds will be tested
  • New HTTP tests (coordinated by HTTP TF)
  • Updated gLExec worker node test to the latest from UMD
Changed:
<
<
Probes' output will be produced only via job submission.
>
>
Probes' results will be taken only from job outputs, i.e. the WN will no longer send them directly to the message bus.
 

Middleware News

Line: 58 to 58
 
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
  • Issues:
Changed:
<
<
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new version of SAM which will move to HTCondor 8.4. Sites have been asked not to upgrade to the latest java or to enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
>
>
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new ETF-Nagios which currently has HTCondor 8.4.4. Sites have been asked either not to upgrade to the latest java, or to re-enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
 
Line: 72 to 72
 
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF
Changed:
<
<
Marian needs to evaluate how complicated it is to fix the java issue reported above in SAM.
>
>
Marian will look into how complicated it is to upgrade HTCondor still on the old SAM-Nagios hosts.
 

Tier 0 News

  • Condor: 118 kHS06 out of a total of 817 kHS06 (15%).
  • Larger CREAM CE flavours being deployed.
Changed:
<
<
Jerome clarified that much more memory will be configured for the CREAM CEs. By the summer the LSF vs HTCondor proportion at the T0 will be half-half.
>
>
Jerome clarified that much more memory will be configured for the CREAM CEs. By the summer the LSF vs HTCondor proportion at the T0 is expected to be half-half.
 

DB News

Line: 163 to 163
 

Changed:
<
<
Andrea V. said that the Friday Feb. 26th LHCb meeting will discuss the issue and decide on the prefered model (slides 4 and 8 in Antonio's presentation).
>
>
Andrea V. said that the Friday Feb. 26th LHCb meeting will discuss the issue and decide on the preferred model (slides 4 and 8 in Antonio's presentation).
 

Network and Transfer Metrics WG

Revision 392016-02-19 - JeremyColes

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 16 to 16
 

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva.
Changed:
<
<
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto.
>
>
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles.
 
  • apologies: Catherine Biscarat (IN2P3)

Operations News

Revision 382016-02-19 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 72 to 72
 
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF
Changed:
<
<
Marian will need to evaluate the java issue reported above.
>
>
Marian needs to evaluate how complicated it is to fix the java issue reported above in SAM.
 

Tier 0 News

Line: 151 to 151
 
  • Information System discussed at the WLCG workshop:
    • General agreement that it would be desirable to become independent from the BDII, although in practice this needs to be understood.
    • No clear outcome about the new IS. There is a general feeling that a new IS is useful, but this needs in any case to be supported by the experiments. As a follow up at the MB on Tuesday, it was agreed to re-visit the experiment needs for this.
  • An IS TF meeting took place on 11th February:
    • In order to define a strategy for the BDII, EGI was invited to present their plans to support the BDII and it was made clear that EGI plans to support the BDII as many VOs rely on it.
    • It was agreed to assess the feasibility of moving static information to GOCDB/OIM, since experiments like ATLAS are interested in going in this direction.
    • It was agreed to work on a table where all primary information sources for each experiment will be described and identified. This should be a compact version of the Use Cases document and an easy way to understand where information is defined and where information is consumed, highlighting possible inconsistencies and also helping to steering the discussion on how to evolve the IS.
    • It was agreed to investigate whether there is room for collaboration between LHCb and ATLAS after LHCb’s implementation of multiple information collector plugins for the DIRAC CS.
    • It was decided to stop discussing about definitions since this work fits better within the benchmarking working group and the MJF TF.
Deleted:
<
<
On the one-but-last bullet of the report, A. McNab wrote in the vidyo chat window: It's not "LHCb's announment of DIRAC rewrite". It's Andrei proposing to write some more information collectors with a plugin framework with that.
 

IPv6 Validation and Deployment TF


Revision 372016-02-19 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Highlights

Changed:
<
<
>
>
  • The new Experiments Test Framework (ETF) will be tentatively ready in April. It will contain few user changes.
  • The WLCG workshop and the MB decided to freeze the gLExec deployment.
  • LHCb to discuss with Multicore deployment TF experts about the future of the TF and advise the sites.
  • The T0 and T1s to install the patches for the critical vulnerability announced yesterday.
 

Agenda

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes)...
  • remote:
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva.
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto.
 
  • apologies: Catherine Biscarat (IN2P3)

Operations News

Line: 35 to 38
 

Experiments Test Framework (ETF)

Changed:
<
<
Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of March, depending on the validation process being successful by this date.
>
>
Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of April, depending on the validation process being successful by this date.
  ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that we will introduced that could impact sites:
  • Testing with RFC proxies (coordinated by RFC TF)
Line: 43 to 46
 
  • New HTTP tests (coordinated by HTTP TF)
  • Updated gLExec worker node test to the latest from UMD
Added:
>
>
Probes' output will be produced only via job submission.
 

Middleware News

  • Useful Links:
Line: 53 to 58
 
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
  • Issues:
Changed:
<
<
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new version of SAM which will move to HTCondor 8.4. Sites have been asked not to upgrade to the latest java or to enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
>
>
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new version of SAM which will move to HTCondor 8.4. Sites have been asked not to upgrade to the latest java or to enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
 
Line: 61 to 66
 
      • planned upgrade to dCache 2.13 in April
    • JINR-T1
      • minor dCache upgrade to v 2.10.54
Changed:
<
<
>
>
    • IN2P3
 
      • xrootd 4.2.3-3 and new balanced redirector on tape buffer under test
    • INFN-T1
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF
Added:
>
>
Marian will need to evaluate the java issue reported above.
 

Tier 0 News

  • Condor: 118 kHS06 out of a total of 817 kHS06 (15%).
Changed:
<
<
  • Larger CREAM CE flavours being deployed.
>
>
  • Larger CREAM CE flavours being deployed.

Jerome clarified that much more memory will be configured for the CREAM CEs. By the summer the LSF vs HTCondor proportion at the T0 will be half-half.

 

DB News

Line: 91 to 100
 

ATLAS

  • High activity: reprocessing of some 2012 data completed one week ago, just in time to start re-reprocessing of all 2015 data (expected to last 1-2 more weeks)
Changed:
<
<
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
>
>
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
 
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557
  • FTS: New release 3.4.x contains a couple of important features for ATLAS, can sites deploy it?
Added:
>
>
FTS3.4.2 is now verified for Readiness at the CERN pilot. Expected in production at CERN next week. The stratum issue will become an action for firm follow-up.
 

CMS

  • General Operational Issues
Line: 104 to 116
 
    • Sites are ask to update to Phedex 4.1.5 (or higher, 4.1.7 is the recommended version) by the end of February. One Tier-1 and about 20 Tier-2 sides still need to upgrade.
    • All Tier-1 sites are running multi-core pilots and Tier-2 sites are now switching coordinated by the Submission Infrastructure team via GGUS tickets (still to be opened).
Added:
>
>
Maria A. suggested that sites report on 2016 pledges at the next Ops Coord meeting (March 3rd).
 

LHCb

  • Stripping for 2015 is almost finished. Cleaning processed RAW files from disk
Changed:
<
<
  • Validation of Turbo and TurCal; Prestaging files for them
>
>
  • Validation of Turbo and TurCal; Prestaging files for them
 
  • Validation of Sim09; We hope we can start massive MC production soon

Ongoing Task Forces and Working Groups

Line: 152 to 165
 

Added:
>
>
Andrea V. said that the Friday Feb. 26th LHCb meeting will discuss the issue and decide on the prefered model (slides 4 and 8 in Antonio's presentation).
 

Network and Transfer Metrics WG


  • WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751)
  • WLCG Network Throughput SU: BNL to PIC throughput degradation
    • Root cause was instability of the GEANT Spain fiber channels
    • Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
  • WLCG Network Throughput SU: FNAL to CERN
    • Issue at ESNet, resolved by LHCOPN ops
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Meeting held on LHCb DIRAC bridge on January 18th:
    • Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
  • Throughput meeting held on January 27th:
Line: 175 to 190
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - ONGOING
>
>
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - DONE
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - - - ONGOING
>
>
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - Should be done by the end of Feb - ONGOING
18.02.2016 CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557 ATLAS - - - ONGOING
 

AOB

Changed:
<
<
-- MariaDimou - 2016-0216
>
>
The vulnerability issue announced yesterday was raised at the 3pm Ops call and moved here due to lack of time. All positions known at this moment are in the site reports of WLCGDailyMeetingsWeek160215#Thursday. The CERN security team will tell the T0 when to do the batches within the allowed timeframe (before Feb 24th).

-- MariaDimou - 2016-02-16

Revision 362016-02-18 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 138 to 138
 
  • Information System discussed at the WLCG workshop:
    • General agreement that it would be desirable to become independent from the BDII, although in practice this needs to be understood.
    • No clear outcome about the new IS. There is a general feeling that a new IS is useful, but this needs in any case to be supported by the experiments. As a follow up at the MB on Tuesday, it was agreed to re-visit the experiment needs for this.
  • An IS TF meeting took place on 11th February:
    • In order to define a strategy for the BDII, EGI was invited to present their plans to support the BDII and it was made clear that EGI plans to support the BDII as many VOs rely on it.
    • It was agreed to assess the feasibility of moving static information to GOCDB/OIM, since experiments like ATLAS are interested in going in this direction.
    • It was agreed to work on a table where all primary information sources for each experiment will be described and identified. This should be a compact version of the Use Cases document and an easy way to understand where information is defined and where information is consumed, highlighting possible inconsistencies and also helping to steering the discussion on how to evolve the IS.
    • It was agreed to investigate whether there is room for collaboration between LHCb and ATLAS after LHCb’s implementation of multiple information collector plugins for the DIRAC CS.
    • It was decided to stop discussing about definitions since this work fits better within the benchmarking working group and the MJF TF.
Added:
>
>
On the one-but-last bullet of the report, A. McNab wrote in the vidyo chat window: It's not "LHCb's announment of DIRAC rewrite". It's Andrei proposing to write some more information collectors with a plugin framework with that.
 

IPv6 Validation and Deployment TF


Revision 352016-02-18 - DavidCameron

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 92 to 92
 
  • High activity: reprocessing of some 2012 data completed one week ago, just in time to start re-reprocessing of all 2015 data (expected to last 1-2 more weeks)
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
Changed:
<
<
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date?
>
>
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557
 
  • FTS: New release 3.4.x contains a couple of important features for ATLAS, can sites deploy it?

CMS

Revision 342016-02-18 - ChristophWissing

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 97 to 97
 

CMS

Added:
>
>
  • General Operational Issues
    • Overall good usage at Tier-1 and Tier-2 sites, good job success rate and CPU utilization (incl. HLT and Tier-0 Openstack resources for processing)
    • We are short on disk space and are incontact with sites about readiness of the 2016 pledges
  • Requests for Sites
    • Sites are ask to update to Phedex 4.1.5 (or higher, 4.1.7 is the recommended version) by the end of February. One Tier-1 and about 20 Tier-2 sides still need to upgrade.
    • All Tier-1 sites are running multi-core pilots and Tier-2 sites are now switching coordinated by the Submission Infrastructure team via GGUS tickets (still to be opened).
 

LHCb

  • Stripping for 2015 is almost finished. Cleaning processed RAW files from disk

Revision 332016-02-18 - DaveDykstra

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 153 to 153
 

Squid Monitoring and HTTP Proxy Discovery TFs

Added:
>
>
  • Squid monitoring based on OIM/GOCDB registrations is improving, with Alastair Dewhurst making a bit more progress on getting exceptions added on the monitoring machine
  • CMS is now planning on making a virtual opportunistic computing site, that can find its proxies with a single configuration based on a Web Proxy Auto Discovery service
    • Dave Dykstra is beginning to work on hosting http://wlcg-wpad.cern.ch/wpad.dat on an existing pair of 10gbit/s external proxy machines, beginning by just supporting a few sites but eventually basing it on the OIM/GOCDB data
  • A separate proxy service is also being added to the same external proxy machines for support of LHC@home and CMS opendata
 

Action list

Creation date Description Responsible Status Comments

Revision 322016-02-18 - DavidCameron

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 90 to 90
 

ATLAS

Added:
>
>
  • High activity: reprocessing of some 2012 data completed one week ago, just in time to start re-reprocessing of all 2015 data (expected to last 1-2 more weeks)
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date?
  • FTS: New release 3.4.x contains a couple of important features for ATLAS, can sites deploy it?
 

CMS

LHCb

Revision 312016-02-18 - VladimirRomanovsky

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 94 to 94
 

LHCb

Added:
>
>
  • Stripping for 2015 is almost finished. Cleaning processed RAW files from disk
  • Validation of Turbo and TurCal; Prestaging files for them
  • Validation of Sim09; We hope we can start massive MC production soon
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Revision 302016-02-18 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 26 to 26
 
  • Next Ops Coord meetings:
    • March 3rd
    • April 7th
Changed:
<
<
    • May 5th
>
>
    • May 12th (Since May 5th is Ascension day)
    • June 2nd
    • July 7th
 
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALLTime usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPUtime to pledges). See MB slides for more details.

Revision 292016-02-18 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 29 to 29
 
    • May 5th
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
Added:
>
>
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALLTime usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPUtime to pledges). See MB slides for more details.
 

Experiments Test Framework (ETF)

Revision 282016-02-18 - JeromeBelleman

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 66 to 66
 

Tier 0 News

Added:
>
>
  • Condor: 118 kHS06 out of a total of 817 kHS06 (15%).
  • Larger CREAM CE flavours being deployed.
 

DB News

Tier 1 Feedback

Revision 272016-02-18 - AndrewMcNab

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 48 to 48
 
  • Baselines/New releases:
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
Changed:
<
<
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
>
>
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
 
  • Issues:
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new version of SAM which will move to HTCondor 8.4. Sites have been asked not to upgrade to the latest java or to enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
    • EGI SVG advisory sent yesterday describing a critical vulnerability of glibc on all platforms (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2015-7547), All running resources MUST be patched by 2016-02-24 21:00 UTC.
Line: 61 to 61
 
    • IN2P3
      • xrootd 4.2.3-3 and new balanced redirector on tape buffer under test
    • INFN-T1
Changed:
<
<
      •  Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
>
>
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
 
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF

Tier 0 News

Line: 107 to 104
 

Machine/Job Features TF

Changed:
<
<

HTTP Deployment TF

>
>
  • Finalized specification and Technical Note document: https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesSpec
  • TN in HSF TN consultation process (for formatting, spelling etc.)
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (started), and HTCondor.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.
 
Added:
>
>

HTTP Deployment TF

 

Information System Evolution

Revision 262016-02-18 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 43 to 43
 

Middleware News

  • Useful Links:
Changed:
<
<
>
>
 
  • Baselines/New releases:
Changed:
<
<
>
>
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
 
  • Issues:
Changed:
<
<

>
>
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new version of SAM which will move to HTCondor 8.4. Sites have been asked not to upgrade to the latest java or to enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
    • EGI SVG advisory sent yesterday describing a critical vulnerability of glibc on all platforms (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2015-7547), All running resources MUST be patched by 2016-02-24 21:00 UTC.
 
  • T0 and T1 services
Changed:
<
<
>
>
    • FNAL
      • EOS Upgraded to v 0.3.127, with xrootd 3.3.6-4.slc6 and Bestman 2.3.0-21
      • planned upgrade to dCache 2.13 in April
    • JINR-T1
      • minor dCache upgrade to v 2.10.54
    • IN2P3
      • xrootd 4.2.3-3 and new balanced redirector on tape buffer under test
    • INFN-T1
      •  Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF
 

Tier 0 News

Revision 252016-02-18 - MarianBabik

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 34 to 34
  Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of March, depending on the validation process being successful by this date.
Changed:
<
<
ETF is a complete re-write of the SAM/Nagios test framework, but it uses the exact same plugins we had before, so it’s not a major change from the sites' perspective, but there are few changes that we will introduce in testing as well, such as:
  • all tests will use RFC proxies (coordinated by RFC TF)
  • we will test all endpoints that are listed in the VO feeds (we’ll no longer perform any checks wrt GOCDB/OIM) - long standing request from the experiments
  • we will start testing HTTP endpoints (coordinated by HTTP TF)
>
>
ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that we will introduced that could impact sites:
  • Testing with RFC proxies (coordinated by RFC TF)
  • All services in the VO feeds will be tested
  • New HTTP tests (coordinated by HTTP TF)
  • Updated gLExec worker node test to the latest from UMD
 

Middleware News

Revision 242016-02-18 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 18 to 18
 

Operations News

Changed:
<
<
>
>
  • Operations Coordination Meetings have been reorganised as of 1st March. See MB slides presented this week:
    • 3PM meetings once a week on Mondays
    • Ops Coord meetings once per month on the first Thursday of the month
      • Topical meetings
      • Written reports still requested, but not necessary to go through them during the meeting
  • Next Ops Coord meetings:
    • March 3rd
    • April 7th
    • May 5th
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
 

Experiments Test Framework (ETF)

Revision 232016-02-18 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Changed:
<
<

Highlights

>
>

Highlights

 

Agenda

Line: 119 to 119
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit when the change finally comes in Globus 6.1, now foreseen for April 1 (sic). On Jan 21 there are only 2 tickets still open: GGUS:117043 for CNAF (largely done) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets.
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea M, Maarten DONE Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. All identified affected services now have compliant certificates and the corresponding tickets have been closed.
 
2015-12-17 Recommend site configurations to enforce memory limits on jobs   DONE 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened, answered, recommendation written in the same twiki and MB informed.

Revision 222016-02-17 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 56 to 56
 

ALICE

Added:
>
>
  • mostly high activity
  • disk space
    • about 2.5 PB were recovered thanks to ad-hoc cleanups
    • further cleanups expected by spring, pending agreement on policy changes
    • CASTOR situation for raw data reco looks good, thanks for the support!
 

ATLAS

CMS

Line: 67 to 73
 

gLExec Deployment TF

Changed:
<
<
>
>
  • a new plan for gLExec has been proposed in Ian Bird's presentation on the
    "Follow-up to the WLCG Workshop in Lisbon" during the Feb 16 MB meeting
  • page 4 says:
    "Freeze deployment of glexec - keep it supported for the existing use,
    but no point to expend further effort in deployment"
  • in principle the minutes of that meeting still need to be approved
  • in practice this will imply closing the remaining open tickets and wrapping up the TF
 

Machine/Job Features TF

Line: 96 to 110
 

RFC proxies

Added:
>
>
  • NTR
 

Squid Monitoring and HTTP Proxy Discovery TFs

Revision 212016-02-17 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, February 18th 2016

Line: 14 to 14
 

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes)...
  • remote:
Changed:
<
<
  • apologies:
>
>
  • apologies: Catherine Biscarat (IN2P3)
 

Operations News

Added:
>
>

Experiments Test Framework (ETF)

Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of March, depending on the validation process being successful by this date.

ETF is a complete re-write of the SAM/Nagios test framework, but it uses the exact same plugins we had before, so it’s not a major change from the sites' perspective, but there are few changes that we will introduce in testing as well, such as:

  • all tests will use RFC proxies (coordinated by RFC TF)
  • we will test all endpoints that are listed in the VO feeds (we’ll no longer perform any checks wrt GOCDB/OIM) - long standing request from the experiments
  • we will start testing HTTP endpoints (coordinated by HTTP TF)
 

Middleware News

  • Useful Links:

Revision 202016-02-16 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"
Changed:
<
<

WLCG Operations Coordination Minutes, January 21st 2016

>
>

WLCG Operations Coordination Minutes, February 18th 2016

 

Highlights

Changed:
<
<
  • It is reminded to sites and experiments that CC7, SL7 and CentOS7 are compatible distributions and that there is no problem in using one or another.
  • AFS team at CERN is interested in collecting feedback from experiments who suffered from the AFS outage OTG:0027970 on 18-19.01 affecting any critical workflows.
  • ETF Nagios will move to RFC provies on 01.03.2016. Validation is currently ongoing.
  • CMS sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6.
  • ATLAS Sites Jamboree taking place on Wednesday 27th to Friday 29th January. Sites should register if they plan to attend.
>
>
 

Agenda

Changed:
<
<
>
>
 

Attendance

Changed:
<
<
  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Alessandro Di Girolamo (ATLAS), Oliver Keeble, Andrea Manzi, Marian Babik, Raja Nandakumar (LHCb)
  • remote: Alessandra Forti (chair), Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Felix Lee, Jeremy Coles, Pepe Flix, Jeremy Coles, Stefano Belforte, Catherine Biscarat, Vincenzo Spinoso, Frederique Chollet.
  • apologies: Maria Dimou (MWR WG)
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes)...
  • remote:
  • apologies:
 

Operations News

Changed:
<
<
  • GGUS support is returning to the care of Maria Dimou.
>
>
 

Middleware News

  • Useful Links:
Changed:
<
<
>
>
 
  • Baselines/New releases:
Changed:
<
<
>
>
 
  • Issues:
Changed:
<
<
    • globus-gssapi change for hostname verification, the new behaviour that we are discussing since long time now, is going to be released the 1st April.
    • openldap crash in TopBDII and ARC-CE resource BDII. RedHat is going to release the fix in RHEL 6.8 (To be scheduled). The same openldap version causing the issue is now available in CentOS7…likely we should have the same issue there so we have asked to include the fix also in the next version of RHEL 7
    • GGUS:118842, gfal-cat fails with Castor, issue discovered when using the ATLAS tool which collects storage dumps
  • T0 and T1 services
    • ASGC
      • CASTOR decommissioned
    • CNAF
      • installed the last production version of the storm-webdav service on the lhcb gridftp servers. Updated srm servers certificates to include alternative names used to contact them
    • CERN,RAL and BNL
      • Dev suggested DB change in order to fix a problem on Rucio polling has been applied
    • NDGF
      • dCache upgraded to v 2.14.8
>
>
 
Changed:
<
<
Alessandra asks whether the situation with CC7, SL7 and CentOS7 is clear, in the sense that they are different distributions and sites may be confused on which OS they have to install. Maarten explains that there is no difference as they are compatible and that this has always been the case in the past already with SL, as the SL distribution in Fermilab and the one at CERN were not exactly the same. The important thing is that all these distributions are compatible. Maarten adds that it is very unlikely that a package is built with a particular OS dependency that is only available in one of the distributions. In this case, it will be discovered and the dependency will have to be explicitly declared. It's not a big problem. Andrea reminds that the MW verification is always done on CentOS7. Maarten explains that also in the past this happened since verification was done in SLC. Some inconsistencies were found at the time between SL and SLC and they were corrected.
>
>
  • T0 and T1 services
 

Tier 0 News

Changed:
<
<
NTR.
>
>
 

DB News

Tier 1 Feedback

Deleted:
<
<
  • NDGF-T1 : Alice disk storage is really full, causing problems. (Ulf can't attend, I'm in a meeting at CERN at the same time)
 

Tier 2 Feedback

Line: 63 to 46
 

ALICE

Deleted:
<
<
  • mostly high activity
  • disk space
    • regular and ad-hoc cleanups ongoing
    • policy changes under discussion with the physics groups
    • CASTOR: the old disk servers remain available until April - thanks!
 

ATLAS

Deleted:
<
<
  • Activities running smooth during the past 2 weeks. Stable around 200k running slots.
  • Discovered an issue on the Reprocessing data produced over the xmas break. The issue has been now fixed, the new tasks will be most probably submitted end of this week. Last round data will be most probably deleted starting from tomorrow (waiting for green light from DP)
  • last WLCG Ops Coord meeting https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160107#ATLAS we reported about an issue with MadGraph jobs which were producing huge log files which were creating troubles to WNs. This was not the only problem they created, we discovered that they also created quite a lot of dark data >1PB (i.e. data on storage but not recorded in Rucio) because the pilot was taking hours to tar the output logs, not sending updates to pandaserver which then considered the job as dead. This will be fixed in the next pilot release.
  • To all the ATLAS sites. ATLAS Sites Jamboree Wed-Fri 27-29January. https://indico.cern.ch/event/440821/ . Please register if you plan to attend.
 

CMS

Deleted:
<
<
  • Tier0/PromptRECO
    • CMS took much more Heavy Ion data last year than PromptRECO capacity would allow
    • Had a rather long backlog of "Tier-0/PromptRECO" jobs
    • Backlog gone since early this week
  • Continue to have high to very processing and production load
    • More than 100k parallel production jobs at most of the time
    • Reprocessing of 2015 data progressing well
  • Requests for sites
    • All sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6. Please note that PhEDEx version 4.1.6 actually doesn't exist.
  • Operational issues
    • Kerberos/AFS problem at CERN affected user and various CMS services Alarm ticket - GGUS:118938
    • Some Kibana based monitoring from CERN-IT has been (is being) fixed

Maarten reminds that at the 3PM meeting today, the AFS service manager asked for feedback on critical workflows affected by the AFS outage.

 

LHCb

Deleted:
<
<
  • Activities :
    • Mostly user and MC jobs running on the grid
    • Started pre-staging for turbo-calibration.
  • Information
    • Restripping imminent - use pre-staged data from
    • Thanks to CNAF for enabling MJF. Look forward to other sites also enabling it.
    • Other site issues handled either by GGUS tickets or internally.
      • SARA srm problems
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Changed:
<
<
  • NTR
>
>
 

Machine/Job Features TF

HTTP Deployment TF

Deleted:
<
<
  • All relevant sites have now been ticketed (around 40)
  • Meeting on 20th Jan was canceled
  • TF will continue to iterate on the remaining tickets
 

Information System Evolution

Changed:
<
<

  • Preparation and discussion of the slides to be presented in the WLG workshop.
  • New Execution Environment service in GOCDB/OIM to give logical CPUs and Benchmark information of the resources in a site:
    • Discussion with GOCDB developer to understand whether a new Execution Environment service could be added to GOCDB. The answer is yes but there is no writeable REST API for the time being. Feedback being collected from sys admins to understand advantages and disadvantages of having this new service defined in GOCDB.
    • OSG is partially providing the needed information (Benchmark) already. They are planning to add HS06 normalisation constant to be able to derive the number of Logical CPUs from there (Logical cores = (total hs06 / hs06 normalization)
  • After the WLCG workshop we hope to have more clear directions on next steps inside the TF, especially for the new IS, that for the time being is on hold.

Alessandro reminds that IS TF should align with any definitions done in other TFs like MJF. Maria reminds that Andrew McNab is making the link between the two TFs and brings in the discussion any relevant information that also affects MJF. Moreover, at the WLCG workshop a joint session between IS, Accounting and benchmarking will take place to discuss common issues.

>
>

  • Information System discussed at the WLCG workshop:
    • General agreement that it would be desirable to become independent from the BDII, although in practice this needs to be understood.
    • No clear outcome about the new IS. There is a general feeling that a new IS is useful, but this needs in any case to be supported by the experiments. As a follow up at the MB on Tuesday, it was agreed to re-visit the experiment needs for this.
  • An IS TF meeting took place on 11th February:
    • In order to define a strategy for the BDII, EGI was invited to present their plans to support the BDII and it was made clear that EGI plans to support the BDII as many VOs rely on it.
    • It was agreed to assess the feasibility of moving static information to GOCDB/OIM, since experiments like ATLAS are interested in going in this direction.
    • It was agreed to work on a table where all primary information sources for each experiment will be described and identified. This should be a compact version of the Use Cases document and an easy way to understand where information is defined and where information is consumed, highlighting possible inconsistencies and also helping to steering the discussion on how to evolve the IS.
    • It was agreed to investigate whether there is room for collaboration between LHCb and ATLAS after LHCb’s implementation of multiple information collector plugins for the DIRAC CS.
    • It was decided to stop discussing about definitions since this work fits better within the benchmarking working group and the MJF TF.
 

IPv6 Validation and Deployment TF

Changed:
<
<

>
>

 

Middleware Readiness WG

Changed:
<
<

>
>

 

Multicore Deployment

Changed:
<
<
>
>

 

Network and Transfer Metrics WG

Changed:
<
<

  • WLCG Network Throughput SU: GGUS-118730 Throughput degradation between CA and EU
    • Root cause was instability of the transatlantic link (WIX reported submarine shunt fault), which in turn impacted Geant- CANARIE link.
    • perfSONAR network helped to identify the problematic segment and once Canarie was notified the issue was resolved by re-routing.
    • Issue was reported by ATLAS, but many different people were involved (ATLAS, TRIUMF, perfSONAR support, LHCONE, Canarie, WIX).
    • Multiple GGUS tickets were open, but only one was followed up, something to improve in the future.
    • Experiments: Please check if everyone was notified of the on-going incident and let us know if we need to add additional contacts (wlcg-network-throughput mailing list)
  • OSG perfSONAR production services: Storage failure (OASIS) at GOC has impacted the entire perfSONAR pipeline, initially just the datastore, but later on also collector and publisher. The issue was resolved yesterday and the systems are recovering now. We have proposed changes that would remove dependency on the shared storage.
>
>

  • WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751)
  • WLCG Network Throughput SU: BNL to PIC throughput degradation
    • Root cause was instability of the GEANT Spain fiber channels
    • Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
  • WLCG Network Throughput SU: FNAL to CERN
    • Issue at ESNet, resolved by LHCOPN ops
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Meeting held on LHCb DIRAC bridge on January 18th:
    • Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
  • Throughput meeting held on January 27th:
 

RFC proxies

Deleted:
<
<
  • SAM
    • new ETF Nagios preprod hosts are using RFC proxies for ALICE, ATLAS and LHCb
    • also agreed for CMS
    • comparisons with production still to be done, but mostly to check other changes
    • tentative date for production: March 1

Alessandra asks whether there is any objection to the proposed date. Maarten explains that all experiment contacts for SAM been informed and it's looking OK so far. Marian explains that validation is still ongoing and that he could do a short presentation at the next meeting to give more details. In any case these changes should be transparent.

 

Squid Monitoring and HTTP Proxy Discovery TFs

Changed:
<
<
  • Alastair Dewhurst has finished the implementation of a flexible exception list for squids to monitor and just needs to make it available for ATLAS & CMS to use
  • Vassil Verguilov will next fill in exceptions known to CMS and generate a CMS-specific MRTG monitoring page using the CMS Sitedb to translate from the names in GOCDB & OIM into the TN_CC_Site format that CMS uses to name sites.
>
>
 

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit when the change finally comes in Globus 6.1, now foreseen for April 1 (sic). On Jan 21 there are only 2 tickets still open: GGUS:117043 for CNAF (largely done) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets.
Changed:
<
<
2015-12-17 Recommend site configurations to enforce memory limits on jobs   ONGOING 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.
>
>
2015-12-17 Recommend site configurations to enforce memory limits on jobs   DONE 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened, answered, recommendation written in the same twiki and MB informed.
 

Specific actions for experiments

Line: 174 to 108
 

AOB

Deleted:
<
<
-- MariaALANDESPRADILLO - 2016-01-19
 \ No newline at end of file
Added:
>
>
-- MariaDimou - 2016-0216

Revision 192016-01-25 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 8 to 8
 
  • It is reminded to sites and experiments that CC7, SL7 and CentOS7 are compatible distributions and that there is no problem in using one or another.
  • AFS team at CERN is interested in collecting feedback from experiments who suffered from the AFS outage OTG:0027970 on 18-19.01 affecting any critical workflows.
  • ETF Nagios will move to RFC provies on 01.03.2016. Validation is currently ongoing.
Changed:
<
<
  • CMS sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6. Please note that PhEDEx version 4.1.6 actually doesn't exist.
>
>
  • CMS sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6.
 
  • ATLAS Sites Jamboree taking place on Wednesday 27th to Friday 29th January. Sites should register if they plan to attend.

Agenda

Revision 182016-01-22 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 8 to 8
 
  • It is reminded to sites and experiments that CC7, SL7 and CentOS7 are compatible distributions and that there is no problem in using one or another.
  • AFS team at CERN is interested in collecting feedback from experiments who suffered from the AFS outage OTG:0027970 on 18-19.01 affecting any critical workflows.
  • ETF Nagios will move to RFC provies on 01.03.2016. Validation is currently ongoing.
Changed:
<
<
  • CMS sites are requested to move to Phedex 4.1.6 on SL6.
>
>
  • CMS sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6. Please note that PhEDEx version 4.1.6 actually doesn't exist.
 
  • ATLAS Sites Jamboree taking place on Wednesday 27th to Friday 29th January. Sites should register if they plan to attend.

Agenda

Line: 16 to 16
 

Attendance

Changed:
<
<
  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Jerome Belman (T0), Alessandro di Girolamo (ATLAS), Oliver Keeble, Andrea Manzi, Marian Babik, Raja Nandakumar (LHCb)
  • remote: Alessandra Forti (chair), Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, DAvid Mason, Felix Lee, Jeremy Coles, Pepe Flix, Jeremy Coles, Stefano Berforte, Catherine Biscarat, Vincenzo Spinoso, Frederique Chollet.
>
>
  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), Alessandro Di Girolamo (ATLAS), Oliver Keeble, Andrea Manzi, Marian Babik, Raja Nandakumar (LHCb)
  • remote: Alessandra Forti (chair), Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Felix Lee, Jeremy Coles, Pepe Flix, Jeremy Coles, Stefano Belforte, Catherine Biscarat, Vincenzo Spinoso, Frederique Chollet.
 
  • apologies: Maria Dimou (MWR WG)

Operations News

Line: 46 to 46
 
    • NDGF
      • dCache upgraded to v 2.14.8
Changed:
<
<
Alessandra asks whether the situation with CC7, SL7 and CentOS7 is clear, in the sense that they are different distributions and sites may be confussed on which OS they have to install. Maarten explains that there is no difference as they are compatible and that this has always been the case in the past already with SL, as the SL distribution in Fermilab and the one at CERN were not exactly the same. The important thing is that all these distributions are compatible. Maarten adds that it is very unlikely that a package is built with a particular OS dependency that is only available in one of the distributions. In this case, it will be discovered and the dependency will have to be explicitely declared. It's not a big problem. Andrea reminds that the MW verification is always done on CentOS7. Maarten explains that also in the past this happened since verification was done in SLC. Some inconsistencies were found at the time between SL and SLC and they were corrected.
>
>
Alessandra asks whether the situation with CC7, SL7 and CentOS7 is clear, in the sense that they are different distributions and sites may be confused on which OS they have to install. Maarten explains that there is no difference as they are compatible and that this has always been the case in the past already with SL, as the SL distribution in Fermilab and the one at CERN were not exactly the same. The important thing is that all these distributions are compatible. Maarten adds that it is very unlikely that a package is built with a particular OS dependency that is only available in one of the distributions. In this case, it will be discovered and the dependency will have to be explicitly declared. It's not a big problem. Andrea reminds that the MW verification is always done on CentOS7. Maarten explains that also in the past this happened since verification was done in SLC. Some inconsistencies were found at the time between SL and SLC and they were corrected.
 

Tier 0 News

Line: 85 to 85
 
    • More than 100k parallel production jobs at most of the time
    • Reprocessing of 2015 data progressing well
  • Requests for sites
Changed:
<
<
    • All sites are reuqested to move to Phedex 4.1.6 on SL6
>
>
    • All sites are requested to move to Phedex 4.1.5 (minimum version) or to 4.1.7 (recommended version) on SL6. Please note that PhEDEx version 4.1.6 actually doesn't exist.
 
  • Operational issues
    • Kerberos/AFS problem at CERN affected user and various CMS services Alarm ticket - GGUS:118938
    • Some Kibana based monitoring from CERN-IT has been (is being) fixed
Line: 149 to 149
 
    • comparisons with production still to be done, but mostly to check other changes
    • tentative date for production: March 1
Changed:
<
<
Alessandra asks whether there is any objection to the proposed date. Maarten explains that all experiments contacts in the SAM group have been informed and it's OK. Marian explains that validation is still ongoing and that he could do a short presentation at the next meeting to give more details. In any case these changes should be transparent.
>
>
Alessandra asks whether there is any objection to the proposed date. Maarten explains that all experiment contacts for SAM been informed and it's looking OK so far. Marian explains that validation is still ongoing and that he could do a short presentation at the next meeting to give more details. In any case these changes should be transparent.
 

Squid Monitoring and HTTP Proxy Discovery TFs

Line: 170 to 170
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
22.01.2016 CMS sites are requested to move to Phedex 4.1.6 on SL6 CMS - - - ONGOING
>
>
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - - - ONGOING
 

AOB

Revision 172016-01-22 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Highlights

Added:
>
>
  • It is reminded to sites and experiments that CC7, SL7 and CentOS7 are compatible distributions and that there is no problem in using one or another.
  • AFS team at CERN is interested in collecting feedback from experiments who suffered from the AFS outage OTG:0027970 on 18-19.01 affecting any critical workflows.
  • ETF Nagios will move to RFC provies on 01.03.2016. Validation is currently ongoing.
  • CMS sites are requested to move to Phedex 4.1.6 on SL6.
  • ATLAS Sites Jamboree taking place on Wednesday 27th to Friday 29th January. Sites should register if they plan to attend.
 

Agenda

Attendance

Changed:
<
<
  • local: Maria Alandes (minutes)
  • remote: Alessandra Forti (chair)
>
>
  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Jerome Belman (T0), Alessandro di Girolamo (ATLAS), Oliver Keeble, Andrea Manzi, Marian Babik, Raja Nandakumar (LHCb)
  • remote: Alessandra Forti (chair), Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, DAvid Mason, Felix Lee, Jeremy Coles, Pepe Flix, Jeremy Coles, Stefano Berforte, Catherine Biscarat, Vincenzo Spinoso, Frederique Chollet.
 
  • apologies: Maria Dimou (MWR WG)

Operations News

Line: 41 to 46
 
    • NDGF
      • dCache upgraded to v 2.14.8
Added:
>
>
Alessandra asks whether the situation with CC7, SL7 and CentOS7 is clear, in the sense that they are different distributions and sites may be confussed on which OS they have to install. Maarten explains that there is no difference as they are compatible and that this has always been the case in the past already with SL, as the SL distribution in Fermilab and the one at CERN were not exactly the same. The important thing is that all these distributions are compatible. Maarten adds that it is very unlikely that a package is built with a particular OS dependency that is only available in one of the distributions. In this case, it will be discovered and the dependency will have to be explicitely declared. It's not a big problem. Andrea reminds that the MW verification is always done on CentOS7. Maarten explains that also in the past this happened since verification was done in SLC. Some inconsistencies were found at the time between SL and SLC and they were corrected.
 

Tier 0 News

NTR.

Line: 83 to 90
 
    • Kerberos/AFS problem at CERN affected user and various CMS services Alarm ticket - GGUS:118938
    • Some Kibana based monitoring from CERN-IT has been (is being) fixed

Added:
>
>
Maarten reminds that at the 3PM meeting today, the AFS service manager asked for feedback on critical workflows affected by the AFS outage.
 

LHCb

  • Activities :
    • Mostly user and MC jobs running on the grid
Line: 114 to 123
 
  • Preparation and discussion of the slides to be presented in the WLG workshop.
  • New Execution Environment service in GOCDB/OIM to give logical CPUs and Benchmark information of the resources in a site:
    • Discussion with GOCDB developer to understand whether a new Execution Environment service could be added to GOCDB. The answer is yes but there is no writeable REST API for the time being. Feedback being collected from sys admins to understand advantages and disadvantages of having this new service defined in GOCDB.
    • OSG is partially providing the needed information (Benchmark) already. They are planning to add HS06 normalisation constant to be able to derive the number of Logical CPUs from there (Logical cores = (total hs06 / hs06 normalization)
  • After the WLCG workshop we hope to have more clear directions on next steps inside the TF, especially for the new IS, that for the time being is on hold.
Added:
>
>
Alessandro reminds that IS TF should align with any definitions done in other TFs like MJF. Maria reminds that Andrew McNab is making the link between the two TFs and brings in the discussion any relevant information that also affects MJF. Moreover, at the WLCG workshop a joint session between IS, Accounting and benchmarking will take place to discuss common issues.
 

IPv6 Validation and Deployment TF


Line: 138 to 149
 
    • comparisons with production still to be done, but mostly to check other changes
    • tentative date for production: March 1
Added:
>
>
Alessandra asks whether there is any objection to the proposed date. Maarten explains that all experiments contacts in the SAM group have been informed and it's OK. Marian explains that validation is still ongoing and that he could do a short presentation at the next meeting to give more details. In any case these changes should be transparent.
 

Squid Monitoring and HTTP Proxy Discovery TFs

  • Alastair Dewhurst has finished the implementation of a flexible exception list for squids to monitor and just needs to make it available for ATLAS & CMS to use
Line: 152 to 165
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Added:
>
>
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - ONGOING
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Added:
>
>
22.01.2016 CMS sites are requested to move to Phedex 4.1.6 on SL6 CMS - - - ONGOING
 

AOB

Revision 162016-01-21 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 38 to 38
 
      • installed the last production version of the storm-webdav service on the lhcb gridftp servers. Updated srm servers certificates to include alternative names used to contact them
    • CERN,RAL and BNL
      • Dev suggested DB change in order to fix a problem on Rucio polling has been applied
Added:
>
>
    • NDGF
      • dCache upgraded to v 2.14.8
 

Tier 0 News

Revision 152016-01-21 - DaveDykstra

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 138 to 138
 

Squid Monitoring and HTTP Proxy Discovery TFs

Added:
>
>
  • Alastair Dewhurst has finished the implementation of a flexible exception list for squids to monitor and just needs to make it available for ATLAS & CMS to use
  • Vassil Verguilov will next fill in exceptions known to CMS and generate a CMS-specific MRTG monitoring page using the CMS Sitedb to translate from the names in GOCDB & OIM into the TN_CC_Site format that CMS uses to name sites.
 

Action list

Creation date Description Responsible Status Comments

Revision 142016-01-21 - AleDiGGi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 61 to 61
 
    • CASTOR: the old disk servers remain available until April - thanks!

ATLAS

Added:
>
>
  • Activities running smooth during the past 2 weeks. Stable around 200k running slots.
  • Discovered an issue on the Reprocessing data produced over the xmas break. The issue has been now fixed, the new tasks will be most probably submitted end of this week. Last round data will be most probably deleted starting from tomorrow (waiting for green light from DP)
  • last WLCG Ops Coord meeting https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160107#ATLAS we reported about an issue with MadGraph jobs which were producing huge log files which were creating troubles to WNs. This was not the only problem they created, we discovered that they also created quite a lot of dark data >1PB (i.e. data on storage but not recorded in Rucio) because the pilot was taking hours to tar the output logs, not sending updates to pandaserver which then considered the job as dead. This will be fixed in the next pilot release.
  • To all the ATLAS sites. ATLAS Sites Jamboree Wed-Fri 27-29January. https://indico.cern.ch/event/440821/ . Please register if you plan to attend.
 

CMS

Revision 132016-01-21 - JeromeBelleman

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 41 to 41
 

Tier 0 News

Added:
>
>
NTR.
 

DB News

Tier 1 Feedback

Revision 122016-01-21 - OliverKeeble

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 95 to 95
 

HTTP Deployment TF

Added:
>
>
  • All relevant sites have now been ticketed (around 40)
  • Meeting on 20th Jan was canceled
  • TF will continue to iterate on the remaining tickets
 

Information System Evolution


  • Preparation and discussion of the slides to be presented in the WLG workshop.
  • New Execution Environment service in GOCDB/OIM to give logical CPUs and Benchmark information of the resources in a site:
    • Discussion with GOCDB developer to understand whether a new Execution Environment service could be added to GOCDB. The answer is yes but there is no writeable REST API for the time being. Feedback being collected from sys admins to understand advantages and disadvantages of having this new service defined in GOCDB.
    • OSG is partially providing the needed information (Benchmark) already. They are planning to add HS06 normalisation constant to be able to derive the number of Logical CPUs from there (Logical cores = (total hs06 / hs06 normalization)
  • After the WLCG workshop we hope to have more clear directions on next steps inside the TF, especially for the new IS, that for the time being is on hold.

Revision 112016-01-21 - UlfBobsonSeverinTigerstedt

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 44 to 44
 

DB News

Tier 1 Feedback

Added:
>
>
* NDGF-T1 : Alice disk storage is really full, causing problems. (Ulf can't attend, I'm in a meeting at CERN at the same time)
 

Tier 2 Feedback

Revision 102016-01-21 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 96 to 96
 

Information System Evolution

Changed:
<
<

Warning: Can't find topic LCG.section=20160121
>
>

  • Preparation and discussion of the slides to be presented in the WLG workshop.
  • New Execution Environment service in GOCDB/OIM to give logical CPUs and Benchmark information of the resources in a site:
    • Discussion with GOCDB developer to understand whether a new Execution Environment service could be added to GOCDB. The answer is yes but there is no writeable REST API for the time being. Feedback being collected from sys admins to understand advantages and disadvantages of having this new service defined in GOCDB.
    • OSG is partially providing the needed information (Benchmark) already. They are planning to add HS06 normalisation constant to be able to derive the number of Logical CPUs from there (Logical cores = (total hs06 / hs06 normalization)
  • After the WLCG workshop we hope to have more clear directions on next steps inside the TF, especially for the new IS, that for the time being is on hold.
 

IPv6 Validation and Deployment TF

Revision 92016-01-21 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 30 to 30
 
  • Issues:
    • globus-gssapi change for hostname verification, the new behaviour that we are discussing since long time now, is going to be released the 1st April.
    • openldap crash in TopBDII and ARC-CE resource BDII. RedHat is going to release the fix in RHEL 6.8 (To be scheduled). The same openldap version causing the issue is now available in CentOS7…likely we should have the same issue there so we have asked to include the fix also in the next version of RHEL 7
Added:
>
>
    • GGUS:118842, gfal-cat fails with Castor, issue discovered when using the ATLAS tool which collects storage dumps
 
  • T0 and T1 services
    • ASGC
      • CASTOR decommissioned

Revision 82016-01-21 - AlessandraForti

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 17 to 17
 

Operations News

Added:
>
>
  • GGUS support is returning to the care of Maria Dimou.
 

Middleware News

  • Useful Links:

Revision 72016-01-21 - ChristophWissing

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 58 to 58
 

CMS

Added:
>
>
  • Tier0/PromptRECO
    • CMS took much more Heavy Ion data last year than PromptRECO capacity would allow
    • Had a rather long backlog of "Tier-0/PromptRECO" jobs
    • Backlog gone since early this week
  • Continue to have high to very processing and production load
    • More than 100k parallel production jobs at most of the time
    • Reprocessing of 2015 data progressing well
  • Requests for sites
    • All sites are reuqested to move to Phedex 4.1.6 on SL6
  • Operational issues
    • Kerberos/AFS problem at CERN affected user and various CMS services Alarm ticket - GGUS:118938
    • Some Kibana based monitoring from CERN-IT has been (is being) fixed

 

LHCb

  • Activities :
    • Mostly user and MC jobs running on the grid

Revision 62016-01-21 - RajaNandakumar

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 59 to 59
 

CMS

LHCb

Added:
>
>
  • Activities :
    • Mostly user and MC jobs running on the grid
    • Started pre-staging for turbo-calibration.
  • Information
    • Restripping imminent - use pre-staged data from
    • Thanks to CNAF for enabling MJF. Look forward to other sites also enabling it.
    • Other site issues handled either by GGUS tickets or internally.
      • SARA srm problems
 

Ongoing Task Forces and Working Groups

Revision 52016-01-21 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 49 to 49
 

ALICE

  • mostly high activity
Added:
>
>
  • disk space
    • regular and ad-hoc cleanups ongoing
    • policy changes under discussion with the physics groups
    • CASTOR: the old disk servers remain available until April - thanks!
 

ATLAS

Revision 42016-01-21 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 23 to 23
 
Changed:
<
<
  • Baselines:
>
>
 
  • Issues:
Added:
>
>
    • globus-gssapi change for hostname verification, the new behaviour that we are discussing since long time now, is going to be released the 1st April.
    • openldap crash in TopBDII and ARC-CE resource BDII. RedHat is going to release the fix in RHEL 6.8 (To be scheduled). The same openldap version causing the issue is now available in CentOS7…likely we should have the same issue there so we have asked to include the fix also in the next version of RHEL 7
  • T0 and T1 services
    • ASGC
      • CASTOR decommissioned
    • CNAF
      • installed the last production version of the storm-webdav service on the lhcb gridftp servers. Updated srm servers certificates to include alternative names used to contact them
    • CERN,RAL and BNL
      • Dev suggested DB change in order to fix a problem on Rucio polling has been applied
 

Tier 0 News

Revision 32016-01-20 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 38 to 38
 

ALICE

Added:
>
>
  • mostly high activity
 

ATLAS

CMS

Line: 48 to 50
 

gLExec Deployment TF

Added:
>
>
  • NTR
 

Machine/Job Features TF

HTTP Deployment TF

Line: 74 to 78
 

RFC proxies

Added:
>
>
  • SAM
    • new ETF Nagios preprod hosts are using RFC proxies for ALICE, ATLAS and LHCb
    • also agreed for CMS
    • comparisons with production still to be done, but mostly to check other changes
    • tentative date for production: March 1
 

Squid Monitoring and HTTP Proxy Discovery TFs

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit when the change finally comes in Globus 6.1, now foreseen for April 1 (sic). On Jan 21 there are only 2 tickets still open: GGUS:117043 for CNAF (largely done) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets.
 
2015-12-17 Recommend site configurations to enforce memory limits on jobs   ONGOING 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.

Revision 22016-01-20 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Line: 13 to 13
 

Attendance

  • local: Maria Alandes (minutes)
  • remote: Alessandra Forti (chair)
Added:
>
>
  • apologies: Maria Dimou (MWR WG)
 

Operations News

Revision 12016-01-19 - MariaALANDESPRADILLO

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 21st 2016

Highlights

Agenda

Attendance

  • local: Maria Alandes (minutes)
  • remote: Alessandra Forti (chair)

Operations News

Middleware News

Tier 0 News

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

ATLAS

CMS

LHCb

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Machine/Job Features TF

HTTP Deployment TF

Information System Evolution


Warning: Can't find topic LCG.section=20160121

IPv6 Validation and Deployment TF


Middleware Readiness WG


Multicore Deployment

Network and Transfer Metrics WG


  • WLCG Network Throughput SU: GGUS-118730 Throughput degradation between CA and EU
    • Root cause was instability of the transatlantic link (WIX reported submarine shunt fault), which in turn impacted Geant- CANARIE link.
    • perfSONAR network helped to identify the problematic segment and once Canarie was notified the issue was resolved by re-routing.
    • Issue was reported by ATLAS, but many different people were involved (ATLAS, TRIUMF, perfSONAR support, LHCONE, Canarie, WIX).
    • Multiple GGUS tickets were open, but only one was followed up, something to improve in the future.
    • Experiments: Please check if everyone was notified of the on-going incident and let us know if we need to add additional contacts (wlcg-network-throughput mailing list)
  • OSG perfSONAR production services: Storage failure (OASIS) at GOC has impacted the entire perfSONAR pipeline, initially just the datastore, but later on also collector and publisher. The issue was resolved yesterday and the systems are recovering now. We have proposed changes that would remove dependency on the shared storage.

RFC proxies

Squid Monitoring and HTTP Proxy Discovery TFs

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th
2015-12-17 Recommend site configurations to enforce memory limits on jobs   ONGOING 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaALANDESPRADILLO - 2016-01-19

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback