Difference: WLCGOpsMinutes160107 (1 vs. 41)

Revision 412018-02-28 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 153 to 153
 

Middleware Readiness WG

Changed:
<
<

>
>

The JIRA dashboard shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Dec. 17th are few due to the year end holidays. Details:

 

Multicore Deployment

Revision 402016-01-12 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 176 to 176
 
Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale. At the Jan 7th meeting, Maria Alandes reported that she is in touch wiht GOCDB and more news will hopefully come next week.
Changed:
<
<
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F.
>
>
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.
 

Specific actions for experiments

Revision 392016-01-08 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 6 to 6
 

Highlights

Changed:
<
<
  • The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites [[https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe][all relevant instructions.
>
>
  • The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites all relevant instructions.
 
  • The Multicore Deployment TF announced that WLCG users should mainly use the Tier1 and Tier2 views which now use the same data as the production portal (ie include cores).

Agenda

Revision 382016-01-07 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 91 to 91
 
    • This is because they produce large amount of outputs and the output log tarball contained data files.
    • The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.
Added:
>
>
Maria Alandes asked which were the reasons for the reprocessing time reduction. The reasons are multiple: many more cores, better network performance and improved software quality.
 

CMS

  • Happy New Year to everyone!
Line: 119 to 121
 
    • Problem pre-staging files at RRCKI
    • Nickname VOMS attribute can not be retrieved (GGUS:118361)
Added:
>
>
There was a discussion on the reasons why the above ticket has no activity since Dec. 16th and status "On hold". It should be followed up by LHCb offline.
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Line: 170 to 174
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale. At the Jan 7th meeting, Maria Alandes reported that she is in touch wiht GOCDB and more news will hopefully come next week.
 
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F.

Revision 372016-01-07 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 15 to 15
 

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Jerome Belleman, Julia Andreeva, Gavin McCance, Helge Meinhard, Oliver Keeble, Marian Babik, Xavier Espinal,
Changed:
<
<
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gomel, Bjashal (T2_IN_TIFR).
>
>
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gamel, B. Jashal (T2_IN_TIFR).
 
  • apologies: Vincenzo Spinoso (EGI)

Operations News

Changed:
<
<
  • Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordinations team at CERN, together with Pepe and Alessandra.
>
>
  • Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordination team at CERN, together with Pepe and Alessandra.
 
  • WLCG workshop: Registration closes on 22nd January.
  • Memory limits for batch queues: At the MB of 27.10.2015, it was decided to put an action on WLCG Operations to produce a set of recipes about how to best configure memory limits for batch queues. Operations coordination will open a set of GGUS tickets to a selection of sites (mostly T1s and a few T2s). Please, be ready to provide the necessary input. Thanks in advance.
Line: 61 to 61
 
    • Thanks to the sites for keeping things in good shape!
    • The first round of the heavy-ion reconstruction finished!
  • CASTOR issues
Changed:
<
<
    • Dec 19: alarm ticket GGUS:118443 because the transfer manager was stuck
>
>
    • Dec 18: alarm ticket GGUS:118443 because the transfer manager was stuck
 
      • Fixed later that afternoon, thanks!
    • Dec 31: team ticket GGUS:118554 because of same problem
      • OK again since Jan 1 00:00, thanks!
Line: 170 to 170
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
 
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F.

Revision 362016-01-07 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Highlights

Changed:
<
<
>
>
  • WLCG workshop: Registration closes on 22nd January.
  • The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites [[https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe][all relevant instructions.
  • The Multicore Deployment TF announced that WLCG users should mainly use the Tier1 and Tier2 views which now use the same data as the production portal (ie include cores).
 

Agenda

Attendance

Changed:
<
<
  • local: Maria Alandes (chair), Maria Dimou (minutes),
  • remote:
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Jerome Belleman, Julia Andreeva, Gavin McCance, Helge Meinhard, Oliver Keeble, Marian Babik, Xavier Espinal,
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gomel, Bjashal (T2_IN_TIFR).
 
  • apologies: Vincenzo Spinoso (EGI)

Operations News

Line: 37 to 40
 
    • Triumf
      • dCache upgraded to v 2.10.44

Added:
>
>
Maria Alandes asked whether Redhat released the openldap fixes that we tested successfully. The answer is 'not yet'.
 

Tier 0 News

  • Condor: 86 kHS06 → 96 kHS06 out of a total of 784 kHS06 since 10 Dec
Line: 167 to 172
 
Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
Changed:
<
<
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system
>
>
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F.
 

Specific actions for experiments

Revision 352016-01-07 - DavidCameron

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 73 to 73
 

ATLAS

  • Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
Changed:
<
<
    • RAL had a few issues with their storage
    • The NET2 - BNL network link was saturated because we put too much RAW data there for reprocessing
>
>
 
  • Reprocessing:
    • Almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break.
    • This is quite a remarkable result, in the past comparable reprocessing campaigns took 4-6 weeks.

Revision 342016-01-07 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 27 to 27
 
Changed:
<
<
    • As reported before the end of the holidays, a problem affected dCache pool version > 2.12 when using Berkeley DB as metadata backed which could lead to data loss.
>
>
    • As reported before the holidays, a problem affected dCache pool version > 2.12 when using Berkeley DB as metadata backend which could lead to data loss.
  For this particular installations, baselines are now dCache 2.12.28, 2.13.6, 2.14.5 and upgrade details circulated by dCache devs are available at : https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt

Revision 332016-01-07 - DaveDykstra

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 160 to 160
 

Squid Monitoring and HTTP Proxy Discovery TFs

Changed:
<
<
>
>
  • NTR
 

Action list

Revision 322016-01-07 - ChristophWissing

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 87 to 87
 
    • The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.

CMS

Changed:
<
<
>
>
  • Happy New Year to everyone!
  • Rather high production load over Xmas break
    • Run more 100k jobs in parallel at many days
    • HLT (High Level Trigger) contributed a few thousand cores
    • No major issues
  • Tier-0 / PromptRECO
    • Backlog of pending jobs not fully cleared during the break
    • Partly due to lacking resources at CERN
    • Needed help from experts to provision fresh VMs GGUS:118546
  • Tape operations
    • Had a rather long backlog of not approved tape migrations at FNAL before Xmas break
      • Sorted out via CMS site contacts
    • Some datasets not moving at RAL
 

LHCb

Revision 312016-01-07 - OliverKeeble

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 111 to 111
 

HTTP Deployment TF

Deleted:
<
<
 
Changed:
<
<
  • A first set of tickets, around 15, have been assigned to sites.
  • The next meeting will concentrate on setting up the operational plan for the campaign to get the monitoring green.
>
>
 

Information System Evolution

Revision 302016-01-07 - JeromeBelleman

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 39 to 39
 

Tier 0 News

Added:
>
>
  • Condor: 86 kHS06 → 96 kHS06 out of a total of 784 kHS06 since 10 Dec
 

DB News

Tier 1 Feedback

Revision 292016-01-07 - DavidCameron

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 71 to 71
 

ATLAS

  • Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
Changed:
<
<
  • Reprocessing : almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break. This is quite a remarkable result, in the past comparable reprocessing campaigns were taking 4-6 weeks. Thanks to the effort of the sites which were extremely stable during the xmas period and to some expert which made sure that few issues were quickly understood and solved. * FTS3 possibly quite dangerous bug. Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
  • Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run. This is because they produce large amount of outputs. The problem has been understood, there is the need of a fix in the ATLAS trasnformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create troubles to the WNs. This fix will be most probably released in one week/10days from now.

>
>
    • RAL had a few issues with their storage
    • The NET2 - BNL network link was saturated because we put too much RAW data there for reprocessing
  • Reprocessing:
    • Almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break.
    • This is quite a remarkable result, in the past comparable reprocessing campaigns took 4-6 weeks.
    • Thanks to the effort of the sites which were extremely stable during the xmas period and to some experts who made sure that the few issues were quickly understood and solved.
  • FTS3:
    • Another possibly quite dangerous bug.
    • Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
  • Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run.
    • This is because they produce large amount of outputs and the output log tarball contained data files.
    • The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.
 

CMS

Revision 282016-01-07 - AleDiGGi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 70 to 70
 
    • Dec 31: tape SE working again, thanks!

ATLAS

Changed:
<
<
>
>
  • Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
  • Reprocessing : almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break. This is quite a remarkable result, in the past comparable reprocessing campaigns were taking 4-6 weeks. Thanks to the effort of the sites which were extremely stable during the xmas period and to some expert which made sure that few issues were quickly understood and solved. * FTS3 possibly quite dangerous bug. Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
  • Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run. This is because they produce large amount of outputs. The problem has been understood, there is the need of a fix in the ATLAS trasnformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create troubles to the WNs. This fix will be most probably released in one week/10days from now.

 

CMS

Revision 272016-01-07 - OliverKeeble

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 97 to 97
 

HTTP Deployment TF

Added:
>
>
  • The next TF meeting has been confirmed as 20th Jan - https://indico.cern.ch/event/473194/
  • ETF is up and running in preprod
  • A first set of tickets, around 15, have been assigned to sites.
  • The next meeting will concentrate on setting up the operational plan for the campaign to get the monitoring green.
 

Information System Evolution


  • IS TF meeting scheduled tomorrow Friday 8th January. ( Agenda)
    • Definitions: summary of the proposed definitions and feedback from sys admins.
    • Status of new IS: news on the feedback given so far by experiments.
    • Preparation for the WLCG workshop discussion about the IS.

Revision 262016-01-07 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 13 to 13
 

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes),
  • remote:
Added:
>
>
  • apologies: Vincenzo Spinoso (EGI)
 

Operations News

  • Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordinations team at CERN, together with Pepe and Alessandra.

Revision 252016-01-07 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 26 to 26
 
Changed:
<
<
>
>
    • As reported before the end of the holidays, a problem affected dCache pool version > 2.12 when using Berkeley DB as metadata backed which could lead to data loss. For this particular installations, baselines are now dCache 2.12.28, 2.13.6, 2.14.5 and upgrade details circulated by dCache devs are available at : https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt
 
  • Issues:
Changed:
<
<
*

  • T0 and T1 services
>
>
 

Tier 0 News

Revision 242016-01-07 - MariaALANDESPRADILLO

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 15 to 15
 
  • remote:

Operations News

Changed:
<
<
>
>
  • Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordinations team at CERN, together with Pepe and Alessandra.
  • WLCG workshop: Registration closes on 22nd January.
  • Memory limits for batch queues: At the MB of 27.10.2015, it was decided to put an action on WLCG Operations to produce a set of recipes about how to best configure memory limits for batch queues. Operations coordination will open a set of GGUS tickets to a selection of sites (mostly T1s and a few T2s). Please, be ready to provide the necessary input. Thanks in advance.
 

Middleware News

  • Useful Links:

Revision 232016-01-07 - ZoltanMathe

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 69 to 69
 

LHCb

Changed:
<
<
>
>
  • Activities:
    • Monte Carlo and user analysis.
    • Pre-staging the data for re-stripping is almost finished.

  • Issue:
    • Problem pre-staging files at RRCKI
    • Nickname VOMS attribute can not be retrieved (GGUS:118361)
 

Ongoing Task Forces and Working Groups

Revision 222016-01-07 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, January 7th 2016

Line: 41 to 41
 

Experiments Reports

ALICE

Changed:
<
<
>
>
  • Best wishes for 2016!
  • Normal to high activity levels during the break
    • Thanks to the sites for keeping things in good shape!
    • The first round of the heavy-ion reconstruction finished!
  • CASTOR issues
    • Dec 19: alarm ticket GGUS:118443 because the transfer manager was stuck
      • Fixed later that afternoon, thanks!
    • Dec 31: team ticket GGUS:118554 because of same problem
      • OK again since Jan 1 00:00, thanks!
    • Jan 5: team ticket GGUS:118619 ditto
      • debugged live by the devs
      • root cause was not found yet
  • EOS issues
    • Dec 31: team ticket GGUS:118559 for EOS at CERN
      • Partly due to EOS-ALICE being ~full !
      • Some disk servers were unavailable
      • Mitigated by the admins, thanks!
  • KIT
    • Dec 31: tape SE working again, thanks!
 

ATLAS

Line: 56 to 75
 

gLExec Deployment TF

Changed:
<
<
>
>
  • NTR
 

Machine/Job Features TF

Line: 86 to 105
 

RFC proxies

Changed:
<
<
>
>
  • NTR
 

Squid Monitoring and HTTP Proxy Discovery TFs

Revision 212016-01-05 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"
Changed:
<
<

WLCG Operations Coordination Minutes, December 17th 2015

>
>

WLCG Operations Coordination Minutes, January 7th 2016

 

Highlights

Changed:
<
<
  • A critical bug was found in dCache and it may cause significant data loss. It affects versions 2.12.[0,27], 2.13.[0,15], 2.14.[0,4] if BerkeleyDB is used as backend. All concerned sites MUST apply the fix described in https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt as soon as possible!
  • All ATLAS sites that have not done it already should take action ASAP on the tickets they received to run storage consistency checks
  • Thanks to everybody for the good work in WLCG operations during 2015!
>
>
 

Agenda

Changed:
<
<
>
>
 

Attendance

Changed:
<
<
  • local: Maria Dimou (chair), Andrea Sciabà (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Vincenzo Spinoso, Hung-Te Lee
  • apologies: Andrew McNab (MJF TF)
>
>
  • local: Maria Alandes (chair), Maria Dimou (minutes),
  • remote:
 

Operations News

Changed:
<
<
  • The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.
  • Our next meeting will take place on Jan. 7th 2016.
  • Workshop for HTCondor and ARC CE users in Barcelona, Spain on Feb 29 2016 March 4 2016. Aimed at users and admins of HTCondor, HTCondor-CEs and ARC-CEs. Several talks and tutorials, meetings with the developers. Proposals for contributions can be sent to hepix-condorworkshop2016-interest (at) cern (dot) ch. More information at https://indico.cern.ch/e/Spring2016HTCondorWorkshop.

Maite announces that she'll no longer represent the Tier-0. A successor (or a rota of them) will be chosen early next year.

Concerning the recommendations for configuring memory limits: the motivation is to improve on the current situation, where sites have to find out without any guidance how to set up memory limits for jobs, which also causes experiments to observe inconsistent behaviours among different sites. It has also implications on purchasing new hardware.

After some discussion, it was agreed that the first step will be to collect from sites information on their current setup (e.g. from the Tier-1s and any willing Tier-2s) and put it in a twiki. Tickets will be used only in case of insufficient feedback. Finally, recommendations will be given, depending on the batch system used and any other relevant factor.

Alessandro mentions that the HTCondor/ARC workshop will overlap with the ATLAS software week.

>
>
 

Middleware News

  • Useful Links:
Line: 38 to 23
 
Changed:
<
<
>
>
 
  • Issues:
Changed:
<
<
    • As reported by Ulf, a quite serious bug is affecting dCache v 2.12/2.13/2.14. Today dCache released a patch, we will contact the sites with more details ASAP.
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

Maarten adds that, for the dCache bug, most likely not all sites are affected, as it depends on the local configuration. Also, dCache site admins normally are subscribed to the dCache admin forum and would thus have already been informed of all the details. Still it would be good to send a WLCG broadcast about it (done by Andrea M). Alessandro mentions that NDGF was severely hit by it because it happens when doing dist-to-disk copies, and many disk servers were being decommissioned. A broadcast will be sent just after the meeting

>
>
*
 
  • T0 and T1 services
Deleted:
<
<
    • ASGC
      • Castor Decommissioning planned for the end of the year
    • CNAF
      • plan to upgrade to Storm to 1.11.0 when released and move to the new storm-webdav from storm-http
    • IN2P3
      • dCache upgraded to v 2.13.4
    • NDGF
      • dCache upgraded to v 2.14.4
 

Tier 0 News

Deleted:
<
<
Jerome reports that now the HTCondor pool has more than 85 kHS06 of computing power (corresponding to about 10K slots).
 

DB News

Tier 1 Feedback

Changed:
<
<
  • NDGF-T1 had a good update to dCache 2.14 on Monday, but then noticed a bug.. that had been introduced into dCache 2.12.0 in January. It causes files that have been moved around within the storage system to lose the stickiness flag, marking the files available for garbage collection. This is ok (and default behaviour) for tape files, but not for disk files. So far we know of 1428 lost files. Alice and Atlas will get a list of files at some point this week. (Ulf writing in since it's unclear if I can attend the meeting due to travel). The bug has been fixed in dCache, and affects 2.12, 2.13 and 2.14 releases.

Ulf adds that also PIC and some German sites were hit by this bug.

>
>
 

Tier 2 Feedback

Experiments Reports

ALICE

Changed:
<
<
  • The heavy ion data taking has ended successfully!
    • Reconstruction and reprocessing will continue for many more weeks
    • The RSS memory usage has remained up to max ~2.5 GB
    • High-memory arrangements were undone also at KISTI and KIT
      • to allow more job slots to be used again, thanks!
  • The CASTOR team then rearranged the ALICE disk servers into a single pool:
    • to allow convenient usage of all available resources, thanks!
  • Grid activity has been high
  • Expectations for the end-of-year break:
    • steady MC production
    • heavy ion reconstruction
    • low analysis activity

  • Thanks to all sites and experts for another successful year!
  • Season's greetings and best wishes for 2016!
>
>
 

ATLAS

Changed:
<
<
  • During xmas break
  • FTS: we are still suffering of critical issues. This time, yesterday, it's most probably related to the high prestaging activity (to prestage data for reprocessing) . We have asked FTS devs to clarify what is the best course of actions to minimize the issues over the xmas break.
    • DDM has implemented an automatic restart in case of issues from FTS: this will just mitigate the issue.
  • Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
  • Merry Xmas, happy new year: super thanks to everybody for making a such successful year!

Alessandro explains that the problem with FTS is that there is an extremely high limit on the number of files in a prestaging request, and requests with too many files will slow down and possibly "collapse" FTS or the SE. Andrea M. adds that the latest patch (3.4.0), now in the pilot, reduces the limit to 1000, but - as the patch contains several other changes - it will not be deployed in production until next year.

>
>
 

CMS

Changed:
<
<
  • Heavy Ion run
    • Took more data than planned originally
    • Pushed the DAQ, StorageManager and PromptRECO to the limits
    • High load on CERN EOS
    • A few files lost because they were deleted from buffer discs before processed
    • Still big backlog of still unprocessed data
  • Big MC RE-DIGIRECO on going
    • Utilizing (large fractions of) CERN, Tier-1s, most Tier-2s
  • "End of the Year" data RE-RECO about to be released
  • Computing will continue with high load during Xmas break

  • Many thanks for the support in 2015, a nice Xmas break and already now the best wishes for 2016
>
>
 

LHCb

Changed:
<
<
  • Pre-staging data for Stripping 24.
  • Aim to run Monte Carlo during the YETS, including on HLT farm.
>
>
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Changed:
<
<
  • NTR
>
>
 

Machine/Job Features TF

Changed:
<
<
  • We have produced a 2nd draft of the HSF technical note and hope to be able to move it into the HSF approval process at the start of next year with no major changes. After that we will look at updating the reference implementations to match the note, and with the aim providing values for all the keys listed in the note.
>
>
 

HTTP Deployment TF

Information System Evolution

Changed:
<
<

  • A proposal for a new WLCG IS based on AGIS was presented at the last GDB.
    • Ongoing discussions with experiments to understand their interest in this new IS.
    • The proposal will be presented at the MB next year to see whether it gets approved.
  • In the meantime, the following activities are ongoing within the TF:
    • Ongoing discussion to agree on a better definition of the GLUE 2 attributes defining HS06 (GLUE2BenchmarkValue) and Logical CPUs (GLUE2ExecutionEnvironmentLogicalCPUs): feedback from sys admins is being collected for two possible definitions.
    • Presented at the last UMD meeting a proposal to validate information at its source so that we can avoid publishing information that is known to be wrong. A technical solution will have to be worked out together with MW developers.
  • Preparing the IS session at the WLCG workshop in February together with Alessandra Forti who will be the chair and who is gathering feedback on what to discuss.
  • Next IS TF meeting scheduled on Friday 8th January. ( Preliminary agenda)
>
>

  • IS TF meeting scheduled tomorrow Friday 8th January. ( Agenda)
    • Definitions: summary of the proposed definitions and feedback from sys admins.
    • Status of new IS: news on the feedback given so far by experiments.
    • Preparation for the WLCG workshop discussion about the IS.
 

IPv6 Validation and Deployment TF

Changed:
<
<

>
>

 

Middleware Readiness WG

Changed:
<
<

>
>

 

Multicore Deployment

Changed:
<
<
>
>

  • Accounting:
    • John Gordon update: The default EGI has not changed but WLCG users should mainly use the Tier1 and Tier2 views (eg http://accounting.egi.eu/tier1.php ) which now use the same data as the production portal (ie include cores). The EMI3(WLCG) view also includes cores and would be useful to view an integrated view of a country including both its Tier1, Tier2, Tier3 and other sites.
    • On ATLAS side working on comparing accounting records in the dashboard and in APEL site by site for the T1 and region by region for T2s.

 

Network and Transfer Metrics WG

Changed:
<
<

>
>

  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard), minor instability in the dashboard reported yesterday, being followed up by OSG
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • Proposed re-organization of the WG meetings, split into two areas, perfSONAR operations (throughput calls) and research/pilot projects
    • perfSONAR operations - main scope would be to continue with perfSONAR support, follow up on the existing infrastructure while at the same time start looking into issues already shown by the existing tools and try to fix them at the source. As this scope is well aligned with the existing North American throughput calls, we could alternate the meetings and publish common notes.
    • Research/pilot projects - will have separate on-demand meetings with notes published to WG mailing list
    • F2F meeting once a year, co-located with GDB or other workshop/conference
  • Pilot projects: LHCb DIRAC bridge available online
 

RFC proxies

Changed:
<
<
  • NTR
>
>
 

Squid Monitoring and HTTP Proxy Discovery TFs

Changed:
<
<
  • Nothing new to report. Existing code for automating monitoring based on GOCDB/OIM registration had broken but it got fixed again.
>
>
 

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
Changed:
<
<
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting
>
>
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
 
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system
Deleted:
<
<
Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Line: 182 to 106
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Deleted:
<
<
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None CLOSED

This action is closed, as it is being managed internally to ATLAS operations.

 

AOB

Deleted:
<
<
Maria mentions that Andrea S. will not work in WLCG operations coordination from next year. She thanks Maite and Andrea for their contributions to WLCG operations coordination.
 
Changed:
<
<
-- MariaALANDESPRADILLO - 2015-12-15
>
>
-- MariaDimou - 2016-01-05
 
META TOPICMOVED by="malandes" date="1450187182" from="LCG.WLCGOpsMinutes171203" to="LCG.WLCGOpsMinutes151217"

Revision 202016-01-05 - TWikiGuest

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Revision 192015-12-18 - AndreaSciaba

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Highlights

Added:
>
>
  • A critical bug was found in dCache and it may cause significant data loss. It affects versions 2.12.[0,27], 2.13.[0,15], 2.14.[0,4] if BerkeleyDB is used as backend. All concerned sites MUST apply the fix described in https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt as soon as possible!
  • All ATLAS sites that have not done it already should take action ASAP on the tickets they received to run storage consistency checks
  • Thanks to everybody for the good work in WLCG operations during 2015!
 

Agenda

Attendance

Changed:
<
<
  • local: Maria Dimou (chair), Andrea Sciaba (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
>
>
  • local: Maria Dimou (chair), Andrea Sciabà (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
 
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Vincenzo Spinoso, Hung-Te Lee
  • apologies: Andrew McNab (MJF TF)
Line: 184 to 187
 This action is closed, as it is being managed internally to ATLAS operations.

AOB

Changed:
<
<
Maria mentions that Andrea S. will not work in WLCG operations coordination from next year.
>
>
Maria mentions that Andrea S. will not work in WLCG operations coordination from next year. She thanks Maite and Andrea for their contributions to WLCG operations coordination.
  -- MariaALANDESPRADILLO - 2015-12-15

Revision 182015-12-17 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 45 to 45
 
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)
Changed:
<
<
Maarten adds that, for the dCache bug, a recipe for fixing the issue will be prepared as soon as possible. Most likely not all sites are affected by it as it depends on the local configuration.
>
>
Maarten adds that, for the dCache bug, most likely not all sites are affected, as it depends on the local configuration. Also, dCache site admins normally are subscribed to the dCache admin forum and would thus have already been informed of all the details. Still it would be good to send a WLCG broadcast about it (done by Andrea M).
 Alessandro mentions that NDGF was severely hit by it because it happens when doing dist-to-disk copies, and many disk servers were being decommissioned. A broadcast will be sent just after the meeting
Line: 106 to 106
 

CMS

  • Heavy Ion run
Changed:
<
<
    • Took more data than planned origionally
>
>
    • Took more data than planned originally
 
    • Pushed the DAQ, StorageManager and PromptRECO to the limits
    • High load on CERN EOS
    • A few files lost because they were deleted from buffer discs beforeprocessed
Line: 171 to 171
 
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system

Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's.

Changed:
<
<
It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to deafine a reasonable timescale.
>
>
It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.
 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Revision 172015-12-17 - AndreaSciaba

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 11 to 11
 

Attendance

Changed:
<
<
  • local: Maria Dimou (chair), Andrea Sciaba (minutes)
  • remote:
>
>
  • local: Maria Dimou (chair), Andrea Sciaba (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Vincenzo Spinoso, Hung-Te Lee
 
  • apologies: Andrew McNab (MJF TF)

Operations News

Line: 20 to 20
 
  • Our next meeting will take place on Jan. 7th 2016.
  • Workshop for HTCondor and ARC CE users in Barcelona, Spain on Feb 29 2016 March 4 2016. Aimed at users and admins of HTCondor, HTCondor-CEs and ARC-CEs. Several talks and tutorials, meetings with the developers. Proposals for contributions can be sent to hepix-condorworkshop2016-interest (at) cern (dot) ch. More information at https://indico.cern.ch/e/Spring2016HTCondorWorkshop.
Added:
>
>
Maite announces that she'll no longer represent the Tier-0. A successor (or a rota of them) will be chosen early next year.

Concerning the recommendations for configuring memory limits: the motivation is to improve on the current situation, where sites have to find out without any guidance how to set up memory limits for jobs, which also causes experiments to observe inconsistent behaviours among different sites. It has also implications on purchasing new hardware.

After some discussion, it was agreed that the first step will be to collect from sites information on their current setup (e.g. from the Tier-1s and any willing Tier-2s) and put it in a twiki. Tickets will be used only in case of insufficient feedback. Finally, recommendations will be given, depending on the batch system used and any other relevant factor.

Alessandro mentions that the HTCondor/ARC workshop will overlap with the ATLAS software week.

 

Middleware News

  • Useful Links:
Line: 37 to 45
 
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)
Added:
>
>
Maarten adds that, for the dCache bug, a recipe for fixing the issue will be prepared as soon as possible. Most likely not all sites are affected by it as it depends on the local configuration. Alessandro mentions that NDGF was severely hit by it because it happens when doing dist-to-disk copies, and many disk servers were being decommissioned. A broadcast will be sent just after the meeting
 
  • T0 and T1 services
    • ASGC
      • Castor Decommissioning planned for the end of the year
Line: 48 to 60
 
      • dCache upgraded to v 2.14.4

Tier 0 News

Added:
>
>
Jerome reports that now the HTCondor pool has more than 85 kHS06 of computing power (corresponding to about 10K slots).
 

DB News

Tier 1 Feedback

  • NDGF-T1 had a good update to dCache 2.14 on Monday, but then noticed a bug.. that had been introduced into dCache 2.12.0 in January. It causes files that have been moved around within the storage system to lose the stickiness flag, marking the files available for garbage collection. This is ok (and default behaviour) for tape files, but not for disk files. So far we know of 1428 lost files. Alice and Atlas will get a list of files at some point this week. (Ulf writing in since it's unclear if I can attend the meeting due to travel). The bug has been fixed in dCache, and affects 2.12, 2.13 and 2.14 releases.
Added:
>
>
Ulf adds that also PIC and some German sites were hit by this bug.
 

Tier 2 Feedback

Experiments Reports

Line: 86 to 101
 
  • Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
  • Merry Xmas, happy new year: super thanks to everybody for making a such successful year!
Added:
>
>
Alessandro explains that the problem with FTS is that there is an extremely high limit on the number of files in a prestaging request, and requests with too many files will slow down and possibly "collapse" FTS or the SE. Andrea M. adds that the latest patch (3.4.0), now in the pilot, reduces the limit to 1000, but - as the patch contains several other changes - it will not be deployed in production until next year.
 

CMS

  • Heavy Ion run
Line: 151 to 168
 
Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting
Added:
>
>
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system

Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to deafine a reasonable timescale.

 

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
Line: 158 to 179
 

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
Changed:
<
<
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None -
>
>
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None CLOSED

This action is closed, as it is being managed internally to ATLAS operations.

 

AOB

Added:
>
>
Maria mentions that Andrea S. will not work in WLCG operations coordination from next year.
  -- MariaALANDESPRADILLO - 2015-12-15

Revision 162015-12-17 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 34 to 34
 
  • Issues:
    • As reported by Ulf, a quite serious bug is affecting dCache v 2.12/2.13/2.14. Today dCache released a patch, we will contact the sites with more details ASAP.
Changed:
<
<
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested @CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
>
>
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
 
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

  • T0 and T1 services

Revision 152015-12-17 - DaveDykstra

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 144 to 144
 

Squid Monitoring and HTTP Proxy Discovery TFs

Added:
>
>
  • Nothing new to report. Existing code for automating monitoring based on GOCDB/OIM registration had broken but it got fixed again.
 

Action list

Creation date Description Responsible Status Comments

Revision 142015-12-17 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 33 to 33
 
    • SL5 decommissioning. As also presented during the GDB, EGI set as deadline for the decommissioning of SL5 services the 30 April 2016. So for the WLCG sites part of EGI still running SL5 services, we suggest to start planning the upgrade to SL6/CentOS7

  • Issues:
Added:
>
>
    • As reported by Ulf, a quite serious bug is affecting dCache v 2.12/2.13/2.14. Today dCache released a patch, we will contact the sites with more details ASAP.
 
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested @CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

Revision 132015-12-17 - ChristophWissing

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 87 to 87
 

CMS

Added:
>
>
  • Heavy Ion run
    • Took more data than planned origionally
    • Pushed the DAQ, StorageManager and PromptRECO to the limits
    • High load on CERN EOS
    • A few files lost because they were deleted from buffer discs beforeprocessed
    • Still big backlog of still unprocessed data
  • Big MC RE-DIGIRECO on going
    • Utilizing (large fractions of) CERN, Tier-1s, most Tier-2s
  • "End of the Year" data RE-RECO about to be released
  • Computing will continue with high load during Xmas break

  • Many thanks for the support in 2015, a nice Xmas break and already now the best wishes for 2016
 

LHCb

  • Pre-staging data for Stripping 24.

Revision 122015-12-17 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 13 to 13
 

Attendance

  • local: Maria Dimou (chair), Andrea Sciaba (minutes)
  • remote:
Added:
>
>
  • apologies: Andrew McNab (MJF TF)
 

Operations News

  • The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.

Revision 112015-12-17 - AndrewMcNab

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 86 to 84
 
  • Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
  • Merry Xmas, happy new year: super thanks to everybody for making a such successful year!
Deleted:
<
<
 

CMS

Deleted:
<
<
 

LHCb

Added:
>
>
  • Pre-staging data for Stripping 24.
  • Aim to run Monte Carlo during the YETS, including on HLT farm.
 

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Line: 100 to 99
 

Machine/Job Features TF

Added:
>
>
  • We have produced a 2nd draft of the HSF technical note and hope to be able to move it into the HSF approval process at the start of next year with no major changes. After that we will look at updating the reference implementations to match the note, and with the aim providing values for all the keys listed in the note.
 

HTTP Deployment TF

Information System Evolution

Revision 102015-12-17 - AndreaSciaba

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 18 to 18
 

Operations News

  • The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.
  • Our next meeting will take place on Jan. 7th 2016.
Added:
>
>
  • Workshop for HTCondor and ARC CE users in Barcelona, Spain on Feb 29 2016 March 4 2016. Aimed at users and admins of HTCondor, HTCondor-CEs and ARC-CEs. Several talks and tutorials, meetings with the developers. Proposals for contributions can be sent to hepix-condorworkshop2016-interest (at) cern (dot) ch. More information at https://indico.cern.ch/e/Spring2016HTCondorWorkshop.
 

Middleware News

Revision 92015-12-17 - AleDiGGi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 76 to 76
 

ATLAS

Added:
>
>
  • During xmas break
  • FTS: we are still suffering of critical issues. This time, yesterday, it's most probably related to the high prestaging activity (to prestage data for reprocessing) . We have asked FTS devs to clarify what is the best course of actions to minimize the issues over the xmas break.
    • DDM has implemented an automatic restart in case of issues from FTS: this will just mitigate the issue.
  • Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
  • Merry Xmas, happy new year: super thanks to everybody for making a such successful year!
 

CMS

Revision 82015-12-17 - AndreaManzi

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 22 to 22
 

Middleware News

  • Useful Links:
Changed:
<
<
>
>
 
Deleted:
<
<
 
  • Baselines:
Added:
>
>
 
  • Issues:
Added:
>
>
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested @CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

 
  • T0 and T1 services
Added:
>
>
    • ASGC
      • Castor Decommissioning planned for the end of the year
    • CNAF
      • plan to upgrade to Storm to 1.11.0 when released and move to the new storm-webdav from storm-http
    • IN2P3
      • dCache upgraded to v 2.13.4
    • NDGF
      • dCache upgraded to v 2.14.4
 

Tier 0 News

Revision 72015-12-16 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 43 to 43
 

ALICE

Changed:
<
<
  • The heavy ion data taking has ended successfully: thanks to all sites and experts involved!
>
>
  • The heavy ion data taking has ended successfully!
 
    • Reconstruction and reprocessing will continue for many more weeks
    • The RSS memory usage has remained up to max ~2.5 GB
    • High-memory arrangements were undone also at KISTI and KIT
      • to allow more job slots to be used again, thanks!
Changed:
<
<
  • After data taking had finished the CASTOR team rearranged the ALICE disk servers into a single pool:
>
>
  • The CASTOR team then rearranged the ALICE disk servers into a single pool:
 
    • to allow convenient usage of all available resources, thanks!
  • Grid activity has been high
Added:
>
>
  • Expectations for the end-of-year break:
    • steady MC production
    • heavy ion reconstruction
    • low analysis activity

  • Thanks to all sites and experts for another successful year!
  • Season's greetings and best wishes for 2016!
 

ATLAS

Line: 64 to 72
 

gLExec Deployment TF

Added:
>
>
  • NTR
 

Machine/Job Features TF

HTTP Deployment TF

Line: 91 to 101
 

RFC proxies

Added:
>
>
  • NTR
 

Squid Monitoring and HTTP Proxy Discovery TFs

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for INFN-T1 (in progress) and GGUS:118371 for FNAL (in progress).
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
 
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Revision 62015-12-16 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 16 to 16
 
  • remote:

Operations News

Added:
>
>
  • The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.
  • Our next meeting will take place on Jan. 7th 2016.
 

Middleware News

Revision 52015-12-16 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 41 to 41
 

ALICE

Added:
>
>
  • The heavy ion data taking has ended successfully: thanks to all sites and experts involved!
    • Reconstruction and reprocessing will continue for many more weeks
    • The RSS memory usage has remained up to max ~2.5 GB
    • High-memory arrangements were undone also at KISTI and KIT
      • to allow more job slots to be used again, thanks!
  • After data taking had finished the CASTOR team rearranged the ALICE disk servers into a single pool:
    • to allow convenient usage of all available resources, thanks!
  • Grid activity has been high
 

ATLAS

Line: 86 to 94
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for INFN-T1 (in progress) and GGUS:118371 for FNAL (assigned).
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for INFN-T1 (in progress) and GGUS:118371 for FNAL (in progress).
 
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Revision 42015-12-16 - MariaDimou

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 12 to 12
 

Attendance

Changed:
<
<
  • local: Maria Dimou (chair), Maria Alandes (minutes).
>
>
  • local: Maria Dimou (chair), Andrea Sciaba (minutes)
 
  • remote:

Operations News

Line: 60 to 60
 

Information System Evolution

Changed:
<
<

>
>

  • A proposal for a new WLCG IS based on AGIS was presented at the last GDB.
    • Ongoing discussions with experiments to understand their interest in this new IS.
    • The proposal will be presented at the MB next year to see whether it gets approved.
  • In the meantime, the following activities are ongoing within the TF:
    • Ongoing discussion to agree on a better definition of the GLUE 2 attributes defining HS06 (GLUE2BenchmarkValue) and Logical CPUs (GLUE2ExecutionEnvironmentLogicalCPUs): feedback from sys admins is being collected for two possible definitions.
    • Presented at the last UMD meeting a proposal to validate information at its source so that we can avoid publishing information that is known to be wrong. A technical solution will have to be worked out together with MW developers.
  • Preparing the IS session at the WLCG workshop in February together with Alessandra Forti who will be the chair and who is gathering feedback on what to discuss.
  • Next IS TF meeting scheduled on Friday 8th January. ( Preliminary agenda)
 

IPv6 Validation and Deployment TF

Changed:
<
<

>
>

 

Middleware Readiness WG

Changed:
<
<

>
>

 

Multicore Deployment

Revision 32015-12-16 - MaartenLitmaath

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 86 to 86
 

Action list

Creation date Description Responsible Status Comments
Changed:
<
<
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). 5-6 tickets still open at the 2015-12-03 meeting. All for sites which have no technical issues to proceed.
>
>
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for INFN-T1 (in progress) and GGUS:118371 for FNAL (assigned).
 
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Revision 22015-12-16 - UlfBobsonSeverinTigerstedt

Line: 1 to 1
 
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Line: 33 to 33
 

DB News

Tier 1 Feedback

Added:
>
>
  • NDGF-T1 had a good update to dCache 2.14 on Monday, but then noticed a bug.. that had been introduced into dCache 2.12.0 in January. It causes files that have been moved around within the storage system to lose the stickiness flag, marking the files available for garbage collection. This is ok (and default behaviour) for tape files, but not for disk files. So far we know of 1428 lost files. Alice and Atlas will get a list of files at some point this week. (Ulf writing in since it's unclear if I can attend the meeting due to travel). The bug has been fixed in dCache, and affects 2.12, 2.13 and 2.14 releases.
 

Tier 2 Feedback

Revision 12015-12-15 - MariaALANDESPRADILLO

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="WLCGOpsCoordination"

WLCG Operations Coordination Minutes, December 17th 2015

Highlights

Agenda

Attendance

  • local: Maria Dimou (chair), Maria Alandes (minutes).
  • remote:

Operations News

Middleware News

  • Baselines:
  • Issues:
  • T0 and T1 services

Tier 0 News

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

ATLAS

CMS

LHCb

Ongoing Task Forces and Working Groups

gLExec Deployment TF

Machine/Job Features TF

HTTP Deployment TF

Information System Evolution


IPv6 Validation and Deployment TF


Middleware Readiness WG


Multicore Deployment

Network and Transfer Metrics WG


RFC proxies

Squid Monitoring and HTTP Proxy Discovery TFs

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). 5-6 tickets still open at the 2015-12-03 meeting. All for sites which have no technical issues to proceed.
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None -

AOB

-- MariaALANDESPRADILLO - 2015-12-15

META TOPICMOVED by="malandes" date="1450187182" from="LCG.WLCGOpsMinutes171203" to="LCG.WLCGOpsMinutes151217"
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback