WLCG MW Readiness WG 11th meeting Minutes - June 17th 2015
WG twiki
Agenda
Summary
- DPM 1.8.9 with DPM-DSI 1.9.5-3 deletion test with gridftp in Rucio being set-up in Edinburgh for ATLAS.
- Triumf and NDGF, testing multiple dCache versions, discovered, in production, an issue related to a DB table memory leak in versions <= 2.10.28 and 2.12.8. The fix was released and tested by the sites. Also an
srm bringonline
issue was found in production, ATLAS test workflow now being extended by the experiment expert to test also this functionality at Triumf
- Fine-tuning configuration at CNAF for StoRM testing for ATLAS.
- DPM 1.8.9 with DPM-DSI 1.9.5-3 tests at GRIF for CMS showed checksumming issue via PhEDex tests. Fix by the DPM dev. team now being in process to be released.
- New pakiti-client version 3.0.1 is imminent in EPEL Stable. The updated documentation is available to all Volunteer Sites, together to a new configuration file to be used due new PKG DB servers deployment. This new pakiti-client version gives the possibility to specify a tag ( --tag option).
- MW Readiness nodes should start publishing their packages with the tag MWR. Andrea Manzi will contact the sites for this upgrade.
- Check the MW Readiness App https://wlcg-mw-readiness.cern.ch/
now offering the management of Baseline MW versions.
- EL7 support and the move to Java 8 are now urgent for ARGUS. The CERN testbed will be available real soon now for testing under heavy load and other scenarios.
- PIC made progress with the dCache v.2.12.11 testing. Asking for other sites to be participate in test injections for loadtest.
- The next MW Readiness WG vidyo meeting will take place on Wednesday September 16th at 4pm CEST.
Attendance
- Local: Alberto Aimar (CERN IT/SDC mgnt), Maria Dimou (chair & notes), Ben Jones (T0), Maarten Litmaath (ALICE & notes), Andrea Manzi (MW Officer), Andrea Sciaba (CMS), Vincent Brillault (security).
- Remote: C. Acosta (PIC), Ricardo Cruz (PIC), Raul Lopes (Brunel Univ.), Jeremy Coles (GridPP), Pepe Flix (PIC), Antonio Maria Perez-Calero Yzquierdo (PIC), Samuel Cadellin Skipsey (Glasgow), Vincenzo Spinoso (EGI Ops Officer).
- Apologies: David Cameron (ATLAS), Lionel Cons (MW Readiness software tools), Alessandra Doria (Napoli); Sven Gabriel (EGI Security Officer
Minutes of previous meeting
The minutes of the
last (10th) meeting HERE were approved.
MW Officer report
Andrea M.'s
slides
contained all recent information on our Software tools and led to this discussion:
-
bringonline
testing:
- other sites may not be able to set up a separate tape library for such tests
- besides ATLAS the functionality is also relevant for LHCb and (eventually) CMS
- PIC setup for CMS:
- Xrootd monitoring plugins are also being tested
- their monitoring info needs to be reported for a different site name
- to be defined in the Dashboard DB, as already done for a few similar cases
Report from the ARGUS meeting
- Argus meeting held on June 5
- also summarized in the GDB introduction of June 10
- main points for MW Readiness:
- EL7 support being worked on
- first builds expected in a few weeks
- basic testing should follow
- stress testing to some extent would be desirable before the release
- Java 8 support would come in the autumn
- when extra effort from DataCloud has become available
- some dependencies may need to be updated
- some code changes may be needed
- newer versions of such external products may bring fixes for issues that have been hampering us
- the recurrent issue at CERN is finally getting tackled!
- Andrea C now has a CERN account for easier access to service hosts in bad states
- since last Thu we happen to have one bad host taken out for investigation
- its
argus-pepd
developed a high load for no apparent reason
- various traces and logs have been sent for inspection
- at the time of writing, the cause was not yet determined
- a separate test instance of the service is mostly ready
- an NFS share for the
gridmapdir
needs to be obtained to mimic the production setup
- the initial testing will be done from
lxplus
- it may already be largely sufficient for hammering the test setup
Discussion:
- the Argus tests from
lxplus
will be against a standalone service
- the
gridmapdir
may be kept on its "local" disk (the host is a VM)
- we will try to get the test service into a bad state that subsequently can be debugged
Sites' feedback
- Napoli
- CREAM CE tests in Napoli are running smoothly.
- CNAF
- Set-up for ATLAS StoRM tests done. Storage being configured. Details in JIRA:MWREADY-61
.
- Triumf
- Good progress with ATLAS dCache tests.
- PIC status report
- SRM with pre-production dcache storage:
- SE srm-pps.pic.es, 10 TB of disk available, currently dcache 2.12.11
- SRM, GridFTP, NFS4.1, gsidcap, xrootd protocols enabled
- xrootd 3.3.6
- CMS specific part:
- voboxcms-pps.pic.es new vobox installed with Dev PhEDEx agents configured to point to srm-pps.pic.es
- Loadtests PIC to/from GRIF_LLR established and running. Currently failing for reasons not associated to PIC or dcache
- Loadtest to CERN next, injection from PIC setup, waiting for approval from CERN admin (request). Please, create also an injection CERN->PIC
- HC tests: test dataset replicated to the validation storage, TFC modified accordingly, HC test jobs submitted and running at PIC
- Results
- tests of initial interaction with storage as a CMS user working fine (fts-transfer-submit, lcg-ls, xrootd, etc)
- xrootd monitoring plugins not working with dcache 2.12.11. Reported and is being worked on. Action 20150617-01
- PhEDEx and HC already setup and working fine
- Next steps:
- upgrade to validate dCache 2.13
- xrootd 4 tests?
- WLCG monitoring for xrootd activity on srm-pps. what is the procedure to implement this? (it was discussed to create a new site pic_mwr, or similar...)
- CMS SAM tests: Put, Get, TFC
- A final note: we are going to be validating dcache releases, however the upgrade procedure may be different with respect to upgrades between golden releases in production storages. We are in principle validating and documenting any problem found with each release (2.12, moving next 2.13), not jumps from one golden release to another (next jump 2.10 to 2.13).
Discussion:
- currently there is no way in SAM to test a non-production SE without impact on A/R results
- switching from Xrootd 3 to 4 probably would be good:
- sites should anyway move to Xrootd 4 this year (e.g. for IPv6)
- new dCache versions would no longer be tested against Xrootd 3
- to be discussed further in CMS
Actions
Action items
Done from past meetings can be found
HERE.
- 20150617-02: Andrea S. to discuss with CMS mgnt whether to stay with dCache testing with xrootd3 or move to xrootd4. JIRA:MWREADY-66
New
- 20150617-01: Antonio Y. (PIC) to follow progress on the xrootd monitoring plugin issue found via the dCache testing at PIC for CMS. JIRA:MWREADY-65
New
- 20150506-04: CNAF to participate in the StoRM Readiness verification. Details in JIRA:MWREADY-61
Done
- 20150506-03: NDGF, Triumf, CNAF, PIC to install the pakiti client. Updated instructions here.
- 20150506-02: Joel and Stefan to state if and how they wish to participate in the MW Readiness verification effort. Status?
- 20150506-01: Maarten to check with ALICE which version use which xrootd version and if they wish to participate in the MW Readiness verification effort. Status?
- 20150318-05: Pepe to proceed with the MW Readiness set-up at PIC Done
- 20150318-02: Ben to set-up the ARGUS testbed at the T0 Re-opened
- 20150318-01: Manuel to communicate to EOS and FTS managers the reminder of the Pakiti client installation instructions here. Status?
- 20141119-03: Andrea M. to contact the GRIF site to proceed with WN testing via the CMS workflow POSTPONED
- 20140702-06: Andrea M. & Lionel Discuss the visualization of testing results. On-going
AOB & Next meeting
- End of July (Wed 29th?) or early September (Wed 9th?) were suggested. The WG concluded on Wednesday September 16th at 4pm CEST.
--
MariaDimou - 2015-06-15