WLCG MW Readiness WG 16th meeting Minutes - March 16th 2016

WG twiki

Agenda

Summary

  • Tier1s are invited to tell the e-group wlcg-ops-coord-wg-middleware at cern.ch whether they agree to install the pakiti client on their production service nodes, so that the versions of MW run at the site be known to authorised DNs site managers taken from GOCDB and expert operations' supporters.
  • SRM-less DPM test on-hold until ATLAS pilot code is changed as per JIRA:MWR-104
  • Excellent progress with gfal2 testing (various configurations) as per JIRA:MWR-101 & JIRA:MWR-117
  • Proposed date for the next meeting is Wed May 18th at 4pm CEST.

Attendance

  • local: Maria Dimou (chair & notes), Maarten Litmaath (ARGUS report), Andrea Manzi (MW Officer), Lionel Cons (MW Readiness software developer), David Cameron (ATLAS), Ben Jones (T0 & ARGUS),.
  • remote: Matt Doidge (Lancaster), Frederique Chollet (French Grids), Di Qinq (Triumf), Viktor Zhiltsov (JINR), Peter Gronbech, Ricard (PIC), C.Acosta (PIC).
  • apologies: Vincent Brillault (Security), Vincenzo Spinoso (EGI), Ron Trompert (NL_T1), Antonio Yzquierdo (PIC), Alessandra Doria (Napoli).

Minutes of previous meeting

The minutes of the last (15th) meeting HERE are accepted.

On the deployment of Pakiti in WLCG production

The pakiti client is now running on the WLCG MW readiness volunteer sites, data are collected at CERN and displayed to Site admin via the WLCG MW readiness app.

For the purpose of the WG, this is helping understanding which version of the software is exactly installed @ the sites when running the verification tests.

The idea is to extend the deployment of pakiti to the WLCG production hosts ( except for WNs where most of the software is taken from CVMFS). This deployment will allow then :

  • Sites manager to easily check the version of software installed at their hosts
  • WLCG ops to monitor which version of software is deployed in the infra ( now that the role of the Information System is under discussion, we may not have a view of the running software on the infrastructure)

The data collected are accessible only to WLCG Ops people and the site admins defined in the GOCDB for each site.

We would like to understand from the site, what is their opinion and their willingness to deploy the software in production, so to discuss it then at WLCG MB level.

Discussion during the meeting:

  • Matt asked whether putting the pakiti client in CVMFS would be a good idea. Maarten's answer was 'No' because it would be too much overhead, given the easy pakiti installation, to mount CVMFS from all nodes. The service nodes (which don't mount CVMFS) are interesting, the WNs are not. EGI is interested in pakiti running on the WNs for security reasons. WLCG is interested in the service nodes for functionality reasons.
  • Maarten said the pakiti client is a good tool for someone debuging an operational problem at a site, to check the versions of MW running at the site and see whether some are not at the right version level.
  • Maarten and Andrea also said the BDII is not offering this level of detail for the packages. Pakiti reports at the level of rpms.
  • Ben said that the Tier0 is willing to us existing tools of its own, avoiding yet another thing to install/observe/maintain but is happy to report, via the existing API to the MW Readiness App. Lionel will document this API, Action 20160316-01 below.
  • David said the sites will find it difficult to see the advantages for this extra work, this is why people advised to better advertise the advantages. Frederique, together with Catherine Biscarat, will propagate the info to the french grid sites.
  • Victor said they had tried pakiti2 at JINR and decided it isn't useful enough to continue using it. Our tool is based on pakiti3.

Verification status report

ATLAS workflow Readiness Verification Status:

MW Product version Volunteer Site(s) Comments Verification status
DPM (srm-less) 1.8.10 LAPP Annecy JIRA:MWR-104 , waiting for a new version of ATLAS pilot to support non srm stagein ongoing
FTS 3.4.2 CERN JIRA:MWR-114 Completed
StoRM 1.11.10 QMUL, CNAF JIRA:MWR-110. JIRA:MWR-109 The verification was done at CNAF , no answer from QMUL Completed
dCache 2.10.56 Triumf JIRA:MWR-118 Completed
dCache 2.15.0 NDGF JIRA:MWR-120 ongoing

Frederique said that Silvain Brunier was making progress on JIRA:MWR-104 till early March but now realised that ATLAS pilots suppose SRM. This is hard-coded and needs to be changed. David confirmed, so this ticket is on-hold.

CMS workflow Readiness Verification Status

MW Product version Volunteer Site(s) Comments Verification status
FTS 3.4.2 CERN JIRA:MWR-114 Completed
gfal2 2.11.0 GRIF_LLR JIRA:MWR-101 Completed for WNs running gfal2 and gfal2-utils staging out via SRM and gridftp. PhEDEx tests via SRM and gridftp also successful.
gfal2 2.11.0 GRIF_LLR JIRA:MWR-117 Pending tests using xrootd.
dCache 2.14.13 PIC JIRA:MWR-119, transfers and jobs looks ok, some connectivity issue still noticed and some issue writing to CERN EOS ( issue EOS side) ongoing

Andrea thanked GRIF_LLR for the big and good work done on gfal2 testing.

"Near" future activities

  • Setup gfal2 verification for ATLAS. Check if some site has a small development cluster to test the WN staging/stagout as GRIF_LLR is doing for CMS. We have checked at INFN-NAPOLI that the current production version ( available via Atlas CVMFS is fine) but we don't have a way to test a new version before is pushed to ATLAS CVMFS
  • ARGUS ( see Maarten report)
  • EOS PPS is still unstable, to check with CERN
  • new FTS to be deployed in pilot ( v 3.4.3)
  • HTCondor 3.5.x available, to be tested in ATLAS pilot factory
  • DPM on CENTOS7 tests, waiting for a new version of DPM fixing some issues.
  • Edinburgh has got new effort to work on DPM testing for ATLAS

WLCG MW Readiness Software Status

App development work:

Description Ticket Status
Tag to show deployment status JIRA:MWR-113 Done
Store update times for products in the table and the package DB JIRA:MWR-115 Done
Add the DNs of admins allowed to view pakiti data JIRA:MWR-116 Done

Sites' feedback

  • PIC report:
    • MWR storage running dCache 2.14.13
    • Phedex Dev transfers:
      • from PIC; PIC to CERN ok, PIC to GRIF-LLR intermittent errors, "Failed to connect 134.158.132.151:24570: Connection refused"
      • to PIC: GRIF-LLR to PIC is ok, CERN to PIC is not, problems when reading test files from EOS "No such file or directory 500-A system call failed: No such file or directory 500 End."
    • CMS HC jobs reading from the dataset on the pps storage continue running OK.

Report from the ARGUS meeting

The JIRA ticket where progress is recorded is JIRA:MWR-30.

  • main items for MW Readiness:
    • one month ago CMS updated an "old" ticket GGUS:118701 about SAM CE test errors at CERN
      • this time all CEs failed the gLExec test, hence the site was unavailable according to the critical profile for CMS
      • this led to the discovery that the SAM proxy mappings were "oscillating"
        due to a few issues in the Argus code vs. our peculiar group mappings
      • currently only the CMS SAM tests are affected, due to the non-trivial way gLExec is tested
      • we added our findings to our existing ticket GGUS:117125 for Argus
    • this led to a reimplementation of the pool account mapping code
    • next we were confronted with how to deploy the new code on a few QA nodes
      to test it out in production for a few days, while the rest stay on the old version
      • the old and the new old code would have conflicting mappings for a few affected cases
      • we can temporarily hardcode those mappings, based just on their DN
      • this worked fine, except for the DNs used by CMS GlideinWMS factories!
      • their DNs contain a '/' character within the CN field
        • example: /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
      • part of the Argus code assumes that every unescaped '/' is a field separator
        • the internal '/' characters were escaped with backslashes and then it worked
          • we will open a ticket about this
    • Ben has prepared a CentOS 7 QA node to try out the new version
      • the config is done with Puppet, the rpm list is manual for now
      • this node was added to the production alias for ~1 hour this afternoon!
      • to be repeated tomorrow: if it keeps looking good, we move to the next phase
      • rpms to be added to the preview repo
    • Michel has converted the old Twiki docs into modern Read-the-Docs format

Actions

Action items Done from past meetings can be found HERE.

  • 20160316-01: Lionel to document the pakiti client API for packages collected with other tools.
  • 20160127-03: Ben to install the ARGUS EL7 rpms available on github. Done
  • 20160127-02: Andrea S. and David C. to obtain their experiments' plans concerning EL7 and/or CentOS7. Involve Maria Alandes in the CMS Workflow for MW Readiness.*Pending*
  • 20160127-01: Andrea M., Andrea S., David C., Paul M. see how the nightly data scratch can be handled so that the Prometheus dCache tests can start JIRA:MWREADY:36. Pending
  • 20150318-02: Ben to set-up the ARGUS testbed at the T0. The testbed is there, the load testing is in the list but now with the recent bug fixes we may not need this testing anymore... *Close"
  • 20141119-03: Andrea M. and Andrea Sartirana to discuss how the GRIF-LLR Volunteer site can proceed with gfal2 and WN testing via the CMS workflow Done

Next meeting

  • Proposed date is Wed May 18th at 4pm CEST. No experiment event found for that date and we are out of Easter, Ascention and Whit Monday holidays.

AOB

-- MariaDimou - 2016-03-09

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HC_test_T1_ES_PIC.png r1 manage 32.0 K 2015-06-16 - 16:32 AntonioPerezCalero HC jobs reading from dcache validation storage at PIC
Edit | Attach | Watch | Print version | History: r82 < r81 < r80 < r79 < r78 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r82 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback