WLCG MW Readiness WG 17th meeting Minutes - May 18th 2016

WG twiki

Agenda

Summary

  • The WG concluded that the pakiti client installation will remain on the Volunteer Sites' testbeds to reveal the rpms of products under verification. Further expansion to production installations won't happen. The need to have a tool that securely reveals what is installed at a site will be followed-up by the Information Systems' Evolution TF. Detailed replies from sites are present in here.
  • The pakiti client documentation is extended to include the API definition for use by other reporting tools. Further details in these minutes and on the documentation twiki MiddlewarePackageReporter. Comments by the Tier0 led to a new action; to be followed-up.
  • Volunteer sites Lancaster and Edinburgh proposed to test UI/WN under CentOS7.
  • EGI to push Product Teams to make more MW packages on CentOS7 as experiments are gradually getting there. See here ATLAS & CMS positions. ALICE & LHCb to, please, state their intentions.
  • Proposed date for the next meeting is Wednesday July 6th at 4pm CEST. In case of problem, please email the e-group a.s.a.p.

Attendance

  • local: Maria Dimou (chair & notes), Maarten Litmaath (ARGUS report), Andrea Manzi (MW Officer), Gavin McCance (T0), Julia Andreeva (WLCG Ops), Vincent Brillault (WLCG Security).
  • remote: Matt Doidge (Lancaster), Frederique Chollet (LAPP), Di Qing( Triumf), Andrew Washbrook (Edinburgh), Vincenzo Spinoso (EGI), Linda Cornwall (RAL Security), Peter Gronbech (Oxford), Sang-Un Ahn (KISTI).
  • apologies: Alessandra Doria (Napoli)

Minutes of previous meeting

The minutes of the last (16th) meeting HERE are approved.

Replies from sites on the deployment of Pakiti in WLCG production

  • CERN as Tier0:
    • Ben Jones at the last meeting - _The Tier0 is willing to use existing tools of its own, avoiding yet another thing to install/observe/maintain but is happy to report, via the existing API to the MW Readiness App. Lionel agreed to document this API, Action 20160316-01, now done and documented later in this twiki and the MiddlewarePackageReporter page.
    • McCance/Meinhard: we regard this as unnecessarily intrusive leaking sensitive information out of the tier-0 site and not a good use of the site's manpower. While we may report for individual Grid services via the API directly, it would be more efficient to do this via a Twiki. It's also not clear why the version cannot be entered directly in GOCDB. CERN-PROD does not expect to install or run this tool.
    • Feedback on "API" as documented by 20160316-01: This is not a useable API in any reasonable form, and not what we meant; the description contains no information other than a link to the generic description of the internal pakiti message format. It contains no information about what information is actually required other than a list of RPMs (which ones?) with the requirement that you send sensitive information such as kernel version, leaving you little option to, as suggested, just run the tool ("If someone wants to write their own submitter tool, I would strongly recommend to use the pakiti-client script to send the report."). Having to maintain a whitelist / blacklist to avoid leaking operationally sensitive information while actually providing the required information is too expensive with this tool and is not a reasonable use of the site's manpower.
    • What we mean by API would be something like a service that accepts a POST of JSON triples of {gocDBhostname, gocdbServicetype, serviceVersion}. Then the question occurs: given the ISTF strategy is now to use the GOCDB to store service information (and GOCDB is developing an API for this), would it not be a better strategy to just add a version tag to GOCDB rather than invent and run a whole new service?
  • JINR - Victor Zhiltsov at the last meeting - They had tried pakiti2 at JINR and decided it isn't useful enough to continue using it. Our tool is based on pakiti3.
  • RAL - Gareth Smith in email - We already run Pakiti internally as a way of tracing the patch status of their systems. We are concerned about the Security implications of this request. This information reveals the patch status - and therefore possible security weaknesses - of our systems. We note the comment already received from Vincent Brillault highlighting the security implications. Given that we already run Pakiti locally the first benefit of this request (that site managers can easily check the version of software installed at their hosts) is not valid in our case. We are not convinced that the additional risks created by exposing this sensitive information and the risks from additional configuration and maintenance complexity are outweighed by the proposed benefits.
  • GRIF - Michel Jouvin in email - In the name of GRIF, I'd like to say that we share RAL's view for exactly the same reasons. We have been running Pakiti on all our machines (not only the grid ones) for years and we don't think that this is a good move to expose everything centrally as it also tend to "deresponsibilize" the sites for tracking the status of their systems. At least, it should be considered acceptable for WLCG that sites opt out this central reporting.
  • NL_T1 - Ron Trompert in email - The WLCG Tier-1 representatives were recently invited to indicate their stance towards reporting detailed middleware software stack and version information for their production service nodes to a central location for inspection external by external reviewers and supporters.We hereby respond to the effect that we will NOT be reporting such information to third parties. The Dutch National e-Infrastructure is a generic service provider for multiple customers, including WLCG which is a large but not a sole client. We offer specified public interfaces to all our user communities, but how such service are implemented internally is a local operational concern and immaterial to our users. What software, and what version of such software, is used to best provide the requisite capabilities for the largest number of our user groups is also a local choice, and exposing such choices to just one of our customers, specifically WLCG, is known to lead to extensive and costly discussions. The efforts expended in such discussion for us far outweigh any perceived benefits of sharing such internals. In this respect, this response to monitoring request is in line with our long-term policy to not designate single preferential customers amongst our user base - and following the same policy we applied when replying to the WLCG effort survey in February last year.
  • WLCG Security - Vincent Brillault in email - I have few comments concerning the deployment of pakiti:
    • EGI's interest in running pakiti is not limited to the WNs, we are interested in running it in any part of the infrastructure, including particular service nodes. The current situation where pakiti only runs on WNs is strictly due to the fact that we are using standard jobs and not asking sites to configure pakiti themselves (there is currently no policy/requirement for this).
    • If WLCG operations were to run a campaign to ask sites to install pakiti on their system, the EGI CSIRT would probably either kindly ask such sites, if they are part of EGI, to configure their client to also report to the EGI CSIRT pakiti server or discuss with WLCG Operation if they could forward the information they receive from EGI productions sites.
    • While I believe that the security of the information has been properly taken into account while designing the WLCG MW readiness app, I would like to make it clear again that package information is sensitive as it can expose potential targets to malicious actors. Was the possibility of offering sites a whitelist/blacklist mechanism for reported packages considered? Would sites be interested in this? I understand that the goal of this session was to understand the interest and willingness of sites about this potential deployment. However, I believe than before WLCG operations starts discussing this matter at the WLCG MB, it would be beneficial to first discuss with the EGI CSIRT and the WLCG Security Officer.
  • Triumf - Di Qing in email - We run Pakiti locally already, but the version may be different from the one asked by WLCG. We can agree to run Pakiti to publish information with the following requirements:
    1. the published info will not be visible publicly
    2. it doesn't grow beyond the grid service nodes defined in GOCDB
    3. we control cron timing and the user ID to run cron
    4. the Pakiti installation doesn't conflict with our local Pakiti setup
    5. the RPM is signed and in a regular repository
  • Manchester - Alessandra Forti in email - As a site I always thought a centralized pakiti was intrusive. From the security point of view I'm not sure how much meaning it can have when we heavily use opportunistic resources where it won't be possible to install it. And while it is kind of useful monitoring if installed locally it would be yet another service not really in line with the effort to reduce the number of services sites have to provide.

Verification status report

The MWREADY JIRA dashboard shows the latest status info of open tickets. Summary of progress since our last meeting on March 16th in the following tables.

ATLAS workflow Readiness Verification Status:

MW Product version Volunteer Site(s) Comments Verification status
DPM (srm-less) 1.8.10 LAPP Annecy JIRA:MWR-104 , last update in the tickets dates since our last meeting date Waiting for ATLAS Pilot code changes
FTS 3.4.3 CERN JIRA:MWR-122 also verified for CMS Completed
FTS 3.4.4 CERN JIRA:MWR-129 also verified for CMS ongoing
dCache 2.15.0 NDGF JIRA:MWR-120 Completed
DPM 1.8.10 Glasgow JIRA:MWR-82 verification on CentOS7 missing ATLAS setup at site, closed as new DPM version was released in the meantime
DPM 1.8.11 Glasgow JIRA:MWR-126 verification on CentOS missing ATLAS setup at site
DPM 1.8.11 Edinburgh=UK-SCOTGRID-ECDF JIRA:MWR-125 ongoing
StoRM 1.11.11 CNAF JIRA:MWR-127 ongoing
UI bundle centos7-ui-0.1 Lancaster JIRA:MWR-128 verification on CentOS also for CMS. Version number is just a place-holder by Matt, as the port doesn't exist at this point to contact M Doidge to see if a first UI bundle for CentOS7 can be made available for testing
ARGUS 1.7.0 CERN JIRA:MWR-30 installed on CentOS7 on a qa node. Test also valid for CMS almost completed, CERN is planning to use this version in production soon

CMS workflow Readiness Verification Status

MW Product version Volunteer Site(s) Comments Verification status
dCache 2.15.5 PIC JIRA:MWR-123 ongoing
EOS 4.0.8-citrine CERN JIRA:MWR-106 Completed
EOS 4.0.12-citrine CERN JIRA:MWR-121 ongoing, some misconfiguration to check with EOS admins
DPM 1.8.11 GRIF_LLR JIRA:MWR-124 ongoing, after an upgrade to the latest globus-gridftp-server from epel-testing we observed systematically crashes..TO contact globus ASAP

Open issues with packages

  • Setup gfal2 verification for ATLAS. Check if some site has a small development cluster to test the WN staging/stageout as GRIF_LLR is doing for CMS. We have checked at INFN-NAPOLI that the current production version ( available via Atlas CVMFS is fine) but we don't have a way to test a new version before is pushed to ATLAS CVMFS
  • ARGUS ( see Maarten report)
  • EOS PPS is still unstable, to check with CERN
  • new FTS to be deployed in pilot ( v 3.4.3) . DONE
  • HTCondor 3.5.x available, to be tested in ATLAS pilot factory. TO DO
  • DPM on CENTOS7 tests, waiting for a new version of DPM fixing some issues. DONE
  • Edinburgh has got new effort to work on DPM testing for ATLAS, DONE

WLCG MW Readiness Software Status

App development work:

* Version 0.4.0RC in dev (https://mw-readiness-dev.cern.ch/). to be deployed soon in production

Sites' feedback

Report from the ARGUS meeting

The JIRA ticket where progress is recorded is JIRA:MWR-30.

  • Argus meeting held Apr 15

  • main items for MW Readiness:
    • the Argus 1.7 beta rpms were used for a few weeks on part of the Argus cluster at CERN
    • the waiting was for the official rpms, which would still have a few other improvements
    • then CMS opened a new ticket GGUS:121025 on April 25
      • a few normal users had unstable pool account mappings
    • the new code turned out to have a bug affecting the mapping of the simplest proxies
      • containing just the VO membership, no other group or role
    • to avoid further instabilities there, the Argus 1.7 beta nodes were taken out of production
      • the existing beta rpms will not be released in the UMD
    • the developers think they have a fix, to be tested against the CERN configuration
    • if the fix looks good, the idea is to upgrade the whole cluster in one go:
      1. prepare sufficient nodes with the 1.7 (beta) rpms
      2. let those nodes replace the current production nodes
      3. clean up the temporary gridmapdir hacks
      4. clean up historical hacks we found there as well
    • we then work further with the Argus PT and EGI to come to an official UMD release
      • on EL6 the existing configuration mechanism (YAIM) will keep working as before
      • on CentOS/EL7 the initial release may not have full configuration support yet
        • a self-contained "mini" Puppet configuration is foreseen
      • documentation updates will only go into the new site:

Pakiti documentation enhancement

Lionel documented the format used by the pakiti-client script in the script’s man page (in GitHub). TWiki (MiddlewarePackageReporter) is updated accordingly. Lionel's recommendations:

  • If someone wants to write their own submitter tool, I would strongly recommend to use the pakiti-client script to send the report. This can be done with the --input option. The idea would be to:
    • prepare one or more reports to be sent and save them as files
    • for each file, run pakiti-client with the --input option to send the prepared report(s)
  • If someone wants to filter out the list of packages report, I would strongly recommend to use the pakiti-client script anyway. The idea would be to:
    • run pakiti-client with the --output option to save the report to file instead of sending it
    • edit the saved file to remove the packages that should not be exposed (think “grep -v”)
    • run pakiti-client with the --input option to send the edited report

Maria thanks very much Lionel for his development work, presentations and participation to this WG. He is now moving to other tasks, following the CERN IT 2016 re-organisation.

Actions

Action items Done from past meetings can be found HERE.

  • 20160518-02: Expansion of the CentOS7 experiments' intentions to:
    • ALICE: Maarten to check and bring experiment intentions at the next meeting. So far, ALICE runs on SL6 with binaries build on SL5 and it works but in the future this might not be the case.
    • LHCb: Joel/Stefan to give us experiment intentions
  • 20160518-01: Re-visit the API definition and documentation, based on the Tier0 comments here above.
  • 20160316-01: Lionel to document the pakiti client API for packages collected with other tools. DONE See section above.
  • 20160127-02: David C. and Andrea S. to obtain their experiments' plans concerning EL7 and/or CentOS7. On-going
    • ATLAS: Information is collected in this ATLAS twiki. See in particular the statement on ATLAS migration
    • CMS: The CMS software built on SLC6 is known to be not binary compatible with an OS other than SLC6. CMS is evaluating a container based approach to allow running SLC6 (or other) binaries on WNs with CC7 or other OS versions. In addition, CMSSW is routinely built on the CC7 architecture as a possible future production architecture. Formal physics validation of CMSSW on CentOS7 hasn't started yet, but CMS is definitely doing more than just building on it.
  • 20160127-01: Andrea M., Andrea S., David C., Paul M. see how the nightly data scratch can be handled so that the Prometheus dCache tests can start JIRA:MWREADY:36. The last update of this ticket dates since June 2015. If there is no interest currently, we should probably close the ticket and this action. Pending

Next meeting

  • Proposed date is Wednesday July 6th at 4pm CEST. Objections to the e-group a.s.a.p. please!

AOB

-- MariaDimou - 2016-05-14

Edit | Attach | Watch | Print version | History: r103 < r102 < r101 < r100 < r99 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r103 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback