CentOS7Readiness

Introduction

This page summarizes the readiness of ATLAS Distributed Computing to utilize resources on CentOS7. It contains information about both central services and clients, as well as links to the proper pages for sites installing CentOS7 WNs.

With CentOS7 it is meant any of the CentOS7/SL7/EL7/CC7 distribution. centos7 is also used in the native binaries identifiers. Only exception is name of the queues for shortness. For brevity CentOS7=C7

Worker Nodes Readiness

Third party software requirements

ATLAS distributes most of its software via cvmfs but OS libs contained in HEP_OSlibs so the two requirements are

  • CentOS7 HEP_OSlibs rpm will be in the singularity images. Sites using singularity will not need to install it anymore. However if for whatever reason singularity cannot be used on the WNs, even temporarily, to fall back to standard running jobs HEP_OSlibs needs to be installed on the WNs.

Check the WLCG baseline page for the required version.

ATLAS position on migration to CentOS7

  • Grid middleware services: can be upgraded. As long as they work and WLCG declares them ready ATLAS is happy to use them.
  • Worker nodes: ATLAS is aiming at having the sites move to CentOS7 by the 1st of June 2019
    See the validation section for the status of software and jobs categories.

Important notes

Middleware status

  • OSG middleware is available
  • EGI middleware is available in the UMD 4 repositories.
    • Be aware that YAIM configuration has been dropped for several services in UMD4/CentOS7 in particular for the WNs.
    • Be aware that lcg_utils is deprecated. GFAL2 replaces it.
      • This will affect the movers in your panda queues. lcgcp mover will not work anymore.

Using containers

  • pilot2 is currently undergoing large scale tests and is going in production in the first half of 2019 and with that we plan to start rolling out also the use of containers for production jobs at CentOS7 sites.
  • Sites moving to CentOS7 are required to install singularity so that the pilots can use it when ready.
    • The baseline version of singularity (2.6.1) can be found in the standard EPEL repositories.

Movers in AGIS

  • Unless you have specific reasons not to it is recommended to use the rucio sitemover (Associated Pilot Copytools) on the new PandaQueues and eliminate the other entries. In particular if you have an old lcg-cp2 mover enabled on the old SL6 queues. As reminded in the middleware status section lcg_utils is deprecated on CentOS7.

Sites already in production

  • You can find out which queues have already move by using this script /cvmfs/atlas.cern.ch/repo/sw/local/bin/node-description
  • CentOS7Deployment

ATLAS software status

  • SL5 releases (Athena release 17 and earlier) will not be supported and setup scripts will fail on C7 nodes
  • SL6 production releases have been tested and validated on the grid and are working in legaxcy mode ATLINFR-1102
  • SL6 analysis has been validated too.
  • C7 builds are not going to be used in 2019 other than for testing (ART and other types)

OS Analysis Production
SL5 Not supported Not supported
SL6 Working Working
C7 In progress In progress

Known Software Problems

The following list of problems is aimed at users and will be solved when pilot2 will be commissioned and containers will be used to run jobs in their natural OS.

  1. offline help topic (import numpy failing on CC7)
    • importing numpy no longer works when using slc6 AnalysisBase releases.
      • Solution: setup numpy using lcgenv.
  2. offline help topic (athena not working in release 20.20.14)
    • athena not working in release 20.20.14 (and also 20.20.11, 20.20.12 and 20.20.13) athena seem to need liblzma.so.0
  3. Automatically Tuned Linear Algebra Software for SSE extensions (ATLAS) does not work (rpms unavailable on centos7). This is a proposed workaround if you need it:
if [ -e /usr/lib64/atlas/libsatlas.so ]; then
  workaroundLib="`pwd`/extraLibs"
  mkdir -p $workaroundLib
  export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$workaroundLib"
  ln -s /usr/lib64/atlas/libsatlas.so $workaroundLib/libptf77blas.so.3
  ln -s /usr/lib64/atlas/libsatlas.so $workaroundLib/libptcblas.so.3
  ln -s /usr/lib64/atlas/libsatlas.so $workaroundLib/libatlas.so.3
  ln -s /usr/lib64/atlas/libsatlas.so $workaroundLib/liblapack.so.3   
  # do the same for any other atlas lib that is missing and needed
fi

To avoid problems as reported above there are 3 methods users can adopt, suggestions here refer to lxplus but the same is true for non CERN UIs.

  1. Log in to an SLC6 host:
    ssh lxplus6.cern.ch
    compile code on SLC6
    pathena/prun ...
    
  2. Log in to Centos7 host :
    ssh lxplus.cern.ch
    spawn singularity with SLC6: setupATLAS -c slc6
    compile code on SLC6 in signularity container
    prun/pathena ...
    
  3. When grid sites will be on Centos7 (5/4/2019 currently migrating).:
    ssh lxplus.cern.ch
    compile code on Centos7
    prun/pathena --osMatching ....  # osMatching will send jobs only to grid sites (now mostly CentOS7 sites) running the same OS as your machine
    

Upgrade problems and work arounds

Upgrade paths

ATLAS can run either on a cluster with C7 WNs or on a cluster with SL6 WNs but can not run on a mixed C7/SL6 WNs cluster transparently. This is because in a mixed environment the system will try to run SL5 and C7 releases on SL6 WNs and SL5 releases on C7 WNs which cannot run in either case. Therefore, we foresee two possible scenarios for the upgrade to C7 described below. ATLAS suggests the first scenario, unless the migration is foreseen to take many days of downtime.

Big Bang Transition to C7

  • The site declares a downtime in GOC to upgrade WNs to C7 and sends an email to atlas-grid-install@cernNOSPAMPLEASE.ch
    • In case the site can rely on a good site or cloud support, the site contact of the cloud contact should be contacted beforehand, he/she can help with the communication process and post more ATLAS specific announcements to e.g. ADC ELOG.
    • The ATLAS SW installation manager un-tags all releases at the site
    • The site shut downs the farm (with the standard procedure of closing queues and CEs) and upgrades the WNs to C7
    • The ATLAS SW installation manager re-tags the site on C7

Rolling Transition to C7

  • The site announces to ATLAS the intention to carry on a rolling transition sending an email to atlas-project-adc-operations@cernNOSPAMPLEASE.ch
    • In case the site can rely on a good site or cloud support, the site contact of the cloud contact should be contacted beforehand, he/she can help with the communication process and post more ATLAS specific announcements to e.g. ADC ELOG.
    • The site should configure a new queue under which SL6 WNs will be migrated. The site should point at least one Computing Element to the queue. The name of the Computing Element (host and queue) should be communicated to ATLAS.
    • The site can start transiting WNs from the SL6 queue(s) to the C7 queue(s).
    • ATLAS will create a new Panda Site and a new Panda Queue pointing to the site batch queue.
      • The new panda site is needed for two reasons:
        1. The HC tests need a master queue. When the master queue is blacklisted all the queues that depends on it are blacklisted. There can be only one master queue per panda site per activity. You can have a master queue that controls the status of the old system and the new system but it is not reccommended
        2. The installation system also uses the concept of master panda queue to decided if it needs to install and validate the releases or just copy the tags. Not all SL6 tags work on CentOS7 nodes (as said in the sw status section).
      • In the new queues please use rucio mover unless the site has its own local mover (mostly US sites)
    • The ATLAS installation system will tag the ATLAS SW for the new Panda Resource (and therefore for the new queue)
    • When all WNs will be migrated to C7, the site should announce to ATLAS the end of the transition. ATLAS will retire the SL6 Panda Resource and stop sending pilots to the CEs pointing at SL6.

You can of course also create a TEST queue before you do anything so you can test adapting your site before moving.

Panda Queues re-organissation during the migration.

Multi-CE queues If a site already has a multi-CE queue please check that they don't have a mixture of CEs with SL6 and C7 nodes associated with the same Panda Queue, Atlas cannot cope with mixed queues at any level, in this case you need to create another queue Panda Site/Panda Resource for the C7 CEs. You can get away with creating new Panda Queues under the same Panda Site if you make the new queue the master queue, otherwise there will be problems with automatically getting the release tags. This of course is a good strategy if you plan to move the majority of your resources to CentOS7 quickly because the master queue also controls the switcher that sets the PQ online and offline.

Panda Queue and Panda Resource names After 6 years we still have Panda Queues with names different from the Panda Resources. If you are going to keep the same queues please DO rename the Panda Queue as the Panda Resource. For new queues this is now enforced by the system.

UCORE queues and Harvester

Many UCORE queues have already been created and moved under Harvester. For some sites this didn't happen in case things have changed and a new UCORE queue can now be used remember the following AGIS settings are needed on the new queue

  • is_default: Yes (once the SCORE queue is disabled)
  • Capability: UCORE
  • HC Suites: PFT and PFT_MCORE
  • Harvester: CERN central A
  • Pilot Manager: Harvester
  • Workflow: pull UPS

How to find cloud support

  • If you don't know how to contact your cloud support here is what you should do: Find which cloud you belong to please check on panda cloud page or search your site on the AGIS system

What about Tier3s?

We are undergoing a transition period where grid sites move from SL6 compatible OS to CentOS7 compatible OS resources. By Tier3, unless explicitly mentioned, I include Tier3 sites as well as userís desktops and laptops.

There have been corner cases in which the SL6 software didn't work, but 99% of the ATLAS software should work on CentOS7 in legacy mode if you install HEP_OSlibs (https://gitlab.cern.ch/linuxsupport/rpms/HEP_OSlibs/tree/el7) also locally. We donít have CentOS7 native releases yet, so all the SL6 software should run on both SL6 and CentOS7 nodes.

ALRB [3] or atlasSetup should setup things correctly for the users and on top of it if you have singularity they can use it seamlessly by running the same commands with an extra parameter. This works well also on interactive machines.

If you have a batch system like HTCondor or slurm you can integrate containers in the batch system. Doing that, you can completely decouple the upgrading of the bare metal worker nodes to CentOS7 and migrating the user's workflows to CentOS7. Some sites like Bonn have already done that and local jobs never run "bare-metal", but always in a containers. So users just add a line into their job submission scripts which "OS" they want.

One thing to note is that ATLAS will be requiring Singularity at grid sites when they migrate to CentOS7. Tier3s are strongly encouraged to install this as well when you go to CentOS7. (Mac Tier3 users can install Docker). With Singularity installed, Tier3 users can run containers very easily (https://twiki.atlas-canada.ca/bin/view/AtlasCanada/Containers) and can even submit to local batch queues while inside the container. Everything you need to know is described in the Twiki link above.

The container-based solutions are here for the long-term. We also have CentOS7 based containers (setupATLAS -c centos7) and in the future (now) should also be able to run user based containers.

Other Info

Documentation pages

ADC Weekly and other updates Tier3s
Major updates:
-- AlessandraForti - 2016-05-13

Responsible: AlessandraForti
Last reviewed by: Never reviewed

Edit | Attach | Watch | Print version | History: r71 < r70 < r69 < r68 < r67 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r71 - 2019-07-04 - AsokaDeSilva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback