SquadHowTo

Introduction

This page provides information needed for the cloud squads (a.k.a. cloud support) in the Atlas Distributed Computing.

A cloud squad is a team who provides support for the corresponding cloud : there is one squad team per cloud. It helps also sites on ATLAS-specific issues and plays an interface role between sites and ATLAS central operation.

The squad for cloud "xx" may be contacted using the following email address : atlas-adc-cloud-xx.at.cern.ch.

Extensive description of squad role

General principles

The cloud squad is a team of people at the interface between site admins, ADC central team and shifters.

At the beginning of WLCG (~2005) , the Monarc model defined the notion of cloud, which included one T1 and associated T2s (10 clouds + CERN for ATLAS). It was natural to create squads matching each cloud. The other motivation for cloud squad was to have a small team of expert people per cloud which followed the ADC tool developments and applied them to their sites. In contrary, CMS decided to request one expert per site and enabled cross-cloud transfers.

By 2018, the Monarc model has progressively disappeared, so the squads can be an aggregation of any site but it is recommended that sites within same FA are associated to the same squad (impact of lack of squad addressed in next section).

The liaison between the squad and the associated sites is organised internally. Usually, there are weekly, bi-weekly or monthly meetings.

The squad members should be ATLAS members in order to have access to all ADC documentation and monitoring. Since they are credited with Class 3 OTP in contrary to site admins (Class 4), the squad members are expected to have a pro-active behaviour and their commitment should match the fraction of time declared in OTP (reviewed by the Scrutiny group each 6 months a posteriori). The squad, in liaison with shifters (CRC, ADCoS and DAST) should cover most of the issues related to sites in order to let ADC focus on the common tools.

The squad skills can benefit from contribution to the ADC shifts like ADCoS, confirmed squad members are encouraged to maintain and improve their expertise contributing to CRC shifts.

The ICB representative should have close contacts with its squads to be aware of the site evolutions and issues.

Detailed squad activities

The responsibilities of the squad is to:

  • organise the introduction, upgrade, migration and removal of a Grid site in liaison with ADC central. Based on the site infrastructure and site admins commitments, the squad proposes the site classification (T1-T2D, T2, T3, analysis or not, storage of primary replicas or not) to ADC central;
  • ensure that Grid sites are permanently fulfilling ATLAS requirements (batch queue, storage, network,...) which depends on the site classification (T1-T2D, T2, T3) and provide reliable services;
  • ensure that WLCG federations have deployed pledged resources for ATLAS and that they are fully and efficiently used according to ATLAS monitoring (that might be inconsistent with WLCG);
  • ensure that sites with non-pledged resources are used according to site expectations.

In order to reach these targets, the squad members are expected to :

  • General
    • Organise the creation, migration and site decommissioning (Jira tickets in ADC-INFR)
    • Organise the transfer of informations to the site admins and to the whole squad team on the ADC evolutions
    • Check site availability and different efficiencies based on ATLAS monitoring
    • Provide feedback on the usability of ATLAS monitoring tools and ATLAS correction tools (dark data for example)
    • Update AGIS according to new site resources or new AGIS organisation (?)
  • Issues :
    • Follow-up JIRA/GGUS tickets : the squad has to ensure that the ticket contains all necessary informations for the site (including pointing to ADC central documentation) and that the site admin addresses the issues according to the WLCG MoU.
    • Address issues reported by DAST shifters (only by mail) and follow-up with sites (submit GGUS/JIRA if necessary)
    • Report issues and solutions which could be useful to the whole community (ADC weekly or written report on monthly basis ?)
  • Jobs
    • Identify problematic Panda queues (overlap with ADCoS for production queues, but no shifter checks analysis queues) : jobs or pilot failing
  • Storage and transfers
    • Manage LOCALGROUPDISK/TAPE
    • Ensure that file transfers match expected speed
    • Check and declare lost files to consistency service

A target for site/endpoint availability/efficiency/... should be defined to serve as a reference to help squads to figure out if site improvements are still possible. The guideline could be a mixture of the best sites (ultimate goal) and mean behaviour.

The 24/7 survey to inform sites about production problems is handled by ADCoS production. As a consequence, the squad is expected to follow up reported issues on daily basis (broken SE, lost files, broken panda queues or squid) . Most other activities can be organised on a weekly basis.

A shared documentation accessible to squads and site admins and managed by a coordinator would be usefull to optimise the time of each squad.

Impact on a site of missing squad survey

If no group of people takes care of squad duties, only ADCoS shifters will monitor regularly the site behaviour (if not T3). The follow-up of ADC central and CRC will be done with low priority. This will impact :

  • No scheduling of the sites upgrades (requested by the site or ADC central)
  • No follow up of the efficiency of analysis queues
  • No contact to receive alerts from DAST

This should prevent to set up analysis queues in these sites and enforce the storage to be used as cache only (no primary replica). In summary, the site could be integrated in Boinc.

This problem could be partially solved by having a squad documentation accessible to all squads and regularly maintained which could be accessed by site admins or newcomers.

The next sections were not updated to be coherent with previous text

Check list

This is a short list of items to check to gain an overview of the cloud status :

Requirements to be a cloud squad member

Next steps should be done only when you are validated as member of atlas-adc-cloud-xx and registered in Savannah

  • Check your membership in atlas-adc-operations.
    • You should see your address in the list of "Email Addresses" once you are included in the above atlas-adc-cloud-xx.

  • To belong to the VOMS group /atlas/team (to be able to treat ggus TEAM tickets, to declare bad files)
    • Contact the responsible of your cloud squad and ask him/her to contact the atlas VO administrator project-lcg-vo-atlas-admin@cernNOSPAMPLEASE.ch for approval. If you contact directly the VO admin, refer that you have been approved for egroup atlas-adc-cloud-xx.at.cern.ch .
    • Request at https://lcg-voms.cern.ch:8443/vo/atlas/vomrs
      1. click [+] in front of Members on the left menu
      2. click Select Groups & Group Roles
        • If you don't see such an item Select Groups & Group Roles, then you are likely a Group Manager of some VOMS group.
          do as follows in this case;
          1. click Manage Groups & Group Roles
          2. put your DN and/or First name and Last name
          3. choose the group name /atlas/team at Groups
          4. check "Roles", and click [Search]
      3. check the box on the right of the group name if you don't see Approved in the line
        • If you see Approved, you are already a member of the /atlas/team group. No need to request anew.
      4. click [Submit]
      5. The validation is manually approved bu VO admin and can take few days

  • atlas/Role=production is usually not necessary except if the squad member will:
    • actively declare corrupted/lost files to the recovery service. It should be possible to do it with the membership to /atlas/team group
    • clean manually LFC or SE (example : LFC cleaning for LFC migration). Cleaning LFC+SE will be soon possible by declaring files to the recovery service
    • blacklist cloud in Panda (should be done by ADCoS expert)
    • if the pilots created by a local pilot factory are using his DN. But it is recommended not use a personal certificate ; CERN uses Robot certificates.
In case an action is necessary for somebody with the production role, the list is maintained in ADCOperationsProductionRole

Production and Analysis support

Production

Monitoring and errors

  • Panda production monitoring
  • Main errors by site (errors link below the graph in the panda production page). Determine the error origine :
    • To check if the errors are linked to a particular task, look at the task list, you may find the ones related to your prefered cloud clicking the number associated to it in the active task line of the Panda Production monitoring page. If a task is associated with a large number of errors contact the owner of the task (you find his name, clicking on the task number).
    • If the errors are due to a site, check the status of the site (see below).
  • Production dashboard is also useful to have an overall view of the errors on the cloud (choose the view for your favorite cloud)

What to check if no significant production in a cloud ?

To check if the cloud had few jobs integrated over the last week: http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/summary

  • Check if there are defined jobs in Panda:
    • No defined jobs means nothing to be run.
    • Check if there are assigned jobs.
    • If so means site should run production unless Input data blocks are not arriving at the site.

  • High number of jobs in waiting state:
    • When found, need investigation as usually mean that the input data is not found. Panda brokering is modulated for waiting jobs.

  • Check if pilots are running:
    • If there are activated jobs and site is not running production, possibly means that pilots are not arriving, so cannot pull jobs from Panda.

  • Check if transferring jobs number is high:
    • Meaning that outputs cannot be aggregated to T1. Panda brokering has a protection against this.

  • Check there is enough disk space:
    • Lack of free disk space block brokering

Some more info

Analysis

  • Panda analysis monitoring is there
  • Main errors by site can be found here (errors link below the graph in the panda production page).
  • If a site has a large error pourcentage which is not due to the user (ie athena crash, user killed his jobs ...), check the status of the site (see below).
  • If a site gets no analysis jobs, check the Panda Brokering Monitoring. Filter by site and increase the period and/or number of displayed lines.
  • NEW Beware ! In case of CE only downtime, current switcher set only prod queues offline : analysis queue has to be set offline by hand by the adc shifter or the squad and set back manually too when the downtime is over. See also SiteStatusBoardInstructionForShifter.

Data Management support

DDM-OPS Savannah

DDM transfer monitorings

Data Transfer Request

  • Check if there are pending data transfert requests for your cloud in DATRI.

FTS monitoring

DDM problem tracking

Procedures

  • SE downtime
    • The downtime should be registered in GOCDB/OIM so that DDM will automatically stop issuing FTS jobs when downtime starts
  • Tracking dark data on storage
    • The following tool will help to find files on storage which are not registered in LFC or not registered in DDM

Sites

Introduce a new site in ATLAS Grid

  • All steps described in this Twiki

Modify site configuration in ATLAS Grid

Panda queue configuration

Merging of CEs within same panda queue

ATLAS recommends to merge Panda queues which have the same parameters except CEs. This is usually already the case for ANALY queues. The target should be to have 1 panda queue per pandaSiteID. Each pandaSiteID corresponds to the site resources with a set of specific parameters that cannot be mixed, for example all the CEs with SL6 WNs behind, queues corresponding to different memory settings, multicore WNs etc etc.The queue and SiteID can have the same name.

The site should be informed of the changes to check :

  • if all CEs should be agregated (before migration)
  • check that all CEs are used as expected after migration
Before merging the queues, check that the Panda queue parameters are identical. Then, to merge the queues, choose one of them in AGIS and clone it. After updating the queue name, just add other CE-queues. The information is propagated to schedconfig within minutes.

The new queue should appear in pilot monitoring after 12 hours max (during week days). Set the new queue in 'test' (to check that few pilots can enter) and, if OK, set it 'online' with comment 'HC.Blacklist.set.online' (HC will not change status if the other panda queues are online). There is no need to contact HC team to add the queue (the PandaSiteID is tested and not single queues)

When the new queue runs perfectly, set the single ce queues offline and wait for them to be purged. Then disable the Panda queues in AGIS: the pandaqueue will disappear from the APF monitoring one week after.

Trace changes in AGIS for a site

Request sent on Sept. 2013 to AGIS devs

Insert test SE within existing site

Change CE within existing Panda queue

  • Drain the panda queue by setting it offline (Do not forget to open Savannah in ADC-Site-exclusion)
  • Update AGIS (at least squad expert can do it). Schedconfig and pilot factories are automatically updated within hours
  • Put the queue in test mode

Upgrading middleware

SL6 upgrade

SL6 Transition can be done in 2 ways : at once after a downtime (so called "Big Bang transition") or progressively moving WNs from SL5 to SL6 ("Rolling transition"). Squad team instructions for:
  • Big Bang transition
  • Rolling transition
    • In AGIS create new panda queue(s) for SL6 (could be only production, analysis or both queue as a function of site request)
      • Create new panda ressource for SL6 ie SITE_SL6 (select your site in AGIS, click on panda item, in the table on the right click on "Panda Site" and follow instructions)
      • Create a new queue associated to this ressource (clone the SL5 one and modify CE and job submission -jdl, jdladd, queue, ...- according to site instructions)
      • Associate the new CE to the queue
    • Set the queue(s) in test mode
    • Post an Elog to inform ATLAS
    • Ask ADC Hammercloud support to add the new queue(s) to ADC database so that they are tested
    • Ask mailto:atlas-grid-install@cernNOSPAMPLEASE.ch to tag the queues for SL6 software releases
    • Follow up and document the progression in the corresponding Savannah
    • Once the queue is running properly:
      • For Tier1s : the panda resource name in cloudconfig.tier1 needs to be changed to the one for SL6 (see #CloudConfig)
      • Drain the SL5 queues and set offline once all jobs are finished
        • For Tier1s: Only once all jobs with destinationSE=oldsite are finished
      • Remove the panda queue and the panda resource in AGIS (Note big red "delete" button sometimes hidden at the very bottom right corner of the edit page for panda resource and queue)
Renaming panda resource back to original
  • After a "rolling transition", the site may rename the queue back to the original. This has the following implications.
    • case A) Instead of "removing the panda resource in AGIS" as the last step above, one may move the SL6 CE or the SL6 panda queue to the original panda resource. In this case, the software releases tagged for the "original" panda resource is still the ones for SL5 and need to be re-tagged, which can take several hours to a day.
    • case B) After "removing the panda resource in AGIS" as the last step above, one may "rename" the panda resource to the original (the one just removed). *this has quite some implications that need to be sorted out. currently not recommended*
  • In either case, atlas-grid-install@cernNOSPAMPLEASE.ch should be contacted.
  • T1s should not forget about cloudconfig.tier1 (see above)
Upgrades followed in Savannah tickets: Important Notes

Status

Overview of the sites availability is shown for ATLAS on the ATLAS site status board (SSB)

Site Exclusion and Notification

Reason Info Special Procedures Exclusion Info Notification
Summary     SSB "[ATLAS SSB Notification] Cloud XX: Daily Résumé"
Downtime AtlasGridDowntime see below DDM "[DQ2 notification] AGIS collector summary" to AMOD and ADCoS Experts with state change by DQ2
PanDA (*) "[Queue AutoExclusion] Summary for XX cloud" to atlas-adc-cloud-xx with state change by Switcher (adcssb01). See SiteStatusBoardInstructionForShifter#Instructions_for_Queue_Autoexclu
Disk Space DDMOperationsGroupArchived
Auto-cleaning agent (Victor)
  DDM "[DQ2 notification] Diskspace collector summary" to atlas-adc-cloud-xx and ADCoS Experts with state change by ddmusr01
warning only " [DQ2 notification] Storage space alerts for XXX" to atlas-adc-cloud-xx daily with small free space
Problematic Files DDMOperationProcedures   warning only " [DQ2 notification] Potentially bad files" to atlas-adc-cloud-xx daily while there are access problems
SRM failure SAAB DDM by SAAB with state change (see more info)
Test Jobs (PFT/AFT) HammerCloud
list of sites and their status
HC.Incidents "[HammerCloud][XXX] XXX Auto-Excluded" by atlas-adc-hammercloud-support
Manual ADCOpsSiteExclusion see below Savannah by Savannah with update to the tickets
(*) Panda queue status

Special Procedures
Downtime
before and after a downtime, PanDA queues are set to
  • production queues :
    offline 8h before downtime
  • analysis queues :
    brokeroff 8h before downtime
    offline 4h before downtime
  • test after the end of downtime
  • online after successful test jobs (PFT for production queues and AFT for analysis queues, see IT.HammerCloud#APPENDIX_2_ATLAS_Automatic_Site)
  • see SiteStatusBoardInstructionForShifter
  • NEW Note In case of CE only downtime, current switcher set only prod queues offline : analysis queue has to be set offline by hand by the adc shifter or the squad and set back manually too when the downtime is over. See also SiteStatusBoardInstructionForShifter.
Manual Exclusion
after manual exclusion,

Status of DDM transferts activities

  • Subscription Tools: go to http://panda.cern.ch, look at the left menu near the bottom;
    • Functional tests: click Functional Tests and then Tiersinfo on the top menu. The subscriptions can be enabled/disabled by manual interventions.
    • Distribution of data produced by T0: click ATLAS Data and then TiersInfo. The subscriptions can be enabled/disabled by manual interventions.
    • Distribution of data produced by Panda: click AODs and then Clouds summary. The subscriptions can be enabled/disabled by manual interventions.
  • If transfer problems are solved here are the instructions to follow: ADCoS#Checking_blacklisted_sites_in_DD

  • the reasons for being offline can in principle be found in the elog:

ATLAS Software Distribution

Operations

What to do to prepare for scheduled downtime (T2)

  • For SE downtimes,
    • If the downtime is published in AGIS (based on GOCDB or OIM (US)), DDM transfer machinery stops inserting new FTS requests when the downtime starts. The already submiited FTS jobs being processed are not purged and will fail. A SE downtime usually implies stopping the panda queues. Look at the next section for the steps to stop Panda queues

  • For CE downtimes,
    • The site can drain the Grid queues but it is not mandatory. If the site drains , the CE is usually set in downtime in advance
    • Any action from squad or ADCoS must be recorded on the ELOG and in ADC Site Exclusion Savannah (see ADCOpsSiteExclusion)
    • The following part is now automatised
      • To prevent new jobs to be assigned to the site, the associated panda queues should be set 'brokeroff' 10 hours in advance for production and 4 hours for analysis (job length supposed to be shorter than production ones). Activated jobs will still start.
      • The site should be put offline just before the downtime (in particular to stop sending pilots). Activated jobs will not start and running jobs will continue to run.
      • The restart and testing of the queues is managed by ADCoS shifter. Squads can contact him to make sure that the queues are restarted on time.
    • NEW Site can trigger testing of the queues by setting the queue to status test with comment HC.Test.Me

How to deploy storage at sites ?

All details about storage deployment at sites (space tokens, acls, shares) are described in this Twiki. The request for 2012 pledge is identical for the moment (Nov 2011)

How to reassign space between space tokens

The space token reassignement is done by the site but should follow the ATLAS requirements. If there is enough free space, the site just reduce the necessary space. If there is not enough free space and after having checked that there is enough secondary replicas to delete, the site can reduce the space untill the limit where Victor will clean the less used datasets. Many steps could be necessary.

How to proceed for a Grid storage migration

  • All informations should be provided in DDM_Ops Savannah.
    • Important dates
    • Old and new SE adress
    • Precious data to be migrated
  • Stop DDM transfers to the site ( dq2-set-location-status -S SITE -p wu -s off -r 'Savannah reference')
  • If data migration is necessary, 2 methods are possible (mainly DATADISK, group DDM endpoints and LOCALGROUPDISK):
    1. Create a new DDM endpoint called site_TMPDATADISK and replicate the datasets with DDM/FTS and clean at source. When dataset is replicated, clean dataset at source. When migration is done, ask DDM Ops to change the location from _TMPxxxx to _xxx
      • IMPORTANT NOTE FOR DATADISK : Keep the same custodiality at destination as source

  • T2 DATADISK :
    • To speed up the migration, it is possible to clean all secondary replicas and input (except the step09 ones).
    • Datasets required by HC should be reinstalled with first priority (reference)
  • T2 PRODDISK migration (no data migrated) :
    • Put the production queue(s) offline (few days on advance)
    • When no more transfer activity, delete all dataset replicas on PRODDISK
    • When deletion finished, update ToA and SchedConfig to point to the new SE
  • T2 SCRATCHDISK (no data migrated) :
    • Put the analysis queue brokeroff to purge the existing jobs
    • When no more jobs, wait for 7 days before deleting the dataset
    • When deletion finished, update ToA and SchedConfig to point to the new SE
  • T2 HOTDISK :
    • When no production and analysis queue, just clean the datasets.
    • When deletion finished, update ToA and SchedConfig to point to the new SE
    • Put the site back in DDM to populate again the DDM endpoint

How to proceed for a Grid storage reset

If a site decides to renew its storage from scratch.

  • All informations should be provided in DDM_Ops Savannah.
    • Important dates
  • The analysis queue should be put offline 10-days in advance to give time to users to get their output back.
  • The production queue should be put offline few days in advance to give time to clean datasets in PRODDISK as soon as last job is finished
  • When production is finished, set 'w' option in DDM site status to stop transfers to the site ( dq2-set-location-status -S SITE -p wu -s off -r 'Savannah reference'). Transfers from the site are kept active
  • Dataset deletion (Request action from DDM Ops through Savannah ticket):
    • DATADISK :
      • In order to populate the site again, make a list of dataset replicas per custodiality (input/primary/extra) ( dq2-list-dataset-site2 -a primary SITE) before deletion starts
      • Input replicas should be input to HC (AFT or PFT) or production. DDM Ops will delete immediatly when there is no more Panda job.
      • Primary : They should be replicated by DDM Ops to other sites as primary before deletion. To provide the list,
        • if number of dataset in DDM endpoint < 1k , run dq2-list-dataset-site2 -a primary SITE
        • otherwise ask the list to DDM-Ops
      • Secondary : To be deleted as soon as last analysis job is finished. Keep a list (same procedure as for primary)
    • SCRATCHDISK : Datasets are garantied to be kept during 7 days.Datasets older than few weeks can be deleted immediatly. For the dataset replicas younger than 7 days, a 7-day lifetime should be set ( dq2-set-replica-metadata DATASET SITE lifetime '7 days')
    • HOTDISK : Start as soon as there is no more analysis and production running
    • PRODDISK : Start as soon as there is no more production running or in transfering mode
    • LOCALGROUPDISK : It is the responsability of the site/squad to copy the datasets somewhere else as temporary copy
  • When deletion is done, completly stop any DDM activity (read+deletion) at the site (set 'off' all DDM options : dq2-set-location-status -S SITE -p 'rdf -s off -r 'Savannah reference')
  • HC will complain that input datasets have disappeared.
  • Site can now reset the hardware
  • AGIS is updated by DDM Ops
  • As soon as the site is ready, put DDM in production ( dq2-set-location-status -S SITE -p 'rdfuw' -s on -r 'Savannah reference') and test the storage with FT over few hours
  • DDM Ops will trigger the transfer all necessary datasets
    • DATADISK : step09*sonar* , input for HC
    • HOTDISK will be automatically populated
  • When previous step finished, set the production and analysis queues ín testing mode (reference to add).

In case of problems

If a site gets a lot of errors

Contact ADCoS shifter through VCR. The shifter should have followed this procedure. Update Savannah or GGUS if you have informations which could be useefull for the shifter or ATLAS expert. There is automatic site exclusion for analysis queues which should set properly the queue status. A similar test is not yet implemented for Production queues.

Site not getting jobs

[lxplus414] /afs/cern.ch/user/j/jezequel > lcg-info --vo atlas --list-ce | grep tbit
- CE: tbit03.nipne.ro:8443/cream-pbs-atlas
    • Check the releases published in the tags
[lxplus243] /afs/cern.ch/user/j/jezequel > lcg-info --vo atlas --list-ce --sed --attr Tag | grep tbit03
tbit03.nipne.ro:8443/cream-pbs-atlas%CREAMCE&GLITE-3_0_0&GLITE-3_1_0&GLITE-3_2_0&LCG-2&LCG-2_1_0&LCG-2_1_1&LCG-2_2_0&LCG-2_3_0&LCG-2_3_1&LCG-2_4_0&LCG-2_5_0&LCG-2_6_0&LCG-2_7_0&VO-atlas-AtlasCAFHLT-16.1.2.8.1-i686-slc5-gcc43-opt&VO-atlas-AtlasCAFHLT-16.1.3.19.1-i686-slc5-gcc43-opt&VO-atlas-AtlasCAFHLT-16.1.3.25.1-i686-slc5-gcc43-opt&VO-atlas-AtlasCAFHLT-16.1.3.25.2-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-16.1.2-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-16.1.3-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-17.1.0-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-17.1.1-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-17.1.2-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-17.1.3-i686-slc5-gcc43-opt&VO-atlas-AtlasHLT-17.1.4-i686-slc5-gcc43-opt&VO-atlas-AtlasP1HLT-16.1.2.
    • If a major release version (used by other sites for production) is missing:
      • Check if the version is validated at the site with software installation monitoring. If necessary, contact Grid Software Installation
      • If the site already had this tag (= already ran jobs with this release), GGUS the site to ask the status of the atlas tag file hosted on the CE (backup with old version for example). If the file has to be reinstalled, contact Grid Software Installation team.
      • If the tag has always been missing, contact Grid Software Installation team to reinstall the tags

  • Compare the schedconfig parameters from a good and bad panda queues (usually from 2 different sites) : maxinputsize, maxtime,... . Parameters are described in SchedconfigParameterDefinitions.

Declares and correct lost or corrupted files at a Site

Potentially lost or corrupted files

Lost or corrupted files
  • Files can be identified as lost or corrupted, either by transfer failures, job failures, or sites finding problems in their storage.
  • Once the site confirms the loss or the corruption of files, they should be declared to the DDM Recovery Service.

Users putting sites in danger

In case a site observe that a user is not using the ATLAS tools and creating too much load on the site, the site can blacklist temporary the user from his site.

  1. The site should inform AMOD and ADC operation management (private email with the user DN and a description of the bad usage). AMOD and ADC Operation Management should check if the same behaviour is observed in different sites.
  2. AMOD/ADC Operation informs the user that he/she is blacklisted at the site.
    • user information can be get with dq2-finger DN or via VOMRS
    • If the user answers that he will stop all problematic jobs and will correct his tool, the user should be unblacklisted after the user modified his application
    • The user will continue to be blacklisted in the site if the user does not react to the mail
    • If the same behavior is observed in other sites, the user will be blacklisted from the ATLAS VO (decision taken by ADC Operation Management) and will be informed by mail. The user will be blacklisted until he/she reacts to the mail.
  3. To implement the blacklisting or unblacklisting for a user with a VO, the VO manager should be contacted.

Space token within site is full (Being rewritten in Feb. 2013)

The recommended size of space token is recommended here (to be updated in Feb. 2013)

The squads are informed by mail,

  • on a daily basis, if a DDM endpoint is close to be full
  • on an hourly basis if the site is blacklisted as destination in DDM (no more FTS transfer accepted but existing FTS jobs are still processed) (monitoring)
To avoid DDM endpoints to be full and on a daily basis, the central ATLAS cleaning tools (except for _LOCALGROUPDISK and GROUPDISK endpoints) selects unused+secondary dataset replicas and set a 1-day lifetime. The tool called Victor process a DDM endpoint at a threshold (detailed in Victor's monitoring page) above the one which triggers the mail. To check the status of the automatic cleaning for a DDM endpoint
  • Look at the status of Victor . Check if the last update of Victor (timestamp just after the table) is older than ~30 hours.
  • Look at the occupancy per custodiality within DQ2 accounting and check if there are TB in state 'ToBeDeleted'. The pages are updated on a daily basis.
  • Look at the status of the central deletion and check that there is not a long backlog of files/datasets to be deleted.
If you notice an unexpected behaviour, contact DDM Ops or, in case of urgency, escalate to AMOD. Only central ATLAS team can trigger manual deletion. To solve the problem on a short timescale, the site can add space but on the medium term (days), the following items provide actions.

It is recommended to regularly detect and delete dark data which are not deleted automatically.

ATLASSCRATCHDISK

For this space token, DQ2 monitoring provides a breakdown of the storage usage per user (ddmadmin means Datri). It is updated on daily basis ( Go to http://bourricot.cern.ch/dq2/accounting/global_view/30/ and browse Reports and Top user Reports and choose your cloud).

The space token can get full for 3 reasons:

  • only at Tier-1: massive cross-cloud transfer to a non-Tier2D is under way (group production ot Datri request). The intermediate dataset owner appears as 'ddmadmin' in the Top10-user-usage. Central operation should identify dataset which reached their final destination to reduce the lifetime (3 days after replica is complete in SCRATCHDISK)

  • at only analysis site : Massive replication request to SCRATCHDISK. Massive subscriptions can be identified with dq2-list-subscription-site SITE_SCRATCHDISK . If these replications are unexpected, the subscription owner can be contacted by the squad to ask to stop the replication

  • at only analysis site : Users writing 'big' outputs. If a user is identified as occupying a significant amount of space, the squad can contact DAST to ask to contact the user and request cleaning
Possible improvements in the pipeline (Feb 2013):
  • Immediatly delete the dataset replica if Datri request to final destination is completed (otherwise the usual 15-day lifetime)
  • Stop assigning new analysis jobs to the site (maybe already implemented in Panda)
It is strongly recommended to regaularly check the existence of dark data whichare not cleaned by central deletion.

If the space token is completly full and no HC output can be written, the site is blackisted by HC.If full space token induces a low availability according to HC tests, ADC Ops can be contacted to correvt the monthly report.

ATLASPRODDISK

This space token can be full if the site accepts group/merging (big input and output) and reconstruction jobs (big output) and the PRODDISK size is underestimated. DQ2 monitoring shows that the occupancy increases quickly within hours.

The cleaning of the output datasets is done on a daily basis when they are fully replicated at T1s. If the output datasets are never fully replicated at T1s, the cleaning is done after 7 days.

The input datasets have a lifetime of 5 days (Feb 2012: value to be checked). If ATLASPRODDISK is full.

To speedup deletion, contact DDM Ops or escalate to AMOD if urgent.

If the space token is completly full and no HC output can be written, the site is blackisted by HC. For T1s, this occurs only when DATADISK is full. If full space token induces a low availability according to HC tests, ADC Ops can be contacted to correvt the monthly report.

ATLASGROUPDISK

The cleaning of the DDM endpoints are under the group responsability. No action from the squads or central operation. If the space token is full (usually because space is overbooked), the site is blacklisted centrally in DDM.

ATLASLOCALGROUPDISK

In this case, the squad should inform the site and explain the actions to be done :

  • Find datasets to be deleted : No accounting is displayed
  • Trigger deletion : The dataset replica owner or the person with role atlas/country/Role=production can trigger the deletion. The actions are provided in DQ2ClientsHowTo#DeleteDatasetReplica

User support within cloud

User groups and role

One can check the groups associated to a user from this link. Providing the name or email adress is sufficient.

Cloud services

In case of problems affecting the whole cloud, it could be set offline or brokeroff all at once, see ADCoS instructions.

FTS downtime

  • If published in gocDB, the downtime should be taken into account automatically

CE/SE downtimes

  • If published in gocDB, the downtime should be taken into account automatically

Pilot factory

  • The central monitoring of pilots are displayed here
  • To configure the number of queued pilots at each factory, set 'nqueue'
curl -k --cert $X509_USER_PROXY "https://panda.cern.ch:25943/server/controller/query?tpmes=setnqueue&nqueue=20&queue=ANALY_CERN&comment="

Cloudconfig

The cloud-wide parameters needed for PanDA workflow are stored in cloudconfig

There are information based on Tier-1 site-specific configuration such as Tier1 and Tier1SE
  • If a Tier-1 changes its "main" panda resource name, the parameter tier1 needs to be updated correspondingly.
  • If a new DDM endpoint that hosts persistent official data is added to a Tier-1, eg. a group space, the parameter tier1se needs to be updated correspondingly.

Actions requested to cloud squads

Require change in site setup

  • evgen squid cache (4 July 2011) : link

Deletion of empty directories in LFC and on Storage Element

Due to a current limitation of the DQ2 deletion service, empty directories are not removed from LFC (resp. SE). In some cases, this can lead to failure of HC tests, especially when the number of subdirectories in one LFC (resp. SE) directory reach 999999 (resp 32k or 65k depending on the Storage Element type). The problem will be fixed in the future version of the deletion service, but for the time being, squad are asked to clean these leftovers. If it is not done, jobs will suddenly fail when the limit is reached.

Cleaning of empty/obsolete LFC directories (July 2011)

This cleaning has to be done by the squad untill the LFC catalog is migrated to CERN.

The clean-up of empty directories should be done at least for the following LFC directories :

  • /grid/atlas/users/pathena/user.elmsheus
  • /grid/atlas/dq2/step09
  • /grid/atlas/dq2/valid*
To do it, run as close to the LFC catalog :
export  LFC_HOST=my.lfc.host
# replace the LFC_HOST by the host for your cloud
lfc-rm -r /grid/atlas/users/pathena/user.elmsheus

Tip, idea The command mentioned above will delete all orphans (files without replicas) and empty directory. Running it on non-empty directories won't remove them.

Warning, important It will take some time to clean the directories mentioned above (probably days). Please ensure you have a long (i.e. 96:00 hours) proxy

Cleaning of obsolete/empty directories on Grid storage (July 2012)

The cleaning of obsolete directories includes the deletion of subdirectories and files.

  • Empty directories
    • Any empty directory can be deleted
  • Obsolete directories
    • Request issued before July 2011
      • SE_path/atlasproddisk/mc08*
      • SE_path/atlasproddisk/mc09*
    • Request issued on July 2011
      • SE_path/atlasdatadisk/fast*
      • SE_path/atlasdatadisk/pile*
      • SE_path/atlasdatadisk/special*
      • SE_path/atlasdatadisk/ccrc08*
      • SE_path/atlasdatadisk/fdr*
      • SE_path/atlasdatadisk/step09/DPD*
      • SE_path/atlasdatadisk/step09/AOD/closed/step09.202009*
      • SE_path/atlasdatadisk/step09/ESD/closed/step09.202009*
      • SE_path/atlasdatadisk/step09/RAW/closed/step09.202009*
      • SE_path/atlasdatadisk/step09/AOD/closed/step09.202010*
      • SE_path/atlasdatadisk/step09/ESD/closed/step09.202010*
      • SE_path/atlasdatadisk/step09/RAW/closed/step09.202010*
      • SE_path/atlasscratchdisk/*user09.JohannesElmsheuser* (also atlasuserdisk instead of atlasscratchdisk in US )
      • SE_path/atlasscratchdisk/*user10.JohannesElmsheuser* (also atlasuserdisk instead of atlasscratchdisk in US ) * Request issued on August 2012
      • SE_path/atlasdatadisk/step09/AOD/closed/step09.202011*
      • SE_path/atlasdatadisk/step09/ESD/closed/step09.202011*
      • SE_path/atlasdatadisk/step09/RAW/closed/step09.202011*

Tip, idea Depending on the storage technology, different methods can be used to find empty directories
For dCache sites that have /pnfs mounted on one machine a simple
find /pnfs/blahblahblah -type d -empty
can be done and the directories removed by rmdir.
For sites with DPM, a recursive
dpns-rm -r
can be used.

Communication

Several teams take care of different operations and systems for the ATLAS distributed computing and some of them have overlaps. It is thus necessary to ensure good communication between all these people :

Documentation

Other related twikis

Specific cloud twiki pages

Cloud Twiki Page
CA CA Cloud twiki
DE GridKa Cloud twiki
ES Cloud twiki ??
FR Fr Cloud twiki
IT Cloud twiki ??
NDGF Cloud twiki ??
NL Cloud twiki ??
TW Cloud twiki ??
UK Cloud twiki ??
US US ATLAS Cloud twiki

References

  1. Squad Tutorial at CC-IN2P3, Lyon 2010.Jan.15
  2. An introduction to ATLAS Computing at ASGC, Taipei, 2010.Oct.25.


Major updates:
-- StephaneJezequel - 29-Oct-2009 -- SabineCrepe - 29-Oct-2009 -- SabineCrepe - 23-Dec-2009-- SabineCrepe - 18-Jan-2010

Responsible: StephaneJezequel
Last reviewed by: Never reviewed

Edit | Attach | Watch | Print version | History: r194 < r193 < r192 < r191 < r190 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r194 - 2018-06-12 - StephaneJezequel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback