TWiki> Main Web>TWikiUsers>MichalSvatos>ADCoSWIP (revision 2)EditAttachPDF

ADCoSWIP   Follow ADCShifts on Twitter

Requirements

  • ADCoS Shifter has to be an ATLAS member.
  • All current ADCoS Shifter are registered in https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=atlas-project-adc-operations-shifts
  • Grid Certificate requirements
    • Every ADCoS Shifter on duty (Trainee, Senior, or Expert) is required to have a valid grid certificate registered to VO atlas at the time of the shift, see WorkBookStartingGrid. When an ADCoS Shifter starts his/her shift without the valid grid certificate registered to VO atlas his/her shift booking will be cancelled and he/she will not get any OTP credit for the shift. Recurring certificate issues may result in discontinuation of possibility to sing up for ADCoS Shifts.
    • Valid grid certificate of the shifter must be in /atlas/team VOMS group, and that addition is done by ADCoS Coordinators at the Step 2 of Trainee Shifter setup procedure. If for some reason your certificate is not yet in /atlas/team VOMS group, ask in advance ADCoS Coordinators to add it. Particularly don't forget to do that when you get a completely new certificate.
    • Check whether you are able to submit ATLAS TEAM GGUS ticket, https://ggus.eu/?mode=ticket_team (you should see TEAM option on you screen). Valid grid certificate (1) in your browser (2) with /atlas/team VOMS group (3) known to GGUS (4) is required for openning a TEAM ticket.
  • OTP requirements
    • It is strictly forbidden to book more than 1 shift within 24 hours. Overcoming this rule may result in discontinuation of possibility to sign up for ADCoS Shifts.
    • ADCoS Shifter does book shifts in her/his name and does shift as the person who booked the shift. It is forbidden to book shift in one persons name on behalf of a different persons. The ADCoS Coordinator can book shifts on behalf of different persons, such a shift will be booked in name of the Shifter.
    • In case of an emergency situation leading into impossibility to take shift ADCoS Shifter immediately notifies ADCoS Coordinator, and the ADCoS Shift Captain of the shift timezone. ADCoS Coordinator then cancels shift booking in OTP. ADCoS Coordinator or ADCoS Shift Captain may announce a Call for shifters to find a replacement Shifter.
    • Booked shifts can only be cancelled 45 days in advance, after this time interval shifter has to find a replacement

CHECKLIST

NEW For ADCOS Shifters page has useful CHECKLIST links on one page.

  • Before coming to your shift, make sure that you fulfill shift requirements!
  • Open Jabber and say hello in the Virtual control Room to let ADC/ADCoS community know you are on shift.
  • List of current shifters to check who covers each shift.
  • Check recent updates and additions to ADCoS procedures in this TWiki, as they will be marked with NEW stamp.
  • If TEAM GGUS ticket does not work for you, please DO NOT SUBMIT non-TEAM tickets. Ask other shifter, ADCoS expert or AMOD to submit a TEAM ticket. When you are opening TEAM GGUS, please decide which priority you need, see How_to_Submit_GGUS_Team_Tickets. There are rare cases of top priority problem for Tier-1s and especially for CERN-PROD.

  • At the beginning of your shift please have a look at the Known Problems and Daily_SHIFT_reports of previous shifters
  • Check ADC eLog
  • Check the status of the TEAM tickets in GGUS and hand-over. See TicketManagement
  • Remember to report every action in eLog. (for 'new' entry, click on existing entry first. If you solve an issue, put [SOLVED] in the eLog subject)
  • Remember to check if sites are in Scheduled Downtime before opening bugs.

Most Common Mistakes by Shifters

  • Opening regular GGUS ticket rather than GGUS Team ticket.
  • Opening duplicate GGUS ticket . Please don't forget to check the list of open tickets before submitting a new GGUS.
  • Reopening the closed GGUS ticket or updating the existing one, rather than opening a new GGUS ticket when the issue/problem is different from the one in the existing GGUS ticket. Please check with the expert shifter if you are not sure if it's a new problem or different manifestation of the one which has already been reported.
  • Submitting a GGUS ticket on the site in downtime . Please always check the AGIS Downtime Calendar before opening a new ticket.
  • Forgetting to write the site name in the subject of the GGUS ticket. It is strongly advised to start the subject with the site name. That will make browsing by ticket subjects (GGUS/ELOG) much easier. On the other hand, if the site name is at the end of a lengthy subject line, it may not show on the summary list of GGUS tickets.
  • Forgetting to put an ELOG entry after opening a new GGUS/Jira ticket. Major status updates, as well as closing the ticket, need an ELOG entry as well.
  • Opening a new ELOG thread on the evolving issue, which already has entry(s) in ELOG. Instead please continue the existing thread.
  • Forgetting to submit an evaluation report of the participant trainee shifter.
  • Using email address from TWiki without editing it to exclude SPAMNOT . Remove the word SPAMNOT, otherwise the email will bounce back.
  • For jobs failing at one of ARC queues, assigning ticket to NDGF-T1 even though the jobs were failing at computing elements of other site (see ND(ARC/ARC-T2/ARC_MCORE) queues and associations to sites in GGUS)

MC production

General guide-line (Fast Troubleshooting)

  • Spot sites with major problems
    • Start chasing major failures in both interfaces: f.i. cloud failling 100% of data import/export.
  • Look at the validation tasks status. These are top priority for bug reporting.
  • Concentrate in tasks with major problems (low efficiency and large number of failures)
    • Do not worry too much with small numbers of failed jobs per task

BigPanda monitor

  • Go to error distribution page
    • First look for sites with a lot of failing jobs
    • Then look for tasks with a lot of failing jobs, then to those with less failing jobs, but remember to track them all!
    • Compare them to know if problems are site related or task related
      • If task is found to be failing at several sites, probably it would be task problem, then file an ADCO-Support Jira bug report for non validation tasks or Validation Jira bug report for validation tasks.
      • When task is failing due to missing files, please follow the #Missing_files procedure
      • If all jobs are failing at one single site probably it would be a site related problem, then file a GGUS team ticket.

Job-states definitions in Panda

There are 10 values in Panda describing different possible states of the jobs, these are:

  • defined : job-record inserted in PandaDB
  • assigned : dispatchDBlock is subscribed to site
  • waiting : input files are not ready
  • activated: waiting for pilot requests
  • sent : sent to a worker node
  • running : running on a worker node
  • holding : adding output files to DQ2 datasets
  • transferring : output files are moving from T2 to BNL
  • finished : completed successfully
  • failed : failed due to errors

The normal sequence of job-states is the following:

 defined -> assigned -> activated -> sent -> running -> holding -> transferring -> finished/failed
If input files are not available:

 defined -> waiting
then, when files are ready
  -> assigned -> activated
And the workflow is:
  • defined -> assigned/waiting : automatic
  • assigned -> activated : received a callback for the dispatchDBlock. If jobs don't have input files, they get activated without a callback.
  • activated -> sent : sent the job to a pilot
  • sent -> running : the pilot received the job
  • waiting -> assigned : received a callback for the destinationDBlock of upstream jobs
  • running -> holding : received the final status report from the pilot
  • holding -> transffering : added the outout files to destinationDBlocks
  • transfering -> finished/failed : received callbacks for the destinationDBlocks

The job brokering for production is listed in PandaBrokerage#Special_brokerage_for_production

The delay for job rebrokering is listed in PandaBrokerage#Rebrokerage_policies_for_product

Task-states definitions in Panda

  • registered : the task information is inserted to the JEDI_Tasks table
  • defined : all task parameters are properly defined
  • assigning : the task brokerage is assiging the task to a cloud
  • ready : the task is ready to generate jobs
  • pending : the task has a temporary problem
  • scouting : the task is running scout jobs to collect job data
  • scouted : all scout jobs were successfully finished
  • running : the task is running jobs
  • prepared : outputs are ready for post-processing
  • done : all inputs of the task were successfully processed
  • failed : all inputs of the task were failed
  • finished : some inputs of the task were successfully processed but others were failed or not processed since the task was terminated
  • aborting : the task is being killed
  • aborted : the task is killed
  • finishing : the task is forced to get finished
  • topreprocess : preprocess job is ready for the task
  • preprocessing : preprocess job is running for the task
  • tobroken : the task is going to broken
  • broken : the task is broken, e.g., the task definition is wrong
  • toretry : the retry command was received for the task
  • toincexec : the incexec command was received for the task
  • rerefine : task parameters are going to be changed for incremental execution
For more details, see https://twiki.cern.ch/twiki/bin/view/PanDA/PandaJEDI#Transition_of_task_status

What to do when

A task is failing

  • If the task is assigned to CERN cloud to a queue different from CERN-PROD, i.e. CERN-BUILDS, CERN-RELEASE, CERN-UNVALID, CERNVM, CERN_8CORE, then forget about it. If the failing task is running in the CERN-PROD queue, then please follow standard procedure.
  • Validation tasks. Submit ATLAS validation Jira ticket for validation tasks (those beginning with valid). There's no need to file a bug for those tasks with small number of failures or to report bugs that has been already reported before:
    • Make sure that the bug has not been reported before.
  • Other tasks -those not beginning with valid: use ADCO-Support for non validation tasks with high failure rate.
    • To submit the ADCO-support bug report for failing task:
      • Check the task-status page for the given taskID (example: http://bigpanda.cern.ch/prodsys/prodtask/task/4000854/):
        • If "Comment" line has words "group production...", the task is from "Group production"
        • If "Comment" line has "MC12(or M11, MC, Common DPD etc.) production...", the task is from "Official production"

A task is not assigned to site

A site is heavily failing

  • If the burst of errors are restricted to less than few hours and and there is no more error (from Panglia plot, the increase if failed jobs is sharp and flat since then), no action to take.
  • If jobs are continuously heavily failing:
    • If the site is in downtime, the site should have already been automatically set offline
    • Make sure is a true site issue (not Athena issue for example)
    • File a GGUS Team-Ticket as described in this section with cloud responsible in CC.

A site is not getting jobs

  • Remember if the site has no jobs assigned, there's no chance to run.
  • Check that software versions requested by current are installed at the site (monitoring).
  • Check if site/queues are online
    • If queue is offline and site not in downtime and if there's no incident ongoing, contact the cloud responsible, file eLog and wait for confirmation to turn the queue online.
    • If queue is offline and there's no incident ongoing, contact the cloud responsible, ADCoS coordinators and file eLog.
  • Check if pilots run in the last hours at the site
    • If you find problems, fill elog and contact the cloud responsible and the pilot factory responsible
  • More elaborated checks to be done by squad

Details about errors

'Lost Heartbeat' error

An update by the pilot to the Panda server is sent to every 30 minutes. If there is no update within 6 hours, the job is declared 'lostheartbeat'. When the job is finished, the pilot will try to update the Panda Server 10 times separated by 2 minutes. PandaPilot contains all details about Pilots.

  • Most common reason : the local batch system has killed the job because it used more than the accepted resources (CpuTime, WallTime, memory). The ATLAS requirements are published in the VOid Card. By comparing similar jobs on different sites, try to identify which is the problematic variable. In this case, the number of failing jobs should be spread over time

  • Site or batch system is broken : The failing jobs should be spread over a period of few minutes.

  • CE has lost track of the job

If you ticket the site, provide an example to the site (jobID to be documented).

'Job killed by signal 15'

The local batch system issued a warning to the pilot informing that the job would be killed soon. The job stopped himself to be able to report the log.

'Exception caught in runJob'

No documentation yet

'Pilot has decided to kill looping job'

No documentation yet

'Get error: Failed to get LFC replicas'

The pilot was not able to get, from LFC catalog, the file adress on the storage. It can be related to a glitch in LFC. This is problematic only of it lasts over more than an hour. If there is a real problem, it should affect other sites linked to the same LFC (same cloud)

'Get error : open connection to sename' or 'Get error: dccp/rfcp failed'

The pilot is not able to copy the input file from the SE. It means that the SE is totally or partially broken. Cross-check with DDM transfers. If the error rate is increasing, put the panda queues offline.

'Get error: with guid xxx not found at'

The dataset is supposed to be available at the site but the LFC scan reveals that the file is not at the site. No action for the moment.

'GUID for xxxx not found in DQ2 '

It usually means that a file was remove from dataset definition after the task defined the input files. In most case, the file was found as not available on SE with no other replica to recover. For the moment, there is no action from the shifter. The jobs will fail quickly and it is up to GDP to treat the task properly.

'Put error'

No information yet

'/opt/lcg/bin/lcg-cr '

The job is not able to copy the file on SE or register the file in LFC. The pilot tries the command twice with file deletion in between (not correct for the moment)

'Transformation not installed in CE '

No documentation yet

'Transfer time out'

The output was not transfered in time. If you manage to find the transfer on DDM dashboard, report it as any other failing transfer. If you cannot find it on the DDM dashboard, write Elog with all information you can get about the issue and then send email with the link and explanation to the cloud support to check the activity of the FTS channel.

'No PFN found in catalogue for GUID'

No information yet

ATLAS Software Releases problem

  • Jobs may fail if the needed software release is not present or badly installed. The procedure to follow is to notify ATLAS SW managers and ask for re-installation of the release at the site.
  • We will discuss about asking GGUS team to add special tag when submitting TEAM Tickets to address this specific problem to the SW responsible, but the interim solution to follow is:
    • Open normal GGUS team ticket and after that, set the "type of problem" to "VO Specific Software", and select "VO Specific"="yes", providing involved ATLAS release, site name and CE.
      and put atlas-grid-install@cernNOSPAMPLEASE.ch in CC

Missing input files on T2_PRODDISK

Missing files

  • For missing files as inputs for MC jobs taking inputs from T2 PRODDISK, please follow instructions at Missing input files on T2_PRODDISK.
  • Otherwise, follow these instructions:
  • Sometimes files look missing from the site but they were actually never registered or the task is misconfigured. Two simple steps that could help to understand what is happening (in case the errors are ambiguous):

In case you have difficulty in the procedure, please contact the expert either at the #ADC_Virtual_Control_Room, or via the ADCoS team ML (be aware: if there is now answer after 15 min in the chat, send an email to the list)

Follow the procedure (slides)

  1. From the panda job page where you found the missing file; click the file name
  2. Then on the page opened, you will find a SURL ( srm://... ) or a list of them
    • If a replica for the site investigated is not listed then it is not a site problem
    • If a replica is listed you should check the file with lcg-ls -l
      lcg-ls -l SURL
    • if it is ONLINE or AVAILABLE, then try to download it using lcg-cp
      lcg-cp --vo atlas <SURL> <LOCALFILE> 
    • If it is accessible it was a transitorial problem
    • If it is unaccessible it is a site problem
  3. If it is a site problem
    1. Open GGUS Team ticket to the site and report the files that have been lost.
    2. Notify the cloud contact about the missing files by adding the following address in cc filed of GGUS ticket
      • atlas-adc-cloud-[CLOUD]@cern.ch [where [CLOUD] stands for CA, CERN, DE, ES, FR, IT, ND, NL, RU, TW, UK, US] (see #ContactingCloudSupport)
  4. In any case (either a site problem or not a site problem)
    1. Open DDM ops Jira ticket to notify data management team about the lost files with
      • the lost files information (SURL, file name and the associated dataset)
      • the panda job information (a link to the panda job and TaskID)
      • a link to the GGUS ticket
      • in Mail Notification Carbon-Copy List add cloud support and task owner
      • Important! If the reported issue is urgent, please explicitely state this in eLog. ADCoS Expert. ADCoS Expert will then escalate issue to AMOD.
    2. When opening DDM ops Jira please follow the pattern
      • Task type Task xxxxx: task status in XX cloud
        eg
        • MC production Task xxxxx: waiting in XX cloud
        • Reprocessing Task yyyyy: input RAW file corrupted at SITE
        • Group production Task zzzzz: waiting for input in ZZ cloud
      • If you have additional information, eg. the input dataset has been deleted, add this information to Jira:
        • Dataset xxxx deleted from SITE
      • It is important for ddm-ops experts to know category of the problem(task type, dataset deleted) and the location (cloud, site), so that they can start do some real work without navigating through panda pages to collect this information. TaskID is necessary for ProdSys.

checksum errors during dq2-get/lcg-cp

DDM tools can check file (on Storage) consistency with LFC/DDM catalogs (filled when the file is registered in DDM and before any replication) on file by file basis. As soon as you have a doubt about a file, follow the procedure:

  1. Check the file consistency:
    • If the file belongs to a tid dataset: run a script to check the consistency of the file on the Storage at the source T1: link
    • If the file does not belong to a tid dataset : use dq2-get which will copy the file locally, compute the checksum and report inconsistency
  2. If the file is not corrupted on the source T1 Storage:
    • Run dq2-get to check the file consistency on the SE used by your application. If the file is correct, go to next point . If the file is not correct, fill a Jira ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.
    • Check within your application. For example, it is possible that the file was not copied on the scratch disk associated to the CPU because it was full or the copy time-out occured before the file was completly copied.
  3. If the file is corrupted at the source T1, it needs to be deleted from the Storage and DDM. Fill a Jira ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.

Waiting Jobs Procedure


  • Waiting jobs (definition): Jobs that were not able to locate the input files within 7 hours after submission. See e.g Panda Waiting jobs
  • This situation could be due to three different scenarios:
    • a) Check if there is a problem at task definition level, i.e. wrong dataset, typos, etc. In case you find an inconsistency this has to be submitted to the ATLAS MC or group production team or ATLAS validation team through the Jira bug reporting portal with CC to the task owner specifying task ID and the problem. (A good check to spot the correctness of the input files defined is to ask through dq2 CLI for a given dataset : i.e. dq2-ls -r DATASET, if dq2 says there's no dataset with this name, then most likely the dataset name is not correct.)
    • b) Job in waiting state before being assigned to any cloud: This mean that the input file cannot be found in ANY site. This should be reported to DDM ops team, through Jira ticket. Specifying the Dataset name or input files (if only some file in dataset cannot be found), together with the task(s) ID(s)
    • c) Job in waiting after being assigned to a cloud: Task has been associated to a cloud so this means that at some point the input file was found in the cloud but after some time disappeared from LFC or is not available on disk. Please follow #Missing_files procedure in this case.

How to know if the problems is task related or site related ?

  • Check to see if any jobs are done (e.g. scouts) by entering the task number in the task field on the bottom of the panda browser. If the scout have gone through and there are a lot of failures at 1-2 sites, the sites are more suspect than the task.
  • If the same task is found to be failing at several sites, probably it is a task related problem.
  • If jobs from a task are failing at one single site and running OK at other clusters, probably it is site related problem.

Tips for ticketing the right site

Sometimes, one site relies on services for another site, for example Tier 2s usually rely on Tier 1 LFCs, so read the error messages carefully to ensure you ticket the correct site - it may not be immediately obvious.

The Information System (BDII) is a special case here too. Since it is a service split over three levels (top bdii, site bdii and resource bdii), it's sometimes challenging to work out where the failure is. Thankfully, GStat 2 makes this task simpler - and they have a convenient guide you can use.

Reprocessing jobs

NEW Group production jobs

  • Group production is now being handled by the new DPD Production Team (contact: atlas-phys-dpd-coordination@cernNOSPAMPLEASE.ch.)
  • Monitoring for group production tasks
  • NEW Twiki with usefull info for group production reporting
  • Please report Group production jobs which did not finish within 1 day.
  • Group production task should run less than 1 week.
  • Problematic tasks are to be reported to ADCo Support Jira:
    • put "TASK" string and task ID(s) to the Subject
    • put task owner to CC
  • Site issue is to be reported to GGUS.
  • DaTrI requests for Group production datasets - please check at the beginning of your shift.
  • Group contacts
  • More info for Groups: DPDProductionTeam

Group production data transfer requests in DaTrI

  • Please check DaTrI requests at the beginning of your shift.
  • If there is anything odd on the DaTrI requests page, please take an action:
    • In case the issue is missing file(s)
      • Please check corresponding task which produced data (see tid part of the dataset name to determine the task ID).
        • If the task is still running, just ignore the problem for now and check later.
        • If the task is aborted, file DDM ops Jira and put the Group production task owner to CC.
    • In case transfer cannot be performed due to full destination spacetoken, file ADCo Support Jira and CC the Group production task owner, ask him/her to clean-up some space on destination spacetoken or request transfer to somewhere else.
    • If the transfer cannot be performed due to clear site issue, file a GGUS ticket to the site.
    • In any other clearly DDM-related case please file DDM ops Jira and put DaTrI owner to CC.
    • If you are not sure what kind of problems you observe, please ask your fellow ADCoS Expert Shifter in ADC VCR.
  • Group production Jira subject convention
    • Put group name to the subject
    • Put destination spacetoken to the subject
    • Put DaTrI ID to the subject
    • e.g. "PHYS-SUSY DaTrI request 1234 is awaiting_subscription to DESY-HH_PHYS-SUSY"
  • Group production Jira content convention
    • Put pointer to DaTrI request ID.
    • Put pointer to DaTrI request to the ticket.
    • Put status of production task which triggered the subscription to the ticket, put task ID of this production task to the ticket, and put URL to the task status in Panda Monitor.
    • Put output of the following dq2- commands to the ticket:
      dq2-ls -r  DATASETNAME
      
      dq2-list-replica-history  DATASETNAME SPACETOKEN
      
      dq2-list-subscription DATASETNAME
      
      dq2-get-metadata DATASETNAME
      
    • Add Group contacts to Jira watchers. You can use the xwho/phonebook service to find more information on how to contact Group contacts.

Group production jobs experience/hints

  • When job fails with "ATH_FAILURE - Athena non-zero exit", please make sure that it has inputs defined first. If the task has no input defined, mention in in JIRA. In such case you don't have to check athena logs further and you don't have to put long excerpt of log into Jira ticket. In such case task has to be redefined by the task owner (whome you put into CC of the ticket).

Symptoms of input file missing at T2_PRODDISK

  • On Panda Monitor: e.g. error SFN not set in LFC, or anything else than ready in the list of input files of the table with list of files on the top of the page
  • Go into the job page (on Panda Monitor) of one of the failed ones, you will see dispatch block dataset. Take the dataset name and see when it was created, e.g. at BEIJING-LCG2_PRODDISK
dq2-list-replica-history $dataset BEIJING-LCG2_PRODDISK
  • Then, from the file name, you will be able to get which job created it, and going into the job page, you will see destination block dataset. See when it was created, and deleted.
dq2-list-replica-history $dataset BEIJING-LCG2_PRODDISK
  • Panda will re-subscribe the deleted dataset before the next job attempt. At the moment, there is nothing site can do to let the error vanish, therefore we do not file GGUS to site (see instructions above).
  • Example output of the commands:
% dq2-list-replica-history panda.280663.03.08.GEN.623cd534-e447-4898-b439-c1521cfeb4e6_dis001205962636  BEIJING-LCG2_PRODDISK
------------------------------------------------------------------------
2011-03-08 22:33:29.568403 | /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart | voatlas59.cern.ch
group                     : /atlas/role=production
request type              : lookup
request state             : waiting
location                  : BEIJING-LCG2_PRODDISK
owner                     : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart


% dq2-list-replica-history mc10_valid.105001.pythia_minbias.simul.HITS.e574_s1149_tid280150_00_sub017233989 BEIJING-LCG2_PRODDISK
------------------------------------------------------------------------
2011-03-06 07:28:19.366454 | /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart | voatlas58.cern.ch
group                     : /atlas/role=production
request type              : lookup
request state             : waiting
location                  : BEIJING-LCG2_PRODDISK
owner                     : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
------------------------------------------------------------------------
2011-03-08 18:16:35.393986 | /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin | voatlas161.cern.ch
group                     : /atlas/role=production
request type              : physical deletion
request state             : queued
location                  : BEIJING-LCG2_PRODDISK
owner                     : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
------------------------------------------------------------------------
2011-03-08 23:54:11.132004 | /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vgaronne | atlddm02.cern.ch
group                     : /atlas/role=production
request type              : physical deletion
request state             : success
location                  : BEIJING-LCG2_PRODDISK
owner                     : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart

NEW Task in submitted state for long time

DDM

  • DDMGlobalOverview
  • Spot most problematic clouds in DDM dashboard: (begin with those in RED, then YELLOW and then BLUE):
    • Click on the Tier-1 name to get a breakdown for the sites. Chase the site(s) that is causing the low efficiency at the cloud by clicking on the error number (breakdown for errors).
      • Understand if the problem is site-related (DESTINATION error):
        • FTS State [Failed] FTS Retries [3] Reason [DESTINATION error during PREPARATION phase: [CONNECTION] failed to contact on remote SRM...
      • or if the problem is outside of this site (SOURCE error):
        • FTS State [Failed] FTS Retries [3] Reason [SOURCE error during PREPARATION phase: [CONNECTION] failed to contact on remote SRM
      • If the error message is : * SOURCE error during TRANSFER_PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds
          1. click the number on the right. You will see the list of files with FAILED_TRANSFER
          2. click some of the files to see the history of the file transfer
          3. if you see the error is persistent (many errors for more than 1 day), the problem should be reported explicitly mentioning the error is persistent.
          4. otherwise (if only a few errors, or errors within 1 day), no need to report
      • DDM is intrinsically linked as downtime on a site can cause collateral effects to all sites pulling or pushing data to it.
    • For problematic sites check Services column: DQ/Grid Status and report in case it is not OK (by this time only DQ is monitored)
  • If you're new to the Team, please check DDMDashboardHowTo
  • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.

DDM dashboard shows timeout/SRMV2STAGER errors

  • When DDM dashboard shows timeout errors or SRMV2STAGER errors, you should wait and see if the error re-occurs and persists before you submit ticket to a site.
    • The aim of waiting is to make sure that the issue is still there, and to prevent us from sending false alarms to sites.
    • Duration of waiting period is shown on the error message, it can be from tens of minutes to day(s). In any case please file an eLog entry about the timeout/SRMV2STAGER issue, and mention this issue in your daily report.
  • NEW Staging Statistics and Staging Errors logs. Use these 2 pages to get summary information about the staging failures. If the failure rate is too high, please consult with your fellow ADCoS Expert shifter whether a GGUS ticket should be filed.

What to fill into GGUS ticket subject (short description)

  • Site name or spacetoken name
  • Short description of observed issue:
    • If the FTS error transfer contains 'locality is unavailable' : put locality is unavailable into GGUS ticket subject
      • Example ERROR MSG:
        ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [LOCALITY] S
        ource file [srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/atlas/atlasdatadisk/step09/ESD/closed/step09.202010410000
        54L.physics_C.recon.ESD.closed/step09.20201041000054L.physics_C.recon.ESD.closed._lb0002._0001_1286547114]: l
        ocality is UNAVAILABLE]
    • If the FTS error contains 'gridftp_copy_wait: Connection timed out ' : put gridftp_copy_wait: Connection timed out
      • Example ERROR MSG:
        [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNEC
        TION_ERROR] failed to contact on remote SRM [httpg://grid05.lal.in2p3.fr:8446/srm/managerv2]. Givin' up after
        3 tries]
    • Otherwise: try to describe problem within up to 4 words, please try to avoid phrases like "many transfer errors" only when better error description is provided in ERROR MSG.
      • If FTS error message states SOURCE error, put SITE_X cannot export data (SITE_X is name of the SOURCE site)
      • If FTS error message states DESTINATION error, put SITE_X cannot receive data (SITE_X is name of the DESTINATION site)
      • If FTS error message states TRANSFER error, put Transfer issues between SITE_X and SITE_Y

Which Problems to report

  • Tier-1s: No transfer reported at dashboard level during few hours (cross-check if the site is in Downtime before).
  • Report DDM errors only if :
    • the source site is problematic (as reported by FTS). Probably site is down or it is lost file
    • T1/T0 <-> T1/T0
    • T1<->T2 within same cloud
    • T1/T0 <-> T2_PRODDISK (afects production ) T2_GROUPDISK (group datasets are not aggregated at final destination) (cross cloud or not)
    • Do NOT report issues with T2-T2 transfers.
    • NEW When ERROR code in DDM dashboard is [DDM Site Services internal] , just report the error following DDM-specific ticket
    • If FTS error means that it is a problem at source (pattern SOURCE in FTS error log)
  • Dashboard: If there is no transfers shown in the DDM dashboard, please notify immediately the dashboard team: dashboard-support@cernNOSPAMPLEASE.ch. If you get no response after one hour and status is the same contact directly the AMOD.
    • DDM dashboard: Please report to atlas-adc-expert@cernNOSPAMPLEASE.ch and dashboard-support@cernNOSPAMPLEASE.ch. Please wait until the issue is resolved. Please monitor SAM SRMv2 tests in the meantime http://tinyurl.com/ATLAS-SRM-last48 and report the most recent issues to sites. Do not report to atlas-dq2-support at cern.ch.
    • That could be a side problem of various things:
      • Dashboard agents are not working
      • Site-services are not working
      • No data is transferred to the sites
      • Everything fine, but no data at all -very rarely seen, as there is always traffic either in the Tier-0 or at the production dashboard-
  • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.
  • When a ticket is solved by site, and the issue disappears from our monitoring tools (will not occur the new issue if the same kind within 1 hr from the ticket solution), consider the issue to be solved. When the same issue reoccurs after 1 hr after the old ticket was solved, please open a new ticket.

Tier-0/Tier-1/Tier2 Data exportation

  • Cooperative work with the ADC Point1 Shift. - Taking care of the data export from Tier0 (Tier0 - Tier1s and Tier0 - Tier2_CALIBDISK) is the main duty of point-1 colleagues -but sometimes point-1 shifts are not covered - so:
    • Regularly check and spot if there are stopped or long lasting subscription

  • If latencies are found please
    • Cross-check the problem has not been reported (eLog, virtual control room)
    • If problem hasn't been addressed:
      • Open DDM ops Jira indicating site name, and dataset statistics.

What to do when a site is failing ?

  • Open GGUS team-ticket with relevant information CC'ing the cloud contact
  • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.
  • Fill ADC eLog

What to do when a site has no FREE disk space in space tokens?

  • In most cases when there is no free disk space in particular spacetoken this spacetoken should be blacklisted for writing automatically

  • If you see error DESTINATION error [NO_SPACE_LEFT]
    DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] at Thu Jun 04 20:11:49 CEST 2009 state Failed : space with id=1209 does not have enough space
    
    • this is not the site problem, but an atlas issue in usage of given resources, do not send ticket to ggus ticket to the site
    • If there is indeed no space left and spacetoken is not blacklisted submit an ADC Site Status Jira with the cloud support in CC so that they can take an action (increase the space, reduce the share, etc...)
    • EXCEPTIONS
      • If you see error [NO_SPACE_LEFT] for DATAPE or MCTAPE, it is a site issue. Send email to atlas-adc-expert(at)cern.ch.

  • The following error messages means that the log area for the FTS server is full. In this case, submit a GGUS ticket to the site which hosts the FTS server
    [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] 
    cannot create archive repository: No space left on device]
    
    or
    Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR]
    error creating file for memmap /var/tmp/glite-url-copy-edguser/BNL-NDGF__2010-01-16-0659_m91sgu.mem: No space
    left on device]
    
    or
    [FTS] FTS State [Failed] FTS Retries [1] Reason [TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus
    _ftp_client: the server responded with an error 500 Command failed. : write error: No space left on device] 
    

The DDM endpoints *_LOCALGROUPDISK are not managed centrally or by the site admins. If a DDM endpoint is full, DDM automatically blacklists the site as destination. ADCoS shifter should submit a Jira ticket within ADC Site Status for information. The cloud squad should acknowledge and close the ticket. The squad is responsible to inform the local users.

The actions are similar for ATLASGROUPDISK (DDM endpoints mainly called PERF-* or PHYS-*). The Jira ticket should be submitted to ADCo Support Jira and assigned to Group Production and the DPD contact person should be put in CC (list in DPDProductionTeam#Group_DPD_contact_persons). The ticket should be acknowledged by the Group production responsible and closed.

For the other space tokens, an automatic cleaning algorithm is defined and running for all space tokens. If the site is full, it means that the cleaning procedure is not perfect. The cleaning monitoring can be found at

QUEUE MANAGEMENT FOR FULL SPACETOKENS:

  • If SCRATCHDISK (for EGEE sites) or GROUPDISK (for US sites) is (almost) full (analysis jobs will not be able to write output)
    • set ANALY queue 'offline'
    • do not change settings for production queues
    • If the SCRATCHDISK/GROUPDISK spacetoken has less than 200 GB of remaining space, ANALY jobs are not brokered to that ANALY site.

  • If PRODDISK is (almost) full (production jobs will not be able to get their input from T1 or write their output)
    • set production queues 'offline'
    • do not change setting for ANALY queue(s)
  • If other spacetokens (DATADISK,MCDISK etc) are (almost) full (Data replication will fail)
    • Do not change settings for any queue of the affected site

  • General procedure:
    • change queue(s) status if applicable. HOWTO change queue status? List of queues.
    • fill eLog saying that queue has been set to 'offline' (specify its name)
    • fill ADC Site Status Jira, put there pointer to the eLog entry you have just made such that afterward AMOD will be able to reply to the elog.
    • update eLog with reference to adc-site-status Jira

  • More detailed information about the whole procedure (involving AMODs) is at ADCOpsSiteExclusion.

What to do when subscriptions are not processing?

  • If Site Services are under suspect, follow: Cental Services procedure
  • If the problems is related to data loss or catalogue inconsistencies:
    • Place as DDM ops Jira bug
      • Report the error message and associated link.
      • Report the dataset not transfered and when it was done.
      • Please follow guidelines on what information to fill in DDM-specific ticket. This information is valid for both GGUS tickets to the site, and to DDM ops Jira tickets.
  • Hardware status

Checking blacklisted sites in DDM

  • When a site is heavily failing and bypassing certain error threshold (#errors/time), the site is removed from site services so no further transfers requests happens for the site. This is done by ADCoS Expert Shifters for T2 and T3 sites, and by AMOD for T0 and T1 sites. There could be several reasons for doing this: Long scheduled/unscheduled downtime, persistent storage problems, fts issues, etc.
    • Sites on downtime (GOCDB/AGIS) are excluded automatically.
    • NEW Sites failing SAAB nagios tests are excluded automatically. SAAB blacklistings can be monitored here. Currently only put tests are active, so in case of the site problem only write/upload (w/u) part is blacklisted by SAAB. If the site has also a problem as a source or at deletions (r/f/d), shifter must treat them as regular transfer/deletion failure case. If site was blacklisted by SAAB, then the failing test issue should be followed up in GGUS ticket to the site. See more information in SAAB TWiki.
    • Sites which are not excluded automatically have to be excluded manually.
  • Shifters should check at the end of the shift if the blacklisted sites are still in troubles or if they have solved the problems so the site can be set online again, this has to be mainly done following these steps:
  • Instructions for ADCoS Expert shifters to exclude/re-include spacetoken from DDM: ADCoSExpert#DDM_spacetoken_exclusion

Checking transfers to CERN-PROD_PHYS-GENER

  • At the beginning of your shift please check transfers to CERN-PROD_PHYS-GENER spacetoken here. If transfers fail because of missing input file, submit DDM Ops Jira ticket. In other cases treat them as transfer failures on any other site.

Checking the deletion backlog

  • Go to page http://bourricot.cern.ch/dq2/deletion/#period=1 , scroll down to the CLOUDS table.
  • Click on name of a particular cloud, list of sites in this cloud appears.
  • Click on name of a particular site, list of spacetokens in this site appears.
  • Click on number of Queued deletions in the PRODDISK row. You will be forwarded to a new page.
  • On the new page, check that column "Catalog Cleanup" contains only Y.
    • If column "Catalog Cleanup" contains only Y and column "Storage Cleanup" contains Y/N, do not file GGUS ticket to the site. Do not file DDM ops Jira.
    • If column "Catalog Cleanup" contains also N, please check LFC backlog on other sites in the cloud. Do not file GGUS ticket to the site. Do not file DDM ops Jira. Create an eLog.
  • NEW If the 2 first columns ( To Delete and Waiting ) show more than few hundreds and both numbers are close together and the column Files Deleted is < 1k one should :
    • Check manually if the site is down
    • For CERN, do not care about CERN-PROD_DAQ and CERN-PROD_TZERO (Deletion should not appear)
    • Do not care about _GRIDFTP sites which are not cleaned centrally (although users can issue the deletion command and create central deletion request)
    • Check, within a DDM site, if the deletion request was issued too recently (look at 'Creation date' and 'Grace period')
    • Report the issue via DDM Ops Jira

Checking the deletion error rate per site

  • Go to page http://bourricot.cern.ch/dq2/deletion/#period=4, scroll down to the CLOUDS table.
  • For each cloud in the table
    • click on name of that cloud (to make list of sites appear)
    • if a site has more than 100 errors over the last 4 hours:
      • Check if the error rate is constant over these 4 hours (click on the name of the cloud and site on the top-left part of the page and look at the error rate plot).
    • Report to ADCoS expert who will check if it is worthwhile to contact the site and fill GGUS ticket if necessary
      • file a GGUS ticket to that site with CC to the corresponding cloud support
      • file an elog
        • reference created GGUS ticket in that eLog

Panda queues

If a site or a queue at a site is in downtime or is heavily failing, the site should be set offline so that jobs are not directed anymore until the problem is solved. Before putting the site online again it should be tested with the procedure described below, if the tests are ok it can be set online again.

Controlling Panda Queues

  • Secure protocol is now required to change queue status, so you will need a proxy certificate
  • If you are on a site where AtlasLocalRootBase setup is available (for example on lxplus) run that first.
    • export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
    • source $ATLAS_LOCAL_ROOT_BASE/user/atlasLocalSetup.sh
    • localSetupEmi
    • voms-proxy-init -voms atlas
  • Then run the command setqueuestatus.
    • The command defaults to values valid values for an EGI UI and should work on most sites in Europe even without the setup above.
    • It works both for analysis and production queues
    • Without arguments the command gives the help with examples
    • With arguments: setqueuestatus <status> <queue> <comment>
      • <status> can be: setoffline/settest/setonline/setmanual/setauto
      • <queue> is the Auto-Pilot queue name
      • <comment> should be a string that has a meaning either for the shifters or for the switcher if there is an elog or a GGUS ticket include it in the comment.
        • If you include the elog number in the command (for other than setonline!), do it in this format: &comment=elog.123456 and Pandamon will make a link to the eLog.
      • If setting a queue online, LEAVE THE COMMENT FIELD BLANK (empty string). No eLog number.
  • Only queues which are configured for 'manual' control can be set offline/test/online via the URL; the rest are auto-set by the Auto-Pilot DB loader. The setmanual command will set any queue to manual mode so it can subsequently be controlled by URL (and the Auto-Pilot loader program pilotController.py will leave its status alone). setauto will remove manual control.

Cloud Control

  • This should only be done by ADC experts and requires a certificate with /atlas/Role=production credentials (VOMS proxy with production role creation howto). Currently, any atlas/xx/Role=production can manipulate a cloud.
    • Simply replace queue=NAME above with cloud=CLOUD. The setonline, setoffline, settest and setbrokeroff commands are valid.
      • brokeroff prevents new tasks from being assigned (and let the already assigned jobs to run) to the cloud and should be done in advance of extended T1 downtimes where cloud services stay up (LFC, FTS). No effect on analysis brokering.
    • When there is a T1 downtime and this downtime should last longer than 2 hours, please set the whole cloud to offline (no new tasks are assigned, activated jobs do not start running).
      • Please put also multicloud sites to offline, so that they cannot run jobs on behalf of another cloud when their T1 is down.
        • NEW To get list of multicloud sites go to Sites page of Panda Monitor, and click on line with "production multicloud sites". Check each of them, look for abbreviated name of the affected cloud.
    • Follow the rest of the procedure as normal - eLog and ADC Site Status Jira.
    • Make sure to notify : atlas-adc-cloud-<CLOUD>@cern.ch AND CC the GDP Team (atlas-adc-gdp@cernNOSPAMPLEASE.ch).
  • Feature request: manipulate all ANALY queues in cloud with one command

HOWTO change queue status?

  • Do not manipulate T1s queues unless ADCoS Expert Shifter, or ADC Manager On Duty, or Cloud Support explicitely ask you!
    • List of T1s sites is available here in the table on the top of the page.
  • Have a look at the Controlling Panda Queues section for more details
  • The 'Clouds' link at the top of the monitor now gives a page that can be used to track all this. It shows the queues by cloud and site, with their Auto-Pilot and Panda names, and the status.

  • If the site or queue was set "offline", then before bringing this site/queue back "online" a shifter should do the following:
    • set the site/queue in "test" mode with the corresponding "settest" curl comand and &comment=HC.Test.Me and the system will pick it up and set it online after the required number of successful tests.
    • check whether these jobs completed OK (if not - submit corresponding RT or GGUS ticket)

Cloud and queue status

  • Cloud offline (Only relevant for cloud still hosting LFC): no additional jobs are submitted to the cloud. Already activated and running jobs keep going.
    • LCG autopyfactories stop submitting pilots. US and ND clouds might behave differently.
    • Activated jobs will keep going into running until pilots already in the CE queues are exhausted.
  • Cloud brokeroff: no additional tasks assigned to the cloud. Current task will keep going until completed.
    • To be implemented in case of long T1 downtime (> 1 day) or FTS downtime (> 1 day)
    • ANALY queues unaffected by the cloud status and need to be handled separately.
    • ANALY shouldn't be handled separately anymore because
      1. The (autopy)factories will stop submitting pilots when the cloud is offline, so there will be no pilots.
      2. The queue data status goes to offline when the cloud is offline, so any pilots which do arrive abort.
  • Queue offline: no additional pilots are sent to the site. Running jobs keep going.
    • Pilots see the offline status from schedconfig, and will not pull a job, so queued pilots will not execute activated jobs.
  • Queue brokeroff: running and activated jobs will complete but no additional jobs will be activated.

How to send test jobs

Test jobs are now sent automatically through PFT. If the site was offline, the shifter has to put the site in test mode. See how to do it in the Controlling Panda Queues section.

Automatic change of panda queues

SiteStatusBoardInstructionForShifter

Analysis Queues (AFT)

The analysis queues are tested through HC jobs submitted permanently to the panda queue. The automatic blacklisting for downtime is not implemented yet. To implement it

Production queue (PFT)

The procedure is similar to AFT with different site name

How to track site status?

  • Jira project is available for handling site status. eLog is working fine as a logbook but it's complicate to track the status of a site after has been set offline/brokeroff, etc. For that reason we agreed to create a new Jira project for handling interventions at the sites: ADC Site Status Jira
  • Jira has the capability of closing issues once the intervention is finished, hence allowing good tracking.

  • The procedure for shifters would be the following:
    • Shifters noticed a failing site
    • Shifter send GGUS mail to the site to report the problem
    • Shifter fill the eLog
    • Shifter fill ADC Site Status Jira ticket and add the cloud squad in CC (selectable through a splitting list: "Assigned To:")
    • Site answer back
    • Shifter submit test jobs to the site. If the testjobs finish successfully, shifter will set the queue online and AMOD will close the Jira.

  • Experts may check in a daily basis the status of opened issues for sites in ADC Site Status Jira

  • Noticed this Jira project is shared with Point-1 shifters and holds entries for DDM status.


Central services

Central services (hosted at CERN) to be monitored https://sls.cern.ch/sls/service.php?id=ADC_CS (need NICE login)

CERN databases https://cern.ch/atlas-service-dbmonitor/shifter/

  • Please, during the shutdown period, do not report the problem with online database processes (like PVSS2COOL), they could be off anytime, this concerns both online and offline DB.
  • For the CERN databases, the escalation procedure when reds spots in the web page would be to notify atlas-adc-expert@cernNOSPAMPLEASE.ch and atlas-dba@cernNOSPAMPLEASE.ch, they will check elogs and mails in daily basis and take over the issue raised by shifters. Never send email to physdb.support, this is action item for atlas-adc-expert or atlas-dba.

Monitoring data transfer from T0 to T1s

  • This activity is done by Comp@P1 shifters during data taking. When there is no Comp@P1 shifter ADCOS shifters take care.
  • Monitor the Tier 0 dashboard: the "T0 export" view on DDM dashboard
  • Problems at a Site seen on the Tier 0 dashboard
  • Problems with the CERN infrastructure or the Central Services

Frontier

The Frontier service provides access to the conditions data stored in the 3D databases which is streamed from CERN to several Tier1 sites. Conditions data accessed from Frontier is primarily used in user analysis jobs. Because conditions data changes relatively slowly a lot of requests are the same and so a series of squid caches have been set up to reduce the load on the Oracle databases. When a job requires conditions data it will first try and get it from a local site squid. If the required data is not in the squid, the squid should connect to the designated Frontier server which will connect to the Oracle database if it doesn't have the data cached already. The system is setup so that if a site squid or Frontier server fails then the request will try other Frontier / squid combinations in order to get their data. Problems with a site squid or Frontier server should therefore not cause jobs to fail, although this will cause additional load elsewhere. If this is allowed to build up the whole service could eventually fail.

Periodically (2-3 times per shift) check: http://sls.cern.ch/sls/service.php?id=ATLAS-Frontier

If this is not at 100% for any of the sites for more than an hour check:

If the site is not in a downtime and it is down in both SLS and MRTG then submit an urgent GGUS ticket to the site and cc in atlas-frontier-support@cernNOSPAMPLEASE.ch. If in doubt email atlas-frontier-support and copy in the expert shifter.

Once per shift check: http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Frontier_Squid&highlight=false

If the site is red, click on the link. This will take you to the MRTG monitoring page which will show you when the squid stopped working. Check if the site is in downtime and if it isn't and the squid has not been responding for more than 4 hours (no "had been up for" line) submit a less urgent GGUS ticket to the site and cc in atlas-frontier-support@cernNOSPAMPLEASE.ch. Exception sites are mentioned on Known Problem page.

Miscelania

Contacting the Cloud Contacts Experts

  • Carbon Copy always the Tier-1 expert list when submitting a GGUS
  • Carbon Copy when an action is performed affecting the sites inside the cloud

NEW Contacting the ticket portal support

In case GGUS or Jira tracker pages are not available, you should try to
  • clear the cache of your browser and try again
  • ask other shifters or colleagues if they are affected by this problem or it is only you
  • if you cannot find a cause of the problem on your side, send email to portal's support

Pilot Factories and methods

How to get pilot jobs to the site when no pilot job is running?

  • Set one of the production queues corresponding to that site to brokeroff for 30-60 minutes.
  • When site gets appropriate number of pilot jobs, change queue status to the one before brokeroff (e.g. 'test', 'online', ...)
  • When setting site to brokeroff does not help, check pilots on pilot factory monitoring, and contact the contact persons.

Communication and Organization

Elog Management

Choosing the right criticality in eLog

  • 1) top priority: data export from cern, problem at the tier0, problems of the central services, central catalog etc etc, LFC at cern
  • 2) very urgent: problems at the tier1s, like no acceptance data from the tier0, LFC or FTS down
  • 3) urgent: problems that affect the cloud
  • 4) less urgent: others

Replying to eLog entry

  • When replying to eLog entry modify the subject elog entry, for example :
    • If you updating information on the problem which has been already reported, put [update] in the subject
    • If the problem is solved, please put [SOLVED] in the subject
    • If the subject is saying that queues were set OFFLINE and you are setting them to TEST/ONLINE, reflect this in the subject: queues were set in TEST mode/ONLINE
    • If you can see that elog subject do not briefly report the problem (for example site name is missing in the subject) please modify elog subject (add a site name, if appropriate)

Ticket Management

Site naming convention - exceptions

General Rules

  1. Check GGUS Atlas tickets (see How to find tickets section)
    • Now shifters can follow all the TEAM tickets on the GGUS interface
  2. DO NOT open duplicate tickets.
    • If a ticket is already open about the SAME problem follow up on that ticket.
  3. Open only TEAM tickets so that other shifters can find them and follow up
  4. When you open a ticket: Write in the ADCoS eLog the reason it was opened and put a link to the opened GGUS ticket
    • Tickets for the US can now be opened in GGUS so they can also be treated in the same way.
    • Some of T3's cannot be found in the list of available sites. Please use TPM option instead of direc route to A site in this case. It is very important to put site name in the description line ( for example, SITET3: transfers are failing because certificate has expired).
  5. When you close a ticket: Write the solution ADCoS eLog
  6. Write everything that is in between in the ticket. The ticket is now the reference for what happens in between opening and closing.
    • In the ADCoS eLog should only go the reason the ticket has been opened, the link to the ticket and the solution when the ticket is closed.
  7. When updating the ticket do not change ticket status to "waiting to reply"; this status is reserved for sites. In this case when shifter check for tickets which needed to be updated, it is easy to see "waiting for reply" ticket.
  8. Don't open tickets for sites in downtime UNLESS you are putting them offline in Panda. In this case follow: How to change a site status and tickets section.
  9. Do not try re-open ALARM ticket. If ALARM ticket is solved, the problem re-appeared and you can't contact AMOD, open new TEAM ticket.

How to Submit GGUS Team-Tickets (direct routing to sites)

  • Notice that when open the submit new ticket I/F a label appears on the top: Open TEAM ticket
  • Clicking it you are in the I/F for our special tickets that gets routed directly to the site
    • Set the ticket priority:
      1. top priority : Problem at CERN Services (affecting exports to every site) should be marked as "uop priority". This includes LFC at CERN, FTS at CERN, SRM(CASTOR) at CERN
      2. very urgent : Problem at services at Tier-1s (affecting exports to the given Tier-1 and within the Tier-1 cloud) or services at calibration Tier-2s should be marked as "very urgent". This includes LFC at Tier-1s, FTS at Tier-1s, SRM at Tier-1s, SRM at Calibration Tier-2s.
      3. urgent : Any other problem should be marked as "urgent"
      4. less urgent : Informational entries should be marked as "less urgent"
    • Select type of the problem
    • Select MoU Area
    • Select site affected
    • Put cloud support in the CC field - atlas-adc-cloud-[CLOUD]@cern.ch [where [CLOUD] stands for CA, CERN, DE, ES, FR, IT, ND, NL, RU, TW, UK, US] (see #ContactingCloudSupport)
  • ggus_team_2011-11-18.png
  • All people from ADCoS team with the correct certificate permissions in GGUS machinery can track and follow the tickets opened by anyone in our team.
    • Please notify GGUS support in case you find problems accessing to it.

  • NEW EXCEPTION : For south-african sites (ZA-*), the sites are not registered in GGUS. Submit the ticket to ROC NGI_ZA (14 May 2012 and should be solved in the coming weeks)

How to submit Jira bug about panda monitoring problems

How to find tickets in GGUS

How to change a site status and tickets

When a site is put offline/test/online states in Atlas (see Atlas queues section) it must be communicated to the site. As per point 6. of the General Rules section above sites problem will now be followed up using the tickets.

  1. When you put offline a site open a ticket
  2. When you change the status to test put the information in the ticket
  3. When you put a site online write the information in the ticket, close it
  4. These are normal tickets so when you open and close them the inormation should also go in the eLog as said in points 4. and 5. of the General Rules section.

Example of a site offline/test/online cycle(s):

  • A shifter puts a site offline and writes it in the ticket
  • The site fixes the problem and writes in the ticket "we think it's solved"
  • The next shifter goes to the list of team tickets parses the opened tickets and sees the site has fixed the problem, puts the site in test state and writes it in the ticket
  • The site doesn't pass the tests back to offline and again is written in the ticket and the cycle restarts
  • The site passes the tests and its put back online. The shifter who puts the site back online writes in the ticket and closes the ticket. And this is the end of the cycle.

Shift Reports and tickets

To help expert shifters to compile the weekly report and also help your fellow shifters to get oriented when their start their shifts:

  • Write the ticket numbers of the the tickets you have opened and closed in the shifter report.
    • There is now a dedicated field for it.

Ticket format

GGUS

  • Provide meaningful subject (non-ATLAS people should understand it), including the site name.
  • Provide time information: When the failure(s) start to happen ?
  • Extract of the error message. Shifter should understand the error before reporting and provide translation from "ATLAS" language to general "language".
    • When possible provide detailed info of the command failed (if the failure is reproducible)
  • Provide link to the log(s) file(s) (panda/production dashboard for MC or DDM dashboard for data ditribution)
  • Approximate number of failures (related to the problem reported):
    • Last 12h for Monte Carlo (default Panda monitoring view)
    • Last 4h/24h for Data distribution (possible views in DDM dashboard)
  • Monte Carlo Specific:
    • Node(s) affected
      • Sometimes Worker Nodes act as black hole. High number of failures related to the same processing host could be an evidence.
    • Provide local batch system job ID
    • When providing link to Panda monitoring, please provide link to one particular failing job, "last12h" aggregation ling might be unvalid at the time site tries to address the ticket

Production or Validation Jira Bug Reporting

  • Task:
    • Task ID
    • Task name
    • Task Progress (Done, ToBeDone, Running, Pending)
    • Task efficiency
    • Task details (release, trf_version, DBrelease)
  • Errors:
    • Error summary. The content of the job log file is accessible from its panda monitoring page (and the panda page is linked from the dashboard). Try the Find and view log files link. If it doesn't work, click on the job log file name (in the table above, file type log). At the bottom of the new page, you find the SURL(s) for the log file and you can download them directly in a shell.
    • Link to the Log files (Panda/ProdSys dashboard)
  • Info flow:
    • Remember to CC to the task owner.
    • Remember to eLog.
NEW Note! ATLAS Distributed Computing groups have moved away from Savannah and are using now the Jira Issue Tracking Service for bug reporting. Jira is quite easy and intuitive to use. In Jira to see the list of issues (tickets) click on "Issues" on the left side menu. On the "Issues" view one can select to look "All Issues" or only the ones belonging to certain category (Unresolved, Added recently, Resolved recently, etc.). To open a new ticket click on "Create Issue" button on upper menu and fill the form. In Jira there is no direct CC option, but it can be done by adding “Watchers”. To do so, find the label called "Watchers:" on the right side menu while inside a particular issue (ticket), and click on the colored circle with a number inside (the number indicates how many watchers that issue already has). You will be prompted to add a watcher. To do so simply start typing the first letters of the name of the person you want to CC (task owner for example). Then Jira will open a matching list from which you can select the desired name. Then Jira will add that person as a “Watcher” and send email notification every time the ticket is updated.

Downtimes

  • Check the AGIS prod calendar for downtimes. In case the prod version is not working, please use AGIS dev calendar. But as this is a development version, it might be unreliable by nature. So, no ticket should be submitted about its unavailability.
  • Check all ongoing entries for site in question. Do care also about downtimes marked as NO_RISK_FOR_ATLAS.

Tier-1s Downtime

  • When there is a T1 downtime and this downtime should last longer than 2 hours, please set the whole cloud to offline (can be done only by experts with Production role proxy).

Monitoring tools

Daily SHIFT report

  • Submit you daily shift summary report using the Interface located at: Shift report elog form. This triggers and automated shift report that is sent to the ADCoS mailing list and also a disk copy stored in the elog: Shift summaries elog

Trainee evaluation report

  • If trainee shifter participated in the shift, send e-mail to : 1) ADCOS coordinators : atlas-adc-adcos-coordinators@cernNOSPAMPLEASE.ch, as well as 2) current ADCoS Expert, where name of current Expert shifter can be found from query (PDF) in the top of Checklist. The e-mail subject should be in the form : "Trainee evaluation of ShifterName (shift Number), date timezone", where example of the date and timezone is "10/10/2012 EU" and total number of trainee shifts taken so far should be reported, like "(shift 3)"
  • Please, report the following (use copy/paste) :
    • Active presence: how much one is present and how much is proactive,
    • Monitoring tools understanding,
    • Errors understanding (at least as far as those explained in the twiki),
    • Ticket handling (learning how not to open duplicates and not to write a single error line... etc etc).
    • Evaluation grade: 0-3 range (1: new shifter, still learning; 2: quite experienced, but not yet ready for promotion; 3: ready to be promoted to senior shifter; 0: very quiet shift, not enough information to evaluate) NEW

ADC Virtual Control Room

Troubleshooting

  • Currently, it looks like Gmail jabber accounts face strange behaviour when you join the chat: you'll see very ancient chat log, but you will not see the most recent one when you log in. In the meantime, please try to use other jabber account than Gmail, e.g. try out your jabber account at CERN (jabber.cern.ch).
  • Jabber server jabbim.* is a privately-held jabber server. If it stops working there is nothing ADC experts can do about it. In that case please check jabbim.* servers monitoring and wait for the servers to get back. If getting jabbim back takes too long, please use your CERN jabber account.
  • If you are disconnected with a "Conflict" error, please reconnect again with a different nickname (handle).
  • If you are disconnected during the jabbim.com server downtime, please use your CERN jabber account to reconnect.

GGUS ATLAS TEAM membership

  • If for some reason your grid certificate is not yet in /atlas/team VOMS group, ask ADCoS Coordinators to add it. Particularly don't forget to do that when you get a completely new certificate. GGUS receives this list in daily basis and updates the membership accordingly.

DQ2 nicknames

  • Starting Monday May 3rd 2010, DQ2 will restrict the creation of datasets to those starting with a valid project (e.g. user, user10, group, group10, etc...). The complete list of valid projects is available at AMI ReferenceTables. An attempted creation of a dataset with an unknown project will fail.
  • If you do not have a nickname attribute, you need to get one by visiting the VOMRS page. From that site, use the "Edit Personal Info" page to add or modify your nickname.
  • more information: elog:12103 or ATLAS DDM news

Site exclusion/blacklisting metrics

Proposed mechanism

  • Shifters spot a potential problem and evaluate if the impact is severe enough to exclude the site either in Data Processing or in Data Transfers according to the metrics described below. Evaluation of site exclusion will be done by ADC experts, so once shifters file the GGUS, and pre-evaluate if a site is a candidate for exclusion using the site exclusion metrics, they contact experts, eLog the issue and send mail to adc-experts list.

  • Experts will evaluate and determine the action to take. This is necessary only in case AFT/PFT are not sufficient.

  • Exclusion/blacklisting of a site has to be communicated to the site/cloud people profiting from the ATLAS cloud-squads, once exclusion/blacklisting is done this is immediately communicated to the relevant cloud-squad via a Jira ticket using a fixed subject (SITE EXCLUSION: SITE_NAME_ACTIVITY_[DD/MM/YYYY]) and mentioning the reason of the action and adding the reference to the GGUS ticket. In parallel an entry is placed in the ATLAS-blacklisted-sites' Google Calendar.

  • Once site is blacklisted it is automatically excluded from ATLAS Test activities.

  • Once site has been notified about the problem and the blacklisting, is up to them to react on this. No further actions will be taken by ADC operations, although a list of blacklisted sites will be provided during the ADC operations weekly meeting.

  • It's also a responsibility of the ATLAS cloud-squad to certify the solution of the problem: simple certification which is to succeed in SAM tests (wLCG and ATLAS).

  • Once the cloud squad verified this, they interact with ATLAS operations by responding to the original Jira ticket. ATLAS operations will then re-introduce it in DDM Functional Tests. Production Functional Tests and Analysis Functional Tests are restarted automatically.

  • After the site is validated inside the Test activities (see site recovery metrics) the site is re-introduced in Distributed Computing activities and the Jira cloud-squad ticket is closed.

Useful links

Tutorials

Shift Credits

  • Before coming to your shift, make sure that you fulfill shift requirements!
  • ADCoS is a Class 2 shift. All ADCoS shift have the same weight within ATLAS OTP.
  • NEW Each shifter is required to take at least 6 shifts every 4 months!!!
  • We have 3 flavours of shifters: ADCoS Expert shifters, ADCoS Senior shifters, ADCoS Trainee shifters.
    • Senior shifter: 8 hours of shift, 2-days blocks (Mon+Tue, Wed+Thu) and Friday credited with 78% (scaled from 100%), 2-days block (Sat+Sun) bonus credit 155% (scaled from 100%).
      • No upper limit on number of shifts. Please take at least few shifts a month. ADCoS training should be repeated if no shifts were taken within a year.
    • Trainee shifter: 8 hours of shift, shifts slots available Mon-Sat, 0% shift credit (scaled from 100%), no Sunday shift.
      • Please book your Trainee shift slot only if it is "red" on the OTP calendar. Please do not overallocate shift slots.
      • Trainee period takes 10 shifts, however, the final number of Trainee shifts strictly depends on Trainee shifter's performance, it can be significantly lower or higher than 10.
      • There is a time limit of 3 months for trainee shifters to finish training. If trainee shifter did not take shifts within last 1 month this shifter would be automatically excluded from the list. Each shift will be evaluated by Senior shifter.
      • After promotion to Senior it is required to take first Senior shift as soon as possible (within a month).
    • Expert shifter: 9 hours of shift, 7-days shift Wed-Tue, credited with 100%, no weekend bonus.
  • We provide 24/7 operations shift in three timezones (defined in CERN time)
    • 00:00 - 08:00 ASIA/PACIFIC (AP) - Shift Captain: Hiroshi Sakamoto
    • 08:00 - 16:00 EUROPE (EU) - Shift Captain: Alexei Sedov
    • 16:00 - 24:00 AMERICAS (US) - Shift Captain: Armen Vartapetian
  • Shifts are booked on first-come-first-served basis in ATLAS OTP.
  • ADCoS tasks in OTP:
    • 529221 - ADCoS Expert shifts
    • 529222 - ADCoS Senior shifts
    • 529223 - ADCoS Trainee shifts
    • 86 - ADCoS Coordination Shifts
  • Generally, more information about ATLAS shifts available at ATLAS OtpShiftClasses TWiki page.
  • In case of questions please contact ADCoS coordinators (atlas-adc-adcos-coordinators@cernNOSPAMPLEASE.ch).

ADCoS Expert Duties

TEAM MEMBERS


Major updates:
-- XavierEspinal - 30 Jul 2008 -- JaroslavaSchovancova - 2010-2011

%RESPONSIBLE% AlexeySedov
%REVIEW% Never reviewed

-- MichalSvatos - 25 Jun 2014

Edit | Attach | Watch | Print version | History: r11 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2014-06-30 - MichalSvatos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback