CRAB Logo

Notes about deployment and potential backward compatibility issues

Complete: 5 Go to SWGuideCrab

There are different pieces of software that gets upgraded every month: the client, the rest interface, the crabcache, the oracle database schema, the task worker, postjobs/WN code(cmscp and job wrapper), ASO, couchdb views.

In general everything starts with a CMSWEB updata that brings the new version of the REST interface, the crabcache, and the ASO couchdb views. As soon as the crabcache goes down then submission of CRAB jobs from the client fails. In principle this is not true for the REST interface since there are different crabserver backends and the CMSWEB deployment is done in such a way that at least one backend is always active.

Oracle schema usually is updated upfront (say we add a column). Changes to the oracle schema are written hereby developers https://github.com/dmwm/CRABServer/blob/master/etc/updateOracle.sql and are performed by a CRAB3 operator. Some type of schema changes need to be performed after the development, and in this case developers needs to make sure the REST code is backward compatible and operators needs to test it (for example if we change the type of a column from VARCHAR to CLOB).

The next piece of software that gets updated is the task worker. This is done by a crab3 operator. Usually the taskworker is shut down from the start to the end of the CMSWEB deployment. This is done as a precaution because new REST code might return data in a different new format that breaks old taskworker (usually "new REST/old TW" works, but developers do not guarantee this). Since the task worker is shut down before the CMSWEB deployment starts, there might be some tasks that are submitted with the "old client/old REST" that goes into the database and are processed by the "new REST/new TW code". Developers needs to take care of this case when they develop the code, and operators need to check things are not broken.

The task worker deployment brings the new PJ and the new WN code. This does not mean that everything gets updated since only new task will use the new code! In the schedds and on the WN there are going to be jobs that uses the old code. Again, developers needs to account for this use case when they develop the code and operators needs to check everything works fine.

The clients gets updated after REST/TW, so the server side code needs to be backward compatible and work with the previous client version. When doing the validation operators needs to make sure that submission with the old client works. ASO is also updated after REST/TW, and again we need to make sure that the old postjob running in the schedds writes documents that are well formed and does not break the new ASO code (or better, the new ASO code needs to be backward compatibile).

Summarizing, this is a checklist for what it concern backward compatibility:

  • Old PJ and old WN code: before the testbed deployment we send tasks that gets processed by old REST/old TW. Jobs should be long and so that PJ and WN code will run after the testbed deployment is done=>processed by the new code.
  • Old REST (for submission)/new TW: again before the testbed deployment starts we should stop the TW and sent some tasks. These tasks will be processed when the testbed deployment is done and the new TW is switched on.
  • Old client: when everything is done the operator needs to make sure the old client is working with the new code as well by sendind a task and checking the basic commands.
  • Old ASO: make sure the old ASO code is working well with the new REST/PJ code.
  • Old PJ/new ASO: this is because we will have old jobs/PJ in the system that runs the old code and talks to ASO

Marco notes: I don't think we need to go through this every deployment, but IMHO each month we need to evaluate what has changed in the code and decide which of these use cases needs to be checked by the operator.

Notes about normal validation:

There is a test suite that sends different tasks with different parameters: https://github.com/dmwm/CRABServer/tree/master/test/templates

To use it, copy the contents of the templates directory from the CRABServer repo to somewhere else, for example your workspace, where there is more space: '/afs/cern.ch/work/e/erupeika/public/validationScripts' Then, modify the file validation-args.sh which contains parameters for the validation.sh script. Main things to change would be:

  • TAG - refers to the HG tags found here https://github.com/dmwm/deployment/releases and should match the release that's being validated. This will appear in each task name of submitted tasks. It's cosmetic and only useful for reference when looking at the validation results.
  • VERSION - Also cosmetic and appears in the task names, can be used to differentiate between multiple validation runs.
  • WORK_DIR - Where crab tasks, CMSSW and other things are stored, could look like this: '/afs/cern.ch/work/e/erupeika/public/crabValidation/$TAG'.
  • MAIN_DIR - should point to where your validation-args.sh, validation.sh and other files are located, for example '/afs/cern.ch/work/e/erupeika/public/validationScripts'.
  • Sourcing the client - it's important to use the correct version of the client for the validation, that can be done by setting the environment variable CRAB_SOURCE_SCRIPT to point to '/cvmfs/cms.cern.ch/crab3/crab_pre_standalone.sh'. When sourcing the client using the light script ='/cvmfs/cms.cern.ch/crab3/crab_pre.sh', it will use crab_pre_standalone.sh instead of crab_standalone.sh, the latter being incorrect.
  • STORAGE_SITE, INSTANCE - self-explanatory but should also be modified accordingly.

Afterwards, run the validation.sh script and you should see multiple tasks being submitted using different templates and parameters.

There is also a script to test the client commands, called client_validation.sh. It sources the pre-prod client and goes through most of the commands using all of the available parameters for the latest task. Some commands/options aren't tested by the script at the moment, like --dryrun and proceed, resubmit, uploadlog, purge, kill, and should be tested by hand.

TODO Describe what different validation templates are

In principle as a general coverage we should test every command with every parameter for eac tasks, but that's basically impossible (unless we automate it). The best thing to do is to check each command on a subset of meaningful tasks. However there are some commands that can only be tested once:

  • checkusername
  • checkwrite
  • proceed
  • purge
  • remake
  • tasks
  • uploadlog

while the other commands should probably be checked for different types of tasks. For example they can be run on the Analysis plugin vs PrivateMC, os using the userinputfile feature, it really depends on what changed in the code. The list of other commands is:

  • kill
  • resubmit
  • report
  • status
  • submit

Validation tasks

These are (more or less) the tasks that currently make up the monthly general release validation. Some of them were made to test specific bugs or features but are not very useful past the initial release.

The plan is to narrow down the amount of tasks submitted every month to around ~10 while still covering most of the critical functionality in CRAB3. The rest of the tasks could prove useful in specific cases so they are also worth keeping around. Finally, I intend to start running two extra tasks each validation by hand: one to publish a dataset in phys03, and another one to analyze it (and also publish the result). This should help ASO.

Task Purpose Propose to include in general validation? Details
Analysis_150Chars_WF-L-T_O-T_P-T_IL-F For problems associated with task name length NO Analysis on GenericTTbar MC dataset, Filebased splitting
Analysis_225Chars_WF-L-T_O-T_P-T_IL-F For problems associated with task name length NO Analysis on GenericTTbar MC dataset, Filebased splitting
Analysis_240Chars_WF-L-T_O-T_P-T_IL-F For problems associated with task name length NO Analysis on GenericTTbar MC dataset, Filebased splitting
Analysis_Partial_DS_OneSite-L-T_O-T_P-T_IL-F Tests running on an incomplete dataset NO Analysis on a MINIAOD dataset, FileBased splitting
Analysis_Partial_DS_TwoSites-L-T_O-T_P-T_IL-F Tests running on an incomplete dataset distributed over multiple sites NO, BROKEN Analysis on a MINIAOD dataset, FileBased splitting
Analysis_Use_Deprecated-L-T_O-T_P-T_IL-F Tests running on a deprecated dataset NO, BROKEN Analysis on a deprecated phys03 dataset, LumiBased splitting
Analysis_Use_Invalid-L-T_O-T_P-T_IL-F Tests running on an invalid dataset NO, BROKEN Analysis on an invalid phys03 dataset, LumiBased splitting
Analysis_Use_Parent-L-T_O-T_P-T_IL-F Tests the parent dataset functionality YES Analysis on a PromptReco dataset, LumiBased splitting, useParent = True
Analysis_UserDS_on_PHYS03-L-T_O-T_P-T_IL-F Tests running on a phys03 dataset YES, BROKEN Analysis on a phys03 dataset, FileBased splitting
Analysis_User_InputFiles-L-T_O-T_P-T_IL-F Tests using a custom list of input files for analysis YES Analysis on GenericTTbar files taken from a user-defined list, FileBased splitting
MC_Analysis_FileBased-L-T_O-T_P-T_IL-F Generic MC analysis task YES Analysis on a GenericTTbar dataset, FileBased splitting
Analysis_LumiBased_on_Data_BIG-L-T_O-T_P-T_IL-F Tests excessive disk usage removal NO Analysis on an AOD dataset, LumiBased splitting
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-MemoryBIG Tests that jobs close to the RSS limit do not get removed (?) NO PrivateMC
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-MemoryRM Tests that jobs exceeding the RSS limit (750MB) get removed. NO PrivateMC
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-RuntimeBig Tests jobs with a high maxJobRuntimeMin requirement NO PrivateMC
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-RuntimeRM Tests removal because of excessive walltime. NO PrivateMC
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-HammerCloud Tests running jobs similar to HammerCloud’s YES, BROKEN? PrivateMC, extraJDL = ['+CRAB_NoWNStageout=1', '+CRAB_HC=True']
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-Stageout Tests the remote stageout option. YES "PrivateMC, extraJDL = ['+CRAB_StageoutPolicy=""remote""']"
MinBias_PrivateMC_EventBased_Ignore_Global_Blacklist-L-T_O-T_P-T_IL-F Uses the ignoreGlobalBlacklist flag, jobs often fail as expected. NO  
MinBias_PrivateMC_EventBased-L-F_O-F_P-F_IL-T One of 6 tasks with different flag configurations. NO  
MinBias_PrivateMC_EventBased-L-F_O-T_P-F_IL-F One of 6 tasks with different flag configurations. NO  
MinBias_PrivateMC_EventBased-L-T_O-F_P-F_IL-F One of 6 tasks with different flag configurations. NO  
MinBias_PrivateMC_EventBased-L-T_O-T_P-F_IL-F One of 6 tasks with different flag configurations. NO  
MinBias_PrivateMC_EventBased-L-T_O-T_P-T_IL-F One of 6 tasks with different flag configurations. NO  
MinBias_PrivateMC_EventBased-L-T_O-T_P-T_IL-T One of 6 tasks with different flag configurations. NO  
PrivateMC_for_LHE-L-T_O-T_P-T_IL-F ??? NO? PrivateMC, has a generator set to ‘lhe’, a special LHE pset ‘pset_on_lhe.py’ and a special ‘dynlo.lhe’ input file.
Skimming_dataset_lumi_based-L-T_O-T_P-T_IL-F Tests skimming? NO? Analysis on an AOD dataset, LumiBased splitting, special ‘skimming_dataset.py’ pset.
testsubmit-L-T_O-T_P-T_IL-F No specific purpose AFAICT NO  
TFile_Analysis_LumiBased-L-T_O-T_P-T_IL-F TFile use case? NO? Analysis on GenericTTbar dataset, LumiBased splitting with lumimask, special pset ‘pset_Tfile.py’
UseSecondary-L-T_O-T_P-T_IL-F Tests secondary dataset functionality YES Analysis on RECO and RAW datasets, LumiBased splitting, ‘pset_user_parent.py’ pset
MinBias_PrivateMC_EventBased_ExtraParams-L-T_O-T_P-T-NumCores Task using multiple cores YES, BROKEN PrivateMC with 2 cores

Notes

  • Things like 'L-T_O-T_P-T_IL-T_DOC-F' at the end of the tasknames specify various flags with which tasks can be submitted. The validation script currently tries out some combinations of these flags, however, our opinion with Marco is that none of them are useful enough for the general validation. For reference, the flags that can be switched on or off in the current validation script are:
    • transferLogs
    • transferOutputs
    • publication
    • ignoreLocality
    • disableAutomaticOutputCollection (Before, every task was duplicated with this flag switched on and off which doubled the amount of validation tasks I've described above.)
  • There exists three templates for LHE use-cases (PrivateMC_for_LHE.py, PrivateMC_for_LHE_NOLS.py, PrivateMC_for_LHE_pythia.py) which use a custom input file 'dynlo.lhe' and 'lhe' or 'pythia' splitting modes but I never got them to work and I don't really know how they're different from normal tasks. I am assuming they shouldn't be run every month and we continue to keep them around just in case.
  • There exists a skimming task template, 'Skimming_dataset_lumi_based.py'. It uses a special 'skimming_dataset.py' PSet. This task works fine but I am assuming that it does not need to be run every month, especially if the only thing that's different is the PSet.
  • There are two templates which try to use a voGroup (specifically, 'becms' and 'escms') but I don't know how to submit them. I'm also assuming that the functionality itself is working fine and they can be skipped for the monthly validation.

Summary

  • My proposal is to keep / create one task for each of the most critical CRAB3 features / use cases. Such tasks in my opinion would be:
    • Analysis with a parent dataset, LumiBased splitting
    • Analysis with a secondary dataset, LumiBased splitting
    • Analysis on a custom list of input files, LumiBased splitting
    • Analysis on data with a Lumimask and runrange applied, LumiBased splitting
    • Analysis on data, FileBased splitting
    • Analysis on data, EventAware splitting
    • A PrivateMC task to generate and publish a dataset in phys03
    • Analysis on the phys03 dataset from the previous task, LumiBased splitting
    • PrivateMC using the 'remote' stageout option (other tasks would use the default 'local' stageout)
    • PrivateMC using 8 cores
    • Analysis using old CMSSW ( CMSSW_5_3_22 )
    • A task that is similar to HammerClouds, analysis on a GenericTTBar dataset, extraJDL = ['+CRAB_JobReleaseTimeout=300', '+CRAB_NoWNStageout=1', '+CRAB_HC=True', 'accounting_group=production', 'accounting_group_user=cmsdataops']

-- MarcoMascheroni - 2017-01-31

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2017-09-14 - StefanoBelforte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback