WMAgent end to end Validation Tests for HG1409 cmsweb upgrade

Upgrade schedule

  • 18 August: release candidate RPMs due for pre-prod deployment * deadline for requests *
  • 19 August: cmsweb-testbed pre-prod release candidate deployment
  • 31 August: validation results due * deadline for validation *
  • 02 Sept: production deployment

Release changes trac ticket
Validation results trac ticket

Versions tested

HG1409a

ReqMgr version 0.9.97.pre2
Global WQ version 0.9.97.pre2
WMStats version 0.9.97.pre2
WMAgent version used for the testing: v0.9.97.pre2

Release Notes (UPDATE THEM)

RequestManager
Global_WorkQueue
WMStats

Premixing performance table

Mean
Metric \ WF DIGI RECO
Classic Premixed Classic Premixed
AvgEventTime 113.383 13.622 51.404 53.053
Timing-file-read-totalMegabytes 34961.866 320.997 281.444 275.293
Timing-file-write-totalMegabytes 632.859 623.153 174.417 175.182

Std. deviation
Metric \ WF DIGI RECO
Classic Premixed Classic Premixed
AvgEventTime 1214.332 2.489 8.849 10.824
Timing-file-read-totalMegabytes 7266.516 79.536 37.220 33.869
Timing-file-write-totalMegabytes 131.735 83.732 22.869 21.376

Classic: pdmvserv_TOP-Spring14dr-00027_00159_v0__140717_193859_6252
Premixed: alahiff_TOP-Spring14premixdr-00001_00003_v0__140731_152029_8397

Or a better "raw" summary for the classic wf:

Results for cmsRun1:
Timing-tstoragefile-read-maxMsecs               : {'std': '180770.823', 'max': '3955560.000', 'avg': '86378.179', 'min': '1689.960'}
Timing-tstoragefile-read-numOperations          : {'std': '133147.912', 'max': '836847.000', 'avg': '640552.396', 'min': '742.000'}
Timing-tstoragefile-read-totalMegabytes         : {'std': '7266.516', 'max': '45648.600', 'avg': '34961.866', 'min': '2.865'}
Timing-tstoragefile-read-totalMsecs             : {'std': '3641310.849', 'max': '36041500.000', 'avg': '3858931.029', 'min': '17739.700'}
Timing-file-write-totalMegabytes                : {'std': '131.735', 'max': '835.107', 'avg': '632.859', 'min': '0.940'}
AvgEventTime                                    : {'std': '1214.332', 'max': '34035.300', 'avg': '113.383', 'min': '23.579'}
TotalJobTime                                    : {'std': '6769.541', 'max': '66029.600', 'avg': '13982.054', 'min': '620.854'}

Results for cmsRun2:
Timing-tstoragefile-read-maxMsecs               : {'std': '985.882', 'max': '29524.600', 'avg': '649.173', 'min': '0.356'}
Timing-tstoragefile-read-numOperations          : {'std': '2002.212', 'max': '16809.000', 'avg': '12215.550', 'min': '1087.000'}
Timing-tstoragefile-read-totalMegabytes         : {'std': '37.220', 'max': '355.123', 'avg': '281.444', 'min': '19.057'}
Timing-tstoragefile-read-totalMsecs             : {'std': '27782.258', 'max': '425791.000', 'avg': '18393.568', 'min': '11.191'}
Timing-file-write-totalMegabytes                : {'std': '22.869', 'max': '219.973', 'avg': '174.417', 'min': '13.533'}
AvgEventTime                                    : {'std': '8.849', 'max': '221.488', 'avg': '51.404', 'min': '27.365'}
TotalJobTime                                    : {'std': '2979.771', 'max': '58442.700', 'avg': '13777.353', 'min': '1001.630'}

... and a better "raw" summary for the premixed wf:

Results for cmsRun1:
Timing-file-read-maxMsecs                       : {'std': '4365.709', 'max': '93309.200', 'avg': '576.579', 'min': '0.000'}
Timing-file-read-numOperations                  : {'std': '1905.609', 'max': '17765.000', 'avg': '2898.919', 'min': '0.000'}
Timing-file-read-totalMegabytes                 : {'std': '79.536', 'max': '822.993', 'avg': '320.997', 'min': '0.000'}
Timing-file-read-totalMsecs                     : {'std': '6362.667', 'max': '187221.000', 'avg': '8053.120', 'min': '0.000'}
Timing-file-write-totalMegabytes                : {'std': '83.732', 'max': '793.488', 'avg': '623.153', 'min': '8.976'}
AvgEventTime                                    : {'std': '2.489', 'max': '29.834', 'avg': '13.622', 'min': '8.895'}
TotalJobTime                                    : {'std': '828.326', 'max': '8115.290', 'avg': '3648.640', 'min': '122.102'}

Results for cmsRun2:
Timing-file-read-maxMsecs                       : {'std': '274.991', 'max': '28094.000', 'avg': '25.905', 'min': '0.336'}
Timing-file-read-numOperations                  : {'std': '1842.571', 'max': '17267.000', 'avg': '11745.452', 'min': '775.000'}
Timing-file-read-totalMegabytes                 : {'std': '33.869', 'max': '346.489', 'avg': '275.293', 'min': '16.331'}
Timing-file-read-totalMsecs                     : {'std': '708.951', 'max': '60584.400', 'avg': '256.675', 'min': '7.958'}
Timing-file-write-totalMegabytes                : {'std': '21.376', 'max': '220.818', 'avg': '175.182', 'min': '11.993'}
AvgEventTime                                    : {'std': '10.824', 'max': '135.981', 'avg': '53.053', 'min': '35.340'}
TotalJobTime                                    : {'std': '3416.552', 'max': '34403.300', 'avg': '14275.803', 'min': '719.193'}

Observed changes from previous versions

Test Tester Completed Status Comments
Lumi mask input  
MC from scratch - Lumi mask input  
MC with input - Lumi mask input  
ReDiGi - Lumi mask input  
ReReco - Lumi mask input  
TaskChain - Lumi mask input  
Wrong run number - Lumi mask input  
Wrong lumi range - Lumi mask input  
Force completing workflows 5249  
PhEDEx node naming  
Disk subscription 5142  

Tests

Test Tester Completed Status Comments
Bug fixes / New features in WMStats  
Bug fixes in WMAgent  
New features in WMAgent  
LheInputFiles feature added to TaskChain requests 4871 Alan  
EventsPerLumi capability added to TaskChain requests 4872 Alan  
Bug fixes in ReqMgr  
MC from scratch workflow extension  
Change the permission for agent update the reqmgr status  
New features in ReqMgr  
Bug fixes in WorkQueue  
Standard workflows  
Old request moved from completed to closed-out and announced  
Old request moved from completed to rejected  
Old request moved from assignment-approved to rejected  
Request moved from assigned to aborted Alan  
Request moved from assigned to rejected Alan  
Request moved from acquired to aborted Alan  
Request moved from acquired to rejected Alan  
Request moved from running to aborted Alan  
MonteCarlo workflow Justas  
MonteCarlo LHE workflow Justas  
MonteCarloFromGEN workflow Justas  
ReDigi workflow Justas  
ReReco+skim workflow Justas  
ACDC for Production Justas
High Scale Test Alan  
TaskChain: MC recycling Alan  
TaskChain: MC from scratch Alan  
TaskChain: FastSim workflow + event splitting Alan  
TaskChain: Data workflow Alan  
TaskChain: Pileup workflow by recycling Alan  
TaskChain: Pileup workflow from scratch Alan  
TaskChain: Pileup Pyquen workflow (PrimaryDataset override) Alan  
TaskChain: automatic harvesting Alan  
TaskChain: different ProcessingString per task Alan  
TaskChain: KeepOutput = False feature (single and cascade) Alan  
TaskChain: 'TransientOutputModules': ['RAWoutput'] and TransientOutputModules = ['RECOSIMoutput'] Alan  
TaskChain: ACDC via WMStats Alan
TaskChain: MC Pre-Mixing workflow Alan

Optional things to test

Test Tester Completed Status Comments
TaskChain: cascade "closed-out" and "announced" changes via script Alan
Propagate Memory (RequestMemory in MB), Disk (RequestDisk in KB) and Job length (MaxWallTimeMins in minutes) estimates to Condor through the JDL #4472  
Apply smart error handling for jobs that failed due to high memory usage or excessive run time #4473  
Robust merge jobs - add missing merge files to ACDC, proceed with existing files #4476  
Fixed timeouts when connecting to the ReqMgr which prevented workflows from being acquired #4660  
Track pileup location and NOT fail out requests #3733 and #4507  
Priority: It has become a required parameter, it can only take values up to 1 million Alan
Argument validation is stricter, in general the idea is that a parameter is either with a valid value or not present, dummy values will most likely fail validation Alan
JobSplitting can now be specified at request creation. Use "SplittingAlgo" and other parameters for that Alan
Do not allow rejection of requests in "assigned" state 4976 Alan

-- AlanMalta - 14 Aug 2014

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-08-18 - AlanMalta
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback