WMAgent end to end Validation Tests for HG1409 cmsweb upgrade
Upgrade schedule
- 18 August: release candidate RPMs due for pre-prod deployment * deadline for requests *
- 19 August: cmsweb-testbed pre-prod release candidate deployment
- 31 August: validation results due * deadline for validation *
- 02 Sept: production deployment
Release changes trac ticket
Validation results trac ticket
Versions tested
HG1409a
ReqMgr version 0.9.97.pre2
Global WQ version 0.9.97.pre2
WMStats version 0.9.97.pre2
WMAgent version used for the testing: v0.9.97.pre2
Release Notes (UPDATE THEM)
RequestManager
Global_WorkQueue
WMStats
Premixing performance table
Mean |
Metric \ WF |
DIGI |
RECO |
Classic |
Premixed |
Classic |
Premixed |
AvgEventTime |
113.383 |
13.622 |
51.404 |
53.053 |
Timing-file-read-totalMegabytes |
34961.866 |
320.997 |
281.444 |
275.293 |
Timing-file-write-totalMegabytes |
632.859 |
623.153 |
174.417 |
175.182 |
Std. deviation |
Metric \ WF |
DIGI |
RECO |
Classic |
Premixed |
Classic |
Premixed |
AvgEventTime |
1214.332 |
2.489 |
8.849 |
10.824 |
Timing-file-read-totalMegabytes |
7266.516 |
79.536 |
37.220 |
33.869 |
Timing-file-write-totalMegabytes |
131.735 |
83.732 |
22.869 |
21.376 |
Classic: pdmvserv_TOP-Spring14dr-00027_00159_v0__140717_193859_6252
Premixed: alahiff_TOP-Spring14premixdr-00001_00003_v0__140731_152029_8397
Or a better "raw" summary for the classic wf:
Results for cmsRun1:
Timing-tstoragefile-read-maxMsecs : {'std': '180770.823', 'max': '3955560.000', 'avg': '86378.179', 'min': '1689.960'}
Timing-tstoragefile-read-numOperations : {'std': '133147.912', 'max': '836847.000', 'avg': '640552.396', 'min': '742.000'}
Timing-tstoragefile-read-totalMegabytes : {'std': '7266.516', 'max': '45648.600', 'avg': '34961.866', 'min': '2.865'}
Timing-tstoragefile-read-totalMsecs : {'std': '3641310.849', 'max': '36041500.000', 'avg': '3858931.029', 'min': '17739.700'}
Timing-file-write-totalMegabytes : {'std': '131.735', 'max': '835.107', 'avg': '632.859', 'min': '0.940'}
AvgEventTime : {'std': '1214.332', 'max': '34035.300', 'avg': '113.383', 'min': '23.579'}
TotalJobTime : {'std': '6769.541', 'max': '66029.600', 'avg': '13982.054', 'min': '620.854'}
Results for cmsRun2:
Timing-tstoragefile-read-maxMsecs : {'std': '985.882', 'max': '29524.600', 'avg': '649.173', 'min': '0.356'}
Timing-tstoragefile-read-numOperations : {'std': '2002.212', 'max': '16809.000', 'avg': '12215.550', 'min': '1087.000'}
Timing-tstoragefile-read-totalMegabytes : {'std': '37.220', 'max': '355.123', 'avg': '281.444', 'min': '19.057'}
Timing-tstoragefile-read-totalMsecs : {'std': '27782.258', 'max': '425791.000', 'avg': '18393.568', 'min': '11.191'}
Timing-file-write-totalMegabytes : {'std': '22.869', 'max': '219.973', 'avg': '174.417', 'min': '13.533'}
AvgEventTime : {'std': '8.849', 'max': '221.488', 'avg': '51.404', 'min': '27.365'}
TotalJobTime : {'std': '2979.771', 'max': '58442.700', 'avg': '13777.353', 'min': '1001.630'}
... and a better "raw" summary for the premixed wf:
Results for cmsRun1:
Timing-file-read-maxMsecs : {'std': '4365.709', 'max': '93309.200', 'avg': '576.579', 'min': '0.000'}
Timing-file-read-numOperations : {'std': '1905.609', 'max': '17765.000', 'avg': '2898.919', 'min': '0.000'}
Timing-file-read-totalMegabytes : {'std': '79.536', 'max': '822.993', 'avg': '320.997', 'min': '0.000'}
Timing-file-read-totalMsecs : {'std': '6362.667', 'max': '187221.000', 'avg': '8053.120', 'min': '0.000'}
Timing-file-write-totalMegabytes : {'std': '83.732', 'max': '793.488', 'avg': '623.153', 'min': '8.976'}
AvgEventTime : {'std': '2.489', 'max': '29.834', 'avg': '13.622', 'min': '8.895'}
TotalJobTime : {'std': '828.326', 'max': '8115.290', 'avg': '3648.640', 'min': '122.102'}
Results for cmsRun2:
Timing-file-read-maxMsecs : {'std': '274.991', 'max': '28094.000', 'avg': '25.905', 'min': '0.336'}
Timing-file-read-numOperations : {'std': '1842.571', 'max': '17267.000', 'avg': '11745.452', 'min': '775.000'}
Timing-file-read-totalMegabytes : {'std': '33.869', 'max': '346.489', 'avg': '275.293', 'min': '16.331'}
Timing-file-read-totalMsecs : {'std': '708.951', 'max': '60584.400', 'avg': '256.675', 'min': '7.958'}
Timing-file-write-totalMegabytes : {'std': '21.376', 'max': '220.818', 'avg': '175.182', 'min': '11.993'}
AvgEventTime : {'std': '10.824', 'max': '135.981', 'avg': '53.053', 'min': '35.340'}
TotalJobTime : {'std': '3416.552', 'max': '34403.300', 'avg': '14275.803', 'min': '719.193'}
Observed changes from previous versions
Test |
Tester |
Completed |
Status |
Comments |
Lumi mask input |
|
MC from scratch - Lumi mask input |
|
MC with input - Lumi mask input |
|
ReDiGi - Lumi mask input |
|
ReReco - Lumi mask input |
|
TaskChain - Lumi mask input |
|
Wrong run number - Lumi mask input |
|
Wrong lumi range - Lumi mask input |
|
Force completing workflows 5249 |
|
PhEDEx node naming |
|
Disk subscription 5142 |
|
Tests
Test |
Tester |
Completed |
Status |
Comments |
Bug fixes / New features in WMStats |
|
Bug fixes in WMAgent |
|
New features in WMAgent |
|
LheInputFiles feature added to TaskChain requests 4871 |
Alan |
|
EventsPerLumi capability added to TaskChain requests 4872 |
Alan |
|
Bug fixes in ReqMgr |
|
MC from scratch workflow extension |
|
Change the permission for agent update the reqmgr status |
|
New features in ReqMgr |
|
Bug fixes in WorkQueue |
|
Standard workflows |
|
Old request moved from completed to closed-out and announced |
|
Old request moved from completed to rejected |
|
Old request moved from assignment-approved to rejected |
|
Request moved from assigned to aborted |
Alan |
|
Request moved from assigned to rejected |
Alan |
|
Request moved from acquired to aborted |
Alan |
|
Request moved from acquired to rejected |
Alan |
|
Request moved from running to aborted |
Alan |
|
MonteCarlo workflow |
Justas |
|
MonteCarlo LHE workflow |
Justas |
|
MonteCarloFromGEN workflow |
Justas |
|
ReDigi workflow |
Justas |
|
ReReco+skim workflow |
Justas |
|
ACDC for Production |
Justas |
High Scale Test |
Alan |
|
TaskChain: MC recycling |
Alan |
|
TaskChain: MC from scratch |
Alan |
|
TaskChain: FastSim workflow + event splitting |
Alan |
|
TaskChain: Data workflow |
Alan |
|
TaskChain: Pileup workflow by recycling |
Alan |
|
TaskChain: Pileup workflow from scratch |
Alan |
|
TaskChain: Pileup Pyquen workflow (PrimaryDataset override) |
Alan |
|
TaskChain: automatic harvesting |
Alan |
|
TaskChain: different ProcessingString per task |
Alan |
|
TaskChain: KeepOutput = False feature (single and cascade) |
Alan |
|
TaskChain: 'TransientOutputModules': ['RAWoutput'] and TransientOutputModules = ['RECOSIMoutput'] |
Alan |
|
TaskChain: ACDC via WMStats |
Alan |
TaskChain: MC Pre-Mixing workflow |
Alan |
Optional things to test
Test |
Tester |
Completed |
Status |
Comments |
TaskChain: cascade "closed-out" and "announced" changes via script |
Alan |
Propagate Memory (RequestMemory in MB), Disk (RequestDisk in KB) and Job length (MaxWallTimeMins in minutes) estimates to Condor through the JDL #4472 |
|
Apply smart error handling for jobs that failed due to high memory usage or excessive run time #4473 |
|
Robust merge jobs - add missing merge files to ACDC, proceed with existing files #4476 |
|
Fixed timeouts when connecting to the ReqMgr which prevented workflows from being acquired #4660 |
|
Track pileup location and NOT fail out requests #3733 and #4507 |
|
Priority: It has become a required parameter, it can only take values up to 1 million |
Alan |
Argument validation is stricter, in general the idea is that a parameter is either with a valid value or not present, dummy values will most likely fail validation |
Alan |
JobSplitting can now be specified at request creation. Use "SplittingAlgo" and other parameters for that |
Alan |
Do not allow rejection of requests in "assigned" state 4976 |
Alan |
--
AlanMalta - 14 Aug 2014