Cross-check list of main objectives:

  • status re-organized without re-mapping known states;
  • improved user monitoring for the output transfers through the status (including monitoring of failures);
  • added lumi-mask without using ACDC (avoiding for the moment to deploy it centrally);
  • deploy crabserver on cmsweb including dedicated reqmgr and global workqueue;
  • output dataset paths: understand various issues related to the LFN generation depending on the output type and the data publication step;
  • MC production.

Developer check list for testing

This doesn't aim to include the full list of items to tests, but the important steps which are related to the new features introduced in the last release.

Deployment --> All

New dedicated services for reqmgr and workqueue need to be deployed. Mattia has prepared new scripts plus deployment patches and it is finalizing them.

Once these are ready developers can start using them to deploy their environment. [Mattia] UPDATE: script to make the setup using dedicated services for analysis can be found here /afs/cern.ch/user/m/mcinquil/public/crab3/an_reqmgr_wq/install-CRAB3-stack.sh . These includes the patch /afs/cern.ch/user/m/mmascher/public/WMCore0.9.12.patch which is needed to support the lumi mask.


[Test Result]:

[Mattia] A couple of things fixed. Works fine to me.

[Mattia] ASO deployment works in case you set WMAGENT_SECRETS_LOCATION instead of ASYNC_SECRETS_LOCATION, opposed to what described in the AsyncStageOutManagement; this has to be fixed, also in order to avoid collisions among a wmagent and ASO setup. In addition the ASO's default config is pointing to the local database as monitoring which leads to errors; so the secret should also contain the WMStats URL (as the agent does) and the user monitoring URL in order to help on the deployment; it should be easy to fix. Also these have to be taken from the secret:

  • config.AsyncTransfer.serviceCert = '/path/to/host-cert'
  • config.AsyncTransfer.serviceKey = '/path/to/valid/host-key'
  • config.DBSPublisher.serverDN = 'Your host DN'
  • config.DBSPublisher.serviceCert = '/path/to/valid/host-cert'
  • config.DBSPublisher.serviceKey = '/path/to/valid/host-key'

[Hassen] /data/cfg/admin/InstallDev -s start command crashs with followings error if connectUrl is not set manually into CRABServerAuth.py file. CMSWEBCRABManagement should be updated.

...
starting crabserver
Traceback (most recent call last):
  File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/bin/wmc-httpd", line 3, in <module>
    main()
  File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/WMCore/REST/Main.py", line 488, in main
    cfg = loadConfigurationFile(args[0])
  File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/WMCore/Configuration.py", line 559, in loadConfigurationFile
    raise RuntimeError, msg
RuntimeError: Unable to load Configuration File:
/data/current/config/crabserver/config.py
Due to error:
cannot import name connectUrlTraceback (most recent call last):
  File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/WMCore/Configuration.py", line 552, in loadConfigurationFile
    modPath[1], modPath[2])
  File "/data/current/config/crabserver/config.py", line 2, in <module>
    from CRABServerAuth import connectUrl
ImportError: cannot import name connectUrl

[Marco] Everything seems to work fine. there were a couple of things I had to update in the CMSWEBCRABManagement twiki, but it should be ok now.

[Mattia] Need to apply a couple of patches, one on client and one on server.

Status --> MATTIA

This should include the monitoring of the transfers.


[Test Result]:

[Mattia] Generic issue (not just status): the client version in CRABClient/__init__.py should be changed since on the server we always see 3.0.7, and therefore we cannot distinguish which client is making the request from the log file.

[Mattia] Status layout: when no information are available for a line, we could drop the line when printing it (e.g.: "Details: ", "Using (0) sites:", etc); when we have resubmissions we can probably show the fact that there are tot % of jobs resubmitted between brackets and aligned with other commands.

  • [Marco] 3945 takes care of this and this 3963

[Mattia] Task status is often delayed: when acquired it's possible to see jobs and becomes running just after a while.

  • [Mattia] Problem is temporary: it happens that the job status is updated directly by the analytic components from the agent, while the workflow status is updated through LWQ-GWQ-ReqMgr chain. Issue could be solved by reporting the workflow status also through the analytic component.

[Mattia] Queued status: this appears and disappear, it corresponds to new jobs being pulled by the agent and already processed by the local wq (the value is often 100%).

  • [Meeting] this could be fixed improving Global WQ monitoring.

  • [Marco] Trying to explain a bit further here. Imho, the real problem here is that we don't know in advance the total number of jobs. Say 3 jobs are created (status shows queued 100%), and then submitted (status shows submitted 100%), and say that other 3 jobs then arrives and are created. Queued jobs will appears again and the status will show (queued 50% submitted 50%).
  • [Meeting] this could be fixed improving Global WQ monitoring.

[Mattia] "status -f": sometimes it doesn't return anything more then the normal status, while the geterror actually works. Need to investigate/understand better why this happens.

  • [Marco] Can it be that status -f shows jobs waiting resubmission in the cooloff state (they are taken from the JSM), while geterror looks at the FWJR and shows the error of these jobs? If so, in my opinion this makes sense.

[Mattia] Transfer details might need to be showed slightly differently: I get 100% success jobs, of these in transfer details I see 100% new and 100 total; in this case I do not see the point of showing the total. At some point both new and total dropped to 74.6% (success jobs are still 100%).

[Mattia] Submission failure is correctly reported on client, but there is an inconsistency with the geterror which returns different failure reasons not related to submission failures. (See geterror report).

Kill --> MATTIA

The kill should be performed and tracked correctly by the monitoring.


[Test Result]:

[Mattia] Kill is generally working, in many cases the wf status is switched to Aborted then it takes sometime to really cancel the jobs. Various cases tested:

  • Kill with 100% cooloff works: give various delays it can happen that the jobs get resubmitted in the while and so they will appear as failure 100% - exception 100%.
  • Kill when the request is assignment-approved (not yet in GWQ) works, user gets a detailed (redundant) message.
  • Kill when in the global queue but not in the local: ok.
  • Kill when in the local: ok; jobs may appear as queued if already processed by the JobCreator.
  • Kill when already submitted by the agent: ok (kill takes a while to be propagated to the local wq, but then it works fine).
Generally OK: most of the issues (in general are all minor) are related to the architecture.

[Marco]

  • This patch 4085 is needed, otherwise killed jobs are not always correctly reported
  • I am not able to resubmit killed workflow. I am not sure if this is a bug or a feture. But in my opinion it can be useful to resubmit a killed workflow.
    [lxplus411] /afs/cern.ch/user/m/mmascher/wf > crab status -t crab_lumi_kill_rsbm_big5
    Registering user credentials
    Task Status:      aborted
    Details:
      canceled 100.0 %   
    Log file is /afs/cern.ch/user/m/mmascher/wf/crab_lumi_kill_rsbm_big5/crab.log
    [lxplus411] /afs/cern.ch/user/m/mmascher/wf > crab resubmit -t crab_lumi_kill_rsbm_big5
    Registering user credentials
    Error contacting the server.
    Server answered with: Invalid input parameter
    Reason is: Impossible to resubmit a workflow with pending jobs.
    Log file is /afs/cern.ch/user/m/mmascher/wf/crab_lumi_kill_rsbm_big5/crab.log
    

Get-errors --> MATTIA

The new command should work fine also including transfers.


[Test Result]:

[Mattia] Always return the most recent information, but things are not always synchronized (what is shown in the status not always correspond to what the get error returns, as in the kill case).

[Mattia] Server can also handle quantity and per exit code parameters: need to adapt the client.

* [Marco] Fixed in 3944

[Mattia] When a submission failure happens due to an agent issue then the geterror doesn't return the correct failure errors: problem is not with the geterror, but with the agent not able to report the error. I have enough cases related to the fact that the job failures are related to

2012-09-13 02:41:38,252:ERROR:JobSubmitterPoller:Could not find pickled jobObject /data/WMAgent/WMCORE.0.9.10/v01/install/wmagent/JobCreator/JobCache/mcinquil_crab_Test_an_100_120912_180853/Analysis/JobCollection_316_0/job_401/job.pkl
, which looks like an agent failure at submission time (need more investigation and need to evaluate if this is already fixed in the latest agent version); the problem with this is that the job name used to update the monitoring is taken by the submitter from the job.pkl itself, so we have a dog chasing its tail; this could be resolved for the monitoring by loading the job name from oracle in case the job.pkl is missing, but we should aim on fixing the basic issue on the agent in case isn't already fixed on the latest agent.

Web monitoring --> MARCO

Verify that the web monitoring of WMStats works as expected.


[Test Result]:

[Marco] The monitoring page starts, however: * if you click on a request you are redirected to an address like https://.../reqmgr/... instead of https://.../an_reqmgr/... * I am not able to look to any graph. I get an error "TypeError: RequestView.draw is not a function". Not sure if it is related to my configuration.

Publication --> MARCO + HASSEN

Verify that the dbs publication works fine, including using the final site name and correctlt creating blocks.


[Test Result]:

[Hassen] It seems to work fine after applying https://github.com/dmwm/AsyncStageout/pull/3945.

[Mattia] It also needs this.

[Mattia] There is an issue with authN/Z with cmsdbsprod.cern.ch reported on cran feedback hypernews which prevent to publish there, so I could not complete the publication step yet because of 403 error the DBSPublisher component is getting:

 2012-09-24 20:47:32,270:INFO:PublisherWorker:Starting data publication for: mcinquil_crab_test_publish_03_monday_120924_171044
2012-09-24 20:47:32,593:ERROR:PublisherWorker:Error migrating datasetFailed to connect in 0 retry attempt(s)

Call to DBS Server (https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02_writer/servlet/DBSServlet) failed
HTTP ERROR Status '403', 
Status detail: 'Access to the specified resource (DN provided not found in grid-mapfile.) has been forbidden.'Traceback (most recent call last):
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/asyncstageout/0.1.2pre2/lib/python2.6/site-packages/AsyncStageOut/PublisherWorker.py", line 404, in publishInDBS
    migrateAPI.migrateDataset(inputDataset)
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/dbs-client/DBS_2_1_9/lib/DBSAPI/dbsMigrateApi.py", line 38, in migrateDataset
    self.migrateDataset(aparentds['PathList'][0])
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/dbs-client/DBS_2_1_9/lib/DBSAPI/dbsMigrateApi.py", line 43, in migrateDataset
    if self.doesPathExist(self.apiDst, path):
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/dbs-client/DBS_2_1_9/lib/DBSAPI/dbsMigrateApi.py", line 114, in doesPathExist
    datasets = api.listProcessedDatasets(patternPrim = tokens[1], patternProc = tokens[2], patternDT = tokens[3])
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/dbs-client/DBS_2_1_9/lib/DBSAPI/dbsApi.py", line 213, in listProcessedDatasets
    raise ex
DbsConnectionError: Failed to connect in 0 retry attempt(s)

The above issue happens when

X509_HOST{CERT|KEY}
point to the loca host certificates.

Get-log --> MARCO

Verify that works fine also with the --exitcode option.


[Test Result]:

Looks ok, but in case saveLogs=True we return the full tar.gz containing the logs of all jobs.

Get-output --> ALL

[Test Result]:

{Mattia]: getoutput is failing and it seems it is retrieving the output always from the ASO remote destination even if not all output have been already transferred; in fact crab status returns that 0.9% of transfers are done and that jobs have ran in different sites

Using 6 site(s):   
Details:
  failure 5.0 %      (  submit 100.0 %  )
  success 95.0 %      (Transfers details:   acquired 1.8 %    new 97.4 %    total 100.0 %    done 0.9 %  )
T2_ES_CIEMAT: 
    success 24.2 %   
T2_DE_DESY: 
    success 25.0 %   
T2_US_Florida: 
    success 8.3 %   
T2_IT_Legnaro: 
    
T2_UK_London_IC: 
    failure 5.0 %   success 20.8 %   
T2_FR_GRIF_LLR: 
    success 16.7 %   
while the getouput should retrieve them from the temporary location it retrieves the files from the remote by using the /store/temp location
Command finished
Failed retrieving file A0222ADA-11FD-E111-AF2E-842B2B758BAA.root
Error:
     no such file or directory ///pnfs/lcg.cscs.ch/cms/trivcat/store/temp/user/mcinquil/doublemu/1347476721/v1/00000/a0222ada-11fd-e111-af2e-842b2b758baa.root
     no such file or directory
Retrieving file '44B426E2-13FD-E111-A8A9-001EC9F87D4A.root' 
Executing 'lcg-cp --connect-timeout 20 --sendreceive-timeout 240 --verbose -b -D srmv2  --srm-timeout 60  srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/44B426E2-13FD-E111-A8A9-001EC9F87D4A.root file:///afs/cern.ch/user/m/mcinquil/scratch1/3/3.1.2-test/crab_Test_an_101/results/44B426E2-13FD-E111-A8A9-001EC9F87D4A.root' 
and the only one succeeding is the one which has been already transfered (the 0.9%); the agent monitoring reports the correct information about LFN and SITE, but the ASO user transfer monitoring database has may entries like this:
{"id":"628d916a7001c07e9d2b9800ad402aa1","key":"628d916a7001c07e9d2b9800ad402aa1","value":{"rev":"1-2da02ed677652dd46f11b18579d4887b"},"doc":{"_id":"628d916a7001c07e9d2b9800ad402aa1","_rev":"1-2da02ed677652dd46f11b18579d4887b","workflow":"mcinquil_crab_Test_an_100_120912_180853","checksum":{"adler32":"32bc44d2","cksum":"3932434898"},"lfn":"/store/temp/user/mcinquil/DoubleMu/1347473310/v1/00000/324A0FCE-06FD-E111-8BED-00259057492E.root","jobid":"5bdc08ba-fd06-11e1-bb49-0026b95c499b-0","retry_count":1,"state":"new","location":"T2_CH_CSCS","timestamp":1347475893,"type":"aso_file","size":1330287}}
and this is most probably given by the fact that FilesByWorkflow view on the map step is returning all the files independently from their status, which means that the CRAB-Interface will get them to do the getoutput. Need urgently a fix.

[Hassen] I think that is the correct behavior of FilesByWorkflow. We do not need the state there we need just the updated LFN and the site hosting the output. LFN set to a temporary area means that the job is not yet transferred while LFN set to permanent area means that the output is transferred.

[Hassen] Here the patch to this bug: https://github.com/dmwm/AsyncStageout/pull/3949

Lumi-mask + resubmit --> MARCO

Just re-introduced: also verify that it is working as expected with the resubmission.


[Test Result]:

[Marco] Feature tested and working fine

Log collect --> MARCO

Verify that:

* savelogs=false log collect jobs are not sent and wf is closed

* savelogs=true log collect jobs work fine and are not showed

[Test Result]:

[Marco] It looks fine to me.

ASO and cmsweb --> HASSEN

Verificy that ASO works fine with the new tag of CMSWEB

[Test Result]:

[Hassen] It seems to work fine.

[Mattia] Works fine to me too.

Optimize ASO parameters in files_database --> HASSEN

[Test Result]:

[Hassen] Requires https://github.com/dmwm/AsyncStageout/pull/3944 and https://github.com/dmwm/WMCore/pull/4074 and then it seems to work fine.

ASO monitoring for operators --> HASSEN

[Test Result]:

[Hassen] It seems to work fine.

Pending issues

Duplicated analyzed files and lumis

It seems there are problems in the LumiBased splitting: the algorithm seems to assign the very same file (and related lumi sections) to two consecutive jobs. Both CouchDB documents and MySQL queries reveal this. I've opened a new issue in WMCore.

[Mattia] The issue has been understood. The problem is only on the files which are apparently assigned to a job, but in reality the lumi mask is the object being used to do this. Everything is reported on the above issue's link. We can consider this ok to proceed.

Datatier issue --> MARCO

If the DataTier is not defined it should default to "USER". Opened 3956.

New issues

* [Mattia] I've seen the ASO failing like this and dying more then once:
2012-09-13 06:45:52,294:INFO:TransferWorker:failed : [u'/store/temp/user/mcinquil/DoubleMu/1347473310/v1/00000/BE47B116-0BFD-E111-9EA8-60EB69BACA30.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/AAAF4D91-13FD-E111-B7D2-60EB69BACA8C.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/CE76FBDD-19FD-E111-9B27-00266CF85EC8.root']
2012-09-13 06:53:05,716:INFO:TransferWorker:failed : [u'/store/temp/user/mcinquil/DoubleMu/1347473310/v1/00000/BE47B116-0BFD-E111-9EA8-60EB69BACA30.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/AAAF4D91-13FD-E111-B7D2-60EB69BACA8C.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/CE76FBDD-19FD-E111-9B27-00266CF85EC8.root']
2012-09-13 06:59:45,438:INFO:TransferWorker:failed : [u'/store/temp/user/mcinquil/DoubleMu/1347473310/v1/00000/BE47B116-0BFD-E111-9EA8-60EB69BACA30.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/AAAF4D91-13FD-E111-B7D2-60EB69BACA8C.root', u'/store/temp/user/mcinquil/DoubleMu/1347476721/v1/00000/CE76FBDD-19FD-E111-9B27-00266CF85EC8.root']
2012-09-13 06:59:45,975:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
  <AsyncStageOut.TransferDaemon.TransferDaemon instance at 0x1b53ccb0> 'dbSource_update'  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/asyncstageout/0.1.2pre2/lib/python2.6/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 160, in __call__
    self.algorithm(parameters)
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/asyncstageout/0.1.2pre2/lib/python2.6/site-packages/AsyncStageOut/TransferDaemon.py", line 86, in algorithm
    self.logger.info(result.get())
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/external/python/2.6.8-comp2/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value

2012-09-13 06:59:45,976:INFO:Harness:>>>Terminating worker threads
2012-09-13 06:59:46,038:ERROR:BaseWorkerThread:Error in event loop (2): <AsyncStageOut.TransferDaemon.TransferDaemon instance at 0x1b53ccb0> 'dbSource_update'
Backtrace:
  File "/data/ASO/012pre2/v01/sw.pre/slc5_amd64_gcc461/cms/asyncstageout/0.1.2pre2/lib/python2.6/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 189, in __call__
    raise ex

2012-09-13 06:59:46,038:INFO:BaseWorkerThread:Worker thread <AsyncStageOut.TransferDaemon.TransferDaemon instance at 0x1b53ccb0> terminated
restarting seems to help, but the problem seems to re-happen.

[Hassen] It is a bug and here is the patch: https://github.com/dmwm/AsyncStageout/pull/3944

[Hassen] The crabaserver crashs with this error when starting:

bash-3.2$ tail -10 logs/crabserver/crabserver-20120924.log
[24/Sep/2012:10:43:54]  ERROR:       module = __import__(module_name, globals(), locals(), [class_name])
[24/Sep/2012:10:43:54]  ERROR:     File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/CRABInterface/RESTBaseAPI.py", line 8, in <module>
[24/Sep/2012:10:43:54]  ERROR:       from CRABInterface.RESTUserWorkflow import RESTUserWorkflow
[24/Sep/2012:10:43:54]  ERROR:     File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/CRABInterface/RESTUserWorkflow.py", line 7, in <module>
[24/Sep/2012:10:43:54]  ERROR:       from CRABInterface.DataUserWorkflow import DataUserWorkflow
[24/Sep/2012:10:43:54]  ERROR:     File "/data/hg1210a/sw.pre.mcinquil/slc5_amd64_gcc461/cms/crabserver/3.1.2pre2/lib/python2.6/site-packages/CRABInterface/DataUserWorkflow.py", line 237
[24/Sep/2012:10:43:54]  ERROR:        def submit(self, *args, **kwargs):
[24/Sep/2012:10:43:54]  ERROR:                                         ^
[24/Sep/2012:10:43:54]  ERROR:    IndentationError: unindent does not match any outer indentation level
[24/Sep/2012:10:43:54]  WATCHDOG: server exited with exit code 1

[Hassen] Fixed applying this patch https://github.com/dmwm/CRABServer/pull/3960

[Marco] There is a type in WMCore and we have jobs in runnig state:

submitted 99.3 %      (  runnig 0.7 %    running 16.8 % 
Fixed in https://github.com/dmwm/WMCore/pull/4107

[Hassen] The "Creating the oracle database" section in CMSWEBCRABManagement is updated to use an_reqmgr instead of reqmgr

-- MattiaCinquilli - 10-Sep-2012

Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2012-09-25 - MattiaCinquilli
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback