BOINC CMS Project

Goal

LHC has some small projects into the BOINC system, which is a working volunteer computing platform and can be used for any branch of science. We can say theoretical physics, ATLAS and LhcB already ran jobs in it and have the proper interfaces. CMS Doesn't (yet). The idea of this project is finding the best possible model to interface both production and analysis CMS jobs to be submitted to this platform, which is quite different from CMS sites we're used to.

Content

Initial considerations

The best way to start is to compare the environments - BOINC WN's X Grid sites - then based on that bring up the needed (conceptual) modifications to the system followed by it's technical implementation :

  • BOINC
    • Simple submitter, stage-in and stage-out limited to port 80 through HTTP protocol
      • No defined Globus CE. Won't interface easily with our submitters, unless we interface it with Globus.
    • May have network issues (BW), you never know who or what is on the other end (and where)
      • This may ask from us to change how jobs are submitted, like input size file, using unmerged instead of merged files; adapt many parameters to make the job run long or short enough, etc.
    • May need to run twice or more the same job, in order to insure it will finish. Would be nice some control to prevent unecessary work once the first "twin" job finishes and stages out.
  • Grid sites
    • All work already done
    • Using merged files, defined Globus CE
    • Network expected to be good (big input and output files)
    • Stageout to SE.

Needed adaptations for the BOINC system

Looking at the comparison we can already see the first differences, and start to think in the first solutions for that :

  • Stage in/out - Either has to go along with the job (less control) or should be fetched and put with standard HTTP protocol-like. For now we have chosen the second, as we can have control on the code that does that, and do the neeeded interfaces/verifications.
  • Standard submission interface - Globus - It seems Co-Pilot system still doesn't has a standard interface for job submission (not as standard as globus), so if we make one we will make our lives much easier when integrating the existing CMS job-submission systems, and open a whole new set of possibilities to any other experiment/project that already submits with globus (really, many)
  • File size issue - not for now but we can think about running against unmerged files/datasets and just at the end of processing run ONLY merge jobs at the site which holds the unmerged [Low Prio]

Technical implementations

For now, the only problem we want to get rid of is the stage in/out, as it's the most fundamental, and it changes the basic script of a CMS job, only after that we can think about scaling and real tests. So the first thing is to run a single job or a bunch of jobs successfully on a BOINC system (by hand, so it's quicker), The found solution so far is :

* Staging in/out to a WEBDAV enabled SE - this protocol is very alike HTTP and is made for stage in/out, and fortunantly both main SE systems (for T2s) in CMS support that - dCache (since > 1.9.11) and Hadoop (Natively). This also solves the authentication problem, as they can set public temp storage, jobs can stage in/out to/from there, without needing X509 credentials, which is better for an environment where you don't know where your proxy is going to. For now this satisfies the communication requirements set by co-pilot team. Probably the best way to go, and after all is done an trusted agent with X509 credentials can transfer the results to the proper place, and cleanup the trash.

Preliminary tests [unfinished]

I could successfully "extract" a created workflow from CRAB and run it manually in the grid, what will help you to reproduce the test is -createJdl option. There's also the wrapper of the job and some other things that I don't remember right now but will add, this section should be considered as unfinished. Once you have the Jdl for one or more jobs, you can just submit it by hand.

Recipe for CRAB is :

  • create your task (MC is better)
    • YOU SHOULD SPECIFY STAGE-OUT TO AN SE IN THE GRID, ANY SITE
  • -createJdl - but don't touch it
  • look for the dir near the jdl the previous command will spit, there's a job/CMSSW.sh script
  • There's a stageOut section, where you should include some cadaver lines, like :
wget http://samir.web.cern.ch/samir/netrc
mv netrc ~/.netrc
echo "put ${file_list}" > davcommands.txt
cadaver webdav://samircury.eng.br:2077  < davcommands.txt
rm  ~/.netrc
  • One should remove the grid stuff, but make sure it won't break (non-0 exit code) the job. I didn't and it works. But does 2 stageOuts.
  • submit the job using glite-wms-job-submit from the created Jdl, there you go with your task =)

Working instances :

/afs/cern.ch/user/s/samir/private/cms_boinc/CMSSW_4_2_0/src/crab_0_111103_225748/share --- first test that worked with cadaver, contains also a MC configuration

/afs/cern.ch/user/s/samir/private/cms_boinc/locrab/CMSSW_4_2_0/src -- made to use local crab

Interesting notes :

This can be found in the wsCopyOutput class inside SchedulerGrid.py

        if int(self.copy_data) == 1:
            stageout = PhEDExDatasvcInfo(self.cfg_params)
            endpoint, lfn, SE, SE_PATH, user = stageout.getEndpoint()
            if self.check_RemoteDir == 1 :
                self.checkRemoteDir(endpoint,jbt.outList('list') )

[lxplus407] /afs/cern.ch/user/s/samir/private/cms_boinc/locrab/CMSSW_4_2_0/src > source /tmp/samir/CRAB_2_7_9_patch1/crab.sh

Tricky WMAgent tips

./config/wmagent/manage execute-reqmgr requestDbAdmin add -u cmsdataops -g cmsdataops

./config/wmagent/manage execute-reqmgr requestDbAdmin add -v all

This is to get open the factory port. https://twiki.cern.ch/twiki/bin/viewauth/CMS/GlideinwmsDeployment

Relevant issues

  • Memory needed into CMSSW in the client node -- Marko says with the VM and everything inside, 2.5 GB
  • Output size - we're taking care of. Small MC shouldn't be a problem.

-- SamirCury - 09-Nov-2011

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2020-08-18 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback