TWiki> LCG Web>WLCGGDBDocs>GDBMeetingNotes20141111 (revision 2)EditAttachPDF

pre-GDB on Volunteer Computing, CERN, November 11, 2014


BOINC - N Høimyr

Reason to use volunteer computing in HEP

  • Volunteers: uncoordinated, unpledged/opportunistic, allow to engage with new people/communities
  • Opportunistic use of institution desktop resources: coordinated, unpledged/opportunistic
  • Small to medium cluster: coordinated, pledged, easier to deploy than grid MW

BOINC: main MW for volunteer computing

  • To be a volunteer: 3 steps on Linux
    • Get BOINC client
    • Get VirtualBox (installed by BOINC client on Windows)
    • Attach a project with boinc-client utility
  • Designed not to disturb the user: many (optional) features to ensure that job is suspended/killed when the machine is used by its main user
  • Provide a GUI to manage the client
  • Core client is a daemon

Key concept to motivate volunteers: credits

  • Based on host benchmark, allow to calculate the contribution of a user to a project
  • Credits collected from multiple projects
  • Statistics produced exposing the credits of each user
  • The 2 main projects (SETI@home and Einstein@home) have collected 600 TFlops each...

Use of virtualization (VirtualBox) pioneered at CERN in 2010-11 with Test4Theory

  • Now mainstream BOINC code
  • Several projects outside CERN using it now
  • Classic BOINC still usable but more difficult: unpredictable environment, labour intensive application building/testing
    • At CERN, only SixTrack uses classic BOINC
  • When using virtualization, BOINC acts as a VM scheduler
  • @CERN, images based on CERNVM + CVMFS
    • Interaction with experiment frameworks (PanDA, DIRAC...)

Volunteers vs. institutional desktops/clusters

  • With volunteers, publicity and marketing are essential when they are less relevant for institutional resources
  • More controlled environments in institutional resources
  • Volunteers/desktops: CPU-intensive apps mainly

BOINC @CERN: now a generic service supported by IT/PES (0.5 FTE)

  • BOINC server (LHC@home) for CERN-related projects
    • Relying on BD on Demand for BOINC MySQL DB
  • Application support: in particular monitoring
  • Using central NFS service
  • Drupal portal

Planned improvements

  • Scalability: split BOINC server into several servers
  • Tools to automate app deployment
  • Security: https rather than http, alternative login methods (e.g. OpenID)

XtremWeb/XWHEP - O. Lodygensky

XtremWeb started as a project to support multiple projects on the same computing infrastructure and offering some authentication/security features

  • Since 2006, forked as XWHEP, the actively developed production version based on XtremWeb
  • Many new features added in XWHEP for authorization and data management
  • 3-tier application: user wanted to run apps, the XWHEP server and the volunteer resources
    • P2P direct communication possible between the app user and the volunteer resources

Authorization based on user profiles: administrator, group administrator, standard user, worker user

  • Standard user can submit apps, manage its private apps and retrieve results
    • A private app, until certified by a group admin, can be seen/used only by the user who submitted it
  • Worker user: used only to deploy the worker, no other right
  • Access rights, similar to Unix files, associated with all objects

Virtualization supported by XWHEP: users can submit apps with a VirtualBox image

  • Based on XWHEP feature allowing volunteers to say the apps they support and ask payload requiring this app
    • Not restricted to virtualization support, can also be arbitrary apps
  • First implementation is VirtualBox but will be used to implement other kinds of virtualization (e.g. libvirt)
  • Implements VM contextualization based on HEPiX WG proposal
  • Demonstrated the ability to deploy CERNVM + ROOT
  • P2P feature used to give SSH access to VMs
  • No access allowed to the local network (protection) but unrestricted access to internet
  • Direct communications possible between different volunteer resources if allowed by volunteers
  • Several different kinds of apps using the platform

XWHEP has both a portal and REST API

XWHEP using several medium-size infrastructure in best-effort mode

  • GRID5000: French computer science infrastructure
  • Qarnot: startup company providing heating systems for appartements based on computers

LHC@home - M. Giovannozzi

SixTrack simulations: studies about field quality in superconducting magnets

  • Critical to understand the stability of the particles of the beam and predict beam lifetime and emittance growth
  • Easy to split a simulation into multiple independent jobs
  • Used originally during the design phase of the LHC magnets
  • Now used during operations to compare with original simulations an refine understanding of beam with the measured magnet parameters
  • Now used for HL-LHC studies

Key issue: numeric compatibility of results obtained on heterogeneous architectures

Attempt to keep track of volunteers and akwnoledge them in all articles

  • Also on website

Future work

  • Reduce the length of jobs: split jobs into smaller one, better for efficiency
  • Reduce size of output files
  • Possibility of performing collimaton studies

BOINC and CERNVM - B. Segal

Original challenge: run real LHC physics on BOINC

  • Huge codebase, strict OS requirements

Early tests with virtualization (VMware) done in 2006 as a proof of concept but very far from a production solution

2008: CERNVM developed and became an enabling technology

  • Solved the image size problem of previous prototype

2009: CoPilot added, solving the VM credential problem and allowing integration with experiment frameworks

Philosophy: volunteer cloud

  • BOINC is only used to start VMs
  • Everything else done by experiment frameworks

No marketing/promotion effort but able to run a steady 2500 VMs per day

Now called vLHC@home.

  • Aim to attract other use cases from other experiments

Experiment Experiences

ATLAS - D. Cameron


  • (Almost) free resources
  • Public outreach

Design considerations

  • Low priority CPU-intensive jobs, typically non-urgent MC jobs
  • Virtualization environment for ATLAS SW, using CVMFS
    • Consequence: need a different image per version of ATLAS SW. Happens once every few months.
  • No (grid) credential on volunteer host: ARC CE used for staging data and interacting with data catalogs
    • Grid credentials not leaving the ARC CE
    • All jobs using the same credential: ATLAS production
  • Apart from these specificities, looks as a regular PanDA queue
    • ARC CE insert the job in BOINC queue
    • VM image to use is pre-loaded to the BOINC server: all jobs using the same image
    • ARC CE and BOINC server on separate machines with a NFS shared file system

Work started early this year in IHEP (Beijing)

  • July: ARC CE + BOINC server moved to CERN
  • Recently: dedicated BOINC server replaced by a vLHC@home server

Simulation job profile

  • Full Athena jobs, 50 evts/job
  • Image is 1.1 GB (500 MB compressed): CERNVM + pre-cached SW
  • input tarbal is 1-100 MB
  • Output ~100 MB
  • Memory requirement : 2 GB/job
  • Some simple validation done at the end of the job
  • Physics validation done later

Issues found

  • Volunteer: entry barrier quite high, many volunteers never run a job
    • Requirement: 64-bit OS, 4 GB, decent bandwith
  • Unstability/failing jobs: people not credited and thus moving away
    • Virtualization/VMwrapper causing a lot of problems
    • Firewal at volunteer PC preventing access to condition data....

Current status

  • Running 2000-3000 jobs continously
  • 75% successful jobs
  • 50% efficiency
  • 28th simulation site in ATLAS

Very active message board: eating a lot of time to answer...

  • In particular volunteers asking why they fail job or are not credited for their contribution

Lessons learned

  • Requires a lot of effort, in particular the interaction with volunteers
    • A "community follower" is needed
    • Some competent volunteers can help others
    • It's why it is not completely free resources...
  • Number of running jobs has reached a plateau: due to current setup, technical reasons understood and solutions being discussed with CERN IT
    • Mostly an issue of scale out (adding new boxes) + a few adjustments in the workflow (increased caching)
  • Main problems caused by vboxwrapper but BOINC developers very enthusiastics and helpful
  • Not ready yet for pushing ATLAS@home in production but this is the goal
    • Could be an alternative to a grid CE for small sites

Discussion about pushing up the outreach effort

  • Infrastructure not yet ready but clearly needed to expand the number of volunteers
    • Involve outreach team at CERN? Make much more visibile the need for contributors
  • Main effort is to build the vLHC@home platform as a a platform usable by all experiments: not yet there
    • This coordinated outreach effort should be done at the platform level rather by each experiment

CMS - L. Field

Test4Theory model

  • BOINC server submits a job wrapper that is in charge of starting and stopping the VM
    • Stopping after 24h to get the contribution credited as it happens only at the end of jobs
  • VM contains an agent that contacts CoPilot to get the actual job and the data (data bridge)
  • Challenge: CoPilot support
    • Relies on no longer maintained dependencies

CMS took the same model but with a different implementation

  • HTCondor used to push jobs to BOINC: backend needs to be written
  • Authentication using standard web server authentication
    • BOINC puts user credentials in a MySQL DB and push them to VM through /dev/fd0
    • mod_auth_ssl (Apache) used to authenticate against BOINC credentials
  • Data: http-based data bridge, allowing to use http dynamic federations
    • DataBridge is no more than http Dynamic Federation + mod_auth_sql to authenticate against BOINC
      • No development needed
    • Monitoring LHCb tests in Canada, promising results so far
    • Great scalability potential due to the redirect capabilities and the use of an S3 backend
    • Asynchronous stage-out after job completion through FTS built upon standard behaviour/feature of CMS jobs
  • BOINC is just used to instantiate VM: could technically leave without it but important to keep it for the community of volunteers
    • BOINC is the main way to attract resources

Basic workflow, based on DataBridge, can be generalized

  • PUTjob description in data bridge
  • PUTdata in data bridge
  • Create a job agent with input and output parameters
  • Async. stage-out of output bucket
  • DataBridge could be used also for storage-less IaaS resources

Status: still an early prototype, work started recently

  • Too early for reporting results

LHCb - F. Stagni

In fact was the first LHC experiment to start with BOINC but have not gone as far as ATLAS and CMS

  • Several generation of early prototypes done and abandoned in 2013 and 2014

Started with the very simple of VMs contacting DIRAC but not possible to use it in production in the volunteer computing environment

  • Requires credentials on the volunteer PC

Several issues found like job length and resource volatility: solutions applicable to other environments too

  • Job masonry to use the most of an available slot: running short MC to fill the slot if needed
  • Smaller output: partly results from shorter job but also splitting simulation and RECO in different jobs
  • Some issues not solved yet: job monitoring, accounting
  • CVMFS: no pre-caching, depend on volunteer location, suffering lack of Shoal support until now
    • May consider doing the same as ATLAS: images specialized for BOINC, with pre-cached SW
  • grid credentials: no solution adopted yet, very interested by DataBridge


  • E. Lançon: importance to have a BOINC server in charge of managing volunteers to avoid exposing the pilot framework to (malicious) volunteers
    • LHCb also convinced of this!


Helge: is calling for "institutional" volunteers energy efficient? Not convinced. May be better to set up a small dedicated cluster...

Image pre-caching: probably no alternative as VM must be restarted very regularly to get contribution credited

  • Without caching, will have to pay the price of first download at every instantiation
  • Optimizing the cost of image management: may be worth to evaluate the real impact of using a SW version different from the one pre-cached
    • Experience at grid sites shows that there is only a very minimal amount of different files between releases

Currently 3 different BOINC projects: SixTrack, vLHC and ATLAS

  • Still under discussion whether merging them is a goal or not: trying to experiment with app-level credits in BOINC

Michel: avoid the word "desktop grid", tends to designate both desktops and clusters but they are 2 very different resources

  • Institutional desktop remains a desktop, i.e. an untrustable environment
  • L. Field: must concentrate on volunteer computing (rather than opportunistic usage) as the real challenge is attract and manage volunteers

Volunteers: how to get more of them? Do we really want more?

  • Must understand the "price" of volunteers in term of support, manpower... Must find the right balance
  • Institutional desktops: low hanging fruits, BOINC client installed by default, not caring about credits
  • Costs of running desktops for volunteer compututing in HEP institutions compared to dedicated clusters
    • Several studies/pres on this but difficult to get a precise idea

Opportunistic resources: volunteer computing is not necessarily the appropriate response as they are dependable resources seen as volatile resources

  • VAC or clouds may be better alternatives
  • L. Field: the real showstopper for a technology to be used as a provisioning method for WLCG T2s is the possibility to do accounting
    • Still not understood how to do proper accounting in cloud and volunteer computing world
  • A. McNab: running multiple jobs in a VM is not that different from running multiple jobs in a pilot job...
  • Agreement that we must concentrate on the volunteer computing as the first goald it to get additional ("free") resources, as an alternative to commercial cloud providers
    • Probably an effort similar is necessary to support using commercial clouds but resources are free

F. Stagni: can we say the level of resource we must be able to provision from the volunteers to be worth the effort?

  • L. Field: ATLAS said 500 usable cores was a minimum
    • O. Lodygensky: remember that 10% of volunteers actually providing resources is already a very good result

O. Lodygensky: may be interesting to look at 3GBridge developed by EDGeS/EDGI projects and allowing to bridge between many different computing infrastructures, including BOINC and grid

Outreach: clearly need a meeting similar to this one between people involved in this effort around volunteer computing


Use mailing list for further discussions + for things specific to the LHC@home platform.

Probably another similar meeting next Spring.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2014-11-11 - NilsHoeimyr
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback