I want a T3, what should I buy?

Introduction

  • Computing often gets asked this question: "I want a T3 at my institute, what should I buy and how do I have to operate it?"
  • There are many answers to this question, because what constitutes a T3 depends on what you want to do with it.
  • In the following, we try to give recommendations starting from the most basic use case (I want my students to have a place to run their analysis) to the most complicated use case (I want to be able to have GRID jobs run at my site).

I want my students to have a place to run their analysis

  • What you want is:
    • One or more "beefy" interactive machines with plenty of available disk space where people can login and run their analysis interactively.
  • An example "beefy" machine is:
    • Dual cpu with 8 cores per CPU configured in hyper threading mode resulting in 32 usable cores in total, 64 GB memory, lots of disk (suggestion is to configure disk space in RAID6 mode to have some redundancy in case of broken disks)
      • Example configuration with 12 x 4 TB disk costs about $8.5k in December 2013.
  • CMS (CMSSW) software should be accessed via CVMFS. CVMFS is a virtual file system, so after initial configuration, you don't need to install CMSSW and you will always have access to the latest software releases.
  • Access to AOD and AODSIM: storage.xml can be configured so that accessing a LFN (LFN: Logical File Name where all paths start with '/store/...') is done through the CMS data federation via Any data, Anytime Anywhere (AAA). AAA works via a xrootd redirector.
    • Attention: AAA streams CMS data over the network. The network requirement is at least 250 KB/sec per CMS job.
  • Access to alignment and calibration conditions: Must be done by using the SQUID proxy at nearest T2. This is configured in site-local-config.xml. Your site should not try accessing conditions directly from CERN. This puts too great of a load the servers.
  • Maintenance: These systems can usually be maintained by a computing-interested grad student. Once configured, you should have little to do except keeping machine(s) up to date (operating system and security updates) and the occasional operating system upgrades (approximately every 2 years).

I want my students to be able to run their applications in a local batch system

  • What you need is:
    • One or more "beefy" interactive machines (See above for an example "beefy" machine.)
    • One or more batch nodes
    • * Example batch node:
      • Dual CPU machine with 6-8 cores per CPU configured with hyper-threading mode turned on. This results in 24-32 usable cores (logical + physical) in total. The machine should have at least 2GB per core (3 is better) or between 48 and 96GB of memory. The recommended scratch space is 20GB per core. The operating system generally uses less than 50GB.
      • The above example configuration was about $4.5K in December 2013.
    • Mount the user's work space disk space from interactive machine(s) via NFS so that their batch jobs have access to their code/releases.
    • A batch system (CONDOR, PBS, SGE, LSF) is installed on interactive machine(s) and batch nodes.
    • CMS software (CMSSW) is mounted via CVMFS, so you don't need to install it locally and you always have access to the latest software releases.
    • Access to AOD and AODSIM data: Can be configured (storage.xml) so that accessing an LFN (file names starting with '/store/...') is done through the CMS data federation via AAA.
      • Attention: AAA streams data over the network utilizing at least 250 KB/sec per jobs of bandwidth.
    • Access to alignment and calibration conditions: done through the nearest T2 site's squid, configured in site-local-config.xml.
  • Maintenance: Suggest solicitng a very computing-interested grad student or postdoc to maintain security patches and the occasional operating system upgrades.
  • Attention: The more cores that access files over the network, the larger the network connection needs to be! 1 Gbps can serve ~250 cores!

I want that my students can stage their CRAB job output to my site

  • First you should consider if this is really necessary, as CRAB output is straightforwardly staged to the nearest T2 site's storage instead, and then you can subsequently read those outputs on your T3 through AAA.

  • What you need is
    • one or several "beefy" interactive machines
    • SRM server as endpoint and shared filesystem
  • You need for a SRM server
    • separate machine accessible from internet
    • GRID software stack (EMI or OSG) installed for authentication
    • SRM server (bestman)
  • You also need to configure your disk space attached to your interactive machine(s) into a shared file system (HDFS, Hadoop file system), can span several machines
    • If more than 50 TB are available to the filesystem, you need an additional node to host the central services of the filesystem
    • You can setup your storage.xml so that LFNs are opened first from your local shared file system and then through the CMS data federation (fallback)
  • maintenance: computing-interested postdoc

I want that GRID jobs can run at my site

  • What you need is
    • one or several "beefy" interactive machines
    • one or several batch nodes
    • Compute element (GRID door to your batch system)
  • You need for a compute element
    • separate machine accessible from internet
    • GRID software stack (EMI or OSG) installed for authentication
    • compute element software (CreamCE from EMI or CE from OSG)
    • compute element needs access to your batch system
  • maintenance: system administrator

Caveats

  • these are suggestions, you can combine them to your liking
  • CMS does provide T3 support through the community, there is no dedicated support from CMS
  • All suggestions are simple to implement for a professional but can represent significant troubles for a normal physicist without computing experience, it is possible though
    • starting from scratch will take time
  • CMS is not suggesting for a T3 to install or run PhEDEx
    • if large scale data access is wanted, this should be realized through the CMS data federation
    • network requirements for PhEDEx and the CMS data federation are not different, efficient usage of both need about the same bandwidth
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2014-04-24 - DouglasJohnson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback