AAA Scale Tests

Current action items

  • Get detailed monitoring rolled out to more sites
    • But: we can start the scale testing in the US, which has detailed monitoring working at all sites, and expand from there
    • Status: Appear to have it working at 4 T1's, although FR is dubious as they might not have xrootd access service enabled. 16 T2's are there (but where has Caltech gone now?) More sites should start coming in now that a proper dCache RPM has been released. Need to find a way to press sites to continue implementation.
      • Should emphasize yet again that this doesn't stop us from starting scaling tests at many sites.
  • Prepare a test dataset that can be placed at sites
    • One solution is to use the /store/test/xrootd/SITENAME/LFN trick for those sites at which it is implemented. It is implemented at least at all US T2 sites but for Florida.
    • Dan Riley/DBS team will provide a tool that can create datasets in DBS from a list of LFN's, which would allow the creation of datasets have the appearance of being uniquely at a site using the above solution. One candidate is the pileup dataset, /MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM, which we know everyone has locally.
      • Note that we don't actually have a delivery date for the tool, is currently the limitation for some small-scale CRAB testing. If we go with Condor-G, maybe we don't even need the dataset tool, can just provide a list of LFN's?

Timeline

  • October 1: T2 sites have AAA fallback implemented
    • Currently have 43/51 sites, remaining sites are harder cases that may never come together
  • December 1: T1 sites have both fallback and access implemented

Metrics

  • Number of simultaneous connections to storage element
    • AAA dashboard
  • Outbound data from storage element (throughput, really)
    • AAA dashboard, FJR files
  • CPU efficiency for remote access
    • Standard dashboard, as long as jobs are identifiable
    • Potentially the FJR files, which might carry this information.
  • File-open failures
    • Job log files/dashboard
  • Job-failure rates
    • Job log files/dashboard, obviously correlated with above
  • Redirector metrics
    • Don't know what these are yet, or how we capture them
  • Possibly later: something to test balancing of different sources

  • Note: one element of this is that we need to be able to track some of these things as a function of the number of jobs hitting a site. Just how to correlate all the data is an interesting issue.

Job configuration and execution

  • To test I/O rates, can use the StandardCandle module, which straightforwardly allows one to just read an entire file. KB has a configuration that seems to test OK.
  • To test connection (file open) rates, can just run a path that only has an input module!
  • Note: both of these jobs will run pretty quickly -- we will have to find some way to sustain the scale for some amount of time.
  • Would like to be able to have production role these tests to try to commandeer a lot of slots; need permission from CompOps.
  • Wisconsin staff/students are willing to work on the job submission; they propose Condor-G as the simplest thing.
  • Need efficient way of harvesting the metrics: is Imperial interested in this?

Sequence of events

  • Run tests at one US site, using a different US site as source
    • Make sure we can get enough job slots
    • Check out metrics, adjust plans as need be, get sense of scale
  • Start adding in multiple US sites as sources to the one US site
  • Add in one more destination site at a time, stay in US
  • Start using non-US sources at US destinations
    • Will test redirection to EU, transatlantic network issues
    • Need to start with sites that have detailed monitoring working
  • Repeat within EU -- start with one "good" site and use EU sources, then add more destinations and more sources
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2013-09-12 - KenBloom
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback