Castor time table

Week 23 4-10 June

Upgrade of the pre-production system to 2.1.3-14

Larger scale SRM tests.

Larger scale Castor-xrootd tests on the ITDC setup with 2.1.3.-14.

Larger scale ‘interference’ tests (also using SRM)

Further tests for the ATLAS export setup

CMS is planning ‘repacker’ tests at 600 MB/s

Move the full ATLAS Castor setup to the new release

Week 24 11-17 June

SPS long MD, COMPASS and NA48, could move c2public to 2.1.3-14 But this is critical CDR activity, maybe too early, but the current system will fail under high load.

Continue large scale tests

Week 25 18-24 June

Continue large scale tests

Start the discussion with ALICE about the production setup of the Castor-xrootd system.

Week 26 25 June–1 July

Possible date for a move of CMS to the new castor release. Plan B would be to make a separate castor instance for CMS just for the pre-CAS07 tests and leave the standard production on the old Castor stager for a few weeks. (DES can still provide a new Castor DB instance, the disk server situation is somewhat critical).

Continue large scale tests

Week 27 2-8 July

Full (2008) dress rehearsal start of ATLAS

End of the Castor task force, continue with normal operations. Focus on the stress tests.

Week 28 9-15 July

Possible date for the move of ALICE and/or LHCb to the new Castor release

Continue large scale tests

Week 29 16-22 July

Start of large scale pre-CSA07 tests of CMS

Week 30 23-29 July

Continue pre-CSA07 for CMS

Castor Tests

List of Castor test suites

SRM tests

-- BerndPanzersteindel - 03 Jun 2007

Tests program for the next weeks

The principle schemes are common for all tests : i.e. changing a set of parameters and tracing the resulting changes in the behavior of the system (performance, response, stability). A key point is the simplification of the debugging and tracing procedure by improving the automatic monitoring (new sensors, better and more visualization ),which requires a constant feedback with the monitoring teams The ‘cross-link’ monitoring between the different items (Castor, SRM, tape, disk server, FTS, GridFTP, network, , experiment data management, etc.) in the dataflow is essential for the stable operation of the data management.

Example of more detailed tests :

  1. changing the number of available LSF slots on the disk servers within the t0perm pool
  2. changing the weights of the Castor load-balancing formula.
  3. testing a ‘read-only’ analysis/calibration pool scenario with multiple users and hundreds of streams
  4. testing whether the implementation of user shares in the Castor LSF queues guarantees a throttling of different user activities in a pool
  5. the SRM tests need to be disentangled and made available as part of the standard Castor release certification tests. One needs to ensure that one can test the SRM part as independent as possible from the Castor part.
  6. the ATLAS export performance is not understood and needs dedicated tests to identify possible congestion areas in Castor (load-balancing, hardware, influence of FTS setup, network parameters, etc.) or the end-points at the T1. Already a few tests have been done, but none of them identified a clear reason for the export fluctuations. Possible tests are the changing of some key parameters in the T0 setup(file size)and monitoring the detailed data-flow effects, adding more artificial load on the system to look for congestion thresholds and changing the queuing mechanism of FTS (together with the way DQ2 enters export requests).
  7. general large scale tests where different pools are loaded with several thousand read and write requests. This will provide important information about scaling and throttling limits and general stability limits.

The performance and behavior of Castor in a production environment depends on a large number of parameters and setup details :

  • number of disk pools (same stager for all pools)
  • number of disk servers per pool
  • number of file systems per disk server
  • type of file system and disk RAID configuration
  • Linux version, IO sub-system defining the characteristics of read and write operations
  • low level disk server Linux network configuration (TCP buffers, …)
  • number of tape drives
  • type of tape server (memory configuration correlated with file size distribution)
  • file size distribution in the pool
  • number of configured Castor ‘slots’ (#concurrently running IO jobs) per disk server
  • parameter which define the load balancing policy - type of parameter (load, Network IO, number of read or write streams,…) - weight of these parameters - used formula to combine the parameters - extrapolation weight to take into account the schedule future streams - updating intervals of the monitoring information for these measured parameters
  • queuing mechanism inside Castor (shares, priorities)
  • queuing mechanism inside FTS (experiment driven)
  • number of stager threads handling IO and control requests (sharing and balancing)
  • job throughput in LSF
  • number of FTS slots (possible concurrent file transfers running)
  • number of parallel TCP streams configured in GridFTP
  • total number of read and write streams from applications
  • Oracle data base performance
  • .....
  • .....
  • .....

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2007-06-03 - unknown
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback