Archival Site Survey Conclusions
Conclusions from the survey.
Main Message
- Submit recalls as far in advance as possible
- Keep the queue as full as possible
Campaign Planning
- Group recall requests by creation time or tape family if possible
- Inform the site with as much warning as possible about recall plans
- Allows synchronisation with local activities such as repack
- Understand how priority requests are handled
- Submitting priority requests will degrade throughput
- Withholding recall submissions to keep latency down will degrade throughput
- Synchronise data use with recalls to avoid purge/recall loops
- The client should delete a staged file from the disk buffer once the workflow requiring the retrieval has completed.
- Do not wait for the last byte to be recalled before processing
Client Behaviour
- Consider queue size to be unlimited
- Exceptions
- FNAL, PIC - 15k per VO
- KIT - 2k per pool
- UNIKHEF-SARA - 1k (?)
- Back off on a combination of SRM_INTERNAL_ERROR (request status) and SRM_FILE_BUSY (file status) (Castor).
- Back off when the number of files in SRM_REQUEST_QUEUED approaches the server-configured limit (dCache).
- Use bulk requests
- Best bulk recall size unknown. 1k is the reference, some sites want more, some want fewer.
- Interaction rates under 10Hz typically acceptable
- Run with no timeouts, or at least 48hrs
- Ignore disk buffer occupancy
Discussion points
Writing strategy
Should we make recommendations on writing strategy?
Selecting particular pools or resources for particular types of data? Probability of future delete??
Perhaps a writing strategy is not possible as repack will eventually destroy locality (?)
Approach
The survey exposes the significant diversity in these systems. What should the strategy be?
Produce some "lower common denominator" advice - do the following... it may help, and will never hurt
Enumerate a small number of basic site characteristics and classify each site individually
Next steps
- Experiments have to join the conversation
- Which new scenarios to exploit tape are the experiments actually considering?
- Carousels – R&D started at BNL
- What else?
- Understand what actions, if any, need to be taken client side
- FTS configurations can be updated easily
- e.g. size of a bulk request, number of outstanding requests
- What about developments in FTS or in experiment-specific clients?
- Track progress using the reported metrics
- Finalise the advice to clients in a “mode d’emploi” for tape systems
--
OliverKeeble - 2018-05-07