J.P. Baud, F. Donno, A. Frohner, E. Lanciotti, G. Lo Presti, R. Mollon, A. Sciaba', D. Smith, P. Tedesco

11/02/2009

TotalRequestTime


JP: specify desiredTotalRequestTime is mandatory for dCache because the default could be bad.

R: allowed by GFAL in certification.

Flavia: a default of 4000 is used. dCache seems to ignore the client's setting anyway. However in v1.9 there seems to be a configurable default in dCache, which is too low for bringOnline and too high for prepareToPut.

An: if the request has expired, can i ask for its status? JP: yes.

G: for CASTOR requests can be alive even after (desired time ignored...).

F: If you set this parameter in the config file, it is ignored for a bug in dCache. In the new version is fixed, version 1.9. This is the time the request stays in queue no matter what the client specifies. It should be better if the client could specify this time.

Summary: not important to report the RemainingRequestTime, however it is desirable to be able to specify DesiredTotalRequestTime. For dCache is very important to set it.

SRM too busy


How to signal the client that the srm is too busy?

JP: better to use SRM_INTERNAL_ERROR.

JP: if we do a PrepareToGet and the system is too busy and cannot answer, you should just retry later. If you don't want that the client retries, then the server should return SRM_FAILURE.

G: today what's the status? if you get a SRM_INTERNAL_ERROR?

P: in FTS, you just give up, clean up and later try a new transfer job.

F: on CASTOR, for example: take a synchronous operation, suppose the server is busy, what you you like the client to do depending on the situation? G: the client must back off, for all operations. Already returning SRM_INTERNAL_ERROR if the backend is busy. One can retry in 10 minutes.

JP: FTS should be modified and should retry instead of cleaning up everything.

Ak: how does SRM_INTERNAL_ERROR help in case of sustained load?

P: SRM_FILE_BUSY could also tell for the client that the SE is running out of transfer slots, so for example FTS could reduce the size of the channel going to the given SE.

An: SRM_FILE_BUSY means that a PutDone has not yet been done for a file after a PrepareToPut.

Ak: It would be nice to have this information [no. of transfer slots] directly available from the SE.

F: we must try to solve a problem for the experiments, with SRM being busy. We must find a way even if we have to overload the specifications.

D: SRM_INTERNAL_ERROR is ok, we must define how to back off.

G: just implement the back off in client.

Ak: how can the experiments figure out that SRM is busy? Now FTS is used to monitor SEs!

G: Ping could have some monitoring info.

Ak: Or the information system. F: if the client gets SRM_INTERNAL_ERROR what does it do?

R: GFAL retries with exponential wait times until the timeout expires.

F: in a test srmLs gave the answer at once but StatusofBringOnline took days. CASTOR did its own exponential polling. srmLs does not guarantee the copy is pinned.

F: the CASTOR backend should tell the SRM server that the file is online without need for pooling. The same for dCache.

G: now the problem of StatusofBringOnline is fixed in CASTOR (not the problem with the polling!). The SRM polling is such that you might know that file is online after it has been garbage collected.

F: tell experiments that they should not use srmLs.

JP: why?

F: srmLs hammers PNFS even if you query a single file and PNFS is overloaded.

G: srmLs is a synchronous query on the CASTOR backend, less optimal than StatusofBringOnline. srmLs talks to the stager to know if file is online.

F: for dCache srmLs gives you an immediate answer but does not guarantee that the file is pinned. It happened very frequently at SARA because the disk was small.

JP: using StatusofBringOnline would not solve the problem.

F: yes, because file has been pinned.

JP: I don't think StatusofBringOnline guarantees that.

F: In dCache there are still some outstanding bugs for the StatusofBringOnline. In 192 it should be ok. But when testing it, they discovered other bugs which prevent from publishing it.

F: For the time being the experiments still have to use srmLs.

JP: it's strange because if file is online the new BringOnline should have been processed immediately.

F: Timur said that request was in queue. The pin time can be comparable with the time the request spends in the queue.

JP: requests should be processed after they stay in the queue for no more than a few tens of seconds.

F: I don't know how they pull requests from queue.

G: for asynchronous requests CASTOR does not return SRM_INTERNAL_ERROR because StatusofBringOnline is very simple. If the database is down everything gives SRM_INTERNAL_ERROR. Even in case of overload, requests are still honoured unless everything is down.

R and Ak: if FTS and lcg-utils cannot talk with SRM (for example "connection timeout") , no retries are done.

JP: in case of SRM_INTERNAL_ERROR at the request level, you can still see some status at file level. You keep on polling at incremental intervals, and then you check if the file status has changed. One can use GetRequestSummary to know how many files changed status. It is much lighter than StatusofBringOnline if you have e.g. 1000 files because it doesn't load the backend.

Ak: There is estimatedWaitTime in the file status that the server could use.

G: to recap:

- Asynchronous requests: use SRM_INTERNAL_ERROR and poll using estimatedWaitTime - Synchronous requests: use SRM_INTERNAL_ERROR - Statusof...: poll based of GetRequestSummary

G: In CASTOR srmAbort does NOT work, whereas srmRm does. In dCache is the other way round.

Conclusions


Suggesting that the client specifies DesiredTotalRequestTime and the server should do its best to abort a request, if the time is reached and the client did not abort it explicitly (i.e. the client has crashed)

Suggested back-off algorithm on both synchronous and asynchronous methods: if an SRM_INTERNAL_ERROR is received the client shall use an exponential retry period on the same operation.

Suggested asynchronous polling method:

1) srmPrepareToGet/Put in a loop with an exponential retry period until something changes: srmGetRequestSummary

2) srmGetStatus to get the details of the change.

On the metascheduling level: it is very hard to give a meaningful estimate on how "busy" is the SE, so we give up on that for now.

-- ElisaLanciotti - 23 Feb 2009

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-05-26 - ElisaLanciotti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback