File Transfer Service support area

The purpose of this page is to keep track of the problems and support requests posted to GGUS.

This page is relevant to the gLite FTS 1.4.1 and FTS 1.5 release.

Summary

Configuration

File Transfer Service

File Transfer Agent

Channel Administration

Configuration

YAIM Configuration explained

Starting from FTS version 1.5, the configuration has moved from the gLite python configuration script to YAIM (you can find an example in /opt/glite/yaim/example/site-info.def). For the Yaim details, please refers to the the related documentation. The relevant part for us is the FTS and the FTA one.

WebServer Configuration

For configuring the FTS WebServer, you need:

# Node names
[...]
FTS_HOST=%FTS_WS_HOSTNAME%.$MY_DOMAIN
[...]

# BDII/GIP specific settings
[...]
BDII_FTS_URL="ldap://$FTS_HOST:2170/mds-vo-name=resource,o=grid"
[...]

# FTS config file for web-service
FTS_DBURL=... # The JDBC url for connecting to the DB
FTS_HOST_ALIAS=prod-fts-ws.cern.ch

Where %FTS_WS_HOSTNAME% is the name of the host where the FTS Web Server is installed (in case you use dns aliases, put the name of the dns alias here)

In case the FTS WS and the Agents are configure din the same file (usually this is the case) you don't need to provide the DB username and password, since these are taken from the Agents parameters (see below). In case you have separate files for the WS and the Agent, you need to provide these values using the parameters:

FTS_DB_TYPE=ORACLE
FTS_DB_USER=...
FTS_DB_PASSWORD=...

FTA Configuration

The FTA configuration is slightly more complex. The first thing you have to specify are the hot that will be used by the FTA and what shoudl be the agents that will be installed into these hosts. For example, you can have:

FTA_MACHINES="ONE TWO FIVE"

FTA_AGENTS_ONE_HOSTNAME="fts101.cern.ch"
FTA_AGENTS_ONE="CERN-BNL BNL-CERN CERN-INFN INFN-CERN"

FTA_AGENTS_TWO_HOSTNAME="fts102.cern.ch"
FTA_AGENTS_TWO="CERN-FNAL FNAL-CERN CERN-RAL RAL-CERN"

FTA_AGENTS_FIVE_HOSTNAME="fts105.cern.ch"
FTA_AGENTS_FIVE="DTEAM ALICE ATLAS CMS LHCB OPS"

In that case, two hosts will be used for the ChannelAgents (fts101.crn.ch and fts102.cern.h) and one for the VOAgents (fts105.cern.ch). Please note that this example is taken from the production FTS at CERN, and doesn't force you to have the agents spread on different boxes (this choice mainly depends on the load you expect on your setup). You have then to specify the type of each agent, like:

FTA_CERN_BNL="URLCOPY" 
FTA_BNL_CERN="URLCOPY" 
FTA_CERN_INFN="URLCOPY" 
FTA_INFN_CERN="URLCOPY" 
FTA_CERN_FNAL="SRMCOPY" 
FTA_FNAL_CERN="URLCOPY" 
FTA_CERN_RAL="URLCOPY" 
FTA_RAL_CERN="URLCOPY" 

FTA_ATLAS="VOAGENT_PYTHON"
FTA_ALICE="VOAGENT_PYTHON"
FTA_LHCB="VOAGENT_PYTHON"
FTA_DTEAM="VOAGENT_PYTHON"
FTA_CMS="VOAGENT_PYTHON"
FTA_OPS="VOAGENT_PYTHON"

The naming convention is quite straightforward: FTA_%INSTANCE_NAME% where %INTANCE_NAME% is one of the names speficied in the FTA_AGENTS_* parameter. Please note that the character "-" shoudl be converted in "_". The supported types are:

  • Channel Agent types: URLCOPY (transfers are excuted using 3rd party gridftp copy), SRMCOPY (uses srmcopy)
  • VO Agent types: VOAGENT_PYTHON (the VOAgent retry logic is provided by a python scrypt, recommeded!), VOAGENT (the VOAgent with the basic retry logic)

The only mandatory parameters are the Database type, username, password and connection string:

FTA_GLOBAL_DBTYPE=ORACLE
FTA_GLOBAL_DB_CONNECTSTRING=...
FTA_GLOBAL_DB_USER=...
FTA_GLOBAL_DB_PASSWORD=..

In addition, please leave the verbosity level of the log files to INFO:

FTA_GLOBAL_LOG_PRIORITY=INFO

The values apply to all the agents. In fact, we defined three diffenet scopes for the configuration parameters:

  • GLOBAL: the values of the parameter are used for all the agents (VOs and Channels). The parameters mentioned above are example of global parameters. This kind of parameters can also be used to define default values that could be overwritten by more detailed scopes.
  • TYPEDEFAULT_%TYPE%: the values are used for all the agents of the same type. The supported types are listed above: URLCOPY, SRMCOPY, VOAGENT_PYTHON, VOAGENT. Please note that in this context URLCOPY and SRMCOPY are considered as different types, even if both refer to ChannelAgents. The same concept also apply to VOAGENT_PYTHON and VOAGENT
  • %INSTANCE_NAME%: the values are specific to the instance of the agent identified by %INSTANCE_NAME% (the name of the VO or the Channel the agent is responsible for).

In order to specify the FTA configuration paremeters, we adoped the following naming convention:

FTA_%SCOPE%_%PARAM_NAME%

where %SCOPE% is one of the values listed above and %PARAM_NAME% is the name of the parameter you want to set. For example, in case of FTA_GLOBAL_LOG_PRIORITY, GLOBAL is the scope and LOG_PRIORITY is the confguration paremeter name.

Usually, the paramters have a meaningful default value, but in some circumstances you may want to tune some of these values:

  • Parameters related to ChannelAgents (URLCOPY or SRMCOPY):
    • GUC_MAXTRANSFERS: The maximum number of concurrent transfers the agent will process (act as a hard-limit on the number of files specified for a channel). Default is 50.
    • GUC_TRANSFERTIMEOUT: The timeout in seconds for completing the transfer. In case of srmcopy transfer, the total timeout is this value multiplied by the number of files speficied in the srmcopy request. Default is 600 for URLCOPY and 0 (no timeout) for SRMCOPY. Recommended value is 1800 for both types.
    • GUC_HTTPTIMEOUT: The http timeout for all the SOAP calls. Default is -1 (i.e. the gLite transfer-url-copy default applies: 40 seconds).

In addition, for ChannelAgents, we recommend you to set:

        FTA_TYPEDEFAULT_%TYPE%_FSM_ENABLEHOLD=false     # since the "Hold" state is a VO policy, only VOAgents should move files to this state 
        FTA_TYPEDEFAULT_%TYPE%_AGENT_CANCEL_INTERVAL=60 # Check if there are ative transfer to cancel every minute 
        FTA_TYPEDEFAULT_%TYPE%_AGENT_DEFAULTINTERVAL=5  # Execute the ChannelAgent operations (fetch new transfers, check the status of the active ones) every 5 seconds instead of every 3 seconds, in order to reduce the load on the !DBServer
      
where %TYPE% is URLCOPY and SRMCOPY (please set both)

  • Parameters related only to URLCOPY ChannelAgents :
    • GUC_STREAMS The maximum number of streams that would be used for a gridftp transfer (act as a hard-limit on the number of streams specifiedfor a channel). Default is 10.
    • GUC_SRMPUTTIMEOUT: The timeout for completing an SrmPut operation and rtriving a valid Turl to be used for the transfer. Default is 60. Recommended value is 180
    • GUC_SRMGETTIMEOUT: The timeout for completing an SrmGet operation and rtriving a valid Turl to be used for the transfer. Default is 60. Recommended value is 180
    • GUC_SRMPUTDONETIMEOUT: The timeout for releasing the Turl returned by the SrmPut call. Default is 60. Recommended value is 180
    • GUC_SRMGETDONETIMEOUT: The timeout for releasing the Turl returned by the SrmGet call. Default is 60. Recommended value is 180
    • GUC_TRANSFERMARKERSTIMEOUT: The timeout between two consequent transfer markers: if the gridftp server is not retruning markers with at least this frequency the transfer is considered stuck and therefore it will be aborted. Default is 120.

  • Parameters related only to SRMCOPY ChannelAgents :
    • GUC_MAXBULKSIZE: the maximum size for a SrmCopy bulk request. Default is 100
    • AGENT_CHECK_INTERVAL: the frequency for checking the status of active SrmCopy requests. Recommended value is 30.

In addition, for SRMCOPY ChannelAgent, in order to prevent an issue with dCache 1.6.6 we recommend you to set:

          FTA_TYPEDEFAULT_SRMCOPY_ACTIONS_SURLNORMALIZATION=compact-with-port 
        

  • Parameters you need to set only for VOAGENT_PYTHON VOAgents :
    • PYTHON_PYTHONPATH: the paths were the python modules and strategies can be loated. Unless you have a setup that differs from te default one, please set this value to: ${GLITE_LOCATION}/lib/python2.2/site-packages:${GLITE_LOCATION}/lib/python/glite/fts/strategies/
    • ACTIONS_RETRYMODULE: the name of the python module that provides the retry logic for the VO. We recommend you to set this value to smarter_retry
    • ACTIONS_RETRYPARAMS: the parameter passed to the retry logic. The format of this string depends on the strategy module itself. For the smarter_retry module, this values looks like:
      "MaxFailures = 3 ; HoldEnabled = false ; OverwriteFailedFiles = true ; OverwriteExistingFiles = false ; DefaultRetryDelay = 300 ; RetryDelayForTimeoutOnGet = 1800 ; RetryDelayForDestFileExists = 300 ;"
          
      Hopefully, the parameters' names are self-explanatory. Please note that in case a VO requires to reduce the retry delay, you may need also to modify the parameter AGENT_RETRY_INTERVAL, that by deault is set to 60 seconds. For example, if a VO wants to have a Retry delay of 30 seconds, you may need to specify:
      FTA_%VO%_AGENT_RETRY_INTERVAL=30
      FTA_%VO%_ACTIONS_RETRYPARAMS="MaxFailures = 3 ; HoldEnabled = false ; OverwriteFailedFiles = true ; OverwriteExistingFiles = false ; DefaultRetryDelay = 30 ; RetryDelayForTimeoutOnGet = 1800 ; RetryDelayForDestFileExists = 300 ;"
          

There are many other aspects of the FTA you could configured, but for a production server we suggest you to limit to the configuration parameters illustrated in this page; if these are not sufficient, you can have a look at the FTS documentation or contact fts-support.

Troubleshooting

In case the Yaim configure_node script returns an error like:

ERROR: The variable FTA_TYPEDEFAULT_%TYPE_%PARAM_NAME% was specified in the configuration file.
This is not used by any of the agents configured in the file.

This means that you're setting a property for a type that is not used (it usually happens when some FTA_TYPEDEFAULT_SRMCOPY_* properties are set but all the Channel agents are URLCOPY). The solution is to simply comment out or remove the lines concerning the unused parameters.

In case you're using diffente types of VOAgents at the same time, you're likely to receive this error:

New agent type VOAGENT used. Creating the default generator config files for it.
Writing generator input file for agent type VOAGENT to temporary file:
   /tmp/tmp.iGRUB18291/agenttype.VOAGENT.config.properties
Agent type VOAGENT overrides some defaults with the following variables:
FTA_TYPEDEFAULT_VOAGENT_ACTIONS_MAXFAILURES FTA_TYPEDEFAULT_VOAGENT_PYTHON_ACTIONS_RETRYMODULE
 
ERROR: The type parameter you have set - FTA_TYPEDEFAULT_VOAGENT_PYTHON_ACTIONS_RETRYMODULE - does not correspond to any known variable in
       the type template file /opt/glite/share/config/glite-data-transfer-agents/glite-transfer-vo-agent-fts-oracle.config.xml
       I don't know what to do with this variable, so am aborting.
       Perhaps you mistyped the parameter name?

This is due to a bug in the adoped naming convention (please see #18265). In order to prevent this, please use only one type of VOAgent: we recommend to use the VOAGENT_PYTHON. In case a VO is not satisfied with the smarter_retry logic, you can restore the basic one by setting the property:

FTA_%VO%_ACTIONS_RETRYMODULE=basic_retry

File Transfer Service

My DN changed. Could you please grant me the same privileges I had before?

In case a user DN changed, for example because of the change of the CERN CA, all his/her privileges on the FTS Server should be updated. If the old certificate is still valid, the user can perform this operation by his own, without the help of the FTS amdinistrator. In order to due that, the user has to execute the following steps with a valid proxy generated from the old certificate:

  • Invoke glite-transfer-getroles to retrieve the list of priviledges
  • For each channel he/she has the management provileges on, execute
        glite-transfer-channel-addmanager CHANNEL_NAME NEW_DN
        
  • For each VO he/she has the management provileges on, execute
        glite-transfer-addvomanager VO_NAME NEW_DN
        

In case the old user's certficate expired, the FTS administrator has to list all managers of all the channels (glite-transfer-channel-listmanagers) and VOs (glite-transfer-listvomanagers) and then executes glite-transfer-channel-addmanager and glite-transfer-addvomanager as above.

The user can then check that the privileges are correct by executing glite-transfer-getroles with a proxy generated from the new certificate.

In case the user is also and FTS administrator, the file /opt/glite/etc/glite-transfer-admin-mapfile should be manually modified in every node where the FTS-WS is installed and a new entry corresponding to the new DN soudl be added.

When the old certificate expires or is no longer needed, the user should then remove the priviledges granted to the old DN by executing the following commands, with a proxy generated from the new certificates:

  • Invoke glite-transfer-getroles to retrieve the list of priviledges
  • For each channel he/she has the management provileges on, execute
       glite-transfer-channel-removemanager CHANNEL_NAME OLD_DN
       
  • For each VO he/she has the management provileges on, execute
       glite-transfer-removevomanager VO_NAME OLD_DN
       

Symptom: I tried to submit a job and it said: submit: You are not authorised to submit jobs to this service

The user is not authorised to submit jobs to the FTS service. In order to authorize him/her, you have to add his/her DN in the submit-mapfile on the FTS server. You can have a look at FtsServerInstall in the Mapfile section and at FtsServerSubmitMapfile

However, due to bug in the FTS (#10362), if the user has a double or more delegated proxy (i.e. the DN ends with /CN=proxy/CN=proxy), a parsing error will cause a authorization denied. This bug has being solved in FTS version 1.4 and in the latest QuickFix for 1.3

If the user is still not authorized to submit request, check his/her DN is not in the veto-mapfile

Symptom: I submitted a job from site X to Y but it didn't work. The channel Y-X exists and has a share for my VO!

From version 1.3 onwards the channel definitions are mono-directional. You have to create another channel in the opposite direction (glite-transfer-channel-add), set the share for the VO interested in using the channel (glite-transfer-channel-setvoshare) and install an Channel Agent that will managed it

Which format should I use for the SURLs?

Starting from gLite 1.4.1, the FTA implements the enhancement request #8364, that allows a user to specify any format he prefers: the agent would then convert each SURL before transfering or registering into the catalog to either a fully qualified format

srm://<host>:<port>/srm/managerv1?SFN=<file_path>
or a compact one
srm://<host>/<file_path>

depending on the configuration. By default it would use the compact format. In case you want to change this parameter, you have to set the related ChannelAgent configuration parameter ACTIONS_SURLNORMALIZATION (transfer-agent-channel-actions.SurlNormalization) to one of the following values:

  • compact all the SURLs will be converted to the format:
            srm://<host>/<file_path>
            
  • compact-with-port all the SURLs will be converted to the format:
            srm://<host>:<port>/<file_path>
            
  • fully-qualified all the SURLs will be converted to the format:
            srm://<host>:<port>/srm/managerv1?SFN=<file_path>
            
  • disabled no SURL convertion will be performed

If you're using a previous version, for interoperability reasons we suggest to use fully qualified SURLs, i.e. in the format

srm://<srm_host>:<srm_port>/srm/managerv1/?SFN=<file_path>

If you know the type of the SRM that would be involved in the transfer, you can also specify one of the supported compact format. For Castor, as example, you can use

srm://<castorsrm>:8443/srm/managerv1?SFN=<file_path>
srm://<castorsrm>:8443//srm/managerv1?SFN=<file_path>
srm://<castorsrm>:8443/?SFN=<file_path>
srm://<castorsrm>:8443/<file_path>
srm://<castorsrm>/<file_path>

In case the transfer is processed by a channel configured to use srmcopy, the fully qualified format may not work. Please have a look here for a workaround

Symptom: I've tried to submit a job but I get back an error saying: SOAP-ENV:Server.userException - org.xml.sax.SAXException

Usually this issue is related to an endpoint pointing to the wrong server (typically ChannelManagement instead on FileTransfer): when you observe an error similar to

submit: SOAP fault: SOAP-ENV:Server.userException -
org.xml.sax.SAXException: Deserializing parameter 'job':  could not find deserializer for type {http://transfer.data.glite.org}TransferJob

please ask the user to look at the command he just submitted and to check that the specified endpoint is correct; all the CLIs commands that start with glite-transfer-channel-* require to use a ChannelManagement interface, while the ones that start with glite-transfer-* require the FileTransfer interface. In order to check if the endpoint is correct, the user can also re-run the command with the -v option and checks if the line Using Endpoint ends with FileTransfer or ChannelManagement

Symptom: I've tried to submit a job but I get back an error saying: No match

When the user submit a transfer job, he usually specify some SURLs that may contains a question mark (?). In some shells this character has to be escaped by simply quoting it ('?'): for example, if the SURLs are

srm://castorgridsc.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/dteam/src_file
srm://castorgridsc.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/dteam/dst_file

please make sure you run glite-transfer-submit in this way

glite-transfer-submit \
    srm://castorgridsc.cern.ch:8443/srm/managerv1'?'SFN=/castor/cern.ch/grid/dteam/src_file \
    srm://castorgridsc.cern.ch:8443/srm/managerv1'?'SFN=/castor/cern.ch/grid/dteam/dst_file

Symptom: I was able to list the channels but I cannot get the channel details

Listing channels is open to any user as long as he/she is not in the veto mapfile - you only get the channel name from this call.

However, getting the details of a channel - source, destination, bandwitch, etc is restricted. For this you need to be:

  • an admin
  • manager of the channel being queried
  • manager of any VO on the given FTS

You can check your roles on a given FTS by running glite-transfer-getroles. Information on channel and VO managers can be managed by a service admin or other managers by using the appropriate client tools. Information on service ADMINs is stored inside the admin-mapfile.

How do I setup a non-dedicated Channel?

Non-dedicated channels (a.k.a. "catch-all" channels) are a special channel configuration that allows matching any site as source or destination, therefore not coupled with the underlying network. Using "catch-all" channels allows to limit the number of channels you need to manage, but also limits the degree of control you have over what is coming into your site (although it still provides the other advantages like queueing, policy enforcement and error recovery). The usage of these channels is mainly recommended in Tier1 for providing full connectivity to all other sites, where the suggested channels definition is:

  • Dedicated channels from any other Tier1 to the T1
  • Non-dedicated channels to each of the related Tier2
  • A non-dedicated channel to the T1

You can setup a non-dedicated channel that will manage all the transfers from any site to your site by issuing a glite-transfer-channel-add and using * and source site name, like:

glite-transfer-channel-add -f NUM_OF_FILES -S CHANNEL_STATE [...] CHANNEL_NAME "*" YOUR_SITE

Of course, you have then to issue a glite-transfer-channel-setvoshare for each VO that should be authorized to use the channel and then configure a ChannelAgent for that channel.

Please note that is a VO is not authorized to use a channel between site A and B but has privileges on a *-B channel, transfer requests for that VO from site A to B are denied since the non-dedicated channel is evaluated after all the dedicated ones.

In addition, please also note that the default ChannelAgent configuration for that channel requires that all the SRM that would be involved in the managed transfers should be listed in the information system. In case a VO needs to relax this constraint, for example in order to transfers files to/from Classic SEs not included in the information system, the following parameters should be added to the VOAgent configuration:

  • ACTIONS_ENABLEUNKNOWNSOURCE (transfer-agent-vo-actions.EnableUnknownSource) should be set to true if SEs not known to the InfoSys should be allowed as valid source (these would be matched by the *-Site catch-all channels)
  • ACTIONS_ENABLEUNKNOWNDEST (transfer-agent-vo-actions.EnableUnknownDest) should be set to true if SEs not known to the InfoSys should be allowed as valid destination (these would be matched by the Site-* catch-all channels)

In case a VO needs these parameters, it would be better to turn off the SURL Normalization, or at least set it to fully-qualified, for all the ChannelAgents associated to non-dedicated channels, since it would be impossible to resolve the correct endpoint for the SRM not listed in the InformationSystem. It will also be worth to reccommend the users to use fully-qualified SURLs for transfers that should be processed through these channels.

Use of the *-* 'catch everything' channel is not recommended for production grids.

Symptom: After upgrading to FTS 1.5 I got "No Channel found or VO not authorized ..." error

Running the FTS service we encountered many inconsistencies in the way the information was published in BDII, especially related to the case used to publish the site name. This not not a probalem when BDII is used directly, since it's is case insensitive, but creates some intereoperability issues when used via ServiceDiscovery (that is case sensitive). We therefore decided to apply a convention, within the FTS boundaries, in order to have all the site names uppercase in the channel definitions. Starting form version 1.5, the FTS WebService forces the case when you create a new channel, but when upgrading from previous versions, this convention may conflict whit already defined channels. In order to fix this, we have provided an admin pack hat allows changing the channel definitions. The instruction how to use that tools are available here.

Therefore, if you hit this problem, download the glite-data-transfer-scripts RPM and follow the instuction reported above in order to replace all the site names that contains lowercase letters in all the channel definition (you may need the support of your DBA).

Note: If this RPM is not yet available in the repository, please contact fts-support.

Symptom: My jobs fail if I have a short time left on the proxy in MyProxy

Make sure you have a fresh version in MyProxy that will last at least the length of all your jobs (assume queue length of 2 days from your last submission).

File Transfer Agent

Symptom: Job always in Submitted state

The first action that is executed on a transfer request is the Allocation, performed by the VO agent associted with the VO of the submitter. This actions checks the source and destination SURLs of the job request, find the sites of the involved SEs using ServiceDiscovery and then look up in the registered channels for a matching. When this operation succeed, the job is moved to Pending and the channel_name property is filled with the name of the found channel.

Due to a bug in FTA 1.3 and 1.4 (#10076) a job stays in Submitted state instead of going to Failed in one of the following cases

  • The channel doesn't exist but the source and destination SE are registered in ServiceDiscovery or the VO is configured to accept unknown source and destination
  • The VO of the user who submitted the job has no valid share on the channel
  • The channel is in Stopped, Drain or Halted (actually, when the channel status is Halted, a job should go in Pending and not in Failed)

Usually this problem is due to a configuration error. The first thing to do is to retrieve the status of the channel that should be involved in the transfer

glite-transfer-channel-list CHANNEL_NAME

check the channel state, that the VO has a share and that the names of the source and destination sites match the ones retrived using ServiceDiscovery: in case the file plugin is used, look at the site element of the SRM services reported into the services.xml file

  <service name='CERNSC3-SRM'>
    <parameters>
      <endpoint>httpg://castorgridsc.cern.ch:8443/srm/managerv1</endpoint>
      <type>SRM</type>
      <version>1.1.0</version>
      <site>CERN-SC</site>
      <param name='SEMountPoint'>/castor/cern.ch/grid/dteam/storage</param>
    </parameters>
  </service>

and compare them with the value returned by glite-transfer-channel-list

In case this doesn't fix the problem, check that a VO agent is configured and running for that VO. Do

glite-transfer-status --verbose JOB_ID

And check that the value of the VOName property is correct; in case is not, it's a problem with the FTS glite-data-transfer-submit-mapfile: edit that file manually or regenerate it following teh procedures reported by FtsServerSubmitMapfile, cancel the job, wait that the files is reloaded by the FTS and ask the user to resubmit the request.

In case the VO is set correctly, check on the agents node that an agent is configured:

  • if you're using gLite 1.3, please have a look at /opt/glite/etc/config/glite-data-transfer-agents-oracle.cfg.xml and see if there is an instance for the VO:
           <instance name="YOUR_VO-fts">
             <parameters>
               <transfer-vo-agent.Name value="YOUR_VO"/>
               <!-- Other parameter -->
               <!- ... -->
             </parameters>
           </instance>
         
  • if you're using gLite 1.4, open the file /opt/glite/etc/config/glite-file-transfer-agents-oracle.cfg.xml and look for an instance:
           <instance name="YOUR_VO" service="transfer-vo-agent-fts"/>
         

If the instance is missing, or the naming convention is not correct, edit the appropriate file and rerun the configuration script.

If the instance is there, check if it's running, using the command

/opt/glite/etc/init.d/glite-data-transfer-agents --instance glite-transfer-vo-agent-YOUR_VO status

or

service transfer-agents --instance glite-transfer-vo-agent-YOUR_VO status

(was service glite-data-transfer-agents ... before 1.5)

If the job is still Submitted, follow the procedure reported here

Symptom: Job always in Pending state

After the a transfer request is allocation to a channel, its status is moved to Pending. The ChannelAgent will then process this request based on its internal inter-VO scheduling.

In case the job state remaing Pending forever, you have to check the follwoing things:

  • The related ChannelAgent daemon should be running
  • The Channel state should be set to Active
  • The VO should have a share on the channel that is greater than 0

In order to check if the agent is running, use the command

/opt/glite/etc/init.d/glite-data-transfer-agents --instance glite-transfer-channel-agent-TYPE-CHANNEL_NAME status

or

service transfer-agents --instance glite-transfer-channel-agent-TYPE-CHANNEL_NAME status

(was service glite-data-transfer-agents ... before 1.5)

You can check the Channel state and VO share uing the command:

glite-transfer-channel-list CHANNEL_NAME

If the job is still Pending, follow the procedure reported here

Symptom: All my transfers fail with a SECURITY_ERROR

This issue is usually due to a problem in the interaction from a FTA and the MyProxy server. This mainly happens in the following cases:

  • User is mistyping the MyProxy passphrase when submitting the job
  • User has an invalid or expired certificate in MyProxy
  • The agent is not an authorized retrieves for MyProxy
  • There is a authentication problem (expired certificate or crl)

In the first two cases, all the transfers of this user should fail while the ones of other users succeed, while in the others all the transfers would faild, indipendently of the user.

Usually, you can detect the type of the error by having a look at the agent log file in /opt/log/glite/glite-transfer-channel-agent-TYPE-CHANNEL_NAME.log or /opt/log/glite/glite-transfer-vo-agent-VO_NAME.log

  • If the problem is due to a wrong passphrase, you'll see
       2005-08-26 07:25:52,281 ERROR  transfer-agent-myproxy - Failed to get the proxy from the !MyProxyServer. Reason is: 
       Reason is Error in bind()
       ERROR from server: invalid pass phrase
       

Ask then the user to resubmit his/her file, possibly using the -p option of glite-transfer-submit. In case the problem persists, maybe the user forgot teh passphrase, so ask him/her to restore the credential in myproxy using

myproxy-init -s MYPROXY_SERVER -d

  • In case the agent is not an authorized retriever, you'll see the a similar entry
       2005-08-26 07:25:52,281 ERROR  transfer-agent-myproxy - Failed to get the proxy from the MyProxyServer. Reason is: 
       ERROR from server: "<anonymous>" not authorized by server's authorized_retriever policy
       

If that is the case, you have to contact the MyProxy server administrator and ask him to add the DN of the certificate of the account used to run the agent. If it still doesn't work, please also check the the agent is running with a valid certificate, following what described here

  • in case the entry is similar to
       2005-08-26 07:25:52,281 ERROR  transfer-agent-myproxy - Failed to get the proxy from the MyProxyServer. Reason is: 
       Error authenticating: GSS Major Status: Authentication Failed
       GSS Minor Status Error Chain: (null)
       

This problem is usually due to an expired certificate or to an expired certificate revocation list (crl). Please check the validity of the certicates and update the crl in both the agent and MyProxy nodes

  • In case you see errors like:
       2005-08-26 07:25:52,281 ERROR  transfer-agent-myproxy - Failed to get the proxy from the !MyProxyServer. Reason is: 
       Reason is Error in bind()
       
    whithout any other details, please check that the environment variables MYPROXY_TCP_PORT_RANGE and GLOBUS_TCP_PORT_RANGE are unset for the account used to run the agents.

  • In the other cases, ask the user to store again his/her certificate in MyProxy, running the command myproxy-init -s MYPROXY_SERVER -d

Please note that the the -d option is required in order to associte the credentials to the DN of the user instead of the account name

If you need to know which MyProxy server is used, have a look here

Which MyProxy Server is used?

When an agent has to perform an operation in behalf of the user, it retrieves the user's delegated credentials from the configured MyProxy server, cache it in the local file system and then impersonate the user by setting the environment variable X509_USER_PROXY. The operations where this is required are:

  • Retrieve services endpoints and information from ServiceDiscovery
  • Perform the transfer
  • Contact the catalog for retrieving the list of replicas and registering the new ones when the transfer is finished (only in case of FPS VO Agent)

The endpoint of the MyProxy server is usually retrieved using ServiceDiscovery, so in case of the file plugin, you need to have an entry in /opt/glite/etc/services.xml like

 <service name='MyProxy'>
    <parameters>
      <endpoint>myproxy://myproxy.cern.ch</endpoint>
      <type>MyProxy</type>
      <version>1.14</version>
    </parameters>
  </service>

You can query the InfoSys using the command

glite-sd-query -t MyProxy

In order to resolve which MyProxy server should be used, the FileTransferAgent looks into the associated services of the FileTransferService who received the user's request (available from gLite 1.3 QF23) or, if not found, takes the first MyProxy server returned by the InformationSystem; you can also force the server to use a specific instance by setting the agent configuration property MYPROXY_SERVER (transfer-agent-myproxy.Server). In case this property is not set and there is no MyProxy entry registered in the InfoSys, the environment variable $MYPROXY_SERVER is used.

Starting from version gLite 1.3 QF23, the user is also allowed to specify the myproxy he want to use by providing the option -m myproxy_hostname in the glite-transfer-submit command line.

Symptom: I've noticed a warning "Cannot Get Agent DN" in the agent log files

You can see this entry in case the agent doesn't run with a valid certificate. When an FTA starts, it put an logs the DN of the certificate the agent will use. This certificate is used to perform the following actions:

  • Retrieve the user delegated credentials from MyProxy using the passphrase provided by the user. This happend both on the Channel and the VO Agents
  • Perfom the transfer

If the agent doesn't have a valid certificate, it's likely that these operations would fail.

In order to fix this problem, check first that the user running the agents has a valid certificate: usually this certificate are installed in $HOME/.globus/usercert.pem and $HOME/.globus/userkey.pem and should be owned by the user. In case the certificate is installed in a different place, the environment variables X509_USER_CERT and X509_USER_KEY shoudl be set accordingly. You should also check that the certificate is not expired, by running:

openssl x509 -text -in ~/.globus/usercert.pem

or

openssl x509 -text -in $X509_USER_CERT

In case the certificate is valid but the agent always reports the warning, check if there is an expired proxy certificate in /tmp/x509up_uUSER_ID (where USER_ID is the uder id of the account used to run the agent) and delete it.

Symptom: My srmcopy transfers fail with a dCache MalformedUrl exception

You may notice this error when a user is transfering files to a dChache SE using a channel configured to perform srmcopy transfers. This is due to a bug in dCache version <= 1.6.5 in parsing the URL. You have to ask the user to resubmit his/her requests using the following conventions:

  • In case the destination SE is dCache, and the source is Castor or DPM
    • Source SURL can be
             srm://<castorsrm>:<port>//srm/managerv1?SFN=<path>
             srm://<castorsrm>:<port>/?SFN=<path>
             srm://<castorsrm>/<path>
             
    • Destination SURL should be
             srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path>
             srm://<dcachesrm>/<path>
             
  • In case the source SE is dCache and the destination one is Castor or DPM
    • Source SURL should be
             srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path>
             srm://<dcachesrm>/<path>
             
    • Destination SURL can be
             srm://<castorsrm>:<port>/srm/managerv1?SFN=<path>
             srm://<castorsrm>:<port>//srm/managerv1?SFN=<path>
             srm://<castorsrm>:<port>/?SFN=<path>
             srm://<castorsrm>:<port>/<path>
             srm://<castorsrm>/<path>
             
  • In case both the source and destination SE are dCache
    • Source SURL should be
             srm://<dcachesrm>:<port>//srm/managerv1?SFN=<path>
             srm://<dcachesrm>/<path>
             
    • Destination SURL should be
             srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path>
             srm://<dcachesrm>/<path>
             

This problem is fixed in dCache v 1.6.6, however this new version doesn't seem to accept the compact SURL format

       srm://<srmhost>/<path>
       

If the destination SE is then dCache and it's version is 1.6.6, we suggest to use for both source and destination SURLs either:

       srm://<srmhost>:<port>/<path>
       

or the fully qualified one:

       srm://<srmhost>:<port>/srm/managerv1?SFN=<path>
       

Symptom: I've upgraded to 1.4.1 but srmcopy doesn't seem to work

Starting from version 1.3QF23, the FileTransferAgent normalize the SURLs before executing all the SRM get, put and copy requests and the default normalization is to convert them into the compact format

       srm://<srmhost>/<path>
       

As illustrated here, we observed a problem with dCache srmcopy in version 1.6.6 not working with this format: after ~30 minutes the error returned is

number of retries exceeded:org.dcache.srm.scheduler.NonFatalJobFailure: java.io.IOException: both from and to url are not local srm

In order to workaround this problem, you have to change the configuration of FilteTransferAgent normalization to use a different format, by setting the ChannelAgent configuration property ACTIONS_SURLNORMALIZATION (=transfer-agent-channel-actions.SurlNormalization) to either compact-with-port for converting to the format

       srm://<srmhost>:<port>/<path>
       

or fully-qualified for the format

       srm://<srmhost>:<port>/srm/managerv1?SFN=<path>
       

Please note that this is not a bug in FTS, but a problem in dCache; you might have observed after upgrading to 1.4.1 because this version of FTS has been release more or less at the same time as dCache 1.6.6

I've upgraded to 1.4.1 but the transfer failed with Error in srm__ping: NULL

Starting from version 1.4.1, FTS retrieves the srm endpoint from the information system, instead of parsing the SURL and, in case one of the compact formats are used, using the default port (8443) and service path (srm/managerv1). In case your transfers start failing after the upgrade with an error:

       Cannot Contact SRM Service. Error in srm__ping: NULL
       

probably the entry in the information system is not correct: in fact, a common error that has been observed is that the SRM endpoint is stored as

       srm://<srmhost>:<port>/srm/managerv1
       

instead of

       httpg://<srmhost>:<port>/srm/managerv1
       

You can also check by looking into the transfer log files (located in /var/tmp/glite-transfer-url-copy-UID/CHANNEL_NAMEfailed in the related ChannelAgent box) and check the endpoint that is used for the SRM calls

Symptom: The transfer failed with the error: No site found for host ...

During the allocation phase the VOAgent needs to resolve what are the sites that will be involved during the transfer. In order to do that, the agent will look up in the information system the site names of the source and destination SRMs, querying by the hostname retrieved from the provided SURLs.

In case the user gets an error like:

Failed to Get Channel Name: No site found for host ...

You have to look at the following things:

  • The entry concerning the SRM services should be listed in the information system
  • The SD library plugins are defined and configured properly (environament variables, files, etc)
  • If the file-based plugin is chosen, the /opt/glite/etc/services.xml file is properly formatted

In order to do detect errors, it's useful to run the command:

su - ACCOUNT_USED_TO_RUN_THE_VOAGENT -c '/opt/glite/bin/glite-sd-query -t SRM --host SRM_HOSTNAME' 

and check the result (this command execute the same query as the agent).

In the problem still persists, it may be worth to have a look at the /proc tanle and see if the

/proc/VOAGENT_PROCESS_ID/environ

contains the correct values for the GLITE_LOCATION and GLITE_SD_* environment variables.

In case the StorageElement should not be listed in the information system, you may want to have a look here

Which Service Types are used?

The File Transfer Agent needs to interact with external services in order to accomplish its tasks and used the gLite ServiceDiscovery API in order to discover their properties. The involved services are:

  • MyProxy: used to retrieve the clients' delegated credentials
  • SRM & GridFtp: the site information is used to allocate a transfer job to a channel
  • FileCatalog: used by the vo-agent in FPS mode in order to retrieve the sourec replicas to be used for a transfer and registered the new replicas when the transfer is finished

In order to discover that information the File Transfer Agent used the service types listed in Glue Service Types

As reported in bug #12961, however, the service type for a GridFtp server is set to GridFTP instead of gsiftp and a backward compatible fix is foreseen for a future release. As a temporary workaround you could follow the comments reported on the bug.

I've tried everything, and it still doesn't seem to work

In case your problem is listed in this page, but none of proposed solutions doesn't seem to work, you can generate verbose log files and send them to fts-support. In order to generate these files, please follow the procedure:

For each agent involved (the VO one responsible to allocate a transfer to a channel and retry failed transfer; and the Channel one, responsible to transfer the files and monitor the status), please edit the files glite-transfer-vo-agent-VO_NAME.log-properties (in case of VO FTA) and/or glite-transfer-channel-agent-TYPE-CHANNEL_NAME.log-properties (in case of Channel FTA) in /opt/glite/etc/glite-data-transfer-agents.d/ and replace the lines

log4j.rootCategory=INFO, file

with

log4j.rootCategory=DEBUG, file

and e log4j.appender.file.fileName=/var/log/glite/glite-transfer-channel-agent-TYPE-CHANNEL_NAME.log

or

log4j.appender.file.fileName=/var/log/glite/glite-transfer-vo-agent-VO_NAME.log

with

log4j.appender.file.fileName=/var/log/glite/glite-transfer-channel-agent-TYPE-CHANNEL_NAME.debug.log

or

log4j.appender.file.fileName=/var/log/glite/glite-transfer-vo-agent-VO_NAME.debug.log

Restart the agents and let them running for ~ 1 minute; then stop the agents, restore the original values of the modified files, start the agents again and mail these /var/log/glite/*.debug.log files to fts-support

Channel Administration

Symptom: How do I set the number of files transferred per VO instead of per channel?

In the FTS Channel Agent you have three parameters you can act on in order to tune the inter-vo scheduling: the channel VO share, the numbers of files that the channel can process concurrently and the AGENT_VOSHARETYPE (transfer-channel-agent.VOShareType) configuration property. The purpose of this configuration parameter is to define a policy how the VO share should be interpreted for a channel and you can add it to the instance that corresponds to the related channel agent in the configuration file. The allowed values are:

  • normalized: the share is the value of the channel voshare property for the given VO, normalized to the sum of all the shares for all the VOs in the same channel. This option could be used when channel administrators want to guarantee slots for certain VOs, in order to implement some sort of QoS, accepting to eventually penalize the total throughput (transfer slots would be reserved to a VO even if that VO has no job to process)

  • absolute: the share is the value on the channel voshare property expressed as a percentage. No normalization is performed, that means that the sum of all the shares on the same channel can exceed 100%. This option could be used when channel administrators want to balance the share between the VOs, without allowing that a single VO fully allocate a channel but minimizing the risk to allocate slots to VOs that don't have any job to process. This option implies some tuning on the VO share values based on experience, but it would allow to have a compromise between throughput and QoS.

  • normalized-on-active: the share is the value of the channel voshare property for the given VO, normalized to the sum of all the share for all the VOs in the same channel that has at least one job that can be processed by the Channel Agent (job state should be Active, Pending or Canceling). This option is the default and should be used when the channel administrators want to optimize the throughput of the channel (the channel can be fully allocated even by one VO), but with a lower QoS

As an example, supposing you have a channel that has 30 files and 3 VOs, you could have:

  Normalized Absolute Normalized-on-active*
VO Share Max Files Max Files Max Files
VO_1 50 15 15 0
VO_2 30 9 9 18
VO_3 20 6 6 12

(* supposing VO_1 has no job to submit)

As you can notice, in case the sum of the VO share is 100, there's no difference between the "normalized" and "absolute" setup. But if this constraint is not respected, you can have:

  Normalized Absolute Normalized-on-active*
VO Share Max Files Max Files Max Files
VO_1 70 14 21 0
VO_2 50 10 15 19
VO_3 30 6 9 11

(* supposing VO_1 has no job to submit)

Please note that the value of the column "Max Files" correspond to the maximum number of files a VO is authorized to submit at the same time. In any case the constraint imposed by the "files" channel property is always respected.

If you want to start with two VOs, setting them each to be able to perform up to 15 transfers concurrently: Set the AGENT_VOSHARETYPE (transfer-channel-agent.VOShareType) to normalized (or absolute), having the VO share set to 50 and the channel files set to 30: you'll allow then up to 30 parallel transfers on the channel, but each VO would not be able to submit more than 15 at the same time. In case you'll have to support other VOs, you'll need to adjust these percentages.


Last edit: LaurenceField on 2008-09-26 - 15:44

Number of topics: 1

Maintainer: PaoloBadino


Edit | Attach | Watch | Print version | History: r35 | r28 < r27 < r26 < r25 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r26 - 2007-01-22 - PaoloBadino
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback