GridFTP errors knowledge base

This page describes the GridFTP errors seen on production tier-0 export transfers and details the causes of these errors.

data_connect_failed() failed: authentication failed

Description

GridFTP tries to read the user's ~/.globus directory, checking for certificates. If the user's home directory does not exist, there is no problem (ie. dteam001, cms002 on the diskservers).

If the user's home directory does exist, the ~/.globus directory needs to be "listable". Otherwise, gridftp will complain like this:

error: the server sent an error response: 425 425 Can't open data connection. data_connect_failed() failed: authentication failed:
GSS Major Status: General failure
GSS Minor Status Error Chain:

accept_sec_context.c:115: gss_accept_sec_context: Error with gss context
globus_i_gsi_gss_utils.c:327: globus_i_gsi_gss_create_and_fill_context: Error with GSI credential
globus_gsi_system_config.c:3493: globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificates directory
globus_gsi_system_config.c:313: globus_i_gsi_sysconfig_create_cert_dir_string: Could not find a valid trusted CA certificates directory: /afs/cern.ch/user/j/jveldik/.globus/certificates has invalid permissions.

Scope and Impact

This problem affects users of non-LHC VO's (compass!), who are likely to be mapped to AFS accounts.

Solution

Check ACLs with fs la ~/.globus. If necessary, run fs sa -dir ~/.globus -acl system:anyuser l

451 Local resource failure: malloc: Cannot allocate memory

The error message 451 Local resource failure: malloc: Cannot allocate memory in /var/log/gridftp.log has a non-obvious background. It could be caused by problems on the other end of the transfer...

>

> Does someone know how to interpret the gridftp error message

> '451 Local resource failure: malloc: Cannot allocate memory'?

David Smith explains:

It really is a malloc() failure - but the reason (at least for all 6 malloc errors in today's log so far, 
on that disk server) is a logic error during the premature termination of the loop the gridftp server 
enters while it is sending out the data:

There is a timeout of 20 minutes inactivity on the data channel once it is connected: if that is exceeded 
the transfer should abort. However the way g_send_data() (from globus_hack.c) works at the moment is that 
after the timer expires the rest of whatever file is being sent is read into memory (without waiting to 
send it anywhere). The 3 different files involved with the 6 instances of the malloc failure I looked at 
were all with ~3.6Gb files. This may not actually quite exhaust the memory on the machine, but probably 
caused the heap for the single ftp process to reach its maximum size, just because of the limitation of 
addressable memory.

The 6 problem transfers were all to various disk servers at in2p3. Checking one of the FTS logs (which 
include the performance markers) there was apparently no data transfered at all - so perhaps the IN2P3 
side was waiting for some condition before being able to accept any bytes. (Free space, free mover?)

And indeed: note the 20 minute gap in the log entries below!

          May  4 07:34:28 lxfsrc6606 gridftpd[17904]: connection from fts109.cern.ch [128.142.160.17]
          May  4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- 220 lxfsrc6606.cern.ch CASTOR GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2(cern-2) (gcc32dbg,
1069715860-42) ready.
          May  4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- 334 Using authentication type GSSAPI; ADAT must follow
          May  4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- ADAT (13749 bytes)
          May  4 07:34:30 lxfsrc6606 gridftpd[17904]: <--- ADAT (109 bytes)
          May  4 07:34:30 lxfsrc6606 gridftpd[17904]: <--- ADAT (413 bytes)
          May  4 07:34:30 lxfsrc6606 gridftpd[17904]: client identity is: "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423"
          May  4 07:34:31 lxfsrc6606 gridftpd[17904]: <--- 235 GSSAPI Authentication succeeded
          May  4 07:34:31 lxfsrc6606 gridftpd[17904]: GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423" is authorized as atlas002
          May  4 07:34:31 lxfsrc6606 gridftpd[17904]: <--- 331 GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423" is authorized as atlas002
          May  4 07:34:32 lxfsrc6606 gridftpd[17904]: USER :globus-mapping:
          May  4 07:34:32 lxfsrc6606 gridftpd[17904]: PASS password
          May  4 07:34:32 lxfsrc6606 gridftpd[17904]: <--- 230-No directory! Logging in with home=/
          May  4 07:34:33 lxfsrc6606 gridftpd[17904]: stager-mapping: STAGE_HOST: <c2atlast0rh>; STAGE_POOL: <t0perm>; STAGE_SVCCLASS: <t0perm>;
RFIO_USE_CASTOR_V2: <YES>
          May  4 07:34:33 lxfsrc6606 gridftpd[17904]: <--- 230 User atlas002 logged in.
          May  4 07:34:33 lxfsrc6606 gridftpd[17904]: FTP LOGIN FROM fts109.cern.ch [128.142.160.17], atlas002
          May  4 07:34:33 lxfsrc6606 gridftpd[17904]: failed to stat umask-mapfile "/etc/castor/gridftpumaskmap.conf": No such file or directory.
          May  4 07:34:33 lxfsrc6606 gridftpd[17904]: FEAT
          May  4 07:34:34 lxfsrc6606 gridftpd[17904]: <--- 211-Extensions supported:
          May  4 07:34:34 lxfsrc6606 gridftpd[17904]: <---  REST STREAM
          May  4 07:34:35 lxfsrc6606 gridftpd[17904]: <---  ESTO
          May  4 07:34:35 lxfsrc6606 gridftpd[17904]: <---  ERET
          May  4 07:34:36 lxfsrc6606 gridftpd[17904]: <---  MDTM
          May  4 07:34:37 lxfsrc6606 gridftpd[17904]: <---  SIZE
          May  4 07:34:38 lxfsrc6606 gridftpd[17904]: <---  PARALLEL
          May  4 07:34:38 lxfsrc6606 gridftpd[17904]: <---  DCAU
          May  4 07:34:39 lxfsrc6606 gridftpd[17904]: <--- 211 END
          May  4 07:34:40 lxfsrc6606 gridftpd[17904]: TYPE Image
          May  4 07:34:41 lxfsrc6606 gridftpd[17904]: <--- 200 Type set to I.
          May  4 07:34:41 lxfsrc6606 gridftpd[17904]: MODE 0
          May  4 07:34:42 lxfsrc6606 gridftpd[17904]: <--- 200 MODE E ok.
          May  4 07:34:42 lxfsrc6606 gridftpd[17904]: SIZE //castor/cern.ch/grid/atlas/t0/perm/T0.D.run009688.RAW/T0.D.run009688.RAW._lumi0014._0001._sfo01
          May  4 07:34:42 lxfsrc6606 gridftpd[17904]: <--- 213 3840000000
          May  4 07:34:43 lxfsrc6606 gridftpd[17904]: <--- 200 Opts successful.
          May  4 07:34:43 lxfsrc6606 gridftpd[17904]: DCAU
          May  4 07:34:44 lxfsrc6606 gridftpd[17904]: <--- 200 DCAU N
          May  4 07:34:44 lxfsrc6606 gridftpd[17904]: PORT
          May  4 07:34:45 lxfsrc6606 gridftpd[17904]: ok: PORT 193.48.99.86,30018 from fts109.cern.ch [128.142.160.17]
          May  4 07:34:46 lxfsrc6606 gridftpd[17904]: <--- 200 PORT command successful.
          May  4 07:34:46 lxfsrc6606 gridftpd[17904]: RETR //castor/cern.ch/grid/atlas/t0/perm/T0.D.run009688.RAW/T0.D.run009688.RAW._lumi0014._0001._sfo01
          May  4 07:35:02 lxfsrc6606 gridftpd[17904]: using rfio: file not local
          May  4 07:35:02 lxfsrc6606 gridftpd[17904]: <--- 150 Opening BINARY mode data connection.
          May  4 07:55:35 lxfsrc6606 gridftpd[17904]: <--- 451 Local resource failure: malloc: Cannot allocate memory.
          May  4 07:55:35 lxfsrc6606 gridftpd[17904]: Local resource failure: malloc
          May  4 07:55:37 lxfsrc6606 gridftpd[17904]: <--- 221 You could at least say goodbye.
          May  4 07:55:43 lxfsrc6606 gridftpd[17904]: FTP session closed

Data connection. data_write() failed: Handle not in the proper state

To be described...

Example:

          May 16 04:45:08 lxfsra3001 gridftpd[5523]: connection from fts108.cern.ch [128.142.160.81]
          May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- 220 lxfsra3001.cern.ch CASTOR GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2(cern-2) (gcc32dbg, 1069715860-42) ready.
          May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- 334 Using authentication type GSSAPI; ADAT must follow
          May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- ADAT (13749 bytes)
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ADAT (109 bytes)
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ADAT (413 bytes)
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: client identity is: "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689"
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 235 GSSAPI Authentication succeeded
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 331 GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: USER :globus-mapping:
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: PASS password
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 230-No directory! Logging in with home=/
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: stager-mapping: STAGE_HOST: <castorpublic>; STAGE_POOL: <dteam>; STAGE_SVCCLASS: <dteam>; RFIO_USE_CASTOR_V2: <YES>
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 230 User dteam001 logged in.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: FTP LOGIN FROM fts108.cern.ch [128.142.160.81], dteam001
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: failed to stat umask-mapfile "/etc/castor/gridftpumaskmap.conf": No such file or directory.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: FEAT
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 211-Extensions supported:
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  REST STREAM
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  ESTO
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  ERET
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  MDTM
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  SIZE
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  PARALLEL
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <---  DCAU
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 211 END
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: TYPE Image
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 Type set to I.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: MODE 0
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 MODE E ok.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: SIZE //castor/cern.ch/grid/dteam/sc4/044/file-051.dat
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 213 1048576000
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 Opts successful.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: DCAU
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 DCAU N
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: PORT
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: ok: PORT 131.169.80.47,20002 from fts108.cern.ch [128.142.160.81]
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 PORT command successful.
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: RETR //castor/cern.ch/grid/dteam/sc4/044/file-051.dat
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: using rfio: file not local
          May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 150 Opening BINARY mode data connection.
          May 16 04:45:24 lxfsra3001 gridftpd[5523]: <--- 426 Data connection. data_write() failed: Handle not in the proper state
          May 16 04:45:24 lxfsra3001 gridftpd[5523]: Data connection. data_write() failed: Handle not in the proper state
          May 16 04:45:24 lxfsra3001 gridftpd[5523]: <--- 221 You could at least say goodbye.
          May 16 04:45:24 lxfsra3001 gridftpd[5523]: lost connection to fts108.cern.ch [128.142.160.81]
          May 16 04:45:24 lxfsra3001 gridftpd[5523]: FTP session closed
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: connection from fts108.cern.ch [128.142.160.81]
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 220 lxfsra3001.cern.ch CASTOR GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2(cern-2) (gcc32dbg, 1069715860-42) ready.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 334 Using authentication type GSSAPI; ADAT must follow
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- ADAT (13749 bytes)
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- ADAT (109 bytes)
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- ADAT (413 bytes)
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: client identity is: "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689"
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 235 GSSAPI Authentication succeeded
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 331 GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: USER :globus-mapping:
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: PASS password
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 230-No directory! Logging in with home=/
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: stager-mapping: STAGE_HOST: <castorpublic>; STAGE_POOL: <dteam>; STAGE_SVCCLASS: <dteam>; RFIO_USE_CASTOR_V2: <YES>
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 230 User dteam001 logged in.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: FTP LOGIN FROM fts108.cern.ch [128.142.160.81], dteam001
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: failed to stat umask-mapfile "/etc/castor/gridftpumaskmap.conf": No such file or directory.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: FEAT
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 211-Extensions supported:
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  REST STREAM
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  ESTO
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  ERET
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  MDTM
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  SIZE
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  PARALLEL
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <---  DCAU
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 211 END
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: TYPE Image
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 200 Type set to I.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: SIZE //castor/cern.ch/grid/dteam/sc4/042/file-020.dat
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 213 1048576000
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: QUIT
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 221-You have transferred 0 bytes in 0 files.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 221-Total traffic for this session was 22299 bytes in 0 transfers.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 221-Thank you for using the FTP service on lxfsra3001.cern.ch.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: <--- 221 Goodbye.
          May 16 08:58:31 lxfsra3001 gridftpd[25523]: FTP session closed

451 Timeout (900 seconds): closing control connection.

 Hi,

 

Since the last 'export meeting' I'm not sure if anyone had been looking at the:

 

ERROR the server sent an error response: 421 421 Timeout (900

seconds): closing control connection.

 

errors (on the castor side), but since that meeting FTS had a GGUS ticket opened asking about it - so in the last few days I decided to check into so we could at least answer the ticket. I looked at one particular instance of the error - generally the rate at which they are seen is variable, the day I was looking at there were ~100.

 

I'll check one or two more instances of the error to see if there is any other failure paths - but I believe this is what is happening:

 

[Example transfer was a CMS transfer, from gsiftp:// lxfsra2008.cern.ch//castor/cern.ch/cms/store/PhEDEx_LoadTest07_2/

LoadTest07_CERN_74 to in2p3 (dCache), started at about 08:36:04 on 27 Apr, error reported at about 08:53:04 - i.e. 17 minutes later]. It is clear the error was coming from the castor gridftp side, as the dcache griddftp door does not have this timeout, at least not at 900 seconds.

 

(1) There is a bug in the gridftp code in retrieve() in ftpd.c:

 

     /* no command */

     if (cmd == NULL)

     {

         fin = fopen(name, "r"), closefunc = fclose;

         st.st_size = 0;

     }

     else

[...]

 

     if (fin == NULL)

     {

         if (errno != 0)

             perror_reply(550, name);

[...]

         return;

 

The 'fopen' above is defined as a preprocessor macro for 'rfio_fopen64'. Under error conditions the rfio open returns the error number in serrno, rather than errno. So the later perror_reply

() may not be called, and retrieve() may return without sending a reply to the ftp client. The ftp server reenters its command loop, waiting for commands on the control channel, while the client is waiting for a result from its 'RETR/ERET' command. The server eventually times-out waiting for the next command which is why we get the 900 second timeout error.

 

(2) The reason for the rfio_fopen64 failure (for the job which I

checked) is a timeout between the stager on an rm_enterjob() call and the rmmaster for the submission of the transfer job:

 

(From the dlf logging: time, hostname, subreq id, parameters)

 

27-04-2007 08:36:04.523124: c2cmssrv03.cern.ch: 

463199d4-0000-1000-98c8-9ac26901000:

SYSTEM=stager_db_process

STRING=Forcing filesystem(s): "lxfsra2008:/srv/castor/03/"

File=stager_db_service.c

Line=1301

errno=0

serrno=0

 

27-04-2007 08:38:04.533415: c2cmssrv03.cern.ch: 

463199d4-0000-1000-98c8-9ac26901000:

ERROR=rmlog

STRING=usersend2rmd: recv error : Timed out File=stager_db_service.c

Line=2008

errno=115

serrno=1004

 

The time difference is the expected 120 seconds from:

 

RM TIMEOUT 120

 

(3) The rmmaster logged this:

 

[...]

Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_enterjob: ... 

rfs            : lxfsra2008:/srv/castor/03/

Apr 27 08:36:04 rmmaster[3139,73289]: 

Crmd_processuser_validate_usergroup: Entering Apr 27 08:36:04 rmmaster[3139,73289]: 

Crmd_processuser_validate_usergroup: Leaving Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_enterjob: Job No 463199d4-0000-1000-98c8-9ac269010000 being submitted as PSUSP to LSF

DATE=20070427083604.533435 HOST=c2cmssrv06.cern.ch LVL=Usage FACILITY=rmmaster PID=3139 TID=73289 MSG="Job is being submitted as PSUSP to LSF" REQID=46272f11-0000-1000-b928-950d6721

000 NSHOSTNAME=N/A NSFILEID=0 FUNC="Crmd_processuser_enterjob" 

JOB_ID="463199d4-0000-1000-98c8-9ac269010000"

Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_process_user_lsf_request: 

Entering

Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Entering Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Asking for 1 hosts:

Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Host No

0: lxfsra2008

Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: 

Processing job enter

[... next log lines from process 31705 ...] Apr 27 08:39:12 rmmaster[31705]: Crmd_process_user_lsf_report_ok: 

Entering

Apr 27 08:39:12 rmmaster[31705]: Crmd_process_user_lsf_report_ok: 

Leaving

[...]

 

I can't see anything in the code that should take time apart from the

lsb_submit() - so I think LSF just took an extended time, just over 3 mins, to submit the job. At that point I haven't gone further - e.g. 

to see why LSF may have taken longer than usual. Do you think it would be useful to try to do that, or perhaps it is will known that the response time is rather variable?

 

Yours,

David


Last edit: DavidSmith on 2007-10-22 - 15:28

Number of topics: 1

Maintainer: GavinMcCance, JanVanEldik

Edit | Attach | Watch | Print version | History: r5 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2007-06-27 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback