TWiki
>
LCG Web
>
FtsWlcg
>
TransferOperations
>
TransferOperationsGridFTPErrors
(2007-10-22,
DavidSmith
)
(raw view)
E
dit
A
ttach
P
DF
---+ !GridFTP errors knowledge base This page describes the !GridFTP errors seen on production tier-0 export transfers and details the causes of these errors. %TOC{depth="2"}% ---++ =data_connect_failed() failed: authentication failed= ---++++ Description !GridFTP tries to read the user's =~/.globus= directory, checking for certificates. If the user's home directory does not exist, there is no problem (ie. dteam001, cms002 on the diskservers). If the user's home directory _does_ exist, the =~/.globus= directory needs to be "listable". Otherwise, gridftp will complain like this: <verbatim> error: the server sent an error response: 425 425 Can't open data connection. data_connect_failed() failed: authentication failed: GSS Major Status: General failure GSS Minor Status Error Chain: accept_sec_context.c:115: gss_accept_sec_context: Error with gss context globus_i_gsi_gss_utils.c:327: globus_i_gsi_gss_create_and_fill_context: Error with GSI credential globus_gsi_system_config.c:3493: globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificates directory globus_gsi_system_config.c:313: globus_i_gsi_sysconfig_create_cert_dir_string: Could not find a valid trusted CA certificates directory: /afs/cern.ch/user/j/jveldik/.globus/certificates has invalid permissions. </verbatim> ---++++ Scope and Impact This problem affects users of non-LHC VO's (compass!), who are likely to be mapped to AFS accounts. ---++++ Solution Check ACLs with =fs la ~/.globus=. If necessary, run =fs sa -dir ~/.globus -acl system:anyuser l= ---++ =451 Local resource failure: malloc: Cannot allocate memory= The error message =451 Local resource failure: malloc: Cannot allocate memory= in /var/log/gridftp.log has a non-obvious background. It _could_ be caused by problems on the other end of the transfer... > > Does someone know how to interpret the gridftp error message > '451 Local resource failure: malloc: Cannot allocate memory'? David Smith explains: <verbatim> It really is a malloc() failure - but the reason (at least for all 6 malloc errors in today's log so far, on that disk server) is a logic error during the premature termination of the loop the gridftp server enters while it is sending out the data: There is a timeout of 20 minutes inactivity on the data channel once it is connected: if that is exceeded the transfer should abort. However the way g_send_data() (from globus_hack.c) works at the moment is that after the timer expires the rest of whatever file is being sent is read into memory (without waiting to send it anywhere). The 3 different files involved with the 6 instances of the malloc failure I looked at were all with ~3.6Gb files. This may not actually quite exhaust the memory on the machine, but probably caused the heap for the single ftp process to reach its maximum size, just because of the limitation of addressable memory. The 6 problem transfers were all to various disk servers at in2p3. Checking one of the FTS logs (which include the performance markers) there was apparently no data transfered at all - so perhaps the IN2P3 side was waiting for some condition before being able to accept any bytes. (Free space, free mover?) </verbatim> And indeed: note the 20 minute gap in the log entries below! <verbatim> May 4 07:34:28 lxfsrc6606 gridftpd[17904]: connection from fts109.cern.ch [128.142.160.17] May 4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- 220 lxfsrc6606.cern.ch CASTOR GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2(cern-2) (gcc32dbg, 1069715860-42) ready. May 4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- 334 Using authentication type GSSAPI; ADAT must follow May 4 07:34:29 lxfsrc6606 gridftpd[17904]: <--- ADAT (13749 bytes) May 4 07:34:30 lxfsrc6606 gridftpd[17904]: <--- ADAT (109 bytes) May 4 07:34:30 lxfsrc6606 gridftpd[17904]: <--- ADAT (413 bytes) May 4 07:34:30 lxfsrc6606 gridftpd[17904]: client identity is: "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423" May 4 07:34:31 lxfsrc6606 gridftpd[17904]: <--- 235 GSSAPI Authentication succeeded May 4 07:34:31 lxfsrc6606 gridftpd[17904]: GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423" is authorized as atlas002 May 4 07:34:31 lxfsrc6606 gridftpd[17904]: <--- 331 GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Miguel De Oliveira Branco 2423" is authorized as atlas002 May 4 07:34:32 lxfsrc6606 gridftpd[17904]: USER :globus-mapping: May 4 07:34:32 lxfsrc6606 gridftpd[17904]: PASS password May 4 07:34:32 lxfsrc6606 gridftpd[17904]: <--- 230-No directory! Logging in with home=/ May 4 07:34:33 lxfsrc6606 gridftpd[17904]: stager-mapping: STAGE_HOST: <c2atlast0rh>; STAGE_POOL: <t0perm>; STAGE_SVCCLASS: <t0perm>; RFIO_USE_CASTOR_V2: <YES> May 4 07:34:33 lxfsrc6606 gridftpd[17904]: <--- 230 User atlas002 logged in. May 4 07:34:33 lxfsrc6606 gridftpd[17904]: FTP LOGIN FROM fts109.cern.ch [128.142.160.17], atlas002 May 4 07:34:33 lxfsrc6606 gridftpd[17904]: failed to stat umask-mapfile "/etc/castor/gridftpumaskmap.conf": No such file or directory. May 4 07:34:33 lxfsrc6606 gridftpd[17904]: FEAT May 4 07:34:34 lxfsrc6606 gridftpd[17904]: <--- 211-Extensions supported: May 4 07:34:34 lxfsrc6606 gridftpd[17904]: <--- REST STREAM May 4 07:34:35 lxfsrc6606 gridftpd[17904]: <--- ESTO May 4 07:34:35 lxfsrc6606 gridftpd[17904]: <--- ERET May 4 07:34:36 lxfsrc6606 gridftpd[17904]: <--- MDTM May 4 07:34:37 lxfsrc6606 gridftpd[17904]: <--- SIZE May 4 07:34:38 lxfsrc6606 gridftpd[17904]: <--- PARALLEL May 4 07:34:38 lxfsrc6606 gridftpd[17904]: <--- DCAU May 4 07:34:39 lxfsrc6606 gridftpd[17904]: <--- 211 END May 4 07:34:40 lxfsrc6606 gridftpd[17904]: TYPE Image May 4 07:34:41 lxfsrc6606 gridftpd[17904]: <--- 200 Type set to I. May 4 07:34:41 lxfsrc6606 gridftpd[17904]: MODE 0 May 4 07:34:42 lxfsrc6606 gridftpd[17904]: <--- 200 MODE E ok. May 4 07:34:42 lxfsrc6606 gridftpd[17904]: SIZE //castor/cern.ch/grid/atlas/t0/perm/T0.D.run009688.RAW/T0.D.run009688.RAW._lumi0014._0001._sfo01 May 4 07:34:42 lxfsrc6606 gridftpd[17904]: <--- 213 3840000000 May 4 07:34:43 lxfsrc6606 gridftpd[17904]: <--- 200 Opts successful. May 4 07:34:43 lxfsrc6606 gridftpd[17904]: DCAU May 4 07:34:44 lxfsrc6606 gridftpd[17904]: <--- 200 DCAU N May 4 07:34:44 lxfsrc6606 gridftpd[17904]: PORT May 4 07:34:45 lxfsrc6606 gridftpd[17904]: ok: PORT 193.48.99.86,30018 from fts109.cern.ch [128.142.160.17] May 4 07:34:46 lxfsrc6606 gridftpd[17904]: <--- 200 PORT command successful. May 4 07:34:46 lxfsrc6606 gridftpd[17904]: RETR //castor/cern.ch/grid/atlas/t0/perm/T0.D.run009688.RAW/T0.D.run009688.RAW._lumi0014._0001._sfo01 May 4 07:35:02 lxfsrc6606 gridftpd[17904]: using rfio: file not local May 4 07:35:02 lxfsrc6606 gridftpd[17904]: <--- 150 Opening BINARY mode data connection. May 4 07:55:35 lxfsrc6606 gridftpd[17904]: <--- 451 Local resource failure: malloc: Cannot allocate memory. May 4 07:55:35 lxfsrc6606 gridftpd[17904]: Local resource failure: malloc May 4 07:55:37 lxfsrc6606 gridftpd[17904]: <--- 221 You could at least say goodbye. May 4 07:55:43 lxfsrc6606 gridftpd[17904]: FTP session closed </verbatim> ---++ =425 Can't open data connection. .= Error on side sending data: Means the attempt to establish the data connection(s) with the peer timed out, by default after 2 minutes. Normally the sending side initiates a TCP connection towards the receiving side. A TCP connect may fail with an error after about 2 minutes, so it is possible that a failure rather than a timeout may sometimes be reported. ---++ =425 Can't open data connection. timed out() failed.= Error on side receiving data: Means the attempt to establish the data connection(s) with the peer timed out, by default after 2 minutes. Normally the receiving side listens on a variable numbered port, awaiting data connection(s) from the sending side. In a third party copy the receiving side is contacted before the sending side, so this timeout may indicate the client is slow to contact the source site or that the source site is slow to initiate the connection to the destination, or that there is a connectivity or firewall problem. ---++ =425 Can't open data connection. data_connect_failed() failed: (message).= Error on side sending OR receiving data: Means the attempt to establish the data connection(s) with the peer had an error. See the example above for the particular case of the message "authentication failed". ---++ =426 Data connection. data_write() failed: Handle not in the proper state= Is rather generic: means that the side sending out data encountered an error while sending the data to the ftp subsystem. However often this simply indictaes that the TCP data connection(s) closed - probably the peer closed them although network problems have also been known to cause the connection to reset. The message '150 Opening BINARY mode data connection' is usually logged just before the error and indicates the point at which the data connection(s) were established. Normally the side sending the data will initiate the connection to the receving side, this arrangement is required when using extended mode (e.g. when more than one parallel stream). In the example below the connection was indeed made from CERN out to the peer, as can be seen from the earlier 'PORT' command. So in the example below the peer (131.169.80.47) accepted data connections on port 20002 but probably the connections were broken after 15 seconds. The short duration of the connection could indicate the peer deliberatly closed the connection rather than a generic connectivity issue. David Example: <verbatim> May 16 04:45:08 lxfsra3001 gridftpd[5523]: connection from fts108.cern.ch [128.142.160.81] May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- 220 lxfsra3001.cern.ch CASTOR GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2(cern-2) (gcc32dbg, 1069715860-42) ready. May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- 334 Using authentication type GSSAPI; ADAT must follow May 16 04:45:08 lxfsra3001 gridftpd[5523]: <--- ADAT (13749 bytes) May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ADAT (109 bytes) May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ADAT (413 bytes) May 16 04:45:09 lxfsra3001 gridftpd[5523]: client identity is: "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 235 GSSAPI Authentication succeeded May 16 04:45:09 lxfsra3001 gridftpd[5523]: GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001 May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 331 GSSAPI user "/C=CH/O=CERN/OU=GRID/CN=Maarten Litmaath 1689" is authorized as dteam001 May 16 04:45:09 lxfsra3001 gridftpd[5523]: USER :globus-mapping: May 16 04:45:09 lxfsra3001 gridftpd[5523]: PASS password May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 230-No directory! Logging in with home=/ May 16 04:45:09 lxfsra3001 gridftpd[5523]: stager-mapping: STAGE_HOST: <castorpublic>; STAGE_POOL: <dteam>; STAGE_SVCCLASS: <dteam>; RFIO_USE_CASTOR_V2: <YES> May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 230 User dteam001 logged in. May 16 04:45:09 lxfsra3001 gridftpd[5523]: FTP LOGIN FROM fts108.cern.ch [128.142.160.81], dteam001 May 16 04:45:09 lxfsra3001 gridftpd[5523]: failed to stat umask-mapfile "/etc/castor/gridftpumaskmap.conf": No such file or directory. May 16 04:45:09 lxfsra3001 gridftpd[5523]: FEAT May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 211-Extensions supported: May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- REST STREAM May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ESTO May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- ERET May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- MDTM May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- SIZE May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- PARALLEL May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- DCAU May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 211 END May 16 04:45:09 lxfsra3001 gridftpd[5523]: TYPE Image May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 Type set to I. May 16 04:45:09 lxfsra3001 gridftpd[5523]: MODE 0 May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 MODE E ok. May 16 04:45:09 lxfsra3001 gridftpd[5523]: SIZE //castor/cern.ch/grid/dteam/sc4/044/file-051.dat May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 213 1048576000 May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 Opts successful. May 16 04:45:09 lxfsra3001 gridftpd[5523]: DCAU May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 DCAU N May 16 04:45:09 lxfsra3001 gridftpd[5523]: PORT May 16 04:45:09 lxfsra3001 gridftpd[5523]: ok: PORT 131.169.80.47,20002 from fts108.cern.ch [128.142.160.81] May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 200 PORT command successful. May 16 04:45:09 lxfsra3001 gridftpd[5523]: RETR //castor/cern.ch/grid/dteam/sc4/044/file-051.dat May 16 04:45:09 lxfsra3001 gridftpd[5523]: using rfio: file not local May 16 04:45:09 lxfsra3001 gridftpd[5523]: <--- 150 Opening BINARY mode data connection. May 16 04:45:24 lxfsra3001 gridftpd[5523]: <--- 426 Data connection. data_write() failed: Handle not in the proper state May 16 04:45:24 lxfsra3001 gridftpd[5523]: Data connection. data_write() failed: Handle not in the proper state May 16 04:45:24 lxfsra3001 gridftpd[5523]: <--- 221 You could at least say goodbye. May 16 04:45:24 lxfsra3001 gridftpd[5523]: lost connection to fts108.cern.ch [128.142.160.81] May 16 04:45:24 lxfsra3001 gridftpd[5523]: FTP session closed </verbatim> ---++ 451 Timeout (900 seconds): closing control connection. <verbatim> Hi, Since the last 'export meeting' I'm not sure if anyone had been looking at the: ERROR the server sent an error response: 421 421 Timeout (900 seconds): closing control connection. errors (on the castor side), but since that meeting FTS had a GGUS ticket opened asking about it - so in the last few days I decided to check into so we could at least answer the ticket. I looked at one particular instance of the error - generally the rate at which they are seen is variable, the day I was looking at there were ~100. I'll check one or two more instances of the error to see if there is any other failure paths - but I believe this is what is happening: [Example transfer was a CMS transfer, from gsiftp:// lxfsra2008.cern.ch//castor/cern.ch/cms/store/PhEDEx_LoadTest07_2/ LoadTest07_CERN_74 to in2p3 (dCache), started at about 08:36:04 on 27 Apr, error reported at about 08:53:04 - i.e. 17 minutes later]. It is clear the error was coming from the castor gridftp side, as the dcache griddftp door does not have this timeout, at least not at 900 seconds. (1) There is a bug in the gridftp code in retrieve() in ftpd.c: /* no command */ if (cmd == NULL) { fin = fopen(name, "r"), closefunc = fclose; st.st_size = 0; } else [...] if (fin == NULL) { if (errno != 0) perror_reply(550, name); [...] return; The 'fopen' above is defined as a preprocessor macro for 'rfio_fopen64'. Under error conditions the rfio open returns the error number in serrno, rather than errno. So the later perror_reply() may not be called, and retrieve() may return without sending a reply to the ftp client. The ftp server reenters its command loop, waiting for commands on the control channel, while the client is waiting for a result from its 'RETR/ERET' command. The server eventually times-out waiting for the next command which is why we get the 900 second timeout error. (2) The reason for the rfio_fopen64 failure (for the job which I checked) is a timeout between the stager on an rm_enterjob() call and the rmmaster for the submission of the transfer job: (From the dlf logging: time, hostname, subreq id, parameters) 27-04-2007 08:36:04.523124: c2cmssrv03.cern.ch: 463199d4-0000-1000-98c8-9ac26901000: SYSTEM=stager_db_process STRING=Forcing filesystem(s): "lxfsra2008:/srv/castor/03/" File=stager_db_service.c Line=1301 errno=0 serrno=0 27-04-2007 08:38:04.533415: c2cmssrv03.cern.ch: 463199d4-0000-1000-98c8-9ac26901000: ERROR=rmlog STRING=usersend2rmd: recv error : Timed out File=stager_db_service.c Line=2008 errno=115 serrno=1004 The time difference is the expected 120 seconds from: RM TIMEOUT 120 (3) The rmmaster logged this: [...] Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_enterjob: ... rfs : lxfsra2008:/srv/castor/03/ Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_validate_usergroup: Entering Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_validate_usergroup: Leaving Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_processuser_enterjob: Job No 463199d4-0000-1000-98c8-9ac269010000 being submitted as PSUSP to LSF DATE=20070427083604.533435 HOST=c2cmssrv06.cern.ch LVL=Usage FACILITY=rmmaster PID=3139 TID=73289 MSG="Job is being submitted as PSUSP to LSF" REQID=46272f11-0000-1000-b928-950d6721000 NSHOSTNAME=N/A NSFILEID=0 FUNC="Crmd_processuser_enterjob" JOB_ID="463199d4-0000-1000-98c8-9ac269010000" Apr 27 08:36:04 rmmaster[3139,73289]: Crmd_process_user_lsf_request: Entering Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Entering Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Asking for 1 hosts: Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Host No 0: lxfsra2008 Apr 27 08:36:04 rmmaster[31705]: Crmd_process_user_lsf_doit: Processing job enter [... next log lines from process 31705 ...] Apr 27 08:39:12 rmmaster[31705]: Crmd_process_user_lsf_report_ok: Entering Apr 27 08:39:12 rmmaster[31705]: Crmd_process_user_lsf_report_ok: Leaving [...] I can't see anything in the code that should take time apart from the lsb_submit() - so I think LSF just took an extended time, just over 3 mins, to submit the job. At that point I haven't gone further - e.g. to see why LSF may have taken longer than usual. Do you think it would be useful to try to do that, or perhaps it is will known that the response time is rather variable? Yours, David </verbatim> ----- Last edit: %SEARCH{".*" nosearch="on" regex="on" scope="title" nototal="no" topic="TransferOperationsGridFTPErrors" format="$wikiusername on $date"}% Maintainer: Main.GavinMcCance, Main.JanVanEldik
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r5
<
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r5 - 2007-10-22
-
DavidSmith
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service Coordination
LCG Grid Deployment
LCG Applications Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback