One of our database systems does a RMAN backup to a CIFS share from a NetApp NAS. The OS is Linux 6.10 and the database is Oracle 12.1. The nightly backup had been running fine until 2 days ago. The backup logs showed the backup location not existing. When I looked at it with “df -h“, the CIFS share was gone. And if I was trying to access the backup location using a simple command like “ls -l /backup“, I got “Input/output error” with garbage characters.
root@JOEDBS02:~# df -h /backup df: `/backup': Input/output error df: no file systems processed root@JOEDBS02:~# ls -lrt /backup ls: cannot access /backup: Input/output error root@JOEDBS02:/# ls -lrt ls: cannot access backup: Input/output error total 164 d?????????? ? ? ? ? ? backup
My first impression was filesystem corrupted? But I was able to mount the CIFS share from another Linux machine. Since this database is just an Oracle DataGuard standby system, I decided to reboot it to see how it goes. And the bad thing happened after I issued the reboot command, it didn’t come back. I didn’t have the console access and had to ask the system administrator to take a look. He told me it was stuck in the shutdown process. Anyway, he reset the machine and it came back fine — the backup location looked good, I could access it with read/write.
We thought that was it — maybe a system glitch somwhere. But the story continued. The 2nd day the same issue happened. So I asked the the system administrator who was also managing the NAS — anything changed and he said there was an upgrade on NetApp a few days ago — coincidence? That sounded suspicious.
I spent more time and looked around. I tried to mount it and got the error “can’t read superblock” although again I could mount it on another machine. Then I noticed the share was actually mounted even it didn’t show in the results of “df -h“, but the results of “mount” had it.
Interesting, I tried to umount/mount it. It succeeded. The share looked perfect. But the next day, it broken again. So I needed to check when it stopped working. I remounted it and started to check it occasionally during the day. It looked like after 1 or 2 hours, the problem happened — something like idle time out because during the day, normally no activities on this CIFS share.
The system log showed:
CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-13
I googled NetApp documentation and found NetApp does have a setting to control SMB/CIFS session timeout and default value is 900 seconds. Also there is a bug on ONTAP 9.7P8 onwards — CIFS sessions close due to idle timeout for Linux clients (with older kernel) in ONTAP 9.7P8 onwards. After set up the timeout to a huge value with the following command on NetApp, the issue is gone.
vserver cifs options modify -vserver SVM -client-session-timeout 4294967295
With the NetApp upgrade, it has started to force client sessions to idle out after the configured timeout period. Newer version of Linux can recover from this. That explained why another Linux system didn’t see this issue.
We will upgrade the Linux on this server soon.
6 thoughts on “CIFS share disconnected after NetApp upgraded”
Hi Joe, we´re facing a similar issue but the error stils shows up with kernel 5.15.32-051532-generic (patched from Ubuntu 20.04 LTS). We mainly used to work with CentOS 7.9 kernel 3.10.0-1160.31.1.el7.x86_64 but with this we were facing random write errors when transfering huge files (X0GB) to the shares. That´s why we switched to Ubuntu, although we did had good experiences with CentOS 8.3, but since this will not be longer maintend after end of 2022 we gave Ubuntu a try. Can you name me which kernel should have the fix for the timout? An upgrade of Ontap will take place next week, so we´re looking forwar on that, meanwhile the fix is to run df in a loop every 60 secs.
Sounds like you had the issue on CentOS7.9 and you have moved to Ubuntu to aovid it. Has this ever worked well on your Ubuntu system?
The kernel version 5.15.xxx is fairly new. We use OL 8.x having the kernel version of 5.4.xxx and don’t have any problems. I wonder if your write error problem is not a simple time out because that happens during your huge file tranfer. Mine only happened when there were no actitivies for some time. And have you checked ONTAP version and timeout setting there?
BTW, this is the bug which caused my problem:
Hi Joe, thanks a lot for the prompt reply which I appreciate a lot. Since I applied the df loop on Friday, I must say that the inaccessible share error did not happen again on all hosts (approx. 10 hosts). But this did not fix the access denied errors, so I assume it must be two “problems” between some inux hosts and the share from the Netapp FAS. One is the “inaccesible shares” and the other is that in a specific workflow we get an “access denied” error for some files on a specific share (even though all files have the same ownership, permission and path. Since this WF only runs on one specific share I can not figure out yet whether if it´s a specific share problem or not.
I also can´t name the current ontap version, but I heard that the department being responsible for that will apply a newer version this week anyway, I try to figure out what version this will be and was.
DO you still have the vserver timeout parameter configured? Or is it back to default 900?
Yes, that timeout setting is still applied. I guess it doesn’t hurt to leave it there. It sounds strange that you get an “access denied” error during a large file transfer not at the begining of the transfer. Increasing the timeout on the ONTAP side after it’s getting updated with the newer version is worth to try, at least to rule it out.
Thank you Joe, may I ask if you are sure that the newer kernel version fixed it and not the vserve timeout parameter?
I believe so. That’s why I didn’t have the issue on OL8 but had it on OL6 without that timeout parameter set on ONTAP back then.