Synth-DIY Yahoo! Groups Archives

Thread

ext4fs reliability bug

2009-03-20 by manu@netbsd.org

Hello

Following up this:
http://www.h-online.com/open/Ext4-data-loss-explanations-and-workarounds
--/news/112892

It seems ext4fs people do not assume that closing a file will make the
data flushed to disk. It is claimed fsync(2) should be called first. 

The standards say "Any unwritten buffered data for the stream shall be
written to the file" on a fclose(3) call:
http://www.opengroup.org/onlinepubs/000095399/functions/fclose.html

I may be missing something, but I have the feeling this behavior is a
standard violation. However, milter-greylist does not run on a standard,
it runs on actual implementation, so we may need a fix for it. 

Request for comments: do we need to modify our Fclose() macro so that it
calls fsync() before fclose()  on Linux? Is it useful? Can it harm?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by Vladimir Vassiliev

As I understand it's not violation of standards.

man fclose says
Note  that fclose() only flushes the user space buffers provided by the C library.  To ensure that the
       data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or
       fsync(2).

Problems arised with sequence 

open(newfile), 
write(newfile), 
close(newfile), 
rename(oldfile, newfile) 

which may leave us with empty file in ext4 when system crash occured. 

It will be useful to read blog of ext4's author.
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/


-- 
Vladimir Vassiliev <vova@...>

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by Vladimir Vassiliev

The last should be
rename(newfile, oldfile)




-- 
Vladimir Vassiliev <vova@...>

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by kogan

Hi!

About fclose/fsync behavior:

fclose() guaranees that internal stdio lib buffers are "flushed" to disk 
using write() and close() syscalls. Years ago that was a guarantee that 
data are in fact made it's way to persistent storage (in the early days 
of Unix read/write syscalls were translated to actual read/write 
operations of a storage device).

In our days, we have at least two more buffering layers between 
write() syscall and actual device:
- VFS/filesystem/io-scheduler kernel buffers
- Internal storage device buffers

So, fclose() only pushes down user-level buffers into kernel but does 
not guarantee that changes are really written to disk. Attempting to 
guarantee than makes necessary not only write any dirty kernel 
buffers but also to issue bus/drive-specific command to flush drvice 
internal buffer. And belive me, you don't want to do it on every file 
close(). Sove devices tend to execute flush command far longer then 
others, and some flush not only write-back but also read-ahead caches 
which impacts subsequent read() operations.

That is why we need explicit fsync()/fdatasync() to ensure actual disk 
writes and internal buffer flushes.

About fclose() macro:

I belive there is no need to fsync() data before each close(). Most of 
files we write are of no value to us after a system restart. The only 
file that matters - is a database dump. I think that the only one
fsync() we need is before closing a database dump file.

Show quoted textHide quoted text

On Fri, Mar 20, 2009 at 07:21:28AM +0100, manu@... wrote:
> Hello
> 
> Following up this:
> http://www.h-online.com/open/Ext4-data-loss-explanations-and-workarounds
> --/news/112892
> 
> It seems ext4fs people do not assume that closing a file will make the
> data flushed to disk. It is claimed fsync(2) should be called first. 
> 
> The standards say "Any unwritten buffered data for the stream shall be
> written to the file" on a fclose(3) call:
> http://www.opengroup.org/onlinepubs/000095399/functions/fclose.html
> 
> I may be missing something, but I have the feeling this behavior is a
> standard violation. However, milter-greylist does not run on a standard,
> it runs on actual implementation, so we may need a fix for it. 
> 
> Request for comments: do we need to modify our Fclose() macro so that it
> calls fsync() before fclose()  on Linux? Is it useful? Can it harm?
> 
> -- 
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> manu@...

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by Oliver Fromme

manu@... wrote:
 > Following up this:
 > http://www.h-online.com/open/Ext4-data-loss-explanations-and-workarounds
 > --/news/112892
 > 
 > It seems ext4fs people do not assume that closing a file will make the
 > data flushed to disk. It is claimed fsync(2) should be called first. 

That's not new.  FreeBSD's "soft updates" feature behaves
similarly:  Files created within about 30 seconds before a
crash (e.g. power failure) can be lost, unless you forcibly
sync.  That's because the meta data updates are re-ordered
and delayed.  This is a feature, not a bug.

 > The standards say "Any unwritten buffered data for the stream shall be
 > written to the file" on a fclose(3) call:
 > http://www.opengroup.org/onlinepubs/000095399/functions/fclose.html

Careful:  Buffering occurs on many different layers.  The
stdio library has buffers, the file system has buffers,
the operating system usually has a VM buffer cache, and
finally the disk controllers and disk drives have caches.

The "unbuffered data" mentioned in the fclose(3) function
refers to stdio buffers only.  It does *not* guarantee that
any data hits the physical disks.  In theory, stdio is
agnostic to the actual file system.

So, the behaviour of ext4fs (and many other file systems
that do the same) is not a standard violation, as far as I
can tell.  I haven't looked at the actual code, though;
there might be other bugs in ext4fs. :-)

 > Request for comments: do we need to modify our Fclose() macro so that it
 > calls fsync() before fclose()  on Linux? Is it useful? Can it harm?

The correct sequence would be to use fflush(3) first, then
fsync(2), then fclose(3).  If you don't use stdio functions,
then fsync(2) before close(2) is sufficient.

It doesn't harm, except that it might have a small impact
on performance, because it prevents the operating system
from optimizing the time and order of syncing dirty FS
buffers to the physical disks.  But I think this effect
is negligible in the case of milter-greylist.

Of course, if you want to be safe, you must check the return
value from fsync(2) (as all other I/O functions).

Note:  If you create a temporary file and move it over
the previous file using rename(2), then you must also use
fsync(2) after the rename, in order to make sure that the
directory meta data was updated on the disk.  Otherwise
you might still have the old file after a crash.

Of course, the question is whether milter-greylist really
has to do all of that.  It is not necessarily critical if it
comes up with tuple list that is not completely up-to-date
after a crash.  How often do your mail servers crash anyway?
Some people might have servers that are so busy that the
extra performance impact from a forced sync might be too
much.

Maybe it would be best to make it an option, so everybody
is happy.  Note that most MTAs have such options, too, for
example sendmail calls it "SuperSafe" (by default it's on).

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Gesch\ufffdftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht M\ufffdn-
chen, HRB 125758,  Gesch\ufffdftsf\ufffdhrer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

In my experience the term "transparent proxy" is an oxymoron (like jumbo
shrimp).  "Transparent" proxies seem to vary from the distortions of a
funhouse mirror to barely translucent.  I really, really dislike them
when trying to figure out the corrective lenses needed with each of them.
        -- R. Kevin Oberman, Network Engineer

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by Emmanuel Dreyfus

On Fri, Mar 20, 2009 at 01:24:46PM +0600, kogan wrote:
> I belive there is no need to fsync() data before each close(). Most of 
> files we write are of no value to us after a system restart. The only 
> file that matters - is a database dump. I think that the only one
> fsync() we need is before closing a database dump file.

This is just the operation I was worrying about.

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by Greg Troxel

kogan <kogan@...> writes:

> fclose() guaranees that internal stdio lib buffers are "flushed" to disk
> using write() and close() syscalls. Years ago that was a guarantee that
> data are in fact made it's way to persistent storage (in the early days
> of Unix read/write syscalls were translated to actual read/write
> operations of a storage device).

I think it was only supposed to be a guarantee that the metadata had hit
the disk, but it may be that in practice the writes were scheduled on
the data and the RK05 driver had no reordering.

> So, fclose() only pushes down user-level buffers into kernel but does
> not guarantee that changes are really written to disk. Attempting to
> guarantee than makes necessary not only write any dirty kernel
> buffers but also to issue bus/drive-specific command to flush drvice
> internal buffer. And belive me, you don't want to do it on every file
> close(). Sove devices tend to execute flush command far longer then
> others, and some flush not only write-back but also read-ahead caches
> which impacts subsequent read() operations.

Agreed, but fsync really needs to ensure that the data is on disk, not
just in a disk cache, and there's FUA and tagged queuing and this seems
to be quite hard in practice.

> That is why we need explicit fsync()/fdatasync() to ensure actual disk
> writes and internal buffer flushes.
>
> About fclose() macro:
>
> I belive there is no need to fsync() data before each close(). Most of
> files we write are of no value to us after a system restart. The only
> file that matters - is a database dump. I think that the only one
> fsync() we need is before closing a database dump file.

I agree that only files that need transaction properties should be
fsynced. But that may be most of them.

I also do not understand how fclose(3) can be used with fsync. It seems it really needs an FSYNC flag and doesn't have it. So we need to do

fflush()
fsync()
fclose()

in these cases.

One can argue that the real trouble in ext4 is not that data is pending,
but that it's wrong to commit a rename to the journal before all data
associated with the new file has been committed. I don't know if soft
updates gets this right, or if NetBSD's WAPBL gets it right either.

So in the real world, we should fsync.

It's not just about avoiding data that's a bit stale - that's what
happens when the fsync doesn't happen. It's about replacing the data
with a zero-length file, which is much worse.

Re: [milter-greylist] ext4fs reliability bug

2009-03-20 by manu@netbsd.org

Greg Troxel <gdt@...> wrote:

> fflush()
> fsync()
> fclose()

So we would do that?

Index: dump.c
===================================================================
RCS file: /milter-greylist/milter-greylist/dump.c,v
retrieving revision 1.35
diff -U2 -r1.35 dump.c
--- dump.c      29 Dec 2007 19:06:49 -0000      1.35
+++ dump.c      20 Mar 2009 20:30:02 -0000
@@ -288,5 +288,13 @@
            "whitelisted\n#\n", done, greylisted_count,
whitelisted_count);
 
-       Fclose(dump);
+       /*
+        * Ensure that the data is really flushed to disk.
+        * Some systems might delay the data write from kernel buffer
+        * to disk even after the file is closed
+        */
+       (void)fflush(dump);
+       (void)fsync(fileno(dump));
+       (void)Fclose(dump);
+
        if (s_buffer)
                free(s_buffer);


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] ext4fs reliability bug

2009-03-21 by Greg Troxel

manu@... writes:

> Greg Troxel <gdt@...> wrote:
>
>> fflush()
>> fsync()
>> fclose()
>
> So we would do that?
>
> Index: dump.c
> ===================================================================
> RCS file: /milter-greylist/milter-greylist/dump.c,v
> retrieving revision 1.35
> diff -U2 -r1.35 dump.c
> --- dump.c 29 Dec 2007 19:06:49 -0000 1.35
> +++ dump.c 20 Mar 2009 20:30:02 -0000
> @@ -288,5 +288,13 @@
> "whitelisted\n#\n", done, greylisted_count,
> whitelisted_count);
>
> - Fclose(dump);
> + /*
> + * Ensure that the data is really flushed to disk.
> + * Some systems might delay the data write from kernel buffer
> + * to disk even after the file is closed
> + */
> + (void)fflush(dump);
> + (void)fsync(fileno(dump));
> + (void)Fclose(dump);
> +
> if (s_buffer)
> free(s_buffer);

Yes, looks good. This isn't frequent, so it shouldn't hurt enough to
motivate figuring out how to not do it sometimes.

Milter-greylist

ext4fs reliability bug

ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Re: [milter-greylist] ext4fs reliability bug

Move to quarantaine