×

Announcing: Slashdot Deals - Explore geek apps, games, gadgets and more. (what is this?)

Thank you!

We are sorry to see you leave - Beta is different and we value the time you took to try it out. Before you decide to go, please take a look at some value-adds for Beta and learn more about it. Thank you for reading Slashdot, and for making the site better!

Kernel Hackers On Ext3/4 After 2.6.29 Release

timothy posted more than 5 years ago | from the good-things-come-from-certain-clashes dept.

Data Storage 316

microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"

Sorry! There are no comments related to the filter you selected.

Slow performance (4, Funny)

rootnl (644552) | more than 5 years ago | (#27327795)

The server is taking too long to respond; please wait a minute or 2 and try again.

Mmmh, must be a big problem

Re:Slow performance (-1, Redundant)

betterunixthanunix (980855) | more than 5 years ago | (#27327835)

Wish I had some mod points for that...

Re:Slow performance (5, Funny)

morgan_greywolf (835522) | more than 5 years ago | (#27328081)

Well, they had to switch the lkml server to ext3 because posts kept getting killed and cut into pieces with their old filesystem and the admins just kept saying "Well, they must've gone to Russia."

Re:Slow performance (0)

Anonymous Coward | more than 5 years ago | (#27328253)

I call this idiotic! :-)

Re:Slow performance (0, Redundant)

SIR_Taco (467460) | more than 5 years ago | (#27328779)

The server is taking too long to respond; please wait a minute or 2 and try again.

Mmmh, must be a big problem

You sure that wasn't an ad for Viagra targeted specifically to the over-the-hill nerd community?

look at the top for for 'skip this ad'

Re:Slow performance (1)

markov_chain (202465) | more than 5 years ago | (#27328965)

The web server tried to fsync() on the logs and keeps waiting for 2+ minutes. Good luck.

lkml.org server is slashdotted. (4, Funny)

javilon (99157) | more than 5 years ago | (#27327843)

this is what I get from http://lkml.org/lkml/2009/3/24/460 [lkml.org] :

"The server is taking too long to respond; please wait a minute or 2 and try again."

Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.

Like me... :-)

Re:lkml.org server is slashdotted. (0)

Anonymous Coward | more than 5 years ago | (#27327917)

and me.

Re:lkml.org server is slashdotted. (5, Funny)

FernandoTorres (1499759) | more than 5 years ago | (#27328009)

Well this is just my meta comment. I'll be writing my real comment later...

Re:lkml.org server is slashdotted. (5, Insightful)

Anonymous Coward | more than 5 years ago | (#27328629)

Well this is just my meta comment. I'll be writing my real comment later...

You forgot to include a link to the comment you'll be writing later.

Re:lkml.org server is slashdotted. (4, Interesting)

thomasdz (178114) | more than 5 years ago | (#27329629)

You forgot to include a link to the comment you'll be writing later.

Maybe the power failed in the middle of him writing his comment?
Don't worry...it'll appear in some other Slashdot thread until CmdrTaco does a fsck.

Re:lkml.org server is slashdotted. (1)

digitalunity (19107) | more than 5 years ago | (#27328749)

This would be one of those posts where a score over 5 is appropriate.

Would have been funnier though if it was Linus saying it in lkml.

Re:lkml.org server is slashdotted. (0)

Anonymous Coward | more than 5 years ago | (#27328163)

and *

Re:lkml.org server is slashdotted. (1)

Tei (520358) | more than 5 years ago | (#27328017)

I doubt it. I suppose has been preemtively put offline. Now is not the slashdot effect, is the slash-lepper.

Re:lkml.org server is slashdotted. (2, Funny)

hesaigo999ca (786966) | more than 5 years ago | (#27328291)

I actually read it, and the emails from Linus, really good read, his performance was as usual,
quite outstanding.

Re:lkml.org server is slashdotted. (5, Insightful)

linuxrocks123 (905424) | more than 5 years ago | (#27329391)

Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.

Re:lkml.org server is slashdotted. (5, Informative)

AigariusDebian (721386) | more than 5 years ago | (#27329577)

On-disk state must always be consistent. That was the point of journalig, so that you do not have to do a fsck to get to a consistent state. You write to a journal, what you are planing to do, then you do it, then you activate it and mark done in the journal. At any point in time, if power is lost, the filesystem is in a consistant state - either the state before the operation or the state after the operation. You might get some half-written blocks, but that is perfectly fine, because they are not referenced in the directory structure until the final activation step is written to disk and those half-written bloxk are still considered empty by the filesystem.

try gmane (0)

Anonymous Coward | more than 5 years ago | (#27327845)

http://thread.gmane.org/gmane.linux.kernel/811167/focus=811228

Good Developers Make Bad Choices? (0)

Anonymous Coward | more than 5 years ago | (#27327901)

If a developer has a difficult time justifying his choices, that could be an indication that the choice is not well thought out.

If a developer, failing to explain a choice, hunkers down and refuses to change, that could be an indication of excessive ego.

A fsck would seem to be in order,

Let me guess... (5, Funny)

Puls4r (724907) | more than 5 years ago | (#27328005)

The server is running linux.

Re:Let me guess... (5, Funny)

UnRDJ (712762) | more than 5 years ago | (#27328037)

too much karma for your tastes?

Re:Let me guess... (4, Informative)

Anonymous Coward | more than 5 years ago | (#27328117)

According to Netcraft, yes. Ubuntu. [netcraft.com]

Wait, this is Slashdot... I need a cliche... uh...

Netcraft confirms is, that server is dying?

Re:Let me guess... (0)

Anonymous Coward | more than 5 years ago | (#27328653)

The posts above that made fun of the coincidence between Linux being slow and the slow server containing the article have been modded funny. You have been modded troll, for making the exact same joke. Welcome to Slashdot.

Re:Let me guess... (0)

Anonymous Coward | more than 5 years ago | (#27329291)

Well in that case, that is very bad moderation. As you just pointed out, he should have been modded redundant and not troll. Mods, please mod the GP funny, then redundant, just to make the point. Might need to do that twice, just to make the redundant stick.

OK, then... *WHO* is the official ext3 "moron"? (5, Insightful)

Anonymous Coward | more than 5 years ago | (#27328043)

Quote from Linus:

"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.

  How about ASKING them rather than calling the Morons?

(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)

TDz.

Re:OK, then... *WHO* is the official ext3 "moron"? (3, Informative)

morgan_greywolf (835522) | more than 5 years ago | (#27328229)

Most likely Ted T'so, based on the git commit logs [kernel.org] . I say most likely because someone more familiar with the kernel git repo than myself should probably confirm or deny this statement.

Re:OK, then... *WHO* is the official ext3 "moron"? (5, Informative)

644bd346996 (1012333) | more than 5 years ago | (#27328489)

ext3 was merged to the mainline kernel in 2001. Git was created in 2005. I wouldn't trust any authorship evidence in a git repo for code predating the repo.

The journalling behavior of ext3 was probably decided by Stephen Tweedie [wikipedia.org]

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Informative)

morgan_greywolf (835522) | more than 5 years ago | (#27328637)

Right, but this problem doesn't go back to 2001.

Re:OK, then... *WHO* is the official ext3 "moron"? (0)

Anonymous Coward | more than 5 years ago | (#27328249)

I no little to nothing about the Linux development process, but Linus can't fix the code in his own kernel? From my non-kernel-hacker perspective it seems like a pretty simple order of operations thing, ie

write_meta(); copy_raw_data();
to
copy_raw_data(); write_meta();

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Informative)

morgan_greywolf (835522) | more than 5 years ago | (#27328407)

I can see you've never written any filesystem drivers ;). It's not quite that simple, but more or less that's the type of change you'd make.

Re:OK, then... *WHO* is the official ext3 "moron"? (1, Funny)

Anonymous Coward | more than 5 years ago | (#27328815)

It is that simple, but the GPL explicitly forbids to change the write order of file system operations without the written consent of the author who lives currently in a hut on an isolated Pacific island.

Re:OK, then... *WHO* is the official ext3 "moron"? (5, Insightful)

Anonymous Coward | more than 5 years ago | (#27328307)

Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
Also, there has been a fairly public discussion including a statement by the responsible person in question.

Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.

Yes, Mr. Torvalds is fairly outspoken.

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Interesting)

gbjbaanb (229885) | more than 5 years ago | (#27329367)

hm. Similar to a parent of two children ranting at them without taking time to think first. Calling them morons is just going to get them growing up to be dysfunctional at best. No wonder the world has a dim view of the "geek" community.

It seems to me that, as usual, the issue is not as clear cut as it first appears [slashdot.org]

Re:OK, then... *WHO* is the official ext3 "moron"? (1)

Bill, Shooter of Bul (629286) | more than 5 years ago | (#27329523)

Ahh... That link explains a lot. However, I have a different parenting strategy. If the kid does something wrong, let him know it. If he does something good let him know it too. Calling them a moron is ok, as long as its balanced out with genius every now and then. Of course, don't actually use the word, if the kid is a moron. Like Linus that should only be used to indicate a temporary lapse of judgment in an otherwise intelligent person.

Saving grace (4, Funny)

coryking (104614) | more than 5 years ago | (#27329491)

Not saying the name is Torvalds attempt at saving grace

Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?

Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.

Re:OK, then... *WHO* is the official ext3 "moron"? (4, Interesting)

Skuto (171945) | more than 5 years ago | (#27328415)

Well, some Linux filesytem developers (and some fanboys) have been chastising other (higher-performance) filesytems for not providing the guarantees that ext3 ordered move provides.

Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).

Some of those developers are now complaining that their "new" filesystem (designed to do away with the bad performance of the old one) is disliked by users who are losing data due to applications being encouraged to be written in a bad way, and telling the developers that they now should add fsync() anyway (instead of fixing the actual problem with the filesystem).

Moreover, they are complaining that the application developers are "weird" because of expecting to be able to write many files to the filesystem and not having them *needlessly* corrupted. IMAGINE THAT!

As an aside joke, the "next generation" btrfs which was supposed to solve all problems has ordered mode by default, but its an ordered mode that will erase your data in exactly the same way as ext4 does.

Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.

Re:OK, then... *WHO* is the official ext3 "moron"? (5, Funny)

Ecuador (740021) | more than 5 years ago | (#27329069)

Yep, we urgently need some kind of killer FS for Linux...

Oh, wait...

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Informative)

BigBuckHunter (722855) | more than 5 years ago | (#27329121)

Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.

I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems. We've had far more "f***d" situations than this (think etherbrick-1000) where hardware was being destroyed without a good understanding of what was happening. Everything will work out as it seems to have everyone's attention.

BBH

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Insightful)

Skuto (171945) | more than 5 years ago | (#27329373)

I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems.

I would be very surprised if the fix can be shared between the filesystems. At least the most serious among those involved, XFS, sits on a complete intermediate compatibility layer that makes Linux looks like IRIX.

Linux filesytems are seriously in a bad state. You simply cannot pick a good one. Either you get one that does not actively kill your data (ext3 ordered/journal) or you pick one which actually gives decent performance (anything besides ext3).

Obviously, we should have both. It's not like that is impossible. But it's surprising how long those problems lasted. It's not like filesystems are a MINOR part of the entire OS.

Probably part of the reason is that we have JFS, XFS, ext3/4, reiser3/4, tux3, btrfs... Filesytem developers suffer very heavily from NIH syndrome. Instead of one good we have 8 that "almost" work.

But almost is not good for something so essential. This is not the kind of choice that is good. It's time one filesystem wins, gets fixed, and the rest is left dead.

Re:OK, then... *WHO* is the official ext3 "moron"? (1)

Kjella (173770) | more than 5 years ago | (#27329495)

Would you care to make an educated guess on how many run one of said three filesystems - particularly ext3, compared to using an etherbrick-1000? Scale matters, even if it sucks equally much if *your* data was eaten by a one-in-a-billion freak bug or a common one.

Re:OK, then... *WHO* is the official ext3 "moron"? (4, Insightful)

SpinyNorman (33776) | more than 5 years ago | (#27329273)

fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.

IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

Re:OK, then... *WHO* is the official ext3 "moron"? (3, Interesting)

Rich0 (548339) | more than 5 years ago | (#27329645)

I agree. What we need is a mechanism for an application to indicate to the OS what kind of data is being written (in terms of criticality/persistance/etc). If it is the gimp swapfile chances are you can optimize differently for performance than if it is a file containing innodb tables.

Right now app developers are having to be concerned with low-level assumptions about how data is being written at the cache level, and that is not appropriate.

I got burned by this when my mythtv backend kept losing chunks of video when the disk was busy. Turns out the app developers had a tiny buffer in ram, which they'd write out to disk, and then do an fsync every few seconds. So, if two videos were being recorded the disk is contantly thrashing between two huge video files while also busy doing whatever else the system is supposed to be doing. When I got rid of the fsyncs and upped the buffer a little all the issues went away. When I record video to disk I don't care if when the system goes down that in addition to losing the next 5 minutes of the show during the reboot I also lose the last 20 seconds as well. This is just bad app design, but it highlights the problems when applications start messing with low-level details like the cache.

Linux filesystems just aren't optimal. I think that everybody is more interested in experimenting with new concepts in file storage, and they're not as interested in just getting files reliably stored to disk. Sure, most of this is volunteer-driven, so I can't exactly put a gun to somebody's head to tell them that no, they need to do the boring work before investing in new ideas. However, it would be nice if things "just worked".

We need a gradual level of tiers ranging from a database that does its own journaling and needs to know that data is fully written to disk to an application swapfile that if it never hits the disk isn't a big deal (granted, such an app should just use kernel swap, but that is another issue). The OS can then decide how to prioritize actual disk IO so that in the event of a crash chances are the highest priority data is saved and nothing is actually corrupted.

And I agree completely regarding transaction support. That would really help.

Re:OK, then... *WHO* is the official ext3 "moron"? (4, Funny)

red_dragon (1761) | more than 5 years ago | (#27328501)

they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus

He's following Ext3 writeback semantics. You'll have to wait for a patch to fix his behaviour.

Re:OK, then... *WHO* is the official ext3 "moron"? (0)

Anonymous Coward | more than 5 years ago | (#27328839)

Shoot first, write data later.

Re:OK, then... *WHO* is the official ext3 "moron"? (0)

Anonymous Coward | more than 5 years ago | (#27328555)

nonsense. we can always use a good pillorying

Re:OK, then... *WHO* is the official ext3 "moron"? (2, Funny)

Anonymous Coward | more than 5 years ago | (#27328635)

nonsense. we can always use a good pillorying

That's why I'm a OpenBSD developer as well... I like the abuse and scorn that Theo throws at me.
It's good to see that Linus is becoming more like Theo. What's the quote: "that which doesn't kill me only makes me stronger"

Re:OK, then... *WHO* is the official ext3 "moron"? (5, Insightful)

houghi (78078) | more than 5 years ago | (#27328649)

Knowing the humor that Linus has, it could be himself.

Um. This doesn't make sense. (4, Insightful)

Colin Smith (2679) | more than 5 years ago | (#27328739)

Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.

from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html [sapienti-sat.org]

"mount -o data=ordered"
                Only journals metadata changes, but data updates are flushed to
                disk before any transactions commit. Data writes are not atomic
                but this mode still guarantees that after a crash, files will
                never contain stale data blocks from old files.

"mount -o data=writeback"
                Only journals metadata changes, and data updates are entirely
                left to the normal "sync" process. After a crash, files will
                may contain stale data blocks from old files: this mode is
                exactly equivalent to running ext2 with a very fast fsck on reboot.

So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
 

Re:Um. This doesn't make sense. (2, Insightful)

Skuto (171945) | more than 5 years ago | (#27329433)

So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...

The thread starts with someone having serious performance problems exactly because ext3 ordered mode is so slow in some circumstances...

Like when you fsync().

Re:OK, then... *WHO* is the official ext3 "moron"? (0)

Anonymous Coward | more than 5 years ago | (#27328975)

Imagine what phrases this thread would be using if it were a Microsoft filesystem's behaviour being discussed.

Re:OK, then... *WHO* is the official ext3 "moron"? (1)

David Greene (463) | more than 5 years ago | (#27329553)

How about ASKING them rather than calling the Morons?

Ah, but that would mean that Linus would have to grow up and actually lead.

I would go further than Linus on this one... (4, Insightful)

pla (258480) | more than 5 years ago | (#27328051)

FTA: "if you write your data _first_, you're never going to see corruption at all"

Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.

Re:I would go further than Linus on this one... (5, Informative)

Skuto (171945) | more than 5 years ago | (#27328177)

You are confusing writeback caching with ext3/4's writeback option, which is simply something different.

The problem with all the ext3/ext4 discussions has been the ORDER in which things get written, not whether they are cached or not. (Hence the existance of an "ordered" mode)

You want new data written first, and the references to that new data updated later, and most definitely NOT the other way around.

Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

Re:I would go further than Linus on this one... (3, Interesting)

AlterRNow (1215236) | more than 5 years ago | (#27328297)

Am I right believing that the new data is written elsewhere and then the metadata is updated in place to point to the new data? I don't know much about filesystems..

Re:I would go further than Linus on this one... (2, Informative)

AvitarX (172628) | more than 5 years ago | (#27328685)

It is by default, using the ordered journal type in Ext3.

It is not an option yet in Ext4, and for now may not be the default, but an option to be set at mount time.

Currently in Ext4, the meta data in journal is first updated, then the data written.

When software assumes that it can send commands, and have them take place in the order sent this becomes problematic. Because without costly immediate writes there is a risk of losing very very old data, as the files metadata gets updated but the data not written to the new place yet.

Re:I would go further than Linus on this one... (4, Informative)

Spazmania (174582) | more than 5 years ago | (#27329231)

Here's what Linus had to say, and I think he hit the nail on the head:

The point is, if you write your metadata earlier (say, every 5 sec) and
the real data later (say, every 30 sec), you're actually MORE LIKELY to
see corrupt files than if you try to write them together.

And if you write your data _first_, you're never going to see corruption
at all.

This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around - writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.

Re:I would go further than Linus on this one... (4, Insightful)

Anonymous Coward | more than 5 years ago | (#27328345)

Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?

Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.

Re:I would go further than Linus on this one... (4, Interesting)

mysidia (191772) | more than 5 years ago | (#27328359)

This is a potential problem when you are overwriting existing bytes or removing data.

In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.

i.e. You truncated a file to 0 bytes, and wrote the data.

You started re-using those bytes for a new file that another process is creating.

Suddenly you are in a state where your metadata on disk is inconsistent, and you crash before that write completes.

Now you boot back up.. you're ext3, so you only journal metadata, so that's the only thing you can revert, unfortunately, there's really nothing to rollback, since you haven't written any metadata yet.

Instead of having a 0 byte file, you have a file that appears to be the size it was before you truncated it, but the contents are silently corrupt, and contain other-program-B's data

Re:I would go further than Linus on this one... (0)

Anonymous Coward | more than 5 years ago | (#27328585)

So you are assuming the order is like that:

1. allocate new data
2. mark old data as free
3. update metadata

What about this:

1. allocate new data
2. update metadata
3. mark old data as free

OR since freeing the space can't be that slow why not:

1. allocate new data
2. update metadata AND mark old data as free (atomic and journaled)

Re:I would go further than Linus on this one... (3, Insightful)

Hatta (162192) | more than 5 years ago | (#27328947)

In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.

i.e. You truncated a file to 0 bytes, and wrote the data.

Why on earth would you do that? Write the new data, update the metadata, THEN remove the old file.

Re:I would go further than Linus on this one... (1)

Logic and Reason (952833) | more than 5 years ago | (#27329343)

Why can't the filesystem just update data and metadata in given order for a particular file? For example, if you truncate a file and then write to it, the following should happen:
  1. Metadata for `foo' is updated (length=0)
  2. New data for `foo' is written elsewhere
  3. Metadata for `foo' is updated (contents=new_data)

If, on the other hand, you're doing the create-write-close-rename trick to get an "atomic file replace", then the following should happen:

  1. Metadata for `foo.new' is created (length=0)
  2. New data for `foo.new' is written elsewhere
  3. Metadata for `foo.new' is updated (contents=new_data)
  4. Metadata for `foo.new' is updated (filename=foo), replacing old `foo'

It seems like in both cases, ensuring that data and metadata are written in given order for a particular file would solve the problem, without imposing any performance penalties on I/O operations going on for other files. I assume I'm missing something-- does all metadata need to be written in order with respect to all other metadata or something?

Re:I would go further than Linus on this one... (1)

Spazmania (174582) | more than 5 years ago | (#27329401)

It's also an easily solved problem:

After a truncf(), you lock the deleted blocks against a write until after you've written the updated metadata for the file. Until then, anything you write to the file will have to be allocated elsewhere on the disk. But then that's part of what the reserve slack is for: to increase the probability that there is somewhere else on the disk that you can write it.

Re:I would go further than Linus on this one... (3, Insightful)

morgan_greywolf (835522) | more than 5 years ago | (#27328575)

Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

It's common sense! Duh. Write data first, pointers to data second. If the system goes down, you're far less likely to lose anything. That's obvious. Those who think this is somehow not obvious don't have the right mentality to be writing kernel code.

I think the problem is Ted T'so has had a slight 'works for me' attitude about it:

All I can tell you is that *I* don't run into them, even when I was
using ext3 and before I got an SSD in my laptop. I don't understand
why; maybe because I don't get really nice toys like systems with
32G's of memory. Or maybe it's because I don't use icecream (whatever
that is).

Re:I would go further than Linus on this one... (1)

Chrisq (894406) | more than 5 years ago | (#27328723)

Is that you Linus?

Re:I would go further than Linus on this one... (2, Funny)

hey (83763) | more than 5 years ago | (#27328741)

Well, its not ironic. It would be ironic if the ext3/4 authors lost their code in a crash because of the order that the data was written.

Except ordered data mode is the (slower) default (1)

Colin Smith (2679) | more than 5 years ago | (#27328987)

Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.

I think Ted Tso etc are probably perfectly aware of how it works.

Frankly I think Linus is trolling.

 

Re:Except ordered data mode is the (slower) defaul (2, Funny)

Skuto (171945) | more than 5 years ago | (#27329529)

You specifically have to choose writeback mode in the full knowledge that the datablocks will almost certainly be written after the metadata journal.

I think Ted Tso etc are probably perfectly aware of how it works.

Except that ext4 loses data in ordered mode for exactly the same reason, and we had a big fuss about that the last few weeks, because *someone* (cough) said that it's the application developers fault for not fsync()-ing.

Safest mkfs/mount options? (3, Interesting)

Per Wigren (5315) | more than 5 years ago | (#27328173)

If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?

Re:Safest mkfs/mount options? (1, Offtopic)

mysidia (191772) | more than 5 years ago | (#27328385)

ISO9660 for most filesystems. (i.e. read-only)

EXT3.

EXT4 is bleeding edge.

I wouldn't recommend XFS unless you have NVRAM-backed storage.

Re:Safest mkfs/mount options? (2, Interesting)

AvitarX (172628) | more than 5 years ago | (#27328571)

Ext3 with an ordered (default) style journal.

I believe XFS has a similar option, and Ext4 will with the next kernel, but for a home type system Ext3 should meet all of your needs, and Linux utilities still know it best.

Of course you should probably use RAID-10 too, with data disk space so cheap it is well worth it. Using the "far" disk layout, you get very fast reads, and though it penalizes writes (vs RAID 0) in theory, the benchmarks I have seen show that penalty to be smaller than the theory.

as for mkfs, large inodes probably, and when mounting use noatime.

for some anti-raid 5 propaganda:
http://www.baarf.com/ [baarf.com]

Re:Safest mkfs/mount options? (3, Insightful)

Blackknight (25168) | more than 5 years ago | (#27328689)

Solaris 10 with ZFS, if you actually care about your data.

Re:Safest mkfs/mount options? (1)

JayAEU (33022) | more than 5 years ago | (#27328883)

If I recall correctly, BtrFS also does checksumming of individual files and has become available in the latest kernel as well, so it's easier to use with Linux.

I wouldn't use it on a server just yet, since there might still be some changes to the ondisk format.

Re:Safest mkfs/mount options? (1)

^me^ (129402) | more than 5 years ago | (#27329061)

btrfs isn't done yet, you'd be an idiot to use it for real data.

The on disk format will change sometime in the future, it is in the freaking help.

Re:Safest mkfs/mount options? (3, Informative)

larry bagina (561269) | more than 5 years ago | (#27328729)

with lvm, you can easily try out the various file systems (don't forget jfs!). Personally, I've found linux XFS to corrupt itself beyond repair, so I use ext3.

A UPS (0)

Anonymous Coward | more than 5 years ago | (#27328751)

n/t

Re:A UPS (2, Insightful)

ledow (319597) | more than 5 years ago | (#27329225)

Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.

A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).

Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.

Re:Safest mkfs/mount options? (1, Insightful)

Anonymous Coward | more than 5 years ago | (#27328789)

JFS

Re:Safest mkfs/mount options? (4, Informative)

mmontour (2208) | more than 5 years ago | (#27328927)

My advice:

- Make regular backups; you'll need them eventually. Keep some off-site.
- ext3 filesystem, default "data=ordered" journal
- Disable the on-drive write-cache with 'hdparm'
- "dirsync" mount option
- Consider a "relatime" or "noatime" mount option to increase performance (depending on whether or not you use applications that care about atime)
- If you don't want the performance hit from disabling the on-drive write-cache, add a UPS and set up software to shut down your system cleanly when the power fails. You are still vulnerable to power-supply failures etc. even if you have a UPS.
- Schedule regular "smartctl" scans to detect low-level drive failures
- Schedule regular RAID parity checks (triggered through a "/sys/.../sync_action" node) to look for inconsistencies. I have a software-RAID1 mirror and I've found problems here a few times (one of which was that 'grub' had written to only one of the disks of the md device for my /boot partition).
- Periodically compare the current filesystem contents against one of your old backups. Make sure that the only files that are different are ones that you expected to be different.

If you decide to use ext4 or XFS most of the above points will still apply. I don't have any experience with ext4 yet so I can't say how well it compares to ext3 in terms of data-preservation.

Linus Torvalds...sponsored by "Girl World Live" (-1, Troll)

Anonymous Coward | more than 5 years ago | (#27328217)

???!

Geez... (2, Funny)

hesaigo999ca (786966) | more than 5 years ago | (#27328271)

Tell us what you really think there Linus.

~I went home today knowing I made someone cry!~

I don't understand (0, Offtopic)

Sam36 (1065410) | more than 5 years ago | (#27328315)

What is this article talking about?

the new linux destroys your data (1)

hildi (868839) | more than 5 years ago | (#27328495)

some people think it doesnt matter, some people think it does.

Linus (-1, Flamebait)

Anonymous Coward | more than 5 years ago | (#27328379)

The more I read about Linus and his cursing and babbling the more I think that he has some kind of self inferiority issues. Someone give him a hug

Re:Linus (2, Funny)

Anonymous Coward | more than 5 years ago | (#27328561)

I think he's sad because he never got that job at Microsoft he always wanted.

Maybe only a hug from Bill Gates would solve his problem.

Re:Linus (1)

Andr T. (1006215) | more than 5 years ago | (#27328865)

Sometimes I get the impression that Linus says things the way he says because the other 'powerful' guys who are really important and active in the Linux community don't say nothing or even agree with him when he talks like that. I remember a similar episode some time ago when a guy wanted to port GIT to C++ or something like that. I think he cried.

I can't imagine a reason to be this rude.

Re:Linus (0)

Anonymous Coward | more than 5 years ago | (#27329185)

Maybe, but compared to Theo De Raadt he's positively polite ...

Linus himself author of code in question (0)

Anonymous Coward | more than 5 years ago | (#27329247)

Ummm, the piece of code Linus called idiotic, he may written himself. While Linus is well known for not holding back his feelings with colorful language, he's also got a strange sense of humour.

Re:Linus (1, Insightful)

Anonymous Coward | more than 5 years ago | (#27329625)

Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.

Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.

It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.

What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.

Maybe Linus should have gone for one more RC (0)

Anonymous Coward | more than 5 years ago | (#27328381)

after all

Epic fail, eh?

mispelling (1)

destiney (149922) | more than 5 years ago | (#27328433)

Andi Kleen, the l is missing.

ZFS (4, Informative)

chudnall (514856) | more than 5 years ago | (#27328995)

Linux seriously needs to find a workaround to its licensing squabbles [blogspot.com] and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris [opensolaris.org] is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots [dzone.com] and a fast in-kernel CIFS server fully integrated with ZFS ACLS [sun.com] (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can! [nexenta.org]

Re:ZFS (0)

Anonymous Coward | more than 5 years ago | (#27329229)

Does BrtFS not seem like an adequate ZFS replacement for Linux?

Re:ZFS (2, Funny)

Anonymous Coward | more than 5 years ago | (#27329621)

Linux seriously needs to find a workaround to its licensing squabbles and find a way to get a rock-solid ZFS in the kernel.

You must have missed Linus's memo:

To: Samuel J. Palmisano, IBM Business Guy
From: Linus Torvalds, Super Genius

Dear Sam:
As you know, I've been trying to get a decent file system into Linux for a while. Let's face it, none of these johnny-come-lately open-source arseholes can write a file system to save their life; the last one to have a chance was Reiser, and I really don't want him hanging around here even if we can spring him; he creeps me out. And your guys are no better. JFS? It is to laugh. Sun has one called ZFS, but they are being utter dicks about licensing it. Since licensing seems to be more in your purview than mine, I thought you might be able to help me out. I can't help but notice that in this recession, IBM is doing relatively good and Sun's stock is in the crapper. Perhaps the easiest thing to do would be to just pick up the whole company. Just a thought
    -- Linus

Data - metadata ordering: softupdates (5, Informative)

ivoras (455934) | more than 5 years ago | (#27329215)

Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf [cmu.edu] . Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).

Here's an excerpt:

We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.

There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?

Re:Data - metadata ordering: softupdates (1, Interesting)

Anonymous Coward | more than 5 years ago | (#27329565)

Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata.

Maybe I misinterpret something here but doesn't that sound like the exact opposite of what you claim:

Block Allocation. When a new block or fragment is allocated for a file,
the new block pointer (whether in the inode or an indirect block) should not
be written to stable storage until after the block has been initialized.

So first initialize the data, then update the pointer in the metadata. If I am not totally mistaken that is exactly what Linus argues for.

No more beating about the bush (-1, Redundant)

Keith_Beef (166050) | more than 5 years ago | (#27329285)

I wish Linus would just come clean and say what he thinks, instead of beating about the bush.

K.

faiLzo8s (-1, Flamebait)

Anonymous Coward | more than 5 years ago | (#27329337)

on My Pentium Pro

idiotic? (0, Flamebait)

xushi (740195) | more than 5 years ago | (#27329393)

well fsck you too... let me see you do a better job..

Load More Comments
Slashdot Login

Need an Account?

Forgot your password?