Child pages
  • How to recover data from a corrupted .tar.bz2 file?

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Made attachment visible

...

That is definitely a good question. To get the answer, bzip2 has a neat option: -t.

Code Block

bzip2 -t archive.tar.bz2

This will tell you if your bzipped archive is fine, or not.
If it's fine, well, enjoy your day (big grin) Otherwise, read on, we'll recover it.

...

cd into recovery.
Here, we'll use the magic bzip2recover command. Hey, but what's that bzip2recover command. Hmmm.
Bzip2 compressed file are divided into blocks (each block being 100k, 200k, ..., 900k bytes big,
depending on what compression options you used - default is 900k).
What bzip2recover does, is splitting a bzip2 archive into many smaller
bzip2 archives (one per block, actually). That's why it's generating soooo many small files.
So, here we go:

Code Block

bzip2recover archive.tar.bz2

...

No, i'm not. I'm not the one with a corrupted archive <evil laugh>.
Seriously, now that we have divided the archive into smaller parts, we'll be able to "isolate" the corrupted parts.
To do so, we'll use bzip2 -t, as we did before, but this time on every small archive file.
Here we go:

Code Block

bzip2 -tv rec*.bz2 > testoutput.log 2>&1

...

Ok, now, we will search for any corrupted small archive through the log file.

Code Block

grep [^ok]$ testoutput.log

(this actually parses the output of bzip2 -t to extract only files which don't end up with a candid "ok"

  • guess what, corrupted files don't generate this kind of candid output (big grin) )

Ouch, i've got corrupted blocks. What should I do with that ?

...

Ok, cd into recovery1.
Here, we have the beginning of a tar file, nothing's corrupted, but the tar file is not complete.
Right. That makes things easy.
We will just bunzip all the small archives into one recovery1.tar file:

Code Block

bzip2 -dc rec*.bz2 > recovery1.tar

Let's have a look at the result .tar file :

Code Block

tar tf recovery1.tar

Wow ! We're getting a list of file, and an error. Not perfect, but better than nothing!
We have here all the files which were into the original archive.tar.bz2 until the first corrupted block.
We're done for recovery1 !

...

recovery2 !! cd ../recovery2
Hmmmm trying the same method as above fails. Why that ? Because tar sux. Yes, it does.
It does not manage to find a correct header right at the start of the file, and so, fails.
Creepy, huh ? But we are smarter than Tar, and there's not much that a little of Perl Magic can't solve.
First, let's have our bzip2 small archives bunziped into a "failing" tar.

Code Block

bzip2 -dc rec*.bz2 > recovery2_failing.tar

As I told you right before, a tar tf recovery2_failing.tar would.... fail (big grin)
What we would need to fix it, is having our recovery2_failing.tar
starting from the begining of a clean header block.
A simple but efficient perl script will help us to make our way out: *findtarheaderfind_tar_headers.pl

Panel
titlefindtarheader.pl

#!/usr/bin/perl -w
use strict;

  1. 99.9% of all credits for this script go
  2. to Tore Skjellnes <torsk@elkraft.ntnu.no>
  3. who is the originator.

my $tarfile;
my $c;
my $hit;
my $header;

  1. if you don't get any results, outcomment the line below and
  2. decomment the line below the it and retry
    my @src = (ord('u'),ord('s'),ord('t'),ord('a'),ord('r'),ord(" "), ord(" "),0);
    #my @src = (ord('u'),ord('s'),ord('t'),ord('a'),ord('r'),0,ord('0'),ord('0'));

die "No tar file given on command line" if $#ARGV != 0;

Wiki Markup
$tarfile = $ARGV\[0\];

open(IN,$tarfile) or die "Could not open `$tarfile': $!";

$hit = 0;
$| = 1;
seek(IN,257,0) or die "Could not seek forward 257 characters in `$tarfile': $!";
while (read(IN,$c,1) == 1){

Wiki Markup
($hit = 0, next) unless (ord($c) == $src\[$hit\]);     $hit = $hit + 1;
( print "hit: $hit", next ) unless $hit > $#src;       # we have a probable header at (pos - 265)\!
my $pos = tell(IN) - 265;
seek(IN,$pos,0) 	or (warn "Could not seek to position $pos in `$tarfile': $\!", next);
(read(IN,$header,512) == 512) 	or (warn "Could not read 512 byte header at position $pos in `$tarfile': $\!", seek(IN,$pos+265,0),next);
my ($name, $mode, $uid, $gid, $size, $mtime, $chksum, $typeflag, 	$linkname, $magic, $version, $uname, $gname, 	$devmajor, $devminor, $prefix) 	= unpack ("Z100a8a8a8Z12a12a8a1a100a6a2a32a32a8a8Z155", $header);
$size = int $size;
printf("%s:%s:%s:%s\n",$tarfile,($pos+1),$name,$size);
$hit = 0;
}

close(IN) or warn "Error closing `$tarfile': $!";

Yeah, copy/paste and save it.bz2

Yeah, bunzip2 . chmod +x on it.
Now, to find the first clean tar header on recovery2_failing.tar, do the following:

Code Block

./findtarheader.pl recovery2_failing.tar

This will generate quite a bunch of output. The only one interesting here is the first result. You can then do :

Code Block

./findtarheader.pl recovery2_failing.tar | head -n 1

...

To do so, do the following :

Code Block

tail -c +17185 recovery2_failing.tar > recovery2_working.tar

This command copies everything from recovery2_failing.tar, starting at offset +17185 into recovery2_working.tar.
Great, now we have a "recovery2_working.tar" tar file, which WORKS !

Code Block

tar tf recovery2_working.tar

...

Well, right, you did it.
We can get something out of it.
For instance, thou shall not use gzip compressed archives for relatively critical stuffs,
because if it ever gets corrupted, well, it's just lost. Sad story, huh ?
Second thing, tar archives are quite fine with data corrupting, at least,
they are better than gzipped files.
Here, we could restore everything from a .tar.bz2 file, EXCEPT what was
within the corrupted bzip block, and everything until the first clean header
after the corrupted block. To sum it up: we lost one block, and any file with
either its header or a part of the body in that block.
If you are saving critical stuff, you could tell BZip2 to use 100kb block-size.
If your archive gets corrupted, you loose a multiple of 100kb,
against a multiple of 900kb if you use 900kb block-size,
which could actually make a BIG difference!

Addendum : Expected Minimal Data Loss
Best case (minimal loss): No file has its header within a corrupted block and its data block in others.
Wors case (maximal loss): Each corrupted block contains the header of a big file. The whole block is lost, plus that file. (hypothetically, unlimited amount of data can be lost, it could be a 100GB file....)

Please note that statistically,with a size of block of 'B' kB on a high amount of corrupted blocks ('N'), if the average
filesize is 'M' kB, the expected data lost is around

Code Block
Estimated average data lost over coruption: (B + (M+1)/2 ) x ( N ) kB 

On a tar file within which the average file size is 200kB, bziped with 900 kB per block, 10 faulty blocks, data
loss is around (900 + 101) x 10 = 10.1 MB.
Same thing with 100kB per block, (100 + 101) x 10 = 2.02 MB.

This should be considered when deciding to build a bzip2 zipped archive, the smaller the block size is, the faster it will compress, the worse the compression will be, and the smaller data will be lost in case of corruption.