×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

timothy posted about 3 months ago | from the which-ones-are-not-like-the-others? dept.

Software 243

postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

243 comments

write it yourself (2, Insightful)

retchdog (1319261) | about 3 months ago | (#46051315)

exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).

Re:write it yourself (5, Informative)

Anonymous Coward | about 3 months ago | (#46051341)

ExifTool is probably your best start:

http://www.sno.phy.queensu.ca/~phil/exiftool/

Re:write it yourself (-1)

Anonymous Coward | about 3 months ago | (#46051445)

A big nigger tool inside the asshole is another great start!

Mmmhmmhmmmm... Filling!

Re:write it yourself (-1)

Anonymous Coward | about 3 months ago | (#46051459)

Does the question say "what are you gonna do right after reading this?" ?

No, I thought not.

Re:write it yourself (4, Informative)

shipofgold (911683) | about 3 months ago | (#46051619)

I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates

Here is a quick command which will rename every file in a directory according to createDate

  exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR

If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.

Re:write it yourself (1)

postbigbang (761081) | about 3 months ago | (#46051343)

Imagine tons of iterative backups of photos. Generations of backups. Now they need consolidation. Something that can look at file systems, vacuum the files-- but only one of each photo, even if there are many copies of that photo, as in myphoto(1).jpg, etc.

Re:write it yourself (0)

Anonymous Coward | about 3 months ago | (#46051525)

I don't know if anything exists but would some kind of image hashing program work. It would just have to create hashes for each photo and then compare the hashes removing any duplicate hashes?

Re:write it yourself (1)

Anonymous Coward | about 3 months ago | (#46051543)

No, you would want to remove the duplicate photos. Removing the duplicate hashes doesn't solve the problem.

General case (5, Informative)

xaxa (988988) | about 3 months ago | (#46051603)

For the general case (any file), I've used this script:


#!/bin/sh

OUTF=rem-duplicates.sh;

echo "#! /bin/sh" > $OUTF;

find "$@" -type f -print0 |
    xargs -0 -n1 md5sum |
        sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
            sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;

chmod a+x $OUTF; ls -l $OUTF

It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.

(Also, isn't this really a question for superuser.com or similar?)

Re:General case (0)

Anonymous Coward | about 3 months ago | (#46052081)

now the question is, how do we adapt that to de-dup stories? I've seen this question everywhere but phoronix by now.

Re:General case (0)

Anonymous Coward | about 3 months ago | (#46052437)

Same thing, but faster...
You md5sum everyfile, but you only need to do it for files with the same size:

#/bin/bash
find "$@" -type f -not -empty -printf "%-32i%-32s%p\n" \
| sort -n -r \
| uniq -w32 \
| cut -b33- \
| uniq -D -w32 \
| cut -b33- \
| xargs -0 -d"\n" -l1 md5sum \
| uniq --all-repeated=separate -w32 \
| cut -b35-

(you are correct)

Re:write it yourself (1)

vux984 (928602) | about 3 months ago | (#46051651)

If the files are in fact identical internally, just backups and backups of backups then it should be pretty straightforward.

Simplest would be simply to:

start with an empty destination

Compare each file in the source(s) tree(s) on each file in the destination by filesize in bytes, then if there is a match there, do a file compare using cmp. Copy it to the destination it if it doesn't match, otherwise move to the next file. Seems like something that would take 10-20 lines of command line script tops. Its a one time job, so who cares if its ideally efficient.

A more sophisticated method to generate and compare file hashes, and compare hashes would potentially be somewhat faster and cleverer; but it would depend on how much duplication actually exists. cmp will terminate at the first mismatch byte so cmp will short circuit out of virtually all comparisons nearly immediately. Whereas generating hashes will require processing all the files completely, as well as coming up with a system for manageing the hash/filename map etc... gets cleverer than it needs to be for a one off job pretty fast.

Re:write it yourself (0)

Anonymous Coward | about 3 months ago | (#46051375)

I Actually expect some helpful Gus to post a 1-liner....

Write a quick script. (4, Informative)

khasim (1285) | about 3 months ago | (#46051377)

If they are identical then their hashes should be identical.

So write a script that generates hashes for each of them and checks for duplicate hashes.

Re:Write a quick script. (1)

xpucadad (244993) | about 3 months ago | (#46052253)

I've done this in Perl. It was easy. But the files have to be identical - even if the pictures are the same size and look identical, if the files contents aren't exactly the same they won't match.

findimagedupes in Debian (5, Interesting)

nemesisrocks (1464705) | about 3 months ago | (#46051473)

whatever you decide on, it could probably be done in a hundred lines of perl

Funny you mention perl.

There's a tool written in perl called "findimagedupes" in Debian [debian.org]. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.

Re:findimagedupes in Debian (3, Interesting)

msobkow (48369) | about 3 months ago | (#46051739)

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.

Re:findimagedupes in Debian (2)

complete loony (663508) | about 3 months ago | (#46051809)

What you want, is a first pass which identifies some interesting points in the image. Similar to microsoft's photosynth. Then you can compare this greatly simplified data for similar sets of points. Allowing you to ignore the effects of scaling or cropping.

A straight hash won't identify similarities between images, and would be totally confused by compression artefacts.

Re:findimagedupes in Debian (2, Informative)

nemesisrocks (1464705) | about 3 months ago | (#46051879)

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage [jhnc.org]), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).

I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.

Re:findimagedupes in Debian (2)

sexconker (1179573) | about 3 months ago | (#46052367)

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage [jhnc.org]), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).

I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.

O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
You have to read all the images and generate the hashes, but that's Theta(n).
Comparing one hash to every other has is Theta(n^2).

If the hashes are small enough to all live in memory (or enough of them that you can intelligently juggle your comparisons without having to wait on the disk too much), then you'll be fine for tens of thousands of pictures.
But photographers can take thousands of pictures per shoot, hundreds of thousands in a year, and have millions of photos to dedupe.
When you're at that level, comparisons have to be 6 orders of magnitude faster than your disk read to avoid being the bottleneck. With large hard drives shitting out 60-120 MBps (we'll ignore SSDs because they can't hold that many photos, and we'll ignore RAID just because), that's not going to be the case.

Re:findimagedupes in Debian (0)

Anonymous Coward | about 3 months ago | (#46051893)

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

Exponential time would require quite a bit of creativity. The simplest algorithm, two nested loops and checking for equality, performs only a quadratic number of comparisons. A simple sorting variant would require O(n log n) comparisons to sort the data, then O(n) to filter out the equal ones. Hash functions can be used to reduce the constant factor.

Re:findimagedupes in Debian (1)

safetyinnumbers (1770570) | about 3 months ago | (#46051949)

I've used findimagedups. IIRC, it rescales each image to a standard size (64x64 or something) then filters and normalizes it down to a 1-bit-depth image.

It then builds a database of these 'hashes'/'signatures' and can output a list of files that have a threshold of bits in common.

That's how it can ignore small changes, it loses most detail and then can ignore a threshold of differences.

It would fail if an image was cropped or rotated, for instance. It could handle picture orientation it it was modified to store 4 versions of the signature, I guess.

It won't actually remove images itself (I wrote a script to read it's output and delete listed images matching a specific path).

I needed it because Dropbox was 'fixing' orientation when it uploaded images and I wanted to clear out ones I'd backed up directly from the camera. (I usually delete duplicate images based on hash.)

Re:write it yourself (1)

Anonymous Coward | about 3 months ago | (#46051531)

Perl to the rescue.

$ sudo cpan App:dupfind
$ dupfind [ --options ] --dir ./path/to/search/

Or you can try your luck with something designed for finding similar image files:
http://www.jhnc.org/findimagedupes/

Re:write it yourself (0)

Anonymous Coward | about 3 months ago | (#46051695)

exactly what you mean by deduplication is kind of vague

And yet you pretty much know what he means.

but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).

Re:write it yourself (0)

Anonymous Coward | about 3 months ago | (#46051815)

You must get paid by the hour if you think it'll a hundred lines.

Hashes should be relatively easy (0)

Anonymous Coward | about 3 months ago | (#46051323)

On the Mac I use a program called Gemini. It could route out any duplicate files between multiple sources, and give you options on which ones to keep/delete (ie manual, oldest, etc).

ZFS filesystem with dedup (0)

Anonymous Coward | about 3 months ago | (#46051327)

Id have put them all on FreeBSd ZFS filesystems and enabled dedup........ whalah job completed ... :P

Re:ZFS filesystem with dedup (-1)

Anonymous Coward | about 3 months ago | (#46051517)

Id have put them all on FreeBSd ZFS filesystems and enabled dedup........ whalah job completed ... :P

It is clear this nigger did not take french in high school.
WTF is "whalah"? Did you intend to say "voila"?
Next you're going to tell me about you keep "loosing" your car keys, or how for all "intensive purposes" it is the same?

Re:ZFS filesystem with dedup (1)

mlts (1038732) | about 3 months ago | (#46051555)

One can use NTFS and turn on deduplication, then manually fire off the background "optimization" task. It isn't a "presto!", but after a good long while, it will find and merge duplicate files, or duplicate blocks of different files.

Caveat: This is only Windows 8 and newer, or Windows Server 2012 and newer.

Re:ZFS filesystem with dedup (1)

DiSKiLLeR (17651) | about 3 months ago | (#46051875)

whalah is not a word.... seriously. wtf people. It's voilÃ.

As for ZFS, sure, I recommend ZFS. But I'm not sure how i feel about ZFS's dedupe. Besides, the multiple files are still there even if it no longer takes up extra space.

You'd want a script that finds dupes by hash but that will only detect images that are identical copies, not 'simliar' say an image has been cropped or retouched or resized. A program that can find image dupes even with changes like tineye.com would be ideal. Anything like that exist?

fdupes -rd (5, Informative)

Anonymous Coward | about 3 months ago | (#46051337)

I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.

fdupes will do a bit-wise comparison. -r = recurse. -d = delete.

fdupes would be the fastest way.

Re:fdupes -rd (1)

Xolotl (675282) | about 3 months ago | (#46052415)

fdupes is excellent and I second that (please mod the parent up!)

The only drawback to fdupes is that the files must be identical, so two identical images but where one has some additional metadata e.g. inside the EXIF won't be deduplicated.

You don't need software for this (0)

Anonymous Coward | about 3 months ago | (#46051339)

Just script something that grabs a list of image files from the filesystem, runs an MD5 hash on all of them, locates any duplicate MD5s, then outputs a list of files to delete later. Now if you're talking about a somewhat more sophisticated duplicate detection (such as, say, detecting images that are the same picture but are not in the same size or format) you're getting into the "someone will pay you money for this" territory.

Re:You don't need software for this (1)

Anonymous Coward | about 3 months ago | (#46051417)

This is what I'd do, but I doubt the submitter is a Bourne shell wizard.

Shell scripts ARE still software by the way.

Re:You don't need software for this (3, Informative)

unrtst (777550) | about 3 months ago | (#46051761)

Adjust as needed:

find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done

...though I think something more sophisticated than an md5sum would be wise (exif data could have been changed but nothing else, and you'd miss that dupe).

Re:You don't need software for this (0)

Anonymous Coward | about 3 months ago | (#46051435)

Yes, I had the same issue and did this. Then you just manually compare the output of the list in a web browser. Cheap and easy. Filenames would also be a big clue if they are dups, so maybe sort by MD5 hash, filename, then full path filename.

Perhaps you might ask why first (1)

Anonymous Coward | about 3 months ago | (#46051357)

It is important here to know why you want to remove duplicate images. Is it just so you can have one large photo album without seeing the same picture twice? If that is true then you could sync all images on all machines onto one large drive, sort the files by size and manually delete the duplicates as they would all bunch together.

If you are trying to save disk space, then using a file system like ZFS can automatically remove duplicate data and add compression.

Also consider that if you do not know where your duplicate files are then any duplicates in existence are, effectively, acting as backups for your disorganized collection. Erasing duplicates until you find a way to cleanly backup your data may be a mistake in the long run.

ZFS dedup (0)

brambus (3457531) | about 3 months ago | (#46051363)

# zfs set dedup=on mypool/photos

Make sure you have enough RAM though (1GB of RAM per TB of unique data) and/or an SSD for L2ARC to make sure it doesn't grind to a halt.

Re:ZFS dedup (3, Informative)

Anonymous Coward | about 3 months ago | (#46051519)

Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.

Re:ZFS dedup (0)

Anonymous Coward | about 3 months ago | (#46051527)

# zfs set dedup=on mypool/photos

Make sure you have enough RAM though (1GB of RAM per TB of unique data) and/or an SSD for L2ARC to make sure it doesn't grind to a halt.

I'm really interested in building a 7 or 8TB RAIDZ2 over the next year or so, and I really, really want to use dedup because it's cool and stuff. But the RAM requirements are painful - the data I intend to store will make heavy use of dedup and could easily exceed 20TB. I've looked around and never saw a mention of using an SSD to supplement this - do you have any good links on that?

Re:ZFS dedup (0)

Anonymous Coward | about 3 months ago | (#46052265)

For 7-8TB, you might be better off just buying enough RAM. 16GB is ~ $200. Depending on your workload, that might be better overall than using an SSD for L2ARC.

Re:ZFS dedup (0)

Anonymous Coward | about 3 months ago | (#46051613)

That requires the files all be exactly identical, bit-for-bit, and thus also requires looking at every single bit of every file. If you have different versions with different changes (one rotated, another resized), it wouldn't find them.

If you just extract the EXIF data, you can process any file in milliseconds no matter how big, and then create a database to easily tell you which files were copies of the same one.

dom

there are many duplicate file finders (0)

Anonymous Coward | about 3 months ago | (#46051397)

There are many duplicate file finders, if the files are binary identical. (Search on Google for "find duplicate files" or "delete duplicate files".) However, if the files have been modified in any way, this becomes much more difficult, because similar files for music or photos have a degree of tolerance for errors and variation. Signed: the author of two of those programs.

Fuzzy Hashing (2)

Oceanplexian (807998) | about 3 months ago | (#46051399)

I would try running all the files through ssdeep [sourceforge.net].

You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.

Try "Uniquefiler" under WINE (0)

Anonymous Coward | about 3 months ago | (#46051405)

It has been reported to work under WINE, but your mileage may vary.
Sorry - don't have any links.

fslint (1)

Anonymous Coward | about 3 months ago | (#46051411)

I did just this, but by copying all of the pics from the various devices to a linux fileshare, and then ran: http://www.pixelbeat.org/fslint/ [pixelbeat.org] Nice software, did exactly what I wanted.

Few option (0)

Anonymous Coward | about 3 months ago | (#46051413)

You could use Unison to merge them two at a time.

Other option is somethling like FSLint that can detect duplicate.

I think I wrote one of these. (1)

paradxum (67051) | about 3 months ago | (#46051419)

I'm pretty sure I wrote something like this in perl/bash in like 20 minutes.
1 - do an md5sum of each file and toss it in a file
2 - sort
3 - perl (or you language of choice) program, basicly:
sum = "a"
newsum = next line
if newsum == sum delete file
else sum = newsum

Re:I think I wrote one of these. (3, Insightful)

Cummy (2900029) | about 3 months ago | (#46051765)

Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the program and loading it to a repo. All the 20second tutorials in the world will not get someone to write a program if they just don;t have the skill set.
This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one (Apple users just go and buy one). Linux will not get out of single digit adoption until people with the skills write and edit programs for the non-programers like myself because when stuff needs to get done fast Windows will have the program (and yes it is easier to clean out the malware and fight the popups than it is to write the program).

Re:I think I wrote one of these. (2, Informative)

VortexCortex (1117377) | about 3 months ago | (#46052109)

This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow".

Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it. I can't be fucked to care if folks are too stupid to learn how to use their computers. If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.

This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one

Which is why it's a nightmare to administer windows. MS had to create a fucking scripting terminal "powershell" because they ditched DOS and didn't expose OS features to a terminal... Now go press the Towel key to open Window8's start screen. Start typing... AT A NEUTERED TERMINAL... ugh. Sometimes, its better to not have to wait for someone to create something for you, especially when it's something very easy to do. You would FIRE a secretary that could not sort a set of physical files by customer ID and remove duplicates, or add up totals with a calculator, etc. Your standard for computer "operator" is so low it's pitiable.

If you paid attention to the thread, you'd have noticed that nothing you said about Windows is exclusive to windows. Indeed, a Google search for any OS would have turned up solutions for it. Some would be a few lines of BASH or Perl, Powershell, BATCH scripts, etc. Some would be 'free' programs, some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't.

The OS exposes your computer's features to you. If you do not know how to write a simple set of instructions for it to follow, then you do not know how to use a computer.

Re:I think I wrote one of these. (1)

dowens81625 (2500160) | about 3 months ago | (#46052111)

Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the program and loading it to a repo. All the 20second tutorials in the world will not get someone to write a program if they just don;t have the skill set.
This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one (Apple users just go and buy one). Linux will not get out of single digit adoption until people with the skills write and edit programs for the non-programers like myself because when stuff needs to get done fast Windows will have the program (and yes it is easier to clean out the malware and fight the popups than it is to write the program).

To learn is to know ones own value.

To expect something is to not care about yourself or your worth.

Nothing worth doing is ever easy.

Re: I think I wrote one of these. (0)

Anonymous Coward | about 3 months ago | (#46052143)

we think that because we assume that only someone exactly like us would be presume to ask for free help. an idiot windows user would properly pick up the phone and pay one of us to do it for him. at least a cheapskate would download one of our free virus embeded tools. THIS FUCKER, however, is too lazy even for google.
I got an idea, why don't you go over and personally sort out his thumbnailled porn collection

Don't reinvent the wheel: fdupes, md5deep, gqview (2)

Jody Bruchon (3404363) | about 3 months ago | (#46051443)

fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.

The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (5, Informative)

rwa2 (4391) | about 3 months ago | (#46051655)

Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:

findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged... [jhnc.org]

fslint :

kleansweep :
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files ...
Homepage: http://linux.bydg.org/~yogin/ [bydg.org]

komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge.... [sourceforge.net]

backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
    across multiple backups of the same or different PC are stored only once
    resulting in substantial savings in disk storage and disk writes. Also known
    as "data deduplication".

I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.

That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (1)

Anonymous Coward | about 3 months ago | (#46051971)

Be careful with fdupes. It defaults to including zero length files and will hard link those together too, which is generally a really bad idea.

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (1)

Impy the Impiuos Imp (442658) | about 3 months ago | (#46052267)

Reminds me of Windows link repairer, automatically searching for the nearest file size, which was almost always the wrong thing to do, then suggesting grampa accept the new pointer.

fdupes (2)

ender8282 (1233032) | about 3 months ago | (#46051499)

Under *buntu
sudo apt-get install fdupes
man fdupes:
fdupes - finds duplicate files in a given set of directories

Seriously? (0)

DigitAl56K (805623) | about 3 months ago | (#46051523)

Are we seriously discussing how to dedupe files based on a hash here?

News for nerds, stuff that matters, questions that belong in a forum where people answer things you couldn't be bothered to Google.

Re:Seriously? (3, Interesting)

postbigbang (761081) | about 3 months ago | (#46051557)

Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.

Re:Seriously? (5, Informative)

zakkie (170306) | about 3 months ago | (#46051753)

See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.

Re:Seriously? (0)

Anonymous Coward | about 3 months ago | (#46051587)

I for one, would welcome a website that combines moderation features of stackoverflow and slashdot.

Typical approach doesn't work always (1)

Anonymous Coward | about 3 months ago | (#46051675)

You can google forever and not get the correct answer.

This is not a trivial problem, and in my case, I had to test multiple ways to do this before finding the correct tools. Also, most approaches work fine with 100 photos, but the problem becomes different if you are talking about 80k photos.

And if it was 100 photos, very likely he would do it by hand and won't need a tool.

I checked multiple places including slashdot before almost writing my own tools in perl.

AND, the devil is on the details (1)

Anonymous Coward | about 3 months ago | (#46051709)

AND, most people come with the trivial answer on deduping files. You DON'T want to MD5 or do anything based on hash tags for 80k photos. That doesn't work. Photos are a particular type of file with particular characteristics, which can reduce your workload a lot.

Trivial approach sucks in this case, and carefully picking the correct tools (in my case classifying photos with an exif / date approach) before deduplicating can convert an impractical solution into a working solution.

Re:Seriously? (2)

Cley Faye (1123605) | about 3 months ago | (#46052073)

When you're talking about duplicate content, you can't limit yourself to "just hashes".
In this case, with pictures, just opening one and saving it again might produce a different hash, just by recompression or changing the file format. How does all these "just check the hashes" solution works for that?
Finding duplicates image is not that easy.

Consider git-annex (1)

dondelelcaro (81997) | about 3 months ago | (#46051561)

In addition to the other methods (ZFS, fdupes, etc), I personally use git-annex.

Git annex can even run on android, so I keep at least two copies of my photos spread throughout all of my computers and removable devices.

DigicaMerge (1)

jalet (36114) | about 3 months ago | (#46051581)

See http://www.librelogiciel.com/s... [librelogiciel.com]

I haven't modified nor used it in years (I don't own a digital camera anymore...) so I ignore if it still works with up to date libraries, but its "--nodupes" option does what you want, and its numerous other command line options (http://www.librelogiciel.com/software/DigicaMerge/commandline) help you solve the main problems of managing directories full of pictures.

It's Free Software, licensed under the GNU GPL of the Free Software Foundation.

Hoping this helps

If not, you can (0)

Anonymous Coward | about 3 months ago | (#46051585)

You can even compile your home-grown photo-deduplicator into your custom kernel if you want to.

I would... (0)

Anonymous Coward | about 3 months ago | (#46051593)

Get your co-workers @ the nsa to do their own work

Use DIM and a deduplicator (0)

Anonymous Coward | about 3 months ago | (#46051607)

Hi. I've been through the same problem. In my case, I had to deduplicate 80k photos. The reason why most suggestions in this thread won't work is because generic solutions don't take advantage of the extra information photos contain. In my case, around 90% of the photos had good EXIF information, but in itself, that is not enough.

I used DIM to classify photos into year / month / day structure, and later I used a photo deduplicator on each day's sub folder.

Additionally, there was extra manual work for those photos not resolved in this way, but definitely was way better than comparing 80k with 80k.

FSLint (0)

Anonymous Coward | about 3 months ago | (#46051623)

If you're not big into scripting there's a program on the Ubuntu Software Center called FSLint that does exactly what you're looking for. You can have it match on filenames, filesize, hashes, etc. It's just a generic file deduplicator, not optimized for images or anything.

What's the problem? (0)

Anonymous Coward | about 3 months ago | (#46051649)

What's the problem? Just cp -u the $file to /newhd/by_md5/$(md5sum $file).${file##*.}
( ...and store the original file name in exif create another hardlink to the md5 filename or whatever way you prefer to locate your stuff )

git-annex (0)

Anonymous Coward | about 3 months ago | (#46051653)

Create a git-annex repository on each file system and set them up with at least one common remote. Then add all of the photos on each file system into the git-annex repository (git annex add *.jpg), sync it with the common remote (git annex sync yourremote), and move all content to the remote (git annex move -t yourremote .).

Quick shell script using exiftool (4, Interesting)

Khopesh (112447) | about 3 months ago | (#46051689)

This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie [sourceforge.net] has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).

I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.

Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):

#!/bin/sh
for image in "$@"; do
echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
done

And then run:

find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output

If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).

There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.

On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).

perl (0)

Anonymous Coward | about 3 months ago | (#46051729)

this would be a nice intermediate-level weekend perl project

checksum? (0)

Anonymous Coward | about 3 months ago | (#46051771)

script to output path, filename, checksum to somewhere maybe?

Going for the obvious unmentioned (1)

Anonymous Coward | about 3 months ago | (#46051797)

Picasa as a local instance, importing from all the other locations... just remember to check the "exclude duplicates" box

ftwin (1)

DaJoky (782414) | about 3 months ago | (#46052009)

ftwin is a command line tool, when built with libpuzzle, able to generate a signature for each image and detect duplicates (including resized/sliightly modified). Link: http://freecode.com/projects/f... [freecode.com] Disclaimer: I'm the author and don't maintain it actively :-P

http://en.wikipedia.org/wiki/List_of_duplicate_fil (2, Informative)

Anonymous Coward | about 3 months ago | (#46052075)

http://en.wikipedia.org/wiki/List_of_duplicate_file_finders

Perhaps a better way exists. (1)

Anonymous Coward | about 3 months ago | (#46052107)

This seems like a rather lot of work just to automate deduping of your porn collection. It might be more enjoyable to do it by hand anyway.

Duplicated in RSS (0)

Anonymous Coward | about 3 months ago | (#46052157)

I love how this article was listed twice in the RSS feed. Kudos!

There's a command line tool for that. (0)

Anonymous Coward | about 3 months ago | (#46052241)

findimagedupes

fslint? (1)

PaperGeek (1045780) | about 3 months ago | (#46052371)

If you are going stricly based on hashing (e.g. not trying to match images that may have different EXIF data embedded, thus making the hashes different) fslint works quite well. It will chug through a filesystem and basically wraps python commands to compare by hash and file size (using both md5 and sha256) and will give you a report of wasted space. You can then save a parseable plain text file. It can take a while - it's bandwidth-bound as you might expect - I just did this for a 2tb network share and it took over 12 hours. But it got the job done and all I had to do was sudo apt-get install fslint

Hope it doesn't access your backup drive (1)

Anonymous Coward | about 3 months ago | (#46052395)

Hope it doesn't access your backup drive and wipe out your backups as "duplicates".

Follow the FBI's lead (1)

Mike Buddha (10734) | about 3 months ago | (#46052403)

They use a database of hashes of kiddie porn to identify offending material without forcing anyone to look at the stuff. Seems like it would be ready to use Perl to crawl your filesystem and identify dupes.

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...