Jump to content
CCleaner Community Forums
Guest Keatah

Finishing the Duplicate File Finder

Recommended Posts

Guest Keatah

Any true duplicate finder will need to look inside the files at the actual contents. So there needs to be an option for this. A byte-by-byte comparison or CRC32, including a warning that it will take extra time to complete this type of scan.

Share this post


Link to post
Share on other sites

Any true duplicate finder will need to look inside the files at the actual contents. So there needs to be an option for this. A byte-by-byte comparison or CRC32, including a warning that it will take extra time to complete this type of scan.

 

CRC32 on a large enough system will find some false positive results. Better would be MD5 duplicate matches, though you're right - a disclaimer that this will require some *time* to complete.

Share this post


Link to post
Share on other sites

CRC32 on a large enough system will find some false positive results. Better would be MD5 duplicate matches, though you're right - a disclaimer that this will require some *time* to complete.

Unlikely

I guess a given file with a 32 bit checksum would need to be compared with "2 raised to power of 32" = 4,294,967,296 to have a probability of 1 matching result.

That is a tremendous number, much bigger than the 74,602 files present in Windows Ultimate.

Just searching in C:\ for duplicated names and sizes there are less then 36 duplicate pairs found, i.e. 99.9% of the total population are excluded by name and size,

so a 32 bit checksum is probably good for a total population of 4,294,967,296,000 files

Share this post


Link to post
Share on other sites

I completely agree with Keatah and 4NTFan.

 

As a long time fan of CCleaner I was excited to see this new feature added, but in its current implementation it's beyond useless. To have an even half way proper duplicate file finder, the mechanism for identifying such files MUST be some sort of hashing algorithm *as a minimum*.

 

@Alan_B

Regarding CRC32, this is simply not suitable. It's not an issue of mathematics, it's an issue of maliciousness (and fundamentally design). You need to use a proper hashing algorithm, such as MDx or SHA-x.

 

I've seen on numerous occasions, through either malicious 'cleverness' or simply quirks of design, DISTINCT files that share the same CRC32 but clearly not the same MD5/SHA-1. They are different files. It's just these sorts of suspicious files you might want to verify. Not everything is purely hypothetical….

 

I'm sorry if I'm sounding overly critical/negative. I think having a (reliable) duplicate file finder built right into CCleaner is an excellent thing, but at the moment it just isn't. Not even close.

Share this post


Link to post
Share on other sites

@Alan_B

Regarding CRC32, this is simply not suitable. It's not an issue of mathematics, it's an issue of maliciousness (and fundamentally design). You need to use a proper hashing algorithm, such as MDx or SHA-x.

 

I've seen on numerous occasions, through either malicious 'cleverness' or simply quirks of design, DISTINCT files that share the same CRC32 but clearly not the same MD5/SHA-1. They are different files. It's just these sorts of suspicious files you might want to verify. Not everything is purely hypothetical….

 

I'm sorry if I'm sounding overly critical/negative. I think having a (reliable) duplicate file finder built right into CCleaner is an excellent thing, but at the moment it just isn't. Not even close.

I totally disagree with your conclusions as applied to this particular application.

 

I totally disagree with recommending MD5 for protection against malicious 'cleverness', because some years ago it was being cracked, see for example.

http://crypto.stacke...md5-hash-value.

http://www.coresecur...om/blogs/?p=478

 

If you really want to avoid malicious 'cleverness' you need SHA-256 or better,

and it would be infinitely preferable to validate a download BEFORE it ever gets moved into your system for use before CCleaner ever gets around to accessing it.

 

It can only be sensible to look for matching file names and sizes,

and then as suggested in this topic title,

to finish off by computing checksums to ensure the contents also match.

 

If the file names and sizes etc match,

a CRC32 error rate of one in 4,294,967,296 means that after deleting 4,294,967,296 "duplicate" files,

then on average one file may have been deleted even though the content was different.

That is a risk I am happy to take - I do not even have 4 billion files with different names :(

Share this post


Link to post
Share on other sites
Guest Keatah

One tiny advantage the dupefinder in ccleaner has is that it can find files by names. There are rare times when you don't care about the file contents, but just the name.

Share this post


Link to post
Share on other sites

It seriously is needing a way to checksum verify which files are actually duplicates for it to be a complete tool worthy of safe use.

Share this post


Link to post
Share on other sites

I totally disagree with recommending MD5 for protection against malicious 'cleverness', because some years ago it was being cracked, see for example.

http://crypto.stacke...md5-hash-value.

http://www.coresecur...om/blogs/?p=478

 

If you really want to avoid malicious 'cleverness' you need SHA-256 or better,

and it would be infinitely preferable to validate a download BEFORE it ever gets moved into your system for use before CCleaner ever gets around to accessing it.

 

Congratulations on so vehemently disagreeing with something I DIDN'T recommend. Read what I wrote. I didn't recommend MD5 specifically, what I said was you need to use a PROPER HASHING ALGARIM such as one from the MDx FAMILY or the SHA-x FAMILY, including (oh yeah) SHA-256.

 

Additionally, I didn't mention anything about downloads. Stop making assumptions.

 

It can only be sensible to look for matching file names and sizes,

and then as suggested in this topic title,

to finish off by computing checksums to ensure the contents also match.

 

Yes OBVIOUSLY if you are comparing File_A to File_B and they have different file sizes then OBVIOUSLY they are distinct and it's wasteful doing further 'checks' to confirm this FOR THIS SPECIFIC FILE_A TO FILE_B COMAPRISON. The subject of discussion was never how most efficiently to design the programmatic logic behind this process.

 

The file name matching or not matching is COMPLETELY irrelevant.

 

If the file names and sizes etc match,

a CRC32 error rate of one in 4,294,967,296 means that after deleting 4,294,967,296 "duplicate" files,

then on average one file may have been deleted even though the content was different.

That is a risk I am happy to take - I do not even have 4 billion files with different names :(

 

Again, it's not just as simple as the theoretical mathematics.

 

 

Look.. at the end of the day it's becoming obvious you have very specific ideas about what a duplicate file finder in CCleaner *should* do and how it *should* meet your specific (I dare say very limited) needs. I don't know whether you are envisioning this as something to find duplicate *.dll libraries of something, but that's certainly not JUST what I would use it for. Nor does it appear anyone else here who has commented would use it in such a limited way.

Share this post


Link to post
Share on other sites

And off we go. . .

 

I totally disagree with your conclusions as applied to this particular application. . . .

 

Congratulations on so vehemently disagreeing with something I DIDN'T recommend. Read what I wrote. . . .

 

There is nothing better than a good old fashioned slapfest between anonymous combatants in writing.

At least when I get to watch from the sidelines. :P

 

Edit: Fwiw, I use Nirsoft's hasher app to determine if the files found by CCleaner are in fact identical.

Its quite fast, supports drag & drop, ignores file names. Its here.

http://www.nirsoft.net/utils/hash_my_files.html

 

I actually prefer that, so that the "finder" points'em out quickly, and I can decide which to spend time comparing.

Share this post


Link to post
Share on other sites

The main purpose of CCleaner is to remove Junk files that are not needed.

 

The Duplicate File Finder capability has only recently been added to CCleaner,

and it is reasonable to be concerned that a file should not be deleted just because it has the same name as something else.

 

Again, it's not just as simple as the theoretical mathematics.

Please explain why you would believe that.

 

MD5 has an output value of 128 bits that can distinguish between any two files that are up to 16 bytes in size

SHA-512 has an output value of 512 bits that can distinguish between any two files that are up to 64 bytes in size

https://en.wikipedia.org/wiki/SHA-2

I do not understand how you expect perfect distinction via "proper hashing algorithm" between files that are likely be hundreds of times larger than 64 bytes.

 

CCR32 has a 32 bit output value which can distinguish between any two files that are up to 4 bytes in size

SHA-512 is overwhelmed by a 64 kB file just as much as CCR32 is just overwhelmed by a 4 kB file.

 

The only difference I can see is in the amount of the probability of error.

 

When I want to "know for sure" whether two files are the same then I use "HashMyFiles" from Nirsoft, and I am happy to live with the 1 in millions of possibilities that the same bit restricted hash output could be produced by files with different content,

otherwise I have utilities that will do a byte by byte comparison - It is a capability I almost never use - life is too short.

Share this post


Link to post
Share on other sites

let's just agree to disagree.

what works for one person will be different for someone else.

they are called Personal Computers for a reason. :)

 

but to dip my toe into the bloodied waters, I think the dup file finder functionality is at least, a waste of time, and at worst, potentially disastrous.

 

OK, DING DONG, Round 3 (someone should find an emoticon that raises a card over her bikini clad frame)

Share this post


Link to post
Share on other sites

I think the dup file finder functionality is at least, a waste of time, and at worst, potentially disastrous.

 

Harsh but true!

Share this post


Link to post
Share on other sites

Changed my mind.

Must agree with Keatah's first post.

If the option to compare the hashes is included, that would be better.

Share this post


Link to post
Share on other sites

If the option to compare the hashes is included, that would be better.

 

It's essential to trust the results.

Share this post


Link to post
Share on other sites

I have used happily used CCleaner for several years and recommended it to friends. I only recently discovered the "duplicate file finder" feature, ran it and have a long list...not surprisingly. However, how do i know i can safely remove the duplicates? Which one do i delete? Do I delete both? I would sure appreciate words of wisdom here.

Share this post


Link to post
Share on other sites

Whilst I usually frown upon threadcomancy, I'll field this one.

Only delete one of them and only only only only only onnnnnnnnnnnnnnnly the one you KNOW should be deleted. EXAMPLE: Two of the same document; one was saved on sunday to your desktop the other on friday to your "my pictures" folder.

Edit:

As goes for my registry rule only use this feature to remove what you don't actually need. You are responsible for what you allow to be removed, that's why there's analysis first.

[Opinion]I haven't used this feature since the inception of this thread. I, personally, am against this features addition to the program; my objections published throughout the past[/opinion]

Share this post


Link to post
Share on other sites

Hi, I'm pretty new to this forum, although I have popped in from time to time. I was just reading this thread as I was looking for an answer to a problem I had today with a newly installed v5.0 CCleaner....I had to install again as there was a problem with my pc losing part of a file...When I uninstalled, I thought I'd load the new version. I didn't fancy trusting the Beta, so this seemed great...I have used CCleaner for quite some time now + never had a problem....Until today. It was doing a good job with other tasks, but then I thought I'd give the Dupe File Finder a go for all the pics I have on my HD. It started up ok, filled the page but when it got to the bottom, it just froze. The interface greyed out + I could do no more with it. I couldn't cancel, I couldn't click to minimize or shut down, I couldn't even right click on the program on the taskbar to close...I ended up having to do the unthinkable + reboot by going straight to the start button + overriding the task manager to not wait for the program to finish uninstalling... :rolleyes: Not quite sure how it got to that, but It turned out ok in the end....Phew!!!  Just wondering if this has happened to anyone else...

 

I had a really bad experience with a dupe file prog about 7 yrs ago. The program found tons of files + it was letting me choose after every file what to do with it...Things were going well, until it started to speed up + stopped asking what to do with files + took over. The next thing I knew was a banner on screen telling me to update my drivers urgently. My OS was a copy, so I didn't have the backup. The pc shut down + wouldn't restart....It had erased ALL of my drivers....One dead pc... :P...My ex had bought me the pc + I knew nothing at all, I got myself into some really nasty messes, but about 75% of them, I managed to work it out by trial + error...Things like managing to turn my screen upsidedown + not knowing how I did it....Enlarging my screen to the point that I couldn't see anything to find my way out as even the start button, (when I eventually found it) was as big as the desert + as many miles away...lol

 

Everything I know now (which compared to you guys is nothing) I had to teach myself...I enjoy learning + when I do get in a scrape, ( not so much these days) I feel I learned something new.... :wub:

Share this post


Link to post
Share on other sites

I'd recommend posting about it in the Bug Reporting area by starting your own unique topic:

http://forum.piriform.com/index.php?showforum=8

Thanx for that....It seems like everybody else, (well a good few of them) have had problems ranging right across the board with this one....The night I posted in here, I decided to just uninstall v5 altogether + downloaded another copy of v4.17....That was the last one I had + was happy with that + a few had said that v4.19 was a bit buggy....I shall just wait till Piriform give V5 a rethink...Until then, it's bliss to look at the old v4..... :D

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...