Skip to main content

Duplicate file finder/Remover using perl and SHA1

When you are using a computing devices (either a laptop or PC or a Tab) for your personal use after some time (let take some years) you will realise that your disk is full and most of the space are occupied by duplicate files (Same copy of file located in different locations).

For ex: You might have a favourite music file in "My Favourite" folder as well as in the "Album" folder. But finding this duplicate manually is a biggest process. That too if the file names are different OMG!.

There are lot of free utilities available to do this in automated way, but if you are a programmer, you will always prefer to do it on your own.

Here are the steps we are going to do. This is purely on a linux - Ubuntu system.  (for windows you might need to change the path as per conventions )
  • Getting SHA1 for all the files recursively in a given directory
  • Compare SHA1 with other files
  • Remove the duplicate file
Getting SHA1 of a file

Using cpan module  Digest::SHA1 we can get SHA1 for a file data as follows

use Digest::SHA1 'sha1_hex';
use File::Slurp;
my $fdata = read_file($file);
my $hash = sha1_hex($fdata);

In the above code I used read_file method which is provided by File::Slurp module.

To find SHA1 for all the files recursively in a directory. There are many modules available in www.cpan.org for iterating a directory but my favourite is always File::Find module which works same like a unix find command.

use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';

my $dir = "./";

# Calls process_file subroutine for each file
find({ wanted => \&process_file, no_chdir => 1 }, $dir);

sub process_file {
    my $file = $_;
    print "Taking file $file\r\n";

    if( -f $file and $file ne '.' and $file ne '..' ){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);
    }
}
Finding the duplicates

Our next step is to find the duplicates based on the SHA1 values found above. I am going to use Hash ref with key as a SHA1 value and values as an Array ref with list of file Path. So once we process all the files we can easily get the list of duplicate files by just getting length of the array.

use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';

my $dir = "./";
my $file_list;

# Calls process_file subroutine for each file
find({ wanted => \&process_file, no_chdir => 1 }, $dir);

sub process_file {
    my $file = $_;
    print "Taking file $file\r\n";
    if( -f $file and $file ne '.' and $file ne '..' ){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);

     push(@{$file_list->{$hash}}, $file );
    }
}

Removing the duplicates

Now we have we have list of duplicate files found. Only thing left is removing the those files by keeping only one copy of them. Perl has a default command called unlink which will remove the file from that location.

unlink "$file"

Now combine everything and add some printing statements and options you will get a nice utility script to remove the duplicate files.

#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';


my $dir = shift || './';
my $count = 0;
my $file_list = {};
my $dup_dir_list = {};
my $dup_file_count = 0;
my $dup_dir_count = 0;
my $removed_count = 0;

find({ wanted => \&process_file, no_chdir => 1 }, $dir);


foreach my $sha_hash (keys %{$file_list}){
    if(scalar(@{$file_list->{$sha_hash}} > 1)){

        # Number of duplicate files
        $dup_file_count = $dup_file_count + scalar(@{$file_list->{$sha_hash}}) - 1;
        my $first_file = 1;
        foreach my $file (@{$file_list->{$sha_hash}}){
            # Don't delete the first file
            if($first_file){
                $first_file = 0;
                next;
            }
            if((unlink "$file") == 1){
                print "REMOVED: $file\n";
                $removed_count = $removed_count + 1;
            }
        }
    }
}

print "********************************************************\n";
print "$count files/dir's traced\n";
print "$dup_dir_count duplicate name directories found\n";
print "$dup_file_count duplicate files found\n";
print "$removed_count duplicate files removed\n";
print "********************************************************\n";

sub process_file {
    my $file = $_;

    #print "Taking file $file\r\n";
    if( -f $file and $file ne '.' and $file ne '..'){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);

        push(@{$file_list->{$hash}}, $file );
        $count = $count + 1;

        local $| = 1;
        print "Processing file: $count\r";
    }
}


The above code will remove any duplicate files in a given directory based on SHA1 value for the data. Keep in mind that if you are having audio or video files which are downloaded from different sources might have different SHA1 values based on various conditions. So this script will remove only computer identical files and it does not have any AI to identify same video/audio/images. When we see an image as a human we can identify it easily but computer will see it as different files based on various properties like that image might be compressed or resolution might have changed etc.

Comments

Popular posts from this blog

TataSky Refresh after Recharge - Activate after Recharge

If you are using TataSky and doing recharge occasionally and that too after disconnection, this post will be very helpful for you.
Most of the time you will not get the channels listed as soon as you make the payment. It will display message to Subscribe for that channel.

You can easily get channels back by simply giving a missed a call from your registered mobile number.

Note: Make sure your TV and SetupBox is on and showing the error message

Give missed call to +91 80892 80892 (Soft Refresh)wait for 1 minute and
If still not active then try giving missed call to +91 90405 90405 (Heavy Refresh).

If the above step did not help you, there might be something else going wrong, please contact TataSky Support.

Also note that, this can be done only from your TataSky account registered mobile number.

Happy Watching :-)

Installing Cordova on Ubuntu 16.04 without Android Studio

Apache Cordova is a platform to build mobile applications using JavaScript, HTML5 and CSS.

When I was trying to install cordova for creating an android application on my Ubuntu 16.04 I was facing lot of versioning and dependency problems. So in this post I will be discussing about all those issues and the solution that worked for me. All the steps are via command line and so no android studio. I don't want to install GB's of files to just use sdk.
Steps:
Install latest version of node and npmInstall CordovaInstall android SDK (Command line)Setting environment variablesCreating android plaformCreating and running first app Installing latest version of nodejs
By default ubuntu 16.04 has a stable version of nodejs added in its own repository. So to install latest version from Ubuntu itself you can simply use apt commad as follows.
$ sudo apt update $ sudo apt install nodejs But when I was using this version (4.2.6 is the latest stable version available when I was trying) I got some…

My First Post

After lot of thinking and research me too started a blog finally to share my knowledge :-).

Not everybody know everything but also everybody else don't know what we know. So lets share and gain knowledge with Win Win approach.