Login | Register
My pages Projects Community openCollabNet

Discussions > dev > Detecting copies/moved files

Discussion topic

2020-04-07: This site is going to be decommissioned and shut down on 2020-07-01. Please copy and archive any data you wish to keep before that date.

Back to topic list

Detecting copies/moved files

Author pmarek
Full name P.Marek
Date 2007-08-26 02:53:30 PDT
Message Hello everybody,

I'm doing a short braindump of my thoughts regarding copy/move detection.
Any feedback is welcome.

[[ I'm CCing users@ too to get a broader audience,
   but would like to keep that on dev@. ]]

1) User interface.

- There'll be "fsvs copy"/"fsvs move" commands, which (when given some
  parameter) will call "cp -a"/"mv" with the arguments, for manual copy/move.
- Likewise some "fsvs copied-from", to tell what already *has* been done.
- Then I'll do some "fsvs detect-copies", which will output some kind of list
  to STDOUT, for manual checking. This list can be re-imported and used.
- On commit itself normally no such things would happen; although there'll
  probably be some option to re-enable that.
  I'd like to have the user informed, not simply push some copy-informations
  into the repository.

2) Algorithms for automatic detection.

- If two files have the same MD5, they'll be found. If the original doesn't
  exist anymore, it's a move - else a copy.
  Currently that's easy to do - subversion knows only "copy" and "delete", so
  internally there's no difference. But as soon as there's a distinct "rename"
  operation, there's a problem: If file A is missing, but there are B and C
  with the same data -- is one renamed, and the other copied?
  - Currently I think that if there is more than one target, it'll be done as
    a symmetric copy/delete operation.
- What about small files, which share the same MD5 because they have the same
  data, but are "different" in the meaning of "independent"? (Eg. the default
  config data in users' home directories).
  Should there be some watermark, and no files are done below that value?
- For big files that share some data, we can use the pre-existing
  manber-hashes ... that's what they are there for.
  Should smaller files be manber-hashed with other parameters? Technically no
  problem, but how do we associate them in the first place? Originally the
  manber hashes were used to find similar files ... All files on the
  filesystem were hashed, and the list of manber-blocks was compared for
  similar/identical files. Should FSVS do simply that?
  Although there are programs that do simply that ... we could simply use
  their output.
  (Part of the problem is that the manber-hash parameters should match the
  file size ... Splitting a 20kB file in kB blocks is fine, but for a 1GB
  movie that wouldn't really help).
- Should we take a look at the file extension and/or magic bytes and/or
  detected mime-type (see file(1)), and use that information to shorten the
  to-be-compared list? Would possibly solve the problem with different
  manber-parameters, as such files should have similar sizes.
  (There'll be few, if any, 10kB .avi, and likewise no 500M .c)
- We could use the size for initial autodetection. Identical files are already
  found via their MD5; if files are changed, their filesize would probably
  change only a bit.
- Could we use the inode number for detecting moved files? Only on local
  filesystems; on NFS that won't work, and I'm not sure about NTFS and VFAT.
- For detecting copied/moved directories FSVS would see that there is a new
  directory, and check its files and subdirectories ... if there are entries
  that relate to some other directory (deleted or not) we could draw some
  conclusions. Possibly use some percentage?

3) About hardlinks

Furthermore, there are hardlinks ... which should be tagged as such in the
repository. But a simple "copy" changes the source and with it the history
There was some discussion about hardlinks
(http://svn.haxx.se/d​ev/archive-2001-11/0​498.shtml) but for linked entries
that propagate their changes ... I don't think that's whats needed here, we
just want to have the correct data (possibly shared in the repository), and a
way to see that these two were hardlinks.
I'd lean towards simply using some property on the file "UUID: had inode
major:minor:inode", and using that information on update/revert/checkout
etc. ... but that works only if *all* paths are processed simultanously on
commit and update.
I don't think that such markers would be valid across different revisions ...
If i commit /bin as r4, and /sbin as r5, there might be hardlinked files in
them, but either
- we'd not see them, as two different parts are commited, or
- we'd have to send the inode information *always*, as there might be some
  files in other revisions that match.
But what about machine A committing /bin, and machine B committing /sbin? How
should such hardlinks be handled?

Maybe we'll just send some UUID for hardlinked files in a property, and if we
find two entries with the same UUID on an update we'd hardlink them.

... That's it. Comments welcome. ...



Versioning your /etc, /home or even your whole installation?
             Try fsvs (fsvs.tigris.org)!

« Previous message in topic | 1 of 1 | Next message in topic »


Show all messages in topic

Detecting copies/moved files pmarek P.Marek 2007-08-26 02:53:30 PDT
Messages per page: