Login | Register
My pages Projects Community openCollabNet

Discussions > dev > Re: Bemerkungen zum RFC

Discussion topic

Back to topic list

Re: Bemerkungen zum RFC

Author pmarek
Full name P.Marek
Date 2006-05-17 01:10:16 PDT
Message HEllo Dirk,

>> http://svn.haxx.se/d​ev/archive-2006-05/0​393.shtml
> I just read it. Very nice! But I'd change something:
>>> The things described here apply to issue 2286: a kind of shared-storage
>>> for identical files in the repository is needeed, without having any
>>> means
>>> to infer such sharing from the outside.
>>> (That is, any of the identical files behaves like a normal entry, only
>>> the storage space in the database is stored.)
> I'd replace "needeed" by "wanted".
> And I guess, "is stored" should read "is shared"... ?
Right on both accounts.

> And now for something (nearly completely) different:
> I just looked into the db-dir of a repository under FSFS (it has the
> advantage that one can simply read individual files (=
> revision-descriptions). It turned out that these files contain (in
> addition to the pure delta data) some other stuff that I'm unfamiliar
> with but looks like it has something to do with delta/version
> organization.
> To cut out the delta data and store it apart from the organizining info
> might become "expensive": For small deltas, the size of the whole file
> will often be below the blocksize of the filesystem (1K-4K for
> ext2/ext3) and then you use twice the space that a singe file uses on
> the hard disk.
The idea was not to store the delta data apart from the meta-data, but
to write a pointer instead of the delta-data.
In effect, there's a mark saying "instead of a delta to the previous
version, look there ---> for the data".
So there's only a single lookup more to do - then the normal data
retrieval happens.

> No idea how BDB stores the data but for FSFS it seems that you might end
> up needing more disk space when you store duplicate data once than if
> you don't do it.
Have a look at the documents in
especially "structure".
In BDB the versioning data is already splitted in several parts, with
pointers linked the various things together; there it's simply a new kind
of "hardlink" pointer, which has to be traversed.
That's one of the causes why I'll start with BDB. (The other is the
problem with committing several thousand files at once.)

> Possible countermeasure: Split delta data from organization info only if
> it's feasable, say, when the delta data is more than 4K. Of course,
> this would increase the implementation efforts remarkably...
> As a fast check, I just did the following:
> find * -path "*/db/revs/*" -printf "%k \t %P\n" | sort -n | less
> Of the 991 files that were found, are:
> 599 <= 4KiB
> 781 <= 8KiB (d.h. 192 >4k und <= 8K)
> 840 <= 12KiB
> 868 <= 16KiB
> ...
> 924 <= 32KiB
> ...
> 953 <= 64KiB
> ...
> 987 <= 1MiB(=1024KiB)
> Bigger are only four, they are 1128, 7775, 7804 and 16204 KiB big.
> Well, 991 files = revisions over all repositories isn't that much...
> Further, we deal a lot with texts instead of programming data and
> patches, I think, when the main usage is program versioning, deltas
> with several 100 KiB might be very rare. Then again, there are very few
> branches here.
I don't have real usage data yet.
My test repository (where I version my development machine) shows for
daily (or at least every 2nd day) dist-upgrades (against unstable):
  # for a in `seq 10 16`
      svnadmin dump --incremental -r $a | wc -c
  * Ausgegeben Revision 10.
  * Ausgegeben Revision 11.
  * Ausgegeben Revision 12.
  * Ausgegeben Revision 13.
  * Ausgegeben Revision 14.
  * Ausgegeben Revision 15.
  * Ausgegeben Revision 16.
(Warning! Takes some time!)
That's 78M per dist-upgrade. A lot of space.

> Hopefully this doesn't create any frustrations... Our usage is not
> typical and thus not representative. One should apply the
> abovementioned 'find' command to a bigger & more typical repository to
> get more usable results. And I have no clue about the storage
> efficiency of BDB.
> And I must confess to my shame that I still didn't find the
> time/opportunity to install FSVS on our backup server. I can easily
> imagine that once you backup whole unix machines with several users,
> you come easily into regions where sharing&linking duplicate data is
> feasible.
If I do some hand-waving and say -
debian has about 1G updates per month, which would possibly make a delta
of 50M ... if you're doing 10 installations, you'd save about 450M or 90%.
I think that would be very much *wanted*, if not required.

> Well, that should be enough blurb for today.
I'm already looking forward to other discussions!

> Feel free to forward anything to the list if you think it's worth to be
> discussed or read by the public... :-)
Thank you.

> Viele Grüße, Viel Erfolg & Viel Spaß
> Dirk


« Previous message in topic | 1 of 1 | Next message in topic »


Show all messages in topic

Re: Bemerkungen zum RFC pmarek P.Marek 2006-05-17 01:10:16 PDT
Messages per page: