Why can't Git handle large files and large repos? -


dozens of questions , answers on , elsewhere emphasize git can't handle large files or large repos. handful of workarounds suggested such git-fat , git-annex, ideally git handle large files/repos natively.

if limitation has been around years, there reason limitation has not yet been removed? assume there's technical or design challenge baked git makes large file , large repo support extremely difficult.

lots of related questions, none seem explain why such big hurdle:

basically, comes down tradeoffs.

one of questions has example linus himself:

[...] cvs, ie ends being pretty oriented "one file @ time" model.

which nice in can have million files, , check out few of them - you'll never see impact of other 999,995 files.

git fundamentally never looks @ less whole repo. if limit things bit (ie check out portion, or have history go bit), git ends still caring whole thing, , carrying knowledge around.

so git scales badly if force @ 1 huge repository. don't think part fixable, although can improve on it.

and yes, there's "big file" issues. don't know huge files. suck @ them, know.

just won't find data structure o(1) index access , insertion, won't find content tracker fantastically.

git has deliberately chosen better @ things, detriment of others.


disk usage

since git dvcs (distributed version control system), has copy of entire repo (unless use relatively recent shallow clone).

this has really nice advantages, why dvcss git have become insanely popular.

however, 4 tb repo on central server svn or cvs manageable, whereas if use git, won't thrilled carrying around.

git has nifty mechanisms minimizing size of repo creating delta chains ("diffs") across files. git isn't constrained paths or commit orders in creating these, , work quite well....kinda of gzipping entire repo.

git puts these little diffs packfiles. delta chains , packfiles makes retrieving objects take little longer, effective @ minimizing disk usage. (there's tradeoffs again.)

that mechanism doesn't work binary files, tend differ quite bit, after "small" change.


history

when check in file, have forever , ever. grandchildren's grandchildren's grandchildren download cat gif every time clone repo.

this of course isn't unique git, being dcvs makes consequences more significant.

and while possible remove files, git's content-based design (each object id sha of content) makes removing files difficult, invasive, , destructive history. in contrast, can delete crufty binary artifact repo, or s3 bucket, without affecting rest of content.


difficulty

working large files requires a lot of careful work, make sure minimize operations, , never load whole thing in memory. extremely difficult reliably when creating program complex feature set git.


conclusion

ultimately, developers "don't put large files in git" bit "don't put large files in databases". don't it, alternatives have disadvantages (git intergration in 1 case, acid compliance , fks other). in reality, works okay, if have enough memory.

it doesn't work designed for.


Comments

Popular posts from this blog

How to run C# code using mono without Xamarin in Android? -

c# - SharpSsh Command Execution -

python - Specify path of savefig with pylab or matplotlib -