|Distributed is NOT a superset of Centralized Version Control|
|Written by Mark Domowicz|
|Friday, 25 June 2010 21:52|
NOTE: From the discussion for this article on reddit, I've noticed that many people think I'm saying that DVCS Tools cannot do everything a CVCS Tool can do. But this article is not about tools, it's about models. My point is that often a project situation can get great benefit from file locking, and that file locking is a Centralized operation, no matter if your Tools are Distributed or Centralized.
In any case, thanks for the great discussion reddit!
Linus Torvalds is a smart man, but I often get the impression that his powerful simplifying and generalizing instincts sometimes forces him into a narrow view, unable to see the software development world from a perspective other than his own. To me, the most tangible example of this is his view on Git. Git is a Distributed Version Control System (DVCS), and because it is distributed, Git is often seen as inherently superior to Centralized Version Control Systems (CVCS) such as Subversion and Perforce.
Popular Version Control Systems
In a 2007 Google Tech Talk, Linus had this to say,
"Distributed is really really central to any SCM you should ever use. So get rid of perforce. Now. It's sad, but it's so so true."
Linus is not talking about Git specifically, here. He is claiming that the Distributed Model is superior to the Centralized Model.
But this is much too big a generalization, and it is incorrect. If I had to guess, I believe the reason so many people tend to assume that DVCSs are intrinsically better than CVCSs is because they think the Distributed Model is a superset of the Centralized Model. That is, they believe that any development model that can be supported in a Centralized system can also be supported in a Distributed system, and that the Distributed system brings with it a host of additional benefits.
The best example of this is commit-access topology - the organization by which people have access to commit changes to the database. In the Centralized model, there is only one master database - a main "server" that everyone works with, and that is administered by a only handful of developers. In the Distributed model, all developers have their own private copy of the source database, and no one person's copy is a second class citizen to anybody else's. This means that everyone works with their own database, which is quite convenient, but more importantly, it means that any commit-access topology in use under a DVCS setup is one defined by the team, and not by the tool. In other words, while a CVCS, by its nature, enforces a centralized topology, with a DVCS the users can define the topology as they want it and enforce it socially.
DVCS tool users can, therefore, also choose to work in a Centralized way. And thus the misconception that the Distributed Model is a superset of the Centralized Model.
But here's the problem with Distributed. It's a poor model to use when your project has a lot of binary data. The best example I can think of is video game development (surprise!), which routinely deals with vastly more binary assets - in the form of audio files, image files, and geometry files - than code. This is bad because binary data is unmergable. Which means if two people are working on the same binary file at the same time, one of the two developers is going to lose their work when it's time to commit.
Topologies, Paul Baran, 1964 (only somewhat related to the article :)
So two developers can't work on the same binary files at the same time. Therefore it would be best if developers could somehow know when a file is "in use" and when it is "available". Of course, one way to do this would be to simply ask every developer if they are currently using the file in question. But that's a lot of overhead and quite impractical in many situations. It would be better if the VCS tool itself could keep track of when a file is, or is not, usable. Distributed systems could do this, but only if they forced every database to talk to a server, (at which point it's no longer distributed), or if all databases talked to each other, which is not only inefficient, it removes a major benefit of Distributed tools, namely, the ability to work offline.
To emphasize, I'm not talking about any specific tool. I'm talking about the model. Can a Distributed tool like Git or Mercurial be used in a centralized way? With the right features, certainly. I'm just saying that the Centralized Model can do things the Distributed Model cannot. By definition then, Distributed is not a superset of Centralized. Further, sometimes, the Centralized model is a better fit for the project, video games being a prime example. Perhaps Git is a more flexible tool than Subversion, but personally, if I have work in the Centralized Model, I'll use tools that are made for it.
And so, if I may be so bold, perhaps a more accurate statement from Linus would have been, "Distributed is really really central to any SCM you should ever use ... if the only software development you do is the kind of software development I know about."
|Last Updated on Saturday, 26 June 2010 23:59|