Last year I did a
[[http://lumberjaph.net/graph/2010/03/25/github-explorer.html][small
exploration of GitHub]] to show the various communities using
[[http://github.com][GitHub]] and how they work. I wanted to do it again
this year, but I was lacking time and motivation to start over. A couple
of months ago, I got a message from
[[https://twitter.com/#!/mojombo][mojombo]] asking me if I was planning
to do a new poster. This triggered the motivation to work on it again.
This time I got help from [[https://twitter.com/#!/jacomyal][Alexis]] to
provide you with an awesome tool: [[http://www.stargit.net][a real
explorer of your graph]], but more on this later ;)
And of course, [[http://labs.linkfluence.net][the poster]]. Feel free to
print it yourself, the size of the poster is A1.
** The data
All the data are available! Last year I got some mails asking me for the
dataset. So this time I asked first if I could release the
[[http://maps.startigt.net/dump/github.tgz][data]] with the
[[http://git.lumberjaph.net/p5-stargit.git/][code]] and the poster, and
the anwser is yes! So if you're intereseted, you can download it.
The data are stored in mongodb, so I provide the dump which you can
easily use:
#+BEGIN_SRC sh
% wget http://maps.stargit.net/dump/github.
% tar xvzf github.tgz
% cd github
% mongorestore -d github .
#+END_SRC
Now you can use mongodb to browse the imported database. There is 5
collections: profiles / repositories / relations / contributions /
edges.
** Methodology
Last year I did a simple "follower/following" graph. It was already
interesting, but it was also /really/ too simple. This time I wanted to
go deeper in the exploration.
The various step to process all this data are:
- using the GitHub API, fetch informations from the profiles.
- when all the profiles are collected, informations about the
repositories are fetched. Only forked repositories are kept.
- "simple" relations (followers/following) are kept and used later to
add weight to relations.
- tag user with the main programming language they use. Using the
GitHub API, I was able to categorize ~40k profiles (about 1/3 of my
whole dataset).
- using the GeoNames API, extract the name of the country the user is
in. This time, about 55k profiles were tagged.
- fetch contributions for each repositories
- compute a score between the author of the contribution and the owner
of the repo
- add a weight to each edges, using the computed score and "+1" if the
developer follow the other developer
For all the graphs, I've used the following colors for:
- Ruby
- JavaScript
- Python
- C (C++, C#)
- Perl
- PHP
- JVM (Java, Clojure, Scala)
- Lisp (Emacs Lisp, Common Lisp)
- Other
** Exploring
Feel free to do your own analysis in the comments :) For each map,
you'll find a PDF of the map, and the graph to explore using gephi (in
GEXF or GDF format).
*** but first, some numbers
I've collected:
- 123 562 profiles
- 2 730 organizations
- 40 807 repositories
This took me about a month in order to collect the data and to build the
adapted tools.
*** Accounts creations
The following chart show the number of account created by month.
"Everyone" means the total of accounts created. You can also see the
numbers for each communities.
On the "Everyone" graph, you can see a huge pick around April 2008,
that's the date GitHub [[https://github.com/blog/40-we-launched][was
launched]].
For most of the communities, the number of created accounts start to
decrease since 2010. I think the reason is that most of the developers
from those communities are now on GitHub.
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
-
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
#+BEGIN_HTML
#+END_HTML
*** Languages
(Keep in mind that these numbers are coming from the profiles I was able
to tag, roughly 40k)
- Ruby: 10046 (28%)
- Python: 5403 (15%)
- JavaScript: 5282 (15%) (JavaScript + CoffeeScript)
- C: 5093 (14%) (C, C++, C#)
- PHP: 3933 (11%)
- JVM: 3790 (10%) (Java, Clojure, Scala, Groovy)
- Perl: 1215 (3%)
- Lisp: 348 (0%) (Emacs Lisp, Common Lisp)
Those numbers doesn't really match "what GitHub
gave":https://github.com/languages, but it could be explained by the way
I've selected my users.
*** Country
- United States: 19861 (36%)
- United Kingdom: 3533 (6%)
- Germany: 3009 (5%)
- Canada: 2657 (4%)
- Brazil: 2454 (4%)
- France: 1833 (3%)
- Japan: 1799 (3%)
- Russia: 1604 (2%)
- Australia: 1441 (2%)
- China: 1159 (2%)
The United States are still the main country represented on GitHub, no
suprise here.
If you are interested in the "geography" of Open Source, you should read
these two articles: [[http://takhteyev.org/dissertation/][Coding
Places]] and
[[http://takhteyev.org/papers/Takhteyev-Hilts-2010.pdf][Investigating
the Geography of Open Source Software through GitHub]].
*** companies
Looking at the "company" field on user's profile, here are some stats
about which companies has employees using GitHub:
- ThoughtWorks: 102
- Google: 66
- Mozilla: 65
- Yahoo!: 65
- Red Hat: 64
- Globo.com: 55
- Twitter: 53
- Facebook: 45
- Yandex: 43
- Intridea: 34
- Microsoft: 33
- Engine Yard: 32
- Pivotal Labs: 29
- MIT: 28
- Rackspace: 27
- IBM: 24
- Caelum: 23
- Novell: 22
- GitHub: 22
- VMware: 22
I didn't knew the first company, ThoughtWorks, and I was expecting to
see FaceBook or Twitter as the company with most developpers on GitHub.
It's also interesting to see Yandex here.
** Global graph (1628 nodes, 9826 edges)
([download PDF](http://maps.stargit.net/global/global.pdf, "download
GDF":http://maps.stargit.net/global/global.gdf))
The main difference with last year, is the android / modders community.
They're developing mostly in C and Java. The poster has been created
from this map.
** Ruby (1968 nodes, 9662 edges)
([[http://maps.stargit.net/ruby/ruby.pdf][download PDF]],
[[http://maps.stargit.net/ruby/ruby.gdf][download GDF]],
[[http://maps.stargit.net/ruby/ruby.gexf][download GEXF]])
This is still the main community on GitHub, even if JavaScript is now
[[https://github.com/languages/JavaScript][the most popular language]].
This graph is really dense, it's not easy to read, since there is no
real cluster in this one.
** Python (1062 nodes, 2631 edges)
([[http://maps.stargit.net/python/python.pdf][download PDF]],
[[http://maps.stargit.net/python/python.gdf][download GDF]])
Here we have some clusters. I'm not familiar with the Python community,
so I can't really give any insight.
** Perl (608 nodes, 2967 edges)
([[http://maps.stargit.net/perl/perl.pdf][download PDF]],
[[http://maps.stargit.net/perl/perl.gdf][download GDF]],
[[http://maps.stargit.net/perl/perl.gexf][download GEXF]])
I really like this graph since it show (in my opinion) one of the real
strength of this community: everybody works with everybody. People
working on a webframework will collaborate with people working on Moose,
or an ORM, or other tools. It shows that in this community, people are
competent in more than one field.
The Perl community is about the same size as last year. However, we can
extract the following informations:
- the Japaneses Perl Hackers are still a cluster by themselves
- [[http://github.com/miyagawa][miyagawa]] is still the glue between
the Japanese community and the "rest of the world"
- other leaders are: Florian Ragwitz
([[http://github.com/rafl][rafl]]), Andy Amstrong
([[http://github.com/andya][AndyA]]), Dave Rolsky
([[http://github.com/autarch][autarch]])
- some clusters exists for Moose and Dancer.
As we can see on the previous charts, the number of created accounts for
the Perl developpers is stalling.
** United States (2646 nodes, 11344 edges)
([[http://maps.startgit.net/unitedstates/unitedstates.pdf][download
PDF]],
[[http://maps.startgit.net/unitedstates/unitedstates.gdf][download
GDF]],
[[http://maps.startgit.net/unitedstates/unitedstates.gexf][download
GEXF]])
This one is really nice. We can clearly see all the communities. There
is something interesting:
- C and Ruby are on the opposite side (C on the left, Ruby on the
right)
- Python and Perl are also opposed (Perl at the bottom and Python at
the top)
I'll let you take some conclusion by yourself on this one ;)
** France (706 nodes, 1059 edges)
([[http://maps.stargit.net/france/france.pdf][download PDF]],
[[http://maps.stargit.net/france/france.gdf][download GDF]],
[[http://maps.stargit.net/france/france.gexf][download GEXF]])
We have a lot of small clusters on this one, and some very big
authorities.
** Japan (464 nodes, 1091 edges)
([[http://maps.stargit.net/japan/japan.pdf][download PDF]],
[[http://maps.stargit.net/japan/japan.gdf][download GDF]],
[[http://maps.stargit.net/japan/japan.gexf][download GEXF]])
There is three dominants clusters on this one:
- Ruby
- Perl
- C
The Ruby and Perl one are well connected. There is a lot of japanese
hacker on CPAN using both languages.
** StarGit
[[http://stargit.net][StarGit]] is a great tool we built with Alexis to
let you explore *your* community on GitHub. You can read more about the
application on
[[http://ofnodesandedges.com/2011/06/20/stargit.html][Alexis' blog]].
It's hosted on [[http://dotcloud.com][dotcloud]] (I'm still amazed at
how easy it was to deploy the code ...), using the Perl
[[http://perldancer.org][Dancer web framework]], MongoDB to store the
data, and Redis to do some caching.
** Credits
I would like to thanks the whole GitHub team for being interested in the
previous poster and to ask another one this year :)
A *huge* thanks to Alexis for his help on building the awesome StarGit.
Another big thanks to Antonin for his work on the poster.