summaryrefslogtreecommitdiff
path: root/posts/2010-03-25-github-explorer.org
blob: efa4816cfb3cc9d015b9628b2a16fec2721e5c63 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
#+BEGIN_QUOTE
  /More informations about the poster are available on
  [[http://lumberjaph.net/graph/2010/04/02/github-poster.html][this
  post]]/
#+END_QUOTE

Last year, with help from my coworkers at
[[http://linkfluence.net/][Linkfluence]], I created two sets of maps of
the [[http://perl.org][Perl]] and [[http://search.cpan.org/][CPAN]]'s
community. For this, I collected data from CPAN to create three maps:

-  [[http://cpan-explorer.org/2009/07/28/new-version-of-the-distributions-map-for-yapceu/][dependencies
   between distributions]]
-  [[http://cpan-explorer.org/2009/07/28/version-of-the-authors-graph-for-yapceu/][which
   authors wre important in term of reliability]]
-  [[http://cpan-explorer.org/2009/07/28/new-web-communities-map-for-yapceu/][and
   how the websites theses authors are structured]]

I wanted to do something similar again, but not with the same data. So I
took a look at what could be a good subject. One of the things that we
saw from the map of the websites is the importance
[[http://github.com/][GitHub]] is gaining inside the Perl community.
GitHub provides a [[http://develop.github.com/][really good API]], so I
started to play with it.

#+BEGIN_QUOTE
  This graph will be printed on a poster, size will be
  [[http://en.wikipedia.org/wiki/A2_paper_size][A2]] and
  [[http://en.wikipedia.org/wiki/A1_paper_size][A1]]. Please, contact me
  franck.cuny [at] linkfluence.net if you will be interested by one.
#+END_QUOTE

This time, I didn't aim for the Perl community only, but the whole
github communities. I've created several graphs:

#+BEGIN_QUOTE
  all the graph are available "on my flickr
  account":http://www.flickr.com/photos/franck\_/sets/72157623447857405/
#+END_QUOTE

-  [[http://www.flickr.com/photos/franck_/4460144638/][a graph of all
   languages]]
-  [[http://www.flickr.com/photos/franck_/4456072255/in/set-72157623447857405/][a
   graph of the Perl community]]
-  [[http://www.flickr.com/photos/franck_/4456914448/][a graph of the
   Ruby community]]
-  [[http://www.flickr.com/photos/franck_/4456118597/in/set-72157623447857405/][a
   graph of the Python community]]
-  [[http://www.flickr.com/photos/franck_/4456830956/in/set-72157623447857405/][a
   graph of the PHP community]]
-  [[http://www.flickr.com/photos/franck_/4456862434/in/set-72157623447857405/][a
   graph of the European community]]
-  [[http://www.flickr.com/photos/franck_/4456129655/in/set-72157623447857405/][a
   graph of the Japan community]]

I think a disclaimer is important at this point. I know that github
doesn't represent the whole open source community. With these maps, I
don't claim to represent what the open source world looks like right
now. This is not a troll about which language is best, or used at large.
It's *ONLY* about GitHub.

Also, I won't provide deep analysis for each of these graphs, as I lack
insight about some of those communities. So feel free to
[[http://franck.lumberjaph.net/graphs.tgz][re-use the graphs]] and
provide your own analyses.

** Methodology

I didn't collect all the profiles. We (with
[[http://twitter.com/gfouetil][Guilhem]] decided to limit to peoples who
are followed by at least two other people. We did the same thing for
repositories, limiting to repositories which are at least forked once.
Using this technique, more than 17k profiles have been collected, and
nearly as many repositories.

For each profile, using the github API, I've tried to determine what the
main language for this person is. And with the help of the
[[http://www.geonames.org][geonames]], find the right country to attach
the profile to.

Each profile is represented by a node. For each node, the following
attributes are set:

-  name of the profile
-  main language used by this profile, determined by github
-  name of the country
-  follower count
-  following count
-  repository count

An edge is a link between two profiles. Each time someone follows
another profile, a link is created. By default, the weight of this link
is 1. For each project this person forked from the target profile, the
weight is incremented.

As always, I've used [[http://gephi.org/][Gephi]] (now in version 0.7)
to create the graphs. Feel free to download the various graph files and
use them with Gephi.

** Github

#+BEGIN_QUOTE
  properties of the graph: 16443 nodes / 130650 edges
#+END_QUOTE

The first map is about all the languages available on github. This one
was really heavy, with more than 17k nodes, and 130k edges. The final
version of the graph use the 2270 more connected nodes.

You can't miss Ruby on this map. As github uses Ruby on Rails, it's not
really surprising that the Ruby community has a particular interest on
this website. The main languages on github are what we can expect, with
PHP, Python, Perl, Javascript.

Some languages are not really well represented. We can assume that most
Haskell projects might use darcs, and therefore are not on github. Some
other languages may use other platforms, like launchpad, or sourceforge.

** Perl

#+BEGIN_QUOTE
  properties of the graph: 365 nodes / 4440 edges
#+END_QUOTE

The Perl community is split into two parts. On the left side, there is
the occidental community, driven by people like
"Florian":http://github.com/rafl, "Yuval":http://github.com/nothingmuch,
"rjbs":http://github.com/rjbs, ... The second part are the japanese Perl
hackers, with Tokhuirom, Typester, Yappo, ... And in between them,
Miyagawa acts as a glue. This map looks a lot like the previous map of
the CPAN. We can see that this community is international, with the
exception of Japan that don't mix with others.

There is no main project on github that gathers people, even though we
can see a fair amount of MooseX:: projects. Most of the developers will
work on different modules, that may not have the same purpose. Lately we
have seen a fair amount of work on various Plack stuff, mainly
middleware, but also HTTP servers (twiggy, starman, ...) and web
framework (dancer).

One important project that is not (deliberately) represented on this
graph is the gitpan, Schwern's project. The gitpan is an import of all
the CPAN modules, with complete history using the Backpan.

To conclude about Perl, there are only 365 nodes on this graph, but no
less than 4440 edges. That's nearly two times the number of edges
compared to the Python community. Perl is a really well structured
community, probably thanks to the CPAN, which already acted as hub for
contributors.

** Python

#+BEGIN_QUOTE
  properties of the graph: 532 nodes / 2566 edges
#+END_QUOTE

The Python community looks a lot like the Perl community, but only in
the structure of the graph. If we look closely, Django is the main
project that represent Python on Github, in contrast with Perl where
there is no leader. Some small projects gather small community of
developers.

** PHP

#+BEGIN_QUOTE
  properties of the graph: 301 nodes / 1071 edges
#+END_QUOTE

PHP is the only community that is structured this way on Github. We can
clearly see that people are structured based on a project where they
mainly contribute.

CakePHP and Symphony are the two main projects. Nearly all the projects
gather an international community, at the exception of a few
japanese-only projects

** Ruby

#+BEGIN_QUOTE
  properties of the graph: 3742 nodes / 24571 edges
#+END_QUOTE

As for the Github graph, we can clearly see that some countries are
isolated. On the right side, we have: the Japan community is at the
bottom; the Spanish at the top. Australian are represented on the upper
right corner, while on the left side we got the Brazilians.

The main projects that gather most of the hackers are Rails and Sinatra,
two famous web frameworks.

** Europe

#+BEGIN_QUOTE
  properties of the graph: 2711 nodes / 11259 edges
#+END_QUOTE

This one shows interesting features. Some countries are really isolated.
If we look at Spain, we can see a community of Ruby programmers, with an
important connectivity between them, but no really strong connection
with any foreign developers. We can clearly see the Perl community
exists as only one community, and is not split by country. The same is
true for Python.

** Japanese hackers community

#+BEGIN_QUOTE
  properties of the graph: 559 nodes / 5276 edges
#+END_QUOTE

This community is unique on github. In 2007, Yappo created
coderepos.org, a repository for open source developers in Japan. It was
a subversion repository, with Trac as an HTTP front-end. It gathered
around 900 developers, with all kind of projects (Perl, Python, Ruby,
Javascript, ...). Most of these users have switched to github now.

Three main communities are visible on this graph: Perl; Ruby; PHP. As
always, the Javascript community as a glue between them. And yes, we can
confirm that Perl is big in Japan.

We have seen in the previous graph that the Japanese hackers are always
isolated. We can assume that their language is an obstacle.

This is a really well-connected graph too.

** Conclusions and graphs

I may have not provided a deep analysis of all the graph. I don't have
knowledge of most of the community outside of Perl. Feel free to
download the graph, to load them in Gephi, experiment, and provides your
own thoughts.

I would like to thanks everybody at Linkfluence (guilhem for his
advices, camille for giving me time to work on this, and antonin for the
amazing poster), who have helped me and let me use time and resources to
finish this work. Special thanks to blob for reviewing my prose and cdlm
for the discussion :)