summaryrefslogtreecommitdiff
path: root/posts/2011-06-20-stargit.org
diff options
context:
space:
mode:
authorFranck Cuny <franckcuny@gmail.com>2016-08-04 11:45:44 -0700
committerFranck Cuny <franckcuny@gmail.com>2016-08-04 11:45:44 -0700
commit585b48b6a605cb71ef99dd767880e1b7ee5bf24e (patch)
treec65377350d12bd1e62e0bdd58458c1044541c27b /posts/2011-06-20-stargit.org
parentUse Bullet list for the index. (diff)
parentMass convert all posts from markdown to org. (diff)
downloadlumberjaph-585b48b6a605cb71ef99dd767880e1b7ee5bf24e.tar.gz
Merge branch 'convert-to-org'
Diffstat (limited to 'posts/2011-06-20-stargit.org')
-rw-r--r--posts/2011-06-20-stargit.org459
1 files changed, 459 insertions, 0 deletions
diff --git a/posts/2011-06-20-stargit.org b/posts/2011-06-20-stargit.org
new file mode 100644
index 0000000..4d4f77f
--- /dev/null
+++ b/posts/2011-06-20-stargit.org
@@ -0,0 +1,459 @@
+Last year I did a
+[[http://lumberjaph.net/graph/2010/03/25/github-explorer.html][small
+exploration of GitHub]] to show the various communities using
+[[http://github.com][GitHub]] and how they work. I wanted to do it again
+this year, but I was lacking time and motivation to start over. A couple
+of months ago, I got a message from
+[[https://twitter.com/#!/mojombo][mojombo]] asking me if I was planning
+to do a new poster. This triggered the motivation to work on it again.
+
+This time I got help from [[https://twitter.com/#!/jacomyal][Alexis]] to
+provide you with an awesome tool: [[http://www.stargit.net][a real
+explorer of your graph]], but more on this later ;)
+
+And of course, [[http://labs.linkfluence.net][the poster]]. Feel free to
+print it yourself, the size of the poster is A1.
+
+** The data
+
+All the data are available! Last year I got some mails asking me for the
+dataset. So this time I asked first if I could release the
+[[http://maps.startigt.net/dump/github.tgz][data]] with the
+[[http://git.lumberjaph.net/p5-stargit.git/][code]] and the poster, and
+the anwser is yes! So if you're intereseted, you can download it.
+
+The data are stored in mongodb, so I provide the dump which you can
+easily use:
+
+#+BEGIN_SRC sh
+ % wget http://maps.stargit.net/dump/github.
+ % tar xvzf github.tgz
+ % cd github
+ % mongorestore -d github .
+#+END_SRC
+
+Now you can use mongodb to browse the imported database. There is 5
+collections: profiles / repositories / relations / contributions /
+edges.
+
+** Methodology
+
+Last year I did a simple "follower/following" graph. It was already
+interesting, but it was also /really/ too simple. This time I wanted to
+go deeper in the exploration.
+
+The various step to process all this data are:
+
+- using the GitHub API, fetch informations from the profiles.
+- when all the profiles are collected, informations about the
+ repositories are fetched. Only forked repositories are kept.
+- "simple" relations (followers/following) are kept and used later to
+ add weight to relations.
+- tag user with the main programming language they use. Using the
+ GitHub API, I was able to categorize ~40k profiles (about 1/3 of my
+ whole dataset).
+- using the GeoNames API, extract the name of the country the user is
+ in. This time, about 55k profiles were tagged.
+- fetch contributions for each repositories
+- compute a score between the author of the contribution and the owner
+ of the repo
+- add a weight to each edges, using the computed score and "+1" if the
+ developer follow the other developer
+
+For all the graphs, I've used the following colors for:
+
+- Ruby
+- JavaScript
+- Python
+- C (C++, C#)
+- Perl
+- PHP
+- JVM (Java, Clojure, Scala)
+- Lisp (Emacs Lisp, Common Lisp)
+- Other
+
+** Exploring
+
+Feel free to do your own analysis in the comments :) For each map,
+you'll find a PDF of the map, and the graph to explore using gephi (in
+GEXF or GDF format).
+
+*** but first, some numbers
+
+I've collected:
+
+- 123 562 profiles
+- 2 730 organizations
+- 40 807 repositories
+
+This took me about a month in order to collect the data and to build the
+adapted tools.
+
+*** Accounts creations
+
+The following chart show the number of account created by month.
+"Everyone" means the total of accounts created. You can also see the
+numbers for each communities.
+
+On the "Everyone" graph, you can see a huge pick around April 2008,
+that's the date GitHub [[https://github.com/blog/40-we-launched][was
+launched]].
+
+For most of the communities, the number of created accounts start to
+decrease since 2010. I think the reason is that most of the developers
+from those communities are now on GitHub.
+
+#+BEGIN_HTML
+ <script language="javascript" type="text/javascript" src="/js/jquery.js"></script>
+#+END_HTML
+
+#+BEGIN_HTML
+ <script language="javascript" type="text/javascript" src="/js/jquery.flot.js"></script>
+#+END_HTML
+
+#+BEGIN_HTML
+ <div id="placeholder" style="width:800px;height:300px;">
+#+END_HTML
+
+#+BEGIN_HTML
+ </div>
+#+END_HTML
+
+#+BEGIN_HTML
+ <ul class="actions">
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ <li class="minibutton">
+#+END_HTML
+
+#+BEGIN_HTML
+ </li>
+#+END_HTML
+
+#+BEGIN_HTML
+ </ul>
+#+END_HTML
+
+#+BEGIN_HTML
+ <script type="text/javascript">
+ $(function () {
+ var options = {
+ lines: { show: true },
+ points: { show: true },
+ xaxis: { mode:"time" }
+ };
+ var data = [];
+ var placeholder = $("#placeholder");
+
+ $.plot(placeholder, data, options);
+
+ // fetch one series, adding to what we got
+ var alreadyFetched = {};
+
+ $("input.resetSeries").click(function() {
+ alreadyFetched = {};
+ data = [];
+ $.plot(placeholder, data, options);
+ });
+
+ $("input.fetchSeries").click(function () {
+ var button = $(this);
+
+ // find the URL in the link right next to us
+ var dataurl = button.attr('href');
+
+ // then fetch the data with jQuery
+ function onDataReceived(series) {
+ // extract the first coordinate pair so you can see that
+ // data is now an ordinary Javascript object
+ var firstcoordinate = '(' + series.data[0][0] + ', ' + series.data[0][1] + ')';
+
+ // let's add it to our current data
+ if (!alreadyFetched[series.label]) {
+ alreadyFetched[series.label] = true;
+ data.push(series);
+ }
+
+ // and plot all we got
+ $.plot(placeholder, data, options);
+ }
+
+ $.ajax({
+ url: dataurl,
+ method: 'GET',
+ dataType: 'json',
+ success: onDataReceived
+ });
+ });
+ });
+ </script>
+#+END_HTML
+
+*** Languages
+
+(Keep in mind that these numbers are coming from the profiles I was able
+to tag, roughly 40k)
+
+- Ruby: 10046 (28%)
+- Python: 5403 (15%)
+- JavaScript: 5282 (15%) (JavaScript + CoffeeScript)
+- C: 5093 (14%) (C, C++, C#)
+- PHP: 3933 (11%)
+- JVM: 3790 (10%) (Java, Clojure, Scala, Groovy)
+- Perl: 1215 (3%)
+- Lisp: 348 (0%) (Emacs Lisp, Common Lisp)
+
+Those numbers doesn't really match "what GitHub
+gave":https://github.com/languages, but it could be explained by the way
+I've selected my users.
+
+*** Country
+
+- United States: 19861 (36%)
+- United Kingdom: 3533 (6%)
+- Germany: 3009 (5%)
+- Canada: 2657 (4%)
+- Brazil: 2454 (4%)
+- France: 1833 (3%)
+- Japan: 1799 (3%)
+- Russia: 1604 (2%)
+- Australia: 1441 (2%)
+- China: 1159 (2%)
+
+The United States are still the main country represented on GitHub, no
+suprise here.
+
+If you are interested in the "geography" of Open Source, you should read
+these two articles: [[http://takhteyev.org/dissertation/][Coding
+Places]] and
+[[http://takhteyev.org/papers/Takhteyev-Hilts-2010.pdf][Investigating
+the Geography of Open Source Software through GitHub]].
+
+*** companies
+
+Looking at the "company" field on user's profile, here are some stats
+about which companies has employees using GitHub:
+
+- ThoughtWorks: 102
+- Google: 66
+- Mozilla: 65
+- Yahoo!: 65
+- Red Hat: 64
+- Globo.com: 55
+- Twitter: 53
+- Facebook: 45
+- Yandex: 43
+- Intridea: 34
+- Microsoft: 33
+- Engine Yard: 32
+- Pivotal Labs: 29
+- MIT: 28
+- Rackspace: 27
+- IBM: 24
+- Caelum: 23
+- Novell: 22
+- GitHub: 22
+- VMware: 22
+
+I didn't knew the first company, ThoughtWorks, and I was expecting to
+see FaceBook or Twitter as the company with most developpers on GitHub.
+It's also interesting to see Yandex here.
+
+** Global graph (1628 nodes, 9826 edges)
+
+([download PDF](http://maps.stargit.net/global/global.pdf, "download
+GDF":http://maps.stargit.net/global/global.gdf))
+
+The main difference with last year, is the android / modders community.
+They're developing mostly in C and Java. The poster has been created
+from this map.
+
+** Ruby (1968 nodes, 9662 edges)
+
+([[http://maps.stargit.net/ruby/ruby.pdf][download PDF]],
+[[http://maps.stargit.net/ruby/ruby.gdf][download GDF]],
+[[http://maps.stargit.net/ruby/ruby.gexf][download GEXF]])
+
+This is still the main community on GitHub, even if JavaScript is now
+[[https://github.com/languages/JavaScript][the most popular language]].
+This graph is really dense, it's not easy to read, since there is no
+real cluster in this one.
+
+** Python (1062 nodes, 2631 edges)
+
+([[http://maps.stargit.net/python/python.pdf][download PDF]],
+[[http://maps.stargit.net/python/python.gdf][download GDF]])
+
+Here we have some clusters. I'm not familiar with the Python community,
+so I can't really give any insight.
+
+** Perl (608 nodes, 2967 edges)
+
+([[http://maps.stargit.net/perl/perl.pdf][download PDF]],
+[[http://maps.stargit.net/perl/perl.gdf][download GDF]],
+[[http://maps.stargit.net/perl/perl.gexf][download GEXF]])
+
+I really like this graph since it show (in my opinion) one of the real
+strength of this community: everybody works with everybody. People
+working on a webframework will collaborate with people working on Moose,
+or an ORM, or other tools. It shows that in this community, people are
+competent in more than one field.
+
+The Perl community is about the same size as last year. However, we can
+extract the following informations:
+
+- the Japaneses Perl Hackers are still a cluster by themselves
+- [[http://github.com/miyagawa][miyagawa]] is still the glue between
+ the Japanese community and the "rest of the world"
+- other leaders are: Florian Ragwitz
+ ([[http://github.com/rafl][rafl]]), Andy Amstrong
+ ([[http://github.com/andya][AndyA]]), Dave Rolsky
+ ([[http://github.com/autarch][autarch]])
+- some clusters exists for Moose and Dancer.
+
+As we can see on the previous charts, the number of created accounts for
+the Perl developpers is stalling.
+
+** United States (2646 nodes, 11344 edges)
+
+([[http://maps.startgit.net/unitedstates/unitedstates.pdf][download
+PDF]],
+[[http://maps.startgit.net/unitedstates/unitedstates.gdf][download
+GDF]],
+[[http://maps.startgit.net/unitedstates/unitedstates.gexf][download
+GEXF]])
+
+This one is really nice. We can clearly see all the communities. There
+is something interesting:
+
+- C and Ruby are on the opposite side (C on the left, Ruby on the
+ right)
+- Python and Perl are also opposed (Perl at the bottom and Python at
+ the top)
+
+I'll let you take some conclusion by yourself on this one ;)
+
+** France (706 nodes, 1059 edges)
+
+([[http://maps.stargit.net/france/france.pdf][download PDF]],
+[[http://maps.stargit.net/france/france.gdf][download GDF]],
+[[http://maps.stargit.net/france/france.gexf][download GEXF]])
+
+We have a lot of small clusters on this one, and some very big
+authorities.
+
+** Japan (464 nodes, 1091 edges)
+
+([[http://maps.stargit.net/japan/japan.pdf][download PDF]],
+[[http://maps.stargit.net/japan/japan.gdf][download GDF]],
+[[http://maps.stargit.net/japan/japan.gexf][download GEXF]])
+
+There is three dominants clusters on this one:
+
+- Ruby
+- Perl
+- C
+
+The Ruby and Perl one are well connected. There is a lot of japanese
+hacker on CPAN using both languages.
+
+** StarGit
+
+[[http://stargit.net][StarGit]] is a great tool we built with Alexis to
+let you explore *your* community on GitHub. You can read more about the
+application on
+[[http://ofnodesandedges.com/2011/06/20/stargit.html][Alexis' blog]].
+
+It's hosted on [[http://dotcloud.com][dotcloud]] (I'm still amazed at
+how easy it was to deploy the code ...), using the Perl
+[[http://perldancer.org][Dancer web framework]], MongoDB to store the
+data, and Redis to do some caching.
+
+** Credits
+
+I would like to thanks the whole GitHub team for being interested in the
+previous poster and to ask another one this year :)
+
+A *huge* thanks to Alexis for his help on building the awesome StarGit.
+Another big thanks to Antonin for his work on the poster.