Welcome to git2net’s documentation!

Overview and Installation

git2net is an Open Source Python package that facilitates the extraction of co-editing networks from git repositories.

Requirements

git2net is pure Python code. It has no platform-specific dependencies and thus works on all platforms. The only requirement is a version of Git >= 2.0.

Installing git2net

Assuming you are using pip, you can install latest version of git2net from the command-line by running:

$ pip install git2net

This command also installs the necessary dependencies. Among other dependencies, which are listed as install_requires in git2net’s setup file, git2net depends on the python-Levenshtein package to compute Levenshtein distances for edited lines of code. On sytems running Windows, automatically compiling this C based module might fail during installation. In this case, unofficial Windows binaries can be found here, which might help you get started.

Contributing to git2net

The source code for git2net is available in a repository on GitHub which can be browsed at:

If you find any bugs related to git2net please report them as issues there.

git2net is developed as an Open Source project. This means that your ideas and inputs are highly welcome. Feel free to share the project and contribute yourself. To get started, you can clone git2net’s repository as follows:

$ git clone git@github.com:gotec/git2net.git

Now uninstall any existing version of git2net and install a local version based on the cloned repository:

$ pip uninstall git2net
$ cd git2net
$ pip install -e .

This will also install git2net’s dependencies.

git2net provides a set of tests that you should run before creating a pull request. To do so, you will first need to unzip the test resitory they are based on:

$ unzip test_repos/test_repo_1.zip -d test_repos/

Then, you can run the tests with:

$ pytest

Citing git2net

@inproceedings{gote2019git2net,
    title={git2net: {M}ining time-stamped co-editing networks from large git repositories},
    author={Gote, Christoph and Scholtes, Ingo and Schweitzer, Frank},
    booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
    pages={433--444},
    year={2019},
    organization={IEEE Press}
}

@article{gote2021analysing,
    title={Analysing time-stamped co-editing networks in software development teams using git2net},
    author={Gote, Christoph and Scholtes, Ingo and Schweitzer, Frank},
    journal={Empirical Software Engineering},
    volume={26},
    number={4},
    pages={1--41},
    year={2021},
    publisher={Springer}
}

License

This software is licensed under the GNU Affero General Public License v3 (AGPL-3.0).

Getting Started

A central aim of git2net is allowing you to conveniently obtain and visualise network projections of editing activity in git repositories. Let’s have a look at an example on how you can achieve this:

import git2net
import pathpy as pp

github_url = 'gotec/git2net'
git_repo_dir = 'git2net4analysis'
sqlite_db_file = 'git2net4analysis.db'

# Clone and mine repository from GitHub
git2net.mine_github(github_url, git_repo_dir, sqlite_db_file)

# Disambiguate author aliases in the resulting database
git2net.disambiguate_aliases_db(sqlite_db_file)

# Obtain temporal bipartite network representing authors editing files over time
t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file, time_from=time_from)

# Aggregate to a static network
n = pp.Network.from_temporal_network(t)

# Visualise the resulting network
colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}
node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}
pp.visualisation.plot(n, node_color=node_color)

In the example above, we used three functions of git2net. First, we extract edits from the repository using mine_github. Then, we disambiguate author identities using disambiguate_aliases_db. Finally, we visualise the bipartite author-file network with get_bipartite_network.

Corresponding to the calls above, git2net’s functionality is partitionied into three modules: extraction, disambiguation, visualisation, and complexity. We outline the most important functions of each module here. For a comprehensive details on all functions of git2net we refer to the API reference.

Tutorials

To help you get started, we provide an extensive set of tutorials covering different aspects of analysing your repository with git2net. You can directly interact with the notebooks in Binder, or view them in NBViewer via the links below.

In addition, we provide links to the individual tutorial notebooks in the tabs below:

We show how to clone and prepare a git repository for analysis with git2net.

Usage Examples

We have published some motivating results as well as details on the mining algorithm in “git2net - Mining Time-Stamped Co-Editing Networks from Large git Repositories”.

In “Analysing Time-Stamped Co-Editing Networks in Software Development Teams using git2net”, we use git2net to mine more than 1.2 million commits of over 25,000 developers. We use this data to test a hypothesis on the relation between developer productivity and co-editing patterns in software teams.

Finally, in “Big Data = Big Insights? Operationalising Brooks’ Law in a Massive GitHub Data Set”, we mine a corpus containing over 200 GitHub repositories using git2net. Based on the resulting data, we study the relationship between team size and productivity in OSS development teams. If you want to use this extensive data set for your own study, we made it publicly available on zenodo.org.

Modules

git2net provides four modules—extraction, disambiguation, visualisation, and complexity.

Extraction

The module extraction contains all the functions that operate directly on a git repository. The most important functions in this module are:

  • mine_git_repo: mines edits from a locally cloned git repositry to an SQLite database.

  • mine_github: creates a local clone and mines edits from repository on GitHub to an SQLite database.

  • check_mining_complete: checks if a repository has been fully mined.

  • mining_state_summary: provides information on any commits that have not been fully mined.

Checkout the API reference for information on the complete list of available functions.

Disambiguation

The module disambiguation only contains a single function which allows you to disambiguate author identities. The disambiguation is based on the algorithm gambit.

  • disambiguate_aliases_db: disambiguates author aliases in a database mined with git2net.

Visualisation

The visualisation module provides functions to generate various network projections based on the SQLite database created during the mining process.

  • get_coediting_network: creates a co-editing network where nodes are authors who are connected by a directed link if they consecutively edited the same line of code. Links are directed from the previous to the subsequent author.

  • get_coauthorship_network: creates a co-authorship network were nodes are authors who are connected by an undirected link if the edited the same file.

  • get_bipartite_network: creates a bipartite network with nodes representing authors and files. Undirected links exist between an author and all files the author edited.

  • get_line_editing_paths: creates paths for all lines in a repository. The paths contain ordered sequences of authors who subsequently edited a line. The number of paths generated for a line depends on the number of forks and merges the line was involved in.

  • get_commit_editing_dag: creates a directed acyclic graph where nodes are commits. Commits are connected by a directed link if a one commit modifies lines last editied in another commit. Links are directed from the editing commit to the edited commit.

Complexity

The module complexity provides the functionality to compute a variety of complexity measures for the commits and files in a git repository. Specifically, for all commits, we compute the number of editing events (events) and the total Levenshtein edit distance (levenshtein_distance) for all modified files. In addition, we compute the Halstead effort (HE), the cyclomatic complexity (CCN), the number of lines of code (NLOC), the number of tokens (TOK), and the number of functions (FUN) in all modified files before (*_pre) and after (*_post) each commit. We further compute the change (*_delta) for all complexity measures. As we show in this publication, the absolute value of the change in complexity can be used as a proxy for the productivity of developers in Open Source software projects.

  • compute_complexity: computes complexity measures for all mined commit/file combinations in a database mined with git2net.

API Reference

Extraction

Disambiguation

git2net.disambiguation.disambiguate_aliases_db(sqlite_db_file, method='gambit', **quargs)

Disambiguates author aliases in a given SQLite database mined with git2net. The disambiguation is performed using the Python package gambit. Internally, disambiguate_aliases_db calls the function gambit.disambiguate_aliases.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • method (str) – disambiguation method from {“gambit”, “bird”, “simple”}

  • **quargs – hyperparameters for the gambit and bird algorithms; gambit: thresh (float) – similarity threshold from interval 0 to 1, sim (str) – similarity measure from {‘lev’, ‘jw’}, bird: thresh (float) – similarity threshold from interval 0 to 1

Returns:

creates new column with unique author_id in the commits table of the provided database

Visualisation

Complexity

Indices and tables