Welcome to git2net’s documentation!

Overview and Installation

git2net is an Open Source Python package that facilitates the extraction of co-editing networks from git repositories.

Requirements

git2net is pure Python code. It has no platform-specific dependencies and thus works on all platforms. The only requirement is a version of Git >= 2.0.

Installing git2net

Assuming you are using pip, you can install latest version of git2net from the command-line by running:

$ pip install git2net

This command also installs the necessary dependencies. Among other dependencies, which are listed as install_requires in git2net’s setup file, git2net depends on the python-Levenshtein package to compute Levenshtein distances for edited lines of code. On sytems running Windows, automatically compiling this C based module might fail during installation. In this case, unofficial Windows binaries can be found here, which might help you get started.

Contributing to git2net

The source code for git2net is available in a repository on GitHub which can be browsed at:

If you find any bugs related to git2net please report them as issues there.

git2net is developed as an Open Source project. This means that your ideas and inputs are highly welcome. Feel free to share the project and contribute yourself. To get started, you can clone git2net’s repository as follows:

$ git clone git@github.com:gotec/git2net.git

Now uninstall any existing version of git2net and install a local version based on the cloned repository:

$ pip uninstall git2net
$ cd git2net
$ pip install -e .

This will also install git2net’s dependencies.

git2net provides a set of tests that you should run before creating a pull request. To do so, you will first need to unzip the test resitory they are based on:

$ unzip test_repos/test_repo_1.zip -d test_repos/

Then, you can run the tests with:

$ pytest

Citing git2net

@inproceedings{gote2019git2net,
    title={git2net: {M}ining time-stamped co-editing networks from large git repositories},
    author={Gote, Christoph and Scholtes, Ingo and Schweitzer, Frank},
    booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
    pages={433--444},
    year={2019},
    organization={IEEE Press}
}

@article{gote2021analysing,
    title={Analysing time-stamped co-editing networks in software development teams using git2net},
    author={Gote, Christoph and Scholtes, Ingo and Schweitzer, Frank},
    journal={Empirical Software Engineering},
    volume={26},
    number={4},
    pages={1--41},
    year={2021},
    publisher={Springer}
}

License

This software is licensed under the GNU Affero General Public License v3 (AGPL-3.0).

Getting Started

A central aim of git2net is allowing you to conveniently obtain and visualise network projections of editing activity in git repositories. Let’s have a look at an example on how you can achieve this:

import git2net
import pathpy as pp

github_url = 'gotec/git2net'
git_repo_dir = 'git2net4analysis'
sqlite_db_file = 'git2net4analysis.db'

# Clone and mine repository from GitHub
git2net.mine_github(github_url, git_repo_dir, sqlite_db_file)

# Disambiguate author aliases in the resulting database
git2net.disambiguate_aliases_db(sqlite_db_file)

# Obtain temporal bipartite network representing authors editing files over time
t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file, time_from=time_from)

# Aggregate to a static network
n = pp.Network.from_temporal_network(t)

# Visualise the resulting network
colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}
node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}
pp.visualisation.plot(n, node_color=node_color)

In the example above, we used three functions of git2net. First, we extract edits from the repository using mine_github. Then, we disambiguate author identities using disambiguate_aliases_db. Finally, we visualise the bipartite author-file network with get_bipartite_network.

Corresponding to the calls above, git2net’s functionality is partitionied into three modules: extraction, disambiguation, visualisation, and complexity. We outline the most important functions of each module here. For a comprehensive details on all functions of git2net we refer to the API reference.

Tutorials

To help you get started, we provide an extensive set of tutorials covering different aspects of analysing your repository with git2net. You can directly interact with the notebooks in Binder, or view them in NBViewer via the links below.

In addition, we provide links to the individual tutorial notebooks in the tabs below:

We show how to clone and prepare a git repository for analysis with git2net.

Usage Examples

We have published some motivating results as well as details on the mining algorithm in “git2net - Mining Time-Stamped Co-Editing Networks from Large git Repositories”.

In “Analysing Time-Stamped Co-Editing Networks in Software Development Teams using git2net”, we use git2net to mine more than 1.2 million commits of over 25,000 developers. We use this data to test a hypothesis on the relation between developer productivity and co-editing patterns in software teams.

Finally, in “Big Data = Big Insights? Operationalising Brooks’ Law in a Massive GitHub Data Set”, we mine a corpus containing over 200 GitHub repositories using git2net. Based on the resulting data, we study the relationship between team size and productivity in OSS development teams. If you want to use this extensive data set for your own study, we made it publicly available on zenodo.org.

Modules

git2net provides four modules—extraction, disambiguation, visualisation, and complexity.

Extraction

The module extraction contains all the functions that operate directly on a git repository. The most important functions in this module are:

  • mine_git_repo: mines edits from a locally cloned git repositry to an SQLite database.

  • mine_github: creates a local clone and mines edits from repository on GitHub to an SQLite database.

  • check_mining_complete: checks if a repository has been fully mined.

  • mining_state_summary: provides information on any commits that have not been fully mined.

Checkout the API reference for information on the complete list of available functions.

Disambiguation

The module disambiguation only contains a single function which allows you to disambiguate author identities. The disambiguation is based on the algorithm gambit.

  • disambiguate_aliases_db: disambiguates author aliases in a database mined with git2net.

Visualisation

The visualisation module provides functions to generate various network projections based on the SQLite database created during the mining process.

  • get_coediting_network: creates a co-editing network where nodes are authors who are connected by a directed link if they consecutively edited the same line of code. Links are directed from the previous to the subsequent author.

  • get_coauthorship_network: creates a co-authorship network were nodes are authors who are connected by an undirected link if the edited the same file.

  • get_bipartite_network: creates a bipartite network with nodes representing authors and files. Undirected links exist between an author and all files the author edited.

  • get_line_editing_paths: creates paths for all lines in a repository. The paths contain ordered sequences of authors who subsequently edited a line. The number of paths generated for a line depends on the number of forks and merges the line was involved in.

  • get_commit_editing_dag: creates a directed acyclic graph where nodes are commits. Commits are connected by a directed link if a one commit modifies lines last editied in another commit. Links are directed from the editing commit to the edited commit.

Complexity

The module complexity provides the functionality to compute a variety of complexity measures for the commits and files in a git repository. Specifically, for all commits, we compute the number of editing events (events) and the total Levenshtein edit distance (levenshtein_distance) for all modified files. In addition, we compute the Halstead effort (HE), the cyclomatic complexity (CCN), the number of lines of code (NLOC), the number of tokens (TOK), and the number of functions (FUN) in all modified files before (*_pre) and after (*_post) each commit. We further compute the change (*_delta) for all complexity measures. As we show in this publication, the absolute value of the change in complexity can be used as a proxy for the productivity of developers in Open Source software projects.

  • compute_complexity: computes complexity measures for all mined commit/file combinations in a database mined with git2net.

API Reference

Extraction

git2net.extraction.check_mining_complete(git_repo_dir, sqlite_db_file, commits=[], all_branches=False, return_number_missing=False)

Checks status of a mining operation

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where with sqlite database

  • commits (List[str]) – only consider specific set of commits, considers all if empty

Returns:

bool – True if all commits are included in the database, otherwise False

git2net.extraction.get_commit_dag(git_repo_dir)

Extracts commit dag from given path to git repository.

Parameters:

git_repo_dir (str) – path to the git repository that is mined

Returns:

pathpy.DAG – dag linking successive commits in the same branch

git2net.extraction.get_unified_changes(git_repo_dir, commit_hash, file_path)

Returns dataframe with github-like unified diff representation of the content of a file before and after a commit for a given git repository, commit hash and file path.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • commit_hash (str) – commit hash for which the changes are computed

  • file_path (str) – path to file (within the repository) for which the changes are computed

Returns:

pandas.DataFrame – pandas dataframe listing changes made to file in commit

git2net.extraction.identify_file_renaming(git_repo_dir)

Identifies all names and locations different files in a repository have had.

Parameters:

git_repo_dir (str) – path to the git repository that is mined

Returns:

  • pathpy.DAG – pathpy DAG object depicting the renaming process

  • dict – dictionary containing all aliases for all files

git2net.extraction.is_binary_file(filename, file_content)

Detects if a file with given content is a binary file.

Parameters:
  • filename (str) – name of the file including its file extension

  • file_content (str) – content of the file

Returns:

bool – True if binary file is detected, otherwise False

git2net.extraction.mine_git_repo(git_repo_dir, sqlite_db_file, commits=[], use_blocks=False, no_of_processes=2, chunksize=1, exclude=[], blame_C='', blame_w=False, max_modifications=0, timeout=0, extract_text=False, extract_merges=True, extract_merge_deletions=False, all_branches=False)

Creates sqlite database with details on commits and edits for a given git repository.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where the sqlite database will be created

  • commits (List[str]) – only consider specific set of commits, considers all if empty

  • use_blocks (bool) – determins if analysis is performed on block or line basis

  • no_of_processes (int) – number of parallel processes that are spawned

  • chunksize (int) – number of tasks that are assigned to a process at a time

  • exclude (List[str]) – file paths that are excluded from the analysis

  • blame_C (str) – string for the blame C option following the pattern “-C[<num>]” (computationally expensive)

  • blame_w (bool) – bool, ignore whitespaces in git blame (-w option)

  • max_modifications (int) – ignore commit if there are more modifications

  • timeout (int) – stop processing commit after given time in seconds

  • extract_text (bool) – extract the commit message and line texts

  • extract_merges (bool) – process merges

  • extract_merge_deletions (bool) – extract lines that are not accepted during a merge as ‘deletions’

Returns:

SQLite database will be written at specified location

git2net.extraction.mine_github(github_url, git_repo_dir, sqlite_db_file, branch=None, **kwargs)

Clones a repository from github and starts the mining process.

Parameters:
  • github_url (str) – url to the publicly accessible github project that will be mined can be priovided as full url or <OWNER>/<REPOSITORY>

  • git_repo_dir (str) – path to the git repository that is mined if path ends with ‘/’ an additional folder will be created

  • sqlite_db_file (str) – path (including database name) where the sqlite database will be created

  • branch (str) – The branch of the github project that will be checked out and mined. If no branch is provided the default branch of the repository is used.

  • **kwargs – arguments that will be passed on to mine_git_repo

Returns:

  • git repository will be cloned to specified location

  • SQLite database will be written at specified location

git2net.extraction.mining_state_summary(git_repo_dir, sqlite_db_file, all_branches=False)

Prints mining progress of database and returns dataframe with details on missing commits.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where with sqlite database

Returns:

pandas.DataFrame – dataframe with details on missing commits

git2net.extraction.text_entropy(text)

Computes entropy for a given text based on UTF8 alphabet.

Parameters:

text (str) – string to compute the text entropy for

Returns:

float – text entropy of the given string

Disambiguation

git2net.disambiguation.disambiguate_aliases_db(sqlite_db_file, method='gambit', **quargs)

Disambiguates author aliases in a given SQLite database mined with git2net. The disambiguation is performed using the Python package gambit. Internally, disambiguate_aliases_db calls the function gambit.disambiguate_aliases.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • method (str) – disambiguation method from {“gambit”, “bird”, “simple”}

  • **quargs – hyperparameters for the gambit and bird algorithms; gambit: thresh (float) – similarity threshold from interval 0 to 1, sim (str) – similarity measure from {‘lev’, ‘jw’}, bird: thresh (float) – similarity threshold from interval 0 to 1

Returns:

creates new column with unique author_id in the commits table of the provided database

Visualisation

git2net.visualisation.get_bipartite_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)

Returns temporal bipartite network containing time-stamped file-author relationships for given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter, datetime object

  • time_to (datetime.datetime) – end time of time window filter, datetime object

Returns:

  • pathpy.TemporalNetwork – bipartite network

  • dict – info on node charactaristics, e.g. membership in bipartite class

  • dict – info on edge characteristics

git2net.visualisation.get_coauthorship_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)

Returns coauthorship network containing links between authors who coedited at least one code file within a given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter

  • time_to (datetime.datetime) – end time of time window filter

Returns:

  • pathpy.Network – coauthorship network

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_coediting_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None, engine='pathpy')

Returns coediting network containing links between authors who coedited at least one line of code within a given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter

  • time_to (datetime.datetime) – end time of time window filter

Returns:

  • pathpy.TemporalNetwork – coediting network

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_commit_editing_dag(sqlite_db_file, time_from=None, time_to=None, filename=None)

Returns DAG of commits where an edge between commit A and B indicates that lines written in commit A were changed in commit B. Further outputs editing paths extracted from the DAG.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter, datetime object

  • time_to (datetime.datetime) – end time of time window filter, datetime object

  • filename (str) – filter to obtain only commits editing a certain file

Returns:

  • pathpy.DAG – commit editing dag

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_line_editing_paths(sqlite_db_file, git_repo_dir, author_identifier='author_id', commit_hashes=None, file_paths=None, with_start=False, merge_renaming=False)

Returns line editing DAG as well as line editing paths.

Parameters:
  • sqlite_db_file (str) – path to SQLite database mined with git2net line method

  • git_repo_dir (str) – path to the git repository that is mined

  • commit_hashes (List[str]) – list of commits to consider, by default all commits are considered

  • file_paths (List[str]) – list of files to consider, by default all files are considered

  • with_start (bool) – determines if node for filename is included as start for all editing paths

  • merge_renaming (bool) – determines if file renaming is considered

Returns:

  • pathpy.Paths – line editing paths

  • pathpy.DAG – line editing directed acyclic graph

  • dict – info on node charactaristics

  • dict – info on edge characteristics

Complexity

git2net.complexity.compute_complexity(git_repo_dir, sqlite_db_file, no_of_processes=2, read_chunksize=1000000.0, write_chunksize=100)

Computes complexity measures for all mined commit/file combinations in a given database. Computing complexities for merge commits is currently not supported.

Parameters:
  • git_repo_dir (str) – path to the git repository that is analysed

  • sqlite_db_file (str) – path to the SQLite database containing the mined commits

  • no_of_processes (str) – number of parallel processes that are spawned

  • read_chunksize (str) – number of commit/file combinations that are processed at once

  • write_chunksize (str) – number of commit/file combinations for which complexities are written at once

Returns:

adds table complexity containing computed complexity measures for all commit/file combinations.

Indices and tables