API Reference

Extraction

git2net.extraction.check_mining_complete(git_repo_dir, sqlite_db_file, commits=[], all_branches=False, return_number_missing=False)

Checks status of a mining operation

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where with sqlite database

  • commits (List[str]) – only consider specific set of commits, considers all if empty

Returns:

bool – True if all commits are included in the database, otherwise False

git2net.extraction.get_commit_dag(git_repo_dir)

Extracts commit dag from given path to git repository.

Parameters:

git_repo_dir (str) – path to the git repository that is mined

Returns:

pathpy.DAG – dag linking successive commits in the same branch

git2net.extraction.get_unified_changes(git_repo_dir, commit_hash, file_path)

Returns dataframe with github-like unified diff representation of the content of a file before and after a commit for a given git repository, commit hash and file path.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • commit_hash (str) – commit hash for which the changes are computed

  • file_path (str) – path to file (within the repository) for which the changes are computed

Returns:

pandas.DataFrame – pandas dataframe listing changes made to file in commit

git2net.extraction.identify_file_renaming(git_repo_dir)

Identifies all names and locations different files in a repository have had.

Parameters:

git_repo_dir (str) – path to the git repository that is mined

Returns:

  • pathpy.DAG – pathpy DAG object depicting the renaming process

  • dict – dictionary containing all aliases for all files

git2net.extraction.is_binary_file(filename, file_content)

Detects if a file with given content is a binary file.

Parameters:
  • filename (str) – name of the file including its file extension

  • file_content (str) – content of the file

Returns:

bool – True if binary file is detected, otherwise False

git2net.extraction.mine_git_repo(git_repo_dir, sqlite_db_file, commits=[], use_blocks=False, no_of_processes=2, chunksize=1, exclude=[], blame_C='', blame_w=False, max_modifications=0, timeout=0, extract_text=False, extract_merges=True, extract_merge_deletions=False, all_branches=False)

Creates sqlite database with details on commits and edits for a given git repository.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where the sqlite database will be created

  • commits (List[str]) – only consider specific set of commits, considers all if empty

  • use_blocks (bool) – determins if analysis is performed on block or line basis

  • no_of_processes (int) – number of parallel processes that are spawned

  • chunksize (int) – number of tasks that are assigned to a process at a time

  • exclude (List[str]) – file paths that are excluded from the analysis

  • blame_C (str) – string for the blame C option following the pattern “-C[<num>]” (computationally expensive)

  • blame_w (bool) – bool, ignore whitespaces in git blame (-w option)

  • max_modifications (int) – ignore commit if there are more modifications

  • timeout (int) – stop processing commit after given time in seconds

  • extract_text (bool) – extract the commit message and line texts

  • extract_merges (bool) – process merges

  • extract_merge_deletions (bool) – extract lines that are not accepted during a merge as ‘deletions’

Returns:

SQLite database will be written at specified location

git2net.extraction.mine_github(github_url, git_repo_dir, sqlite_db_file, branch=None, **kwargs)

Clones a repository from github and starts the mining process.

Parameters:
  • github_url (str) – url to the publicly accessible github project that will be mined can be priovided as full url or <OWNER>/<REPOSITORY>

  • git_repo_dir (str) – path to the git repository that is mined if path ends with ‘/’ an additional folder will be created

  • sqlite_db_file (str) – path (including database name) where the sqlite database will be created

  • branch (str) – The branch of the github project that will be checked out and mined. If no branch is provided the default branch of the repository is used.

  • **kwargs – arguments that will be passed on to mine_git_repo

Returns:

  • git repository will be cloned to specified location

  • SQLite database will be written at specified location

git2net.extraction.mining_state_summary(git_repo_dir, sqlite_db_file, all_branches=False)

Prints mining progress of database and returns dataframe with details on missing commits.

Parameters:
  • git_repo_dir (str) – path to the git repository that is mined

  • sqlite_db_file (str) – path (including database name) where with sqlite database

Returns:

pandas.DataFrame – dataframe with details on missing commits

git2net.extraction.text_entropy(text)

Computes entropy for a given text based on UTF8 alphabet.

Parameters:

text (str) – string to compute the text entropy for

Returns:

float – text entropy of the given string

Disambiguation

git2net.disambiguation.disambiguate_aliases_db(sqlite_db_file, method='gambit', **quargs)

Disambiguates author aliases in a given SQLite database mined with git2net. The disambiguation is performed using the Python package gambit. Internally, disambiguate_aliases_db calls the function gambit.disambiguate_aliases.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • method (str) – disambiguation method from {“gambit”, “bird”, “simple”}

  • **quargs – hyperparameters for the gambit and bird algorithms; gambit: thresh (float) – similarity threshold from interval 0 to 1, sim (str) – similarity measure from {‘lev’, ‘jw’}, bird: thresh (float) – similarity threshold from interval 0 to 1

Returns:

creates new column with unique author_id in the commits table of the provided database

Visualisation

git2net.visualisation.get_bipartite_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)

Returns temporal bipartite network containing time-stamped file-author relationships for given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter, datetime object

  • time_to (datetime.datetime) – end time of time window filter, datetime object

Returns:

  • pathpy.TemporalNetwork – bipartite network

  • dict – info on node charactaristics, e.g. membership in bipartite class

  • dict – info on edge characteristics

git2net.visualisation.get_coauthorship_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)

Returns coauthorship network containing links between authors who coedited at least one code file within a given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter

  • time_to (datetime.datetime) – end time of time window filter

Returns:

  • pathpy.Network – coauthorship network

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_coediting_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None, engine='pathpy')

Returns coediting network containing links between authors who coedited at least one line of code within a given time window.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter

  • time_to (datetime.datetime) – end time of time window filter

Returns:

  • pathpy.TemporalNetwork – coediting network

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_commit_editing_dag(sqlite_db_file, time_from=None, time_to=None, filename=None)

Returns DAG of commits where an edge between commit A and B indicates that lines written in commit A were changed in commit B. Further outputs editing paths extracted from the DAG.

Parameters:
  • sqlite_db_file (str) – path to SQLite database

  • time_from (datetime.datetime) – start time of time window filter, datetime object

  • time_to (datetime.datetime) – end time of time window filter, datetime object

  • filename (str) – filter to obtain only commits editing a certain file

Returns:

  • pathpy.DAG – commit editing dag

  • dict – info on node charactaristics

  • dict – info on edge characteristics

git2net.visualisation.get_line_editing_paths(sqlite_db_file, git_repo_dir, author_identifier='author_id', commit_hashes=None, file_paths=None, with_start=False, merge_renaming=False)

Returns line editing DAG as well as line editing paths.

Parameters:
  • sqlite_db_file (str) – path to SQLite database mined with git2net line method

  • git_repo_dir (str) – path to the git repository that is mined

  • commit_hashes (List[str]) – list of commits to consider, by default all commits are considered

  • file_paths (List[str]) – list of files to consider, by default all files are considered

  • with_start (bool) – determines if node for filename is included as start for all editing paths

  • merge_renaming (bool) – determines if file renaming is considered

Returns:

  • pathpy.Paths – line editing paths

  • pathpy.DAG – line editing directed acyclic graph

  • dict – info on node charactaristics

  • dict – info on edge characteristics

Complexity

git2net.complexity.compute_complexity(git_repo_dir, sqlite_db_file, no_of_processes=2, read_chunksize=1000000.0, write_chunksize=100)

Computes complexity measures for all mined commit/file combinations in a given database. Computing complexities for merge commits is currently not supported.

Parameters:
  • git_repo_dir (str) – path to the git repository that is analysed

  • sqlite_db_file (str) – path to the SQLite database containing the mined commits

  • no_of_processes (str) – number of parallel processes that are spawned

  • read_chunksize (str) – number of commit/file combinations that are processed at once

  • write_chunksize (str) – number of commit/file combinations for which complexities are written at once

Returns:

adds table complexity containing computed complexity measures for all commit/file combinations.