API Reference
Extraction
- git2net.extraction.check_mining_complete(git_repo_dir, sqlite_db_file, commits=[], all_branches=False, return_number_missing=False)
Checks status of a mining operation
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
sqlite_db_file (str) – path (including database name) where with sqlite database
commits (List[str]) – only consider specific set of commits, considers all if empty
- Returns:
bool – True if all commits are included in the database, otherwise False
- git2net.extraction.get_commit_dag(git_repo_dir)
Extracts commit dag from given path to git repository.
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
- Returns:
pathpy.DAG – dag linking successive commits in the same branch
- git2net.extraction.get_unified_changes(git_repo_dir, commit_hash, file_path)
Returns dataframe with github-like unified diff representation of the content of a file before and after a commit for a given git repository, commit hash and file path.
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
commit_hash (str) – commit hash for which the changes are computed
file_path (str) – path to file (within the repository) for which the changes are computed
- Returns:
pandas.DataFrame – pandas dataframe listing changes made to file in commit
- git2net.extraction.identify_file_renaming(git_repo_dir)
Identifies all names and locations different files in a repository have had.
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
- Returns:
pathpy.DAG – pathpy DAG object depicting the renaming process
dict – dictionary containing all aliases for all files
- git2net.extraction.is_binary_file(filename, file_content)
Detects if a file with given content is a binary file.
- Parameters:
filename (str) – name of the file including its file extension
file_content (str) – content of the file
- Returns:
bool – True if binary file is detected, otherwise False
- git2net.extraction.mine_git_repo(git_repo_dir, sqlite_db_file, commits=[], use_blocks=False, no_of_processes=2, chunksize=1, exclude=[], blame_C='', blame_w=False, max_modifications=0, timeout=0, extract_text=False, extract_merges=True, extract_merge_deletions=False, all_branches=False)
Creates sqlite database with details on commits and edits for a given git repository.
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
sqlite_db_file (str) – path (including database name) where the sqlite database will be created
commits (List[str]) – only consider specific set of commits, considers all if empty
use_blocks (bool) – determins if analysis is performed on block or line basis
no_of_processes (int) – number of parallel processes that are spawned
chunksize (int) – number of tasks that are assigned to a process at a time
exclude (List[str]) – file paths that are excluded from the analysis
blame_C (str) – string for the blame C option following the pattern “-C[<num>]” (computationally expensive)
blame_w (bool) – bool, ignore whitespaces in git blame (-w option)
max_modifications (int) – ignore commit if there are more modifications
timeout (int) – stop processing commit after given time in seconds
extract_text (bool) – extract the commit message and line texts
extract_merges (bool) – process merges
extract_merge_deletions (bool) – extract lines that are not accepted during a merge as ‘deletions’
- Returns:
SQLite database will be written at specified location
- git2net.extraction.mine_github(github_url, git_repo_dir, sqlite_db_file, branch=None, **kwargs)
Clones a repository from github and starts the mining process.
- Parameters:
github_url (str) – url to the publicly accessible github project that will be mined can be priovided as full url or <OWNER>/<REPOSITORY>
git_repo_dir (str) – path to the git repository that is mined if path ends with ‘/’ an additional folder will be created
sqlite_db_file (str) – path (including database name) where the sqlite database will be created
branch (str) – The branch of the github project that will be checked out and mined. If no branch is provided the default branch of the repository is used.
**kwargs – arguments that will be passed on to mine_git_repo
- Returns:
git repository will be cloned to specified location
SQLite database will be written at specified location
- git2net.extraction.mining_state_summary(git_repo_dir, sqlite_db_file, all_branches=False)
Prints mining progress of database and returns dataframe with details on missing commits.
- Parameters:
git_repo_dir (str) – path to the git repository that is mined
sqlite_db_file (str) – path (including database name) where with sqlite database
- Returns:
pandas.DataFrame – dataframe with details on missing commits
- git2net.extraction.text_entropy(text)
Computes entropy for a given text based on UTF8 alphabet.
- Parameters:
text (str) – string to compute the text entropy for
- Returns:
float – text entropy of the given string
Disambiguation
- git2net.disambiguation.disambiguate_aliases_db(sqlite_db_file, method='gambit', **quargs)
Disambiguates author aliases in a given SQLite database mined with git2net. The disambiguation is performed using the Python package gambit. Internally, disambiguate_aliases_db calls the function gambit.disambiguate_aliases.
- Parameters:
sqlite_db_file (str) – path to SQLite database
method (str) – disambiguation method from {“gambit”, “bird”, “simple”}
**quargs – hyperparameters for the gambit and bird algorithms; gambit: thresh (float) – similarity threshold from interval 0 to 1, sim (str) – similarity measure from {‘lev’, ‘jw’}, bird: thresh (float) – similarity threshold from interval 0 to 1
- Returns:
creates new column with unique author_id in the commits table of the provided database
Visualisation
- git2net.visualisation.get_bipartite_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)
Returns temporal bipartite network containing time-stamped file-author relationships for given time window.
- Parameters:
sqlite_db_file (str) – path to SQLite database
time_from (datetime.datetime) – start time of time window filter, datetime object
time_to (datetime.datetime) – end time of time window filter, datetime object
- Returns:
pathpy.TemporalNetwork – bipartite network
dict – info on node charactaristics, e.g. membership in bipartite class
dict – info on edge characteristics
- git2net.visualisation.get_coauthorship_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None)
Returns coauthorship network containing links between authors who coedited at least one code file within a given time window.
- Parameters:
sqlite_db_file (str) – path to SQLite database
time_from (datetime.datetime) – start time of time window filter
time_to (datetime.datetime) – end time of time window filter
- Returns:
pathpy.Network – coauthorship network
dict – info on node charactaristics
dict – info on edge characteristics
- git2net.visualisation.get_coediting_network(sqlite_db_file, author_identifier='author_id', time_from=None, time_to=None, engine='pathpy')
Returns coediting network containing links between authors who coedited at least one line of code within a given time window.
- Parameters:
sqlite_db_file (str) – path to SQLite database
time_from (datetime.datetime) – start time of time window filter
time_to (datetime.datetime) – end time of time window filter
- Returns:
pathpy.TemporalNetwork – coediting network
dict – info on node charactaristics
dict – info on edge characteristics
- git2net.visualisation.get_commit_editing_dag(sqlite_db_file, time_from=None, time_to=None, filename=None)
Returns DAG of commits where an edge between commit A and B indicates that lines written in commit A were changed in commit B. Further outputs editing paths extracted from the DAG.
- Parameters:
sqlite_db_file (str) – path to SQLite database
time_from (datetime.datetime) – start time of time window filter, datetime object
time_to (datetime.datetime) – end time of time window filter, datetime object
filename (str) – filter to obtain only commits editing a certain file
- Returns:
pathpy.DAG – commit editing dag
dict – info on node charactaristics
dict – info on edge characteristics
- git2net.visualisation.get_line_editing_paths(sqlite_db_file, git_repo_dir, author_identifier='author_id', commit_hashes=None, file_paths=None, with_start=False, merge_renaming=False)
Returns line editing DAG as well as line editing paths.
- Parameters:
sqlite_db_file (str) – path to SQLite database mined with git2net line method
git_repo_dir (str) – path to the git repository that is mined
commit_hashes (List[str]) – list of commits to consider, by default all commits are considered
file_paths (List[str]) – list of files to consider, by default all files are considered
with_start (bool) – determines if node for filename is included as start for all editing paths
merge_renaming (bool) – determines if file renaming is considered
- Returns:
pathpy.Paths – line editing paths
pathpy.DAG – line editing directed acyclic graph
dict – info on node charactaristics
dict – info on edge characteristics
Complexity
- git2net.complexity.compute_complexity(git_repo_dir, sqlite_db_file, no_of_processes=2, read_chunksize=1000000.0, write_chunksize=100)
Computes complexity measures for all mined commit/file combinations in a given database. Computing complexities for merge commits is currently not supported.
- Parameters:
git_repo_dir (str) – path to the git repository that is analysed
sqlite_db_file (str) – path to the SQLite database containing the mined commits
no_of_processes (str) – number of parallel processes that are spawned
read_chunksize (str) – number of commit/file combinations that are processed at once
write_chunksize (str) – number of commit/file combinations for which complexities are written at once
- Returns:
adds table complexity containing computed complexity measures for all commit/file combinations.