Title: | Identify Duplicated R Code in a Project |
---|---|
Description: | Identifies code blocks that have a high level of similarity within a set of R files. |
Authors: | Russ Hyde |
Maintainer: | Russ Hyde <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0.9000 |
Built: | 2025-02-17 02:43:58 UTC |
Source: | https://github.com/russhyde/dupree |
convert a 'dups' object to a 'tibble'
## S3 method for class 'dups' as_tibble(x, ...)
## S3 method for class 'dups' as_tibble(x, ...)
x |
A data frame, list, matrix, or other object that could reasonably be coerced to a tibble. |
... |
Unused, for extensibility. |
as.data.frame method for 'dups' class
## S3 method for class 'dups' as.data.frame(x, ...)
## S3 method for class 'dups' as.data.frame(x, ...)
x |
any R object. |
... |
additional arguments to be passed to or from methods. |
This function identifies all code-blocks in a set of files and then computes a similarity score between those code-blocks to help identify functions / classes that have a high level of duplication, and could possibly be refactored.
dupree(files, min_block_size = 40, ...)
dupree(files, min_block_size = 40, ...)
files |
A set of files over which code-duplication should be measured. |
min_block_size |
|
... |
Unused at present. |
Code-blocks under a size threshold are disregarded before analysis (the size
threshold is controlled by min_block_size
); and only top-level code
blocks are considered.
Every sufficiently large code-block in the input files will be present in the results at least once. If code-block X and code-block Y are present in a row of the resulting data-frame, then either X is the closest match to Y, or Y is the closest match to X (or possibly both) according to the similarity score; as such, some code-blocks may be present multiple times in the results.
Similarity between code-blocks is calculated using the
longest-common-subsequence (lcs
) measure from the package
stringdist
. This measure is applied to a tokenised version of the
code-blocks. That is, each function name / operator / variable in the code
blocks is converted to a unique integer so that a code-block can be
represented as a vector of integers and the lcs
measure is applied to
each pair of these vectors.
A tibble
. Each row in the table summarises the
comparison between two code-blocks (block 'a' and block 'b') in the input
files. Each code-block in the pair is indicated by: i) the file
(file_a
/ file_b
) that contains it; ii) its position within
that file (block_a
/ block_b
; 1 being the first code-block in
a given file); and iii) the line where that code-block starts in that file
(line_a
/ line_b
). The pairs of code-blocks are ordered by
decreasing similarity. Any match that is returned is either the top hit for
block 'a' or for block 'b' (or both).
# To quantify duplication between the top-level code-blocks in a file example_file <- system.file("extdata", "duplicated.R", package = "dupree") dup <- dupree(example_file, min_block_size = 10) dup # For the block-pair with the highest duplication, we print the first four # lines: readLines(example_file)[dup$line_a[1] + c(0:3)] readLines(example_file)[dup$line_b[1] + c(0:3)] # The code-blocks in the example file are rather small, so if # `min_block_size` is too large, none of the code-blocks will be analysed # and the results will be empty: dupree(example_file, min_block_size = 40)
# To quantify duplication between the top-level code-blocks in a file example_file <- system.file("extdata", "duplicated.R", package = "dupree") dup <- dupree(example_file, min_block_size = 10) dup # For the block-pair with the highest duplication, we print the first four # lines: readLines(example_file)[dup$line_a[1] + c(0:3)] readLines(example_file)[dup$line_b[1] + c(0:3)] # The code-blocks in the example file are rather small, so if # `min_block_size` is too large, none of the code-blocks will be analysed # and the results will be empty: dupree(example_file, min_block_size = 40)
Run duplicate-code detection over all R-files in a directory
dupree_dir( path = ".", min_block_size = 40, filter = NULL, ..., recursive = TRUE )
dupree_dir( path = ".", min_block_size = 40, filter = NULL, ..., recursive = TRUE )
path |
A directory (By default the current working directory). All files in this directory that have a ".R", ".r" or ".Rmd" extension will be checked for code duplication. |
min_block_size |
|
filter |
A pattern for use in grep - this is used to keep only particular files: eg, filter = "classes" would compare files with 'classes' in the filename |
... |
Further arguments for grep. For example, 'filter = "test", invert = TRUE' would disregard all files with 'test' in the file-path. |
recursive |
Should we consider files in subdirectories as well? |
dupree
The function fails if the path does not look like a typical R package (it should have both an R/ subdirectory and a DESCRIPTION file present).
dupree_package(package = ".", min_block_size = 40)
dupree_package(package = ".", min_block_size = 40)
package |
The name or path to the package that is to be checked (By default the current working directory). |
min_block_size |
|
dupree
An S4 class to represent the code blocks as strings of integers
blocks
A tbl_df with columns 'file', 'block', 'start_line' and 'enumerated_code'
print method for 'dups' class
## S3 method for class 'dups' print(x, ...)
## S3 method for class 'dups' print(x, ...)
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |