Package 'dupree' reference manual

Title:	Identify Duplicated R Code in a Project
Description:	Identifies code blocks that have a high level of similarity within a set of R files.
Authors:	Russ Hyde
Maintainer:	Russ Hyde <[email protected]>
License:	MIT + file LICENSE
Version:	0.3.0.9000
Built:	2025-02-17 02:43:58 UTC
Source:	https://github.com/russhyde/dupree

convert a 'dups' object to a 'tibble'

Description

convert a 'dups' object to a 'tibble'

Usage

## S3 method for class 'dups'
as_tibble(x, ...)
## S3 method for class 'dups'
as_tibble(x, ...)

Arguments

`x`	A data frame, list, matrix, or other object that could reasonably be coerced to a tibble.
`...`	Unused, for extensibility.

as.data.frame method for 'dups' class

Description

as.data.frame method for 'dups' class

Usage

## S3 method for class 'dups'
as.data.frame(x, ...)
## S3 method for class 'dups'
as.data.frame(x, ...)

Arguments

`x`	any R object.
`...`	additional arguments to be passed to or from methods.

Detect code duplication between the code-blocks in a set of files

Description

This function identifies all code-blocks in a set of files and then computes a similarity score between those code-blocks to help identify functions / classes that have a high level of duplication, and could possibly be refactored.

Usage

dupree(files, min_block_size = 40, ...)
dupree(files, min_block_size = 40, ...)

Arguments

`files`	A set of files over which code-duplication should be measured.
`min_block_size`	`dupree` uses a notion of non-trivial symbols. These are the symbols / code-words that remain after filtering out really common symbols like `<-`, `,`, etc. After filtering out these symbols from each code-block, only those blocks containing at least `min_block_size` symbols are used in the inter-block code-duplication measurement.
`...`	Unused at present.

Details

Code-blocks under a size threshold are disregarded before analysis (the size threshold is controlled by min_block_size); and only top-level code blocks are considered.

Every sufficiently large code-block in the input files will be present in the results at least once. If code-block X and code-block Y are present in a row of the resulting data-frame, then either X is the closest match to Y, or Y is the closest match to X (or possibly both) according to the similarity score; as such, some code-blocks may be present multiple times in the results.

Similarity between code-blocks is calculated using the longest-common-subsequence (lcs) measure from the package stringdist. This measure is applied to a tokenised version of the code-blocks. That is, each function name / operator / variable in the code blocks is converted to a unique integer so that a code-block can be represented as a vector of integers and the lcs measure is applied to each pair of these vectors.

Value

A tibble. Each row in the table summarises the comparison between two code-blocks (block 'a' and block 'b') in the input files. Each code-block in the pair is indicated by: i) the file (file_a / file_b) that contains it; ii) its position within that file (block_a / block_b; 1 being the first code-block in a given file); and iii) the line where that code-block starts in that file (line_a / line_b). The pairs of code-blocks are ordered by decreasing similarity. Any match that is returned is either the top hit for block 'a' or for block 'b' (or both).

Examples

# To quantify duplication between the top-level code-blocks in a file
example_file <- system.file("extdata", "duplicated.R", package = "dupree")
dup <- dupree(example_file, min_block_size = 10)
dup

# For the block-pair with the highest duplication, we print the first four
# lines:
readLines(example_file)[dup$line_a[1] + c(0:3)]
readLines(example_file)[dup$line_b[1] + c(0:3)]

# The code-blocks in the example file are rather small, so if
# `min_block_size` is too large, none of the code-blocks will be analysed
# and the results will be empty:
dupree(example_file, min_block_size = 40)
# To quantify duplication between the top-level code-blocks in a file
example_file <- system.file("extdata", "duplicated.R", package = "dupree")
dup <- dupree(example_file, min_block_size = 10)
dup

# For the block-pair with the highest duplication, we print the first four
# lines:
readLines(example_file)[dup$line_a[1] + c(0:3)]
readLines(example_file)[dup$line_b[1] + c(0:3)]

# The code-blocks in the example file are rather small, so if
# `min_block_size` is too large, none of the code-blocks will be analysed
# and the results will be empty:
dupree(example_file, min_block_size = 40)

Run duplicate-code detection over all R-files in a directory

Description

Run duplicate-code detection over all R-files in a directory

Usage

dupree_dir(
  path = ".",
  min_block_size = 40,
  filter = NULL,
  ...,
  recursive = TRUE
)
dupree_dir(
  path = ".",
  min_block_size = 40,
  filter = NULL,
  ...,
  recursive = TRUE
)

Arguments

`path`	A directory (By default the current working directory). All files in this directory that have a ".R", ".r" or ".Rmd" extension will be checked for code duplication.
`min_block_size`	`dupree` uses a notion of non-trivial symbols. These are the symbols / code-words that remain after filtering out really common symbols like `<-`, `,`, etc. After filtering out these symbols from each code-block, only those blocks containing at least `min_block_size` symbols are used in the inter-block code-duplication measurement.
`filter`	A pattern for use in grep - this is used to keep only particular files: eg, filter = "classes" would compare files with 'classes' in the filename
`...`	Further arguments for grep. For example, 'filter = "test", invert = TRUE' would disregard all files with 'test' in the file-path.
`recursive`	Should we consider files in subdirectories as well?

Run duplicate-code detection over all files in the 'R' directory of a package

Description

The function fails if the path does not look like a typical R package (it should have both an R/ subdirectory and a DESCRIPTION file present).

Usage

dupree_package(package = ".", min_block_size = 40)
dupree_package(package = ".", min_block_size = 40)

Arguments

`package`	The name or path to the package that is to be checked (By default the current working directory).
`min_block_size`	`dupree` uses a notion of non-trivial symbols. These are the symbols / code-words that remain after filtering out really common symbols like `<-`, `,`, etc. After filtering out these symbols from each code-block, only those blocks containing at least `min_block_size` symbols are used in the inter-block code-duplication measurement.

An S4 class to represent the code blocks as strings of integers

Description

An S4 class to represent the code blocks as strings of integers

Slots

blocks: A tbl_df with columns 'file', 'block', 'start_line' and 'enumerated_code'

print method for 'dups' class

Description

print method for 'dups' class

Usage

## S3 method for class 'dups'
print(x, ...)
## S3 method for class 'dups'
print(x, ...)

Arguments

`x`	an object used to select a method.
`...`	further arguments passed to or from other methods.

Package 'dupree'

Help Index

convert a 'dups' object to a 'tibble'

Description

Usage

Arguments

as.data.frame method for 'dups' class

Description

Usage

Arguments

Detect code duplication between the code-blocks in a set of files

Description

Usage

Arguments

Details

Value

Examples

Run duplicate-code detection over all R-files in a directory

Description

Usage

Arguments

See Also

Run duplicate-code detection over all files in the 'R' directory of a package

Description

Usage

Arguments

See Also

An S4 class to represent the code blocks as strings of integers

Description

Slots

print method for 'dups' class

Description

Usage

Arguments