API Reference

Lineage Definition Strategies

RepertoireMetrics.AbstractLineageDefinition — Type

AbstractLineageDefinition

Abstract supertype for all lineage definition strategies. Concrete subtypes define how sequences are grouped into clonal lineages.

Implementing a new strategy

To implement a new lineage definition strategy:

Create a subtype of AbstractLineageDefinition
Implement lineage_key(strategy, row) returning a hashable key

RepertoireMetrics.LineageIDDefinition — Type

LineageIDDefinition <: AbstractLineageDefinition

Define lineages using the lineage_id column directly. This is the simplest strategy when lineage assignment has already been performed.

Fields

column::Symbol: Column name containing lineage IDs (default: :lineage_id)

RepertoireMetrics.VJCdr3Definition — Type

VJCdr3Definition <: AbstractLineageDefinition

Define lineages using the combination of V gene, J gene, and CDR3 sequence. This is consistent with LineageCollapse.jl's default behavior.

Fields

v_column::Symbol: Column name for V gene call (default: :v_call)
j_column::Symbol: Column name for J gene call (default: :j_call)
cdr3_column::Symbol: Column name for CDR3 sequence (default: :cdr3)
use_first_allele::Bool: Use only first allele from calls (default: true)

RepertoireMetrics.CustomDefinition — Type

CustomDefinition{F} <: AbstractLineageDefinition

Define lineages using a custom function that extracts a key from each row.

Fields

key_func::F: Function row -> key that returns a hashable lineage key

Example

# Group by V gene family only
strategy = CustomDefinition(row -> first(split(string(row.v_call), "-")))

RepertoireMetrics.lineage_key — Function

lineage_key(strategy::AbstractLineageDefinition, row) -> key

Extract a lineage key from a data row using the given strategy. Returns a hashable value that identifies the lineage.

RepertoireMetrics.first_allele — Function

first_allele(call::AbstractString) -> String

Extract the first allele from a gene call string. Handles comma-separated multiple calls and allele notation.

Examples

first_allele("IGHV1-2*01,IGHV1-2*02") # "IGHV1-2*01"
first_allele("IGHV1-2*01")             # "IGHV1-2*01"

Reading Data

RepertoireMetrics.read_repertoire — Function

read_repertoire(
    filepath::AbstractString,
    strategy::AbstractLineageDefinition;
    count_column::Union{Symbol,Nothing} = :count,
    donor_id::String = "",
    donor_column::Union{Symbol,Nothing} = nothing,
    length_column::Union{Symbol,Nothing} = nothing,
    length_aa::Bool = false,
    kwargs...
) -> Repertoire{Int}

Read a MIAIRR file (TSV, CSV, or gzipped) and convert to a Repertoire. The delimiter is auto-detected by CSV.jl.

Arguments

filepath: Path to file (supports .tsv, .csv, .tsv.gz, .csv.gz)
strategy: Lineage definition strategy
count_column: Column containing sequence counts (default :count)
donor_id: Donor/sample identifier
donor_column: Column to extract donor ID from (e.g., :library_id)
length_column: Column to compute length statistics from (e.g., :cdr3)
length_aa: If true, length_column contains amino acid sequences
kwargs...: Additional arguments passed to CSV.read

Example

# Using lineage_id column
rep = read_repertoire("collapsed_data.tsv", LineageIDDefinition())

# Using V-J-CDR3 definition with CDR3 length stats
rep = read_repertoire("sequences.tsv", VJCdr3Definition(); length_column=:cdr3)

# With donor ID from file
rep = read_repertoire("data.tsv", VJCdr3Definition(); donor_column=:library_id)

# Compute length from amino acid column
rep = read_repertoire("data.tsv", VJCdr3Definition(); 
    length_column=:cdr3_aa, length_aa=true)

RepertoireMetrics.read_repertoires — Function

read_repertoires(
    filepaths::Vector{<:AbstractString},
    strategy::AbstractLineageDefinition;
    kwargs...
) -> RepertoireCollection{Int}

Read multiple MIAIRR TSV files and return a collection.

Example

files = ["donor1.tsv", "donor2.tsv", "donor3.tsv"]
collection = read_repertoires(files, VJCdr3Definition())
all_metrics = compute_metrics(collection)

RepertoireMetrics.read_repertoires_from_directory — Function

read_repertoires_from_directory(
    dirpath::AbstractString,
    strategy::AbstractLineageDefinition;
    pattern::Regex = r"\.tsv$"i,
    kwargs...
) -> RepertoireCollection{Int}

Read all matching files from a directory and return a collection.

Arguments

dirpath: Path to directory containing TSV files
strategy: Lineage definition strategy
pattern: Regex pattern to match filenames (default: .tsv files)
kwargs...: Additional arguments passed to read_repertoire

Example

collection = read_repertoires_from_directory("data/", VJCdr3Definition())

RepertoireMetrics.repertoire_from_dataframe — Function

repertoire_from_dataframe(
    df::DataFrame,
    strategy::AbstractLineageDefinition;
    count_column::Union{Symbol,Nothing} = :count,
    donor_id::String = "",
    donor_column::Union{Symbol,Nothing} = nothing,
    length_column::Union{Symbol,Nothing} = nothing,
    length_aa::Bool = false
) -> Repertoire{Int}

Convert a DataFrame to a Repertoire using the specified lineage definition strategy.

Arguments

df: Input DataFrame in MIAIRR format
strategy: Lineage definition strategy (e.g., VJCdr3Definition(), LineageIDDefinition())
count_column: Column containing sequence counts (default :count). If nothing or column doesn't exist, each row counts as 1.
donor_id: Donor/sample identifier. If empty and donor_column is specified, extracts from first row.
donor_column: Column containing donor IDs (e.g., :library_id)
length_column: Column to compute length statistics from (e.g., :cdr3, :junction). If nothing (default), no length statistics are computed.
length_aa: If true, length_column contains amino acid sequences. If false (default), assumes nucleotide sequences and converts to amino acid length (÷3).

Returns

A Repertoire{Int} with aggregated lineage counts. If length_column is specified, length statistics are stored in metadata and accessible via composable metrics like MeanLength(), MedianLength(), etc.

Example

df = CSV.read("sequences.tsv", DataFrame)

# Basic repertoire
rep = repertoire_from_dataframe(df, VJCdr3Definition())

# With CDR3 length statistics
rep = repertoire_from_dataframe(df, VJCdr3Definition(); length_column=:cdr3)
println(mean_length(rep))  # Access via function
metrics = compute_metrics(rep, MeanLength() + MedianLength())  # Or compose

# Using amino acid column directly
rep = repertoire_from_dataframe(df, VJCdr3Definition(); 
    length_column=:cdr3_aa, length_aa=true)

RepertoireMetrics.split_by_donor — Function

split_by_donor(
    df::DataFrame,
    donor_column::Symbol,
    strategy::AbstractLineageDefinition;
    kwargs...
) -> RepertoireCollection{Int}

Split a multi-donor DataFrame into separate Repertoires by donor.

Arguments

df: DataFrame containing data from multiple donors
donor_column: Column containing donor identifiers
strategy: Lineage definition strategy
kwargs...: Additional arguments passed to repertoire_from_dataframe

Example

df = CSV.read("all_donors.tsv", DataFrame)
collection = split_by_donor(df, :library_id, VJCdr3Definition())

Computing Metrics

Main Functions

RepertoireMetrics.compute_metrics — Function

compute_metrics(rep::Repertoire) -> Metrics
compute_metrics(rep::Repertoire, metrics::MetricSet) -> Metrics

Compute metrics for a repertoire. Without a MetricSet argument, computes all metrics.

Examples

# Compute all metrics (default)
result = compute_metrics(rep)

# Compute specific metrics
result = compute_metrics(rep, ShannonEntropy() + Clonality() + D50())

# Use predefined sets
result = compute_metrics(rep, DIVERSITY_METRICS)

compute_metrics(collection::RepertoireCollection) -> Vector{Metrics}
compute_metrics(collection::RepertoireCollection, metrics::MetricSet) -> Vector{Metrics}

Compute metrics for all repertoires in a collection.

RepertoireMetrics.Metrics — Type

Metrics <: AbstractMetricResult

Result container for computed metrics. Access values by property name. All values are stored as Float64 for type stability.

Example

result = compute_metrics(rep)
println(result.shannon_entropy)
println(result.clonality)
println(result.d50)

RepertoireMetrics.compute_metric — Function

compute_metric(rep::Repertoire, metric::AbstractMetric) -> value

Compute a single metric for a repertoire.

Individual Metric Functions

RepertoireMetrics.shannon_entropy — Function

shannon_entropy(freqs::Vector{Float64}) -> Float64

Compute Shannon entropy H = -Σ(pᵢ log pᵢ) using natural logarithm. Zero frequencies are handled correctly (0 * log(0) = 0).

shannon_entropy(rep::Repertoire) -> Float64

Compute Shannon entropy for a repertoire.

RepertoireMetrics.simpson_index — Function

simpson_index(freqs::Vector{Float64}) -> Float64

Compute Simpson's index D = Σpᵢ². This is the probability that two randomly selected individuals belong to the same lineage.

simpson_index(rep::Repertoire) -> Float64

Compute Simpson's index for a repertoire.

RepertoireMetrics.simpson_diversity — Function

simpson_diversity(rep::Repertoire) -> Float64

Compute Simpson's diversity (1 - D) for a repertoire.

RepertoireMetrics.inverse_simpson — Function

inverse_simpson(rep::Repertoire) -> Float64

Compute inverse Simpson index (1/D) for a repertoire.

RepertoireMetrics.berger_parker_index — Function

berger_parker_index(freqs::Vector{Float64}) -> Float64

Compute Berger-Parker index: proportion of the most abundant lineage. Assumes frequencies are sorted in descending order.

berger_parker_index(rep::Repertoire) -> Float64

Compute Berger-Parker index for a repertoire.

RepertoireMetrics.gini_coefficient — Function

gini_coefficient(freqs::Vector{Float64}) -> Float64

Compute Gini coefficient measuring inequality in lineage abundances. Returns 0 for perfect equality, approaches 1 for maximum inequality.

gini_coefficient(rep::Repertoire) -> Float64

Compute Gini coefficient for a repertoire.

RepertoireMetrics.clonality — Function

clonality(rep::Repertoire) -> Float64

Compute clonality: 1 - normalized Shannon entropy. High clonality indicates oligoclonal expansion. Range: [0, 1] where 1 = maximally clonal (single dominant lineage).

RepertoireMetrics.evenness — Function

evenness(rep::Repertoire) -> Float64

Compute Pielou's evenness J = H / H_max. Range: [0, 1] where 1 = perfectly even distribution.

RepertoireMetrics.d50 — Function

d50(counts::Vector{<:Real}) -> Int

Compute D50: minimum number of lineages comprising 50% of total repertoire. Assumes counts are sorted in descending order.

d50(rep::Repertoire) -> Int

Compute D50 for a repertoire.

RepertoireMetrics.chao1 — Function

chao1(counts::Vector{<:Integer}) -> Float64

Compute Chao1 richness estimator. Estimates total species richness including unobserved species.

Formula: Schao1 = Sobs + f₁²/(2f₂) where f₁ = singletons, f₂ = doubletons

If f₂ = 0, uses bias-corrected formula: S_obs + f₁(f₁-1)/2

chao1(rep::Repertoire) -> Float64

Compute Chao1 richness estimator for a repertoire.

RepertoireMetrics.hill_number — Function

hill_number(freqs::Vector{Float64}, q::Real) -> Float64

Compute Hill number of order q. Hill numbers provide a unified framework for diversity metrics.

Special cases

q=0: Richness (number of species with non-zero frequency)
q=1: exp(Shannon entropy) - uses L'Hôpital's rule limit
q=2: Inverse Simpson (1/Σpᵢ²)
q→∞: 1/max(pᵢ) (reciprocal of Berger-Parker)

Formula

ᵍD = (Σpᵢᵍ)^(1/(1-q)) for q ≠ 1

hill_number(rep::Repertoire, q::Real) -> HillNumber{Q}

Compute Hill number of order q for a repertoire. Returns a typed HillNumber{Q} for compile-time known orders.

Hill Numbers

RepertoireMetrics.HillNumber — Type

HillNumber{Q} <: AbstractMetricResult

Hill number of order Q, providing a unified framework for diversity.

Type parameter

Q: Order of the Hill number (compile-time constant for type stability)

Fields

value::Float64: The computed Hill number
richness::Int: Number of unique lineages
total_count::Int: Total count

Special cases

Q=0: Richness (number of species)
Q=1: Exponential of Shannon entropy
Q=2: Inverse Simpson index
Q=∞: Reciprocal of Berger-Parker index

Composable Metric Selection

RepertoireMetrics.AbstractMetric — Type

AbstractMetric

Abstract supertype for individual metric selectors. Each concrete subtype represents a specific diversity/clonality metric.

RepertoireMetrics.MetricSet — Type

MetricSet{M<:Tuple}

A composable set of metrics to compute. Use the + operator to combine metrics.

Examples

# Select specific metrics
metrics = Richness() + ShannonEntropy() + Clonality()
result = compute_metrics(rep, metrics)

# Use predefined sets
result = compute_metrics(rep, DIVERSITY_METRICS)
result = compute_metrics(rep, ALL_METRICS)

# Default computes all metrics
result = compute_metrics(rep)

Predefined Metric Sets

RepertoireMetrics.ALL_METRICS — Constant

ALL_METRICS

All available metrics. This is the default when calling compute_metrics(rep).

RepertoireMetrics.DIVERSITY_METRICS — Constant

DIVERSITY_METRICS

Common diversity metrics: Shannon entropy/diversity, Simpson diversity, inverse Simpson.

RepertoireMetrics.CLONALITY_METRICS — Constant

CLONALITY_METRICS

Metrics focused on clonal expansion: clonality, Gini, Berger-Parker, D50.

RepertoireMetrics.RICHNESS_METRICS — Constant

RICHNESS_METRICS

Richness-related metrics: observed richness and Chao1 estimator.

RepertoireMetrics.ROBUST_METRICS — Constant

ROBUST_METRICS

Depth-robust metrics recommended for comparing samples of different sizes. These metrics are computed from frequencies and are less sensitive to sequencing depth. Includes Depth so you always report sequencing depth alongside comparisons.

Metric Types

RepertoireMetrics.Richness — Type

Richness: Number of unique lineages (S)

RepertoireMetrics.TotalCount — Type

TotalCount: Total sequence/cell count (N)

RepertoireMetrics.Depth — Type

Depth: Sequencing depth (alias for total count, commonly used term)

RepertoireMetrics.ShannonEntropy — Type

ShannonEntropy: H = -Σ(pᵢ log pᵢ)

RepertoireMetrics.ShannonDiversity — Type

ShannonDiversity: exp(H) - effective number of lineages

RepertoireMetrics.NormalizedShannon — Type

NormalizedShannon: H / log(S), normalized to [0,1]

RepertoireMetrics.SimpsonIndex — Type

SimpsonIndex: D = Σpᵢ² - probability two random sequences are same lineage

RepertoireMetrics.SimpsonDiversity — Type

SimpsonDiversity: 1 - D (Gini-Simpson index)

RepertoireMetrics.InverseSimpson — Type

InverseSimpson: 1/D - effective number of lineages (Hill q=2)

RepertoireMetrics.BergerParker — Type

BergerParker: Proportion of most abundant lineage

RepertoireMetrics.Evenness — Type

Evenness: Pielou's J = H / log(S)

RepertoireMetrics.Clonality — Type

Clonality: 1 - normalized Shannon entropy

RepertoireMetrics.GiniCoefficient — Type

GiniCoefficient: Inequality measure [0,1]

RepertoireMetrics.D50 — Type

D50: Minimum lineages comprising 50% of repertoire

RepertoireMetrics.Chao1 — Type

Chao1: Richness estimator including unobserved species

Sampling

RepertoireMetrics.rarefaction — Function

rarefaction(rep::Repertoire, depth::Integer; rng=nothing) -> Repertoire

Randomly subsample a repertoire to a specified depth (total count).

Many diversity metrics are sensitive to sample size—deeper sequencing captures more rare lineages. Rarefaction normalizes sample sizes by randomly subsampling all repertoires to the same depth, enabling fair comparisons.

Arguments

rep: Input repertoire
depth: Target total count (must be ≤ current total count of rep)
rng: Optional random number generator for reproducibility

Returns

A new Repertoire with subsampled counts. Lineages that received zero counts in the subsample are removed.

Example

# Compare two repertoires of different sizes
rep1 = read_repertoire("donor1.tsv", VJCdr3Definition())  # 50,000 sequences
rep2 = read_repertoire("donor2.tsv", VJCdr3Definition())  # 12,000 sequences

# Rarefy to the smaller depth
target = min(total_count(rep1), total_count(rep2))
rep1_rare = rarefaction(rep1, target)
rep2_rare = rarefaction(rep2, target)

# Now compare fairly
compute_metrics(rep1_rare)
compute_metrics(rep2_rare)

# For reproducibility, use a fixed RNG
using Random
rng = MersenneTwister(42)
rarefied = rarefaction(rep, 10000; rng=rng)

Notes

Rarefaction is stochastic; results vary between runs unless rng is fixed
For robust estimates, consider averaging metrics over multiple rarefactions
Metrics like Simpson index are naturally less sensitive to sample size

Exporting Results

RepertoireMetrics.metrics_to_dataframe — Function

metrics_to_dataframe(m::Metrics, donor_id::String="") -> DataFrame

Convert Metrics to a single-row DataFrame.

metrics_to_dataframe(collection::RepertoireCollection, metrics::Vector{Metrics}) -> DataFrame

Convert a collection's metrics to a DataFrame with one row per donor.

RepertoireMetrics.write_metrics — Function

write_metrics(filepath::AbstractString, df::DataFrame; kwargs...)

Write metrics DataFrame to a TSV file.

Length Statistics

Types and Functions

RepertoireMetrics.LengthStats — Type

LengthStats

Container for sequence length statistics, stored in Repertoire metadata.

Fields

mean_length::Float64: Mean sequence length
median_length::Float64: Median sequence length
std_length::Float64: Standard deviation of sequence length
min_length::Int: Minimum sequence length
max_length::Int: Maximum sequence length
n_sequences::Int: Number of sequences analyzed
column::Symbol: Column used for length calculation
use_aa::Bool: Whether lengths are in amino acids

Computed during repertoire construction when a length_column is specified.

RepertoireMetrics.compute_length_stats — Function

compute_length_stats(sl::SequenceLengths) -> LengthStats

Compute length statistics from extracted sequence lengths.

compute_length_stats(df::DataFrame; length_column=:cdr3, count_column=nothing, use_aa=false) -> LengthStats

Compute length statistics from a DataFrame column.

Arguments

df: DataFrame containing sequences
length_column: Column to measure length from (default :cdr3)
count_column: If provided, weight statistics by this count column
use_aa: If true, column contains amino acid sequences. If false (default), assumes nucleotide sequences and divides length by 3.

Returns

A LengthStats object with computed statistics.

RepertoireMetrics.length_distribution — Function

length_distribution(sl::SequenceLengths) -> Dict{Int,Int}

Compute length distribution from extracted sequence lengths.

length_distribution(df::DataFrame; length_column=:cdr3, count_column=nothing, use_aa=false) -> Dict{Int,Int}

Compute the distribution of sequence lengths.

Arguments

df: DataFrame containing sequences
length_column: Column to measure length from (default :cdr3)
count_column: If provided, weight by this count column
use_aa: If true, column contains amino acid sequences

Returns

A Dict{Int,Int} mapping length to count.

Example

dist = length_distribution(df; length_column=:cdr3)
for (len, cnt) in sort(collect(dist))
    println("Length $len: $cnt sequences")
end

RepertoireMetrics.has_length_stats — Function

has_length_stats(rep::Repertoire) -> Bool

Check if length statistics are available for this repertoire.

RepertoireMetrics.length_stats — Function

length_stats(rep::Repertoire) -> LengthStats

Return the LengthStats object for this repertoire.

Composable Length Metrics

RepertoireMetrics.MeanLength — Type

MeanLength <: AbstractMetric

Metric type for mean sequence length. Requires length_column to be specified during repertoire construction.

RepertoireMetrics.MedianLength — Type

MedianLength <: AbstractMetric

Metric type for median sequence length.

RepertoireMetrics.StdLength — Type

StdLength <: AbstractMetric

Metric type for standard deviation of sequence length.

RepertoireMetrics.MinLength — Type

MinLength <: AbstractMetric

Metric type for minimum sequence length.

RepertoireMetrics.MaxLength — Type

MaxLength <: AbstractMetric

Metric type for maximum sequence length.

RepertoireMetrics.mean_length — Function

mean_length(rep::Repertoire) -> Float64

Return the mean sequence length. Requires length statistics to be computed during repertoire construction (via length_column parameter).

RepertoireMetrics.median_length — Function

median_length(rep::Repertoire) -> Float64

Return the median sequence length.

RepertoireMetrics.std_length — Function

std_length(rep::Repertoire) -> Float64

Return the standard deviation of sequence length.

RepertoireMetrics.min_length — Function

min_length(rep::Repertoire) -> Int

Return the minimum sequence length.

RepertoireMetrics.max_length — Function

max_length(rep::Repertoire) -> Int

Return the maximum sequence length.

RepertoireMetrics.LENGTH_METRICS — Constant

LENGTH_METRICS

Predefined MetricSet containing all length metrics: MeanLength() + MedianLength() + StdLength() + MinLength() + MaxLength()

Index

RepertoireMetrics.ALL_METRICS
RepertoireMetrics.CLONALITY_METRICS
RepertoireMetrics.DIVERSITY_METRICS
RepertoireMetrics.LENGTH_METRICS
RepertoireMetrics.RICHNESS_METRICS
RepertoireMetrics.ROBUST_METRICS
RepertoireMetrics.AbstractLineageDefinition
RepertoireMetrics.AbstractMetric
RepertoireMetrics.BergerParker
RepertoireMetrics.Chao1
RepertoireMetrics.Clonality
RepertoireMetrics.CustomDefinition
RepertoireMetrics.D50
RepertoireMetrics.Depth
RepertoireMetrics.Evenness
RepertoireMetrics.GiniCoefficient
RepertoireMetrics.HillNumber
RepertoireMetrics.InverseSimpson
RepertoireMetrics.LengthStats
RepertoireMetrics.LineageIDDefinition
RepertoireMetrics.MaxLength
RepertoireMetrics.MeanLength
RepertoireMetrics.MedianLength
RepertoireMetrics.MetricSet
RepertoireMetrics.Metrics
RepertoireMetrics.MinLength
RepertoireMetrics.NormalizedShannon
RepertoireMetrics.Repertoire
RepertoireMetrics.RepertoireCollection
RepertoireMetrics.Richness
RepertoireMetrics.ShannonDiversity
RepertoireMetrics.ShannonEntropy
RepertoireMetrics.SimpsonDiversity
RepertoireMetrics.SimpsonIndex
RepertoireMetrics.StdLength
RepertoireMetrics.TotalCount
RepertoireMetrics.VJCdr3Definition
RepertoireMetrics.berger_parker_index
RepertoireMetrics.chao1
RepertoireMetrics.clonality
RepertoireMetrics.compute_length_stats
RepertoireMetrics.compute_metric
RepertoireMetrics.compute_metrics
RepertoireMetrics.counts
RepertoireMetrics.d50
RepertoireMetrics.donor_id
RepertoireMetrics.evenness
RepertoireMetrics.first_allele
RepertoireMetrics.frequencies
RepertoireMetrics.gini_coefficient
RepertoireMetrics.has_length_stats
RepertoireMetrics.hill_number
RepertoireMetrics.inverse_simpson
RepertoireMetrics.length_distribution
RepertoireMetrics.length_stats
RepertoireMetrics.lineage_ids
RepertoireMetrics.lineage_key
RepertoireMetrics.max_length
RepertoireMetrics.mean_length
RepertoireMetrics.median_length
RepertoireMetrics.metrics_to_dataframe
RepertoireMetrics.min_length
RepertoireMetrics.rarefaction
RepertoireMetrics.read_repertoire
RepertoireMetrics.read_repertoires
RepertoireMetrics.read_repertoires_from_directory
RepertoireMetrics.repertoire_from_dataframe
RepertoireMetrics.richness
RepertoireMetrics.shannon_entropy
RepertoireMetrics.simpson_diversity
RepertoireMetrics.simpson_index
RepertoireMetrics.split_by_donor
RepertoireMetrics.std_length
RepertoireMetrics.total_count
RepertoireMetrics.write_metrics