API Reference

Lineage Definition Strategies

RepertoireMetrics.AbstractLineageDefinitionType
AbstractLineageDefinition

Abstract supertype for all lineage definition strategies. Concrete subtypes define how sequences are grouped into clonal lineages.

Implementing a new strategy

To implement a new lineage definition strategy:

  1. Create a subtype of AbstractLineageDefinition
  2. Implement lineage_key(strategy, row) returning a hashable key
RepertoireMetrics.LineageIDDefinitionType
LineageIDDefinition <: AbstractLineageDefinition

Define lineages using the lineage_id column directly. This is the simplest strategy when lineage assignment has already been performed.

Fields

  • column::Symbol: Column name containing lineage IDs (default: :lineage_id)
RepertoireMetrics.VJCdr3DefinitionType
VJCdr3Definition <: AbstractLineageDefinition

Define lineages using the combination of V gene, J gene, and CDR3 sequence. This is consistent with LineageCollapse.jl's default behavior.

Fields

  • v_column::Symbol: Column name for V gene call (default: :v_call)
  • j_column::Symbol: Column name for J gene call (default: :j_call)
  • cdr3_column::Symbol: Column name for CDR3 sequence (default: :cdr3)
  • use_first_allele::Bool: Use only first allele from calls (default: true)
RepertoireMetrics.CustomDefinitionType
CustomDefinition{F} <: AbstractLineageDefinition

Define lineages using a custom function that extracts a key from each row.

Fields

  • key_func::F: Function row -> key that returns a hashable lineage key

Example

# Group by V gene family only
strategy = CustomDefinition(row -> first(split(string(row.v_call), "-")))
RepertoireMetrics.lineage_keyFunction
lineage_key(strategy::AbstractLineageDefinition, row) -> key

Extract a lineage key from a data row using the given strategy. Returns a hashable value that identifies the lineage.

RepertoireMetrics.first_alleleFunction
first_allele(call::AbstractString) -> String

Extract the first allele from a gene call string. Handles comma-separated multiple calls and allele notation.

Examples

first_allele("IGHV1-2*01,IGHV1-2*02") # "IGHV1-2*01"
first_allele("IGHV1-2*01")             # "IGHV1-2*01"

Data Structures

RepertoireMetrics.RepertoireType
Repertoire{T<:Real}

Immutable representation of a B cell repertoire for diversity analysis. Stores lineage counts in a type-stable manner for efficient metric computation.

Type parameter

  • T: Numeric type for counts (typically Int or Float64)

Fields

  • counts::Vector{T}: Sorted vector of lineage counts (descending order)
  • lineage_ids::Vector{String}: Lineage identifiers corresponding to counts
  • donor_id::String: Identifier for the donor/sample
  • total_count::T: Total count (cached for efficiency)
  • metadata::Dict{String,Any}: Optional metadata (flexible storage for extensibility; known keys like "length_stats" contain typed values)

Construction

Use the constructors or read_repertoire for type-safe creation.

Example

# From counts directly
rep = Repertoire([100, 50, 25, 10, 5], donor_id="Donor1")

# From DataFrame
rep = read_repertoire("data.tsv", VJCdr3Definition())
RepertoireMetrics.RepertoireCollectionType
RepertoireCollection{T<:Real}

Collection of repertoires from multiple donors for comparative analysis.

Fields

  • repertoires::Vector{Repertoire{T}}: Vector of repertoires
  • donor_ids::Vector{String}: Donor identifiers (for quick lookup)

Example

collection = RepertoireCollection([rep1, rep2, rep3])
metrics = compute_metrics(collection)  # Returns vector of DiversityMetrics
RepertoireMetrics.richnessFunction
richness(rep::Repertoire) -> Int

Return the number of unique lineages (richness/species count).

RepertoireMetrics.frequenciesFunction
frequencies(rep::Repertoire{T}) -> Vector{Float64}

Return normalized frequencies (proportions) for each lineage. Frequencies sum to 1.0.

Reading Data

RepertoireMetrics.read_repertoireFunction
read_repertoire(
    filepath::AbstractString,
    strategy::AbstractLineageDefinition;
    count_column::Union{Symbol,Nothing} = :count,
    donor_id::String = "",
    donor_column::Union{Symbol,Nothing} = nothing,
    length_column::Union{Symbol,Nothing} = nothing,
    length_aa::Bool = false,
    kwargs...
) -> Repertoire{Int}

Read a MIAIRR file (TSV, CSV, or gzipped) and convert to a Repertoire. The delimiter is auto-detected by CSV.jl.

Arguments

  • filepath: Path to file (supports .tsv, .csv, .tsv.gz, .csv.gz)
  • strategy: Lineage definition strategy
  • count_column: Column containing sequence counts (default :count)
  • donor_id: Donor/sample identifier
  • donor_column: Column to extract donor ID from (e.g., :library_id)
  • length_column: Column to compute length statistics from (e.g., :cdr3)
  • length_aa: If true, length_column contains amino acid sequences
  • kwargs...: Additional arguments passed to CSV.read

Example

# Using lineage_id column
rep = read_repertoire("collapsed_data.tsv", LineageIDDefinition())

# Using V-J-CDR3 definition with CDR3 length stats
rep = read_repertoire("sequences.tsv", VJCdr3Definition(); length_column=:cdr3)

# With donor ID from file
rep = read_repertoire("data.tsv", VJCdr3Definition(); donor_column=:library_id)

# Compute length from amino acid column
rep = read_repertoire("data.tsv", VJCdr3Definition(); 
    length_column=:cdr3_aa, length_aa=true)
RepertoireMetrics.read_repertoiresFunction
read_repertoires(
    filepaths::Vector{<:AbstractString},
    strategy::AbstractLineageDefinition;
    kwargs...
) -> RepertoireCollection{Int}

Read multiple MIAIRR TSV files and return a collection.

Example

files = ["donor1.tsv", "donor2.tsv", "donor3.tsv"]
collection = read_repertoires(files, VJCdr3Definition())
all_metrics = compute_metrics(collection)
RepertoireMetrics.read_repertoires_from_directoryFunction
read_repertoires_from_directory(
    dirpath::AbstractString,
    strategy::AbstractLineageDefinition;
    pattern::Regex = r"\.tsv$"i,
    kwargs...
) -> RepertoireCollection{Int}

Read all matching files from a directory and return a collection.

Arguments

  • dirpath: Path to directory containing TSV files
  • strategy: Lineage definition strategy
  • pattern: Regex pattern to match filenames (default: .tsv files)
  • kwargs...: Additional arguments passed to read_repertoire

Example

collection = read_repertoires_from_directory("data/", VJCdr3Definition())
RepertoireMetrics.repertoire_from_dataframeFunction
repertoire_from_dataframe(
    df::DataFrame,
    strategy::AbstractLineageDefinition;
    count_column::Union{Symbol,Nothing} = :count,
    donor_id::String = "",
    donor_column::Union{Symbol,Nothing} = nothing,
    length_column::Union{Symbol,Nothing} = nothing,
    length_aa::Bool = false
) -> Repertoire{Int}

Convert a DataFrame to a Repertoire using the specified lineage definition strategy.

Arguments

  • df: Input DataFrame in MIAIRR format
  • strategy: Lineage definition strategy (e.g., VJCdr3Definition(), LineageIDDefinition())
  • count_column: Column containing sequence counts (default :count). If nothing or column doesn't exist, each row counts as 1.
  • donor_id: Donor/sample identifier. If empty and donor_column is specified, extracts from first row.
  • donor_column: Column containing donor IDs (e.g., :library_id)
  • length_column: Column to compute length statistics from (e.g., :cdr3, :junction). If nothing (default), no length statistics are computed.
  • length_aa: If true, length_column contains amino acid sequences. If false (default), assumes nucleotide sequences and converts to amino acid length (÷3).

Returns

A Repertoire{Int} with aggregated lineage counts. If length_column is specified, length statistics are stored in metadata and accessible via composable metrics like MeanLength(), MedianLength(), etc.

Example

df = CSV.read("sequences.tsv", DataFrame)

# Basic repertoire
rep = repertoire_from_dataframe(df, VJCdr3Definition())

# With CDR3 length statistics
rep = repertoire_from_dataframe(df, VJCdr3Definition(); length_column=:cdr3)
println(mean_length(rep))  # Access via function
metrics = compute_metrics(rep, MeanLength() + MedianLength())  # Or compose

# Using amino acid column directly
rep = repertoire_from_dataframe(df, VJCdr3Definition(); 
    length_column=:cdr3_aa, length_aa=true)
RepertoireMetrics.split_by_donorFunction
split_by_donor(
    df::DataFrame,
    donor_column::Symbol,
    strategy::AbstractLineageDefinition;
    kwargs...
) -> RepertoireCollection{Int}

Split a multi-donor DataFrame into separate Repertoires by donor.

Arguments

  • df: DataFrame containing data from multiple donors
  • donor_column: Column containing donor identifiers
  • strategy: Lineage definition strategy
  • kwargs...: Additional arguments passed to repertoire_from_dataframe

Example

df = CSV.read("all_donors.tsv", DataFrame)
collection = split_by_donor(df, :library_id, VJCdr3Definition())

Computing Metrics

Main Functions

RepertoireMetrics.compute_metricsFunction
compute_metrics(rep::Repertoire) -> Metrics
compute_metrics(rep::Repertoire, metrics::MetricSet) -> Metrics

Compute metrics for a repertoire. Without a MetricSet argument, computes all metrics.

Examples

# Compute all metrics (default)
result = compute_metrics(rep)

# Compute specific metrics
result = compute_metrics(rep, ShannonEntropy() + Clonality() + D50())

# Use predefined sets
result = compute_metrics(rep, DIVERSITY_METRICS)
compute_metrics(collection::RepertoireCollection) -> Vector{Metrics}
compute_metrics(collection::RepertoireCollection, metrics::MetricSet) -> Vector{Metrics}

Compute metrics for all repertoires in a collection.

RepertoireMetrics.MetricsType
Metrics <: AbstractMetricResult

Result container for computed metrics. Access values by property name. All values are stored as Float64 for type stability.

Example

result = compute_metrics(rep)
println(result.shannon_entropy)
println(result.clonality)
println(result.d50)

Individual Metric Functions

RepertoireMetrics.shannon_entropyFunction
shannon_entropy(freqs::Vector{Float64}) -> Float64

Compute Shannon entropy H = -Σ(pᵢ log pᵢ) using natural logarithm. Zero frequencies are handled correctly (0 * log(0) = 0).

shannon_entropy(rep::Repertoire) -> Float64

Compute Shannon entropy for a repertoire.

RepertoireMetrics.simpson_indexFunction
simpson_index(freqs::Vector{Float64}) -> Float64

Compute Simpson's index D = Σpᵢ². This is the probability that two randomly selected individuals belong to the same lineage.

simpson_index(rep::Repertoire) -> Float64

Compute Simpson's index for a repertoire.

RepertoireMetrics.berger_parker_indexFunction
berger_parker_index(freqs::Vector{Float64}) -> Float64

Compute Berger-Parker index: proportion of the most abundant lineage. Assumes frequencies are sorted in descending order.

berger_parker_index(rep::Repertoire) -> Float64

Compute Berger-Parker index for a repertoire.

RepertoireMetrics.gini_coefficientFunction
gini_coefficient(freqs::Vector{Float64}) -> Float64

Compute Gini coefficient measuring inequality in lineage abundances. Returns 0 for perfect equality, approaches 1 for maximum inequality.

gini_coefficient(rep::Repertoire) -> Float64

Compute Gini coefficient for a repertoire.

RepertoireMetrics.clonalityFunction
clonality(rep::Repertoire) -> Float64

Compute clonality: 1 - normalized Shannon entropy. High clonality indicates oligoclonal expansion. Range: [0, 1] where 1 = maximally clonal (single dominant lineage).

RepertoireMetrics.evennessFunction
evenness(rep::Repertoire) -> Float64

Compute Pielou's evenness J = H / H_max. Range: [0, 1] where 1 = perfectly even distribution.

RepertoireMetrics.d50Function
d50(counts::Vector{<:Real}) -> Int

Compute D50: minimum number of lineages comprising 50% of total repertoire. Assumes counts are sorted in descending order.

d50(rep::Repertoire) -> Int

Compute D50 for a repertoire.

RepertoireMetrics.chao1Function
chao1(counts::Vector{<:Integer}) -> Float64

Compute Chao1 richness estimator. Estimates total species richness including unobserved species.

Formula: Schao1 = Sobs + f₁²/(2f₂) where f₁ = singletons, f₂ = doubletons

If f₂ = 0, uses bias-corrected formula: S_obs + f₁(f₁-1)/2

chao1(rep::Repertoire) -> Float64

Compute Chao1 richness estimator for a repertoire.

RepertoireMetrics.hill_numberFunction
hill_number(freqs::Vector{Float64}, q::Real) -> Float64

Compute Hill number of order q. Hill numbers provide a unified framework for diversity metrics.

Special cases

  • q=0: Richness (number of species with non-zero frequency)
  • q=1: exp(Shannon entropy) - uses L'Hôpital's rule limit
  • q=2: Inverse Simpson (1/Σpᵢ²)
  • q→∞: 1/max(pᵢ) (reciprocal of Berger-Parker)

Formula

ᵍD = (Σpᵢᵍ)^(1/(1-q)) for q ≠ 1

hill_number(rep::Repertoire, q::Real) -> HillNumber{Q}

Compute Hill number of order q for a repertoire. Returns a typed HillNumber{Q} for compile-time known orders.

Hill Numbers

RepertoireMetrics.HillNumberType
HillNumber{Q} <: AbstractMetricResult

Hill number of order Q, providing a unified framework for diversity.

Type parameter

  • Q: Order of the Hill number (compile-time constant for type stability)

Fields

  • value::Float64: The computed Hill number
  • richness::Int: Number of unique lineages
  • total_count::Int: Total count

Special cases

  • Q=0: Richness (number of species)
  • Q=1: Exponential of Shannon entropy
  • Q=2: Inverse Simpson index
  • Q=∞: Reciprocal of Berger-Parker index

Composable Metric Selection

RepertoireMetrics.AbstractMetricType
AbstractMetric

Abstract supertype for individual metric selectors. Each concrete subtype represents a specific diversity/clonality metric.

RepertoireMetrics.MetricSetType
MetricSet{M<:Tuple}

A composable set of metrics to compute. Use the + operator to combine metrics.

Examples

# Select specific metrics
metrics = Richness() + ShannonEntropy() + Clonality()
result = compute_metrics(rep, metrics)

# Use predefined sets
result = compute_metrics(rep, DIVERSITY_METRICS)
result = compute_metrics(rep, ALL_METRICS)

# Default computes all metrics
result = compute_metrics(rep)

Predefined Metric Sets

RepertoireMetrics.ROBUST_METRICSConstant
ROBUST_METRICS

Depth-robust metrics recommended for comparing samples of different sizes. These metrics are computed from frequencies and are less sensitive to sequencing depth. Includes Depth so you always report sequencing depth alongside comparisons.

Metric Types

Sampling

RepertoireMetrics.rarefactionFunction
rarefaction(rep::Repertoire, depth::Integer; rng=nothing) -> Repertoire

Randomly subsample a repertoire to a specified depth (total count).

Many diversity metrics are sensitive to sample size—deeper sequencing captures more rare lineages. Rarefaction normalizes sample sizes by randomly subsampling all repertoires to the same depth, enabling fair comparisons.

Arguments

  • rep: Input repertoire
  • depth: Target total count (must be ≤ current total count of rep)
  • rng: Optional random number generator for reproducibility

Returns

A new Repertoire with subsampled counts. Lineages that received zero counts in the subsample are removed.

Example

# Compare two repertoires of different sizes
rep1 = read_repertoire("donor1.tsv", VJCdr3Definition())  # 50,000 sequences
rep2 = read_repertoire("donor2.tsv", VJCdr3Definition())  # 12,000 sequences

# Rarefy to the smaller depth
target = min(total_count(rep1), total_count(rep2))
rep1_rare = rarefaction(rep1, target)
rep2_rare = rarefaction(rep2, target)

# Now compare fairly
compute_metrics(rep1_rare)
compute_metrics(rep2_rare)

# For reproducibility, use a fixed RNG
using Random
rng = MersenneTwister(42)
rarefied = rarefaction(rep, 10000; rng=rng)

Notes

  • Rarefaction is stochastic; results vary between runs unless rng is fixed
  • For robust estimates, consider averaging metrics over multiple rarefactions
  • Metrics like Simpson index are naturally less sensitive to sample size

Exporting Results

RepertoireMetrics.metrics_to_dataframeFunction
metrics_to_dataframe(m::Metrics, donor_id::String="") -> DataFrame

Convert Metrics to a single-row DataFrame.

metrics_to_dataframe(collection::RepertoireCollection, metrics::Vector{Metrics}) -> DataFrame

Convert a collection's metrics to a DataFrame with one row per donor.

Length Statistics

Types and Functions

RepertoireMetrics.LengthStatsType
LengthStats

Container for sequence length statistics, stored in Repertoire metadata.

Fields

  • mean_length::Float64: Mean sequence length
  • median_length::Float64: Median sequence length
  • std_length::Float64: Standard deviation of sequence length
  • min_length::Int: Minimum sequence length
  • max_length::Int: Maximum sequence length
  • n_sequences::Int: Number of sequences analyzed
  • column::Symbol: Column used for length calculation
  • use_aa::Bool: Whether lengths are in amino acids

Computed during repertoire construction when a length_column is specified.

RepertoireMetrics.compute_length_statsFunction
compute_length_stats(sl::SequenceLengths) -> LengthStats

Compute length statistics from extracted sequence lengths.

compute_length_stats(df::DataFrame; length_column=:cdr3, count_column=nothing, use_aa=false) -> LengthStats

Compute length statistics from a DataFrame column.

Arguments

  • df: DataFrame containing sequences
  • length_column: Column to measure length from (default :cdr3)
  • count_column: If provided, weight statistics by this count column
  • use_aa: If true, column contains amino acid sequences. If false (default), assumes nucleotide sequences and divides length by 3.

Returns

A LengthStats object with computed statistics.

RepertoireMetrics.length_distributionFunction
length_distribution(sl::SequenceLengths) -> Dict{Int,Int}

Compute length distribution from extracted sequence lengths.

length_distribution(df::DataFrame; length_column=:cdr3, count_column=nothing, use_aa=false) -> Dict{Int,Int}

Compute the distribution of sequence lengths.

Arguments

  • df: DataFrame containing sequences
  • length_column: Column to measure length from (default :cdr3)
  • count_column: If provided, weight by this count column
  • use_aa: If true, column contains amino acid sequences

Returns

A Dict{Int,Int} mapping length to count.

Example

dist = length_distribution(df; length_column=:cdr3)
for (len, cnt) in sort(collect(dist))
    println("Length $len: $cnt sequences")
end

Composable Length Metrics

RepertoireMetrics.MeanLengthType
MeanLength <: AbstractMetric

Metric type for mean sequence length. Requires length_column to be specified during repertoire construction.

RepertoireMetrics.mean_lengthFunction
mean_length(rep::Repertoire) -> Float64

Return the mean sequence length. Requires length statistics to be computed during repertoire construction (via length_column parameter).

RepertoireMetrics.LENGTH_METRICSConstant
LENGTH_METRICS

Predefined MetricSet containing all length metrics: MeanLength() + MedianLength() + StdLength() + MinLength() + MaxLength()

Index