LineageCollapse

Documentation for LineageCollapse, a Julia package for collapsing lineages in AIRR data.

Overview

LineageCollapse provides tools for processing and analyzing adaptive immune receptor repertoire (AIRR) data. It offers functions for data loading, preprocessing, lineage assignment, and lineage collapsing.

Architecture

Processing Pipeline

The library follows a linear pipeline where each stage transforms your data:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          LineageCollapse Pipeline                           │
└─────────────────────────────────────────────────────────────────────────────┘

   AIRR TSV File
        │
        ▼
┌───────────────┐
│  load_data()  │  Load and validate AIRR-formatted data
└───────┬───────┘
        │ DataFrame
        ▼
┌──────────────────────┐
│  preprocess_data()   │  Filter, derive D-region, normalize columns
└──────────┬───────────┘
           │ DataFrame + d_region, v_call_first, j_call_first
           ▼
┌────────────────────────────────────────────────────────────────────────────┐
│  process_lineages(df, threshold)                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │  For each (V-gene, J-gene, CDR3-length) group:                       │  │
│  │    1. Compute pairwise CDR3 distances (AbstractDistanceMetric)       │  │
│  │    2. Cluster sequences (ClusteringMethod)                           │  │
│  │    3. Assign lineage_id to each cluster                              │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
└──────────┬─────────────────────────────────────────────────────────────────┘
           │ DataFrame + lineage_id, cluster, cluster_size, cdr3_frequency
           ▼
┌────────────────────────────────────────────────────────────────────────────┐
│  collapse_lineages(df, strategy)                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │  CollapseStrategy determines output:                                 │  │
│  │    • Hardest() → one representative per lineage                      │  │
│  │    • Soft(cutoff) → all clones above frequency threshold             │  │
│  │                                                                      │  │
│  │  AbstractTieBreaker resolves ties when selecting representatives     │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
└──────────┬─────────────────────────────────────────────────────────────────┘
           │
           ▼
   Collapsed DataFrame (representatives or filtered clones)

Type Hierarchy

The library uses Julia's multiple dispatch with abstract types for extensibility:

AbstractDistanceMetric           ClusteringMethod              CollapseStrategy
         │                              │                             │
         ├── HammingDistance            └── HierarchicalClustering    ├── Hardest
         ├── NormalizedHammingDistance          │                     └── Soft
         └── LevenshteinDistance                └── cutoff::Float
                                                                           │
                                                                           │
AbstractTieBreaker ◄───────────────────────────────────────────────────────┘
         │                                                    (used by Hardest)
         ├── TieBreaker
         │       └── criteria::Vector{Pair{Symbol,Bool}}
         │
         └── MostCommonVdjNtTieBreaker

Extension Points

You can extend the library by defining new types and methods:

Custom Distance Metric:

struct MyDistance <: AbstractDistanceMetric end

# Implement the required method
LineageCollapse.compute_distance(::MyDistance, x::LongDNA{4}, y::LongDNA{4}) = ...

Custom Tie-Breaker:

struct MyTieBreaker <: AbstractTieBreaker end

# Or use the built-in TieBreaker with custom criteria
my_breaker = TieBreaker([:my_column => true, :cdr3 => false])

Key Concepts

ConceptDescription
LineageGroup of sequences sharing V-gene, J-gene, CDR3 length, and similar CDR3
CloneUnique combination of D-region + lineage + V + J + CDR3 within a lineage
Clone FrequencyProportion of sequences in a lineage belonging to a clone
RepresentativeSelected sequence to represent an entire lineage

Built-in Tie-Breakers

FunctionStrategy
ByMostCommonVdjNt()Most common VDJ nucleotide sequence (igdiscover-compatible)
ByVdjCount()Highest VDJ count, then lexicographic CDR3
ByCdr3Count()Highest CDR3 count, then lexicographic CDR3
BySequenceCount()Highest sequence count, then lexicographic CDR3
ByMostNaive()Highest V/J identity (closest to germline)
ByLexicographic()Lexicographically smallest CDR3
ByFirst()First candidate (no sorting)

Tie-breakers can be combined: ByVdjCount() + ByLexicographic()

Installation

using Pkg
Pkg.add("LineageCollapse")

Quick Start

using LineageCollapse

# Load and preprocess
df = load_data("airr_data.tsv.gz")
df = preprocess_data(df; min_d_region_length=3)

# Assign lineages (threshold: 1 mismatch or 0.1 = 10% of CDR3 length)
lineages = process_lineages(df, 1)

# Collapse options:
# Option A: One representative per lineage
result = collapse_lineages(lineages, Hardest())

# Option B: Keep clones with frequency ≥ 20%
result = collapse_lineages(lineages, Soft(0.2))

# Option C: Custom tie-breaking
result = collapse_lineages(lineages, Hardest(); 
                           tie_breaker=ByMostNaive(),
                           tie_atol=0.01)  # 1% tolerance

Detailed Examples

Using Different Distance Metrics

# Explicit metric configuration
lineages = process_lineages(df;
    distance_metric = NormalizedHammingDistance(),
    clustering_method = HierarchicalClustering(0.1f0),
    linkage = :average
)

Diagnostic: Finding Ties

# Identify lineages where multiple clones have the same max frequency
ties = hardest_tie_summary(df; atol=0.01)
filter(:hardest_tied => identity, ties)  # Show only tied lineages

Column reference

Meaning and computation of columns added by each stage. The most important is count (optional input and main abundance output).

Count (most important)

Input (optional): An optional numeric column count (e.g. read or UMI count per sequence). If present, it is the abundance of that row; if absent, each row is treated as abundance 1. Missing values are treated as 0 when summing.

Output (Hardest only): The collapsed result includes a count column: for each lineage it is the sum of the input count over all sequences in that lineage (total lineage abundance). If the input had no count column, this equals the number of sequences in the lineage.

How it is used: Tie-breakers such as ByMostCommonVdjNt() and ByVdjCount() use these abundances to select the "most common" VDJ nucleotide sequence or clone (by highest total count). Clone frequency and the Soft strategy use the number of sequences (rows) per clone, not the sum of count.

After preprocess_data

ColumnMeaningComputation
d_regionD-region nucleotide sequencesequence[v_sequence_end+1:j_sequence_start]
v_call_firstFirst V-gene alleleFirst token of v_call (before comma)
j_call_firstFirst J-gene alleleFirst token of j_call (before comma)
vdj_ntV–D–J nucleotide sequencesequence[v_sequence_start:j_sequence_end] (only if those columns exist)
cdr3_lengthCDR3 lengthlength(cdr3)

After process_lineages

ColumnMeaningComputation
lineage_idLineage identifierUnique id per (V, J, CDR3 length, cluster)
clusterCluster within V/J/CDR3-length groupFrom hierarchical clustering of CDR3 distances
cluster_sizeSequences in this clusterNumber of rows in the same cluster
min_distanceMin distance to another CDR3 in clusterSmallest pairwise distance to another sequence in the cluster
cdr3_countHow many sequences share this CDR3 in the clusterNumber of rows with same CDR3 in same cluster
max_cdr3_countMax cdr3_count in this clusterMaximum of cdr3_count over the cluster
cdr3_frequencyRelative frequency of this CDR3 in clustercdr3_count / max_cdr3_count (0–1)

After collapse_lineages with Hardest

ColumnMeaningComputation
countTotal lineage abundanceSum of input count (or 1 per row if no input count) over all sequences in the lineage — see "Count (most important)" above
nVDJ_ntNumber of unique VDJ nucleotide sequences in lineageCount of distinct vdj_nt in the lineage (or missing if no vdj_nt)

After collapse_lineages with Soft

ColumnMeaningComputation
clone_frequencyFraction of lineage (by row count) in this cloneNumber of sequences (rows) in this clone ÷ total sequences in the lineage
sequence_countNumber of sequences in this cloneNumber of rows with same (dregion, lineageid, vcallfirst, jcallfirst, cdr3)

For detailed function signatures and options, see the API Reference below.

LineageCollapse.HierarchicalClusteringType
HierarchicalClustering{T} <: ClusteringMethod

Hierarchical clustering with a distance cutoff.

Fields

  • cutoff::T: Distance threshold for cluster merging

Example

HierarchicalClustering(1.0f0)  # Merge clusters within distance 1.0
source
LineageCollapse.MostCommonVdjNtTieBreakerType
MostCommonVdjNtTieBreaker

A tie-breaker that selects the representative based on the most common VDJ_nt sequence weighted by count, matching igdiscover's clonotypes behavior.

For each lineage, it:

  1. Sums the count for each unique vdj_nt sequence
  2. Selects the vdj_nt with the highest total count
  3. Returns the first row with that vdj_nt
source
LineageCollapse.SoftType
Soft{T} <: CollapseStrategy

Collapse strategy that keeps all clones above a frequency threshold.

Fields

  • cutoff::T: Minimum clone frequency to retain (0.0 to 1.0)

Example

Soft(0.2)  # Keep clones with frequency ≥ 20%
source
LineageCollapse.TieBreakerType
TieBreaker <: AbstractTieBreaker

A configurable tie-breaker that sorts candidates by specified column criteria.

Fields

  • criteria::Vector{Pair{Symbol,Bool}}: Sorting criteria as column => descending pairs. Columns are checked in order; true means sort descending (higher is better).

Example

# Sort by count descending, then by cdr3 ascending (lexicographic)
TieBreaker([:count => true, :cdr3 => false])
source
LineageCollapse.ByMostCommonVdjNtMethod
ByMostCommonVdjNt()

Create a tie-breaker that matches igdiscover's clonotypes representative selection.

Selects the representative by finding the VDJnt sequence with the highest total count across all members of the lineage, then returns the first row with that VDJnt.

Requires columns: vdj_nt, count

source
LineageCollapse.ByMostNaiveMethod
ByMostNaive()

Tie-breaker prioritizing sequences closest to germline (highest V/J identity), then by VDJ count, CDR3 count, and lexicographic CDR3.

source
LineageCollapse.collapse_lineagesFunction
collapse_lineages(df::DataFrame, strategy=Hardest(); tie_breaker, tie_atol) -> DataFrame

Collapse lineages to representative sequences.

Arguments

  • df::DataFrame: Data from process_lineages with lineage_id column
  • strategy::CollapseStrategy=Hardest(): Collapse strategy
    • Hardest(): One representative per lineage (highest clone frequency)
    • Soft(cutoff): Keep all clones with frequency ≥ cutoff

Keyword Arguments

  • tie_breaker::AbstractTieBreaker=ByMostCommonVdjNt(): Strategy for breaking ties
  • tie_atol::Real=0.0: Tolerance for frequency comparison

Returns

For Hardest():

  • Selected rows with added count::Int (sum of counts) and nVDJ_nt::Int (unique VDJ sequences)

For Soft(cutoff):

  • Rows meeting threshold with added clone_frequency::Float64 and sequence_count::Int
source
LineageCollapse.compute_distanceFunction
compute_distance(metric::AbstractDistanceMetric, x::LongDNA{4}, y::LongDNA{4}) -> Float32

Compute distance between two DNA sequences using the specified metric.

source
LineageCollapse.deduplicate_dataFunction
deduplicate_data(df::DataFrame, use_barcode::Bool=false)::DataFrame

Deduplicate the input DataFrame based on sequence or sequence+barcode.

Arguments

  • df::DataFrame: Input DataFrame.
  • use_barcode::Bool=false: Whether to use barcode for deduplication.

Returns

  • DataFrame: Deduplicated DataFrame.
source
LineageCollapse.load_dataMethod
load_data(filepath::String; 
          delimiter::Char='	', 
          required_columns=[:sequence_id, :sequence, :v_sequence_end, :j_sequence_start, :cdr3, :v_call, :j_call, :stop_codon])::DataFrame

Load data from a file (compressed or uncompressed) and return a DataFrame.

Arguments

  • filepath::String: Path to the data file.
  • delimiter::Char=' ': Delimiter used in the data file (default: tab).
  • required_columns::Vector{Symbol}: Required columns to select from the data file.

Returns

  • DataFrame: DataFrame containing the loaded data.

Throws

  • ArgumentError: If any of the required columns are missing in the data file.
source
LineageCollapse.perform_clusteringMethod
perform_clustering(method::HierarchicalClustering, linkage, dist_matrix) -> Vector{Int}

Perform hierarchical clustering and return cluster assignments.

source
LineageCollapse.preprocess_dataMethod
preprocess_data(df::DataFrame; min_d_region_length::Union{Int,Nothing}=nothing, deduplicate::Bool=false, use_barcode::Bool=false)::DataFrame

Preprocess the input DataFrame by performing data cleaning and transformation.

Arguments

  • df::DataFrame: Input DataFrame.
  • min_d_region_length::Union{Int,Nothing}=nothing: Minimum length of the D region to keep. If nothing, no filtering is applied.
  • deduplicate::Bool=false: Whether to deduplicate the DataFrame.
  • use_barcode::Bool=false: Whether to use barcode for deduplication (only applicable if deduplicate is true).

Returns

  • DataFrame: Preprocessed DataFrame.
source
LineageCollapse.process_lineagesMethod
process_lineages(df::DataFrame; distance_metric, clustering_method, linkage) -> DataFrame

Process sequences into lineages with explicit metric and clustering configuration.

Keyword Arguments

  • distance_metric::AbstractDistanceMetric=HammingDistance(): Distance metric for CDR3 comparison
  • clustering_method::ClusteringMethod=HierarchicalClustering(1.0f0): Clustering method and cutoff
  • linkage::Symbol=:single: Hierarchical clustering linkage

See process_lineages(df, threshold) for return value documentation.

source
LineageCollapse.process_lineagesMethod
process_lineages(df::DataFrame, threshold; linkage=:single) -> DataFrame

Process sequences into lineages using CDR3 clustering.

Arguments

  • df::DataFrame: Preprocessed data with columns v_call_first, j_call_first, cdr3, cdr3_length, d_region
  • threshold: Clustering threshold. Integer for absolute mismatches, Float (0.0-1.0) for fraction of CDR3 length.
  • linkage::Symbol=:single: Hierarchical clustering linkage (:single, :complete, :average)

Returns

DataFrame with added columns:

  • lineage_id::Int: Unique lineage identifier
  • cluster::Int: Cluster assignment within V/J group
  • cluster_size::Int: Number of sequences in cluster
  • min_distance::Float32: Minimum distance to other CDR3s in cluster
  • cdr3_count::Int: Count of this CDR3 in cluster
  • max_cdr3_count::Int: Maximum CDR3 count in cluster
  • cdr3_frequency::Float64: cdr3_count / max_cdr3_count
source