LineageCollapse
Documentation for LineageCollapse, a Julia package for collapsing lineages in AIRR data.
Overview
LineageCollapse provides tools for processing and analyzing adaptive immune receptor repertoire (AIRR) data. It offers functions for data loading, preprocessing, lineage assignment, and lineage collapsing.
Architecture
Processing Pipeline
The library follows a linear pipeline where each stage transforms your data:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LineageCollapse Pipeline │
└─────────────────────────────────────────────────────────────────────────────┘
AIRR TSV File
│
▼
┌───────────────┐
│ load_data() │ Load and validate AIRR-formatted data
└───────┬───────┘
│ DataFrame
▼
┌──────────────────────┐
│ preprocess_data() │ Filter, derive D-region, normalize columns
└──────────┬───────────┘
│ DataFrame + d_region, v_call_first, j_call_first
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ process_lineages(df, threshold) │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ For each (V-gene, J-gene, CDR3-length) group: │ │
│ │ 1. Compute pairwise CDR3 distances (AbstractDistanceMetric) │ │
│ │ 2. Cluster sequences (ClusteringMethod) │ │
│ │ 3. Assign lineage_id to each cluster │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└──────────┬─────────────────────────────────────────────────────────────────┘
│ DataFrame + lineage_id, cluster, cluster_size, cdr3_frequency
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ collapse_lineages(df, strategy) │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ CollapseStrategy determines output: │ │
│ │ • Hardest() → one representative per lineage │ │
│ │ • Soft(cutoff) → all clones above frequency threshold │ │
│ │ │ │
│ │ AbstractTieBreaker resolves ties when selecting representatives │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└──────────┬─────────────────────────────────────────────────────────────────┘
│
▼
Collapsed DataFrame (representatives or filtered clones)Type Hierarchy
The library uses Julia's multiple dispatch with abstract types for extensibility:
AbstractDistanceMetric ClusteringMethod CollapseStrategy
│ │ │
├── HammingDistance └── HierarchicalClustering ├── Hardest
├── NormalizedHammingDistance │ └── Soft
└── LevenshteinDistance └── cutoff::Float
│
│
AbstractTieBreaker ◄───────────────────────────────────────────────────────┘
│ (used by Hardest)
├── TieBreaker
│ └── criteria::Vector{Pair{Symbol,Bool}}
│
└── MostCommonVdjNtTieBreakerExtension Points
You can extend the library by defining new types and methods:
Custom Distance Metric:
struct MyDistance <: AbstractDistanceMetric end
# Implement the required method
LineageCollapse.compute_distance(::MyDistance, x::LongDNA{4}, y::LongDNA{4}) = ...Custom Tie-Breaker:
struct MyTieBreaker <: AbstractTieBreaker end
# Or use the built-in TieBreaker with custom criteria
my_breaker = TieBreaker([:my_column => true, :cdr3 => false])Key Concepts
| Concept | Description |
|---|---|
| Lineage | Group of sequences sharing V-gene, J-gene, CDR3 length, and similar CDR3 |
| Clone | Unique combination of D-region + lineage + V + J + CDR3 within a lineage |
| Clone Frequency | Proportion of sequences in a lineage belonging to a clone |
| Representative | Selected sequence to represent an entire lineage |
Built-in Tie-Breakers
| Function | Strategy |
|---|---|
ByMostCommonVdjNt() | Most common VDJ nucleotide sequence (igdiscover-compatible) |
ByVdjCount() | Highest VDJ count, then lexicographic CDR3 |
ByCdr3Count() | Highest CDR3 count, then lexicographic CDR3 |
BySequenceCount() | Highest sequence count, then lexicographic CDR3 |
ByMostNaive() | Highest V/J identity (closest to germline) |
ByLexicographic() | Lexicographically smallest CDR3 |
ByFirst() | First candidate (no sorting) |
Tie-breakers can be combined: ByVdjCount() + ByLexicographic()
Installation
using Pkg
Pkg.add("LineageCollapse")Quick Start
using LineageCollapse
# Load and preprocess
df = load_data("airr_data.tsv.gz")
df = preprocess_data(df; min_d_region_length=3)
# Assign lineages (threshold: 1 mismatch or 0.1 = 10% of CDR3 length)
lineages = process_lineages(df, 1)
# Collapse options:
# Option A: One representative per lineage
result = collapse_lineages(lineages, Hardest())
# Option B: Keep clones with frequency ≥ 20%
result = collapse_lineages(lineages, Soft(0.2))
# Option C: Custom tie-breaking
result = collapse_lineages(lineages, Hardest();
tie_breaker=ByMostNaive(),
tie_atol=0.01) # 1% toleranceDetailed Examples
Using Different Distance Metrics
# Explicit metric configuration
lineages = process_lineages(df;
distance_metric = NormalizedHammingDistance(),
clustering_method = HierarchicalClustering(0.1f0),
linkage = :average
)Diagnostic: Finding Ties
# Identify lineages where multiple clones have the same max frequency
ties = hardest_tie_summary(df; atol=0.01)
filter(:hardest_tied => identity, ties) # Show only tied lineagesColumn reference
Meaning and computation of columns added by each stage. The most important is count (optional input and main abundance output).
Count (most important)
Input (optional): An optional numeric column count (e.g. read or UMI count per sequence). If present, it is the abundance of that row; if absent, each row is treated as abundance 1. Missing values are treated as 0 when summing.
Output (Hardest only): The collapsed result includes a count column: for each lineage it is the sum of the input count over all sequences in that lineage (total lineage abundance). If the input had no count column, this equals the number of sequences in the lineage.
How it is used: Tie-breakers such as ByMostCommonVdjNt() and ByVdjCount() use these abundances to select the "most common" VDJ nucleotide sequence or clone (by highest total count). Clone frequency and the Soft strategy use the number of sequences (rows) per clone, not the sum of count.
After preprocess_data
| Column | Meaning | Computation |
|---|---|---|
d_region | D-region nucleotide sequence | sequence[v_sequence_end+1:j_sequence_start] |
v_call_first | First V-gene allele | First token of v_call (before comma) |
j_call_first | First J-gene allele | First token of j_call (before comma) |
vdj_nt | V–D–J nucleotide sequence | sequence[v_sequence_start:j_sequence_end] (only if those columns exist) |
cdr3_length | CDR3 length | length(cdr3) |
After process_lineages
| Column | Meaning | Computation |
|---|---|---|
lineage_id | Lineage identifier | Unique id per (V, J, CDR3 length, cluster) |
cluster | Cluster within V/J/CDR3-length group | From hierarchical clustering of CDR3 distances |
cluster_size | Sequences in this cluster | Number of rows in the same cluster |
min_distance | Min distance to another CDR3 in cluster | Smallest pairwise distance to another sequence in the cluster |
cdr3_count | How many sequences share this CDR3 in the cluster | Number of rows with same CDR3 in same cluster |
max_cdr3_count | Max cdr3_count in this cluster | Maximum of cdr3_count over the cluster |
cdr3_frequency | Relative frequency of this CDR3 in cluster | cdr3_count / max_cdr3_count (0–1) |
After collapse_lineages with Hardest
| Column | Meaning | Computation |
|---|---|---|
count | Total lineage abundance | Sum of input count (or 1 per row if no input count) over all sequences in the lineage — see "Count (most important)" above |
nVDJ_nt | Number of unique VDJ nucleotide sequences in lineage | Count of distinct vdj_nt in the lineage (or missing if no vdj_nt) |
After collapse_lineages with Soft
| Column | Meaning | Computation |
|---|---|---|
clone_frequency | Fraction of lineage (by row count) in this clone | Number of sequences (rows) in this clone ÷ total sequences in the lineage |
sequence_count | Number of sequences in this clone | Number of rows with same (dregion, lineageid, vcallfirst, jcallfirst, cdr3) |
For detailed function signatures and options, see the API Reference below.
LineageCollapse.AbstractDistanceMetricLineageCollapse.AbstractTieBreakerLineageCollapse.ClusteringMethodLineageCollapse.CollapseStrategyLineageCollapse.HammingDistanceLineageCollapse.HardestLineageCollapse.HierarchicalClusteringLineageCollapse.LevenshteinDistanceLineageCollapse.MostCommonVdjNtTieBreakerLineageCollapse.NormalizedHammingDistanceLineageCollapse.SoftLineageCollapse.TieBreakerLineageCollapse.ByCdr3CountLineageCollapse.ByFirstLineageCollapse.ByLexicographicLineageCollapse.ByMostCommonVdjNtLineageCollapse.ByMostNaiveLineageCollapse.BySequenceCountLineageCollapse.ByVdjCountLineageCollapse.collapse_lineagesLineageCollapse.compute_distanceLineageCollapse.compute_pairwise_distanceLineageCollapse.deduplicate_dataLineageCollapse.hardest_tie_summaryLineageCollapse.load_dataLineageCollapse.perform_clusteringLineageCollapse.preprocess_dataLineageCollapse.process_lineagesLineageCollapse.process_lineages
LineageCollapse.AbstractDistanceMetric — Type
Abstract type for all distance metrics used in sequence comparison.
LineageCollapse.AbstractTieBreaker — Type
Abstract type for tie-breaking strategies used when selecting lineage representatives.
LineageCollapse.ClusteringMethod — Type
Abstract type for clustering methods.
LineageCollapse.CollapseStrategy — Type
Abstract type for lineage collapse strategies.
LineageCollapse.HammingDistance — Type
HammingDistance <: AbstractDistanceMetricHamming distance metric - counts the number of mismatches between sequences.
LineageCollapse.Hardest — Type
Hardest <: CollapseStrategyCollapse strategy that selects exactly one representative per lineage.
LineageCollapse.HierarchicalClustering — Type
HierarchicalClustering{T} <: ClusteringMethodHierarchical clustering with a distance cutoff.
Fields
cutoff::T: Distance threshold for cluster merging
Example
HierarchicalClustering(1.0f0) # Merge clusters within distance 1.0LineageCollapse.LevenshteinDistance — Type
LevenshteinDistance <: AbstractDistanceMetricLevenshtein (edit) distance metric.
LineageCollapse.MostCommonVdjNtTieBreaker — Type
MostCommonVdjNtTieBreakerA tie-breaker that selects the representative based on the most common VDJ_nt sequence weighted by count, matching igdiscover's clonotypes behavior.
For each lineage, it:
- Sums the
countfor each uniquevdj_ntsequence - Selects the
vdj_ntwith the highest total count - Returns the first row with that
vdj_nt
LineageCollapse.NormalizedHammingDistance — Type
NormalizedHammingDistance <: AbstractDistanceMetricNormalized Hamming distance - mismatches divided by sequence length.
LineageCollapse.Soft — Type
Soft{T} <: CollapseStrategyCollapse strategy that keeps all clones above a frequency threshold.
Fields
cutoff::T: Minimum clone frequency to retain (0.0 to 1.0)
Example
Soft(0.2) # Keep clones with frequency ≥ 20%LineageCollapse.TieBreaker — Type
TieBreaker <: AbstractTieBreakerA configurable tie-breaker that sorts candidates by specified column criteria.
Fields
criteria::Vector{Pair{Symbol,Bool}}: Sorting criteria as column => descending pairs. Columns are checked in order;truemeans sort descending (higher is better).
Example
# Sort by count descending, then by cdr3 ascending (lexicographic)
TieBreaker([:count => true, :cdr3 => false])LineageCollapse.ByCdr3Count — Method
ByCdr3Count()Tie-breaker that selects by highest CDR3 count, then lexicographic CDR3.
LineageCollapse.ByFirst — Method
ByFirst()Tie-breaker that selects the first candidate (no sorting).
LineageCollapse.ByLexicographic — Method
ByLexicographic()Tie-breaker that selects by lexicographically smallest CDR3.
LineageCollapse.ByMostCommonVdjNt — Method
ByMostCommonVdjNt()Create a tie-breaker that matches igdiscover's clonotypes representative selection.
Selects the representative by finding the VDJnt sequence with the highest total count across all members of the lineage, then returns the first row with that VDJnt.
Requires columns: vdj_nt, count
LineageCollapse.ByMostNaive — Method
ByMostNaive()Tie-breaker prioritizing sequences closest to germline (highest V/J identity), then by VDJ count, CDR3 count, and lexicographic CDR3.
LineageCollapse.BySequenceCount — Method
BySequenceCount()Tie-breaker that selects by highest sequence count, then lexicographic CDR3.
LineageCollapse.ByVdjCount — Method
ByVdjCount()Tie-breaker that selects by highest VDJ nucleotide count, then lexicographic CDR3.
LineageCollapse.collapse_lineages — Function
collapse_lineages(df::DataFrame, strategy=Hardest(); tie_breaker, tie_atol) -> DataFrameCollapse lineages to representative sequences.
Arguments
df::DataFrame: Data fromprocess_lineageswithlineage_idcolumnstrategy::CollapseStrategy=Hardest(): Collapse strategyHardest(): One representative per lineage (highest clone frequency)Soft(cutoff): Keep all clones with frequency ≥ cutoff
Keyword Arguments
tie_breaker::AbstractTieBreaker=ByMostCommonVdjNt(): Strategy for breaking tiestie_atol::Real=0.0: Tolerance for frequency comparison
Returns
For Hardest():
- Selected rows with added
count::Int(sum of counts) andnVDJ_nt::Int(unique VDJ sequences)
For Soft(cutoff):
- Rows meeting threshold with added
clone_frequency::Float64andsequence_count::Int
LineageCollapse.compute_distance — Function
compute_distance(metric::AbstractDistanceMetric, x::LongDNA{4}, y::LongDNA{4}) -> Float32Compute distance between two DNA sequences using the specified metric.
LineageCollapse.compute_pairwise_distance — Method
compute_pairwise_distance(metric, sequences) -> Matrix{Float32}Compute pairwise distance matrix for DNA sequences.
LineageCollapse.deduplicate_data — Function
deduplicate_data(df::DataFrame, use_barcode::Bool=false)::DataFrameDeduplicate the input DataFrame based on sequence or sequence+barcode.
Arguments
df::DataFrame: Input DataFrame.use_barcode::Bool=false: Whether to use barcode for deduplication.
Returns
DataFrame: Deduplicated DataFrame.
LineageCollapse.hardest_tie_summary — Method
hardest_tie_summary(df::DataFrame; atol=0.0) -> DataFrameDiagnostic function to identify lineages with tied maximum clone frequencies.
LineageCollapse.load_data — Method
load_data(filepath::String;
delimiter::Char=' ',
required_columns=[:sequence_id, :sequence, :v_sequence_end, :j_sequence_start, :cdr3, :v_call, :j_call, :stop_codon])::DataFrameLoad data from a file (compressed or uncompressed) and return a DataFrame.
Arguments
filepath::String: Path to the data file.delimiter::Char=' ': Delimiter used in the data file (default: tab).required_columns::Vector{Symbol}: Required columns to select from the data file.
Returns
DataFrame: DataFrame containing the loaded data.
Throws
ArgumentError: If any of the required columns are missing in the data file.
LineageCollapse.perform_clustering — Method
perform_clustering(method::HierarchicalClustering, linkage, dist_matrix) -> Vector{Int}Perform hierarchical clustering and return cluster assignments.
LineageCollapse.preprocess_data — Method
preprocess_data(df::DataFrame; min_d_region_length::Union{Int,Nothing}=nothing, deduplicate::Bool=false, use_barcode::Bool=false)::DataFramePreprocess the input DataFrame by performing data cleaning and transformation.
Arguments
df::DataFrame: Input DataFrame.min_d_region_length::Union{Int,Nothing}=nothing: Minimum length of the D region to keep. If nothing, no filtering is applied.deduplicate::Bool=false: Whether to deduplicate the DataFrame.use_barcode::Bool=false: Whether to use barcode for deduplication (only applicable if deduplicate is true).
Returns
DataFrame: Preprocessed DataFrame.
LineageCollapse.process_lineages — Method
process_lineages(df::DataFrame; distance_metric, clustering_method, linkage) -> DataFrameProcess sequences into lineages with explicit metric and clustering configuration.
Keyword Arguments
distance_metric::AbstractDistanceMetric=HammingDistance(): Distance metric for CDR3 comparisonclustering_method::ClusteringMethod=HierarchicalClustering(1.0f0): Clustering method and cutofflinkage::Symbol=:single: Hierarchical clustering linkage
See process_lineages(df, threshold) for return value documentation.
LineageCollapse.process_lineages — Method
process_lineages(df::DataFrame, threshold; linkage=:single) -> DataFrameProcess sequences into lineages using CDR3 clustering.
Arguments
df::DataFrame: Preprocessed data with columnsv_call_first,j_call_first,cdr3,cdr3_length,d_regionthreshold: Clustering threshold. Integer for absolute mismatches, Float (0.0-1.0) for fraction of CDR3 length.linkage::Symbol=:single: Hierarchical clustering linkage (:single,:complete,:average)
Returns
DataFrame with added columns:
lineage_id::Int: Unique lineage identifiercluster::Int: Cluster assignment within V/J groupcluster_size::Int: Number of sequences in clustermin_distance::Float32: Minimum distance to other CDR3s in clustercdr3_count::Int: Count of this CDR3 in clustermax_cdr3_count::Int: Maximum CDR3 count in clustercdr3_frequency::Float64:cdr3_count / max_cdr3_count