SequenceTokenizers

Documentation for SequenceTokenizers.

SequenceTokenizers.SequenceTokenizersModule
SequenceTokenizers

A module for tokenizing sequences of symbols into numerical indices and vice versa. This module provides functionality for creating tokenizers, encoding sequences, and working with one-hot representations of tokenized data.

Exports

  • SequenceTokenizer: A struct for tokenizing sequences
  • onehot_batch: Convert tokenized sequences to one-hot representations
  • onecold_batch: Convert one-hot representations back to tokenized sequences

Example

using SequenceTokenizers

# Create a tokenizer for DNA sequences
dna_alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(dna_alphabet, 'N')

# Tokenize a sequence
seq = "ACGTACGT"
tokenized = tokenizer(seq)

# Convert to one-hot representation
onehot = onehot_batch(tokenizer, tokenized)

# Convert back to tokens
recovered = onecold_batch(tokenizer, onehot)
source
SequenceTokenizers.SequenceTokenizerType
SequenceTokenizer{T, V <: AbstractVector{T}}

A struct for tokenizing sequences of symbols into numerical indices.

Fields

  • alphabet::V: The set of valid symbols in the sequences
  • lookup::Vector{UInt32}: A lookup table for fast symbol-to-index conversion
  • unksym::T: The symbol to use for unknown tokens
  • unkidx::UInt32: The index assigned to the unknown symbol

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer)(idx::Integer)

Convert an index back to its corresponding token.

Arguments

  • idx::Integer: An index to be converted back to a token

Returns

The token corresponding to the given index in the tokenizer's alphabet

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer(2))  # Output: 'a'
println(tokenizer(1))  # Output: 'x' (unknown token)
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer{T})(token::T) where T

Convert a single token to its corresponding index.

Arguments

  • token::T: A single token to be converted to an index

Returns

The index of the token in the tokenizer's alphabet, or the unknown token index if not found

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer('a'))  # Output: 2
println(tokenizer('x'))  # Output: 1
println(tokenizer('z'))  # Output: 1 (unknown token)
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer{T})(x::AbstractArray) where T

Tokenize an array of symbols.

Arguments

  • x::AbstractArray: An array of symbols to be tokenized

Returns

An array of indices corresponding to the input symbols

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer(['a', 'b', 'z', 'c']))  # Output: [2, 3, 1, 4]
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer{T})(input::AbstractString) where T

Tokenize a string input using the SequenceTokenizer.

This method efficiently converts the input string to a vector of tokens of type T and applies the tokenizer to each element.

Arguments

  • tokenizer::SequenceTokenizer{T}: The tokenizer to use
  • input::AbstractString: The input string to be tokenized

Returns

A Vector{UInt32} of token indices corresponding to the characters in the input string

Performance Notes

  • This method uses collect(T, input) to convert the string to a vector of type T
  • It's marked as @inline for potential performance benefits in certain contexts

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
result = tokenizer("abcx")
println(result)  # Output: [2, 3, 4, 1]
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer{T})(batch::AbstractVector{<:AbstractString}) where T

Tokenize a batch of string sequences, padding shorter sequences with the unknown token.

Arguments

  • tokenizer::SequenceTokenizer{T}: The tokenizer to use
  • batch::AbstractVector{<:AbstractString}: A vector of string sequences to be tokenized

Returns

A matrix of indices, where each column represents a tokenized and padded sequence

Example

tokenizer = SequenceTokenizer(['A','T','G','C'], 'N')
sequences = ["ATG", "ATGCGC"]
result = tokenizer(sequences)
source
SequenceTokenizers.SequenceTokenizerMethod
(tokenizer::SequenceTokenizer{T})(batch::AbstractVector{<:AbstractVector{T}}) where T

Tokenize a batch of sequences, padding shorter sequences with the unknown token.

Arguments

  • batch::AbstractVector{<:AbstractVector{T}}: A vector of sequences to be tokenized

Returns

A matrix of indices, where each column represents a tokenized and padded sequence

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [['a', 'b'], ['c', 'a', 'b']]
println(tokenizer(sequences))
# Output:
# [2 4
#  3 2
#  1 3]
source
Base.lengthMethod
Base.length(tokenizer::AbstractSequenceTokenizer)

Get the number of unique tokens in the tokenizer's alphabet.

Returns

The length of the tokenizer's alphabet

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(length(tokenizer))  # Output: 4
source
Base.showMethod
Base.show(io::IO, tokenizer::SequenceTokenizer{T}) where T

Custom display method for SequenceTokenizer instances.

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer)  # Output: SequenceTokenizer{Char}(length(alphabet)=4, unksym=x)
source
SequenceTokenizers.onecold_batchMethod
onecold_batch(tokenizer::AbstractSequenceTokenizer, onehot_batch::OneHotArray)

Convert a one-hot representation back to tokenized sequences.

Arguments

  • tokenizer::AbstractSequenceTokenizer: The tokenizer used for the sequences
  • onehot_batch::OneHotArray: A OneHotArray representing the one-hot encoding of sequences

Returns

A matrix of indices representing the tokenized sequences

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [['a', 'b'], ['c', 'a', 'b']]
tokenized = tokenizer(sequences)
onehot = onehot_batch(tokenizer, tokenized)
recovered = onecold_batch(tokenizer, onehot)
# Recovered result is batched therefore it remains padded
println(recovered == ['a' 'c'; 'b' 'a'; 'x' 'b']) # Output: true
source
SequenceTokenizers.onehot_batchMethod
onehot_batch(tokenizer::SequenceTokenizer, batch::AbstractMatrix{UInt32})

Convert a batch of tokenized sequences to one-hot representations.

Arguments

  • tokenizer::SequenceTokenizer: The tokenizer used for the sequences
  • batch::AbstractMatrix{UInt32}: A matrix of tokenized sequences

Returns

A OneHotArray representing the one-hot encoding of the input batch

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [["a", "b"], ["c", "a", "b"]]
tokenized = tokenizer(sequences)
onehot = onehot_batch(tokenizer, tokenized)
println(size(onehot))  # Output: (4, 3, 2)
source
SequenceTokenizers.onehot_batchMethod
onehot_batch(tokenizer::AbstractSequenceTokenizer, batch::AbstractVector{UInt32})

Convert a batch of tokenized sequences to one-hot representations.

This function takes a vector of token indices and converts it into a one-hot encoded representation using the alphabet of the provided tokenizer.

Arguments

  • tokenizer::AbstractSequenceTokenizer: The tokenizer used for the sequences. Its length

determines the size of the one-hot encoding dimension.

  • batch::AbstractVector{UInt32}: A vector of token indices to be converted to

one-hot representation.

Returns

  • OneHotArray: A one-hot encoded representation of the input batch. The resulting

array will have dimensions (length(tokenizer), length(batch)).

Example

alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
tokenized_sequence = [2, 3, 1, 4]  # Corresponds to ['a', 'b', 'x', 'c']
onehot = onehot_batch(tokenizer, tokenized_sequence)
println(size(onehot))  # Output: (4, 4)
println(onehot[:, 1])  # Output: [0, 1, 0, 0]

Note

This function assumes that all indices in the input batch are valid for the tokenizer's alphabet. Indices outside the valid range may result in errors or unexpected behavior.

See also

back to token indices.

source