SequenceTokenizers
Documentation for SequenceTokenizers.
SequenceTokenizers.SequenceTokenizers
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
SequenceTokenizers.SequenceTokenizer
Base.length
Base.show
SequenceTokenizers.onecold_batch
SequenceTokenizers.onehot_batch
SequenceTokenizers.onehot_batch
SequenceTokenizers.SequenceTokenizers
— ModuleSequenceTokenizers
A module for tokenizing sequences of symbols into numerical indices and vice versa. This module provides functionality for creating tokenizers, encoding sequences, and working with one-hot representations of tokenized data.
Exports
SequenceTokenizer
: A struct for tokenizing sequencesonehot_batch
: Convert tokenized sequences to one-hot representationsonecold_batch
: Convert one-hot representations back to tokenized sequences
Example
using SequenceTokenizers
# Create a tokenizer for DNA sequences
dna_alphabet = ['A', 'C', 'G', 'T']
tokenizer = SequenceTokenizer(dna_alphabet, 'N')
# Tokenize a sequence
seq = "ACGTACGT"
tokenized = tokenizer(seq)
# Convert to one-hot representation
onehot = onehot_batch(tokenizer, tokenized)
# Convert back to tokens
recovered = onecold_batch(tokenizer, onehot)
SequenceTokenizers.SequenceTokenizer
— TypeSequenceTokenizer{T, V <: AbstractVector{T}}
A struct for tokenizing sequences of symbols into numerical indices.
Fields
alphabet::V
: The set of valid symbols in the sequenceslookup::Vector{UInt32}
: A lookup table for fast symbol-to-index conversionunksym::T
: The symbol to use for unknown tokensunkidx::UInt32
: The index assigned to the unknown symbol
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer)(idx::Integer)
Convert an index back to its corresponding token.
Arguments
idx::Integer
: An index to be converted back to a token
Returns
The token corresponding to the given index in the tokenizer's alphabet
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer(2)) # Output: 'a'
println(tokenizer(1)) # Output: 'x' (unknown token)
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer{T})(token::T) where T
Convert a single token to its corresponding index.
Arguments
token::T
: A single token to be converted to an index
Returns
The index of the token in the tokenizer's alphabet, or the unknown token index if not found
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer('a')) # Output: 2
println(tokenizer('x')) # Output: 1
println(tokenizer('z')) # Output: 1 (unknown token)
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer{T})(x::AbstractArray) where T
Tokenize an array of symbols.
Arguments
x::AbstractArray
: An array of symbols to be tokenized
Returns
An array of indices corresponding to the input symbols
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer(['a', 'b', 'z', 'c'])) # Output: [2, 3, 1, 4]
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer{T})(input::AbstractString) where T
Tokenize a string input using the SequenceTokenizer.
This method efficiently converts the input string to a vector of tokens of type T and applies the tokenizer to each element.
Arguments
tokenizer::SequenceTokenizer{T}
: The tokenizer to useinput::AbstractString
: The input string to be tokenized
Returns
A Vector{UInt32} of token indices corresponding to the characters in the input string
Performance Notes
- This method uses
collect(T, input)
to convert the string to a vector of type T - It's marked as
@inline
for potential performance benefits in certain contexts
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
result = tokenizer("abcx")
println(result) # Output: [2, 3, 4, 1]
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer{T})(batch::AbstractVector{<:AbstractString}) where T
Tokenize a batch of string sequences, padding shorter sequences with the unknown token.
Arguments
tokenizer::SequenceTokenizer{T}
: The tokenizer to usebatch::AbstractVector{<:AbstractString}
: A vector of string sequences to be tokenized
Returns
A matrix of indices, where each column represents a tokenized and padded sequence
Example
tokenizer = SequenceTokenizer(['A','T','G','C'], 'N')
sequences = ["ATG", "ATGCGC"]
result = tokenizer(sequences)
SequenceTokenizers.SequenceTokenizer
— Method(tokenizer::SequenceTokenizer{T})(batch::AbstractVector{<:AbstractVector{T}}) where T
Tokenize a batch of sequences, padding shorter sequences with the unknown token.
Arguments
batch::AbstractVector{<:AbstractVector{T}}
: A vector of sequences to be tokenized
Returns
A matrix of indices, where each column represents a tokenized and padded sequence
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [['a', 'b'], ['c', 'a', 'b']]
println(tokenizer(sequences))
# Output:
# [2 4
# 3 2
# 1 3]
Base.length
— MethodBase.length(tokenizer::AbstractSequenceTokenizer)
Get the number of unique tokens in the tokenizer's alphabet.
Returns
The length of the tokenizer's alphabet
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(length(tokenizer)) # Output: 4
Base.show
— MethodBase.show(io::IO, tokenizer::SequenceTokenizer{T}) where T
Custom display method for SequenceTokenizer instances.
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
println(tokenizer) # Output: SequenceTokenizer{Char}(length(alphabet)=4, unksym=x)
SequenceTokenizers.onecold_batch
— Methodonecold_batch(tokenizer::AbstractSequenceTokenizer, onehot_batch::OneHotArray)
Convert a one-hot representation back to tokenized sequences.
Arguments
tokenizer::AbstractSequenceTokenizer
: The tokenizer used for the sequencesonehot_batch::OneHotArray
: A OneHotArray representing the one-hot encoding of sequences
Returns
A matrix of indices representing the tokenized sequences
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [['a', 'b'], ['c', 'a', 'b']]
tokenized = tokenizer(sequences)
onehot = onehot_batch(tokenizer, tokenized)
recovered = onecold_batch(tokenizer, onehot)
# Recovered result is batched therefore it remains padded
println(recovered == ['a' 'c'; 'b' 'a'; 'x' 'b']) # Output: true
SequenceTokenizers.onehot_batch
— Methodonehot_batch(tokenizer::SequenceTokenizer, batch::AbstractMatrix{UInt32})
Convert a batch of tokenized sequences to one-hot representations.
Arguments
tokenizer::SequenceTokenizer
: The tokenizer used for the sequencesbatch::AbstractMatrix{UInt32}
: A matrix of tokenized sequences
Returns
A OneHotArray representing the one-hot encoding of the input batch
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
sequences = [["a", "b"], ["c", "a", "b"]]
tokenized = tokenizer(sequences)
onehot = onehot_batch(tokenizer, tokenized)
println(size(onehot)) # Output: (4, 3, 2)
SequenceTokenizers.onehot_batch
— Methodonehot_batch(tokenizer::AbstractSequenceTokenizer, batch::AbstractVector{UInt32})
Convert a batch of tokenized sequences to one-hot representations.
This function takes a vector of token indices and converts it into a one-hot encoded representation using the alphabet of the provided tokenizer.
Arguments
tokenizer::AbstractSequenceTokenizer
: The tokenizer used for the sequences. Its length
determines the size of the one-hot encoding dimension.
batch::AbstractVector{UInt32}
: A vector of token indices to be converted to
one-hot representation.
Returns
OneHotArray
: A one-hot encoded representation of the input batch. The resulting
array will have dimensions (length(tokenizer), length(batch)).
Example
alphabet = ['a', 'b', 'c']
tokenizer = SequenceTokenizer(alphabet, 'x')
tokenized_sequence = [2, 3, 1, 4] # Corresponds to ['a', 'b', 'x', 'c']
onehot = onehot_batch(tokenizer, tokenized_sequence)
println(size(onehot)) # Output: (4, 4)
println(onehot[:, 1]) # Output: [0, 1, 0, 0]
Note
This function assumes that all indices in the input batch are valid for the tokenizer's alphabet. Indices outside the valid range may result in errors or unexpected behavior.
See also
SequenceTokenizer
: The tokenizer struct used to create the input batch.onecold_batch
: The inverse operation, converting one-hot representations
back to token indices.