Models

Activations

This module contains custom activation functions that can be used in PyTorch models.

Available functions:

Sigmax: Implements the custom activation function for attention.
CReLu: Implements the Clipped Rectified Linear Unit (CReLu) activation function.

class speeq.models.activations.CReLu(max_val: int)[source]

Bases: Module

implements the Clipped Rectified Linear Unit (CReLu) activation function as described in: https://arxiv.org/abs/1412.5567

Args:

max_val (int): The maximum value for clipping

forward(x: Tensor) → Tensor[source]

Passes the input tensor x through the Clipped Rectified Linear Unit (CReLu) activation function.

Args:: x (Tensor): The input tensor.
Returns:: Tensor: The result tensor after applying the activation function.

training: bool

class speeq.models.activations.Sigmax(dim: int = -1)[source]

Bases: Module

Implements the custom activation function for attention proposed in https://arxiv.org/abs/1506.07503

Args:

dim (int): The dimension to apply the activation function on.

forward(x: Tensor) → Tensor[source]

Passes the input tensor x through the Sigmax activation function.

Args:: x (Tensor): The input tensor.
Returns:: Tensor: The result tensor after applying the activation function.

training: bool

CTC Models

This module contains various CTC (Connectionist Temporal Classification) models for speech recognition. The CTC models are implemented as subclasses of the base class CTCModel.

Classes:

CTCModel(nn.Module): Base class for CTC models.
DeepSpeechV1(CTCModel): DeepSpeech version 1 model.
BERT(nn.Module): Bidirectional Encoder Representations from Transformers (BERT) model.
DeepSpeechV2(CTCModel): DeepSpeech version 2 model.
Conformer(CTCModel): Conformer model.
Jasper(CTCModel): Jasper model.
Wav2Letter(CTCModel): Wav2Letter model.
QuartzNet(CTCModel): QuartzNet model.
Squeezeformer(CTCModel): Squeezeformer model.

class speeq.models.ctc.BERT(max_len: int, in_features: int, d_model: int, h: int, ff_size: int, n_layers: int, n_classes: int, p_dropout: float)[source]

Bases: Module

Implements the BERT Model as described in https://arxiv.org/abs/1810.04805

Args:

max_len (int): The maximum length for positional encoding.

in_features (int): The input/speech feature size.

d_model (int): The model dimensionality.

h (int): The number of attention heads.

ff_size (int): The inner size of the feed forward module.

n_layers (int): The number of transformer encoders.

n_classes (int): The number of classes.

p_dropout (float): The dropout rate.

embed(x: Tensor, mask: Tensor)[source]

forward(x: Tensor, mask: Tensor) → Tuple[Tensor, Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

class speeq.models.ctc.CTCModel(*args, **kwargs)[source]

Bases: Module

Builds the base of CTC model, if used encoder paramters has to be added, otherwise the forward module will raise error.

forward(x: Tensor, mask: Tensor, *args, **kwargs)[source]

passes the speech input to the model.

Args:

x (Tensor): The input speech signal of shape [B, M, d]

mask (Tensor): The speech mask of shape [B, M], where it’s false for the positions that contains padding.

Returns:: Tuple[Tensor, Tensor]: A tuple where the first is the predictions of shape [M, B, C], and the lengths tensor of shape [B].

training: bool

class speeq.models.ctc.Conformer(*args, **kwargs)[source]

Bases: CTCModel

Implements the conformer model proposed in https://arxiv.org/abs/2005.08100, this model used with CTC, while in the paper used RNN-T.

Args:

n_classes (int): The number of classes.

d_model (int): The model dimension.

n_conf_layers (int): The number of conformer blocks.

ff_expansion_factor (int): The feed-forward expansion factor.

h (int): The number of attention heads.

kernel_size (int): The convolution module kernel size.

ss_kernel_size (int): The subsampling layer kernel size.

ss_stride (int): The subsampling layer stride size.

ss_num_conv_layers (int): The number of subsampling convolutional layers.

in_features (int): The input/speech feature size.

res_scaling (float): The residual connection multiplier.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.ctc.DeepSpeechV1(*args, **kwargs)[source]

Bases: CTCModel

Builds the DeepSpeech model described in https://arxiv.org/abs/1412.5567

Args:

in_features (int): The input feature size.

hidden_size (int): The hidden size of the rnn layers.

n_linear_layers (int): The number of feed-forward layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_clases (int): The number of classes to predict.

max_clip_value (int): The maximum relu clipping value.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

p_dropout (float): The dropout rate.

predict(x: Tensor) → Tensor[source]

training: bool

class speeq.models.ctc.DeepSpeechV2(*args, **kwargs)[source]

Bases: CTCModel

Implements the deep speech model proposed in https://arxiv.org/abs/1512.02595

Args:

n_conv (int): The number of convolution layers.

kernel_size (int): The kernel size of the convolution layers.

stride (int): The stride size of the convolution layer.

in_features (int): The input/speech feature size.

hidden_size (int): The hidden size of the RNN layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_rnn (int): The number of RNN layers.

n_linear_layers (int): The number of linear layers.

n_classes (int): The number of classes.

max_clip_value (int): The maximum relu clipping value.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

tau (int): The future context size.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.ctc.Jasper(*args, **kwargs)[source]

Bases: CTCModel

Implements Jasper model architecture proposed in https://arxiv.org/abs/1904.03288

Args:

n_classes (int): The number of classes.

in_features (int): The input/speech feature size.

num_blocks (int): The number of Jasper blocks (denoted as ‘B’ in the paper).

num_sub_blocks (int): The number of Jasper subblocks (denoted as ‘R’ in the paper).

channel_inc (int): The rate to increase the number of channels across the blocks.

epilog_kernel_size (int): The kernel size of the epilog block convolution layer.

prelog_kernel_size (int): The kernel size of the prelog block ocnvolution layer.

prelog_stride (int): The stride size of the prelog block convolution layer.

prelog_n_channels (int): The output channnels of the prelog block convolution layer.

blocks_kernel_size (Union[int, List[int]]): The kernel size(s) of the convolution layer for each block.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.ctc.QuartzNet(*args, **kwargs)[source]

Bases: CTCModel

Implements QuartzNet model architecture proposed in https://arxiv.org/abs/1910.10261

Args:

n_classes (int): The number of classes.

in_features (int): The input/speech feature size.

num_blocks (int): The number of QuartzNet blocks (denoted as ‘B’ in the paper).

block_repetition (int): The number of times to repeat each block (denoted as ‘S’ in the paper).

num_sub_blocks (int): The number of QuartzNet subblocks, (denoted as ‘R’ in the paper).

channels_size (List[int]): A list of integers representing the number of output channels for each block.

epilog_kernel_size (int): The kernel size of the convolution layer in the epilog block.

epilog_channel_size (Tuple[int, int]): A tuple for both epilog layers of the convolution layer .

prelog_kernel_size (int): The kernel size pf the convolution layer in the prelog block.

prelog_stride (int): The stride size of the of the convoltuional layer in the prelog block.

prelog_n_channels (int): The number of output channels of the convolutional layer in the prelog block.

groups (int): The groups size.

blocks_kernel_size (Union[int, List[int]]): An integer or a list of integers representing the kernel size(s) for each block’s convolutional layer.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.ctc.Squeezeformer(*args, **kwargs)[source]

Bases: CTCModel

Implements the Squeezeformer model architecture as described in https://arxiv.org/abs/2206.00888

Args:

n_classes (int): The number of classes.

in_features (int): The input/speech feature size.

n (int): The number of layers per block, (denoted as N in the paper).

d_model (int): The model dimension.

ff_expansion_factor (int): The expansion factor of linear layer in the feed forward module.

h (int): The number of attention heads.

kernel_size (int): The kernel size of the depth-wise convolution layer.

pooling_kernel_size (int): The kernel size of the pooling convolution layer.

pooling_stride (int): The stride size of the pooling convolution layer.

ss_kernel_size (Union[int, List[int]]): The kernel size of the subsampling layer(s).

ss_stride (Union[int, List[int]]): The stride of the subsampling layer(s).

ss_n_conv_layers (int): The number of subsampling convolutional layers.

p_dropout (float): The dropout rate.

ss_groups (Union[int, List[int]]): The subsampling convolution groups size(s).

masking_value (int): The masking value. Default -1e15

training: bool

class speeq.models.ctc.Wav2Letter(*args, **kwargs)[source]

Bases: CTCModel

Implements Wav2Letter model proposed in https://arxiv.org/abs/1609.03193

Args:

in_features (int): The input/speech feature size.

n_classes (int): The number of classes.

n_conv_layers (int): The number of convolution layers.

layers_kernel_size (int): The kernel size of the convolution layers.

layers_channels_size (int): The number of output channels of each convolution layer.

pre_conv_stride (int): The stride of the prenet convolution layer.

pre_conv_kernel_size (int): The kernel size of the prenet convolution layer.

post_conv_channels_size (int): The number of output channels of the postnet convolution layer.

post_conv_kernel_size (int): The kernel size of the postnet convolution layer.

p_dropout (float): The dropout rate.

wav_kernel_size (Optional[int]): The kernel size of the first layer that processes the wav samples directly if wav is modeled. Default None.

wav_stride (Optional[int]): The stride size of the first layer that processes the wav samples directly if wav is modeled. Default None.

training: bool

Decoders

This module contains various pre-implemented decoders used in differnet models.

Classes:

GlobAttRNNDecoder: Implements a RNN decoder with global attention mechanism.
LocationAwareAttDecoder: Implements a RNN decoder with location aware attention mechanism.
TransducerRNNDecoder: Implements a simple RNN decoder with an embedding layer and a single RNN layer.
TransformerDecoder: Implements a transformer decoder.
TransformerTransducerDecoder: Implements a Transformer-Transducer decoder.

class speeq.models.decoders.GlobAttRNNDecoder(embed_dim: int, hidden_size: int, n_layers: int, n_classes: int, pred_activation: Module, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: Module

Implements RNN decoder with global attention.

Args:

embed_dim (int): The size of the embedding.

hidden_size (int): The size of the RNN hidden state.

n_layers (int): The number of RNN layers.

n_classes (int): The number of output classes.

pred_activation (Module): An instance of an activation function.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

forward(h: Optional[Union[Tensor, Tuple[Tensor, Tensor]]], enc_out: Tensor, enc_mask: Tensor, dec_inp: Tensor, *args, **kwargs) → Tensor[source]

Decodes the input regressivly.

Args:

h (Union[Tensor, Tuple[Tensor, Tensor], None]): The last hidden state of the encoder. If not provided, set as None. Its shape is [1, B, hidden_size].

enc_out (Tensor): The encoder output tensor of shape [B, M, h].

enc_mask (Tensor): The encoder mask tensor of shape [B, M], where True denotes data positions and False denotes padding ones.

dec_inp (Tensor): The decoder input tensor of shape [B, M_dec].

Returns:: Tensor: A tensor of shape [B, M_dec, C], representing the output of the forward pass.

predict(state: dict) → Tuple[Tensor, dict, Tensor][source]

training: bool

class speeq.models.decoders.LocationAwareAttDecoder(embed_dim: int, hidden_size: int, n_layers: int, n_classes: int, pred_activation: Module, kernel_size: int, activation: str, inv_temperature: Union[float, int] = 1, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: GlobAttRNNDecoder

Implements RNN decoder with location aware attention.

Args:

embed_dim (int): The embedding size.

hidden_size (int): The RNN hidden size.

n_layers (int): The number of RNN layers.

n_classes (int): The number of classes.

pred_activation (Module): An activation function instance.

kernel_size (int): The attention kernel size.

activation (str): The activation function to use. it can be either softmax or sigmax.

inv_temperature (Union[float, int]): The inverse temperature value. Default 1.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

forward(h: Optional[Union[Tensor, Tuple[Tensor, Tensor]]], enc_out: Tensor, enc_mask: Tensor, dec_inp: Tensor, *args, **kwargs) → Tensor[source]

Runs the forward pass on the input.

Args:

h (Union[Tensor, Tuple[Tensor, Tensor], None]): The last hidden state of the encoder if provided, which is of shape [1, B_enc, hidden_size].

enc_out (Tensor): The encoder outputs of shape [B, M, h].

enc_mask (Tensor): The encoder mask of shape [B, M], which is True for the data positions and False for the padding ones.

dec_inp (Tensor): The decoder input of shape [B, M_dec].

Returns:

A tensor of shape [B, M_dec, C], which is the output of the LocationAwareAttDecoder module.

predict(state: dict) → Tuple[Tensor, dict, Tensor][source]

training: bool

class speeq.models.decoders.SpeechTransformerDecoder(n_classes: int, n_layers: int, d_model: int, ff_size: int, h: int, pred_activation: Module, masking_value: int = -1000000000000000.0)[source]

Bases: TransformerDecoder

Implements the speech transformer decoder as described in https://ieeexplore.ieee.org/document/8462506

Args:

n_classes (int): The number of classes the model will predict.

n_layers (int): The number of decoder layers.

d_model (int): The model dimensionality.

ff_size (int): The dimensionality of the feed-forward inner layer.

h (int): The number of attention heads.

pred_activation (Module): An activation function instance.

masking_value (int): The attentin masking value. Default -1e15

forward(enc_out: Tensor, enc_mask: Optional[Tensor], dec_inp: Tensor, dec_mask: Optional[Tensor], *args, **kwargs) → Tensor[source]

Passes the inputs through the speech transformer decoder.

Args:

enc_out (Tensor): The output tensor of the encoder of shape [B, M_enc, d].

enc_mask (Union[Tensor, None]): The encoder mask of shape [B, M_enc], which is True for the data positions and False for the padding ones.

dec_inp (Tensor): The input tensor of the decoder of shape [B, M_dec].

dec_mask (Union[Tensor, None]): The decoder mask of shape [B, M_dec], which is True for the data positions and False for the padding ones.

Returns:

Tensor: The output tensor of the speech transformer decoder of shape [B, M_dec, C].

predict(state: dict) → dict[source]

training: bool

class speeq.models.decoders.TransducerRNNDecoder(vocab_size: int, emb_dim: int, hidden_size: int, rnn_type: str, n_layers: int = 1)[source]

Bases: Module

Builds a simple RNN-decoder that contains embedding layer and a single RNN layer

Args:

vocab_size (int): The vocabulary size.

emb_dim (int): The embedding dimension.

hidden_size (int): The RNN’s hidden size.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

n_layers (int): The number of RNN layers to use. Default 1.

forward(x: Tensor, mask: Tensor) → Tuple[Tensor, Tensor][source]

Runs the input tensor through the RNN transducer decoder.

Args:

x (Tensor): The input tensor of shape [B, M].

mask (Tensor): The encoder mask of shape [B, M]. It is True for
data positions and False for padding ones.

Returns:

Tuple[Tensor, Tensor]: A tuple containing two tensors. The first tensor is the output tensor of shape [B, M, hidden_size] and the second tensor is the length tensor of shape [B], representing the actual length of each input sequence in the batch.

predict(state: dict) → dict[source]

training: bool

class speeq.models.decoders.TransformerDecoder(n_classes: int, n_layers: int, d_model: int, ff_size: int, h: int, pred_activation: Module, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the transformer decoder as described in https://arxiv.org/abs/1706.03762

Args:

n_classes (int): The number of classes the model will predict.

n_layers (int): The number of decoder layers.

d_model (int): The model dimensionality.

ff_size (int): The dimensionality of the feed-forward inner layer.

h (int): The number of attention heads.

pred_activation (Module): An activation function instance.

masking_value (int): The attentin masking value. Default -1e15

forward(enc_out: Tensor, enc_mask: Optional[Tensor], dec_inp: Tensor, dec_mask: Optional[Tensor], *args, **kwargs) → Tensor[source]

Passes the inputs through the transformer decoder.

Args:

enc_out (Tensor): The output tensor of the encoder of shape [B, M_enc, d].

enc_mask (Union[Tensor, None]): The encoder mask of shape [B, M_enc], which is True for the data positions and False for the padding ones.

dec_inp (Tensor): The input tensor of the decoder of shape [B, M_dec].

dec_mask (Union[Tensor, None]): The decoder mask of shape [B, M_dec], which is True for the data positions and False for the padding ones.

Returns:

Tensor: The output tensor of the transformer decoder of shape [B, M_dec, C].

predict(state: dict) → dict[source]

training: bool

class speeq.models.decoders.TransformerTransducerDecoder(vocab_size: int, n_layers: int, d_model: int, ff_size: int, h: int, left_size: int, right_size: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Transformer-Transducer decoder with relative truncated multi-head self attention as described in https://arxiv.org/abs/2002.02562

Args:

vocab_size (int): The vocabulary size.

n_layers (int): The number of transformer encoder layers with truncated self attention and relative positional encoding.

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

p_dropout (float): The dropout rate.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input x through the decoder layers.

Args:

x (Tensor): The input tensor of shape [B, M]

mask (Tensor): The input boolean mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded text of shape [B, M, d_model] and the second element is the lengths of shape [B].

training: bool

Encoders

This module provides various speech encoders.

The available encoders are:

DeepSpeechV1Encoder: The encoder implementation of the DeepSpeech V1 model.
DeepSpeechV2Encoder: The encoder implementation of the DeepSpeech V2 model.
ConformerEncoder: The encoder implementation of the Conformer model.
JasperEncoder: The encoder implementation of the Jasper model.
Wav2LetterEncoder: The encoder implementation of the Wav2Letter model.
QuartzNetEncoder: The encoder implementation of the QuartzNet model.
SqueezeformerEncoder: The encoder implementation of the Squeezeformer model.
SpeechTransformerEncoder: The encoder implementation of the Speech Transformer model.
RNNEncoder: The encoder implementation of a general RNN model.
PyramidRNNEncoder: The encoder implementation of the Pyramid RNN model.
ContextNetEncoder: The encoder implementation of the ContextNet model.
VGGTransformerEncoder: The encoder implementation of the VGG-Transformer.
TransformerTransducerEncoder: The encoder implementation of the transformer transducer with relative truncated multi-head self-attention.

Each encoder takes a speech input of shape [B, M, d], and the lengths if shape [B], where B is the batch size, M is the length of the speech sequence, and d is the number of features.

class speeq.models.encoders.ConformerEncoder(d_model: int, n_conf_layers: int, ff_expansion_factor: int, h: int, kernel_size: int, ss_kernel_size: int, ss_stride: int, ss_num_conv_layers: int, in_features: int, res_scaling: float, p_dropout: float)[source]

Bases: Module

Implements the conformer encoder proposed in https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

n_conf_layers (int): The number of conformer blocks.

ff_expansion_factor (int): The feed-forward expansion factor.

h (int): The number of attention heads.

kernel_size (int): The convolution module kernel size.

ss_kernel_size (int): The subsampling layer kernel size.

ss_stride (int): The subsampling layer stride size.

ss_num_conv_layers (int): The number of subsampling convolutional layers.

in_features (int): The input/speech feature size.

res_scaling (float): The residual connection multiplier.

p_dropout (float): The dropout rate.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.ContextNetEncoder(in_features: int, n_layers: int, n_sub_layers: Union[int, List[int]], stride: Union[int, List[int]], out_channels: Union[int, List[int]], kernel_size: int, reduction_factor: int)[source]

Bases: Module

Implements the ContextNet encoder proposed in https://arxiv.org/abs/2005.03191

Args:

in_features (int): The input feature size.

n_layers (int): The number of ContextNet blocks.

n_sub_layers (Union[int, List[int]]): The number of convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

stride (Union[int, List[int]]): The stride of the last convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

out_channels (Union[int, List[int]]): The channels size of the convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

kernel_size (int): The convolutional layers kernel size.

reduction_factor (int): The feature reduction size of the Squeeze-and-excitation module.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.DeepSpeechV1Encoder(in_features: int, hidden_size: int, n_linear_layers: int, bidirectional: bool, max_clip_value: int, rnn_type: str, p_dropout: float)[source]

Bases: Module

Builds the DeepSpeech encoder described in https://arxiv.org/abs/1412.5567

Args:

in_features (int): The input feature size.

hidden_size (int): The hidden size of the rnn layers.

n_linear_layers (int): The number of feed-forward layers.

bidirectional (bool): if the rnn is bidirectional or not.

max_clip_value (int): The maximum relu clipping value.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

p_dropout (float): The dropout rate.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.DeepSpeechV2Encoder(n_conv: int, kernel_size: int, stride: int, in_features: int, hidden_size: int, bidirectional: bool, n_rnn: int, n_linear_layers: int, max_clip_value: int, rnn_type: str, tau: int, p_dropout: float)[source]

Bases: Module

Implements the deep speech 2 encoder proposed in https://arxiv.org/abs/1512.02595

Args:

n_conv (int): The number of convolution layers.

kernel_size (int): The kernel size of the convolution layers.

stride (int): The stride size of the convolution layer.

in_features (int): The input/speech feature size.

hidden_size (int): The hidden size of the RNN layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_rnn (int): The number of RNN layers.

n_linear_layers (int): The number of linear layers.

max_clip_value (int): The maximum relu clipping value.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

tau (int): The future context size.

p_dropout (float): The dropout rate.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.JasperEncoder(in_features: int, num_blocks: int, num_sub_blocks: int, channel_inc: int, epilog_kernel_size: int, prelog_kernel_size: int, prelog_stride: int, prelog_n_channels: int, blocks_kernel_size: Union[int, List[int]], p_dropout: float)[source]

Bases: Module

Implements Jasper’s encoder proposed in https://arxiv.org/abs/1904.03288

Args:

in_features (int): The input/speech feature size.

num_blocks (int): The number of Jasper blocks (denoted as ‘B’ in the paper).

num_sub_blocks (int): The number of Jasper subblocks (denoted as ‘R’ in the paper).

channel_inc (int): The rate to increase the number of channels across the blocks.

epilog_kernel_size (int): The kernel size of the epilog block convolution layer.

prelog_kernel_size (int): The kernel size of the prelog block ocnvolution layer.

prelog_stride (int): The stride size of the prelog block convolution layer.

prelog_n_channels (int): The output channnels of the prelog block convolution layer.

blocks_kernel_size (Union[int, List[int]]): The kernel size(s) of the convolution layer for each block.

p_dropout (float): The dropout rate.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.PyramidRNNEncoder(in_features: int, hidden_size: int, reduction_factor: int, bidirectional: bool, n_layers: int, p_dropout: float, rnn_type: str = 'rnn')[source]

Bases: Module

Implements a pyramid of RNN as described in https://arxiv.org/abs/1508.01211.

Args:

in_features (int): The input features size.

hidden_size (int): The RNN hidden size.

reduction_factor (int): The time resolution reduction factor.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_layers (int): The number of RNN layers.

p_dropout (float): The dropout rate.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

forward(x: Tensor, mask: Tensor, return_h=False, *args, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.QuartzNetEncoder(in_features: int, num_blocks: int, block_repetition: int, num_sub_blocks: int, channels_size: List[int], epilog_kernel_size: int, epilog_channel_size: Tuple[int, int], prelog_kernel_size: int, prelog_stride: int, prelog_n_channels: int, groups: int, blocks_kernel_size: Union[int, List[int]], p_dropout: float)[source]

Bases: JasperEncoder

Implements QuartzNet encoder proposed in https://arxiv.org/abs/1910.10261

Args:

in_features (int): The input/speech feature size.

num_blocks (int): The number of QuartzNet blocks (denoted as ‘B’ in the paper).

block_repetition (int): The number of times to repeat each block (denoted as ‘S’ in the paper).

num_sub_blocks (int): The number of QuartzNet subblocks, (denoted as ‘R’ in the paper).

channels_size (List[int]): A list of integers representing the number of output channels for each block.

epilog_kernel_size (int): The kernel size of the convolution layer in the epilog block.

epilog_channel_size (Tuple[int, int]): A tuple for both epilog layers of the convolution layer .

prelog_kernel_size (int): The kernel size pf the convolution layer in the prelog block.

prelog_stride (int): The stride size of the of the convoltuional layer in the prelog block.

prelog_n_channels (int): The number of output channels of the convolutional layer in the prelog block.

groups (int): The groups size.

blocks_kernel_size (Union[int, List[int]]): An integer or a list of integers representing the kernel size(s) for each block’s convolutional layer.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.encoders.RNNEncoder(in_features: int, hidden_size: int, bidirectional: bool, n_layers: int, p_dropout: float, rnn_type: str = 'rnn')[source]

Bases: Module

Implements a stack of RNN layers.

Args:

in_features (int): The input features size.

hidden_size (int): The RNN hidden size.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_layers (int): The number of RNN layers.

p_dropout (float): The dropout rate.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

forward(x: Tensor, mask: Tensor, return_h=False, *args, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.SpeechTransformerEncoder(in_features: int, n_conv_layers: int, kernel_size: int, stride: int, d_model: int, n_layers: int, ff_size: int, h: int, att_kernel_size: int, att_out_channels: int)[source]

Bases: Module

Implements the speech transformer encoder described in https://ieeexplore.ieee.org/document/8462506

Args:

in_features (int): The input/speech feature size.

n_conv_layers (int): The number of down-sampling convolutional layers.

kernel_size (int): The kernel size of the down-sampling convolutional layers.

stride (int): The stride size of the down-sampling convolutional layers.

d_model (int): The model dimensionality.

n_layers (int): The number of encoder layers.

ff_size (int): The dimensionality of the inner layer of the feed-forward module.

h (int): The number of attention heads.

att_kernel_size (int): The kernel size of the attentional convolutional layers.

att_out_channels (int): The number of output channels of the attentional convolution layers.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.SqueezeformerEncoder(in_features: int, n: int, d_model: int, ff_expansion_factor: int, h: int, kernel_size: int, pooling_kernel_size: int, pooling_stride: int, ss_kernel_size: Union[int, List[int]], ss_stride: Union[int, List[int]], ss_n_conv_layers: int, p_dropout: float, ss_groups: Union[int, List[int]] = 1, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Squeezeformer encoder as described in https://arxiv.org/abs/2206.00888

Args:

in_features (int): The input/speech feature size.

n (int): The number of layers per block, (denoted as N in the paper).

d_model (int): The model dimension.

ff_expansion_factor (int): The expansion factor of linear layer in the feed forward module.

h (int): The number of attention heads.

kernel_size (int): The kernel size of the depth-wise convolution layer.

pooling_kernel_size (int): The kernel size of the pooling convolution layer.

pooling_stride (int): The stride size of the pooling convolution layer.

ss_kernel_size (Union[int, List[int]]): The kernel size of the subsampling layer(s).

ss_stride (Union[int, List[int]]): The stride of the subsampling layer(s).

ss_n_conv_layers (int): The number of subsampling convolutional layers.

p_dropout (float): The dropout rate.

ss_groups (Union[int, List[int]]): The subsampling convolution groups size(s).

masking_value (int): The masking value. Default -1e15

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.TransformerTransducerEncoder(in_features: int, n_layers: int, d_model: int, ff_size: int, h: int, left_size: int, right_size: int, p_dropout: float, stride: int = 1, kernel_size: int = 1, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Transformer-Transducer encoder with relative truncated multi-head self attention as described in https://arxiv.org/abs/2002.02562

Args:

in_features (int): The input feature size.

n_layers (int): The number of transformer encoder layers with truncated self attention and relative positional encoding.

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

p_dropout (float): The dropout rate.

stride (int): The stride of the convolution layer. Default 1.

kernel_size (int): The kernel size of the convolution layer. Default 1.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.VGGTransformerEncoder(in_features: int, n_layers: int, n_vgg_blocks: int, n_conv_layers_per_vgg_block: List[int], kernel_sizes_per_vgg_block: List[List[int]], n_channels_per_vgg_block: List[List[int]], vgg_pooling_kernel_size: List[int], d_model: int, ff_size: int, h: int, left_size: int, right_size: int, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the VGGTransformer encoder as described in https://arxiv.org/abs/1910.12977

Args:

in_features (int): The input feature size.

n_layers (int): The number of transformer encoder layers with truncated self attention.

n_vgg_blocks (int): The number of VGG blocks to use.

n_conv_layers_per_vgg_block (List[int]): A list of integers that specifies the number of convolution layers in each block.

kernel_sizes_per_vgg_block (List[List[int]]): A list of lists that contains the kernel size for each layer in each block. The length of the outer list should match n_vgg_blocks, and each inner list should be the same length as the corresponding block’s number of layers.

n_channels_per_vgg_block (List[List[int]]): A list of lists that contains the number of channels for each convolution layer in each block. This argument should also have length equal to n_vgg_blocks, and each sublist should have length equal to the number of layers in the corresponding block.

vgg_pooling_kernel_size (List[int]): A list of integers that specifies the size of the max pooling layer in each block. The length of this list should be equal to n_vgg_blocks.

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

class speeq.models.encoders.Wav2LetterEncoder(in_features: int, n_conv_layers: int, layers_kernel_size: int, layers_channels_size: int, pre_conv_stride: int, pre_conv_kernel_size: int, post_conv_channels_size: int, post_conv_kernel_size: int, p_dropout: float, wav_kernel_size: Optional[int] = None, wav_stride: Optional[int] = None)[source]

Bases: Module

Implements the Wav2Letter encoder proposed in https://arxiv.org/abs/1609.03193

Args:

in_features (int): The input/speech feature size.

n_conv_layers (int): The number of convolution layers.

layers_kernel_size (int): The kernel size of the convolution layers.

layers_channels_size (int): The number of output channels of each convolution layer.

pre_conv_stride (int): The stride of the prenet convolution layer.

pre_conv_kernel_size (int): The kernel size of the prenet convolution layer.

post_conv_channels_size (int): The number of output channels of the postnet convolution layer.

post_conv_kernel_size (int): The kernel size of the postnet convolution layer.

p_dropout (float): The dropout rate.

wav_kernel_size (Optional[int]): The kernel size of the first layer that processes the wav samples directly if wav is modeled. Default None.

wav_stride (Optional[int]): The stride size of the first layer that processes the wav samples directly if wav is modeled. Default None.

forward(x: Tensor, mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor][source]

Passes the input x through the encoder layers.

Args:

x (Tensor): The input speech tensor of shape [B, M, d]

mask (Tensor): The input boolean input mask of shape [B, M], where it’s True if there is no padding.

Returns:

Tuple[Tensor, Tensor]: A tuple where the first element is the encoded speech of shape [B, M, F] and the second element is the lengths of shape [B].

training: bool

Layers

This module contains implementations of various atomic layers used in neural network models.

Layers:

PackedRNN: RNN layer with support for PackedSequence.
PackedLSTM: LSTM layer with support for PackedSequence.
PackedGRU: GRU layer with support for PackedSequence.
PredModule: A simple feedforward prediction module.
ConvPredModule: A convolutional prediction module.
FeedForwardModule: A transformer feedforward module.
AddAndNorm: A layer that performs residual connection and layer normalization.
MultiHeadAtt: Multi-Head Attention layer.
MaskedMultiHeadAtt: Masked Multi-Head Attention layer.
TransformerEncLayer: Transformer Encoder layer.
RowConv1D: A 1D convolution layer that convolves each row separately.
Conv1DLayers: A stack of 1D convolutional layers.
GlobalMulAttention: Global Multiplicative Attention layer.
ConformerFeedForward: A feedforward module used in Conformer model.
ConformerConvModule: A convolutional module used in Conformer model.
ConformerRelativeMHSA: Conformer Relative Multi-Head Self-Attention layer.
ConformerBlock: Conformer block.
ConformerPreNet: A pre-processing network used in Conformer model.
JasperSubBlock: Jasper Sub-block.
JasperResidual: Jasper Residual module.
JasperBlock: Jasper Block.
JasperBlocks: A stack of Jasper Blocks.
LocAwareGlobalAddAttention: Location-Aware Global Additive Attention layer.
MultiHeadAtt2d: 2D Multi-Head Attention layer.
SpeechTransformerEncLayer: Speech Transformer Encoder layer.
TransformerDecLayer: Transformer Decoder layer.
PositionalEmbedding: Positional embedding layer.
GroupsShuffle: Group Shuffle layer.
QuartzSubBlock: Quartz Sub-block.
QuartzBlock: Quartz Block.
QuartzBlocks: A stack of Quartz Blocks.
Scaling1d: A learnable scaling layer.
SqueezeformerConvModule: A convolutional module used in Squeezeformer model.
SqueezeformerRelativeMHSA: Squeezeformer Relative Multi-Head Self-Attention layer.
SqueezeformerFeedForward: A feedforward module used in Squeezeformer model.
SqueezeformerBlock: Squeezeformer block.
SqueezeAndExcit1D: Squeeze-and-Excitation layer for 1D inputs.
ContextNetConvLayer: ContextNet convolution layer.
ContextNetResidual: ContextNet residual module.
ContextNetBlock: ContextNet block.
CausalVGGBlock: Causal VGG Block.
TruncatedSelfAttention: Truncated self attention.
TransformerEncLayerWithAttTruncation: Transformer Encoder layer with truncated self attention.
VGGTransformerPreNet: VGG Transformer prenet.
TruncatedRelativeMHSA: Truncated relative multi-head self attention.
TransformerTransducerLayer: Transfirmer transducer layer with Truncated relative multi-head self attention.

class speeq.models.layers.AddAndNorm(d_model: int)[source]

Bases: Module

Implements the Add and Norm module of the transformer model as described in the paper https://arxiv.org/abs/1706.03762

Args:

d_model (int): The model dimensionality.

forward(x: Tensor, sub_x: Tensor) → Tensor[source]

takes the output tensor x from the last layer and the output tensor sub_x from the sub-layer, adds them, and then normalizes the sum using layer normalization.

Args:

x (Tensor): The output tensor of the last layer with shape [B, M, d].

sub_x (Tensor): The output tensor of the sub-layer with shape [B, M, d].

Returns:

Tensor: The result tensor obtained after normalizing the sum of the inputs with shape [B, M, d].

training: bool

class speeq.models.layers.CausalVGGBlock(n_conv: int, in_channels: int, out_channels: List[int], kernel_sizes: List[int], pooling_kernel_size: int)[source]

Bases: Module

Implements a causal VGG block consisting of causal 2D convolution layers, as described in the paper https://arxiv.org/pdf/1910.12977.pdf.

Args:

n_conv (int): Specifies the number of convolution layers.

in_channels (int): Specifies the number of input channels.

out_channels (List[int]): A list of integers that specifies the number of channels in each convolution layer

kernel_sizes (List[int]): A list of integers that specifies the kernel size of each convolution layer.

pooling_kernel_size (int): Specifies the kernel size of the pooling layer.

forward(x: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor][source]

passes the input x of shape [B, C, M, f] to the network.

Args:: x (Tensor): The input tensor if shape [B, C, M, f]. lengths (Tensor): The legnths tensor of shape [B].
Returns:: Tuple[Tensor, Tensor]: A tuple where the first is the result of shape [B, C’, M’, f’] and the updated lengths of shape [B]

training: bool

class speeq.models.layers.ConformerBlock(d_model: int, ff_expansion_factor: int, h: int, kernel_size: int, p_dropout: float, res_scaling: float = 0.5)[source]

Bases: Module

Implements the conformer block described in https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

ff_expansion_factor (int): The expansion factor of the linear layer.

h (int): The number of heads.

kernel_size (int): The kernel size of depth-wise convolution.

p_dropout (float): The dropout rate.

res_scaling (float): The multiplier for residual connection.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Passes the input to the conformer block.

Args:

x (torch.Tensor): The input tensor of shape [B, M, d].

mask (Tensor, optional): Boolean tensor of shape [B, M], where False for padding. If None is provided, no masking is applied. Default is None.

Returns:

Tensor: The output tensor of the same shape as the input tensor x.

training: bool

class speeq.models.layers.ConformerConvModule(d_model: int, kernel_size: int, p_dropout: float)[source]

Bases: Module

Implements the conformer convolution module as described in https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

kernel_size (int): The depth-wise convolution kernel size.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input tensor through the Conformer Convolutional Module.

Args:: x (Tensor): Input tensor of shape [B, M, d].
Returns:: Tensor: Result tensor of shape [B, M, d].

training: bool

class speeq.models.layers.ConformerFeedForward(d_model: int, expansion_factor: int, p_dropout: float)[source]

Bases: Module

Implements the feed-forward module used in Conformer models as described in https://arxiv.org/abs/2005.08100

Args:

d_model (int): The input feature dimensionality.

expansion_factor (int): The expansion factor used by the linear layer.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input x through the conformer feed-forward module.

Args:: x (Tensor): The input tensor of shape [B, M, d].
Returns:: Tensor: The output tensor of shape [B, M, d].

training: bool

class speeq.models.layers.ConformerPreNet(in_features: int, kernel_size: Union[int, List[int]], stride: Union[int, List[int]], n_conv_layers: int, d_model: int, p_dropout: float, groups: Union[int, List[int]] = 1)[source]

Bases: Module

Implements the pre-conformer blocks that contains the subsampling as described in https://arxiv.org/abs/2005.08100

Args:

in_features (int): The input/speech feature size.

kernel_size (Union[int, List[int]]): The kernel size of the subsampling layer.

stride (Union[int, List[int]]): The stride of the subsampling layer.

n_conv_layers (int): The number of convolutional layers.

d_model (int): The model dimension.

p_dropout (float): The dropout rate.

groups (Union[int, List[int]]): The convolution groups size. Default 1.

forward(x: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input x to the pre-conformer blocks that contains the subsampling convolutional.

Args:

x (Tensor): The input tensor of shape [B, M, d].

lengths (Tensor): A tensor of shape [B] containing the lengths of each sequence in x before subsampling.

Returns:

Tuple[Tensor, Tensor]: A tuple containing two tensors. The first tensor is the output of the pre-conformer block of shape [B, N, d]. The second tensor is a tensor of shape [B] containing the lengths of each sequence in the output tensor after subsampling.

training: bool

class speeq.models.layers.ConformerRelativeMHSA(d_model: int, h: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: MultiHeadAtt

Multi-Head Self-Attention module with relative positional encoding, based on the paper https://arxiv.org/abs/2005.08100

Args:

d_model (int): The input and output feature dimension.

h (int): The number of attention heads.

p_dropout (float): The dropout rate.

masking_value (int): The masking value used for padding. Default -1e15.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Performs Multi-Head Self-Attention operation with relative positional encoding on input tensor x.

Args:

x (Tensor): Input tensor of shape [B, M, d].

mask (Tensor, optional): Boolean tensor of shape [B, M], where False for padding. If None is provided, no masking is applied. Default is None.

Returns:

Tensor: Result tensor of shape [B, M, d].

training: bool

class speeq.models.layers.ContextNetBlock(n_layers: int, in_channels: int, out_channels: int, kernel_size: int, reduction_factor: int, add_residual: bool, last_layer_stride: int = 1)[source]

Bases: Module

Implements the convolution block of the ContextNet model proposd in https://arxiv.org/abs/2005.03191

Args:

n_layers (int): The number of convolutional layers in the block.

in_channels (int): The number of channels in the input.

out_channels (int): The number of output channels.

kernel_size (int): The convolution kernel size.

reduction_factor (int):The reduction factor for the Squeeze-and-excitation module.

add_residual (bool): A flag indicating whether to include a residual connection.

last_layer_stride (int): The stride size of the last convolutional layer.

forward(x: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input through the convolution block of the ContextNet.

Args:: x (Tensor): The input tensor of shape [B, in_channels, M]. lengths (Tensor): The tensor of shape [B] containing the lengths of each sequence in x.
Returns:: Tuple[Tensor, Tensor]: The output tensor after passing through the convolution block, of shape [B, out_channels, N], and the updated lengths tensor of shape [B], after passing through the convolution block.

training: bool

class speeq.models.layers.ContextNetConvLayer(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1)[source]

Bases: Module

Implements the convolution layer of the ContextNet model proposed in https://arxiv.org/abs/2005.03191. This layer applies a convolution operation followed by batch normalization and an activation function.

Args:

in_channels (int): The number of input channels.

out_channels (int): The number of output channels.

kernel_size (int): The convolution layer kernel size.

stride (int): The stride of the convolution layer. Default 1.

forward(x: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input tensor to the ContextNet convolution layer and returns a tuple of the output tensor and the updated lengths tensor.

Args:

x (Tensor): The input tensor of shape [B, in_channels, M].

lengths (Tensor): The tensor of shape [B] containing the lengths of each sequence in x.

Returns:

Tuple[Tensor, Tensor]: A tuple of two tensors. The first tensor is the output tensor after applying convolution of shape [B, out_channels, N], and the second tensor is the updated lengths tensor of shape [B], after applying the convolution layer.

training: bool

class speeq.models.layers.ContextNetResidual(in_channels: int, out_channels: int, kernel_size: int, stride: int)[source]

Bases: Module

Implements the residual branch of the ContextNet block as proposed in https://arxiv.org/abs/2005.03191

Args:

in_channels (int): The number of input channels.

out_channels (int): The number of output channels.

kernel_size (int): The convolution kernel size.

stride (int): The convolution stride size.

forward(x: Tensor, out: Tensor) → Tensor[source]

Args:

x (Tensor): The input tensor of shape [B, d, M].

out (Tensor): The output tensor of the previous block of shape [B, d/s, M] where s is the stride value. If the block has no stride, s is set to 1.

Returns:: Tensor: The output tensor after applying the residual connection of shape [B, d, M].

training: bool

class speeq.models.layers.Conv1DLayers(in_size: int, out_size: Union[List[int], int], kernel_size: Union[List[int], int], stride: Union[List[int], int], n_layers: int, p_dropout: float, groups: Union[List[int], int] = 1, activation: Optional[Module] = None)[source]

Bases: Module

Implements stack of Conv1d layers.

Args:

in_size (int): The input feature size.

out_size (Union[List[int], int]): The output feature size(s) of each layer. If a list is passed, it has to be of length equal to n_layers.

kernel_size (Union[List[int], int]): The kernel size(s) of the Conv1d layers. If a list is passed, it has to be of length equal to n_layers.

stride (Union[List[int], int]): The stride size(s) of the Conv1d layers. If a list is passed, it has to be of length equal to n_layers.

n_layers (int): The number of Conv1d layers to stack.

p_dropout (float): The dropout rate.

groups (Union[List[int], int]): The groups size of the conv layers, if a list is passed it has to be of length equal to n_layers. Default 1.

activation (Module, optional): An optional instance of an activation function would be executed on the output of each activation convolution layer. Default None.

forward(x: Tensor, data_len: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input tensor x through the Conv1D layers and returns the result.

Args:

x (Tensor): The input tensor of shape [B, M, in_size].

data_len (Tensor): A tensor of shape [B] containing the length of each sequence in x.

Returns:

Tuple[Tensor, Tensor]: A tuple containing the result tensor of shape [B, M, out_size] and a tensor of shape [B] containing the new length of each sequence after applying the conv layers.

training: bool

class speeq.models.layers.ConvPredModule(in_features: int, n_classes: int, activation: Module)[source]

Bases: Module

A prediction module that consist of a signle Conv1d layer followed by a pre-defined activation function.

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes to be produced.

activation (Module): The activation function to be used.

forward(x: Tensor) → Tensor[source]

Passes the input thought the layers’ modules, where the input x of shape [B, M, C]

Args:: x (Tensor): The input tensor of shape [B, M, d]
Returns:: Tensor: The output tensor of shape [B, M, C] obtained after passing through the layers’ modules.

training: bool

class speeq.models.layers.FeedForwardModule(d_model: int, ff_size: int, p_dropout: float = 0.0)[source]

Bases: Module

Implements the feed-forward module of the transformer architecture as described in the paper https://arxiv.org/abs/1706.03762

Args:

d_model (int): The model dimensionality.

ff_size (int): The dimensionality of the inner layer.

forward(x: Tensor) → Tensor[source]

Passes the input to the layer

Args:: x (Tensor): The input tensor of shape [B, M, d]
Returns:: Tensor: The output tensor of shape [B, M, d] obtained after passing through the layer.

training: bool

class speeq.models.layers.GlobalMulAttention(enc_feat_size: int, dec_feat_size: int, scaling_factor: Union[float, int] = 1, mask_val: float = -1000000000000.0)[source]

Bases: Module

Implements the global multiplicative attention mechanism as described in https://arxiv.org/abs/1508.04025, using direct dot product for scoring.

Args:

enc_feat_size (int): The size of encoder features.

dec_feat_size (int): The size of decoder features.

scaling_factor (Union[float, int]): The scaling factor for numerical stability used inside the softmax. Default: 1.

mask_val (float): the masking value. Default -1e12.

forward(key: Tensor, query: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Applies the global multiplicative attention mechanism to the input key and query.

Args:

key (Tensor): The key tensor of shape [B, M, enc_feat_size].

query (Tensor): The query tensor of shape [B, 1, dec_feat_size].

mask (Union[None, Tensor], optional): The boolean mask tensor of shape [B, M], where False for padding. Default None.

Returns:

Tensor: The attention tensor of shape [B, enc_feat_size].

training: bool

class speeq.models.layers.GroupsShuffle(groups: int)[source]

Bases: Module

Implements the group shuffle proposed in https://arxiv.org/abs/1707.01083

Args:

groups (int): The groups size.

forward(x: Tensor) → Tensor[source]

Applies the group shuffle on the input tensor x.

Args:

x (Tensor): The input tensor of shape [B, C, …].

Returns:

Tensor: The output tensor after applying group shuffle of shape [B, C, …].

training: bool

class speeq.models.layers.JasperBlock(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, p_dropout: float)[source]

Bases: Module

Implements the main jasper block of the Jasper model as described in https://arxiv.org/abs/1904.03288

Args:

num_sub_blocks (int): The number of subblocks, which is denoted as ‘R’ in the paper.

in_channels (int): The number of input channels.

out_channels (int): The number of output channels.

kernel_size (int): The kernel size of the convolutional layer.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input x through the layer.

Args:: x (Tensor): The input tensor of shape [B, in_channels, M]

Returns:

Tensor: The result tensor of shape [B, out_channels, M]

training: bool

class speeq.models.layers.JasperBlocks(num_blocks: int, num_sub_blocks: int, in_channels: int, channel_inc: int, kernel_size: Union[int, List[int]], p_dropout: float)[source]

Bases: Module

Implements the jasper’s series of blocks as described in https://arxiv.org/abs/1904.03288

Args:

num_blocks (int): The number of Jasper blocks (denoted as ‘B’ in the paper).

num_sub_blocks (int): The number of Jasper subblocks (denoted as ‘R’ in the paper).

in_channels (int): The number of input channels.

channel_inc (int): The rate to increase the number of channels across the blocks.

kernel_size (Union[int, List[int]]): The kernel size(s) of the convolution layer for each block.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input tensor through the JasperBlocks layer.

Args:: x (Tensor): The input tensor of shape [B, in_channels, M].

Returns:

Tensor: The output tensor of shape [B, in_channels + channel_inc * num_blocks, M].
This tensor is the result of applying the JasperBlocks layer to the input tensor x.

training: bool

class speeq.models.layers.JasperResidual(in_channels: int, out_channels: int)[source]

Bases: Module

Implements the residual connection module described in https://arxiv.org/abs/1904.03288

Args:

in_channels (int): The number of input channels.

out_channels (int): The number of output channels.

forward(x: Tensor) → Tensor[source]

Passes the input x through the residual branch.

Args:: x (Tensor): The input tensor of shape [B, in_channels, M]

Returns:

Tensor: The result tensor of shape [B, out_channels, M]

training: bool

class speeq.models.layers.JasperSubBlock(in_channels: int, out_channels: int, kernel_size: int, p_dropout: float, stride: int = 1, padding: Union[str, int] = 'same')[source]

Bases: Module

Implements the subblock of the Jasper model as described in https://arxiv.org/abs/1904.03288

Args:

in_channels (int): The number of input channels.

out_channels (int): The number of output channels.

kernel_size (int): The kernel size of the convolutional layer.

p_dropout (float): The dropout rate.

stride (int): The stride of the convolutional layer. Default is 1.

padding (Union[str, int]): The padding mode or size. Default is ‘same’.

forward(x: Tensor, residual: Union[None, Tensor] = None) → Tensor[source]

Passes the input to the layer

Args:

x (Tensor): The input tensor of shape [B, d, M].

residual (Union[Tensor, None], optional): An optional tensor of shape [B, out_channels, M]. If not None, it is added element-wise to the output tensor. Defaults to None.

Returns:

Tensor: The output tensor of shape [B, out_channels, M].

training: bool

class speeq.models.layers.LocAwareGlobalAddAttention(enc_feat_size: int, dec_feat_size: int, kernel_size: int, activation: str, inv_temperature: Optional[Union[float, int]] = 1, mask_val: Optional[float] = -1000000000000.0)[source]

Bases: Module

Implements the location-aware global additive attention proposed in https://arxiv.org/abs/1506.07503

Args:

enc_feat_size (int): The encoder feature size.

dec_feat_size (int): The decoder feature size.

kernel_size (int): The size of the attention kernel.

activation (str): The activation function to use. Can be either ‘softmax’ or ‘sigmoid’.

inv_temperature (Union[float, int], optional): The value of the inverse temperature parameter. Default is 1.

mask_val (float, optional): The masking value for the attention weights. Default is -1e12.

forward(key: Tensor, query: Tensor, alpha: Tensor, mask: Union[None, Tensor] = None) → Tuple[Tensor, Tensor][source]

Computes the forward pass of the location-aware global additive attention mechanism.

Args:

key (Tensor): The encoder feature maps of shape [B, M_enc, enc_feat_size].

query (Tensor): The decoder feature maps of shape [B, 1, dec_feat_size].

alpha (Tensor): The previous attention weights of shape [B, 1, M_enc].

mask (Tensor, optional): The mask tensor of shape [B, M_enc], with zeros/False in the
positions that should be masked. Default is None.

Returns:

A tuple of two tensors:

context (Tensor): The context tensor of shape [B, 1, M_dec].
attn_weights (Tensor): The attention weights tensor of shape [B, 1, M_enc].

training: bool

class speeq.models.layers.MaskedMultiHeadAtt(d_model: int, h: int, masking_value: float = -1000000000000000.0)[source]

Bases: MultiHeadAtt

A multi-head attention module that performs masking to handle padded sequences. This implementation is based on the architecture described in https://arxiv.org/abs/1706.03762

Args:

d_model (int): The model dimensionality.

h (int): The number of heads in the attention mechanism.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(key: Tensor, query: Tensor, value: Tensor, key_mask: Optional[Tensor]) → Tensor[source]

Applies masked multi-head attention to the input.

Args:

key (Tensor): The key input tensor of shape [B, M, d].

query (Tensor): The query input tensor of shape [B, M, d].

value (Tensor): The value input tensor of shape [B, M, d].

key_mask (Union[Tensor, None]): The mask tensor of the key of shape [B, M] where True indicates that the corresponding key position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head attention.

Returns:

Tensor: The attention result tensor of shape [B, M, d].

get_looking_ahead_mask(key_mask: Tensor) → Tensor[source]

training: bool

class speeq.models.layers.MultiHeadAtt(d_model: int, h: int, masking_value: int = -1000000000000000.0)[source]

Bases: Module

A module that implements the multi-head attention mechanism described in https://arxiv.org/abs/1706.03762.

Args:

d_model (int): The dimensionality of the model.

h (int): The number of heads to use in the attention mechanism.

masking_value (float, optional): The value used for masking. Defaults to -1e15.

forward(key: Tensor, query: Tensor, value: Tensor, key_mask: Union[None, Tensor] = None, query_mask: Union[None, Tensor] = None) → Tensor[source]

passes the input to the multi-head attention by computing a weighted sum of the values using queries and keys. The weights are computed as a softmax over the dot products of the queries and keys for each attention head. Optionally, attention can be masked using key and query masks.

Args:

key (Tensor): The key input tensor of shape [B, M, d]

query (Tensor): The query of shape [B, M, d]

value (Tensor): Teh value tensor of shape [B, M, d]

key_mask (Tensor, optional): A boolean tensor of shape [B, M] where True indicates that the corresponding key position contains data, not padding, and should not be masked

query_mask (Tensor, optional): A boolean tensor of shape [B, M] where True indicates that the corresponding query position contains data, not padding, and should not be masked

Returns:

Tensor: The tensor of shape [B, M, d] resulting from the multi-head attention computation.

perform_attention(key: Tensor, query: Tensor, value: Tensor, key_mask: Union[None, Tensor] = None, query_mask: Union[None, Tensor] = None) → Tensor[source]

Performs multi-head attention by computing a weighted sum of the values using queries and keys. The weights are computed as a softmax over the dot products of the queries and keys for each attention head. Optionally, attention can be masked using key and query masks.

Args:

key (Tensor): The key input tensor of shape [B, M, d]

query (Tensor): The query of shape [B, M, d]

value (Tensor): Teh value tensor of shape [B, M, d]

key_mask (Tensor, optional): A boolean tensor of shape [B, M] where True indicates that the corresponding key position contains data, not padding, and should not be masked

query_mask (Tensor, optional): A boolean tensor of shape [B, M] where True indicates that the corresponding query position contains data, not padding, and should not be masked

Returns:

Tensor: The tensor of shape [B, M, d] resulting from the multi-head attention computation.

training: bool

class speeq.models.layers.MultiHeadAtt2d(d_model: int, h: int, out_channels: int, kernel_size: int)[source]

Bases: MultiHeadAtt

Implements the 2-dimensional multi-head self-attention proposed in https://ieeexplore.ieee.org/document/8462506

Args:

d_model (int): The input feature dimensionality.

h (int): The number of attention heads.

out_channels (int): The number of output channels of the convolution layer.

kernel_size (int): The size of the convolutional kernel to apply.

forward(key: Tensor, query: Tensor, value: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Applies both time-domain and frequency-domain multi-head self-attention on the input.

Args:

key (Tensor): A tensor of shape [B, M,d].

query (Tensor): A tensor of shape [B, M,d].

value (Tensor): A tensor of shape [B, M,d].

mask (Tensor, optional): Boolean tensor of shape [B, M], where False for padding. If None is provided, no masking is applied. Default is None.

Returns:: Tensor: The result tensor of shape [B, M, d].

perform_frequency_attention(key: Tensor, query: Tensor, value: Tensor) → Tensor[source]

Applies frequency-domain multi-head self-attention.

Args:: key (Tensor): A tensor of shape [B, M, d]. query (Tensor): A tensor of shape [B, M, d]. value (Tensor): A tensor of shape [B, M, d].
Returns:: Tensor: A tensor of shape [B, M, d], representing the result after performing the attention mechanism on the frequency domain.

training: bool

class speeq.models.layers.PackedGRU(input_size: int, hidden_size: int, batch_first=True, enforce_sorted=False, bidirectional=False)[source]

Bases: PackedRNN

training: bool

class speeq.models.layers.PackedLSTM(input_size: int, hidden_size: int, batch_first=True, enforce_sorted=False, bidirectional=False)[source]

Bases: PackedRNN

training: bool

class speeq.models.layers.PackedRNN(input_size: int, hidden_size: int, batch_first=True, enforce_sorted=False, bidirectional=False)[source]

Bases: Module

Packed RNN Module utilizes the RNN built in torch with the padding functionalities provided in torch.

Args:: input_size (int): The RNN input size hidden_size (int): The RNN hidden size batch_first (bool): whether the batch will be in the first dimension or not. Default to True. enforce_sorted (bool): If the inputs are sorted based on their length. Default to False. bidirectional (bool): If the RNN is bidirectional or not. Default to False.

forward(x: Tensor, lens: Union[List[int], Tensor], h: Union[None, Tensor] = None) → Tuple[Tensor, Tensor, Tensor][source]

Passes the input tensor x of shape [B, M, d], along with tensor or list of lengths lens of shape [B] representing the length of each sequence without padding, through the layer. An optional tensor h representing the last hidden state can also be provided.

Args:

x (Tensor): The input sequence tensor of shape [B, M, d].

lens (Union[List[int], Tensor]): The lengths of the data without padding for each sequence of length [B].

h (Tensor, optional): The last hidden state if there’s any. Defaults to None.

Returns:

Tuple[Tensor, Tensor, Tensor]: A tuple of three tensors containing the output sequence of shape [B, max(lens), hidden_size], the last hidden state of shape [D, B, hidden_size], and the new lengths.

training: bool

class speeq.models.layers.PositionalEmbedding(vocab_size: int, embed_dim: int)[source]

Bases: Module

Implements the positional embedding proposed in https://arxiv.org/abs/1706.03762

output = positional_encoding + Embedding(input)

Args:

vocab_size (int): The vocabulary size.

embed_dim (int): The embedding size.

forward(x: Tensor) → Tensor[source]

Applies the positional embedding to the input tensor.

Args:

x (Tensor): The input tensor of shape [B, M].

Returns:

Tensor: The output tensor of shape [B, M, d].

training: bool

class speeq.models.layers.PredModule(in_features: int, n_classes: int, activation: Module)[source]

Bases: Module

This is a Prediction Module class that comprises a single feedforward layer followed by a pre-defined activation function.

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes to be produced.

activation (Module): The activation function to be used.

forward(x: Tensor) → Tensor[source]

Passes the input thought the layers’ modules, where the input x of shape [B, M, d]

Args:: x (Tensor): The input tensor of shape [B, M, d]
Returns:: Tensor: The output tensor of shape [B, M, C] obtained after passing through the layers’ modules.

training: bool

class speeq.models.layers.QuartzBlock(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, groups: int, p_dropout: float)[source]

Bases: JasperBlock

Implements the main block of the QuartzNet model as described in https://arxiv.org/abs/1904.03288

Args:

num_sub_blocks (int): Number of subblocks, denoted as ‘R’ in the paper.

in_channels (int): Number of input channels of the convolution layer.

out_channels (int): Number of output channels of the convolution layer.

kernel_size (int): Convolution layer’s kernel size.

groups (int): Group size for the convolution layer.

p_dropout (float): Dropout rate.

training: bool

class speeq.models.layers.QuartzBlocks(num_blocks: int, block_repetition: int, num_sub_blocks: int, in_channels: int, channels_size: List[int], kernel_size: Union[int, List[int]], groups: int, p_dropout: float)[source]

Bases: JasperBlocks

Implements the series of blocks in the QuartzNet model, as described in https://arxiv.org/abs/1910.10261

Args:

num_blocks (int): The number of QuartzNet blocks, denoted as ‘B’ in the paper.

block_repetition (int): The number of times to repeat each block, denoted as ‘S’ in the paper.

num_sub_blocks (int): The number of QuartzNet subblocks, denoted as ‘R’ in the paper.

in_channels (int): The number of channels in the input.

channels_size (List[int]): A list of integers representing the number of output channels for each block.

kernel_size (Union[int, List[int]]): An integer or a list of integers representing the kernel size(s) for each block’s convolutional layer.

groups (int): The group size.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.layers.QuartzSubBlock(in_channels: int, out_channels: int, kernel_size: int, p_dropout: float, groups: int, stride: int = 1, padding: Union[str, int] = 'same')[source]

Bases: JasperSubBlock

Implements the subblock module of Quartznet as described in https://arxiv.org/abs/1910.10261

Args:

in_channels (int): The number of channels of the input.

out_channels (int): The number of channels of the output.

kernel_size (int): The kernel size of the convolution layer.

p_dropout (float): The dropout rate.

groups (int): The number of groups in the convolution layer.

stride (int): The stride of the convolution layer. Default is 1.

padding (Union[str, int]): The padding mode or size. Default is ‘same’.

forward(x: Tensor, residual: Union[None, Tensor] = None) → Tensor[source]

The forward method applies the Quartznet’s subblock module to the input tensor x and an optional residual tensor.

Args:

x (Tensor): The input tensor of shape [B, in_channels, M].

residual (Tensor, optional): The residual tensor of shape [B, out_channels, M]. Default is None.

Returns:: Tensor: The output tensor of shape [B, out_channels, M].

training: bool

class speeq.models.layers.RowConv1D(tau: int, feat_size: int)[source]

Bases: Module

Implements the row convolution module proposed in https://arxiv.org/abs/1512.02595

Args:

tau (int): The size of future context.

feat_size (int): The size of the input feature.

forward(x: Tensor) → Tensor[source]

Passes the input tensor x through the row convolution layer.

Args:: x (Tensor): The input tensor of shape [B, M, feat_size].
Returns:: Tensor: The result tensor of the same shape [B, M, feat_size].

training: bool

class speeq.models.layers.Scaling1d(d_model: int)[source]

Bases: Module

Implements the scaling layer proposed in https://arxiv.org/abs/2206.00888

Args:

d_model (int): The model dimension.

forward(x: Tensor) → Tensor[source]

Scales the input x.

Args:: x (Tensor): The input tensor of shape [B, M, d].
Returns:: Tensor: The scaled and shifted tensor of shape [B, M, d].

training: bool

class speeq.models.layers.SpeechTransformerDecLayer(d_model: int, ff_size: int, h: int, masking_value: int = -1000000000000000.0)[source]

Bases: TransformerDecLayer

Implements a single decoder layer of the speech transformer as described in https://ieeexplore.ieee.org/document/8462506

Args:

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of attention heads.

masking_value (int): The masking value. Default -1e15

forward(enc_out: Tensor, enc_mask: Optional[Tensor], dec_inp: Tensor, dec_mask: Optional[Tensor]) → Tensor[source]

Applies a single decoder layer of speech transformer to the input.

Args:

enc_out (Tensor): The output of the encoder. Its shape is [B, M_enc, d].

enc_mask (Tensor, optional): The mask tensor for the encoder output. Its shape is [B, M_enc], where it is False for the padding positions.

dec_inp (Tensor): The input to the decoder layer. Its shape is [B, M_dec, d_model].

dec_mask (Tensor, optional): The mask tensor for the decoder input. Its shape is [B, M_dec], where it is False for the padding.

Returns:

The output of the decoder layer. Its shape is [B, M_dec, d_model].

training: bool

class speeq.models.layers.SpeechTransformerEncLayer(d_model: int, ff_size: int, h: int, out_channels: int, kernel_size: int)[source]

Bases: TransformerEncLayer

Implements a single encoder layer of the speech transformer as described in https://ieeexplore.ieee.org/document/8462506

Args:

d_model (int): The model dimensionality.

ff_size (int): The dimensionality of the inner layer of the feed-forward module.

h (int): The number of attention heads.

out_channels (int): The number of output channels of the convolution layer.

kernel_size (int): The kernel size of the convolutional layers.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Passes the input tensor x through a single encoder layer of the speech transformer.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Tensor, optional): The mask tensor of shape [B, M], or None if no mask is needed. Default None.

Returns:

Tensor: The output tensor of shape [B, B, d].

training: bool

class speeq.models.layers.SqueezeAndExcit1D(in_feature: int, reduction_factor: int)[source]

Bases: Module

Implements the squeeze and excite module proposed in https://arxiv.org/abs/1709.01507 and used in https://arxiv.org/abs/2005.03191

Args:

in_feature (int): The number of channels or feature size.

reduction_factor (int): The feature reduction size.

forward(x: Tensor, mask: Tensor)[source]

Applies the squeeze and excite operation to the input tensor.

Args:

x (Tensor): The input tensor of shape [B, d, M].

mask (Tensor): The masking tensor of shape [B, M].

Returns:: Tensor: The output tensor of shape [B, d, M] after applying the squeeze and excite operation.

training: bool

class speeq.models.layers.SqueezeformerBlock(d_model: int, ff_expansion_factor: int, h: int, kernel_size: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Squeezeformer block described in https://arxiv.org/abs/2206.00888

Args:

d_model (int): The model dimension.

ff_expansion_factor (int): The linear layer’s expansion factor.

h (int): The number of atention heads.

kernel_size (int): The kernel size of the depth-wise convolution layer.

p_dropout (float): The dropout rate.

masking_value (int): The masking value. Default -1e15

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Forward pass of the Squeezeformer block.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Optional[Tensor]): The optional mask tensor of shape [B, M]. Default None.

Returns:

Tensor: The output tensor of shape [B, M, d].

training: bool

class speeq.models.layers.SqueezeformerConvModule(d_model: int, kernel_size: int, p_dropout: float)[source]

Bases: ConformerConvModule

Implements the conformer convolution module with the modification as described in https://arxiv.org/abs/2206.00888

Args:

d_model (int): The model dimension.

kernel_size (int): The size of the depth-wise convolution kernel.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input x through the layers of the SqueezeformerConvModule.

Args:

x (torch.Tensor): A tensor of shape [B, M, d].

Returns:: Tensor: The result tensor of shape [B, M, d]

training: bool

class speeq.models.layers.SqueezeformerFeedForward(d_model: int, expansion_factor: int, p_dropout: float)[source]

Bases: ConformerFeedForward

Implements the conformer feed-forward module with the modifications introduced in https://arxiv.org/abs/2206.00888

Args:

d_model (int): The model dimension.

expansion_factor (int): The expansion factor of the linear layer.

p_dropout (float): The dropout rate.

forward(x: Tensor) → Tensor[source]

Passes the input to the feed-forward layers

Args:: x (Tensor): Input tensor of shape [B, M, d].
Returns:: Tensor: Output tensor of shape [B, M, d].

training: bool

class speeq.models.layers.SqueezeformerRelativeMHSA(d_model: int, h: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: MultiHeadAtt

Implements the multi-head self attention module with relative positional encoding and pre-scaling module.

Args:

d_model (int): The model dimension.

h (int): The number of attention heads.

p_dropout (float): The dropout rate.

masking_value (int): The masking value. Default -1e15

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

computes the multi-head self-attention of the input tensor with optional mask tensor.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Tensor, optional): Boolean tensor of shape [B, M], where it’s set to False for padding positions. If None is provided, no masking is applied. Default is None.

Returns:: Tensor: A tensor of shape [B, M, d] representing the output of the multi-head self-attention module.

training: bool

class speeq.models.layers.TransformerDecLayer(d_model: int, ff_size: int, h: int, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements a single decoder layer of the transformer as described in https://arxiv.org/abs/1706.03762

Args:

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of attention heads.

masking_value (int): The masking value. Default -1e15

forward(enc_out: Tensor, enc_mask: Optional[Tensor], dec_inp: Tensor, dec_mask: Optional[Tensor]) → Tensor[source]

Applies a single decoder layer of the transformer to the input.

Args:

enc_out (Tensor): The output of the encoder. Its shape is [B, M_enc, d].

enc_mask (Tensor, optional): The mask tensor for the encoder output. Its shape is [B, M_enc], where it is False for the padding positions.

dec_inp (Tensor): The input to the decoder layer. Its shape is [B, M_dec, d_model].

dec_mask (Tensor, optional): The mask tensor for the decoder input. Its shape is [B, M_dec], where it is False for the padding.

Returns:

The output of the decoder layer. Its shape is [B, M_dec, d_model].

training: bool

class speeq.models.layers.TransformerEncLayer(d_model: int, ff_size: int, h: int, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements a single layer of the transformer encoder model as presented in the paper https://arxiv.org/abs/1706.03762

Args:

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Performs a forward pass of the transformer encoder layer.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Union[Tensor, None], optional): Boolean tensor of the input of shape [B, M] where True indicates that the corresponding key position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head attention. Defaults to None.

Returns:: Tensor: Result tensor of the same shape as x.

training: bool

class speeq.models.layers.TransformerEncLayerWithAttTruncation(d_model: int, ff_size: int, h: int, left_size: int, right_size: int, masking_value: int = -1000000000000000.0)[source]

Bases: TransformerEncLayer

Implements a single encoder layer of the transformer with truncated self attention as described in https://arxiv.org/abs/1910.12977

Args:

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Performs a forward pass of the transformer encoder layer.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Union[Tensor, None], optional): Boolean tensor of the input of shape [B, M] where True indicates that the corresponding key position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head attention. Defaults to None.

Returns:: Tensor: Result tensor of the same shape as x.

training: bool

class speeq.models.layers.TransformerTransducerLayer(d_model: int, ff_size: int, h: int, left_size: int, right_size: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements a single encoder layer of the transformer transducer with truncated relative self attention as described in https://arxiv.org/abs/2002.02562

Args:

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

p_dropout (float): The dropout rate.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(x: Tensor, mask: Union[None, Tensor] = None) → Tensor[source]

Performs a forward pass of the transformer-transducer encoder layer.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Union[Tensor, None], optional): Boolean tensor of the input of shape [B, M] where True indicates that the corresponding key position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head attention. Defaults to None.

Returns:: Tensor: Result tensor of the same shape as x.

training: bool

class speeq.models.layers.TruncatedRelativeMHSA(d_model: int, h: int, left_size: int, right_size: int, masking_value: float = -1000000000000000.0)[source]

Bases: TruncatedSelfAttention

Builds the truncated self attention with relative positional encoding module proposed in https://arxiv.org/abs/2002.02562

Args:

d_model (int): The model dimension.

h (int): The number of attention heads.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

masking_value (float): The attention masking value.

forward(x: Tensor, mask: Optional[Tensor]) → Tensor[source]

Applies truncated masked rekative multi-head self attention to the input.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Union[Tensor, None]): The mask tensor of the input of shape [B, M] where True indicates that the corresponding input position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head self attention.

Returns:

Tensor: The attention result tensor of shape [B, M, d].

training: bool

class speeq.models.layers.TruncatedSelfAttention(d_model: int, h: int, left_size: int, right_size: int, masking_value: float = -1000000000000000.0)[source]

Bases: MultiHeadAtt

Builds the truncated self attention module used in https://arxiv.org/abs/1910.12977

Args:

d_model (int): The model dimension.

h (int): The number of attention heads.

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

masking_value (float): The attention masking value.

forward(x: Tensor, mask: Optional[Tensor]) → Tensor[source]

Applies truncated masked multi-head self attention to the input.

Args:

x (Tensor): The input tensor of shape [B, M, d].

mask (Union[Tensor, None]): The mask tensor of the input of shape [B, M] where True indicates that the corresponding input position contains data not padding and therefore should not be masked. If None, the function will act as a normal multi-head self attention.

Returns:

Tensor: The attention result tensor of shape [B, M, d].

get_looking_ahead_mask(mask: Tensor) → Tensor[source]

training: bool

class speeq.models.layers.VGGTransformerPreNet(in_features: int, n_vgg_blocks: int, n_layers_per_block: List[int], kernel_sizes_per_block: List[List[int]], n_channels_per_block: List[List[int]], pooling_kernel_size: List[int], d_model: int)[source]

Bases: Module

Implements the VGGTransformer prenet module as described in https://arxiv.org/abs/1910.12977

Args:

in_features (int): The input feature size.

n_vgg_blocks (int): The number of VGG blocks to use.

n_layers_per_block (List[int]): A list of integers that specifies the number of convolution layers in each block.

kernel_sizes_per_block (List[List[int]]): A list of lists that contains the kernel size for each layer in each block. The length of the outer list should match n_vgg_blocks, and each inner list should be the same length as the corresponding block’s number of layers.

n_channels_per_block (List[List[int]]): A list of lists that contains the number of channels for each convolution layer in each block. This argument should also have length equal to n_vgg_blocks, and each sublist should have length equal to the number of layers in the corresponding block.

pooling_kernel_size (List[int]): A list of integers that specifies the size of the max pooling layer in each block. The length of this list should be equal to n_vgg_blocks.

d_model (int): The size of the output feature

forward(x: Tensor, lengths: Tensor) → Tuple[Tensor, Tensor][source]

Passes the input x through the VGGTransformer prenet and returns a tuple of tensors.

Args:

x (Tensor): Input tensor of shape [B, M, in_features].

lengths (Tensor): Lengths of shape [B] that has the length for each sequence in x.

Returns:

A tuple of tensors (output, updated_lengths). - output (Tensor): Output tensor of shape [B, M, d_model]. - updated_lengths (Tensor): Updated lengths of shape [B].

training: bool

Registry

speeq.models.registry.get_model(model_config: ModelConfig, n_classes: int) → Module[source]

Creates and returns a targeted model using the provided configuration object model_config.

Args:

model_config (object): The model configuration object.

n_classes (int): The number of classes for the model to predict.

Returns:

Module: The targeted model created using the configuration object.

speeq.models.registry.list_ctc_models() → List[str][source]: Lists all pre-implemented ctc based models.

speeq.models.registry.list_seq2seq_models() → List[str][source]: Lists all pre-implemented seq2seq based models.

speeq.models.registry.list_transducer_models() → List[str][source]: Lists all pre-implemented transducer based models.

Seq2Seq Models

The module contains implementations of various sequence-to-sequence (seq2seq) speech recognition models

Classes:

BasicAttSeq2SeqRNN: A basic seq2seq model with an RNN encoder and an attention-based RNN decoder.
LAS: A Listen, Attend and Spell (LAS) model.
RNNWithLocationAwareAtt: An RNN-based seq2seq model with location-aware attention mechanism.
SpeechTransformer: A transformer-based seq2seq model for speech processing.

class speeq.models.seq2seq.BasicAttSeq2SeqRNN(in_features: int, n_classes: int, hidden_size: int, enc_num_layers: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, p_dropout: float, pred_activation: Module, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: Module

Implements The basic RNN encoder decoder ASR.

Args:

in_features (int): The encoder’s input feature speech size.

n_classes (int): The number of classes/vocabulary.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

forward(speech: Tensor, speech_mask: Tensor, text: Tensor, *args, **kwargs) → Tensor[source]

Passes the input to the model

Args:

speech (Tensor): The input speech of shape [B, M, d]

mask (Union[Tensor, None]): The speech mask of shape [B, M], which is True for the data positions and False for the padding ones.

text (Tensor): The text tensor of shape [B, M_dec]

Returns:

Tensor: The prediction tensor of shape [B, M_dec, C]

predict(x: Tensor, mask: Tensor, state: dict) → dict[source]

training: bool

class speeq.models.seq2seq.LAS(in_features: int, n_classes: int, hidden_size: int, enc_num_layers: int, reduction_factor: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, p_dropout: float, pred_activation: Module, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: BasicAttSeq2SeqRNN

Implements Listen, Attend and Spell model proposed in https://arxiv.org/abs/1508.01211

Args:

in_features (int): The encoder’s input feature speech size.

n_classes (int): The number of classes/vocabulary.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

reduction_factor (int): The time resolution reduction factor.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function to be applied on the last dimension of the predicted logits..

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

training: bool

class speeq.models.seq2seq.RNNWithLocationAwareAtt(in_features: int, n_classes: int, hidden_size: int, enc_num_layers: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, kernel_size: int, activation: str, p_dropout: float, pred_activation: Module, inv_temperature: Union[float, int] = 1, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: BasicAttSeq2SeqRNN

Implements RNN seq2seq model proposed: in https://arxiv.org/abs/1506.07503

Args:

in_features (int): The encoder’s input feature speech size.

n_classes (int): The number of classes/vocabulary.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

kernel_size (int): The attention kernel size.

activation (str): The activation function to use in the attention layer. it can be either softmax or sigmax.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function to be applied on the last dimension of the predicted logits..

inv_temperature (Union[float, int]): The inverse temperature value. Default 1.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

training: bool

class speeq.models.seq2seq.SpeechTransformer(in_features: int, n_classes: int, n_conv_layers: int, kernel_size: int, stride: int, d_model: int, n_enc_layers: int, n_dec_layers: int, ff_size: int, h: int, att_kernel_size: int, att_out_channels: int, pred_activation: Module, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Speech Transformer model proposed in https://ieeexplore.ieee.org/document/8462506

Args:

in_features (int): The input/speech feature size.

n_classes (int): The number of classes.

n_conv_layers (int): The number of down-sampling convolutional layers.

kernel_size (int): The kernel size of the down-sampling convolutional layers.

stride (int): The stride size of the down-sampling convolutional layers.

d_model (int): The model dimensionality.

n_enc_layers (int): The number of encoder layers.

n_dec_layers (int): The number of decoder layers.

ff_size (int): The dimensionality of the inner layer of the feed-forward module.

h (int): The number of attention heads.

att_kernel_size (int): The kernel size of the attentional convolutional layers.

att_out_channels (int): The number of output channels of the attentional convolution layers.

pred_activation (Module): An activation function instance to be applied on the last dimension of the predicted logits.

masking_value (int): The attentin masking value. Default -1e15

forward(speech: Tensor, speech_mask: Tensor, text: Tensor, text_mask: Tensor, *args, **kwargs) → Tensor[source]

Passes the input to the model

Args:

speech (Tensor): The input speech of shape [B, M, d]

speech_mask (Union[Tensor, None]): The speech mask of shape [B, M], which is True for the data positions and False for the padding ones.

text (Tensor): The text tensor of shape [B, M_dec]

text_mask (Union[Tensor, None]): The text mask of shape [B, M_dec], which is True for the data positions and False for the padding ones.

Returns:: Tensor: The prediction tensor of shape [B, M_dec, C]

predict(speech: Tensor, mask: Tensor, state: dict)[source]

training: bool

Skeletons

Builds the skeleton for CTC, seq2seq, and transducer models, this is used for building custom models where the user has the ability to combine or build custom encoder, or decoder.

class speeq.models.skeletons.CTCSkeleton(*args, **kwargs)[source]

Bases: CTCModel

Builds the CTC-based model skeleton

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward of the encoder returns a tuple of the encoded speech tensor and a length tensor for the encoded speech.

pred_net (Union[Module, None]): The prediction network. if provided the forward of the prediction network expected to have log softmax as an activation function, and the predictions of shape [B, T, C] where T is the sequence length, B the batch size, and C the number of classes. Default None.

feat_size (Union[Module, None]): Used if pred_net parameter is not None where it’s the encoder’s output feature size. Default None.

n_classes (Union[Module, None]): Used if pred_net parameter is not None where it’s the number of the classes/characters to be predicted.

training: bool

class speeq.models.skeletons.Seq2SeqSkeleton(encoder: Module, decoder: Module, *args, **kwargs)[source]

Bases: Module

Builds the Seq2Seq model skeleton

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward method of the encoder returns a tuple of the encoded speech tensor, the last encoder hidden state tensor/tuple if there is any, and a length tensor for the encoded speech.

decoder (Module): The text decoder such that the forward method of the decoder takes the encoder’s output, the last encoder’s hidden state (if there is any), the encoder mask, the decoder input, and the decoder mask and returns the prediction tensor.

forward(speech: Tensor, speech_mask: Tensor, text: Tensor, text_mask: Tensor, *args, **kwargs) → Tensor[source]

Passes the input to the model.

Args:: speech (Tensor): The speech of shape [B, M, d] speech_mask (Tensor): The speech mask of shape [B, M] text (Tensor): The text tensor of shape [B, N] text_mask (Tensor): The text mask tensor of shape [B, M]
Returns:: Tensor: The result tensor of shape [B, N, C]

training: bool

class speeq.models.skeletons.TransducerSkeleton(encoder: Module, decoder: Module, join_net: Optional[Module] = None, feat_size: Optional[int] = None, n_classes: Optional[int] = None)[source]

Bases: _BaseTransducer

Builds the Transducer-based model skeleton

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward method of the encoder returns a tuple of the encoded speech tensor and a length tensor for the encoded speech.

decoder (Module): The text decoder such that the forward method of the decoder returns a tuple of the encoded text tensor and a length tensor for the encoded text.

join_net (Union[Module, None]): The join network. if provided the forward of the join network expected to have no activation function, and the results of shape [B, Ts, Tt, C], where B the batch size, Ts is the speech sequence length, Tt is the text sequence length, and C the number of classes. Default None.

feat_size (Union[Module, None]): Used if join_net parameter is not None where it’s the encoder and the decoder’s output feature size. Default None.

n_classes (Union[Module, None]): Used if join_net parameter is not None where it’s the number of the classes/characters to be predicted.

training: bool

Model Templates

This file contains templates for various pre-implemented models. Each template is a model configuration for a specific pre-implemented model in the framework.

Classes:

BaseTemplate: Base template that defines common configuration parameters for all models.
DeepSpeechV1Temp: Template for configuring DeepSpeechV1 model.
BERTTemp: Template for configuring BERT model.
DeepSpeechV2Temp: Template for configuring DeepSpeechV2 model.
ConformerCTCTemp: Template for configuring Conformer CTC model.
JasperTemp: Template for configuring Jasper model.
Wav2LetterTemp: Template for configuring Wav2Letter model.
LASTemp: Template for configuring LAS model.
BasicAttSeq2SeqRNNTemp: Template for configuring Basic Attention Seq2Seq RNN model.
RNNWithLocationAwareAttTemp: Template for configuring RNN with Location-Aware Attention model.
SpeechTransformerTemp: Template for configuring Speech Transformer model.
QuartzNetTemp: Template for configuring QuartzNet model.
SqueezeformerCTCTemp: Template for configuring Squeezeformer CTC model.
RNNTTemp: Template for configuring RNNT model.
ConformerTransducerTemp: Template for configuring Conformer Transducer model.
ContextNetTemp: Template for configuring ContextNet model.
VGGTransformerTransducerTemp: Template for configuring the VGG transformer with truncated self attention model.
TransformerTransducerTemp: Template for configuring the transformer-transducer with truncated relative self attention model.

Builder:

The below templates can be used to build custome model:

CTCModelBuilderTemp: Template for building CTC models.
TransducerBuilderTemp: Template for building Transducer models.
Seq2SeqBuilderTemp: Template for building Seq2Seq models.

class speeq.models.templates.BERTTemp(max_len: int, in_features: int, d_model: int, h: int, ff_size: int, n_layers: int, p_dropout: float)[source]

Bases: BaseTemplate

BERT model template https://arxiv.org/abs/1810.04805

Args:

max_len (int): The maximum length for positional encoding.

in_features (int): The input/speech feature size.

d_model (int): The model dimensionality.

h (int): The number of attention heads.

ff_size (int): The inner size of the feed forward module.

n_layers (int): The number of transformer encoders.

p_dropout (float): The dropout rate.

d_model: int

ff_size: int

h: int

in_features: int

max_len: int

n_layers: int

p_dropout: float

class speeq.models.templates.BaseTemplate[source]

Bases: ITemplate

get_dict()[source]

property name

property type

class speeq.models.templates.BasicAttSeq2SeqRNNTemp(in_features: int, hidden_size: int, enc_num_layers: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, p_dropout: float, pred_activation: Module, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: BaseTemplate

Basic RNN encoder decoder model template.

Args:

in_features (int): The encoder’s input feature speech size.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

bidirectional: bool

dec_num_layers: int

emb_dim: int

enc_num_layers: int

hidden_size: int

in_features: int

p_dropout: float

pred_activation: Module

rnn_type: str = 'rnn'

teacher_forcing_rate: float = 0.0

class speeq.models.templates.CTCModelBuilderTemp(encoder: Module, pred_net: Optional[Module] = None, feat_size: Optional[int] = None)[source]

Bases: BaseTemplate

CTC-based model builder template

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward of the encoder returns a tuple of the encoded speech tensor and a length tensor of the encoded speech.

pred_net (Union[Module, None]): The prediction network. if provided the forward of the prediction network expected to have log softmax as an activation function, and the predictions of shape [T, B, C] where T is the sequence length, B the batch size, and C the number of classes. Default None.

feat_size (Union[Module, None]): Used if pred_net parameter is not None where it’s the encoder’s output feature size. Default None.

encoder: Module

feat_size: Optional[int] = None

pred_net: Optional[Module] = None

class speeq.models.templates.ConformerCTCTemp(d_model: int, n_conf_layers: int, ff_expansion_factor: int, h: int, kernel_size: int, ss_kernel_size: int, ss_stride: int, ss_num_conv_layers: int, in_features: int, res_scaling: float, p_dropout: float)[source]

Bases: BaseTemplate

ConformerCTC model template https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

n_conf_layers (int): The number of conformer blocks.

ff_expansion_factor (int): The feed-forward expansion factor.

h (int): The number of attention heads.

kernel_size (int): The convolution module kernel size.

ss_kernel_size (int): The subsampling layer kernel size.

ss_stride (int): The subsampling layer stride size.

ss_num_conv_layers (int): The number of subsampling convolutional layers.

in_features (int): The input/speech feature size.

res_scaling (float): The residual connection multiplier.

p_dropout (float): The dropout rate.

d_model: int

ff_expansion_factor: int

h: int

in_features: int

kernel_size: int

n_conf_layers: int

p_dropout: float

res_scaling: float

ss_kernel_size: int

ss_num_conv_layers: int

ss_stride: int

class speeq.models.templates.ConformerTransducerTemp(d_model: int, n_conf_layers: int, n_dec_layers: int, ff_expansion_factor: int, h: int, kernel_size: int, ss_kernel_size: int, ss_stride: int, ss_num_conv_layers: int, in_features: int, res_scaling: float, emb_dim: int, rnn_type: str, p_dropout: float)[source]

Bases: BaseTemplate

Conformer transducer model template https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

n_conf_layers (int): The number of conformer blocks.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

ff_expansion_factor (int): The feed-forward expansion factor.

h (int): The number of attention heads.

kernel_size (int): The convolution module kernel size.

ss_kernel_size (int): The subsampling layer kernel size.

ss_stride (int): The subsampling layer stride size.

ss_num_conv_layers (int): The number of subsampling convolutional layers.

in_features (int): The input/speech feature size.

res_scaling (float): The residual connection multiplier.

emb_dim (int): The embedding layer’s size.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

p_dropout (float): The dropout rate.

d_model: int

emb_dim: int

ff_expansion_factor: int

h: int

in_features: int

kernel_size: int

n_conf_layers: int

n_dec_layers: int

p_dropout: float

res_scaling: float

rnn_type: str

ss_kernel_size: int

ss_num_conv_layers: int

ss_stride: int

class speeq.models.templates.ContextNetTemp(in_features: int, emb_dim: int, n_layers: int, n_dec_layers: int, n_sub_layers: Union[int, List[int]], stride: Union[int, List[int]], out_channels: Union[int, List[int]], kernel_size: int, reduction_factor: int, rnn_type: str)[source]

Bases: BaseTemplate

ContextNet transducer model template https://arxiv.org/abs/2005.03191

Args:

in_features (int): The input feature size.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of ContextNet blocks.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

n_sub_layers (Union[int, List[int]]): The number of convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

stride (Union[int, List[int]]): The stride of the last convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

out_channels (Union[int, List[int]]): The channels size of the convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

kernel_size (int): The convolutional layers kernel size.

reduction_factor (int): The feature reduction size of the Squeeze-and-excitation module.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

emb_dim: int

in_features: int

kernel_size: int

n_dec_layers: int

n_layers: int

n_sub_layers: Union[int, List[int]]

out_channels: Union[int, List[int]]

reduction_factor: int

rnn_type: str

stride: Union[int, List[int]]

class speeq.models.templates.DeepSpeechV1Temp(in_features: int, hidden_size: int, n_linear_layers: int, bidirectional: bool, max_clip_value: int, p_dropout: float, rnn_type: str = 'rnn')[source]

Bases: BaseTemplate

DeepSpeech 1 model template https://arxiv.org/abs/1412.5567

Args:

in_features (int): The input feature size.

hidden_size (int): The hidden size of the rnn layers.

n_linear_layers (int): The number of feed-forward layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

max_clip_value (int): The maximum relu clipping value.

p_dropout (float): The dropout rate.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

bidirectional: bool

hidden_size: int

in_features: int

max_clip_value: int

n_linear_layers: int

p_dropout: float

rnn_type: str = 'rnn'

class speeq.models.templates.DeepSpeechV2Temp(n_conv: int, kernel_size: int, stride: int, in_features: int, hidden_size: int, bidirectional: bool, n_rnn: int, n_linear_layers: int, max_clip_value: int, tau: int, p_dropout: float, rnn_type: str = 'rnn')[source]

Bases: BaseTemplate

deep speech 2 model template https://arxiv.org/abs/1512.02595

Args:

n_conv (int): The number of convolution layers.

kernel_size (int): The kernel size of the convolution layers.

stride (int): The stride size of the convolution layer.

in_features (int): The input/speech feature size.

hidden_size (int): The hidden size of the RNN layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

n_rnn (int): The number of RNN layers.

n_linear_layers (int): The number of linear layers.

max_clip_value (int): The maximum relu clipping value.

tau (int): The future context size.

p_dropout (float): The dropout rate.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

bidirectional: bool

hidden_size: int

in_features: int

kernel_size: int

max_clip_value: int

n_conv: int

n_linear_layers: int

n_rnn: int

p_dropout: float

rnn_type: str = 'rnn'

stride: int

tau: int

class speeq.models.templates.JasperTemp(in_features: int, num_blocks: int, num_sub_blocks: int, channel_inc: int, epilog_kernel_size: int, prelog_kernel_size: int, prelog_stride: int, prelog_n_channels: int, blocks_kernel_size: Union[int, List[int]], p_dropout: float)[source]

Bases: BaseTemplate

Jasper model template https://arxiv.org/abs/1904.03288

Args:

in_features (int): The input/speech feature size.

num_blocks (int): The number of Jasper blocks (denoted as ‘B’ in the paper).

num_sub_blocks (int): The number of Jasper subblocks (denoted as ‘R’ in the paper).

channel_inc (int): The rate to increase the number of channels across the blocks.

epilog_kernel_size (int): The kernel size of the epilog block convolution layer.

prelog_kernel_size (int): The kernel size of the prelog block ocnvolution layer.

prelog_stride (int): The stride size of the prelog block convolution layer.

prelog_n_channels (int): The output channnels of the prelog block convolution layer.

blocks_kernel_size (Union[int, List[int]]): The kernel size(s) of the convolution layer for each block.

p_dropout (float): The dropout rate.

blocks_kernel_size: Union[int, List[int]]

channel_inc: int

epilog_kernel_size: int

in_features: int

num_blocks: int

num_sub_blocks: int

p_dropout: float

prelog_kernel_size: int

prelog_n_channels: int

prelog_stride: int

class speeq.models.templates.LASTemp(in_features: int, hidden_size: int, enc_num_layers: int, reduction_factor: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, p_dropout: float, pred_activation: Module, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: BaseTemplate

Listen, Attend and Spell model template https://arxiv.org/abs/1508.01211

Args:

in_features (int): The encoder’s input feature speech size.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

reduction_factor (int): The time resolution reduction factor.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function to be applied on the last dimension of the predicted logits..

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

bidirectional: bool

dec_num_layers: int

emb_dim: int

enc_num_layers: int

hidden_size: int

in_features: int

p_dropout: float

pred_activation: Module

reduction_factor: int

rnn_type: str = 'rnn'

teacher_forcing_rate: float = 0.0

class speeq.models.templates.QuartzNetTemp(in_features: int, num_blocks: int, block_repetition: int, num_sub_blocks: int, channels_size: List[int], epilog_kernel_size: int, epilog_channel_size: Tuple[int, int], prelog_kernel_size: int, prelog_stride: int, prelog_n_channels: int, groups: int, blocks_kernel_size: Union[int, List[int]], p_dropout: float)[source]

Bases: BaseTemplate

QuartzNet model template https://arxiv.org/abs/1910.10261

Args:

in_features (int): The input/speech feature size.

num_blocks (int): The number of QuartzNet blocks (denoted as ‘B’ in the paper).

block_repetition (int): The number of times to repeat each block (denoted as ‘S’ in the paper).

num_sub_blocks (int): The number of QuartzNet subblocks, (denoted as ‘R’ in the paper).

channels_size (List[int]): A list of integers representing the number of output channels for each block.

epilog_kernel_size (int): The kernel size of the convolution layer in the epilog block.

epilog_channel_size (Tuple[int, int]): A tuple for both epilog layers of the convolution layer .

prelog_kernel_size (int): The kernel size pf the convolution layer in the prelog block.

prelog_stride (int): The stride size of the of the convoltuional layer in the prelog block.

prelog_n_channels (int): The number of output channels of the convolutional layer in the prelog block.

groups (int): The groups size.

blocks_kernel_size (Union[int, List[int]]): An integer or a list of integers representing the kernel size(s) for each block’s convolutional layer.

p_dropout (float): The dropout rate.

block_repetition: int

blocks_kernel_size: Union[int, List[int]]

channels_size: List[int]

epilog_channel_size: Tuple[int, int]

epilog_kernel_size: int

groups: int

in_features: int

num_blocks: int

num_sub_blocks: int

p_dropout: float

prelog_kernel_size: int

prelog_n_channels: int

prelog_stride: int

class speeq.models.templates.RNNTTemp(in_features: int, emb_dim: int, n_layers: int, n_dec_layers: int, hidden_size: int, bidirectional: bool, rnn_type: str, p_dropout: float)[source]

Bases: BaseTemplate

RNN transducer model template https://arxiv.org/abs/1211.3711

Args:

in_features (int): The input feature size.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of the RNN layers in the encoder.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

hidden_size (int): The hidden size of the RNN layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

rnn_type (str): The RNN type.

p_dropout (float): The dropout rate.

bidirectional: bool

emb_dim: int

hidden_size: int

in_features: int

n_dec_layers: int

n_layers: int

p_dropout: float

rnn_type: str

class speeq.models.templates.RNNWithLocationAwareAttTemp(in_features: int, hidden_size: int, enc_num_layers: int, bidirectional: bool, dec_num_layers: int, emb_dim: int, kernel_size: int, activation: str, p_dropout: float, pred_activation: Module, inv_temperature: Union[float, int] = 1, teacher_forcing_rate: float = 0.0, rnn_type: str = 'rnn')[source]

Bases: BaseTemplate

RNN seq2seq with location aware attention model tempalte: in https://arxiv.org/abs/1506.07503

Args:

in_features (int): The encoder’s input feature speech size.

hidden_size (int): The hidden size of the RNN layers.

enc_num_layers (int): The number of layers in the encoder.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

dec_num_layers (int): The number of the RNN layers in the decoder.

emb_dim (int): The embedding size.

kernel_size (int): The attention kernel size.

activation (str): The activation function to use in the attention layer. it can be either softmax or sigmax.

p_dropout (float): The dropout rate.

pred_activation (Module): An instance of an activation function to be applied on the last dimension of the predicted logits..

inv_temperature (Union[float, int]): The inverse temperature value. Default 1.

teacher_forcing_rate (float): The teacher forcing rate. Default 0.0

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm. Default ‘rnn’.

activation: str

bidirectional: bool

dec_num_layers: int

emb_dim: int

enc_num_layers: int

hidden_size: int

in_features: int

inv_temperature: Union[float, int] = 1

kernel_size: int

p_dropout: float

pred_activation: Module

rnn_type: str = 'rnn'

teacher_forcing_rate: float = 0.0

class speeq.models.templates.Seq2SeqBuilderTemp(encoder: Module, decoder: Module)[source]

Bases: BaseTemplate

Seq2Seq-based model builder template

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward method of the encoder returns a tuple of the encoded speech tensor, the last encoder hidden state tensor/tuple if there is any, and a length tensor for the encoded speech.

decoder (Module): The text decoder such that the forward method of the decoder takes the encoder’s output, the last encoder’s hidden state (if there is any), the encoder mask, the decoder input, and the decoder mask and returns the prediction tensor.

decoder: Module

encoder: Module

class speeq.models.templates.SpeechTransformerTemp(in_features: int, n_conv_layers: int, kernel_size: int, stride: int, d_model: int, n_enc_layers: int, n_dec_layers: int, ff_size: int, h: int, att_kernel_size: int, att_out_channels: int, pred_activation: Module, masking_value: int = -1000000000000000.0)[source]

Bases: BaseTemplate

Speech Transformer model template https://ieeexplore.ieee.org/document/8462506

Args:

in_features (int): The input/speech feature size.

n_conv_layers (int): The number of down-sampling convolutional layers.

kernel_size (int): The kernel size of the down-sampling convolutional layers.

stride (int): The stride size of the down-sampling convolutional layers.

d_model (int): The model dimensionality.

n_enc_layers (int): The number of encoder layers.

n_dec_layers (int): The number of decoder layers.

ff_size (int): The dimensionality of the inner layer of the feed-forward module.

h (int): The number of attention heads.

att_kernel_size (int): The kernel size of the attentional convolutional layers.

att_out_channels (int): The number of output channels of the attentional convolution layers.

pred_activation (Module): An activation function instance to be applied on the last dimension of the predicted logits.

masking_value (int): The attentin masking value. Default -1e15

att_kernel_size: int

att_out_channels: int

d_model: int

ff_size: int

h: int

in_features: int

kernel_size: int

masking_value: int = -1000000000000000.0

n_conv_layers: int

n_dec_layers: int

n_enc_layers: int

pred_activation: Module

stride: int

class speeq.models.templates.SqueezeformerCTCTemp(in_features: int, n: int, d_model: int, ff_expansion_factor: int, h: int, kernel_size: int, pooling_kernel_size: int, pooling_stride: int, ss_kernel_size: Union[int, List[int]], ss_stride: Union[int, List[int]], ss_n_conv_layers: int, p_dropout: float, ss_groups: Union[int, List[int]] = 1, masking_value: int = -1000000000000000.0)[source]

Bases: BaseTemplate

Squeezeformer model template https://arxiv.org/abs/2206.00888

Args:

in_features (int): The input/speech feature size.

n (int): The number of layers per block, (denoted as N in the paper).

d_model (int): The model dimension.

ff_expansion_factor (int): The expansion factor of linear layer in the feed forward module.

h (int): The number of attention heads.

kernel_size (int): The kernel size of the depth-wise convolution layer.

pooling_kernel_size (int): The kernel size of the pooling convolution layer.

pooling_stride (int): The stride size of the pooling convolution layer.

ss_kernel_size (Union[int, List[int]]): The kernel size of the subsampling layer(s).

ss_stride (Union[int, List[int]]): The stride of the subsampling layer(s).

ss_n_conv_layers (int): The number of subsampling convolutional layers.

p_dropout (float): The dropout rate.

ss_groups (Union[int, List[int]]): The subsampling convolution groups size(s).

masking_value (int): The masking value. Default -1e15

d_model: int

ff_expansion_factor: int

h: int

in_features: int

kernel_size: int

masking_value: int = -1000000000000000.0

n: int

p_dropout: float

pooling_kernel_size: int

pooling_stride: int

ss_groups: Union[int, List[int]] = 1

ss_kernel_size: Union[int, List[int]]

ss_n_conv_layers: int

ss_stride: Union[int, List[int]]

class speeq.models.templates.TransducerBuilderTemp(encoder: Module, decoder: Module, join_net: Optional[Module] = None, feat_size: Optional[int] = None)[source]

Bases: BaseTemplate

Transducer-based model builder template

Args:

encoder (Module): The speech encoder (acoustic model), such that the forward method of the encoder returns a tuple of the encoded speech tensor and a length tensor for the encoded speech.

decoder (Module): The text decoder such that the forward method of the decoder returns a tuple of the encoded text tensor and a length tensor for the encoded text.

join_net (Union[Module, None]): The join network. if provided the forward of the join network expected to have no activation function, and the results of shape [B, Ts, Tt, C], where B the batch size, Ts is the speech sequence length, Tt is the text sequence length, and C the number of classes. Default None.

feat_size (Union[Module, None]): Used if join_net parameter is not None where it’s the encoder and the decoder’s output feature size. Default None.

decoder: Module

encoder: Module

feat_size: Union[None, int] = None

join_net: Optional[Module] = None

class speeq.models.templates.TransformerTransducerTemp(in_features: int, n_layers: int, n_dec_layers: int, d_model: int, ff_size: int, h: int, joint_size: int, enc_left_size: int, enc_right_size: int, dec_left_size: int, dec_right_size: int, p_dropout: float, stride: int = 1, kernel_size: int = 1, masking_value: int = -1000000000000000.0)[source]

Bases: BaseTemplate

Transformer-Transducer model template https://arxiv.org/abs/2002.02562

Args:

in_features (int): The input feature size.

n_layers (int): The number of transformer encoder layers with truncated self attention.

n_dec_layers (int): The number of layers in the decoder (predictor).

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

joint_size (int): The joint layer feature size.

enc_left_size (int): The size of the left window that each time step is allowed to look at in the encoder.

enc_right_size (int): The size of the right window that each time step is allowed to look at in the encoder.

dec_left_size (int): The size of the left window that each time step is allowed to look at in the decoder.

dec_right_size (int): The size of the right window that each time step is allowed to look at in the decoder.

p_dropout (float): The dropout rate.

stride (int): The stride of the convolution layer in the prenet. Default 1.

kernel_size (int): The kernel size of the convolution layer in the prenet. Default 1.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

d_model: int

dec_left_size: int

dec_right_size: int

enc_left_size: int

enc_right_size: int

ff_size: int

h: int

in_features: int

joint_size: int

kernel_size: int = 1

masking_value: int = -1000000000000000.0

n_dec_layers: int

n_layers: int

p_dropout: float

stride: int = 1

class speeq.models.templates.VGGTransformerTransducerTemp(in_features: int, emb_dim: int, n_layers: int, n_dec_layers: int, rnn_type: str, n_vgg_blocks: int, n_conv_layers_per_vgg_block: List[int], kernel_sizes_per_vgg_block: List[List[int]], n_channels_per_vgg_block: List[List[int]], vgg_pooling_kernel_size: List[int], d_model: int, ff_size: int, h: int, joint_size: int, left_size: int, right_size: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: BaseTemplate

VGG Transformer transducer model template https://arxiv.org/abs/1910.12977

Args:

in_features (int): The input feature size.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of transformer encoder layers with truncated self attention.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

rnn_type (str): The RNN type.

n_vgg_blocks (int): The number of VGG blocks to use.

n_conv_layers_per_vgg_block (List[int]): A list of integers that specifies the number of convolution layers in each block.

kernel_sizes_per_vgg_block (List[List[int]]): A list of lists that contains the kernel size for each layer in each block. The length of the outer list should match n_vgg_blocks, and each inner list should be the same length as the corresponding block’s number of layers.

n_channels_per_vgg_block (List[List[int]]): A list of lists that contains the number of channels for each convolution layer in each block. This argument should also have length equal to n_vgg_blocks, and each sublist should have length equal to the number of layers in the corresponding block.

vgg_pooling_kernel_size (List[int]): A list of integers that specifies the size of the max pooling layer in each block. The length of this list should be equal to n_vgg_blocks.

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

joint_size (int): The joint layer feature size (denoted as do in the paper).

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

p_dropout (float): The dropout rate.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

d_model: int

emb_dim: int

ff_size: int

h: int

in_features: int

joint_size: int

kernel_sizes_per_vgg_block: List[List[int]]

left_size: int

masking_value: int = -1000000000000000.0

n_channels_per_vgg_block: List[List[int]]

n_conv_layers_per_vgg_block: List[int]

n_dec_layers: int

n_layers: int

n_vgg_blocks: int

p_dropout: float

right_size: int

rnn_type: str

vgg_pooling_kernel_size: List[int]

class speeq.models.templates.Wav2LetterTemp(in_features: int, n_conv_layers: int, layers_kernel_size: int, layers_channels_size: int, pre_conv_stride: int, pre_conv_kernel_size: int, post_conv_channels_size: int, post_conv_kernel_size: int, p_dropout: float, wav_kernel_size: Optional[int] = None, wav_stride: Optional[int] = None)[source]

Bases: BaseTemplate

Wav2Letter model template https://arxiv.org/abs/1609.03193

Args:

in_features (int): The input/speech feature size.

n_conv_layers (int): The number of convolution layers.

layers_kernel_size (int): The kernel size of the convolution layers.

layers_channels_size (int): The number of output channels of each convolution layer.

pre_conv_stride (int): The stride of the prenet convolution layer.

pre_conv_kernel_size (int): The kernel size of the prenet convolution layer.

post_conv_channels_size (int): The number of output channels of the postnet convolution layer.

post_conv_kernel_size (int): The kernel size of the postnet convolution layer.

p_dropout (float): The dropout rate.

wav_kernel_size (Optional[int]): The kernel size of the first layer that processes the wav samples directly if wav is modeled. Default None.

wav_stride (Optional[int]): The stride size of the first layer that processes the wav samples directly if wav is modeled. Default None.

in_features: int

layers_channels_size: int

layers_kernel_size: int

n_conv_layers: int

p_dropout: float

post_conv_channels_size: int

post_conv_kernel_size: int

pre_conv_kernel_size: int

pre_conv_stride: int

wav_kernel_size: Optional[int] = None

wav_stride: Optional[int] = None

Transducer Models

The transducer module provides implementations for different models used in speech recognition based on the transducer architecture.

Classes:

RNNTransducer: An implementation of the RNN transducer model.
ConformerTransducer: An implementation of the Conformer transducer model.
ContextNet: An implementation of the ContextNet transducer model.
VGGTransformerTransducer: An implementation of the VGGTransformer transducer model with truncated self attention.

class speeq.models.transducers.ConformerTransducer(d_model: int, n_conf_layers: int, n_dec_layers: int, ff_expansion_factor: int, h: int, kernel_size: int, ss_kernel_size: int, ss_stride: int, ss_num_conv_layers: int, in_features: int, res_scaling: float, n_classes: int, emb_dim: int, rnn_type: str, p_dropout: float)[source]

Bases: RNNTransducer

Implements the conformer transducer model proposed in https://arxiv.org/abs/2005.08100

Args:

d_model (int): The model dimension.

n_conf_layers (int): The number of conformer blocks.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

ff_expansion_factor (int): The feed-forward expansion factor.

h (int): The number of attention heads.

kernel_size (int): The convolution module kernel size.

ss_kernel_size (int): The subsampling layer kernel size.

ss_stride (int): The subsampling layer stride size.

ss_num_conv_layers (int): The number of subsampling convolutional layers.

in_features (int): The input/speech feature size.

res_scaling (float): The residual connection multiplier.

n_classes (int): The number of classes/vocabulary.

emb_dim (int): The embedding layer’s size.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.transducers.ContextNet(in_features: int, n_classes: int, emb_dim: int, n_layers: int, n_dec_layers: int, n_sub_layers: Union[int, List[int]], stride: Union[int, List[int]], out_channels: Union[int, List[int]], kernel_size: int, reduction_factor: int, rnn_type: str)[source]

Bases: _BaseTransducer

Implements the ContextNet transducer model proposed in https://arxiv.org/abs/2005.03191

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes/vocabulary.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of ContextNet blocks.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

n_sub_layers (Union[int, List[int]]): The number of convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

stride (Union[int, List[int]]): The stride of the last convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

out_channels (Union[int, List[int]]): The channels size of the convolutional layers per block. If list is passed, it has to be of length equal to n_layers.

kernel_size (int): The convolutional layers kernel size.

reduction_factor (int): The feature reduction size of the Squeeze-and-excitation module.

rnn_type (str): The RNN type it has to be one of rnn, gru or lstm.

training: bool

class speeq.models.transducers.RNNTransducer(in_features: int, n_classes: int, emb_dim: int, n_layers: int, n_dec_layers: int, hidden_size: int, bidirectional: bool, rnn_type: str, p_dropout: float)[source]

Bases: _BaseTransducer

Implements the RNN transducer model proposed in https://arxiv.org/abs/1211.3711

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes/vocabulary.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of the RNN layers in the encoder.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

hidden_size (int): The hidden size of the RNN layers.

bidirectional (bool): A flag indicating if the rnn is bidirectional or not.

rnn_type (str): The RNN type.

p_dropout (float): The dropout rate.

training: bool

class speeq.models.transducers.TransformerTransducer(in_features: int, n_classes: int, n_layers: int, n_dec_layers: int, d_model: int, ff_size: int, h: int, joint_size: int, enc_left_size: int, enc_right_size: int, dec_left_size: int, dec_right_size: int, p_dropout: float, stride: int = 1, kernel_size: int = 1, masking_value: int = -1000000000000000.0)[source]

Bases: Module

Implements the Transformer-Transducer model as described in https://arxiv.org/abs/2002.02562

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes/vocabulary.

n_layers (int): The number of transformer encoder layers with truncated self attention.

n_dec_layers (int): The number of layers in the decoder (predictor).

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

joint_size (int): The joint layer feature size.

enc_left_size (int): The size of the left window that each time step is allowed to look at in the encoder.

enc_right_size (int): The size of the right window that each time step is allowed to look at in the encoder.

dec_left_size (int): The size of the left window that each time step is allowed to look at in the decoder.

dec_right_size (int): The size of the right window that each time step is allowed to look at in the decoder.

p_dropout (float): The dropout rate.

stride (int): The stride of the convolution layer in the prenet. Default 1.

kernel_size (int): The kernel size of the convolution layer in the prenet. Default 1.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

forward(speech: Tensor, speech_mask: Tensor, text: Tensor, text_mask: Tensor, *args, **kwargs) → Tuple[Tensor, Tensor, Tensor][source]

Passes the input to the model

Args:

speech (Tensor): The input speech of shape [B, M, d]

speech_mask (Tensor): The speech mask of shape [B, M]

text (Tensor): The text input of shape [B, N]

text_mask (Tensor): The text mask of shape [B, N]

Returns:: Tuple[Tensor, Tensor, Tensor]: A tuple of 3 tensors where the first is the predictions of shape [B, M, N, C], the last two tensor are the speech and text length of shape [B]

training: bool

class speeq.models.transducers.VGGTransformerTransducer(in_features: int, n_classes: int, emb_dim: int, n_layers: int, n_dec_layers: int, rnn_type: str, n_vgg_blocks: int, n_conv_layers_per_vgg_block: List[int], kernel_sizes_per_vgg_block: List[List[int]], n_channels_per_vgg_block: List[List[int]], vgg_pooling_kernel_size: List[int], d_model: int, ff_size: int, h: int, joint_size: int, left_size: int, right_size: int, p_dropout: float, masking_value: int = -1000000000000000.0)[source]

Bases: RNNTransducer

Implements the Transformer-Transducer model as described in https://arxiv.org/abs/1910.12977

Args:

in_features (int): The input feature size.

n_classes (int): The number of classes/vocabulary.

emb_dim (int): The embedding layer’s size.

n_layers (int): The number of transformer encoder layers with truncated self attention.

n_dec_layers (int): The number of RNNs in the decoder (predictor).

rnn_type (str): The RNN type.

n_vgg_blocks (int): The number of VGG blocks to use.

n_conv_layers_per_vgg_block (List[int]): A list of integers that specifies the number of convolution layers in each block.

kernel_sizes_per_vgg_block (List[List[int]]): A list of lists that contains the kernel size for each layer in each block. The length of the outer list should match n_vgg_blocks, and each inner list should be the same length as the corresponding block’s number of layers.

n_channels_per_vgg_block (List[List[int]]): A list of lists that contains the number of channels for each convolution layer in each block. This argument should also have length equal to n_vgg_blocks, and each sublist should have length equal to the number of layers in the corresponding block.

vgg_pooling_kernel_size (List[int]): A list of integers that specifies the size of the max pooling layer in each block. The length of this list should be equal to n_vgg_blocks.

d_model (int): The model dimensionality.

ff_size (int): The feed forward inner layer dimensionality.

h (int): The number of heads in the attention mechanism.

joint_size (int): The joint layer feature size (denoted as do in the paper).

left_size (int): The size of the left window that each time step is allowed to look at.

right_size (int): The size of the right window that each time step is allowed to look at.

p_dropout (float): The dropout rate.

masking_value (float, optional): The value to use for masking padded elements. Defaults to -1e15.

training: bool