Tutorials

Framework structure

The SpeeQ framework is organized in the following structure:

speeq
├── config
├── constants
├── data
│   ├── augmenters
│   ├── decorators
│   ├── loaders
│   ├── padders
│   ├── processes
│   ├── processors
│   ├── registry
│   └── tokenizers
├── interfaces
├── models
│   ├── activations
│   ├── ctc
│   ├── decoders
│   ├── encoders
│   ├── layers
│   ├── registry
│   ├── seq2seq
│   ├── skeletons
│   ├── templates
│   └── transducers
├── predictors
├── trainers
│   ├── criterions
│   ├── decorators
│   ├── registry
│   ├── schedulers
│   ├── templates
│   └── trainers
├── utils
│   ├── loggers
│   └── utils
└── _version

Components

The main components of the SpeeQ framework are:

Config

Includes all configuration objects that are necessary to initialize and instantiate various objects and jobs, such as trainers, predictors, models, and data input/output.

Data

Encompasses all data-related components, including data augmenters, data loaders, data padders, and tokenizers, as well as various data preprocessing pipelines.

Models

Comprises all pre-implemented speech recognition models, along with layers, encoders, and decoders.

Predictors

Consists of different modules for speech recognition prediction, which can be used in the inference stage from a pre-trained models.

Trainers

incorporates all modules and components required for training speech recognition models.

Utils

contains various helper functionalities and training loggers.

Data Preperation

To use the framework, it is essential to structure all datasets into a CSV file that contains two fields: “file_path” for the audio file and “text” for the transcription. If you want to sort the data, add a sorting column to the CSV file. It’s crucial to verify that all paths are valid and that the corresponding files exist.

Here is an example of a CSV file

file_path,text,cleaned_text,notes,duration
audios/1.wav,the cat sat on the mat,the cat sat on the mat,A note!, 12.5
audios/2.wav,what do you read?,what do you read,A note!, 3.5

Train your first model

To start a training job in the SpeeQ framework, you need to configure three main things: the model you want to train, the data you want to train the model on and the processing pipeline for that data, and finally, the training procedure. All of these can be done by building three configuration objects, which can be found in the speeq.config` module:

ModelConfig
ASRDataConfig
TrainerConfig

In the rest of this tutorial, we will explain how to create each of these objects and how to successfully launch a training job.

Model Building

Within the framework, models can be constructed using what is referred to as a “template.” This template serves as a structure that must be filled with the model configuration, and all available model templates can be found in the “speeq.models.templates” module.

There are two types of model templates:

Static templates: These allow you to alter only the model configuration while keeping the model architecture unchanged. This option provides the flexibility to replicate previous research works.
Dynamic templates: These permit changes to both the model architecture and configuration, enabling the combination of components from different layers or the creation of custom models for that reason we call them model builders

Within the “speeq.models.templates” module, only three Dynamic templates (model builders) are available. These are:

CTCModelBuilderTemp: This template is utilized for constructing and customizing CTC models.
Seq2SeqBuilderTemp: This template is used for constructing and customizing Seq2Seq models (Encoder/Decoder).
TransducerBuilderTemp: This template is used for constructing and customizing Transducer models.

All other templates within the module are Static templates.

Example on Static templates:

let’s assume you want to experiment with Speech transformer model to build its template, you can easily do the below, and it is the same for any model’s architecture.

# importing the templates module
from torch.nn import Softmax
from speeq.models import templates

template = template.SpeechTransformerTemp(
    in_features=160,
    n_conv_layers=3,
    kernel_size=32,
    stride=2,
    d_model=512,
    n_enc_layers=8,
    n_dec_layers=8,
    ff_size=1024,
    h=8,
    att_kernel_size=16,
    att_out_channels=512,
    pred_activation=nn.Softmax(dim=-1)
)

Example on Dynamic templates:

let’s assume you want to experiment with a dummy feed-forward based CTC model to build its template, you can easily do the below.

# importing the templates module
from torch import nn
from speeq.models import templates

# define the encoder
class Encoder(nn.Module):
    def __init__(self, in_features: int, feat_size: int):
        super().__init__()
        self.fc = nn.Linear(in_features, feat_size)

    def forward(self, x, mask):
        # x os shape [B, M, in_features]
        lengths = mask.sum(dim=-1)
        out = self.fc(x)
        return out, lengths

# define an instance of the encoder
feat_size = 512
encoder = Encoder(80, feat_size)

# define the template
template = templates.CTCModelBuilderTemp(
    encoder=encoder,
    feat_size=feat_size
)

Once you have defined the model structure and architecture using template, it is the time to create the model. This can be accomplished by creating a configuration object that includes the template and the path of a pre-trained model, if available. You can then pass the model configuration object to the get_model method found in the speeq.models.registry module, as shown in the code below.

# importing registry to use the get_model funciton
from speeq.models import registry, templates
# import ModelConfig to setup model configuration
from speeq.config import ModelConfig

# defining a dummy template
template = template.SpeechTransformerTemp(
    in_features=160,
    n_conv_layers=3,
    kernel_size=32,
    stride=2,
    d_model=512,
    n_enc_layers=8,
    n_dec_layers=8,
    ff_size=1024,
    h=8,
    att_kernel_size=16,
    att_out_channels=512,
    pred_activation=nn.Softmax(dim=-1)
)

# creating model configuration object
model_cfg = ModelConfig(template=template)
# creating the model
model = registry.get_model(model_config=model_cfg, n_classes=5)

Setting up data pipelines

Having the data in the CSV format as mentioned earlier, we can use now the data loaders that built in the framework, in-order to train a model or launch a training job we have to define a speech and text processor first and after that the speeq.config.ASRDataConfig object.

Defining Speech and Text processor

Both processors are requird to be an instance of speeq.data.processors.OrderedProcessor, which contains a list of instances from process objects that implement the speeq.interfaces.IProcess interface. For convenience, the framework provides a set of predefined processes that are commonly used, and they can be found in the speeq.data.processes module.

Text Processor

Suppose we want to create a dummy text processor that consists of two processes: text stripping (which removes trailing and leading white spaces from the input text) and lowering the text. We can do this by creating the processes first and then passing them to the processor, as shown in the code below:

from speeq.data.processors import OrderedProcessor
from speeq.interfaces import IProcess

# Create the text stripper process.
class TextStripper(IProcess):
    def run(self, text: str) -> str:
        return text.strip()

# Create the text lowering process.
class TextLowering(IProcess):
    def run(self, text: str) -> str:
        return text.lower()

# Create a list of the processes that will be executed by the processor in order.
processes = [TextStripper(), TextLowering()]

# Create the text processor.
text_processor = OrderedProcessor(processes)

Speech Processor

The speech processor can be created in a similar way as the text processor. However, for the speech processor, there are predefined processes that fall into two categories: speech processors and speech augmenters. These can be accessed from the speeq.data.processes and speeq.data.augmenters modules.

The processes must always be executed in a specific order, whereas the augmenters can be applied in various orders to produce different augmentation combinations. To address this, we have two types of processors: OrderedProcessor and StochasticProcessor. The former applies processes in sequential order, while the latter shuffles the processes before applying them sequentially to the input.

To combine these processors, we have the speeq.data.processors.SpeechProcessor module. Below is an example of how to create a speech processor in both scenarios:

from speeq.data.processors import SpeechProcessor, OrderedProcessor, StochasticProcessor
from speeq.data.processes import FeatExtractor
from speeq.data.augmenters import FrequencyMasking, WhiteNoiseInjector

# Create a speech processor that loads speech and extracts mel-scale spectrogram
sample_rate = 16000
speech_processor = OrderedProcessor([
    AudioLoader(sample_rate=sample_rate),
    FeatExtractor(feat_ext_name='melspec', feat_ext_args={})
])

# If you want to add data augmentation in an ordered process, you can add augmenters as follows:
speech_processor_with_aug = OrderedProcessor([
    AudioLoader(sample_rate=sample_rate),
    # Time-domain augmentation
    WhiteNoiseInjector(ratio=0.3),
    FeatExtractor(feat_ext_name='melspec', feat_ext_args={}),
    # Frequency-domain augmentation
    FrequencyMasking(n=5, max_length=10, ratio=0.2)
])

# However, if you have more than one time or frequency domain and you want to shuffle their execution order,
# you can use the `SpeechProcessor` module to achieve that as follows:

speech_file_processor = OrderedProcessor([
    AudioLoader(sample_rate=sample_rate)
])
time_domain_aug = StochasticProcessor([
    WhiteNoiseInjector(ratio=0.3),
    VariableAttenuator(ratio=0.4)
])
spec_processor = OrderedProcessor([
    FeatExtractor(feat_ext_name='melspec', feat_ext_args={})
])
freq_domain_aug = StochasticProcessor([
    FrequencyMasking(n=5, max_length=10, ratio=0.2)
])
speech_processor_with_rand_aug = SpeechProcessor(
    audio_processor=speech_file_processor,
    audio_augmenter=time_domain_aug,
    spec_processor=spec_processor,
    spec_augmenter=freq_domain_aug
)

"""The speech_processor_with_rand_aug will perform the following steps in a
specific order: first, it will provide the file path to speech_file_processor,
then it will pass the time domain signal to the time domain augmentation. After
that, it will extract the features using spec_processor and, finally, apply frequency
domain augmentation using freq_domain_aug."""

Building ASRDataConfig

Once the text processor and speech processor are built, we can create the data configuration object, which is similar to the model configuration. The code below demonstrates how to create an ASRDataConfig object:

from speeq.config import ASRDataConfig

data_cfg = ASRDataConfig(
    training_path='path/to/train.csv',
    testing_path='path/to/test.csv',
    speech_processor=speech_processor,
    text_processor=text_processor,
    tokenizer_path='outdir/tokenizer.json',
    tokenizer_type='char_tokenizer',
    add_sos_token=True,
    add_eos_token=True,
    sort_key='duration'
)

This will create a configuration object for ASR with training and testing data paths, speech and text processors, tokenizer information, and sorting criteria.

Training

The function speeq.trainers.trainers.launch_training_job is used to start a training job, and it takes three input parameters: the model configuration, the data configuration, and the training configuration. The following code demonstrates how to create the trainer configuration, which specifies the training settings:

from speeq.config import TrainerConfig, DistConfig

# create a single-GPU training configuration
signle_gpu_trainer_cfg = TrainerConfig(
    name='seq2seq', # can be 'ctc', 'seq2seq', or 'transducer', depending on the selected model
    batch_size=16,
    epochs=100,
    outdir='outdir',
    logdir='outdir/logs',
    log_steps_frequency=100,
    criterion='ctc',
    optimizer='adam',
    optim_args={'lr': 0.005},
    device='cuda'
)

# use DDP for multiple-GPU training
dist_cfg = DistConfig(
    port=12345,
    n_gpus=2,
    address='tcp://localhost:5555',
    backend='nccl'
)
multiple_gpus_trainer_cfg = TrainerConfig(
    name='seq2seq',
    batch_size=16,
    epochs=100,
    outdir='outdir',
    logdir='outdir/logs',
    log_steps_frequency=100,
    criterion='ctc',
    optimizer='adam',
    optim_args={'lr': 0.005},
    dist_config=dist_cfg
)

After creating the trainer configuration, and having the model and data configuration objects, you can easily launch the training job using the following code:

from speeq.trainers.trainers import launch_training_job

launch_training_job(
    trainer_config=trainer_config,
    data_config=data_config,
    model_config=model_config
    )

Prediction

To utilize a pre-trained model for prediction, you can follow these steps. First, create a speech processor object and a model configuration object as done previously. However, you need to provide the path to the model configuration object to load the pre-trained model. Finally, you can use a predictor from the speeq.predictors module.

Here is an example of how to use a CTC predictor with a pre-trained model and a speech processor object:

from speeq.predictors import CTCPredictor

predictor = CTCPredictor(
    speech_processor=speech_processor,
    tokenizer_path='path/to/tokenizer.json',
    model_config=model_config,
    device='cuda',
)

# Use the predictor to transcribe an audio file
print(predictor.predict('path/to/audio.wav'))