Finite Automata in Natural Language Processing

Chetan Lohkare
8 min readJun 8, 2022

A Friendly Introduction

Natural language processing (NLP) comprises a number of tasks that deal with the processing of natural language through computers. The theory of automata plays a significant role in providing solutions to many problems in natural language processing. For example, speech recognition, spelling correction, information retrieval, etc. Finite state methods are pretty useful in processing natural language as the modeling of information using rules has many advantages for language modeling. Finite state automaton has a mathematical model which is quite understandable; data can be represented in a compacted form using finite state automaton and it allows automatic compilation of system components.

Finite state automata (deterministic and non-deterministic finite automata) provide decisions regarding the acceptance and rejection of a string while transducers provide some output for a given input. Thus the two machines are quite useful in language processing tasks. Finite state automata are useful in deciding whether a given word belongs to a particular language or not.

Finite State Automata

The term automata, derived from the Greek word “αὐτόματα” meaning “self-acting”, is the plural of automaton which may be defined as an abstract self-propelled computing device that follows a predetermined sequence of operations automatically.

An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA).

Mathematically, an automaton can be represented by 5-tuple (Q, Σ, δ, q0, F), where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton
► δ is the transition function
► q0 is the initial state from where any input is processed (q0 ∈ Q)
► F is a set of final state/states of Q (F ⊆ Q)

Types of Finite State Automation (FSA)

Finite state automation is of two types.

Deterministic Finite Automation (DFA)

It may be defined as the type of finite automation wherein, for every input symbol we can determine the state to which the machine will move. It has a finite number of states which is why the machine is called Deterministic Finite Automaton (DFA).

Mathematically, a DFA can be represented by 5-tuple (Q, Σ, δ, q0, F), where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton.
► δ is the transition function where δ: Q × Σ → Q .
► q0 is the initial state from where any input is processed (q0 ∈ Q).
► F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically, a DFA can be represented by diagraphs called state diagrams where −
► The states are represented by vertices.
► The transitions are shown by labeled arcs.
► The initial state is represented by an empty incoming arc.
► The final state is represented by double circle.

Example of DFA —

Non-deterministic Finite Automation (NDFA)

It may be defined as the type of finite automation where for every input symbol we cannot determine the state to which the machine will move i.e. the machine can move to any combination of the states. It has a finite number of states which is why the machine is called Non-deterministic Finite Automation (NDFA).

Mathematically, NDFA can be represented by 5-tuple (Q, Σ, δ, q0, F), where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton.
► δ is the transition function where δ: Q × Σ → 2 Q.
► q0 is the initial state from where any input is processed (q0 ∈ Q).
► F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state diagrams where −
► The states are represented by vertices.
► The transitions are shown by labeled arcs.
► The initial state is represented by an empty incoming arc.
► The final state is represented by double circle.

Example of NDFA —

Finite Automata in NLP

Finite State Transducer

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

Language Recognizer

There are many tasks that need language recognizing mechanism. For example, spelling checker, morphological analysis, language identification etc.. Finite state machine are quite useful as a language recognizer. For a given word, a NFA can be designed easily that recognize the word. For example, NFA for the words ‘boy’ and ‘bat’ is shown in the Figure below. Similarly for every word a NFA can be designed and the different NFA’s can be combined to form spelling checker or dictionary compilation for a language.

NFA for the words ‘boy’ and ‘bat’

For each category of words, we can form a separate NFA and then combine them using transitions. For example, nouns and their plural can be recognized through one NFA and verbs and their different forms can be recognized through another NFA and finally, the two NFA can be combined. The figure below shows the NFA for some words and their morphological variations.

NFA for some words and their morphological variations

A Simple Machine that can laugh

A finite state generator is a simple computing machine that outputs a sequence of symbols.

It starts in some start state and then tries to reach a final state by making transitions from one state to another. Every time it makes such a transition it emits (or writes or generates) a symbol.

So, what does the generator in the pictures say? It laughs. It generates sequences of symbols of the form ha! or haha! or hahaha! or hahahaha! and so on. Why does it behave like that? Well, it first has to make a transition emitting h. The state that it reaches through this transition is not a final state. So, it has to keep on going emitting an a. Here, it has two possibilities: it can either follow the ! arrow, emitting ! and then stopping in the final state (but remember, it can’t look ahead to see that it would reach a final state with the ! transition) or it can follow the h arrow emitting an h and going back to the state where it just came from.

Finite state generators can be thought of as directed graphs. And in fact finite state generators are usually drawn as directed graphs. Here is our laughing machine as we will from now on draw finite state generators:

Morphological parsing and generation

Morphological parsing is the process of producing a lexical structure of a word, that is, breaking a word into stem and affixes and labeling the morphemes with category label. For example the word ‘books’ can be parsed as book+s and further as book + N+PL, the word ‘went’ can be parsed as go+V+PAST etc. generation is the reverse process of parsing, that is combining the lexical form of a word to produce the word. For example, box+N+Pl generates the word ‘boxes’. Finite state transducers are quite useful in morphological parsing. Let us consider the lexicon form of regular inflectional ‘girl +N +PL’. Figure below shows the transducer which convert the lexicon form ‘girl+N+PL’ into the word ‘girls’. Let us assume that x represents the word ‘girl’ for simplification purpose as for every regular singular noun the input and output will remain same in a transducer, that’s why a variable x can be used for a noun. The word ‘girl’ can be replaced by any other regular noun like ‘boy’.

Generation of word from its lexical form through a transducer

Here simple regular nouns have been considered on exemplarily basis, orthographic rules can be applied for irregular inflections and derivations and NFAs can be designed for every word and its variations in a certain language.

In other sense, we can say that morphology is the study of −
► The formation of words.
► The origin of the words.
► Grammatical forms of the words.
► Use of prefixes and suffixes in the formation of words.
► How parts-of-speech (PoS) of a language are formed.

Finite state automata for recognition

The approach to spelling rules that is described here involves the use of finite state transducers (FSTs). Rather than jumping straight into this, we will briefly consider the simpler finite state automata and how they can be used in a simple recogniser. Suppose we want to recognise dates (just day and month pairs) written in the format day/month. The day and the month may be expressed as one or two digits (e.g. 11/2, 1/12 etc). This format corresponds to the following simple FSA, where each character corresponds to one transition:

This is a non-deterministic FSA: for instance, an input starting with the digit 3 will move the FSA to both state 2 and state 3. This corresponds to a local ambiguity: i.e., one that will be resolved by subsequent context. By convention, there must be no ‘left over’ characters when the system is in the final state.

--

--

Chetan Lohkare

Student at Vishwakarma Institute of Technology, Pune