| Home | Trees | Indices | Help |
|
|---|
|
|
1 """
2 Finite state transducers.
3
4 A finite state trasducer, or FST, is a directed graph that is used to
5 encode a mapping from a set of I{input strings} to a set of I{output
6 strings}. An X{input string} is a sequence of immutable values (such
7 as integers, characters, or strings) called X{input symbols}.
8 Similarly, an C{output string} is a sequence of immutable values
9 called X{output symbols}. Collectively, input strings and output
10 strings are called X{symbol strings}, or simply X{strings} for short.
11 Note that this notion of I{string} is different from the python string
12 type -- symbol strings are always encoded as tuples of input or output
13 symbols, even if those symbols are characters. Also, note that empty
14 sequences are valid symbol strings.
15
16 The nodes of an FST are called X{states}, and the edges are called
17 X{transition arcs} or simply X{arcs}. States may be marked as
18 X{final}, and each final state is annotated with an output string,
19 called the X{finalizing string}. Each arc is annotated with an input
20 string and an output string. An arc with an empty input string is
21 called an I{epsilon-input arc}; and an arc with an empty output
22 string is called an I{epsilon-output arc}.
23
24 The set of mappings encoded by the FST are defined by the set of paths
25 through the graph, starting at a special state known as the X{initial
26 state}, and ending at a final state. In particular, the FST maps an
27 input string X to an output string Y iff there exists a path from the
28 initial state to a final state such that:
29
30 - The input string X is formed by concatenating the input strings
31 of the arcs along the path (in order).
32 - The output string Y is formed by concatenating the output strings
33 of the arcs along the path (in order), plus the final state's
34 output string.
35
36 The following list defines some terms that apply to finite state
37 transducers.
38
39 - The X{transduction} defined by a FST is the mapping from input
40 strings to output strings.
41
42 - An FST X{encodes a deterministic transduction} if each input
43 string maps to at most one output string. An FST X{encodes a
44 nondeterministic transduction} if any input string maps to more
45 than one output string.
46
47 - An FST is X{deterministic} if it every state contains at most one
48 outgoing arc that is consistent with any input string; otherwise,
49 the FST is X{nondeterministic}. If an FST is deterministic, then
50 it necessarily encodes a deterministic transduction; however, it
51 is possible to define an FST that is nondeterministic but that
52 encodes a deterministic transduction.
53
54 - An FST is X{sequential} if each arc is labeled with exactly one
55 input symbol, no two outgoing arcs from any state have the same
56 input symbol, and all finalizing strings are empty. (Sequential
57 implies deterministic).
58
59 - An FST is I{subsequential} if each arc is labeled with exactly
60 one input symbol, and no two outgoing arcs from any state have
61 the same input symbol. (Finalizing strings may be non-empty.)
62
63 An FSA can be represented as an FST that generates no output symbols.
64
65 The current FST class does not provide support for:
66
67 - Weighted arcs. (However, weights can be used as, or included
68 in, the output symbols. The total weight of a path can then
69 be found after transduction by combining the weights. But
70 there's no support for e.g., finding the path with the minimum
71 weight.
72
73 - Multiple initial states.
74
75 - Initializing strings (an output string associated with the initial
76 state, which is always generated when the FST begins).
77
78 Possible future changes:
79
80 - Define several classes, in a class hierarchy? E.g., FSA is a base
81 class, FST inherits from it. And maybe a further subclass to add
82 finalizing sequences. I would need to be more careful to only
83 access the private variables when necessary, and to usually go
84 through the accessor functions.
85 """
86
87 import re, os, random, tempfile
88 from subprocess import Popen, PIPE
89 from nltk_lite.draw import *
90 from nltk_lite.contrib.fst.draw_graph import *
91
92 ######################################################################
93 # CONTENTS
94 ######################################################################
95 # 1. Finite State Transducer
96 # - State information
97 # - Transition Arc Information
98 # - FST Information
99 # - State Modification
100 # - Transition Arc Modification
101 # - Transformations
102 # - Misc
103 # - Transduction
104 # 2. AT&T fsmtools support
105 # 3. Graphical Display
106 # - FSTDisplay
107 # - FSTDemo
108 ######################################################################
109
110 ######################################################################
111 #{ Finite State Transducer
112 ######################################################################
113
115 """
116 A finite state transducer. Each state is uniquely identified by a
117 label, which is typically a string name or an integer id. A
118 state's label is used to access and modify the state. Similarly,
119 each arc is uniquely identified by a label, which is used to
120 access and modify the arc.
121
122 The set of arcs pointing away from a state are that state's
123 I{outgoing} arcs. The set of arcs pointing to a state are that
124 state's I{incoming} arcs. The state at which an arc originates is
125 that arc's I{source} state (or C{src}), and the state at which it
126 terminates is its I{destination} state (or C{dst}).
127
128 It is possible to define an C{FST} object with no initial state.
129 This is represented by assigning a value of C{None} to the
130 C{initial_state} variable. C{FST}s with no initial state are
131 considered to encode an empty mapping. I.e., transducing any
132 string with such an C{FST} will result in failure.
133 """
135 """
136 Create a new finite state transducer, containing no states.
137 """
138 self.label = label
139 """A label identifying this FST. This is used for display &
140 debugging purposes only."""
141
142 #{ State Information
143 self._initial_state = None
144 """The label of the initial state, or C{None} if this FST
145 does not have an initial state."""
146
147 self._incoming = {}
148 """A dictionary mapping state labels to lists of incoming
149 transition arc labels."""
150
151 self._outgoing = {}
152 """A dictionary mapping state labels to lists of outgoing
153 transition arc labels."""
154
155 self._is_final = {}
156 """A dictionary mapping state labels to boolean values,
157 indicating whether the state is final."""
158
159 self._finalizing_string = {}
160 """A dictionary mapping state labels of final states to output
161 strings. This string should be added to the output
162 if the FST terminates at this state."""
163
164 self._state_descr = {}
165 """A dictionary mapping state labels to (optional) state
166 descriptions."""
167 #}
168
169 #{ Transition Arc Information
170 self._src = {}
171 """A dictionary mapping each transition arc label to the label of
172 its source state."""
173
174 self._dst = {}
175 """A dictionary mapping each transition arc label to the label of
176 its destination state."""
177
178 self._in_string = {}
179 """A dictionary mapping each transition arc label to its input
180 string, a (possibly empty) tuple of input symbols."""
181
182 self._out_string = {}
183 """A dictionary mapping each transition arc label to its output
184 string, a (possibly empty) tuple of input symbols."""
185
186 self._arc_descr = {}
187 """A dictionary mapping transition arc labels to (optional)
188 arc descriptions."""
189 #}
190
191 #////////////////////////////////////////////////////////////
192 #{ State Information
193 #////////////////////////////////////////////////////////////
194
196 """Return an iterator that will generate the state label of
197 each state in this FST."""
198 return iter(self._incoming)
199
201 """Return true if this FST contains a state with the given
202 label."""
203 return label in self._incoming
204
208 if label is not None and label not in self._incoming:
209 raise ValueError('Unknown state label %r' % label)
210 self._initial_state = label
211 initial_state = property(_get_initial_state, _set_initial_state,
212 doc="The label of the initial state (R/W).")
213
215 """Return an iterator that will generate the incoming
216 transition arcs for the given state. The effects of modifying
217 the FST's state while iterating are undefined, so if you plan
218 to modify the state, you should copy the incoming transition
219 arcs into a list first."""
220 return iter(self._incoming[state])
221
223 """Return an iterator that will generate the outgoing
224 transition arcs for the given state. The effects of modifying
225 the FST's state while iterating are undefined, so if you plan
226 to modify the state, you should copy the outgoing transition
227 arcs into a list first."""
228 return iter(self._outgoing[state])
229
231 """Return true if the state with the given state label is
232 final."""
233 return self._is_final[state]
234
236 """Return the output string associated with the given final
237 state. If the FST terminates at this state, then this string
238 will be emitted."""
239 #if not self._is_final[state]:
240 # raise ValueError('%s is not a final state' % state)
241 return self._finalizing_string.get(state, ())
242
244 """Return the description for the given state, if it has one;
245 or None, otherwise."""
246 return self._state_descr.get(state)
247
248 #////////////////////////////////////////////////////////////
249 #{ Transition Arc Information
250 #////////////////////////////////////////////////////////////
251
253 """Return an iterator that will generate the arc label of
254 each transition arc in this FST."""
255 return iter(self._src)
256
258 """Return the state label of this transition arc's source
259 state."""
260 return self._src[arc]
261
263 """Return the state label of this transition arc's destination
264 state."""
265 return self._dst[arc]
266
268 """Return the given transition arc's input string, a (possibly
269 empty) tuple of input symbols."""
270 return self._in_string[arc]
271
273 """Return the given transition arc's output string, a
274 (possibly empty) tuple of output symbols."""
275 return self._out_string[arc]
276
278 """Return the description for the given transition arc, if it
279 has one; or None, otherwise."""
280 return self._arc_descr.get(arc)
281
283 """Return a tuple (src, dst, in_string, out_string) for the
284 given arc, where:
285 - C{src} is the label of the arc's source state.
286 - C{dst} is the label of the arc's destination state.
287 - C{in_string} is the arc's input string.
288 - C{out_string} is the arc's output string.
289 """
290 return (self._src[arc], self._dst[arc],
291 self._in_string[arc], self._out_string[arc])
292
293 #////////////////////////////////////////////////////////////
294 #{ FST Information
295 #////////////////////////////////////////////////////////////
296
298 """
299 Return true if this FST is sequential.
300 """
301 for state in self.states():
302 if self.finalizing_string(state): return False
303 return self.is_subsequential()
304
306 """
307 Return true if this FST is subsequential.
308 """
309 for state in self.states():
310 out_syms = set()
311 for arc in self.outgoing(state):
312 out_string = self.out_string(arc)
313 if len(out_string) != 1: return False
314 if out_string[0] in out_syms: return False
315 out_syms.add(out_string)
316 return True
317
318 #////////////////////////////////////////////////////////////
319 #{ State Modification
320 #////////////////////////////////////////////////////////////
321
324 """
325 Create a new state, and return its label. The new state will
326 have no incoming or outgoing arcs. If C{label} is specified,
327 then it will be used as the state's label; otherwise, a new
328 unique label value will be chosen. The new state will be
329 final iff C{is_final} is true. C{descr} is an optional
330 description string for the new state.
331
332 Arguments should be specified using keywords!
333 """
334 label = self._pick_label(label, 'state', self._incoming)
335
336 # Add the state.
337 self._incoming[label] = []
338 self._outgoing[label] = []
339 self._is_final[label] = is_final
340 self._state_descr[label] = descr
341 self._finalizing_string[label] = tuple(finalizing_string)
342
343 # Return the new state's label.
344 return label
345
347 """
348 Delete the state with the given label. This will
349 automatically delete any incoming or outgoing arcs attached to
350 the state.
351 """
352 if label not in self._incoming:
353 raise ValueError('Unknown state label %r' % label)
354
355 # Delete the incoming/outgoing arcs.
356 for arc in self._incoming[label]:
357 del (self._src[arc], self._dst[arc], self._in_string[arc],
358 self._out_string[arc], self._arc_descr[arc])
359 for arc in self._outgoing[label]:
360 del (self._src[arc], self._dst[arc], self._in_string[arc],
361 self._out_string[arc], self._arc_descr[arc])
362
363 # Delete the state itself.
364 del (self._incoming[label], self._otugoing[label],
365 self._is_final[label], self._state_descr[label],
366 self._finalizing_string[label])
367
368 # Check if we just deleted the initial state.
369 if label == self._initial_state:
370 self._initial_state = None
371
373 """
374 If C{is_final} is true, then make the state with the given
375 label final; if C{is_final} is false, then make the state with
376 the given label non-final.
377 """
378 if state not in self._incoming:
379 raise ValueError('Unknown state label %r' % state)
380 self._is_final[state] = is_final
381
383 """
384 Set the given state's finalizing string.
385 """
386 if not self._is_final[state]:
387 raise ValueError('%s is not a final state' % state)
388 if state not in self._incoming:
389 raise ValueError('Unknown state label %r' % state)
390 self._finalizing_string[state] = tuple(finalizing_string)
391
393 """
394 Set the given state's description string.
395 """
396 if state not in self._incoming:
397 raise ValueError('Unknown state label %r' % state)
398 self._state_descr[state] = descr
399
401 """
402 Duplicate an existing state. I.e., create a new state M{s}
403 such that:
404 - M{s} is final iff C{orig_state} is final.
405 - If C{orig_state} is final, then M{s.finalizing_string}
406 is copied from C{orig_state}
407 - For each outgoing arc from C{orig_state}, M{s} has an
408 outgoing arc with the same input string, output
409 string, and destination state.
410
411 Note that if C{orig_state} contained self-loop arcs, then the
412 corresponding arcs in M{s} will point to C{orig_state} (i.e.,
413 they will I{not} be self-loop arcs).
414
415 The state description is I{not} copied.
416
417 @param label: The label for the new state. If not specified,
418 a unique integer will be used.
419 """
420 if orig_state not in self._incoming:
421 raise ValueError('Unknown state label %r' % src)
422
423 # Create a new state.
424 new_state = self.add_state(label=label)
425
426 # Copy finalization info.
427 if self.is_final(orig_state):
428 self.set_final(new_state)
429 self.set_finalizing_string(new_state,
430 self.finalizing_string(orig_state))
431
432 # Copy the outgoing arcs.
433 for arc in self._outgoing[orig_state]:
434 self.add_arc(src=new_state, dst=self._dst[arc],
435 in_string=self._in_string[arc],
436 out_string=self._out_string[arc])
437
438 return new_state
439
440 #////////////////////////////////////////////////////////////
441 #{ Transition Arc Modification
442 #////////////////////////////////////////////////////////////
443
446 """
447 Create a new transition arc, and return its label.
448
449 Arguments should be specified using keywords!
450
451 @param src: The label of the source state.
452 @param dst: The label of the destination state.
453 @param in_string: The input string, a (possibly empty) tuple of
454 input symbols. Input symbols should be hashable
455 immutable objects.
456 @param out_string: The output string, a (possibly empty) tuple
457 of output symbols. Output symbols should be hashable
458 immutable objects.
459 """
460 label = self._pick_label(label, 'arc', self._src)
461
462 # Check that src/dst are valid labels.
463 if src not in self._incoming:
464 raise ValueError('Unknown state label %r' % src)
465 if dst not in self._incoming:
466 raise ValueError('Unknown state label %r' % dst)
467
468 # Add the arc.
469 self._src[label] = src
470 self._dst[label] = dst
471 self._in_string[label] = tuple(in_string)
472 self._out_string[label] = tuple(out_string)
473 self._arc_descr[label] = descr
474
475 # Link the arc to its src/dst states.
476 self._incoming[dst].append(label)
477 self._outgoing[src].append(label)
478
479 # Return the new arc's label.
480 return label
481
483 """
484 Delete the transition arc with the given label.
485 """
486 if label not in self._src:
487 raise ValueError('Unknown arc label %r' % src)
488
489 # Disconnect the arc from its src/dst states.
490 self._incoming[self._dst[label]].remove(label)
491 self._outgoing[self._src[label]].remove(label)
492
493 # Delete the arc itself.
494 del (self._src[label], self._dst[label], self._in_string[label],
495 self._out_string[label], self._arc_descr[label])
496
497 #////////////////////////////////////////////////////////////
498 #{ Transformations
499 #////////////////////////////////////////////////////////////
500
502 """Swap all in_string/out_string pairs."""
503 fst = self.copy()
504 fst._in_string, fst._out_string = fst._out_string, fst._in_string
505 return fst
506
508 """Reverse the direction of all transition arcs."""
509 fst = self.copy()
510 fst._incoming, fst._outgoing = fst._outgoing, fst._incoming
511 fst._src, fst._dst = fst._dst, fst._src
512 return fst
513
515 fst = self.copy()
516
517 if fst.initial_state is None:
518 raise ValueError("No initial state!")
519
520 # Determine whether there is a path from the initial node to
521 # each node.
522 queue = [fst.initial_state]
523 path_from_init = set(queue)
524 while queue:
525 state = queue.pop()
526 dsts = [fst.dst(arc) for arc in fst.outgoing(state)]
527 queue += [s for s in dsts if s not in path_from_init]
528 path_from_init.update(dsts)
529
530 # Determine whether there is a path from each node to a final
531 # node.
532 queue = [s for s in fst.states() if fst.is_final(s)]
533 path_to_final = set(queue)
534 while queue:
535 state = queue.pop()
536 srcs = [fst.src(arc) for arc in fst.incoming(state)]
537 queue += [s for s in srcs if s not in path_to_final]
538 path_to_final.update(srcs)
539
540 # Delete anything that's not on a path from the initial state
541 # to a final state.
542 for state in list(fst.states()):
543 if not (state in path_from_init and state in path_to_final):
544 fst.del_state(state)
545
546 return fst
547
549 """
550 Return a new FST that is identical to this FST, except that
551 all state and arc labels have been replaced with new labels.
552 These new labels are consecutive integers, starting with zero.
553
554 @param relabel_states: If false, then don't relabel the states.
555 @param relabel_arcs: If false, then don't relabel the arcs.
556 """
557 if label is None: label = '%s (relabeled)' % self.label
558 fst = FST(label)
559
560 # This will ensure that the state relabelling is canonical, *if*
561 # the FST is subsequential.
562 state_ids = self._relabel_state_ids(self.initial_state, {})
563 if len(state_ids) < len(self._outgoing):
564 for state in self.states():
565 if state not in state_ids:
566 state_ids[state] = len(state_ids)
567
568 # This will ensure that the arc relabelling is canonical, *if*
569 # the state labelling is canonical.
570 arcs = sorted(self.arcs(), key=self.arc_info)
571 arc_ids = dict([(a,i) for (i,a) in enumerate(arcs)])
572
573 for state in self.states():
574 if relabel_states: label = state_ids[state]
575 else: label = state
576 fst.add_state(label, is_final=self.is_final(state),
577 finalizing_string=self.finalizing_string(state),
578 descr=self.state_descr(state))
579
580 for arc in self.arcs():
581 if relabel_arcs: label = arc_ids[arc]
582 else: label = arc
583 src, dst, in_string, out_string = self.arc_info(arc)
584 if relabel_states:
585 src = state_ids[src]
586 dst = state_ids[dst]
587 fst.add_arc(src=src, dst=dst, in_string=in_string,
588 out_string=out_string,
589 label=label, descr=self.arc_descr(arc))
590
591 if relabel_states:
592 fst.initial_state = state_ids[self.initial_state]
593 else:
594 fst.initial_state = self.initial_state
595
596 return fst
597
599 """
600 A helper function for L{relabel()}, which decides which new
601 label should be assigned to each state.
602 """
603 if state in ids: return
604 ids[state] = len(ids)
605 for arc in sorted(self.outgoing(state),
606 key = lambda a:self.in_string(a)):
607 self._relabel_state_ids(self.dst(arc), ids)
608 return ids
609
611 """
612 Return a new FST which defines the same mapping as this FST,
613 but is determinized.
614
615 The algorithm used is based on [...].
616
617 @require: All arcs in this FST must have exactly one input
618 symbol.
619 @require: The mapping defined by this FST must be
620 deterministic.
621 @raise ValueError: If the determinization algorithm was unable
622 to determinize this FST. Typically, this happens because
623 a precondition is not met.
624 """
625 # Check preconditions..
626 for arc in self.arcs():
627 if len(self.in_string(arc)) != 1:
628 raise ValueError("All arcs must have exactly one "
629 "input symbol.")
630
631 # State labels have the form:
632 # frozenset((s1,w1),(s2,w2),...(sn,wn))
633 # Where si is a state and wi is a string of output symbols.
634 if label is None: label = '%s (determinized)' % self.label
635 new_fst = FST(label)
636
637 initial_state = frozenset( [(self.initial_state,())] )
638 new_fst.add_state(initial_state)
639 new_fst.initial_state = initial_state
640
641 queue = [initial_state]
642 while queue:
643 new_fst_state = queue.pop()
644
645 # For each final state from the original FSM that's
646 # contained in the new FST's state, compute the finalizing
647 # string. If there is at least one finalizing string,
648 # then the new state is a final state. However, if the
649 # finalizing strings are not all identical, then the
650 # transduction defined by this FST is nondeterministic, so
651 # fail.
652 finalizing_strings = [w+self.finalizing_string(s)
653 for (s,w) in new_fst_state
654 if self.is_final(s)]
655 if len(set(finalizing_strings)) > 0:
656 if not self._all_equal(finalizing_strings):
657 # multiple conflicting finalizing strings -> bad!
658 raise ValueError("Determinization failed")
659 new_fst.set_final(new_fst_state)
660 new_fst.set_finalizing_string(new_fst_state,
661 finalizing_strings[0])
662
663 # sym -> dst -> [residual]
664 # nb: we checked above that len(in_string)==1 for all arcs.
665 arc_table = {}
666 for (s,w) in new_fst_state:
667 for arc in self.outgoing(s):
668 sym = self.in_string(arc)[0]
669 dst = self.dst(arc)
670 residual = w + self.out_string(arc)
671 arc_table.setdefault(sym,{}).setdefault(dst,set())
672 arc_table[sym][dst].add(residual)
673
674 # For each symbol in the arc table, we need to create a
675 # single edge in the new FST. This edge's input string
676 # will be the input symbol; its output string will be the
677 # shortest common prefix of strings that can be generated
678 # by the original FST in response to the symbol; and its
679 # destination state will encode the set of states that the
680 # original FST can go to when it sees this symbol, paired
681 # with the residual output strings that would have been
682 # generated by the original FST, but have not yet been
683 # generated by the new FST.
684 for sym in arc_table:
685 for dst in arc_table[sym]:
686 if len(arc_table[sym][dst]) > 1:
687 # two arcs w/ the same src, dst, and insym,
688 # but different residuals -> bad!
689 raise ValueError("Determinization failed")
690
691 # Construct a list of (destination, residual) pairs.
692 dst_residual_pairs = [(dst, arc_table[sym][dst].pop())
693 for dst in arc_table[sym]]
694
695 # Find the longest common prefix of all the residuals.
696 # Note that it's ok if some of the residuals disagree,
697 # but *only* if the states associated with those
698 # residuals can never both reach a final state with a
699 # single input string.
700 residuals = [res for (dst, res) in dst_residual_pairs]
701 prefix = self._common_prefix(residuals)
702
703 # Construct the new arc's destination state. The new
704 # arc's output string will be `prefix`, so the new
705 # destination state should be the set of all pairs
706 # (dst, residual-prefix).
707 new_arc_dst = frozenset([(dst, res[len(prefix):])
708 for (dst,res) in dst_residual_pairs])
709
710 # If the new arc's destination state isn't part of
711 # the FST yet, then add it; and add it to the queue.
712 if not new_fst.has_state(new_arc_dst):
713 new_fst.add_state(new_arc_dst)
714 queue.append(new_arc_dst)
715
716 # Create the new arc.
717 new_fst.add_arc(src=new_fst_state, dst=new_arc_dst,
718 in_string=(sym,), out_string=prefix)
719 return new_fst
720
722 """Return true if all elements in the list are equal"""
723 for item in lst[1:]:
724 if item != lst[0]: return False
725 return True
726
728 """Return the longest sequence that is a prefix of all of the
729 given sequences."""
730 prefix = sequences[0]
731 for seq in sequences[1:]:
732 # If the sequence is longer then the prefix, then truncate
733 # the prefix to the length of the sequence.
734 prefix = prefix[:len(seq)]
735 # If the prefix doesn't match item i of the sequence, then
736 # truncate the prefix to include everything up to (but not
737 # including) element i.
738 for i in range(len(prefix)):
739 if seq[i] != prefix[i]:
740 prefix = prefix[:i]
741 break
742 return prefix
743
744 #////////////////////////////////////////////////////////////
745 #{ Misc
746 #////////////////////////////////////////////////////////////
747
749 # Choose a label & create the FST.
750 if label is None: label = '%s-copy' % self.label
751 fst = FST(label)
752
753 # Copy all state:
754 fst._initial_state = self._initial_state
755 fst._incoming = self._incoming.copy()
756 fst._outgoing = self._outgoing.copy()
757 fst._is_final = self._is_final.copy()
758 fst._finalizing_string = self._finalizing_string.copy()
759 fst._state_descr = self._state_descr.copy()
760 fst._src = self._src.copy()
761 fst._dst = self._dst.copy()
762 fst._in_string = self._in_string.copy()
763 fst._out_string = self._out_string.copy()
764 fst._arc_descr = self._arc_descr.copy()
765 return fst
766
768 lines = ['FST %s' % self.label]
769 for state in sorted(self.states()):
770 # State information.
771 if state == self.initial_state:
772 line = '-> %s' % state
773 lines.append(' %-40s # Initial state' % line)
774 if self.is_final(state):
775 line = '%s ->' % state
776 if self.finalizing_string(state):
777 line += ' [%s]' % ' '.join(self.finalizing_string(state))
778 lines.append(' %-40s # Final state' % line)
779 # List states that would otherwise not be listed.
780 if (state != self.initial_state and not self.is_final(state)
781 and not self.outgoing(state) and not self.incoming(state)):
782 lines.append(' %-40s # State' % state)
783 # Outgoing edge information.
784 for arc in sorted(self.arcs()):
785 src, dst, in_string, out_string = self.arc_info(arc)
786 line = ('%s -> %s [%s:%s]' %
787 (src, dst, ' '.join(in_string), ' '.join(out_string)))
788 lines.append(' %-40s # Arc' % line)
789 return '\n'.join(lines)
790
791 @staticmethod
795
796 @staticmethod
798 fst = FST(label)
799 prev_src = None
800 lines = s.split('\n')[::-1]
801 while lines:
802 line = lines.pop().split('#')[0].strip() # strip comments
803 if not line: continue
804
805 # Initial state
806 m = re.match(r'->\s*(\S+)$', line)
807 if m:
808 label = m.group(1)
809 if not fst.has_state(label): fst.add_state(label)
810 fst.initial_state = label
811 continue
812
813 # Final state
814 m = re.match('(\S+)\s*->\s*(?:\[([^\]]*)\])?$', line)
815 if m:
816 label, finalizing_string = m.groups()
817 if not fst.has_state(label): fst.add_state(label)
818 fst.set_final(label)
819 if finalizing_string is not None:
820 finalizing_string = finalizing_string.split()
821 fst.set_finalizing_string(label, finalizing_string)
822 continue
823
824 # State
825 m = re.match('(\S+)$', line)
826 if m:
827 label = m.group(1)
828 if not fst.has_state(label): fst.add_state(label)
829 continue
830
831 # State description
832 m = re.match(r'descr\s+(\S+?):\s*(.*)$', line)
833 if m:
834 label, descr = m.groups()
835 # Allow for multi-line descriptions:
836 while lines and re.match(r'\s+\S', lines[-1]):
837 descr = descr.rstrip()+' '+lines.pop().lstrip()
838 if not fst.has_state(label): fst.add_state(label)
839 fst.set_descr(label, descr)
840 continue
841
842 # Transition arc
843 m = re.match(r'(\S+)?\s*->\s*(\S+)\s*'
844 r'\[(.*?):(.*?)\]$', line)
845 if m:
846 src, dst, in_string, out_string = m.groups()
847 if src is None: src = prev_src
848 if src is None: raise ValueError("bad line: %r" % line)
849 prev_src = src
850 if not fst.has_state(src): fst.add_state(src)
851 if not fst.has_state(dst): fst.add_state(dst)
852 in_string = tuple(in_string.split())
853 out_string = tuple(out_string.split())
854 fst.add_arc(src, dst, in_string, out_string)
855 continue
856
857 raise ValueError("bad line: %r" % line)
858
859 return fst
860
862 """
863 Return an AT&T graphviz dot graph.
864 """
865 # [xx] mark initial node??
866 lines = ['digraph %r {' % self.label,
867 'node [shape=ellipse]']
868 state_id = dict([(s,i) for (i,s) in enumerate(self.states())])
869 if self.initial_state is not None:
870 lines.append('init [shape="plaintext" label=""]')
871 lines.append('init -> %s' % state_id[self.initial_state])
872 for state in self.states():
873 if self.is_final(state):
874 final_str = self.finalizing_string(state)
875 if len(final_str)>0:
876 lines.append('%s [label="%s\\n%s", shape=doublecircle]' %
877 (state_id[state], state, ' '.join(final_str)))
878 else:
879 lines.append('%s [label="%s", shape=doublecircle]' %
880 (state_id[state], state))
881 else:
882 lines.append('%s [label="%s"]' % (state_id[state], state))
883 for arc in self.arcs():
884 src, dst, in_str, out_str = self.arc_info(arc)
885 lines.append('%s -> %s [label="%s:%s"]' %
886 (state_id[src], state_id[dst],
887 ' '.join(in_str), ' '.join(out_str)))
888 lines.append('}')
889 return '\n'.join(lines)
890
891 #////////////////////////////////////////////////////////////
892 #{ Transduction
893 #////////////////////////////////////////////////////////////
894
897
899 """
900 This is implemented as a generator, to make it easier to
901 support stepping.
902 """
903 if not self.is_subsequential():
904 raise ValueError('FST is not subsequential!')
905
906 # Create a transition table that indicates what action we
907 # should take at any state for a given input symbol. In
908 # paritcular, this table maps from (src, in) tuples to
909 # (dst, out, arc) tuples. (arc is only needed in case
910 # we want to do stepping.)
911 transitions = {}
912 for arc in self.arcs():
913 src, dst, in_string, out_string = self.arc_info(arc)
914 assert len(in_string) == 1
915 assert (src, in_string[0]) not in transitions
916 transitions[src, in_string[0]] = (dst, out_string, arc)
917
918 output = []
919 state = self.initial_state
920 try:
921 for in_pos, in_sym in enumerate(input):
922 (state, out_string, arc) = transitions[state, in_sym]
923 if step: yield 'step', (arc, in_pos, output)
924 output += out_string
925 yield 'succeed', output
926 except KeyError:
927 yield 'fail', None
928
931
933 """
934 This is implemented as a generator, to make it easier to
935 support stepping.
936 """
937 input = tuple(input)
938 output = []
939 in_pos = 0
940
941 # 'frontier' is a stack used to keep track of which parts of
942 # the search space we have yet to examine. Each element has
943 # the form (arc, in_pos, out_pos), and indicates that we
944 # should try rolling the input position back to in_pos, the
945 # output position back to out_pos, and applying arc. Note
946 # that the order that we check elements in is important, since
947 # rolling the output position back involves discarding
948 # generated output.
949 frontier = []
950
951 # Start in the initial state, and search for a valid
952 # transduction path to a final state.
953 state = self.initial_state
954 while in_pos < len(input) or not self.is_final(state):
955 # Get a list of arcs we can possibly take.
956 arcs = self.outgoing(state)
957
958 # Add the arcs to our backtracking stack. (The if condition
959 # could be eliminated if I used eliminate_multi_input_arcs;
960 # but I'd like to retain the ability to trace what's going on
961 # in the FST, as its specified.)
962 for arc in arcs:
963 in_string = self.in_string(arc)
964 if input[in_pos:in_pos+len(in_string)] == in_string:
965 frontier.append( (arc, in_pos, len(output)) )
966
967 # Get the top element of the frontiering stack.
968 if len(frontier) == 0:
969 yield 'fail', None
970
971 # perform the operation from the top of the frontier.
972 arc, in_pos, out_pos = frontier.pop()
973 if step:
974 yield 'step', (arc, in_pos, output[:out_pos])
975
976 # update our state, input position, & output.
977 state = self.dst(arc)
978 assert out_pos <= len(output)
979 in_pos = in_pos + len(self.in_string(arc))
980 output = output[:out_pos]
981 output.extend(self.out_string(arc))
982
983 # If it's a subsequential transducer, add the final output for
984 # the terminal state.
985 output += self.finalizing_string(state)
986
987 yield 'succeed', output
988
989
990 #////////////////////////////////////////////////////////////
991 #{ Helper Functions
992 #////////////////////////////////////////////////////////////
993
995 """
996 Helper function for L{add_state} and C{add_arc} that chooses a
997 label for a new state or arc.
998 """
999 if label is not None and label in used_labels:
1000 raise ValueError("%s with label %r already exists" %
1001 (typ, label))
1002 # If no label was specified, pick one.
1003 if label is not None:
1004 return label
1005 else:
1006 label = 1
1007 while '%s%d' % (typ[0], label) in used_labels: label += 1
1008 return '%s%d' % (typ[0], label)
1009
1010 ######################################################################
1011 #{ AT&T fsmtools Support
1012 ######################################################################
1013
1015 """
1016 A class used to interface with the AT&T fsmtools package. In
1017 particular, L{FSMTools.transduce} can be used to transduce an
1018 input string using any subsequential transducer where each input
1019 and output arc is labelled with at most one symbol.
1020 """
1021 EPSILON = object()
1022 """A special symbol object used to represent epsilon strings in
1023 the symbol<->id mapping (L{FSMTools._symbol_ids})."""
1024
1026 self.fsmtools_path = fsmtools_path
1027 """The path of the directory containing the fsmtools binaries."""
1028
1029 self._symbol_ids = self.IDMapping(self.EPSILON)
1030 """A mapping from symbols to unique integer IDs. We manage
1031 our own mapping, rather than using 'symbol files', since
1032 symbol files can't handle non-string symbols, symbols
1033 containing whitespace, unicode symbols, etc."""
1034
1035 self._state_ids = self.IDMapping()
1036 """A mapping from state labels to unique integer IDs."""
1037
1038 #////////////////////////////////////////////////////////////
1039 #{ Transduction
1040 #////////////////////////////////////////////////////////////
1041
1043 try:
1044 # Create a temporary working directory for intermediate files.
1045 tempdir = tempfile.mkdtemp()
1046 def tmp(s): return os.path.join(tempdir, s+'.fsm')
1047
1048 # Comile the FST & input file into binary fmstool format.
1049 self.compile_fst(fst, tmp('fst'))
1050 self.compile_string(input_string, tmp('in'))
1051
1052 # Transduce the input using the FST. We do this in two
1053 # steps: first, use fsmcompose to eliminate any paths that
1054 # are not consistent with the input string; and then use
1055 # fsmbestpath to choose a path through the FST. If the
1056 # FST is nondeterministic, then the path chosen is
1057 # arbitrary. Finally, print the result, so we can process
1058 # it and extract the output sequence.
1059 p1 = Popen([self._bin('fsmcompose'), tmp('in'), tmp('fst')],
1060 stdout=PIPE)
1061 p2 = Popen([self._bin('fsmbestpath')],
1062 stdin=p1.stdout, stdout=PIPE)
1063 p3 = Popen([self._bin('fsmprint')],
1064 stdin=p2.stdout, stdout=PIPE)
1065 out_string_fsm = p3.communicate()[0]
1066 finally:
1067 for f in os.listdir(tempdir):
1068 os.unlink(os.path.join(tempdir, f))
1069 os.rmdir(tempdir)
1070
1071 # If the empty string was returned, then the input was not
1072 # accepted by the FST; return None.
1073 if len(out_string_fsm) == 0:
1074 return None
1075
1076 # Otherwise, the input was accepted, so extract the
1077 # corresponding output string.
1078 out_string = []
1079 final_state_id = 0
1080 for line in out_string_fsm.split('\n'):
1081 words = line.split()
1082 if len(words) == 5:
1083 out_string.append(self._symbol_ids.getval(words[3]))
1084 final_state_id += int(words[4])
1085 elif len(words) == 4:
1086 out_string.append(self._symbol_ids.getval(words[3]))
1087 elif len(words) == 2:
1088 final_state_id += int(words[1])
1089 elif len(words) != 0:
1090 raise ValueError("Bad output line: %r" % line)
1091
1092 # Add on the finalizing string for the final state.
1093 final_state = self._state_ids.getval(final_state_id)
1094 out_string += fst.finalizing_string(final_state)
1095 return out_string
1096
1097 #////////////////////////////////////////////////////////////
1098 #{ FSM Compilation
1099 #////////////////////////////////////////////////////////////
1100
1102 """
1103 Compile the given FST to an fsmtools .fsm file, and write it
1104 to the given filename.
1105 """
1106 if fst.initial_state is None:
1107 raise ValueError("FST has no initial state!")
1108 if not (fst.is_final(fst.initial_state) or
1109 len(fst.outgoing(fst.initial_state)) > 0):
1110 raise ValueError("Initial state is nonfinal & "
1111 "has no outgoing arcs")
1112
1113 # Put the initial state first, since that's how fsmtools
1114 # decides which state is the initial state.
1115 states = [fst.initial_state] + [s for s in fst.states() if
1116 s != fst.initial_state]
1117
1118 # Write the outgoing edge for each state, & mark final states.
1119 lines = []
1120 for state in states:
1121 for arc in fst.outgoing(state):
1122 src, dst, in_string, out_string = fst.arc_info(arc)
1123 lines.append('%d %d %d %d\n' %
1124 (self._state_ids.getid(src),
1125 self._state_ids.getid(dst),
1126 self._string_id(in_string),
1127 self._string_id(out_string)))
1128 if fst.is_final(state):
1129 lines.append('%d %d\n' % (self._state_ids.getid(state),
1130 self._state_ids.getid(state)))
1131
1132 # Run fsmcompile to compile it.
1133 p = Popen([self._bin('fsmcompile'), '-F', outfile], stdin=PIPE)
1134 p.communicate(''.join(lines))
1135
1137 """
1138 Compile the given symbol string into an fsmtools .fsm file,
1139 and write it to the given filename. This FSM will generate
1140 the given symbol string, and no other strings.
1141 """
1142 # Create the input for fsmcompile.
1143 lines = []
1144 for (i, sym) in enumerate(sym_string):
1145 lines.append('%d %d %d\n' % (i, i+1, self._symbol_ids.getid(sym)))
1146 lines.append('%d\n' % len(sym_string))
1147
1148 # Run fsmcompile to compile it.
1149 p = Popen([self._bin('fsmcompile'), '-F', outfile], stdin=PIPE)
1150 p.communicate(''.join(lines))
1151
1152 #////////////////////////////////////////////////////////////
1153 #{ Helpers
1154 #////////////////////////////////////////////////////////////
1155
1157 return os.path.join(self.fsmtools_path, command)
1158
1160 if len(sym_string) == 0:
1161 return self._symbol_ids.getid(self.EPSILON)
1162 elif len(sym_string) == 1:
1163 return self._symbol_ids.getid(sym_string[0])
1164 else:
1165 raise ValueError('fsmtools does not support multi-symbol '
1166 'input or output strings on arcs.??')
1167
1181
1182 ######################################################################
1183 #{ Graphical Display
1184 ######################################################################
1185
1188 GraphWidget.__init__(self, canvas, (), (), **attribs)
1189
1190 # Create a widget for each state.
1191 self.state_widgets = state_widgets = {}
1192 for state in fst.states():
1193 label = TextWidget(canvas, state)
1194 if fst.is_final(state) and fst.finalizing_string(state):
1195 fstr = fst.finalizing_string(state)
1196 label = StackWidget(canvas, label,
1197 SequenceWidget(canvas,
1198 SymbolWidget(canvas, 'rightarrow'),
1199 TextWidget(canvas, fstr)))
1200 w = OvalWidget(canvas, label,
1201 double=fst.is_final(state), margin=2,
1202 fill='white')
1203 if state == fst.initial_state:
1204 w = SequenceWidget(canvas,
1205 SymbolWidget(canvas, 'rightarrow'), w)
1206 w['draggable'] = True
1207 state_widgets[state] = w
1208 self.add_node(w)
1209
1210 # Create a widget for each arc.
1211 self.arc_widgets = arc_widgets = {}
1212 for arc in fst.arcs():
1213 label = TextWidget(canvas, ' '.join(fst.in_string(arc)) + ' : ' +
1214 ' '.join(fst.out_string(arc)))
1215 w = GraphEdgeWidget(canvas, 0,0,0,0, label, color='cyan4')
1216 arc_widgets[arc] = w
1217 self.add_edge(state_widgets[fst.src(arc)],
1218 state_widgets[fst.dst(arc)], w)
1219
1220 # Arrange the graph.
1221 if fst.initial_state is not None:
1222 toplevel = [self.state_widgets[fst.initial_state]]
1223 else:
1224 toplevel = None
1225 self.arrange(toplevel=toplevel)
1226
1228 oval = self.state_widgets[state]
1229 if isinstance(oval, SequenceWidget):
1230 oval = oval.child_widgets()[1]
1231 oval['fill'] = color
1232
1234 oval = self.state_widgets[state]
1235 if isinstance(oval, SequenceWidget):
1236 oval = oval.child_widgets()[1]
1237 oval['fill'] = 'white'
1238
1243
1248
1251 self.cf = CanvasFrame(width=580, height=600, background='#f0f0f0')
1252 self.cf._parent.geometry('+650+50') # hack..
1253
1254 self.fst_widgets = {}
1255 for fst in fsts:
1256 self.add_fst(fst)
1257
1259 w = FSTWidget(self.cf.canvas(), fst, draggable=True,
1260 xspace=130, yspace=100)
1261 self.fst_widgets[fst] = w
1262 self.cf.add_widget(w, x=20)
1263
1266 self.top = Tk()
1267 f1 = Frame(self.top)
1268 f2 = Frame(self.top)
1269 f3 = Frame(self.top)
1270 f4 = Frame(self.top)
1271
1272 # The canvas for the FST itself.
1273 self.cf = CanvasFrame(f1, width=800, height=400,
1274 background='#f0f0f0',
1275 relief="sunken", border="2")
1276 self.cf.pack(expand=True, fill='both')
1277
1278 # The description of the current state.
1279 self.state_label = Label(f4, font=('bold', -16))
1280 self.state_label.pack(side='top', anchor='sw')
1281 self.state_descr = Text(f4, height=3, wrap='word', border="2",
1282 relief="sunken", font='helvetica',
1283 width=10)
1284 self.state_descr.pack(side='bottom', expand=True, fill='x')
1285
1286 # The input string.
1287 font = ('courier', -16, 'bold')
1288 Label(f2,text=' Input:', font='courier').pack(side='left')
1289 self.in_text = in_text = Text(f2, height=1, wrap='none',
1290 font=font, background='#c0ffc0',
1291 #padx=0, pady=0,
1292 width=10,
1293 highlightcolor='green',
1294 highlightbackground='green',
1295 highlightthickness=1)
1296 in_text.tag_config('highlight', foreground='white',
1297 background='green4')
1298 in_text.insert('end', 'r = reset; <space> = step')
1299 in_text.tag_add('read', '1.0', '1.4')
1300 in_text.pack(side='left', expand=True, fill="x")
1301
1302 # The input string.
1303 Label(f3,text='Output:', font='courier').pack(side='left')
1304 self.out_text = out_text = Text(f3, height=1, wrap='none',
1305 font=font, background='#ffc0c0',
1306 #padx=0, pady=0,
1307 width=10,
1308 highlightcolor='red',
1309 highlightbackground='red',
1310 highlightthickness=1)
1311 out_text.tag_config('highlight', foreground='white',
1312 background='red4')
1313 out_text.pack(side='left', expand=True, fill="x")
1314
1315 f1.pack(expand=True, fill='both', side='top', padx=5, pady=5)
1316 f4.pack(expand=False, fill='x', side='bottom', padx=5, pady=5)
1317 f3.pack(expand=False, fill='x', side='bottom', padx=5, pady=5)
1318 f2.pack(expand=False, fill='x', side='bottom', padx=5, pady=5)
1319
1320 self.top.title('FST')
1321 self.top.geometry('+650+50')
1322 self.top.bind('<Control-p>', lambda e: self.cf.print_to_file())
1323 self.top.bind('<Control-x>', self.destroy)
1324 self.top.bind('<Control-q>', self.destroy)
1325 self.top.bind('<space>', self.step)
1326 self.top.bind('r', lambda e: self.transduce(self.stepper_input))
1327
1328 self.stepper = None
1329
1330 self.graph = None
1331 self.set_fst(fst)
1332
1334 if self.fst.is_subsequential():
1335 self.stepper = self.fst.step_transduce_subsequential(input)
1336 else:
1337 self.stepper = self.fst.step_transduce(input)
1338 self.stepper_input = input
1339 # the following really duplicates code in step(), and should be
1340 # factored out.
1341 self.in_text.delete('1.0', 'end')
1342 self.out_text.delete('1.0', 'end')
1343 self.in_text.insert('1.0', ' '.join(self.stepper_input))
1344 for state in self.fst.states():
1345 self.graph.unmark_state(state)
1346 self.graph.mark_state(self.fst.initial_state)
1347 self.state_label['text'] = 'State = %s' % self.fst.initial_state
1348 self.state_descr.delete('1.0', 'end')
1349 state_descr = fst.state_descr(self.fst.initial_state)
1350 self.state_descr.insert('end', state_descr or '')
1351
1353 if self.stepper is None: return
1354
1355 # Perform one step.
1356 try: result, val = self.stepper.next()
1357 except StopIteration: return
1358
1359 if result == 'fail':
1360 self.stepper = None
1361 self.out_text.insert('end', ' (Failed!)')
1362 elif result == 'succeed':
1363 self.stepper = None
1364 self.out_text.delete('1.0', 'end')
1365 self.out_text.insert('end', ' '.join(val))
1366 self.out_text.tag_add('highlight', '1.0', 'end-1c')
1367 self.out_text.insert('end', ' (Finished!)')
1368 elif result == 'backtrack':
1369 self.out_text.insert('end', ' (Backtrack)')
1370 for state, widget in self.graph.state_widgets.items():
1371 if state == val: self.graph.mark_state(state, '#f0b0b0')
1372 else: self.graph.unmark_state(state)
1373 else:
1374 (arc, in_pos, output) = val
1375
1376 # Update in text display
1377 in_pos += len(fst.in_string(arc))
1378 output = list(output)+list(fst.out_string(arc))
1379 self.in_text.delete('1.0', 'end')
1380 self.in_text.insert('end', ' '.join(self.stepper_input[:in_pos]))
1381 self.in_text.tag_add('highlight', '1.0', 'end-1c')
1382 if in_pos > 0:
1383 self.in_text.insert('end', ' ')
1384 l,r= self.in_text.xview()
1385 if (r-l) < 1:
1386 self.in_text.xview_moveto(1.0-(r-l)/2)
1387 self.in_text.insert('end', ' '.join(self.stepper_input[in_pos:]))
1388
1389 # Update out text display
1390 self.out_text.delete('1.0', 'end')
1391 self.out_text.insert('end', ' '.join(output))
1392 self.out_text.tag_add('highlight', '1.0', 'end-1c')
1393
1394 # Update the state descr display
1395 self.state_label['text'] = 'State = %s' % fst.dst(arc)
1396 self.state_descr.delete('1.0', 'end')
1397 state_descr = fst.state_descr(fst.dst(arc))
1398 self.state_descr.insert('end', state_descr or '')
1399
1400 # Highlight the new dst state.
1401 for state, widget in self.graph.state_widgets.items():
1402 if state == fst.dst(arc):
1403 self.graph.mark_state(state, '#00ff00')
1404 elif state == fst.src(arc):
1405 self.graph.mark_state(state, '#b0f0b0')
1406 else: self.graph.unmark_state(state)
1407
1408 # Highlight the new arc.
1409 for a, widget in self.graph.arc_widgets.items():
1410 if a == arc: self.graph.mark_arc(a)
1411 else: self.graph.unmark_arc(a)
1412
1413 # Make end of output visible..
1414 l,r= self.out_text.xview()
1415 if (r-l) < 1:
1416 self.out_text.xview_moveto(1.0-(r-l)/2)
1417 self.out_text.insert('end', ' '*100)
1418
1420 self.fst = fst
1421 c = self.cf.canvas()
1422 if self.graph is not None:
1423 self.cf.remove_widget(self.graph)
1424 self.graph = FSTWidget(c, self.fst, xspace=130, yspace=100)
1425 self.cf.add_widget(self.graph, 20, 20)
1426
1431
1434
1435 ######################################################################
1436 #{ Test Code
1437 ######################################################################
1438
1439 if __name__ == '__main__':
1440 # This is a very contrived example. :)
1441 # Something phonetic might be better.
1442 fst = FST.parse("test", """
1443 -> start
1444 start -> vp [john:john]
1445 start -> vp [mary:mary]
1446
1447 # Delay production of the determiner until we know the gender.
1448 start -> subj_noun [the:]
1449 subj_noun -> vp [dog:le chien]
1450 subj_noun -> vp [cow:la vache]
1451
1452 vp -> obj [eats:mange]
1453 obj -> obj_noun [the:]
1454 obj -> obj_noun [:]
1455 obj_noun -> end [grass:de l'herbe]
1456 obj_noun -> end [bread:du pain]
1457 end ->
1458 """)
1459
1460 print "john eats the bread ->"
1461 print ' '+ ' '.join(fst.transduce("john eats the bread".split()))
1462 rev = fst.inverted()
1463 print "la vache mange de l'herbe ->"
1464 print ' '+' '.join(rev.transduce("la vache mange de l'herbe".split()))
1465
1466 demo = FSTDemo(fst)
1467 demo.transduce("the cow eats the bread".split())
1468 demo.mainloop()
1469
| Home | Trees | Indices | Help |
|
|---|
| Generated by Epydoc 3.0beta1 on Wed May 16 22:47:57 2007 | http://epydoc.sourceforge.net |