Thiago R. Adams website

Home Blog Code-Blog Twitter Downloads Links / Books About

Websites

TkGen - A Lexical Analyzer

What is tkgen?

TkGen is a lexical analyzer generator (also known as scanner generator) for C++, written in C++.

Input format

Basicaly the input file a list of token names and regular expressions. Input file sample:

NUMBER
[0-9]+(\.[0-9]+)?
SYMBOL
[a-zA-Z]+([0-9]+[a-zA-Z]+)?
BLANKS
[\ \n]+
OPEN
\(
CLOSE
\)
MULTI
\*
DIV
\/
PLUS
\+
MINUS
\-

The output is a C++ header file with a DFA (deterministic finite automaton) transitions information.

The generated file is combined with two template classes to create the final scanner which recognizes tokens from a source of characters that can be a file or string for instance.

Try tkgen online

http://www.thradams.com/webtkgen.aspx

How to use the generated code?

To create a Tokenizer you will need two more classes

  • TokenizerStream
  • Tokenizer

Both can be found Tokenizer and InputStream tokenizer

Complete sample


#include "stdafx.h"

#include <iostream>
#include <fstream>

//download it from http://www.thradams.com/codeblog/tkgencode.htm
#include "tokenizer.h"

//generated by the compiler. copy from the online tkgen and paste it in your file
#include "statemachine.h"


int _tmain(int argc, _TCHAR* argv[])
{
  std::wifstream ss(argv[1]);
  FileTokenizerStream<wchar_t> fileStream(ss);
  Tokenizer<StateMachine, FileTokenizerStream<wchar_t> > tk(fileStream);
  
  std::wstring lexeme;
  Tokens token;
  while (tk.NextToken(lexeme, token))
  { 
      std::wcout  << TokensToString(token) << L": '" << lexeme << L"'" << std::endl;        
  }
}

Input file details

Tkgen accepts these regex syntax:

?  : optional
+  : one or more
*  : zero or more
.  : any char
[] : or-groups
\  : escape
0-9: range inside groups
(Note: ^ is not yet supported)

Download sample

tkgensample1.zip

References

Acknowledgments

  • Cesar Mello for the incentive over the years to implement this kind of tokenizer generator based on DFAs. - Marcelo B. for the feedbacks and patience talking about NFA DFA etc.

History

  • 18 nov 2009 : web page released
  • 02 dez 2010 : compact version added code

Want to see more? Go to the CodeBlog section.

About the author: I am Thiago Adams. I work as a professional C++ software engineer. I have created this website to share ideas and source code with other people with similar interests.
I would like to hear from you comments, critics, questions and suggestions about this topic or any other part of this website. Email: thiago.adams at gmail dot com