下面的例子展示了如何使用lex.py对输入进行标记
# ------------------------------------------------------------
# calclex.py
#
# tokenizer for a simple expression evaluator for
# numbers and +,-,*,/
# ------------------------------------------------------------
import ply.lex as lex
# List of token names. This is always required
tokens = (
"NUMBER",
"PLUS",
"MINUS",
"TIMES",
"DIVIDE",
"LPAREN",
"RPAREN",
)
# Regular expression rules for simple tokens
t_PLUS = r"+"
t_MINUS = r"-"
t_TIMES = r"*"
t_DIVIDE = r"/"
t_LPAREN = r"("
t_RPAREN = r")"
# A regular expression rule with some action code
def t_NUMBER(t):
r"d+"
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(t):
r"
+"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = " "
# Error handling rule
def t_error(t):
print "Illegal character "%s"" % t.value[0]
t.lexer.skip(1)
# Build the lexer
lexer = lex.lex()
为了使lexer工作,你需要给定一个输入,并传递给input()
方法。然后,重复调用token()
方法来获取标记序列,下面的代码展示了这种用法:
# Test it out
data = """
3 + 4 * 10
+ -20 *2
"""
# Give the lexer some input
lexer.input(data)
# Tokenize
while True:
tok = lexer.token()
if not tok: break # No more input
print tok
程序执行,将给出如下输出:
$ python example.py
LexToken(NUMBER,3,2,1)
LexToken(PLUS,"+",2,3)
LexToken(NUMBER,4,2,5)
LexToken(TIMES,"*",2,7)
LexToken(NUMBER,10,2,10)
LexToken(PLUS,"+",3,14)
LexToken(MINUS,"-",3,16)
LexToken(NUMBER,20,3,18)
LexToken(TIMES,"*",3,20)
LexToken(NUMBER,2,3,21)
Lexers也同时支持迭代,你可以把上面的循环写成这样:
for tok in lexer:
print tok
由lexer.token()方法返回的标记是LexToken类型的实例,拥有tok.type
,tok.value
,tok.lineno
和tok.lexpos
属性,下面的代码展示了如何访问这些属性:
# Tokenize
while True:
tok = lexer.token()
if not tok: break # No more input
print tok.type, tok.value, tok.line, tok.lexpos
tok.type
和tok.value
属性表示标记本身的类型和值。tok.line
和tok.lexpos
属性包含了标记的位置信息,tok.lexpos
表示标记相对于输入串起始位置的偏移。