For one of my projects I needed some one-liners parser to AST. I’ve tried PLY, pyPEG and a few more. And stopped on pyparsing. It’s actively maintained, works without magic and easy to use.
Ideally I wanted to parse something like:
LANG=en_US.utf-8 git diff | wc -l >> diffs
To something like:
(= LANG en_US.utf-8)
(>> (| (git diff) (wc -l))
(diffs))
So let’s start with simple shell command, it’s just space-separated tokens:
import pyparsing as pp
token = pp.Word(pp.alphanums + '_-.')
command = pp.OneOrMore(token)
command.parseString('git branch --help')
>>> ['git', 'branch', '--help']
It’s simple, another simple part is parsing environment variables. One environment
variable is token=token
, and list of them separated by spaces:
env = pp.Group(token + '=' + token)
env.parseString('A=B')
>>>[['A', '=', 'B']]
env_list = pp.OneOrMore(env)
env_list.parseString('VAR=test X=1')
>>> [['VAR', '=', 'test'], ['X', '=', '1']]
And now we can easily merge command and environment variables, mind that environment variables are optional:
command_with_env = pp.Optional(pp.Group(env_list)) + pp.Group(command)
command_with_env.parseString('LOCALE=en_US.utf-8 git diff')
>>> [[['LOCALE', '=', 'en_US.utf-8']], ['git', 'diff']]
Now we need to add support of pipes, redirects and logical operators. Here we don’t need to know what they’re doing, so we’ll treat them just like separators between commands:
separators = ['1>>', '2>>', '>>', '1>', '2>', '>', '<', '||', '|', '&&', '&', ';']
separator = pp.oneOf(separators)
command_with_separator = pp.OneOrMore(pp.Group(command) + pp.Optional(separator))
command_with_separator.parseString('git diff | wc -l >> out.txt')
>>> [['git', 'diff'], '|', ['wc', '-l'], '>>', ['out.txt']]
And now we can merge environment variables, commands and separators:
one_liner = pp.Optional(pp.Group(env_list)) + pp.Group(command_with_separator)
one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')
>>> [[['LANG', '=', 'C'], ['DEBUG', '=', 'true']], [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]]
Result is hard to process, so we need to structure it:
one_liner = pp.Optional(env_list).setResultsName('env') + \
pp.Group(command_with_separator).setResultsName('command')
result = one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')
print('env:', result.env, '\ncommand:', result.command)
>>> env: [['LANG', '=', 'C'], ['DEBUG', '=', 'true']]
>>> command: [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]
Although we didn’t get AST, but just a bunch of grouped tokens. So now we need to transform it to proper AST:
def prepare_command(command):
"""We don't need to work with pyparsing internal data structures,
so we just convert them to list.
"""
for part in command:
if isinstance(part, str):
yield part
else:
yield list(part)
def separator_position(command):
"""Find last separator position."""
for n, part in enumerate(command[::-1]):
if part in separators:
return len(command) - n - 1
def command_to_ast(command):
"""Recursively transform command to AST."""
n = separator_position(command)
if n is None:
return tuple(command[0])
else:
return (command[n],
command_to_ast(command[:n]),
command_to_ast(command[n + 1:]))
def to_ast(parsed):
if parsed.env:
for env in parsed.env:
yield ('=', env[0], env[2])
command = list(prepare_command(parsed.command))
yield command_to_ast(command)
list(to_ast(result))
>>> [('=', 'LANG', 'C'),
>>> ('=', 'DEBUG', 'true'),
>>> ('>>', ('|', ('git', 'branch'),
>>> ('wc', '-l')),
>>> ('out.txt',))]
It’s working. The last part, glue that make it easier to use:
def parse(command):
result = one_liner.parseString(command)
ast = to_ast(result)
return list(ast)
parse('LANG=en_US.utf-8 git diff | wc -l >> diffs')
>>> [('=', 'LANG', 'en_US.utf-8'),
('>>', ('|', ('git', 'diff'),
('wc', '-l')),
('diffs',))]
Although it can’t parse all one-liners, it doesn’t support nested commands like:
echo $(git branch)
echo `git branch`
But it’s enough for my task and support of not implemented features can be added easily.