Parse shell one-liners with pyparsing


For one of my projects I needed some one-liners parser to AST. I’ve tried PLY, pyPEG and a few more. And stopped on pyparsing. It’s actively maintained, works without magic and easy to use.

Ideally I wanted to parse something like:

LANG=en_US.utf-8 git diff | wc -l >> diffs

To something like:

(= LANG en_US.utf-8)
(>> (| (git diff) (wc -l))
    (diffs))

So let’s start with simple shell command, it’s just space-separated tokens:

import pyparsing as pp


token = pp.Word(pp.alphanums + '_-.')
command = pp.OneOrMore(token)

command.parseString('git branch --help')
>>> ['git', 'branch', '--help']

It’s simple, another simple part is parsing environment variables. One environment variable is token=token, and list of them separated by spaces:

env = pp.Group(token + '=' + token)

env.parseString('A=B')
>>>[['A', '=', 'B']]

env_list = pp.OneOrMore(env)

env_list.parseString('VAR=test X=1')
>>> [['VAR', '=', 'test'], ['X', '=', '1']]

And now we can easily merge command and environment variables, mind that environment variables are optional:

command_with_env = pp.Optional(pp.Group(env_list)) + pp.Group(command)

command_with_env.parseString('LOCALE=en_US.utf-8 git diff')
>>> [[['LOCALE', '=', 'en_US.utf-8']], ['git', 'diff']]

Now we need to add support of pipes, redirects and logical operators. Here we don’t need to know what they’re doing, so we’ll treat them just like separators between commands:

separators = ['1>>', '2>>', '>>', '1>', '2>', '>', '<', '||', '|', '&&', '&', ';']
separator = pp.oneOf(separators)
command_with_separator = pp.OneOrMore(pp.Group(command) + pp.Optional(separator))

command_with_separator.parseString('git diff | wc -l >> out.txt')
>>> [['git', 'diff'], '|', ['wc', '-l'], '>>', ['out.txt']]

And now we can merge environment variables, commands and separators:

one_liner = pp.Optional(pp.Group(env_list)) + pp.Group(command_with_separator)
            
one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')
>>> [[['LANG', '=', 'C'], ['DEBUG', '=', 'true']], [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]]

Result is hard to process, so we need to structure it:

one_liner = pp.Optional(env_list).setResultsName('env') + \
            pp.Group(command_with_separator).setResultsName('command')
result = one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt')

print('env:', result.env, '\ncommand:', result.command)
>>> env: [['LANG', '=', 'C'], ['DEBUG', '=', 'true']] 
>>> command: [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]

Although we didn’t get AST, but just a bunch of grouped tokens. So now we need to transform it to proper AST:

def prepare_command(command):
    """We don't need to work with pyparsing internal data structures,
    so we just convert them to list.
    
    """
    for part in command:
        if isinstance(part, str):
            yield part
        else:
            yield list(part)


def separator_position(command):
    """Find last separator position."""
    for n, part in enumerate(command[::-1]):
        if part in separators:
            return len(command) - n - 1


def command_to_ast(command):
    """Recursively transform command to AST.""" 
    n = separator_position(command)
    if n is None:
        return tuple(command[0])
    else:
        return (command[n],
                command_to_ast(command[:n]),
                command_to_ast(command[n + 1:]))


def to_ast(parsed):
    if parsed.env:
        for env in parsed.env:
            yield ('=', env[0], env[2])
    command = list(prepare_command(parsed.command))
    yield command_to_ast(command)
   
   
list(to_ast(result))
>>> [('=', 'LANG', 'C'),
>>>  ('=', 'DEBUG', 'true'),
>>>  ('>>', ('|', ('git', 'branch'),
>>>               ('wc', '-l')),
>>>         ('out.txt',))]

It’s working. The last part, glue that make it easier to use:

def parse(command):
    result = one_liner.parseString(command)
    ast = to_ast(result)
    return list(ast)
    
    
parse('LANG=en_US.utf-8 git diff | wc -l >> diffs')
>>> [('=', 'LANG', 'en_US.utf-8'),
     ('>>', ('|', ('git', 'diff'),
                  ('wc', '-l')),
            ('diffs',))]

Although it can’t parse all one-liners, it doesn’t support nested commands like:

echo $(git branch)
echo `git branch`

But it’s enough for my task and support of not implemented features can be added easily.

Gist with source code.



comments powered by Disqus