reveal.js

Practical Refactoring with Syntax Trees

Example:

            
            #!/usr/bin/env bash
            sed -i 's/old_function/new_function/g' *.py

Code as Data

`(1 + 2) * 3`

Let's start with the "Tree" part of Abstract Syntax Tree. An expression like this one can be represented as a tree. There's an operator. A left-hand side, right hand side. The left-hand side is itself an operation. It converts to this tree. The nicer visualizations were all in the tutorial session this morning. If you missed it, you're stuck with my design skills. DESCRIBE THE TREE. - Root node: multiplication operator - left hand side and right hand side as children. - subtree as child for left hand side. If you had to turn yourself into a computer, you might evaluate it starting from the bottom: you'd do 1 + 2 first - essentially evaluating the left child subtree - then you'd use that to evaluate the root node.

                
            $ echo "(1 + 2) * 3" > simple.py

                
$ python -m ast simple.py


Module(
   body=[
      Expr(
         value=BinOp(
            left=BinOp(
               left=Constant(value=1),
               op=Add(),
               right=Constant(value=2)),
            op=Mult(),
            right=Constant(value=3)))],
   type_ignores=[])

Let's take a look at the Python AST for this expression. Now there's more stuff here. There's a module. Basically always the top level element. The module has a body, which has a list with just one expression. There we find a tree similar to the one we had. Notice we got that tree from running the `ast` module from the standard library as a script with the `python -m` syntax. You can run that in your terminal on your Python files. It's not particularly readable - we're better at reading source code. Every node is a Python object, and the children here are attributes on the object. There are nodes for FunctionDefinition, Assignment, For loops... You don't need to know them beforehand.

What is the Abstract Syntax Tree?

a tree of Python objects that represent the source
node types represent language constructs: expression, assignment, import, etc.
used by the Python interpreter to run code
a representation of the code that we can modify

          >>> import ast

          >>> import ast
>>> source = "(1 + 2) * 3"

          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)

          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>

          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>
>>> ast.dump(node)
'Module(body=[Expr(value=BinOp(left=BinOp(left=Constant(value=1), op=Add(), right=Constant(value=2)), op=Mult(), right=Constant(value=3)))], type_ignores=[])'

          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>
>>> ast.dump(node)
'Module(body=[Expr(value=BinOp(left=BinOp(left=Constant(value=1), op=Add(), right=Constant(value=2)), op=Mult(), right=Constant(value=3)))], type_ignores=[])'
>>> ast.unparse(node)
'(1 + 2) * 3'

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)
write(transformed_tree.unparse(), file_path)

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)
write(transformed_tree.unparse(), file_path)

Transforming the AST

Utilities: ast.NodeVisitor, ast.NodeTransformer
Depth-first traversal
Define visit_[NodeType] methods

ast.NodeTransformer

before.py

b = a + 1

after.py

data['b'] = data['a'] + 1


Assign(
  targets=[
    AssignTarget(
      target=Name(value='b')
    )
  ],
  value=BinaryOperation(
    left=Name(value='a'),
    operator=Add(),
    right=Integer(value='1')
  )
)


Assign(
  targets=[
    AssignTarget(
      target=Subscript(
        value=Name(value='data'),
        slice=[
          SubscriptElement(
            slice=Index(
              value=SimpleString(
                        value="'b'"
                        ),
            )
          )
        ],
      )
    )
  ],
  value=BinaryOperation(
    left=Subscript(
      value=Name(value='data'),
      slice=[
        SubscriptElement(
          slice=Index(
            value=SimpleString(
                        value="'a'"
                        ),
          )
        )
      ]
    ),
    operator=Add(),
    right=Integer(value='1')
  )
)

ast.NodeTransformer

before.py

b = a + 1

after.py

data['b'] = data['a'] + 1

            
class RewriteName(NodeTransformer):

    def visit_Name(self, node):
        return Subscript(
            value=Name(id='data', ctx=Load()),
            slice=Constant(value=node.id),
            ctx=node.ctx
        )

Now say we want to modify the AST. Let's look at this example from the docs. ast module gives us The NodeTransformer class We can use it to transform an AST. This is when we have a piece of code. And we want to modify it, as if rewriting it on-the-fly. NodeTransformer It follows the "visitor" programming pattern, meaning we define the methods as `visit_NodeType` and they will be called on relevant nodes as the tree is visited. The way you'd do this is you'd print both trees: before and after. Then you know how to change Subscript node means bracket access. The slice is the index or the key we use inside the brackets. Replaces all variables like 'foo' with 'data["foo"]'. The part we're interested in is the NodeTransformer. This is a pattern that can be used to modify the AST.

Python AST for refactoring

exhibit A


                        1 + (2 * 3)  # IP-protected formula

exhibit B

1 + 2 * 3

Same AST 😱

Python AST for refactoring: oh no

does not preserve formatting
does not preserve comments

Good if your codebase is not formatted and comments are overrated anyway.

Concrete Syntax Trees


            (1 + 2) * 3   # comment


SimpleStatementLine(
  body=[
    Expr(
      value=BinaryOperation(
        left=BinaryOperation(
          left=Integer(value='1'),
          operator=Add(),
          right=Integer(value='2'),
          lpar=[LeftParen()],
          rpar=[RightParen()],
        ),
        operator=Multiply(),
        right=Integer(value='3'),
      ),
    ),
  ],
  trailing_whitespace=TrailingWhitespace(
    whitespace=SimpleWhitespace(value='   '),
    comment=Comment(value='# comment'),
  ),
)

Concrete Syntax Trees

Cousins of AST. Sometimes called Parse Trees.
Preserve whitespace, parentheses, comments...
Allow 'round-tripping'
Great fit for refactoring scripts
We do not have to care about whitespace

They are Cousins of AST. Sometimes they are called Parse Trees. Sometimes... they are also called ASTs. The frontier is a little blurry, but basically... Different trees: the node types and structure won't be exactly the same. But the mental model transfers. We don't have to care about whitespace, and in fact, we will not. We also won't really have to care about comments in our refactoring scripts. We do care that they are preserved, but that is done essentially without us having to do anything If you put all the formatting information aside, it looks similar to the AST. Let's look at some examples now to get a feel of the tree structure and see different types of nodes.

Concrete Syntax Trees

                    
import numpy

                    Import(
  names=[
    ImportAlias(
      name=Name(
        value='numpy'
      )
    )
  ]
)

Concrete Syntax Trees

                    
import numpy as np

                    
Import(
  names=[
    ImportAlias(
      name=Name(
        value='numpy'
      ),
      asname=AsName(
        name=Name(
          value='np'
        )
      )
    )
  ]
)

Concrete Syntax Trees

                    
a = 'Hello'

                    Assign(
  targets=[
    AssignTarget(
      target=Name(value='a')
    )
  ],
  value=SimpleString(value="'Hello'")
)

Concrete Syntax Trees

                    
a = f('Hello')

                    
Assign(
  targets=[
    AssignTarget(
      target=Name(value='a')
    )
  ],
  value=Call(
    func=Name(value='f'),
    args=[
      Arg(
        value=SimpleString(value="'Hello'")
      )
    ]
  )
)

Example: rename pytest fixtures

                    
                    @pytest.fixture
                    def test_user():
                        return {"name": "test"}

                    def test_login(test_user):
                        ...

What's wrong with this?

Naming convention
This is wrong
Why would you
Hurts my feelings
Can cause crashes with some pytest versions

Hundreds of fixtures like this one.

What do we want?

before.py



                    import pytest

                    @pytest.fixture
                    def test_user():
                        return {"name": "test"}

                    def test_login(test_user):
                        assert test_user["name"] == "test"

after.py


import pytest

@pytest.fixture
def user_fixture():
    return {"name": "test"}

def test_login(user_fixture):
    assert user_fixture["name"] == "test"

Change this, on hundreds of fixtures.

LibCST API

CSTTransformer ~= ast.NodeTransformer
Define methods:
- visit_[NodeType]
- leave_[NodeType]


            import libcst as cst


            class Transformer(cst.CSTTransformer):
                def visit_FunctionDef(self, node):
                    ...

                def leave_Name(self, original_node, updated_node):
                    ...


                class Transformer(cst.CSTTransformer):

                    def __init__(self):
                        self.renames: dict[str, str] = {}

                    def visit_FunctionDef(self, node):
                        """Collect fixtures that need to be renamed."""
                        if is_pytest_fixture(node) and should_rename(node):
                            old_name = node.name.value
                            self.renames[old_name] = generate_new_name(old_name)
                        return True  # Continue visiting children

                
                class Transformer(cst.CSTTransformer):

                    def leave_Name(self, original_node, updated_node):
                        """Update variables that match renamed fixtures."""
                        name = updated_node.value
                        if name in self.renames:
                            new_name = self.renames[name]
                            return updated_node.with_changes(value=new_name)
                        return updated_node

Matching fixtures

                
def test_user():
    ...

        FunctionDef(
  name=Name(value='test_user'),
  params=Parameters(),
  body=SimpleStatementSuite(body=[
      Expr(value=Ellipsis()),
    ]),
  decorators=[],
)

Matching fixtures

                
                    @pytest.fixture
                    def test_user():
                        ...

        
FunctionDef(
  name=Name(value='test_user'),
  params=Parameters(),
  body=SimpleStatementSuite(body=[
      Expr(value=Ellipsis()),
    ]),
  decorators=[
    Decorator(
      decorator=Attribute(
        value=Name(value='pytest'),
        attr=Name(value='fixture'),
      ),
    ),
  ],
)

Matching fixtures


def is_pytest_fixture(node: cst.FunctionDef) -> bool:
    for decorator in node.decorators:
        match decorator.decorator:
            # Handle @fixture
            case cst.Name(value="fixture"):
                return True
            # Handle @pytest.fixture
            case cst.Attribute(value=cst.Name(value="pytest"),
                               attr=cst.Name(value="fixture")):
                return True
    return False


class Transformer(cst.CSTTransformer):
    def __init__(self):
        self.renames: dict[str, str] = {}

    def visit_FunctionDef(self, node) -> bool:
        """Collect fixtures that need to be renamed."""
        if is_pytest_fixture(node) and should_rename(node):
            old_name = node.name.value
            self.renames[old_name] = generate_new_name(old_name)
        return True  # Continue visiting children

    def leave_Name(self, original_node, updated_node):
        """Update variables that match renamed fixtures."""
        name = updated_node.value
        if name in self.renames:
            new_name = self.renames[name]
            return updated_node.with_changes(value=new_name)
        return updated_node

Running codemods

Clean git working tree
Run codemod script
Run formatters/linters
Commit just this

Automated changes = isolated commits

So this is the stuff I talk about at parties. Workflow to run codemods! 1. Make sure git status says everything is clean. You don't want unstaged changes 2. Run the codemod. If something went wrong, just collect a couple examples and git reset --hard 3. Run formatters/linters (I mention this as a separate step) Run it as part of your script, or I like precommit for that 4. Commit _just that_ This is a bit of boring advice but I'd feel bad not mentioning it. The rule: This is important! You will thank yourself if you need to rebase on top of a lot of changes. Instead of dealing with conflicts it's often easier to drop the automated changes and re-run the script.

Writing codemods

Maintainability? No
90%? Claim success
Leave formatting to tooling: more fun
Test Driven Development

So advice on running codemods: be rigorous. Follow the method, atomic commits... BORING. Now what do we care about when we're _writing_ codemods? **PRESS** this doesn't always apply, if you're a library author writing a codemod so all your users can upgrade, you'll need a higher bar. You probably can't assume your users all have a formatter like black or ruff. **PRESS** if the script gets you 90% of the way there, that might be enough for your use case! It's fine to do things manually too, it's just another tool in the toolbox. As long as you keep automated changes and manual changes in separate git commits. I prefer to give control of formatting to a tool like black or ruff. Then the refactor script can mess everything up - there's no need to manage whitespaces so the code still looks good after applying the script. Similar for things like duplicate or unused imports. If you have tooling doing that, your refactoring scripts can add 3 times the same import. Unix philosophy of combining small tools that do one thing well. All in all: If you're targeting one codebase, for a one-off change: that's a bit liberating :). Also, you might notice these bullet points are really where AI Agents strive.

Example: limitations (many)

Same-named local variables
Defined "after use"
Fixtures across files
Matching not the most robust (ex: pytest alias)

Example: Defined "after use"

before.py


def test_function(test_user):
    assert test_user["name"] == "test"


@pytest.fixture
def test_user():
    return {"name": "test"}

after.py


def test_function(user_fixture):
    assert user_fixture["name"] == "test"


@pytest.fixture
def user_fixture():
    return {"name": "test"}

Example: same-name variables

before.py


@pytest.fixture
def test_user():
    return {"name": "test"}

def test_another_function():
    test_user = {"name": "local"}
    assert test_user["name"] == "local"

after.py


@pytest.fixture
def user_fixture():
    return {"name": "test"}

def test_another_function():
    test_user = {"name": "local"}
    assert test_user["name"] == "local"

Going further with LibCST

State across files: multiple passes [Metadata APIs]
Variable scope management [ScopeProvider]
Pattern-matching nodes [matchers]

Automated refactoring: getting started

First step: use existing tools!

django-upgrade, pyupgrade
npx @next/codemod upgrade canary

Start asking "would this be doable?"

Ideas of refactoring scripts

Unittest to pytest

before.py


class TestAssertNotEqual(TestCase):

    def test_you(self):
        self.assertNotEqual(abc, 'xxx')

after.py

def test_you(self):
    assert abc != 'xxx'

From https://github.com/pytest-dev/unittest2pytest/tree/main

Ideas of refactoring scripts

Rewrite apps.get_model (Django) to local imports

before.py


Message = apps.get_model("Chat", "Message")

after.py


from chat.models import Message

Ideas of refactoring scripts

Cleanup feature flags automatically

before.js


const data = featureFlag('new-release')
                ? {name: 'Product'}
                : undefined;

after.js


const data = { name: 'Product' };

Blog post: https://martinfowler.com/articles/codemods-api-refactoring.html

The elephant

LibCST vs the Claudes

Deterministic vs faith-based
Easy rebasing vs another 30 minutes of GPU time
Claude can write the LibCST transformer

LibCST vs the Claudes

Clear and obvious conclusion:

AI really good in many cases :)
Codemods feel better for some changes
Large codebases, reusability: push towards codemods

Resources

Repo with the pytest codemod: https://github.com/ldirer/codemod-rename-pytest-fixtures
Explore ASTs in the browser: https://ast-explorer.dev/
LibCST example, with blog post: https://github.com/seatgeek/tornado-async-transformer
Template inspired by above: https://github.com/ldirer/libcst-codemod-template
Great blog post (mentioned in feature toggles example): https://martinfowler.com/articles/codemods-api-refactoring.html

Thank you

Syntax Trees Everywhere

Interpreter
Linters
Refactoring
'Transpiling': converting from one language to another
Template engines

I want to mention this as a motivation. It's nice to learn about syntax trees because they show up in so many developer tools. So: where else are ASTs used. Interpreter: If you access the JavaScript AST and run it in Python, you have a JavaScript interpreter written in Python. You can also do a Python interpreter written in Python. That's one way of confusing people. Linters: eslint in JavaScript, flake8 in Python. formatters too: black uses the AST. Transpiling: babel is an example in JavaScript to convert new syntax to older, compatible syntax. So we can use the fancy new JavaScript features, and it converts the code back to an equivalent version taht all browsers can understand. Template engines: jinja2 is one example. I think it's a fundamental concept, not because it is basic or simple - it's not - but because it powers a lot of the tools we use. And while it might be hard to write these tools - handling edge cases, different python versions... - , understanding the principles goes a long way.

Practical Refactoring with Syntax Trees

Code as Data

(1 + 2) * 3

What is the Abstract Syntax Tree?

Anatomy of a refactoring script

Anatomy of a refactoring script

Anatomy of a refactoring script

Anatomy of a refactoring script

Anatomy of a refactoring script

Transforming the AST

ast.NodeTransformer

ast.NodeTransformer

Python AST for refactoring

Python AST for refactoring: oh no

Concrete Syntax Trees

Concrete Syntax Trees

Concrete Syntax Trees

Concrete Syntax Trees

Concrete Syntax Trees

Concrete Syntax Trees

Example: rename pytest fixtures

What do we want?

LibCST API

Matching fixtures

Matching fixtures

Matching fixtures

Running codemods

Writing codemods

Example: limitations (many)

Example: Defined "after use"

Example: same-name variables

Going further with LibCST

Automated refactoring: getting started

Ideas of refactoring scripts

Ideas of refactoring scripts

Ideas of refactoring scripts

The elephant

LibCST vs the Claudes

LibCST vs the Claudes

Resources

Thank you

Syntax Trees Everywhere

Big picture: from source code to execution

Big picture: from source code to execution

Big picture: from source code to execution

Big picture: from source code to execution

`(1 + 2) * 3`