Practical Refactoring with Syntax Trees

Example:

            
            #!/usr/bin/env bash
            sed -i 's/old_function/new_function/g' *.py
            

Code as Data

(1 + 2) * 3

                
            $ echo "(1 + 2) * 3" > simple.py
                
                    
                
$ python -m ast simple.py
                
                

Module(
   body=[
      Expr(
         value=BinOp(
            left=BinOp(
               left=Constant(value=1),
               op=Add(),
               right=Constant(value=2)),
            op=Mult(),
            right=Constant(value=3)))],
   type_ignores=[])
                

What is the Abstract Syntax Tree?

  • a tree of Python objects that represent the source
  • node types represent language constructs: expression, assignment, import, etc.
  • used by the Python interpreter to run code
  • a representation of the code that we can modify
          >>> import ast









          >>> import ast
>>> source = "(1 + 2) * 3"








          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)







          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>





          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>
>>> ast.dump(node)
'Module(body=[Expr(value=BinOp(left=BinOp(left=Constant(value=1), op=Add(), right=Constant(value=2)), op=Mult(), right=Constant(value=3)))], type_ignores=[])'



          >>> import ast
>>> source = "(1 + 2) * 3"
>>> node = ast.parse(source)
>>> node
<ast.Module object at 0x7bcbef830370>
>>> ast.dump(node)
'Module(body=[Expr(value=BinOp(left=BinOp(left=Constant(value=1), op=Add(), right=Constant(value=2)), op=Mult(), right=Constant(value=3)))], type_ignores=[])'
>>> ast.unparse(node)
'(1 + 2) * 3'

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)



                
                

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)


                
                

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)

                
                

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)
write(transformed_tree.unparse(), file_path)
                
                

Anatomy of a refactoring script

            
# read, parse, transform, write
source_code = read(file_path)
tree = parse(source_code)
transformed_tree = transform_tree(tree)
write(transformed_tree.unparse(), file_path)
                
                

Transforming the AST

  • Utilities: ast.NodeVisitor, ast.NodeTransformer
  • Depth-first traversal
  • Define visit_[NodeType] methods

ast.NodeTransformer

before.py

b = a + 1

after.py

data['b'] = data['a'] + 1

Assign(
  targets=[
    AssignTarget(
      target=Name(value='b')
    )
  ],
  value=BinaryOperation(
    left=Name(value='a'),
    operator=Add(),
    right=Integer(value='1')
  )
)
                    

Assign(
  targets=[
    AssignTarget(
      target=Subscript(
        value=Name(value='data'),
        slice=[
          SubscriptElement(
            slice=Index(
              value=SimpleString(
                        value="'b'"
                        ),
            )
          )
        ],
      )
    )
  ],
  value=BinaryOperation(
    left=Subscript(
      value=Name(value='data'),
      slice=[
        SubscriptElement(
          slice=Index(
            value=SimpleString(
                        value="'a'"
                        ),
          )
        )
      ]
    ),
    operator=Add(),
    right=Integer(value='1')
  )
)
                    

ast.NodeTransformer

before.py

b = a + 1

after.py

data['b'] = data['a'] + 1
            
class RewriteName(NodeTransformer):

    def visit_Name(self, node):
        return Subscript(
            value=Name(id='data', ctx=Load()),
            slice=Constant(value=node.id),
            ctx=node.ctx
        )

            

Python AST for refactoring

exhibit A


                        1 + (2 * 3)  # IP-protected formula
                    

exhibit B

1 + 2 * 3

Same AST 😱

Python AST for refactoring: oh no

  • does not preserve formatting
  • does not preserve comments

Good if your codebase is not formatted and comments are overrated anyway.

Concrete Syntax Trees


            (1 + 2) * 3   # comment
            

SimpleStatementLine(
  body=[
    Expr(
      value=BinaryOperation(
        left=BinaryOperation(
          left=Integer(value='1'),
          operator=Add(),
          right=Integer(value='2'),
          lpar=[LeftParen()],
          rpar=[RightParen()],
        ),
        operator=Multiply(),
        right=Integer(value='3'),
      ),
    ),
  ],
  trailing_whitespace=TrailingWhitespace(
    whitespace=SimpleWhitespace(value='   '),
    comment=Comment(value='# comment'),
  ),
)
                
            

Concrete Syntax Trees

  • Cousins of AST. Sometimes called Parse Trees.
  • Preserve whitespace, parentheses, comments...
  • Allow 'round-tripping'
  • Great fit for refactoring scripts
  • We do not have to care about whitespace

Concrete Syntax Trees

                    
import numpy
                    
                
                    Import(
  names=[
    ImportAlias(
      name=Name(
        value='numpy'
      )
    )
  ]
)





                    
                

Concrete Syntax Trees

                    
import numpy as np
                    
                
                    
Import(
  names=[
    ImportAlias(
      name=Name(
        value='numpy'
      ),
      asname=AsName(
        name=Name(
          value='np'
        )
      )
    )
  ]
)
                    
                

Concrete Syntax Trees

                    
a = 'Hello'
                    
                
                    Assign(
  targets=[
    AssignTarget(
      target=Name(value='a')
    )
  ],
  value=SimpleString(value="'Hello'")
)







                    
                

Concrete Syntax Trees

                    
a = f('Hello')
                    
                
                    
Assign(
  targets=[
    AssignTarget(
      target=Name(value='a')
    )
  ],
  value=Call(
    func=Name(value='f'),
    args=[
      Arg(
        value=SimpleString(value="'Hello'")
      )
    ]
  )
)
                    
                

Example: rename pytest fixtures

                    
                    @pytest.fixture
                    def test_user():
                        return {"name": "test"}

                    def test_login(test_user):
                        ...
                
                

What's wrong with this?

  • Naming convention
  • This is wrong
  • Why would you
  • Hurts my feelings
  • Can cause crashes with some pytest versions

Hundreds of fixtures like this one.

What do we want?

before.py



                    import pytest

                    @pytest.fixture
                    def test_user():
                        return {"name": "test"}

                    def test_login(test_user):
                        assert test_user["name"] == "test"
      

after.py


import pytest

@pytest.fixture
def user_fixture():
    return {"name": "test"}

def test_login(user_fixture):
    assert user_fixture["name"] == "test"
      

Change this, on hundreds of fixtures.

LibCST API

  • CSTTransformer ~= ast.NodeTransformer
  • Define methods:
    • visit_[NodeType]
    • leave_[NodeType]

            import libcst as cst


            class Transformer(cst.CSTTransformer):
                def visit_FunctionDef(self, node):
                    ...

                def leave_Name(self, original_node, updated_node):
                    ...
            

                class Transformer(cst.CSTTransformer):

                    def __init__(self):
                        self.renames: dict[str, str] = {}

                    def visit_FunctionDef(self, node):
                        """Collect fixtures that need to be renamed."""
                        if is_pytest_fixture(node) and should_rename(node):
                            old_name = node.name.value
                            self.renames[old_name] = generate_new_name(old_name)
                        return True  # Continue visiting children
            
                
                class Transformer(cst.CSTTransformer):

                    def leave_Name(self, original_node, updated_node):
                        """Update variables that match renamed fixtures."""
                        name = updated_node.value
                        if name in self.renames:
                            new_name = self.renames[name]
                            return updated_node.with_changes(value=new_name)
                        return updated_node
                
            

Matching fixtures

                
def test_user():
    ...
                
            
        FunctionDef(
  name=Name(value='test_user'),
  params=Parameters(),
  body=SimpleStatementSuite(body=[
      Expr(value=Ellipsis()),
    ]),
  decorators=[],
)







                    
            

Matching fixtures

                
                    @pytest.fixture
                    def test_user():
                        ...
                
            
        
FunctionDef(
  name=Name(value='test_user'),
  params=Parameters(),
  body=SimpleStatementSuite(body=[
      Expr(value=Ellipsis()),
    ]),
  decorators=[
    Decorator(
      decorator=Attribute(
        value=Name(value='pytest'),
        attr=Name(value='fixture'),
      ),
    ),
  ],
)
                    
            

Matching fixtures


def is_pytest_fixture(node: cst.FunctionDef) -> bool:
    for decorator in node.decorators:
        match decorator.decorator:
            # Handle @fixture
            case cst.Name(value="fixture"):
                return True
            # Handle @pytest.fixture
            case cst.Attribute(value=cst.Name(value="pytest"),
                               attr=cst.Name(value="fixture")):
                return True
    return False
            

class Transformer(cst.CSTTransformer):
    def __init__(self):
        self.renames: dict[str, str] = {}

    def visit_FunctionDef(self, node) -> bool:
        """Collect fixtures that need to be renamed."""
        if is_pytest_fixture(node) and should_rename(node):
            old_name = node.name.value
            self.renames[old_name] = generate_new_name(old_name)
        return True  # Continue visiting children

    def leave_Name(self, original_node, updated_node):
        """Update variables that match renamed fixtures."""
        name = updated_node.value
        if name in self.renames:
            new_name = self.renames[name]
            return updated_node.with_changes(value=new_name)
        return updated_node


            

Running codemods

  1. Clean git working tree
  2. Run codemod script
  3. Run formatters/linters
  4. Commit just this

Automated changes = isolated commits

Writing codemods

  • Maintainability? No
  • 90%? Claim success
  • Leave formatting to tooling: more fun
  • Test Driven Development

Example: limitations (many)

  • Same-named local variables
  • Defined "after use"
  • Fixtures across files
  • Matching not the most robust (ex: pytest alias)

Example: Defined "after use"

before.py


def test_function(test_user):
    assert test_user["name"] == "test"


@pytest.fixture
def test_user():
    return {"name": "test"}
                    

after.py


def test_function(user_fixture):
    assert user_fixture["name"] == "test"


@pytest.fixture
def user_fixture():
    return {"name": "test"}
                    

Example: same-name variables

before.py


@pytest.fixture
def test_user():
    return {"name": "test"}

def test_another_function():
    test_user = {"name": "local"}
    assert test_user["name"] == "local"
                        

after.py


@pytest.fixture
def user_fixture():
    return {"name": "test"}

def test_another_function():
    test_user = {"name": "local"}
    assert test_user["name"] == "local"
                        

Going further with LibCST

  • State across files: multiple passes [Metadata APIs]
  • Variable scope management [ScopeProvider]
  • Pattern-matching nodes [matchers]

Automated refactoring: getting started

  • First step: use existing tools!
    • django-upgrade, pyupgrade
    • npx @next/codemod upgrade canary
  • Start asking "would this be doable?"

Ideas of refactoring scripts

Unittest to pytest

before.py


class TestAssertNotEqual(TestCase):

    def test_you(self):
        self.assertNotEqual(abc, 'xxx')
                    

after.py

def test_you(self):
    assert abc != 'xxx'

                    

From https://github.com/pytest-dev/unittest2pytest/tree/main

Ideas of refactoring scripts

Rewrite apps.get_model (Django) to local imports

before.py


Message = apps.get_model("Chat", "Message")
                    

after.py


from chat.models import Message
                    

Ideas of refactoring scripts

Cleanup feature flags automatically

before.js


const data = featureFlag('new-release')
                ? {name: 'Product'}
                : undefined;
                    

after.js


const data = { name: 'Product' };
                    

Blog post: https://martinfowler.com/articles/codemods-api-refactoring.html

The elephant

LibCST vs the Claudes

  • Deterministic vs faith-based
  • Easy rebasing vs another 30 minutes of GPU time
  • Claude can write the LibCST transformer

LibCST vs the Claudes

Clear and obvious conclusion:

  • AI really good in many cases :)
  • Codemods feel better for some changes
  • Large codebases, reusability: push towards codemods

Resources

Thank you

Syntax Trees Everywhere

  • Interpreter
  • Linters
  • Refactoring
  • 'Transpiling': converting from one language to another
  • Template engines

Big picture: from source code to execution

Big picture: from source code to execution

Big picture: from source code to execution

Big picture: from source code to execution