Typos correction in code identifiers

Irina Khismatullina, source{d}.

Typos correction in code identifiers

Irina Khismatullina

source{d}
#MLonCode

Plan

  1. Intro
  2. When do we correct typos in code identifiers
  3. How do we correct typos in code identifiers
  4. How can we get better

Code identifiers

class ClassName:
    @classmethod
                def function_name(cls) -> str:
                    variable_name = "Hello, I'm ClassName object!"
                    return variable_name

Typos correction

funktion    function

GetValu    GetValue

str_lenght    str_length

Typos correction inside code identifiers

class ClasName:
    @classmethod
                def function_name(cls) -> str:
                    varyable_name = "Hello, I'm ClassName object!"
                    return varyable_name

Typos correction inside code identifiers

class ClasName:
    @classmethod
                def function_name(cls) -> str:
                    varyable_name = "Hello, I'm ClassName object!"
                    return varyable_name

Typos correction inside code identifiers

class ClassName:
    @classmethod
                def function_name(cls) -> str:
                    variable_name = "Hello, I'm ClassName object!"
                    return variable_name

When do we fix typos in code?

Stages of development

  1. Writing code
  2. Compiling, building, running, fixing code locally
  3. Creating a Pool Request (PR) to add code to the project's repository
  4. Code review
  5. Merging new code with the whole project code

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally
  3. Creating a PR to add code to the project's repository
  4. Code review
  5. Merging new code with the whole project code

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE & tools
  3. Creating a PR to add code to the project's repository
  4. Code review
  5. Merging new code with the whole project code

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE & tools
  3. Creating a PR to add code to the project's repository - automatic checks
  4. Code review
  5. Merging new code with the whole project code

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE & tools
  3. Creating a PR to add code to the project's repository - automatic checks
  4. Code review
  5. Merging new code with the whole project code

Code review

Colleagues/leads/maintainers/teachers check the new code and suggest changes: After the reviewers are happy, the code is merged to the codebase.

Code review

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE & tools
  3. Creating a PR to add code to the project's repository - automatic checks
  4. Code review - colleagues, teachers, team leads
  5. Merging new code with the whole project code

Solution: automate what we can!

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE & tools
  3. Creating a PR to add code to the project's repository - automatic checks
  4. Code review - colleagues, teachers, team leads
  5. Merging new code with the whole project code

Stages of development

  1. Writing code - developer & IDE
  2. Running, fixing code locally - developer & IDE
  3. Creating a PR to add code to the project's repository - automatic checks
  4. Code review - colleagues, teachers, team leads
  5. Merging new code with the whole project code

Goal: filter as much typos as possible automatically on PR creation

Automatic checks on PR creation

Lookout

Lookout analyzer

High-level API

class MyAnalyzer(Analyzer):
    @classmethod
    def train(cls, ...) -> AnalyzerModel:
        # ...

    def analyze(self, ...) -> [Comment]:
        # do something with self.model

How do we fix typos in code identifiers?

Classical approach

Typos correction in code

Built-in-IDEs spellcheckers

Typos correction with src-d/lookout

Perks

We can use Machine Learning

Vocabulary

  1. Universal vocabulary is derived from the dataset of identifiers:
    • Contains most frequent tokens
    • Contains English words
    • Filtering based on edit distances
  2. All identifiers' tokens in the repo are added to the vocabulary

Detection

The correction pipeline

Token embeddings

Candidates generation

  1. Find candidates with the SymSpell
  2. Identifier to which the token belongs - a context
  3. Features: based on the frequencies, embeddings, similarity and edit distance between the elements

Candidates ranking

Suggestions

Training and testing

Results

Metrics

Results on a token level

Note: the test dataset is not a ground truth.

Main problems

The worst

Complicated architecture strongly based on human knowledge:

We need something simpler

DL models for typos correction

Summary

Summary

Thank you!