Alright, who committed styling changes on 13000 files to the main branch on the day before we need to merge 5 long-lived feature branches for a release cycle?

- A Mouthful Legacy Developer

The Problem


Introducing code standards to large legacy repositories can seem straightforward, but several shortcomings must be accepted or addressed.

Too large

Some issues that can I’ve come across are that some code bases are too large for regular scanning. They overwhelm the linting tool and take several minutes to an hour to complete. That isn’t practical when you enforce standards on every repository push.

Disruption downstream

Other downsides are successfully running the tool and then using an auto-fix feature to update the codebase at once. This en masse update sounds like a great plan, but when pushed downstream to multiple long-lived feature branches, it can become a nightmare to sort out by requiring delicate merge strategies.

Regression

This action may introduce unwarranted risk to the product and force a full regression test. With non-automated QA, product owners or business units may consider this request unreasonable and reject it.

Buy-in

Code quality is a concept that’s difficult for people who don’t have their hands in the code every day to recognize how important it is. It’s especially critical in larger applications where comments, indentation, and general readability result in more confident and comfortable developers.

So what can be done?


The quickest result is pulling the band-aid off, pushing the styling changes en masse, and dealing with the repercussions. Results could consist of angry developers due to complex merges, delayed QA tasks, regression testing, and –perish the thought– a root cause analysis meeting.

Another option is for a gradual introduction of coding conventions. Reformat the code only when it’s touched will lead to eventual consistency.

A concern with this method is that it’s a manual process without tooling that would require discipline and trust in your teammates. Code reviews would help out, but code style should not be the focal point of a code review. There is a need for automated tools.

Gradual integration via tooling


Since I’m primarily a Java developer, this tool focuses on that ecosystem. There may be similar tools for other frameworks/languages that can be used interchangeably.

This section details gradual coding conventions implemented with Samurai Style Cop, a tool for Java projects using Checkstyle. Similar methods can be used for your tool of choice.

Checkstyle is an excellent linting tool with effortless incorporation into most build processes. Whatever tool you use, determine beforehand if it supports suppressions.

Checkstyle is very convenient because it can ignore files and has enough granularity to suppress lines within a file. So if the incoming change is only for lines 50 through 57, it will disregard warnings for other lines.

Use the git diff

To produce a suppression file, start by fetching a list of changed files. Do that by running a git diff against the local repository.

git --no-pager diff --minimal -U0 -- *.java

The above example uses –no-pager to dump unimpeded output to the screen without using an interface like less. Remove if scrolling is desired.

Additionally, –minimal is used to create the smallest possible diff, and -U0 turns off context lines.

Here is an example output using the Samurai Style Cop repository:


diff --git a/src/main/java/com/epicmonstrosity/samuraistyle/Main.java b/src/main/java/com/epicmonstrosity/samuraistyle/Main.java
index 8dc6c1e..43c108b 100644
--- a/src/main/java/com/epicmonstrosity/samuraistyle/Main.java
+++ b/src/main/java/com/epicmonstrosity/samuraistyle/Main.java
@@ -4,0 +5 @@ import com.epicmonstrosity.samuraistyle.diff.UnifiedDiffReader;
+import com.epicmonstrosity.samuraistyle.lint.CheckstyleService;
@@ -8,0 +10 @@ import com.epicmonstrosity.samuraistyle.utils.feeds.StandardInputReader;
+import com.puppycrawl.tools.checkstyle.api.CheckstyleException;
@@ -13,0 +16 @@ import java.nio.file.Files;
+import java.util.List;
@@ -25,2 +28,4 @@ public class Main {
-            InputFeed diffFeed = getInputFeed(diffOptions);
-            final String xml = generateSuppressionXml(diffOptions, diffFeed);
+            final InputFeed diffFeed = getInputFeed(diffOptions);
+            final CheckstyleSuppressor suppressor = getSuppressor(diffOptions);
+            final UnifiedDiffReader reader = getDiffReader(diffFeed);
+            final String xml = suppressor.buildXml(reader.getDocuments());

Brief overview of git diff format (2-way, patch)

Git Diff is an extension of the Unified Diff Format with a few alterations to the header.

All patch headers start with a diff --git header line. a/ and b/ is the filename before and after the changes. Renames will have a different b/ filename. If the file is new or removed, the filenames will be identical. The status of the file will be described later in the header, if applicable.


diff --git a/src/Main.java b/src/Main.java

Some diffs will have extended information. The index header is a common extension that displays the shortened blob object names and the file mode (<hash>..<hash> <mode>). Blob object names are, from my understanding, the name of the file’s versioned content stored in the git database. The mode portion is a record of the file type and permissions bits.


index 8dc6c1e..43c108b 100644

Other extended headers can also be present:


old mode <mode>
new mode <mode>
deleted file mode <mode>
new file mode <mode>
copy from <path>
copy to <path>
rename from <path>
rename to <path>
similarity index <number>
dissimilarity index <number>

The rest of the diff is Unified format.

--- represents the filename prior to the changes. +++ is the filename after changes were made. This should be verbatim of what is in the header, with the event of /dev/null present for either line when the file is created or deleted. The diff --git header line does not utilize /dev/null.


diff --git a/examples/pixel-aspect-ratio/Cargo.toml b/examples/pixel-aspect-ratio/Cargo.toml
new file mode 100644
index 0000000..aaf4732
--- /dev/null
+++ b/examples/pixel-aspect-ratio/Cargo.toml

Following the original/modified filenames is the diff hunk header. This header starts and ends with two ampersands and is defined by the following:

@@ -<Line Number Start>,<Length> +<Line Number Start>,<Length> @@ 

The minus - and plus + correspond with the original and modified file revision. Line number starting refers to the line number in the file where the differences start. The length is the total number of changed lines. For example, if the change starts on line 14 and ends on line 16, the value would be 14,3. The range is inclusive of the starting line.

If no number is after the line start, such as in the example of @@ -13,0 +16 @@, the range is 1.

A range of 0 indicates no modifications, and the change is distinctly an addition or removal.

The next part of the diff hunk is the content lines. Line content starts at column 1. Diff markers populate column 0 and are either - or +, which indicates if the line was removed or added.


@@ -25,2 +28,4 @@ public class Main {
-            InputFeed diffFeed = getInputFeed(diffOptions);
-            final String xml = generateSuppressionXml(diffOptions, diffFeed);
+            final InputFeed diffFeed = getInputFeed(diffOptions);
+            final CheckstyleSuppressor suppressor = getSuppressor(diffOptions);
+            final UnifiedDiffReader reader = getDiffReader(diffFeed);
+            final String xml = suppressor.buildXml(reader.getDocuments());

In this example hunk, lines 25 and 26 were removed in the old file revision, and then starting with line 28 of the new revision, lines 28, 29, 30, and 31 were added. The lines may or may not match up depending on the number of changes in a prior hunk.

Multiple hunks can be defined per file diff, and more than one file diff may be present in a patch.

Enter the Samurai

Samurai Style Cop has a simple parser for git diff files using the above definition. It tracks the line numbers and file paths for every file mentioned in the diff, then builds a Checkstyle suppressions.xml file. This file can be used with an external or embedded Checkstyle executable. If embedded, it only passes the paths to Checkstyle to scan, significantly reducing the number of files to review.

Since it only inspects the diff and not the individual files, there are caveats.

The length of the file (the max num of lines) is just a static guess of 50,000. Users can tweak this line limit (--limit switch), but it’s a global constant used for every scanned file. Checkstyle doesn’t care if the file is shorter than the max line limit, but larger files would have lines beyond that limit scanned.

As an example, a scanned file is 127 lines long. The diff cues that changes are on lines 3 through 5.

<suppress checks=".*" files="Main.java" lines="1-2,6-50000"/>

Since it tracks the file path, the program could be extended to scan each file for length before producing the suppression markup.

Areas of Interest in the Code

  • Diff Scanner/Parser
    • Features a state machine for more accurate transitions and stop false-positive state parsing
    • Simple parsing using Regular Expressions
    • Not a true parsing library; it only implements enough of the git diff format to process the suppression list.
  • Embedded Checkstyle
    • Suppression list builder and setup for embedded Checkstyle.

Output


<?xml version="1.0"?><!DOCTYPE suppressions PUBLIC "-//Puppy Crawl//DTD Suppressions 1.0//EN" "http://www.puppycrawl.com/dtds/suppressions_1_0.dtd">
<suppressions>
<suppress checks=".*" files="src[\\/]main[\\/]java[\\/]com[\\/]epicmonstrosity[\\/]samuraistyle[\\/]Main.java" lines="1-4,7-9,12-15,18-27,41-43,46-58,66-68,71-50000"/>
<suppress checks=".*" files="src[\\/]main[\\/]java[\\/]com[\\/]epicmonstrosity[\\/]samuraistyle[\\/]cli[\\/]DiffOptions.java" lines="1-9,12,18-40,46,49-51,54-50000"/>
<suppress checks=".*" files="src[\\/]main[\\/]java[\\/]com[\\/]epicmonstrosity[\\/]samuraistyle[\\/]lint[\\/]CheckstyleSuppressor.java" lines="1-62,65-67,70-74,82-50000"/>
</suppressions>

Only the files listed in the suppression list will be checked when using embedded. In this case, Main.java, DiffOptions.java and CheckstyleSuppressor.java.


Starting audit...
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\Main.java:63:48: Parameter diffOptions should be final. [FinalParameters]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:10:5: Missing a Javadoc comment. [JavadocVariable]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:11:20: Variable 'exclusiveInput' must be private and have accessor methods. [VisibilityModifier]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:13:5: Missing a Javadoc comment. [JavadocVariable]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:14:23: Variable 'checkStyleOptions' must be private and have accessor methods. [VisibilityModifier]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:16:5: Missing a Javadoc comment. [JavadocVariable]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:17:10: Variable 'outputFile' must be private and have accessor methods. [VisibilityModifier]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:41:5: Missing a Javadoc comment. [MissingJavadocMethod]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:47:17: Variable 'stdIn' must be private and have accessor methods. [VisibilityModifier]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\cli\DiffOptions.java:52:9: Missing a Javadoc comment. [MissingJavadocMethod]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\lint\CheckstyleSuppressor.java:63:5: Missing a Javadoc comment. [MissingJavadocMethod]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\lint\CheckstyleSuppressor.java:68:5: Missing a Javadoc comment. [MissingJavadocMethod]
[ERROR] .\src\main\java\com\epicmonstrosity\samuraistyle\lint\CheckstyleSuppressor.java:75:5: Missing a Javadoc comment. [MissingJavadocMethod]
Audit done.

Integrate into your build process

Currently, client-side pre-commit git hooks are the only method I inspected. It puts the onus on the developer to set up and not deactivate, so this might not be ideal for all users.

I could imagine setting up Jenkins to run the Checkstyle process and failing the build or setting it as unstable would be trivial. I plan to set up a Jenkins container and test this out in the future. I will post an update when this occurs.

Part 2

Part 2 will use metrics obtained from git to create a list of low-risk files for en masse styling changes.