How I’ve Been Creating My Own Pull Request Copilot
I know what you’re thinking: “Oh, fantastic, another boring article promoting LLM”. But this won’t be another article about Copilot being a game-changer, and everyone needs to jump on the hype train.
To be honest, the original title of this article was “How I Failed to Build My Own Pull Request Copilot”. However, while writing, I realized that it wasn’t a complete failure, and I still don’t give up on the idea of having my own customized Copilot.
GitHub Pull Request Copilot
Maybe you’re aware that GitHub conducted an experiment to create a GitHub Pull Request Copilot. The experiment ended last year, and according to this summary page, only a few ideas that originated from the experiment were added to GitHub Copilot or Copilot Workspace.
If you carefully read the article, you’ll notice that the only features added to the existing product were the writing summaries of Pull Request and composing poems.
Another experiment was generating tests for pull requests. This seemed promising until you discovered that the GitHub team didn’t include generating tests in GitHub Copilot; it has been only open-sourced. It seems clear that the GitHub team may not be fully satisfied with the results enough to integrate this new test-generating capability into their paid services.
If you ask me, the experiment outcome seems a bit unpromising.
However, I was inspired and began considering creating a simple, customized Pull Request Copilot tailored to my needs.
Why Tools Matter?
Before we get to the Copilot, let’s discuss tooling. During my software career, I realized I am a big fan of checklists and tooling. I’ve often heard that tools are not the most important thing, and I agree that many of them might be just a waste of time or bike-shedding. But I can’t help myself. I love even tools that help me save only a few keystrokes.
That means I’m a big fan of ReSharper, various linters, and custom CLI programs. I even have a Word document, which I use to store my checklist for reviewing PRs. So, having a tailored Pull Request Copilot is a natural step for me.
Can AI Copilot replace a human reviewer?
I don’t think so. Pull request review practice is not only about improving code and catching bugs. Another very important aspect of the PR process is sharing knowledge, learning from others, and teaching others. To have a good code review is to have a good conversation.
Pull requests Copilot would be just another tool. It can’t replace your team member.
On the other hand, Copilot can serve as another (a bit weird) team member. When learning a new syntax in the C#, I rely on R# suggestions and compiler warnings. I read numerous posts about new C# features, but I learn the most when I receive immediate feedback from ReSharper indicating that the code I just wrote needs to be updated to adhere to idiomatic C#.
From this perspective, Pull Request Copilot can provide that type of additional feedback. It may resemble a junior developer specializing in a specific area, with a low ego but high confidence.
The first version
My initial attempt to build Copilot was really simple. I wrote a small NextJS application that calls my C# API.
NextJS app sent the number of the pull request to review. C# API fetched and processed each file’s content separately with the Llama3 model using the prompt Improve the code {code}.
The system message was simple: You are a senior C# developer
. The result from LLM has been displayed as raw text.
The result wasn’t good. Since I was sending only a piece of changed code — sometimes just one line — the LLM model often generated just some random code. I should have expected that. LLM wasn’t reviewing but writing new code from a chunk of code I provided. It wasn’t useful at all.
After several tries of modifying the prompt and the system message, I came to the conclusion that I needed to change my approach. The problem was that the language model (LLM) didn’t have the context, and the code snippet alone wasn’t sufficient to give a helpful suggestion.
Reduce the scope
I believe my goal was to be ambitious, so I decided to reduce the scope and see how LLM would behave when reviewing only the methods.
Using a simple code analysis, I could extract recently updated or added methods from the pull request and analyze them separately. LLM was still a bit chatty, so I changed the system prompt from:
You are a senior C# developer
to
You are a senior C# developer reviewing pull requests.
Be short and concise. Looking for possible bugs and improvements
I also needed to provide a developer with a more detailed explanation of what was changed and why. I changed my prompt to:
Improve the code.
My the entire code:
{file}
this is the method I'm changing:
{method}
please focus on this change:
{change}
Add explanations of what you improved. Use the latest version of C#.
Use the following coding style :
- collection expressions
- pattern matching and other new features
Return only the modified method, not the whole class.
These two small changes did the trick and helped a lot.
I also changed the model from Llama3 to GPT-4o, which seems to give better answers.
Improve developer experience
That were the changes I made on LLM’s side. The second part was to improve the developer experience. My original idea was to provide just a reply from LLM, but it quickly turned out it wasn’t enough.
A pull request review allows every developer to compare the difference between two versions of the code. Therefore, adding a diff viewer made perfect sense.
Instead of one, I added two GitHub-like diff views.
- Diff view of the old method and the new method
- Diff view of the new method and the improved method provided by LLM
That’s been my current version, which I’m testing. The following section highlights what I’ve learned so far during this brief journey.
Is building Pull Request Copilot really a good idea?
The real problem here is the name. Pull Request Review Copilot might be setting up some expectations it can’t quite meet. The word review is the troublemaker here.
The latest generative AI models are great at generating. Writing code taps into your creativity and focus while reviewing, and it needs an analytical approach, critical thinking, and solid memory recall. And that recall isn’t just about knowing C# and standard patterns learned from open sources in GitHub. This problem represents the central issue for every LLM — missing context.
Missing context
After my initial enthusiasm at the start of the project, I realized that the amount of knowledge I use for reviewing PRs is immense.
For example, you can write a single line of code that is acceptable to SonarQube, R#, the C# compiler, and LLM. However, during the PR review, it might turn out that that line of code should be in a completely different microservice. LLM is not aware of your architectural context. Furthermore, there are also issues that are more closely related to the code itself.
There are several reasons for leaving a comment on your colleague’s PR. A few of them are:
- Code quality
- Bug Prevention
- Consistency
- Performance
Many of these rules may be informal guidelines shared within a team or even a casual comment made during a meeting. For instance, a team might decide not to use inheritance in their codebase or agree on a specific convention that deviates from the standard coding style of the language being used. There are teams that do not mark private fields with an underscore, but LLM models use it by default since it is a standard convention for C#.
During PR review, your brain have these types of information:
- This part of the code is legacy and uses different approaches for testing.
- We have decided that we no longer want to use this library.
- This part of the code is written poorly but for a good reason (a.k.a. micro-optimization)
And many more. All of these piece of information are effectively hidden for LLM.
In all of these instances, the team would need to formalize these rules somewhere in the document that LLM could access and use. That’s one reason why I don’t believe in utilizing generic third-party pull request copilots.
However, some things are much simpler, yet LLM still can’t handle them. Like this suggestion:
Code:
int[] numbers = [1, 2, 3, 4, 5];
var greaterThanTen = Array.Exists(numbers, num => num > 10);
Suggestion:
Used Any Instead of Array.Exists: Updated the code to use
LINQ's Any method instead of Array.Exists for better readability
and consistency.
Unfortunately, this suggestion goes against the SonarQube rule. You can customize LLM responses to address almost all of those issues, but it comes with a cost.
Price
Instead of a short prompt, like: “Improve the code quality of this method”, you can provide a larger context like:
Improve the code quality of this method
The rules that should be applied are:
- The method should be less than 20 lines of code.
- The method should use Array.Exists instead of LINQ Any method
- etc..
However, you need to be aware that a larger message means more tokens, which may result in higher costs. There are two ways to receive a very high bill every month:
- Use as much context as possible.
- Use less context, but call LLM multiple times with different contexts. This is suitable for use cases when the context doesn’t fit into the context window.
Let’s do some math:
For OpenAI API - model GPT-4o - one million input tokens cost $5, and one million output tokens cost $15.
- A larger class of ~150 lines of code consists of ~1200 tokens.
- One larger method consists of approximately 200–300 tokens, with some smaller context containing rules you want to impose on the code, which might add another 1000 tokens — or more if you use a large context with many rules.
That is 2500 tokens for one request. This means that each reviewed method costs 1 cent just for input data.
Is it too much? I don’t think so. Nevertheless, the price could go up if you provide very detailed context with all the rules. For example, SonarQube has over 400 rules for C# language.
Note: I tried to ask GPT-4o to generate a code that is comply with all SonarQube rules but without success. It ignored the mention about SonarQube.
My biggest concern is that I might end up with a tool that replicates the functionality of R# or SonarQube but at a much higher cost. Standard linters using regular expressions to detect all code issues are way cheaper.
Alternatives
This blog post couldn’t be complete without mentioning existing commercial services. We touched Copilot Workspace from GitHub. But it’s still in preview. Fortunately, there are existing commercial solutions like CodeRabbit.ai, which uses the UI of GitHub and provides a GitHub bot that makes suggestions and summaries. Additionally, there are other options like What The Diff and DeepCode from Snyk.
I haven’t tried these services myself because I have a concern regarding the protection of intellectual property. However, if you maintain open source, they are definitely worth trying.
Do you need your own Copilot?
Well, I’m not sure yet. Building the Copilot was fun, and I learned a lot, mainly about the limitations of the current LLM. There are several potential improvements that I haven’t had the chance to test yet. Some of these improvements involve using Retrieval Augmented Generation, adding more preprocessing, or using tailored prompts for each type of change — for example, using different prompts for a one-liner change compared to adding a new class.
However, as I mentioned, I need to be mindful of the price and not create another SonarQube or R#. It might be more trouble than it’s worth, but only time will tell.