Here at SingleStone, clients partner with us to solve all kinds of problems, and many of those solutions involve building or improving their software. There’s great diversity in the problems we tackle with our clients, so we’re particular about who we hire — we look for attitude, aptitude, and skills, in that order. This helps us to foster a culture of exploration, not just execution. We’re always looking for new ways to deliver value more efficiently, so we’ve been exploring whether generative AI tools are ready to become part of our engineering teams.
You may be thinking, hang on there! Are we talking about AI replacing humans on a team, or just being another tool that teams can use to produce better and higher-quality work?
Well, we’re not sure. Perhaps AI will only replace “tasks, not jobs.” That’s an interesting idea but seems challenged when GPT-4 can go from a sketch to a working website largely on its own. What’s still missing before it can do the work entirely on its own?
To answer this question, we’ve been investigating two publicly available generative AI tools: GitHub Copilot (based on a GPT-3 derivative) and ChatGPT (using the pre-release GPT-4 model).
First, we’ll walk through the capabilities that we’ve found impressive and useful for some tasks, then we’ll discuss what we think the limitations are and why we don’t think it’s coming for anyone’s job just yet.
Copilot is “the world’s first at-scale generative AI development tool made with OpenAI’s Codex model, a descendent of GPT-3.” Unlike ChatGPT’s chat-style interface, Copilot integrates its capabilities directly into an IDE, such as Visual Studio Code. This integration presents itself essentially as an autocomplete, which is great for usability.
Copilot is trained on a wealth of source code from GitHub and has ingested context from many languages, frameworks, libraries, and patterns. Copilot is aware of the files and code that you have loaded in your IDE and can incorporate what it finds there into the suggestions it makes through the auto-complete system.
Auto-complete...turned up to 11
For example, if your codebase makes liberal use of logging statements throughout your functions, then Copilot will proactively offer to add logger calls to new functions you’re authoring in locations that often make sense in context. It will also parametrize those calls using patterns it has identified elsewhere in your codebase while incorporating the variables into those patterns from the function you're currently authoring. Neat.
There are many more examples of clever ways that Copilot can accelerate development, and while handy, these capabilities can be hit or miss, sometimes making nonsensical or non-functional suggestions. Moreover, when it’s not working well it’s not always obvious why.
Copilot will probably speed up your typing, but not by a lot. Internally, we’ve described our time with Copilot as a “minor convenience” — which we do in fact consider to be a win for Copilot. It’s just not as transformational in its current form as we had hoped. We’re excited about the release of Copilot X, though.
Unlike Copilot, ChatGPT is trained in a more generalized fashion, without a specific focus on code and software development. Rather than being integrated into your IDE like Copilot, ChatGPT simply offers a robust chat-style interface for you, the operator, to use to converse with the model.
In our testing, we used the pre-release version of the GPT-4 model which, at the time of this writing, is only available through OpenAI’s for-pay ChatGPT Plus subscription.
GPT-4 is significantly more complex than GPT-3 under the hood and is trained on a much larger and more recent data set. Systematic comparisons have suggested that it “vastly surpasses” the capabilities of earlier language models, including the intermediate model, GPT-3.5, which is available in the free version of ChatGPT. GPT-4 also has a relatively large amount of working memory available to it, which means it can take longer prompts and return longer answers than its predecessors.
To illustrate some of the impressive characteristics of GPT-4, we described to the model something that we wanted it to do in plain English:
I need a REST service that allows a user to query the current weather for a zip code that they provide. The service needs to be written in TypeScript and should call the OpenWeather API to get the necessary weather data.
GPT-4’s answer to this prompt was thorough, so we won’t include it all here, but these are the highlights:
- It described the tools and packages one would need to install locally to build and run a TypeScript-based NodeJS application
- It provided terminal commands for installing the packages it would be using
- It generated code snippets for all necessary portions of the actual solution, and provided environment configuration files with consideration for application secrets, and the runtime code was an express.js-based web API service
- Inside the generated code included: a reasonable level of error handling, rudimentary response caching functionality, rudimentary rate limiting functionality
- And between every code snippet returned by the model, there was a description of what each snippet was for, written in plain English
Code that runs
The user does need to follow its instructions for copying its code blocks into the file and folder structure that the model described, and for executing the run command that it also provided. Strikingly, the code it produced ran and performed the correct action.
This example is relatively simple, but software engineers are accustomed to breaking down complex problems into simpler ones. Therefore, we think that ChatGPT and GPT-4 could be really effective accelerators in this space.
These models are not infallible, however. They exhibit what might feel like superhuman comprehension, but they can’t read minds, they don’t know everything, and they don’t ask questions — they guess and then just continue, and sometimes they’ll guess very, very wrong in rather subtle ways. If we were to resubmit the exact same prompt that we used as an example above, there’s a fairly high probability that the output would have needed additional tweaks for it to run — it probably would have been close, but you still mostly need to know how to do the things you’re asking the model to do so that you can test and debug effectively.
Not like a human programmer
We think both of these tools are great, and we’ve incorporated them into our work because they make us more efficient. Copilot—aptly named—perhaps as much as doubles the speed at which a skilled developer could write certain code. GPT-4 writes code itself, easily 1–2 orders of magnitude faster than a human could.
So why are we not ready to say that GPT-4 can take the place of a human programmer? We think that it comes down to the difficulty of communicating implicit knowledge.
As consultants, we come into new situations and gather information about the client’s business and culture. Some of this information is written down, but a lot comes from snippets of conversations spread out across many interviews, workshops, and meetings. The entire team participates in these meetings, and each member ends up understanding the work in a slightly different way. We compare notes continuously as we go. Although we document that knowledge in several forms, it is always incomplete and evolving. The entire team gains an understanding of the project that is more than the sum of what everyone knows.
A second type of implicit knowledge is embedded in our longer-term relationships with each other. Though we rearrange teams from project to project, many of us have worked together before. We also share a company-wide culture that informs how we approach projects. When someone agrees to take on a piece of work, they know what their teammates expect, and their teammates can count on them to do a lot of small things without being asked. Arguably GPT-4 has the same kind of implicit understanding, but it is not anchored by the team’s shared experience and is, therefore, less predictable.
GPT-4, at least for now, has no long-term memory. It doesn’t process spoken conversations. It can’t piece together new facts from many interactions. Although it can create remarkably complete code from minimal instructions, in our experience the cost of writing code isn’t the limiting factor in software development projects.
Tools, not teammates
For now, there is no way to get all that implicit knowledge into GPT-4. Sure, you could feed it a bunch of text as part of the prompt, but which parts of which conversations are important? Who is going to write it down? As amazing as the chat interface is for making these models more useful—and they are amazing—they still represent a tremendous information bottleneck. The model can’t learn new information the way human teams do, and that fact, we think, relegates them to the status of tools. Not teammates. They can speed up tasks, but they can’t—yet—do the job of a person.