title

Last week, I attended the QCon AI 2026 conference in Boston. One theme kept coming back throughout the conference: what happens to engineering organizations when code generation becomes cheaper, but review, evaluation, infrastructure, and ownership remain expensive?

This question came up in different forms throughout the conference. Sometimes it appeared in discussions about code review. Sometimes it appeared in talks about agent evaluation, infrastructure, or the changing role of software engineers. But to me, many of the conversations pointed in the same direction: AI is not only making software development faster. It is also shifting the bottlenecks.

First of all, I want to say that I enjoyed the event a lot. The speakers were fantastic, the organization was great, and the attendees were highly engaged with the topic and came from a wide variety of organizations. I especially appreciated that the conference was concise overall: no fluff in the talks or themes, no pretensions of being posh, just down-to-earth and very professional.

The overall mood in the room

If I had to categorize all the discussions and presentations I participated in, I would split them into two groups: “how to build AI-powered products” and “how to make the work of your organization more efficient.”

Thoughts on specific topics were diverse, but I felt there was more or less a common consensus on the following:

  1. There were no radical skeptics of AI technologies. I suppose this is somewhat obvious, since it was an AI conference after all.

  2. Most people were concerned about the effect the AI boom could have on the financial market in the short term and on the economy in the long term. Nevertheless, we need to learn how to utilize this new technology for good and generate more value.

  3. People from organizations with aggressive AI adoption shared stories about how employees often feel burned out and stressed. My personal thought is that this will remain a trend as long as AI keeps getting better this rapidly. It is the uncertainty and the constant need to adapt while still “doing the business” that creates extra stress.

  4. There is still no good recipe for measuring the real return from AI. Most organizations measure what is easily accessible: token usage for adoption, diff counts for productivity. But such metrics are far from showing the real picture. Answering the question “Have we shipped more features?” is hard to do over a short period, does not have a good definition, and cannot be A/B tested easily.

Code review as the new bottleneck

“Human code review is the new bottleneck” was quite a hot topic at the conference.

A smaller cohort, represented by people from frontier labs or aggressive adopters like Roblox — for some products only — argued for removing humans from the review process. The main argument was that AI is doing a better job at finding bugs. And if a change introduced by an AI agent causes an incident, then tweaking the skills and tools is the action item that should be taken to avoid it next time, assuming the root cause was a bad code change and not a lack of guardrails or observability.

Many people I talked to agreed with the problem, but at the same time were somewhat skeptical about the solution. I tend to lean toward the same point of view for the following reasons:

  1. Even when I use top-tier publicly available models, I see agents forget certain recommendations or rules, which leads to mistakes. And that is understandable due to the nature of LLMs. We can make it better by adding guardrails, doing automated reviews, cleaning up the tech debt in our prompts, and so on. But that is a whole big problem, and fixing the prompt for your skills is not enough.

  2. I tend to assume that these companies and their products can take more risks. It is hard to imagine a big company whose product is not entertainment, and whose users have different expectations, taking such risks easily.

  3. Code reviews are not done only for bug detection and knowledge sharing. They are also a way to communicate and align with other stakeholders on decisions: abstractions, ownership, consistency. They are a way to get buy-in from another engineer about the decisions you are making.

There were examples shared where unreviewed agentic development, after several months of work, resulted in projects that were hard to maintain and had many inconsistencies. It has always been interesting to me to try this unreviewed approach on a real production project as an experiment and see what happens. Unfortunately, I have not had an opportunity to run such an experiment myself; the risk has always been too high. If you know of any real-world examples where teams have applied this approach to production features, I would love to hear about them.

Interestingly enough, I also heard different information about code review practices in some of these companies that are pushing for minimal or no human code review. They still review specs, plans, or core functionality. They do not review the code line by line or get picky about little details. This approach makes way more sense to me, and it addresses my third point well enough.

A few technical themes I took away

The conference had many great talks. Some of them were especially insightful to me.

Chat-assistant architecture

When I had an opportunity to work on a production chat assistant in fall 2025, the industry had not yet settled on the best architecture for such applications.

Some people tended to implement a workflow where each node had a small area of responsibility, which allowed developers to guide the application through the user request. Others leaned toward a hub-and-spoke model, in which a central node has more power and responsibility and reaches out to smaller nodes or tools to execute something deterministic or to save context.

After the conference, I got the feeling that the hub-and-spoke approach is becoming the default for this type of application.

Agent harness architecture

One of the sessions was by Vinoth Govindarajan (OpenAI) about agent harnesses like Codex or OpenClaw. Such agents are booming and becoming very popular. Many people, including me, use them every day.

I have not yet had an opportunity to work on a product like this, so learning about the internals first-hand was very insightful to me. A typical set of principles for agents looks like this:

  • A model proposes.
  • The harness commits.
  • The receipt proves it.

It was especially interesting to learn that most agents on the market more or less follow the same set of principles and architecture.

NL to SQL

While assistants and agents internally rely on having a main decision node with a powerful model, there are use cases where a workflow approach gives better and cheaper results.

One example of that was generating SQL queries from natural language, which was presented by Francesca Lazzeri (Microsoft). Frontier models and coding agents are pretty good at turning natural language into SQL, but they are still not as good as tailor-made assistants in a production environment. Frontier models get better every time and the gap is closing, but this gap is still significant.

Agent evaluation

Evaluating agentic and assistant applications is still an unsolved problem. The process is flaky and inaccurate.

An interesting and reasonable thing I learned from Zhou Yu’s (Arklex.ai) session is that we should find ways to turn evaluation into the evaluation of a deterministic action. Instead of using a blunt “LLM as a judge” approach to evaluate the agent’s response, we can assert whether the agent’s work resulted in an order refund, for example.

The more realistic the simulation we create for the application during evaluation, the better. One additional key idea is to use synthetic users modeled after real users. But even very good simulations do not allow us to beat the nondeterministic nature of the models, and even companies specializing in such evaluations assume test retries up to five times.

One of the main problems with running good evaluation frameworks for agentic applications is the cost of the run.

Closing predictions

The conference ended with a keynote by Meryem Arik (Doubleword), where she made predictions about what QCon 2030 could look like. A summary of all predictions can be found in this LinkedIn post.

What I found interesting is that these predictions were not just about AI making engineering faster. They were also about the second-order effects: what breaks, what becomes more expensive, and what new skills become important when code generation becomes much cheaper.

Massively parallel agents will create new infrastructure problems. To me, the most important part of this prediction is not only the parallelism itself, but the fact that agents can dramatically increase the amount of code we ship. When code generation becomes cheaper, the pressure moves to everything that happens after the code is written: pipelines, builds, tests, reviews, deployments, observability, and incident response.

I think this is a widely discussed topic, which I often hear about at conferences and in interviews. If we ship more code, we need more resources for running pipelines, we compile more, and we create more changes that need to be tested, reviewed, deployed, monitored, and sometimes fixed. I found this presentation by Adam Bender (Google) highly relevant to that.

Software engineering skills may shift further toward product thinking and project management. If agents make implementation cheaper, the relative value of deciding what to build, why to build it, and how to coordinate the work may increase. Project management will matter for managing teams of agents: a developer ships more code through agents, and those agents need to be managed. Product thinking will matter for negotiating and navigating the product among stakeholders.

Deep understanding of the technology and ownership of the system may become less central, while product judgment, coordination, and the ability to evaluate trade-offs may become more important. I often hear the opinion — I think from Gergely Orosz at The Pragmatic Engineer and his guests — that the area of responsibility for every role is expanding and starting to overlap with other roles. This is logical because people are still trying to figure out how to use AI. And this idea matches the prediction well.

Coding agents may start shaping infrastructure. This was one of the more interesting predictions to me because it goes beyond productivity. If agents become the main consumers of software, then software may need to be designed not only for humans, but also for agents that operate, inspect, modify, and integrate with it.

That could significantly change how we approach product building. We may end up building software to “please” an AI and not only a human being. In that world, good APIs, clear contracts, observability, deterministic behavior, and machine-readable structure may become even more important than they are today.

Summary

To summarize, it was a great conference. I learned a lot, met interesting people, and came away with a better understanding of where the industry is today and where it may be heading over the next few years.

What I appreciated most was that the discussions were grounded in real-world experience. There was plenty of excitement about AI, but also a healthy amount of skepticism, practical lessons, and honest conversations about the challenges that come with adopting these technologies at scale.

For me, the main takeaway is that AI is not necessarily making software engineering easier. It makes writing the same amount of code cheaper and faster, but it also raises expectations. Engineers may be expected to ship more, review more, evaluate more, operate more, and take ownership of results produced with tools that are still nondeterministic.

And thank you, Boston, for the nice weather during the conference and the great views.

boston