Home

The State of AI

Aug, 2025

This first part might only stay current for a short while. Best Models for use cases - Best model for internet search is Grok but a simple Google Search is still better than this (It is hard to beat 20 years of crawling and indexing by Google) - Best model for extracting text from photos is Gemini Flash (Probably because of 20 years of captcha by google that annotated written letters) - Best model for troubleshooting problems with AI is Claude. (Its thinking text is the least censored and shows the exact sequence of thought of the AI that came to that answer) - Best model for conversations is ChatGPT (this keeps getting better at different types of conversations. More on this in a section below) - Best model for code is Claude (Some might disagree but the publicly available usage stats support this. More on usage stats below) - Best model for thinking about input is Claude (reading an image and thinking about it are different things and Claude generally performs better for thinking about images and in fact thinking about everything) - Best model for fast outputs is Gemini Flash (for some reason, it is the only fast model that is still smart enough to work. Most other fast models are dumb to the point of not being useful) If you want to find out which models are currently better than others, looking at model rankings on leaderboards or some AI scores is not a good idea. Most scores and leaderboards are built on artifical tests that keep getting out dated and were not even good to begin with. What you want to do instead is to look at the usage statistics of the models. While the model companies don't release these themselves, there is one model router service that provides statistics of which models people are using the most in their apps. This is updated daily and you can even use it to see the top models in a category like code or conversations. The website is OpenRouter You can also use the Chat section of the same website to do model comparisons side-by-side where you can select as many models as you want and when you give it a prompt, it will return the response of all models for the same prompt. --- Vibe Coding The best AI code editor is actually not Cursor. It's a free vscode extension called Cline where you pay per API call. The best way to use it is to give it unrestricted read access but limit the write access to a permission based flow. Vibe coding is tricky if you're making changes without actually looking through them. You should be reading every line of code to make sure that the logic that AI used is correct. You can delegate the syntax and the typing to AI, but you cannot delegate your thinking to AI. You still need to think about the logic yourself. If you're using AI to code, switch to thinking with an abundance mindset. Before AI, you wrote limited debugging code into your program because it took time to think about it and add it and then maintain it. But now that AI can do it for you without affecting your program, you should be using more detailed debugging logs. Execution of every part of the program should be logged along with its details. This reduces the time to find and fix a problem later on. It was not done before because the tradeoff of writing and maintaining detailed logging code was too high relative to the time saved because of it when facing bugs. --- Non-Technical people Best way for non-technical people to try building something with AI is using lovable.dev. In building a piece of software, there are a lot of parts that are a pain to set up but are not actually that important for people starting out. Setting up the code editor (which is like a text editor on your computer but is built specifically for code writing and editing) takes unnecessary effort. If you want to build a simple website, you should not need to go through it. Other similar things include setting up a server or hosting for your website from where people on the internet can access it online. The specifics of it are also not important for someone starting out. What lovable.dev does is that it abstracts away all of this from you so you can focus on two things, AI prompting and your product. What you need to know before you start using lovable to build something: nothing. Just tell it what you want and keep asking questions for parts that you don't understand. The access to have someone that is an expert and answer every question you have is a very new privilege and you should be using it to extreme. One small tip when you're trying to build something with AI is that as you keep working on the product, AI might start getting stuck at some things. This happens because of the history of your prompts and the way that the AI has already configured the code. If at some point you get really stuck and AI just does not seem to be able to understand what you are asking it to, you should start a new project from scratch. From your previous attempt you will have learned a number of things about your own project because when you start building something you will only have a vague idea which would be missing a lot of specifics of your product. When you start the same project from scratch again, what you should do is combine all of your new knowledge of the details into the first prompt so that AI now has the entire context from scratch and will be unlikely to get stuck. An alternative of this restart approach is to have AI create a detailed product requirement outside of lovable and then add it to lovable. But the problem with this approach for people new to building things is that you will not know which parts in the product requirement are unnecessary or which parts are missing, so I don't recommend this approach. --- Hiring Engineers With AI, the context that an engineer needs to manage in his head has become less. You can benefit from this by getting more stuff built by engineers but that is trickier to implement. Pushing engineers to build more in less time has never worked. Anyone who has worked with story points knows that even if things start getting quoted fewer story points, they keep taking more time to actually ship in reality. What can work instead is making your engineers to start becoming product engineers. This is simpler to push where you push the engineers to understand not just what they're building but also why they're building it. This is only possible now because without AI engineers have to manage a lot of context in depth and they have no left over context capacity in their head for anything product related. And they are working at a layer of depth that is too far away from the product layer and they would need to keep switching, which is not practical. The engineers don't necessarily have to extend to thinking about product problems from scratch but they should definitely focus on the reasons of the product logic for things they are working on. This allows them to catch edge cases lying in the overlap of engineering and product that sometimes get missed and pushing to simplify or remove product functionality that is too complicated to build or maintain but does not actually solve that big of a product problem. If you're hiring AI engineers, good luck. You're going to need help. --- AI Projects If you are working on an AI project, you should start with a quick and dirty Proof of Concept that works. A proof of concept is allowed to crash but it is not allowed to be unreliable. You should fix reliability first and then focus on engineering. You should not be spending time on system architecture and other normal engineering things before you have shipped this Proof of Concept that works for your use case. The reason is that the architecture will change on what you discover during the proof of concept phase. And an interesting part in fixing the reliability of an AI project is that non-engineering members of the team are much better at troubleshooting AI problems than engineers because AI is like a word problem and engineers are not good at those. A smart PM would find it easier to fix an AI problem than a smart engineer because the nature of AI problems is different than the nature of engineering problems but it is closer to product problems. If a person runs into an AI problem and their first thought is to fine-tune a model, fire that person or wait for them to make the AI do worse than it is already doing and then fire them after it. Most AI problems have simpler solutions and immediate reach to fine-tuning is some kind of a pandemic. The use of fine-tuning is to overfit to a pattern. If you want a model that can tell if something is hotdog or not hotdog, then you should use fine-tuning, and better yet, maybe contact Jin Yang. --- AI Products The most interesting property of AI is that it is improving unexpectedly fast. Because of this speed of improvement most people can't keep up with it and therefore most people in the world don't even know what it can do or the problems that it can solve. You can make good money by just building a product that shows users how they can do a very specific thing or solve a very specific problem. For example, there is this one product that probably uses a publicly available API to let people convert pdf statements into excel sheets with reliability irrespective of formats of the PDF and does about $40k MRR. These types of products are useful but the long term sustainability of them might be questionable. But so is the long term sustainability of every other product. --- AI Cost Calculation There are a lot of parts that go into calculating the cost AI would incur if you were using it for a specific task and having to calcuated these yourself is unnecessary work because you have to first convert your input into token count and then get a sample output and also convert that to token count and then multiply them with the cost of tokens which is usually gives as dollars per million tokens by model providers. The short way to calculate cost for any model and for any usecase is to use the model router service OpenRouter with the same prompt that you were going to use and it will return the response with the cost of that message. You will need to top up your account with a minimum balance though. --- This second part will stay current for a long while, I think. Consistent Patterns The problem with following the state of AI is that it never seems to be in a single state. The best model today is not the best model tomorrow, or even tonight and so if you try staying up to date about all of it, you feel like you are not up to date with any of it. But the truth is, there actually are some patterns with AI that do not change or at least not as frequently. The oldest and most consistent pattern is the one on which the current age of AI is built upon; the idea that intelligence is just a matter of recalling things. That is what the underlying transformer architecture does. It recalls things. When the first GPT model was released by OpenAI with its sub-par intelligence, people theorised that it would not get more intelligent because real intelligence requires more than just recall. But these people missed the part that one of our theories of human intelligence is also that it is primarily memory based. And it is also the theory that I personally have found to be the most useful in understanding human intelligence. And this pattern has also held up well for artificial intelligence. There might be some problem with it being less efficient than human intelligence but efficiency problems get solved over time. We can see models getting smarter while sticking to the same premise. First with more data and then with better data (one reason deepseek could be trained at a fraction of the cost of chatgpt was that they trained it not on raw data but on the responses of chatgpt itself which is better data than raw data. But this also limits the performance of their models to that of chatgpt). The second consistent pattern is the thinking or the reasoning part of the AI. Because AI is memory based intelligence, if it can first breakdown a problem into more words, it can look at the problem in more ways than directly attacking it. This is also a function of the memory based intelligence. But it took longer for us to actually learn about this pattern because ChatGPT did not show its thinking process and even when it started doing it, the thinking process was censored. Meaning that the thinking that you saw was not the actual thinking but an intentionally filtered version of it. That’s why it felt business speak instead of actual thinking. Deepseek fixed this and then we got it from Claude as well and more models have followed in this. But most model providers followed into the same pattern as openai of filtered thinking and are therefore not that helpful. Only Claude seems to show actual unfiltered thinking. But why does it matter that we look at thinking tokens? Because when you can look at the thinking, you can tell exactly why a response was wrong and what to change in your prompt or data to make sure the problem does not happen again. An example of this is that if you give it a file with a list of numbers on different lines and ask it to calculate the total, the thinking will show that it starts adding every number to the sum of the previous numbers at every line. But this approach leads it to make mistakes because apparently models can miss lines or read them twice. It's like an attention switching problem where when the model moves its attention to creating a sum of the previous lines, it forgets which line it was on. A very human like problem. But if you ask it to first list down the numbers from each line and then add them in the second step you see that it does not miss or double count a line and after listing them it can add them to the correct sum. And the best part about this understanding you gain through looking at the thinking of Claude is that it is not actually limited to Claude. It applies just as well to other models because what you are learning while looking at the thinking part of the model is the working of the underlying memory based intelligence. And I often do this thing where I will troubleshoot a problem with Claude but use a different model for the actual use case. And if I’m being honest, it is very close to how humans make mistakes as well and exactly how we change things to help humans make less mistakes (Predictably Irrational by Dan Ariely is a good book on this). This pattern of thinking text being helpful in troubleshooting AI is going to continue to hold and is going to keep helping us understand AI better and maybe even understand humans better. The third pattern is the compounding of specializations of AI models. A model that is better than other models at something specific keeps getting fed more data because users keep using it more for that thing and it keeps getting better. Now to clarify, this data is not data as we normally call it. This is usage data which is different from training data. For example the google search algorithm started with pagerank where the ranking of a webpage was based on how many other websites linked to it but after the starting point, it started getting usage and that usage data had things that could never be collected without the product. For example, the fact that a user opens a link but comes back in 5 seconds to the search page to open a different link means that the first link did not actually answer the their query even if it had all the right keywords. This is basically collecting data points that are unknown but still significant. Google got really good at knowing the intent of the user better than the user himself. Similar patterns hold for model providers (there might be a question of whether these patterns also hold for companies downstream of model providers using these models but that part is less important. Because while the downstream companies might improve somewhat from their users, the model providers will improve from the users of all companies so that the model provider comes out better. The exception is where the niche is counter to generalization. For example a person learning to throw only darts will probably become a better dart thrower than a person learning to throw darts and javelins). Right now, from what it seems, Google is ahead on the OCR front probably because of their history of scanning handwritten books and then also having their free annotation program (captcha). OpenAI seems to be ahead on the conversational front because more people talk to it about their lives(and they doubled down on this by focusing on a wider range of conversational use cases in their latest release). Anthropic seems to be ahead on the coding front. The reason for Anthropic being ahead is probably more of a pagerank type algorithm. While I don’t know the algorithm itself, I can comment on one aspect that makes it better. It is exceptional at understanding overlapping logics in a piece of text and then figuring out how to work with them and which logic is more important than the others. This ability to understand nested logic is also what I suspect makes it better at working with code which is always a jumble of overlapping logics. This overlapping logic reason is why other models sometimes horribly fail when you say “not to do x” because they still do x and in fact, the line may make them more likely to do x now because x is mentioned. They missed the nested logic. This is also similar to how children interpret what you say. If you tell a 2 yr old not to do x, their mind immediately focuses on x. If you want a 2 yr old to actually not do x, talk to them about y. This works much better (I have a 2 yr old nephew that I sometimes have to manage and his father says that if ran a school it would look like the school at the end of 3 Idiots being run by Phunsukh Wangdu) The fourth pattern is something that might be a slightly contrarian take because people don’t like companies when they make money and instead want to support a world of free things and and open source models. But money is a sustainable way to make things better and the more money a model makes, the better they can make their model. (I use "can" because this is not automatic. Just because someone has the right height to play basketball does not automatically make them a great basketball player). So, the pattern is that open source models lag behind commercially provided closed source models. The reason why this pattern would hold should be obvious because this is not something new. The best innovations in the world have often been patented. And the only reason they were built were because they could be patented. Otherwise the number of people who are willing to spend 10 years of their lives and money to build something makes no sense if their payoff in the case of success is not high enough. Same for copyrights to books. We feel like removing patents would make the world a better place by putting all innovations at the use of everyone. And temporarily it might, but in the long term progress would halt because new innovations would stop happening. Extraordinary effort usually comes in the hope of extraordinary rewards. And when extraordinary rewards are removed by removing patents and copyrights, extraordinary effort would stop. The same applies to model providers. The best models that we see working for specific usecase will continue to be closed source and they are actually a win for the users because a model provider would optimize the cost of providing their models a lot more if they can save money. And they may continue to release some parts of their models for free for better adoption like facebook did with react native. The fifth pattern that will continue to hold is the limit of the context window. Because it is inherent to memory based intelligence. The bigger you make the size of text, the more difficult it becomes to decide which memory is relevant. I can explain this with a computer analogy. Computers have cache and ram. Both are memories in the computer but cache is faster than ram but is much smaller in capacity. Any kid that learns of the two the first time asks the naive question of why can't we just make cache bigger in capacity if it is much faster than ram and never learns the actual reason because most explanations are at a horrible level of abstraction. We can't because what makes cache faster is not the hardware but the process of getting memories from cache. A better abstraction layer is to explain it using the drawers in your home. The desk you are sitting on has one or two drawers where you just throw in stuff you use on a routine basis. You actually don't remember which of the two drawers you put a specific thing in but that does not matter. When you’re looking for something you check both and find the thing quickly in either the first attempt or the second. Cache works exactly like this. Now the problem is that if you added 20 drawers to your desk or near it and started throwing frequently used stuff randomly in one of them without remembering which, everytime you need something, you will need to go through a lot of drawers to find it. The more drawers you have, the more time you will spend on average to look for the thing. There is a similar problem with context windows of AI models. The bigger you make the context window, the more confusion for the model to figure out where that memory is or which of the similar memories you are talking about. So this pattern will also continue to hold. The sixth pattern that is related to the above is that memory is still unsolved and will continue to stay unsolved for a while because it is not part of the model and because it does not stay consistent across models. Cross platform compatibility is a big challenge usually and does not automatically go away. iPhones and Androids still can't transfer data over a high speed bluetooth or airdrop connection. You have to use the internet. Not only is the memory not solved, the direction of its solution is also not available yet. AI did not work until we got the transformer architecture. Memory might not work until we get something that works architecturally and scales. Human memory scales relatively well by separating short term memory and long term memory. Most AI do not have this differentiation. The default short term memory is the context window but that is not necessarily a good system for memory even if it is a good system for context management. The 7th consistent pattern that is also somewhat a contrarian take is that AI responses are actually extremely consistent and deterministic. For the same inputs provided the same way, you will get exactly the same outputs everytime. The only cases this varies is where you actually don’t understand the inputs or how they affect each other or if you don’t know how to change the temperature setting of an AI model. The more deterministic output you want, the more deterministic input you should give. You can take the same example of calculating the sum of numbers across multiple lines. The type of mistake that the AI makes if you directly ask it to sum the numbers stays the same. The problem is consistent. It only appears inconsistent to you when you look at the final sums and they are sometimes higher than the correct sum and sometimes lower. But if you look at the thinking text of the model, you can see that the problem is consistently that of missing and counting lines twice. This allows you to be deterministic about things that are important while letting AI use its discretion with parts for which being deterministic is not important. P.S. If you need more details on a specific part, let me know and I can try to add it. Thanks to Tamseel, Uzair and Hiba for reading drafts of this.

---

If you want to be notified of new articles, this is my substack where I send out unscheduled emails containing new essays and things that I am thinking or reading.

If you are stuck at a problem with AI and want me to have a look, you can reach out through my linkedin, this is my linkedin.