[{"content":"For years the architecture visualization industry has been stuck with a strange limit. Studios can produce a still render so convincing you cannot tell it from a photograph, but the moment a client wants to move through that space, the cost jumps by an order of magnitude. Animation means cameras, keyframes, render farms, and days of compute per shot. So most projects ship as a handful of gorgeous frozen images, and the sense of actually being there never makes it to the viewer.\nTwo things changed that recently. Image-to-video models can now invent believable motion from a couple of stills. And coding assistants like Claude can turn that footage into a polished, interactive site in an afternoon. I wanted to see how far that combination goes, so I built a small experiment. This is how it went.\nSee it live: d1v38cpm8emdm5.cloudfront.net. Open it, press begin, and scroll.\nThe raw material A friend runs a studio that does 3D architectural visualization, the kind of work where you model a building that may not be built yet and light it until it looks real. He had four finished renders of an apartment: the towers from outside at dusk, the lobby, the living and dining area, and the kitchen. Beautiful images, all completely static.\nThat is the typical handoff in this field. The architect already pictured the space as something you walk through, but the client only ever receives the one angle that looked best in the deck. Everything between those frames lives in the architect\u0026rsquo;s head.\nInventing the motion This is where image-to-video models come in, and they are genuinely good now.\nI used the Seedance 2.0 image-to-video model on fal.ai. The feature that made this whole project possible is that you can give it two images, a start frame and an end frame, and it generates the camera move that connects them. So instead of describing a shot in words and hoping, you hand it exactly where the camera begins and exactly where it lands, and it fills in everything in between.\nI ran it three times:\nExterior render as the start, lobby render as the end. Lobby as the start, living area as the end. Living area as the start, kitchen as the end. A few seconds of footage each. The model handled the hard part, which is plausible parallax and perspective as the camera glides forward through a doorway or across a room.\nThe key insight is the chaining. Because each clip ends on one render and the next clip begins on that same render, the three clips line up into one unbroken path. The end frame of one is the start frame of the next. Stitched together, three separate generations become a single continuous walk from the street to the kitchen.\nWhat it actually costs This is the part that surprises people, because the number is small.\nGenerating video runs roughly $1.5 for a 5-second clip at 1080p. The cost scales with the number of transitions you need, not with the size of the building, since each transition is one clip between two renders. For this apartment I had three transitions, so the entire walk cost in the neighborhood of $4.5 in generation.\nThe math is refreshingly linear:\nSpaces in the walk Transitions Approx. generation cost 4 rooms 3 ~$4.5 6 rooms 5 ~$7.5 10 rooms 9 ~$13.5 Everything downstream is effectively free. The starting renders already exist as part of the studio\u0026rsquo;s normal work. The website is static files, so hosting is pennies a month or nothing at all on a free tier. And the code was written in a single session with an AI assistant rather than billed as developer time.\nCompare that to a traditional rendered walkthrough animation, which is typically quoted in the thousands and measured in days of render time. The gap is not incremental, it is a different category of spending.\nTurning footage into an experience with Claude Having the video was half the problem. The other half was building something worth showing it in, and I did not want a plain embedded player with a play button. I wanted it to feel like you are the one moving.\nI described the idea to Claude and it wrote the entire front end. No framework, no build tooling, just HTML, CSS, and JavaScript that runs anywhere. The interaction it implemented is scroll-linked playback: the video does not play on its own, your scroll position is the playhead. Scroll down and you walk forward through the apartment. Scroll up and you walk back. You set the pace, so you can pause in a doorway or move quickly to the next room.\nIt is the same mechanic behind those premium product pages where scrolling rotates a phone or assembles a watch. Claude applied it to a building.\nIt also handled the details that separate a demo from something presentable:\nA title card and a single clear call to action to start the walk. A live room label and a progress rail down the side, so you always know where you are and can jump straight to any room. Motion smoothing, so even a jerky scroll wheel resolves into a slow cinematic drift. A soft vignette and a faint film grain laid over everything, which is what makes rendered frames read as shot rather than modeled. A couple of technical decisions were worth the trouble. The video is re-encoded so that every frame is independently seekable, which is the difference between smooth scrubbing and a stuttering mess when you drag through it. And scroll-linked video needs a host that serves byte ranges, which is basically every static host out there, so deployment is just uploading a folder. No server, no backend, effectively free to run.\nIt also reflows down to a phone, where the walk becomes a full-screen vertical experience you drive with your thumb.\nWhy this matters for the industry Step back from the apartment and the pattern is the interesting part.\nThe expensive, slow step in architectural visualization has always been motion. Image-to-video collapses that. A studio that already produces strong stills, which is their entire craft, can now generate the connective movement between them without a render farm or an animation pipeline. And the part that used to require a web developer, wrapping that footage in something that feels considered and premium, can be handled by describing it to an AI coding assistant.\nWhat else you can build with this The walkable apartment is just one shape. The underlying recipe is broader: take a set of strong stills, generate the motion between them, and bind that motion to an interaction. Once you see it that way, a lot of use cases open up.\nIn and around architecture and real estate:\nFinished residential or commercial projects presented as a walk instead of a slideshow. Off-plan developments where buyers move through a unit that has not been built yet. Day-to-night or summer-to-winter transitions of the same space, scrubbed with a slider. Before-and-after renovation reveals, scrolling from the existing room to the proposed design. Master plans and infrastructure, where the walk becomes a flythrough over a site or down a street. Interior design and staging options, scrolling between furniture or material schemes in the same room. Beyond buildings:\nProduct and industrial design, turning a few render angles into a 360-style spin the visitor controls. Automotive, gliding around the exterior and into the cabin. Fashion and retail, a lookbook where scrolling walks the model or rotates the garment. Travel and hospitality, a hotel or venue tour that moves room to room. Museums, galleries, and events, a guided path through a space at the viewer\u0026rsquo;s pace. Education and storytelling, scroll-driven explainers where each scene dissolves into the next. The common thread is that none of these previously justified a full motion-graphics budget. Now the motion is a few dollars and the interface is a conversation, so the experiences that were too expensive to bother with become routine.\nHonest about the seams This is an experiment, and it has rough edges. If you look closely the generated motion is not flawless, and the joins between clips are good rather than invisible. It is, underneath, three short AI-generated clips and a few hundred lines of JavaScript pretending to be a building.\nBut it proves the point. Two capabilities that did not exist in usable form a couple of years ago, image-to-video generation and AI-assisted coding, now stack on top of each other cleanly. Together they take a handful of static architectural renders, for a few dollars and an afternoon, and turn them into something a client can open in any browser and feel like they are walking through. No game engine, no app, no specialist pipeline. Just a link.\nFor a field whose whole job is helping people experience a space before it is real, that is a meaningful shift.\n","permalink":"https://ashishsaini.work/posts/walkable-building-image-to-video-claude/","summary":"\u003cp\u003eFor years the architecture visualization industry has been stuck with a strange limit. Studios can produce a still render so convincing you cannot tell it from a photograph, but the moment a client wants to \u003cem\u003emove\u003c/em\u003e through that space, the cost jumps by an order of magnitude. Animation means cameras, keyframes, render farms, and days of compute per shot. So most projects ship as a handful of gorgeous frozen images, and the sense of actually being there never makes it to the viewer.\u003c/p\u003e\n\u003cp\u003eTwo things changed that recently. Image-to-video models can now invent believable motion from a couple of stills. And coding assistants like Claude can turn that footage into a polished, interactive site in an afternoon. I wanted to see how far that combination goes, so I built a small experiment. This is how it went.\u003c/p\u003e","title":"From still renders to a walkable building: image-to-video and Claude are rewriting how architecture gets shown"},{"content":"There\u0026rsquo;s a lot of \u0026ldquo;AI copywriting\u0026rdquo; out there and almost all of it does the same thing: you give it a prompt, it gives you five variations, you pick one, the end. It\u0026rsquo;s a better autocomplete. It writes once and forgets everything the moment the message goes out, including, crucially, whether anyone clicked.\nThe agent we built at Comify is the opposite of forgetful. It researches a brand before it writes a word, generates message templates with actual intent behind the wording, ships them, watches what real people do, and then changes how it writes based on what worked. It\u0026rsquo;s a closed loop. And closing the loop is where it stops being a writing tool and starts being a system, with all the upside and all the ways that can go wrong.\nThree stages, and only the middle one looks like \u0026ldquo;AI writing\u0026rdquo; Research. Before generating anything, the agent builds a model of the brand. What do they sell, to whom, in what register? A premium eyewear brand and a value mobile-recharge service should not sound alike, and a generic LLM left to its own devices will flatten both into the same pleasant, lifeless marketing voice. So the agent does homework first (the brand\u0026rsquo;s domain, its products, its existing tone) and carries that context into everything downstream. This is the unsexy part that determines whether the output sounds like the brand or like ChatGPT wearing the brand\u0026rsquo;s logo.\nGeneration with intent. This is the part that looks like AI copywriting, but the difference is that the agent isn\u0026rsquo;t just writing pleasant sentences. It\u0026rsquo;s applying deliberate psychological triggers (urgency, social proof, curiosity, loss aversion, the well-worn levers of persuasion), and it knows which one it\u0026rsquo;s pulling on each template. That matters enormously for the next stage, because \u0026ldquo;this message got clicks\u0026rdquo; is useless feedback, but \u0026ldquo;messages using a curiosity hook for this audience got clicks\u0026rdquo; is something you can act on. The intent has to be structured, not vibes, or there\u0026rsquo;s nothing to learn from.\nOptimization. The templates go out as real messages. Click-through data comes back. The agent uses it to update which approaches it favours, for this brand, for this audience, at this time. A hook that lands for one brand\u0026rsquo;s customers falls flat for another\u0026rsquo;s, and the agent\u0026rsquo;s whole job is to discover that empirically rather than assume it. Over weeks, it converges on what actually works for each specific brand instead of what works in general, which is the only kind of \u0026ldquo;works\u0026rdquo; that pays.\nbrand research ──▶ generate templates ──▶ send (context) (with tagged intent) │ ▲ ▼ │ click-through data └──────── update strategy ◀───────────┘ (which intents win, for whom) Why \u0026ldquo;self-learning\u0026rdquo; is a promise you have to be careful with A feedback loop that optimizes itself is exactly as dangerous as it is powerful, and anyone who builds one and isn\u0026rsquo;t a little nervous hasn\u0026rsquo;t thought about it hard enough. Here\u0026rsquo;s what we had to design around.\nClickbait is a local maximum, and the loop will find it. If the only thing you optimize for is click-through, the agent will happily learn that the highest-clicking message is a misleading one. \u0026ldquo;Your order has a problem, click here\u0026rdquo; gets clicks. It also gets unsubscribes, complaints, and a brand that no longer trusts you. A naive optimizer walks straight into this, because the metric it\u0026rsquo;s climbing genuinely does go up. We had to constrain the objective so the agent optimizes within the brand\u0026rsquo;s voice and honesty guardrails, not just for the raw number. The guardrails aren\u0026rsquo;t a nice-to-have bolted on the side; they\u0026rsquo;re part of the objective, or the system optimizes itself into something embarrassing.\nFeedback is noisy and slow, and the agent must not over-fit to it. Click-through depends on a hundred things that have nothing to do with the copy: time of day, the offer itself, the audience, what else was in the inbox. Treat one campaign\u0026rsquo;s numbers as gospel and the agent learns superstitions. We had to make it update gradually and weight evidence by how much of it there was, so a single lucky send doesn\u0026rsquo;t yank its whole strategy. This is the difference between learning and twitching.\nExploration costs real money, so you can\u0026rsquo;t explore recklessly. To learn, the agent has to sometimes try a template it\u0026rsquo;s not sure about. That\u0026rsquo;s the explore side of explore-versus-exploit. But every message goes to a real customer of a real brand, so a bad exploratory send has a real cost. We kept exploration deliberate and bounded rather than letting it experiment freely on people who didn\u0026rsquo;t sign up to be a test group.\nThe result, and the honest caveat On targeted workflows, this approach (the agent plus the clickstream recommendation system feeding it audience signal) lifted revenue 60–80% over untargeted sends. That\u0026rsquo;s a big number and I want to be precise about what it means: it\u0026rsquo;s the gap between blasting everyone the same generic message and sending the right intent to the right audience and improving on it over time. A lot of that lift is the targeting; a lot of it is the agent learning the brand. Pulling them apart cleanly is genuinely hard and I\u0026rsquo;d be lying if I claimed an exact split.\nWhat I\u0026rsquo;m confident about is the shape of the win. The value isn\u0026rsquo;t in any single clever message. It\u0026rsquo;s in the loop: a system that gets measurably better at a specific brand the longer it runs, because it\u0026rsquo;s learning from that brand\u0026rsquo;s actual customers rather than from a model\u0026rsquo;s general prior about marketing.\nWhat building it changed in how I think about agents The hype around agents is mostly about capability: what can it do, how autonomous is it, how many steps can it chain. After building this, I think that\u0026rsquo;s the less interesting axis. The interesting one is the feedback loop: what does the agent learn from, how fast, and what stops it from learning the wrong thing.\nAn agent without a feedback loop is just an expensive function call. An agent with one is a system that changes over time, which means it can get better on its own, and, if you\u0026rsquo;re careless with the objective, worse on its own, confidently, while every metric you\u0026rsquo;re watching goes up. The engineering that matters isn\u0026rsquo;t the generation. It\u0026rsquo;s the loop, the guardrails on the loop, and the humility to assume the loop will find every shortcut you left open.\nWe built an agent that writes copy. The hard part, and the part I\u0026rsquo;m proud of, was building one that can be trusted to keep doing it without us watching every word.\n","permalink":"https://ashishsaini.work/posts/copywriting-agent-that-learns-from-clicks/","summary":"\u003cp\u003eThere\u0026rsquo;s a lot of \u0026ldquo;AI copywriting\u0026rdquo; out there and almost all of it does the same thing: you give it a prompt, it gives you five variations, you pick one, the end. It\u0026rsquo;s a better autocomplete. It writes once and forgets everything the moment the message goes out, including, crucially, whether anyone clicked.\u003c/p\u003e\n\u003cp\u003eThe agent we built at Comify is the opposite of forgetful. It researches a brand before it writes a word, generates message templates with actual intent behind the wording, ships them, watches what real people do, and then changes how it writes based on what worked. It\u0026rsquo;s a closed loop. And closing the loop is where it stops being a writing tool and starts being a system, with all the upside and all the ways that can go wrong.\u003c/p\u003e","title":"A copywriting agent that learns from clicks"},{"content":"When people hear that Comify\u0026rsquo;s backend is fifty-some AWS Lambda functions moving more than fifty million messages a day, the usual reaction is \u0026ldquo;isn\u0026rsquo;t serverless expensive at that scale?\u0026rdquo; It\u0026rsquo;s the right question with the wrong assumption baked in. Serverless got chosen because of the scale and the economics, not in spite of them. But you only get the good version if you design for it deliberately. The naive version is genuinely a trap.\nThe shape of the problem Communication traffic is spiky in a way that punishes always-on infrastructure. A brand fires a campaign and ten million push notifications need to go out in a few minutes. Then, for hours, almost nothing. Then a WhatsApp flow for a different brand. Then a quiet overnight stretch. Then a morning surge.\nIf you provision servers for the peak, you\u0026rsquo;re paying for a fleet that sits at 5% utilization most of the day. If you provision for the average, you fall over the moment a real campaign launches, which is the one moment that matters, because a campaign that goes out late is worse than one that doesn\u0026rsquo;t go out at all. Autoscaling groups help, but they scale in minutes and the spikes happen in seconds, so you\u0026rsquo;re perpetually either over-provisioned or behind the wave.\nThis traffic shape is exactly what serverless is for. You pay for invocations, the platform absorbs the spike, and when nothing\u0026rsquo;s happening you pay nothing. The cost curve follows the business: we pay per message because the platform charges us per unit of work, and we charge per message too. That alignment is the whole reason the architecture works.\nWhy fifty functions and not five services There\u0026rsquo;s a fair critique of \u0026ldquo;Lambda everything\u0026rdquo;: you can end up with a sprawl of tiny functions that\u0026rsquo;s impossible to reason about: a distributed monolith with worse tooling. We pushed back on that by being deliberate about boundaries. The fifty-odd functions aren\u0026rsquo;t fifty microservices; they\u0026rsquo;re a smaller number of pipelines, each decomposed into stages that have genuinely different scaling and failure characteristics.\nThe decision rule was simple: a stage becomes its own function when it scales differently or fails differently from its neighbours. Audience resolution (read-heavy, bursty, cacheable) is not the same workload as channel delivery (I/O-bound, rate-limited by external providers) which is not the same as click ingestion (high-volume, write-heavy, must never block a send). Splitting those means each can scale to its own shape, and a slow downstream provider can\u0026rsquo;t back-pressure into the part of the system that\u0026rsquo;s deciding who to message.\ncampaign trigger │ ▼ audience resolution ─▶ queue ─▶ content / personalization ─▶ queue ─▶ delivery (read-heavy) (agentic, LLM-backed) (rate-limited) │ click / event ingestion ◀──────────────────────────────────────────────┘ (write-heavy, async) The queues between stages are not decoration. They\u0026rsquo;re the shock absorbers. A campaign dumps ten million people into the front of the pipeline; the queue holds them while delivery drains at whatever rate the downstream providers (WhatsApp\u0026rsquo;s API, the push gateways) will actually accept. The queue depth becomes your natural backpressure and your natural retry buffer at the same time.\nWhere serverless will bite you, and what we did about it I don\u0026rsquo;t want to sell this as free. Serverless at this volume has sharp edges, and pretending otherwise is how people end up with a surprise bill and a 3am page.\nCost is invocations × duration × memory, and duration is where you bleed. A function that\u0026rsquo;s slow because it\u0026rsquo;s waiting on a network call is paying for the wait. We spent real effort making sure functions did work, not waiting: pushing I/O-bound waits into queue-driven steps rather than holding a function open while a slow API responded. A 200ms function and a 2-second function that do the same useful work cost 10x apart, and at fifty million invocations that gap is the difference between a sustainable product and a dead one.\nThe downstream rate limits are the real ceiling. You can invoke Lambda almost arbitrarily fast. You cannot send WhatsApp messages arbitrarily fast: the provider has limits, and blowing through them gets you throttled or blocked, which is catastrophic for a communication company. So the delivery stage is deliberately not trying to go as fast as Lambda can. It\u0026rsquo;s pacing itself to the provider\u0026rsquo;s limits, with the queue absorbing the difference between how fast we could go and how fast we\u0026rsquo;re allowed to.\nCold starts matter at the edges, not the middle. During a campaign, everything\u0026rsquo;s warm. The cold-start tax shows up on the low-traffic, latency-sensitive paths. We kept those functions lean and reached for provisioned concurrency only where a cold start would actually be felt by a user, not as a blanket policy, because provisioned concurrency is just renting an always-on server again, and if you sprinkle it everywhere you\u0026rsquo;ve thrown away the entire reason you went serverless.\nIdempotency is non-negotiable. Retries are a feature of every queue and every Lambda, which means every message-affecting operation will, eventually, run twice. Sending a customer the same push notification twice because a retry fired is a real, visible failure. Every stage that could cause a duplicate send is built to be safe to re-run.\nThe part people underrate: the bill is observability Because we pay per invocation and per millisecond, the AWS bill is a remarkably honest profiler. A function that quietly got slower shows up as a line item that quietly got bigger. A pipeline stage that\u0026rsquo;s retrying too much shows up as invocation count drifting away from message count. We watch cost-per-message the way a different team might watch p99 latency, because for this business they\u0026rsquo;re nearly the same signal. A regression in efficiency is a regression, and the billing dashboard catches it.\nWould I do it again For this workload, without hesitation. The traffic is spiky, the unit economics demand that cost track usage, and the team is small enough that not running a server fleet is a genuine multiplier. If the traffic were flat and predictable and huge, I\u0026rsquo;d reconsider. At constant high utilization, reserved instances win on raw cost, and serverless stops paying for its premium. Architecture is a response to a workload, not a religion.\nBut the lesson that transfers regardless of platform is this: at scale, cost-per-unit-of-work is not a finance concern you bolt on later. It\u0026rsquo;s a design constraint you put on the whiteboard on day one, next to latency and reliability. We can move fifty million messages a day on fifty functions because we treated the bill as a feature. Most of the architecture decisions that look clever in hindsight were really just us refusing to pay for work we weren\u0026rsquo;t doing.\n","permalink":"https://ashishsaini.work/posts/fifty-million-messages-fifty-lambdas/","summary":"\u003cp\u003eWhen people hear that Comify\u0026rsquo;s backend is fifty-some AWS Lambda functions moving more than fifty million messages a day, the usual reaction is \u0026ldquo;isn\u0026rsquo;t serverless expensive at that scale?\u0026rdquo; It\u0026rsquo;s the right question with the wrong assumption baked in. Serverless got \u003cem\u003echosen\u003c/em\u003e because of the scale and the economics, not in spite of them. But you only get the good version if you design for it deliberately. The naive version is genuinely a trap.\u003c/p\u003e","title":"Fifty million messages a day on fifty Lambda functions"},{"content":"Every e-commerce catalogue has a dirty secret, and it\u0026rsquo;s a number nobody likes to say out loud: the percentage of products that have a proper photo of a real person wearing them. Ours was 73%. Which means more than a quarter of what we sold, thousands of frames, was represented online by a flat shot of the product on a white background, while its neighbour had a crisp image of a model looking great in it. The model shots convert better. Everyone knows this. The problem is that closing the gap means a physical photoshoot, and a physical photoshoot does not scale.\nWe took catalogue coverage from 73% to 100% in about a month. Not by shooting faster. By not shooting at all for the long tail.\nWhy the bottleneck was real, not just expensive It\u0026rsquo;s tempting to frame this as a cost problem: model shoots are pricey, generate them instead, save money. The cost was real, but the actual constraint was throughput. New frames land in the catalogue continuously. A studio shoots a finite number of SKUs a day. The backlog doesn\u0026rsquo;t shrink; it\u0026rsquo;s a queue with an arrival rate that beats its service rate, which means it grows forever. You can\u0026rsquo;t hire your way out of an unbounded queue.\nSo the goal wasn\u0026rsquo;t \u0026ldquo;cheaper photos.\u0026rdquo; It was \u0026ldquo;make coverage a solved problem instead of a permanent backlog.\u0026rdquo; That reframing mattered, because it told us what we could and couldn\u0026rsquo;t compromise on. We could tolerate a generated image being slightly less perfect than a studio shot. We could not tolerate a pipeline that still needed a human in the loop for every single SKU, because then we\u0026rsquo;d just have rebuilt the bottleneck with extra steps.\nThe pipeline, honestly The headline tools were Stable Diffusion and ComfyUI, but the model was maybe a third of the work. The other two-thirds was everything around it.\nInput conditioning. You can\u0026rsquo;t just prompt \u0026ldquo;a model wearing these glasses\u0026rdquo; and get the actual glasses. The product has to be preserved pixel-faithfully: the brand will not accept a frame that\u0026rsquo;s \u0026ldquo;inspired by\u0026rdquo; their product. So the real product image is the anchor, and generation happens around it: we composited the genuine product onto a generated person and scene, using the diffusion model for the parts that need to be invented (the face, the body, the lighting, the environment) while protecting the parts that must not change (the product itself).\nThe ComfyUI graph as a product, not a notebook. ComfyUI is brilliant for exploration and a trap for production if you treat it like a sketchpad. We turned the graph into a parameterized, versioned pipeline: fixed nodes, controlled randomness, and inputs driven by catalogue metadata rather than a human dragging sliders. A new SKU enters as data (product type, color, intended demographic) and comes out the other end as a finished frame without anyone opening the UI.\nDemographic and brand consistency. A model image isn\u0026rsquo;t neutral. The \u0026ldquo;right\u0026rdquo; model for a frame depends on the market and the brand\u0026rsquo;s guidelines. We drove that from metadata too, so a value-line product and a premium sub-brand didn\u0026rsquo;t both get the same generic face. Getting this wrong is the kind of mistake that looks like a small aesthetic slip and is actually a brand-safety incident.\nSKU metadata + product image │ ▼ conditioning ──▶ diffusion (SD + ComfyUI graph) ──▶ candidate frames │ │ └────────────── brand / demographic params ▼ automated QA gates │ pass ──▶ catalogue fail ──▶ human review QA was the actual hard part Here\u0026rsquo;s where most \u0026ldquo;we used GenAI to make images\u0026rdquo; stories quietly skip ahead. Generating a plausible image is easy now. Generating a correct one, ten thousand times, without a human checking each one, is the entire engineering problem.\nDiffusion models fail in specific, repeatable ways, and at catalogue scale you will hit every one of them many times a day: the extra finger, the melted product edge, a frame subtly distorted into the wrong shape, a face that wandered into uncanny territory, lighting that doesn\u0026rsquo;t match the product\u0026rsquo;s real material. If even 2% of generated frames have a defect and you\u0026rsquo;re publishing thousands, you\u0026rsquo;re shipping dozens of broken images a day straight to customers. That\u0026rsquo;s not a coverage win; that\u0026rsquo;s a trust loss.\nSo we built automated QA gates and treated them as first-class. The product region in the output was checked against the original to catch distortion. If the glasses came out warped, the frame was rejected before a human ever saw it. We checked for the common anatomical failure modes. We checked that the generated lighting was consistent enough to look like a real shot. Anything that failed a gate didn\u0026rsquo;t go to the catalogue; it went to a small human review queue.\nThat review queue is the key to the whole thing. The pipeline didn\u0026rsquo;t have to be perfect. It had to be good enough to auto-pass the easy majority and route only the genuinely hard cases to a person. That\u0026rsquo;s what turned an O(n) human bottleneck into an O(hard-cases) one, and the hard cases are a small, shrinking fraction. That\u0026rsquo;s the difference between a demo and a system.\nWhat closing the gap actually bought Coverage hit 100% in about a month, and it stayed there, which is the part I care about more than the headline. New SKUs no longer joined a backlog; they joined a pipeline. Photoshoot cost came down, but the durable win was that \u0026ldquo;products without a model image\u0026rdquo; stopped being a category that existed.\nThe general lesson I\u0026rsquo;ve now built a few of these: image generation at Lenskart, a product-photoshoot engine at Comify that does the same trick for ad creatives. The pattern is always the same, and it\u0026rsquo;s the opposite of where the excitement is.\nThe generative model is a commodity. It\u0026rsquo;s good, it\u0026rsquo;s getting better, and it\u0026rsquo;s not your moat. Your moat is the conditioning that makes it faithful to a real product, the parameterization that lets it run on data instead of vibes, and, above all, the automated QA that lets you trust the output at a scale no human could review. The model makes the image. The system around the model is what lets you actually publish it.\nEveryone wants to talk about the prompt. The prompt was the easy part.\n","permalink":"https://ashishsaini.work/posts/73-to-100-genai-photoshoot-pipeline/","summary":"\u003cp\u003eEvery e-commerce catalogue has a dirty secret, and it\u0026rsquo;s a number nobody likes to say out loud: the percentage of products that have a proper photo of a real person wearing them. Ours was 73%. Which means more than a quarter of what we sold, thousands of frames, was represented online by a flat shot of the product on a white background, while its neighbour had a crisp image of a model looking great in it. The model shots convert better. Everyone knows this. The problem is that closing the gap means a physical photoshoot, and a physical photoshoot does not scale.\u003c/p\u003e\n\u003cp\u003eWe took catalogue coverage from 73% to 100% in about a month. Not by shooting faster. By not shooting at all for the long tail.\u003c/p\u003e","title":"From 73% to 100%: a GenAI photoshoot pipeline that cleared the catalogue"},{"content":"Virtual try-on has a problem nobody puts in the marketing video: it works beautifully for people who don\u0026rsquo;t already wear glasses. For everyone else (which, at an eyewear company, is most of your customers), the experience is broken. You point the camera at your face, the app renders a gorgeous new frame, and it sits on top of the glasses you\u0026rsquo;re already wearing. Two pairs of glasses. It looks ridiculous, and worse, it doesn\u0026rsquo;t answer the only question the customer has: do these look good on me?\nSo we built the thing that sounds impossible when you say it out loud: remove the glasses the user is wearing, live, from the camera feed, before rendering the new ones. On the phone. Fast enough that it feels like video, not a slideshow.\nWhy it had to run on the device The obvious architecture is to stream frames to a server, run a big model, stream them back. We never seriously considered it, for three reasons that I think generalize to almost any real-time CV product:\nLatency. A try-on is a mirror. The moment the reflection lags your movement, the illusion dies. Round-tripping every frame to a server adds a hundred-plus milliseconds you cannot hide. Cost. Millions of users, each generating a live video stream you\u0026rsquo;d have to run inference on, is a GPU bill that turns a feature into a liability. On-device inference costs you nothing per frame after the user has downloaded the model. Trust. People are pointing a camera at their own face. \u0026ldquo;We process your video on our servers\u0026rdquo; is a sentence you don\u0026rsquo;t want in your privacy policy if you can avoid it. The catch is that \u0026ldquo;run it on the device\u0026rdquo; turns every problem into a budget problem. You have roughly 80 milliseconds per frame to hit a usable frame rate, and an iPhone 12 (a great phone, but a phone) is doing everything else at the same time: the camera pipeline, the AR rendering, the UI. The CV model gets a slice of that, and the slice is small.\nThe naive pipeline, and why it was too slow Removing an object from an image and convincingly filling in what was behind it is, in general, two hard problems stacked on top of each other: segmentation (find every pixel that is \u0026ldquo;glasses\u0026rdquo;) and inpainting (reconstruct the face, eyes, and skin that the glasses were covering). The textbook approach is a heavy segmentation network followed by a generative inpainting model.\nRun that per frame and you get maybe two or three frames per second on a phone. It\u0026rsquo;s a tech demo, not a product. Three FPS feels worse than no feature at all, because now the user is watching a juddery, uncanny version of their own face.\nGetting from three FPS to twelve-plus wasn\u0026rsquo;t one clever trick. It was a stack of unglamorous ones.\nWhat actually got it to 12+ FPS Stop solving the whole frame every frame. A face doesn\u0026rsquo;t teleport between frames. We tracked the glasses region across frames and only ran the expensive work inside a tight, predicted bounding box, not the full image. Most of the frame is skin and background that didn\u0026rsquo;t change; spending model budget on it is waste.\nRight-size the model brutally. The instinct from research is to reach for the most accurate network. The instinct from shipping is to ask what\u0026rsquo;s the smallest network that\u0026rsquo;s still convincing. We used a compact segmentation backbone, quantized it, and accepted that it would be slightly worse at edge cases in exchange for being three times faster. On a live feed, the eye barely registers a single imperfect frame; it absolutely registers stutter.\nInpaint cheaply, not generatively. Full generative inpainting per frame was the budget killer. For the temple arms and the lenses over skin, a much lighter reconstruction, informed by the symmetric, un-occluded side of the face and a short temporal memory of recent clean frames, was good enough to be invisible at video speed. We saved the heavy generation for the hardest occluded regions, not the whole thing.\nExploit temporal coherence. This is the single biggest lever for any real-time CV system and it\u0026rsquo;s the one people coming from image-models forget. Consecutive frames are almost the same frame. You can amortize work across them: run the full segmentation every N frames and propagate it on the frames in between using cheap optical-flow-style tracking. The user sees twelve fresh-looking frames a second; the model only did the expensive thing a few times.\nLean on the hardware. Core ML and the Neural Engine exist for exactly this. A model that crawls on the CPU flies on the NPU. A meaningful chunk of our speedup was simply making sure every operation in the graph was one the Neural Engine would actually accept, and reworking the few that weren\u0026rsquo;t.\nframe N ──▶ full segmentation + heavy inpaint (every ~5th frame) frames N+1… ──▶ track region + propagate mask + light fill (cheap) │ └─ temporal memory of recent clean pixels The bugs that taught me the most The interesting failures were never about accuracy on a benchmark. They were about the gap between a dataset and a human in their kitchen.\nLighting was the first humbling lesson. Models trained on well-lit catalogue-style faces fall apart under a yellow ceiling bulb at night, which is where a lot of people actually shop on their phone. We ended up doing a lightweight per-frame normalization before the model ever saw the pixels, just to drag the input back toward something the network recognized.\nThick frames and reflective lenses were the second. A glossy lens reflecting a window confuses a segmenter into thinking the reflection is part of the face. Temporal smoothing helped here too: a single confused frame gets outvoted by its neighbors.\nAnd motion. People move their heads while evaluating glasses; that\u0026rsquo;s the whole point. Fast motion blurs the temple arms into the hair and the tracker loses the region. The fix was less about the model and more about graceful degradation: when tracking confidence dropped, we\u0026rsquo;d briefly fall back rather than render something wrong, because a momentary \u0026ldquo;hold still\u0026rdquo; beats a glitch on your own face.\nWhat I\u0026rsquo;d tell anyone building real-time CV on the edge The frame budget is the design. Everything follows from \u0026ldquo;you have ~80ms.\u0026rdquo; Accuracy you can\u0026rsquo;t afford is worse than slightly lower accuracy you can sustain, because stutter breaks the illusion harder than imperfection does. Temporal coherence is free performance and most people leave it on the table. And the device\u0026rsquo;s accelerators are not optional: a model is only as fast as its slowest unsupported op.\nWe shipped this to existing spectacle wearers across iOS, and for the first time they could actually see themselves in new frames. It was part of an AR try-on program that moved online revenue by around 9%. But the number I remember is twelve. Going from three frames a second to twelve was the difference between a clever research result and something a person could look into and believe.\n","permalink":"https://ashishsaini.work/posts/erasing-glasses-12-fps-iphone/","summary":"\u003cp\u003eVirtual try-on has a problem nobody puts in the marketing video: it works beautifully for people who don\u0026rsquo;t already wear glasses. For everyone else (which, at an eyewear company, is most of your customers), the experience is broken. You point the camera at your face, the app renders a gorgeous new frame, and it sits \u003cem\u003eon top of the glasses you\u0026rsquo;re already wearing\u003c/em\u003e. Two pairs of glasses. It looks ridiculous, and worse, it doesn\u0026rsquo;t answer the only question the customer has: do these look good on \u003cem\u003eme\u003c/em\u003e?\u003c/p\u003e\n\u003cp\u003eSo we built the thing that sounds impossible when you say it out loud: remove the glasses the user is wearing, live, from the camera feed, before rendering the new ones. On the phone. Fast enough that it feels like video, not a slideshow.\u003c/p\u003e","title":"Erasing glasses in real time: 12 FPS on an iPhone 12"},{"content":"There\u0026rsquo;s a version of a scaling story that\u0026rsquo;s all dashboards and Kubernetes. This isn\u0026rsquo;t that one. This is about a phone-number-verification product called CODAC that, at its peak, placed more than ten million calls a day, made around $12M a year, and was run by three people. We did it on a cluster of telephony servers that mostly looked like pets, not cattle, and the hardest problem we solved had nothing to do with the volume.\nWhat CODAC actually did The premise was unglamorous and lucrative. An e-commerce company (Lenskart, Snapdeal, Myntra, take your pick) wants to confirm that the phone number a customer typed at checkout is real and reachable before they ship a cash-on-delivery order to it. The cheapest way to do that, in India, at that time, was a \u0026ldquo;missed call\u0026rdquo; flow: the system places a call, the user\u0026rsquo;s phone rings, they don\u0026rsquo;t even have to pick up, and the act of the call connecting (or the user calling a number back) verifies the line.\nMultiply that by every COD order across several of the country\u0026rsquo;s biggest retailers and you get a firehose. Tens of millions of call attempts a day, each of which has to be placed, tracked, retried on failure, and reported back to the client\u0026rsquo;s order system within seconds, because a customer is standing on the checkout page waiting.\nThe first architecture was wrong, and that was fine We started in PHP. I want to be honest about that because there\u0026rsquo;s a temptation, years later, to pretend you reached for the perfect tool on day one. We didn\u0026rsquo;t. PHP was what we knew, it got the first version in front of paying customers fast, and \u0026ldquo;it works and bills\u0026rdquo; beats \u0026ldquo;it\u0026rsquo;s elegant and theoretical\u0026rdquo; every single time at a startup.\nPHP held longer than you\u0026rsquo;d think. What eventually pushed us off it wasn\u0026rsquo;t request throughput. It was the long-running, stateful nature of managing call legs and retries. A call isn\u0026rsquo;t a request-response. It\u0026rsquo;s a little state machine that lives for thirty seconds, can fail in six different ways, and needs to be reconciled against what the telephony hardware actually did. We ported the core to Python, kept MySQL as the system of record, and leaned hard on Redis and Memcached for the hot path: the per-number, per-second state that you cannot afford to hit the database for.\nThe telephony itself ran across ten-plus servers, each one wired to carrier trunks. From the outside it was one product. From the inside it was a small fleet, and fleets have a specific failure mode that took me a while to fully respect.\nThe actual hard problem: distributing numbers across servers Here\u0026rsquo;s the part nobody warns you about. When you have ten telephony servers and a river of numbers to call, which server calls which number turns out to be the whole game.\nNaively, you round-robin. Number comes in, hand it to the next server. This breaks in ways that are invisible until they\u0026rsquo;re catastrophic:\nA carrier rate-limits per trunk, and trunks are tied to servers. Round-robin a hot batch of numbers from one client onto a server whose trunk is already near its ceiling and that whole batch fails, not because the system is overloaded, but because you put the wrong work in the wrong place. Retries have to remember where they came from. If number X failed on server 3 because of a carrier issue specific to server 3\u0026rsquo;s route, retrying it on server 3 is the dumbest possible choice. Servers die. When server 3 goes down mid-batch, ten thousand in-flight numbers need to go somewhere, immediately, without double-dialing the ones that already connected. So we wrote a distribution layer. Custom, boring, and the single most important piece of software in the product. It tracked per-server, per-trunk capacity in real time, kept a short memory of where each number had already been tried, and made placement decisions that balanced load while respecting the physical reality of the carrier routes. When a server fell over, its outstanding work drained to healthy peers without replaying anything that had already completed.\nIt was, in effect, a purpose-built scheduler for a resource (carrier trunk capacity) that you can\u0026rsquo;t autoscale because it\u0026rsquo;s a contract with a phone company. You can\u0026rsquo;t spin up more trunk on a Tuesday. The whole architecture had to be designed around the fact that the scarce resource was fixed and lumpy.\nincoming numbers ──▶ distribution layer ──▶ server pool │ ├─ srv1 (trunk A, 78% util) │ ├─ srv2 (trunk B, 41% util) per-trunk capacity ├─ srv3 ✗ draining + recent-attempt memory └─ ... That diagram is the entire trick. Everything else (the calling, the reporting, the billing) was comparatively easy.\nWhy three people could run it People hear \u0026ldquo;10M calls a day, three engineers\u0026rdquo; and assume heroics. It was almost the opposite. The team was small because the system was designed to not need babysitting, and it was designed that way because the team was small. The constraint and the architecture fed each other.\nA few decisions that bought us our weekends back:\nThe database was the source of truth, and nothing else was allowed to be. Caches were disposable. Any server could be rebuilt from MySQL plus a cold start. That meant a dead server was a non-event, not an incident. Idempotency everywhere on the call path. Placing the same call twice is worse than not placing it: you annoy a customer and you pay the carrier. Every operation that touched a number was safe to retry, which is what let the distribution layer be aggressive about reassignment. We monitored the carrier, not just the servers. Most of our real outages originated outside our walls: a route degrading, a trunk flapping. The alerts that mattered watched connection rates per route, so we found out before the client did. We said no to features that would have added state. Every fancy capability someone wanted usually meant another thing to reconcile when a server died. The discipline of a tiny team is that you feel the cost of complexity immediately. What I took with me I\u0026rsquo;ve since built GenAI pipelines and agent fleets, and the lessons from CODAC keep showing up wearing different clothes. The scarce, lumpy resource that you have to schedule around isn\u0026rsquo;t carrier trunk anymore: it\u0026rsquo;s GPU, or an API rate limit, or a model\u0026rsquo;s context window. The principle is identical: find the thing you can\u0026rsquo;t just autoscale your way out of, and design the whole system around respecting it.\nThe other thing CODAC taught me is that revenue-per-engineer is a real engineering metric, not just a finance one. Three people made $12M a year not because we worked harder than everyone else, but because we spent our complexity budget on the one problem that mattered and ruthlessly avoided spending it anywhere else.\nThat\u0026rsquo;s still the job. The river just carries different water now.\n","permalink":"https://ashishsaini.work/posts/ten-million-calls-team-of-three/","summary":"\u003cp\u003eThere\u0026rsquo;s a version of a scaling story that\u0026rsquo;s all dashboards and Kubernetes. This isn\u0026rsquo;t that one. This is about a phone-number-verification product called CODAC that, at its peak, placed more than ten million calls a day, made around $12M a year, and was run by three people. We did it on a cluster of telephony servers that mostly looked like pets, not cattle, and the hardest problem we solved had nothing to do with the volume.\u003c/p\u003e","title":"Ten million calls a day with a team of three"},{"content":"I\u0026rsquo;m Ashish. I build AI products and the teams that ship them.\nFor fifteen years I\u0026rsquo;ve done some version of the same job: take something that doesn\u0026rsquo;t exist yet (a telephony platform, a virtual try-on, a fleet of agents) and get it running reliably in front of real users. The technology keeps changing. The job, mostly, doesn\u0026rsquo;t.\nThese days I\u0026rsquo;m CTO and co-founder of Comify, where we\u0026rsquo;re building communication infrastructure that decides what to say, to whom, and when, not just how to deliver it. Before that I led AI and AR research at Lenskart, and before that I spent a decade at MyOperator going from founding engineer to Director of Technology while the company grew from its first customer to more than ten thousand.\nThe short version of the long story I started out writing CRUD apps for clients at a small shop in 2011. Unglamorous. I learned a lot.\nIn 2012 I joined a tiny cloud-telephony startup then called VoiceTree, later MyOperator. I was employee-number-small. Over the next ten years I helped build two products from nothing, designed the architecture for a platform that peaked at 100+ servers and 99.9% uptime, and grew into leading a 30+ person engineering org. The project I\u0026rsquo;m proudest of from those years is CODAC: a phone-number-verification system for e-commerce that, at its peak, handled 10M+ calls a day and brought in around $12M a year. We ran it with a team of three. I learned more about distributed systems from keeping CODAC alive than from any book.\nIn 2022 I made a deliberate jump: from infrastructure to applied AI. I joined Lenskart to start an AI/AR research team, which on day one was just me. Within a year we\u0026rsquo;d shipped a virtual eyeglass try-on to millions of users that moved online revenue by about 9%. Then we kept going: real-time eyeglass removal on the live camera feed at 12+ FPS on an iPhone 12, contact-lens try-on, a GenAI photoshoot pipeline that took catalogue coverage from 73% to 100% in a month, and a 3D asset pipeline rebuilt around Blender automation. The team grew from 1 to 12.\nIn 2025 I co-founded Comify to build what I\u0026rsquo;d kept wishing existed: communication infrastructure with judgment. We move 50M+ messages a day across push and WhatsApp on a serverless backend, and we hand the parts that used to be a marketer\u0026rsquo;s full-time job (writing the copy, picking the audience, generating the creative) to agents that learn from what actually got clicked.\nHow I work A few things I believe, mostly because I\u0026rsquo;ve been burned by the alternatives:\nStay close to the code. I\u0026rsquo;ve never been the kind of leader who stops building. My best architecture decisions came from having recently felt the pain myself. My worst came from a slide deck. Tie it to a number. Uptime, revenue, cost, coverage, FPS: pick the one that matters and move it. \u0026ldquo;We modernized the stack\u0026rdquo; is not a result. \u0026ldquo;We replaced a vendor in a week and saved $800k a year\u0026rdquo; is. The bill is part of the design. Scale is easy if you ignore cost. Doing 50M messages a day cheaply is the actual engineering problem. Build-vs-buy is a decision, not a reflex. I\u0026rsquo;ve bought when buying was right and built when the vendor was the bottleneck. Knowing which is which is most of the job. People stay when the work is good and the leadership is honest. At MyOperator I was told more than once that we had the lowest attrition in the company. I think that\u0026rsquo;s the achievement that compounds the most. Beyond the day job I tinker constantly. A digital-twin agent that produces fully automated talking-avatar news videos. Videofarm, an API-driven layout engine that generates video programmatically. SnapStitch, an apparel photoshoot app. An agentic development system built on Claude Code. A home-security setup running detection on a Raspberry Pi. A handful of small models I trained for fun and spite: audio anomaly detection, anger detection in speech, a custom text-to-speech voice.\nMost of these will never be products. That\u0026rsquo;s fine. They\u0026rsquo;re how I keep my hands in the clay.\nIf any of this overlaps with what you\u0026rsquo;re working on, come say hi.\n","permalink":"https://ashishsaini.work/about/","summary":"\u003cp\u003eI\u0026rsquo;m Ashish. I build AI products and the teams that ship them.\u003c/p\u003e\n\u003cp\u003eFor fifteen years I\u0026rsquo;ve done some version of the same job: take something that doesn\u0026rsquo;t exist yet (a telephony platform, a virtual try-on, a fleet of agents) and get it running reliably in front of real users. The technology keeps changing. The job, mostly, doesn\u0026rsquo;t.\u003c/p\u003e\n\u003cp\u003eThese days I\u0026rsquo;m CTO and co-founder of \u003cstrong\u003eComify\u003c/strong\u003e, where we\u0026rsquo;re building communication infrastructure that decides \u003cem\u003ewhat\u003c/em\u003e to say, \u003cem\u003eto whom\u003c/em\u003e, and \u003cem\u003ewhen\u003c/em\u003e, not just how to deliver it. Before that I led AI and AR research at \u003cstrong\u003eLenskart\u003c/strong\u003e, and before that I spent a decade at \u003cstrong\u003eMyOperator\u003c/strong\u003e going from founding engineer to Director of Technology while the company grew from its first customer to more than ten thousand.\u003c/p\u003e","title":"About"},{"content":"The fastest way to reach me is email:\nashy.saini@gmail.com I read everything and reply to most things within a day or two.\nGood reasons to write You\u0026rsquo;re putting AI agents into production and want a second opinion on the architecture. You\u0026rsquo;re doing computer vision or AR on real devices and fighting the frame budget. You\u0026rsquo;re scaling a serverless or telephony/messaging backend and the bill is starting to hurt. You\u0026rsquo;re building an AI/ML team and want to compare notes on hiring, structure, and not burning people out. You want to talk about a build-vs-buy call before you commit to either. Advisory, consulting, or just a good technical conversation. If you email, a couple of lines on what you\u0026rsquo;re working on and what you\u0026rsquo;re stuck on goes a long way. It lets me actually be useful in the first reply instead of the third.\nElsewhere LinkedIn: linkedin.com/in/ashysaini I\u0026rsquo;m based in Noida, India, and work across time zones regularly, so don\u0026rsquo;t worry about where you are.\n","permalink":"https://ashishsaini.work/contact/","summary":"\u003cp\u003eThe fastest way to reach me is email:\u003c/p\u003e\n\u003ch3 id=\"ashysainigmailcom\"\u003e\u003ca href=\"mailto:ashy.saini@gmail.com\"\u003eashy.saini@gmail.com\u003c/a\u003e\u003c/h3\u003e\n\u003cp\u003eI read everything and reply to most things within a day or two.\u003c/p\u003e\n\u003ch2 id=\"good-reasons-to-write\"\u003eGood reasons to write\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eYou\u0026rsquo;re putting \u003cstrong\u003eAI agents into production\u003c/strong\u003e and want a second opinion on the architecture.\u003c/li\u003e\n\u003cli\u003eYou\u0026rsquo;re doing \u003cstrong\u003ecomputer vision or AR on real devices\u003c/strong\u003e and fighting the frame budget.\u003c/li\u003e\n\u003cli\u003eYou\u0026rsquo;re scaling a \u003cstrong\u003eserverless or telephony/messaging backend\u003c/strong\u003e and the bill is starting to hurt.\u003c/li\u003e\n\u003cli\u003eYou\u0026rsquo;re \u003cstrong\u003ebuilding an AI/ML team\u003c/strong\u003e and want to compare notes on hiring, structure, and not burning people out.\u003c/li\u003e\n\u003cli\u003eYou want to talk about a \u003cstrong\u003ebuild-vs-buy\u003c/strong\u003e call before you commit to either.\u003c/li\u003e\n\u003cli\u003eAdvisory, consulting, or just a good technical conversation.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eIf you email, a couple of lines on what you\u0026rsquo;re working on and what you\u0026rsquo;re stuck on goes a long way. It lets me actually be useful in the first reply instead of the third.\u003c/p\u003e","title":"Contact"},{"content":"I\u0026rsquo;d rather show you a short, honest list of things I\u0026rsquo;ve actually shipped than a long one of things I\u0026rsquo;ve heard of. Here\u0026rsquo;s where I\u0026rsquo;m genuinely useful.\nGenerative AI \u0026amp; Agentic Systems This is most of my work right now. Not chatbots. Systems that do things and improve on their own.\nSelf-learning agents. At Comify I built a copywriting agent that researches a brand\u0026rsquo;s domain, applies psychological triggers to draft message templates, then keeps optimizing them against live click-through data. It gets better at a brand the longer it runs. Multi-agent orchestration at scale. The Comify platform is a fleet of agents and 50+ Lambda functions coordinating to deliver 50M+ messages a day, deciding audience, channel, content, and timing. Image \u0026amp; video generation pipelines. Product-photoshoot engines that turn a single product shot into on-brand campaign creatives in bulk, using vision LLMs (Gemini, Claude) and diffusion tooling (Stable Diffusion, ComfyUI). At Lenskart this took catalogue photoshoot coverage from 73% to 100% in a month. Recommendation systems. Clickstream-driven recommenders (recently-viewed, popular, demography-based) that lifted revenue on targeted communication workflows by 60–80% over untargeted sends. RAG and retrieval where it earns its keep, and not where it doesn\u0026rsquo;t. Computer Vision \u0026amp; AR Five years of getting models to run fast, on real devices, in real lighting.\nVirtual try-on at consumer scale. Eyeglass AR try-on delivered to millions across Android, iOS, and web; ~9% lift in online revenue since launch. Real-time, on-device CV. Eyeglass removal on the live AR video feed at 12+ FPS on an iPhone 12, so existing spectacle wearers could actually see themselves in new frames. Edge inference, tight frame budgets, no round-trip to a server. Try-on beyond glasses. Contact-lens try-on for the Aqualens sub-brand (20%+ sales lift in markets like the UAE), and a real-time 2D image try-on shipped in a week that replaced a vendor and saved ~$800k a year in licensing. 3D asset pipelines. Rebuilt Lenskart\u0026rsquo;s 3D try-on pipeline with Blender automation and purpose-built QA tooling: coverage from 65% to 92% in three months, asset development time down ~30%. High-Scale, Low-Cost Infrastructure Before AI was my job, this was. It still underpins everything I build.\nServerless at scale. A backend of 50+ production AWS Lambda functions orchestrating 50M+ daily messages, designed, deliberately, around minimizing cost per message. Distributed telephony. CODAC peaked at 10M+ calls a day across 10+ telephony servers (PHP → Python, MySQL, Redis/Memcached) and generated ~$12M a year with a three-person team. High availability by design. Multi-server, load-balanced, fault-tolerant clusters spanning 100+ servers at 99.9% uptime, with custom tooling for the genuinely hard parts, like distributing phone numbers across servers without collisions. The unglamorous foundations. Monitoring, release planning, capacity, and the alerting that lets a small team sleep. Leadership \u0026amp; Technical Strategy I\u0026rsquo;ve built two teams from one person, and led an org of thirty.\nTeam building \u0026amp; retention. Grew Lenskart\u0026rsquo;s AI/AR/Data team from 1 to 12; led a 30+ person engineering org at MyOperator with the company\u0026rsquo;s lowest attrition. Build-vs-buy \u0026amp; vendor strategy. The calls that quietly save the most money usually aren\u0026rsquo;t about technology at all. Working with founders \u0026amp; product. Roadmap and ideation directly with co-founders and senior product leadership, translating between \u0026ldquo;what\u0026rsquo;s possible this quarter\u0026rdquo; and \u0026ldquo;what the business actually needs.\u0026rdquo; Agentic engineering practice. Using AI tooling to accelerate the team\u0026rsquo;s own delivery without trading away reliability. If one of these is the problem you\u0026rsquo;re staring at right now, let\u0026rsquo;s talk.\n","permalink":"https://ashishsaini.work/expertise/","summary":"\u003cp\u003eI\u0026rsquo;d rather show you a short, honest list of things I\u0026rsquo;ve actually shipped than a long one of things I\u0026rsquo;ve heard of. Here\u0026rsquo;s where I\u0026rsquo;m genuinely useful.\u003c/p\u003e\n\u003ch2 id=\"generative-ai--agentic-systems\"\u003eGenerative AI \u0026amp; Agentic Systems\u003c/h2\u003e\n\u003cp\u003eThis is most of my work right now. Not chatbots. Systems that \u003cem\u003edo\u003c/em\u003e things and improve on their own.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSelf-learning agents.\u003c/strong\u003e At Comify I built a copywriting agent that researches a brand\u0026rsquo;s domain, applies psychological triggers to draft message templates, then keeps optimizing them against live click-through data. It gets better at a brand the longer it runs.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMulti-agent orchestration at scale.\u003c/strong\u003e The Comify platform is a fleet of agents and 50+ Lambda functions coordinating to deliver 50M+ messages a day, deciding audience, channel, content, and timing.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eImage \u0026amp; video generation pipelines.\u003c/strong\u003e Product-photoshoot engines that turn a single product shot into on-brand campaign creatives in bulk, using vision LLMs (Gemini, Claude) and diffusion tooling (Stable Diffusion, ComfyUI). At Lenskart this took catalogue photoshoot coverage from 73% to 100% in a month.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRecommendation systems.\u003c/strong\u003e Clickstream-driven recommenders (recently-viewed, popular, demography-based) that lifted revenue on targeted communication workflows by 60–80% over untargeted sends.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRAG and retrieval\u003c/strong\u003e where it earns its keep, and not where it doesn\u0026rsquo;t.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"computer-vision--ar\"\u003eComputer Vision \u0026amp; AR\u003c/h2\u003e\n\u003cp\u003eFive years of getting models to run \u003cem\u003efast\u003c/em\u003e, on \u003cem\u003ereal devices\u003c/em\u003e, in \u003cem\u003ereal lighting\u003c/em\u003e.\u003c/p\u003e","title":"Expertise"},{"content":"A walk through the things I\u0026rsquo;ve built: what the problem was, what I did, and what it actually moved. Most of the AI/AR work has video, so where I can, I\u0026rsquo;ll just show you.\nComify: CTO \u0026amp; Co-founder 2025 – present. Intelligent, omnichannel customer-communication infrastructure for large consumer brands.\nI co-founded Comify on a simple frustration: every brand spends enormous effort delivering messages and almost none deciding whether a message should be sent at all. We\u0026rsquo;re building the layer that decides what to say, to whom, and when, then delivers it at scale, cheaply.\nArchitected a multi-agent communication platform handling 50M+ messages a day (push, WhatsApp) for brands including Lenskart and Cars24. Designed a serverless backend of 50+ production AWS Lambda functions to orchestrate and deliver at scale with minimal cost per message. Built a product-photoshoot image \u0026amp; video generation tool on vision LLMs (Gemini, Claude, Higgsfield) that turns product images into on-brand ad creatives in bulk from style guidelines, cutting catalogue creative cost ~30%. Designed self-learning agents, including a copywriting agent that researches a brand\u0026rsquo;s domain, applies psychological triggers to generate templates, then continuously optimizes them against live click-through. Built a clickstream recommendation system (recently-viewed, popular, demography-based) that lifted revenue 60–80% on targeted workflows versus untargeted sends. Hired and lead a cross-functional team across frontend, backend, AI, and data engineering. Lenskart: AI \u0026amp; AR Research Lead 2022 – 2025. Built and led the AI / AR / Data Science team (grew 1 → 12), partnering directly with co-founders and senior product on roadmap.\nI joined to start an AI/AR team from scratch and spent three years turning research demos into features millions of people actually used.\nDelivered best-in-class AR virtual try-on across Android, iOS, and web, with a ~9% lift in online revenue since launch. Built real-time eyeglass removal on the live AR feed at 12+ FPS on an iPhone 12, so existing spectacle wearers could see themselves in new frames. Launched contact-lens try-on for the Aqualens sub-brand, with a 20%+ sales lift in markets like the UAE. Architected a GenAI product-photoshoot pipeline (Stable Diffusion + ComfyUI) that raised model-photoshoot coverage from 73% → 100% in one month. Rebuilt the 3D try-on asset pipeline with Blender automation and custom QA apps: 3D coverage 65% → 92% in three months, asset dev time down ~30%. Shipped a real-time 2D try-on in a single week that replaced a third-party vendor and saved ~$0.8M/year in licensing. Dynamic True Fit: animated glasses tracking on a live feed.\nReal-time frame segmentation and removal, running on-device at the edge.\nAI eyeglass generator: product variations rendered for try-on.\nMyOperator (formerly VoiceTree): Founding Engineer → Director of Technology 2012 – 2022. Founding engineer at one of India\u0026rsquo;s fastest-growing cloud-telephony companies; grew from the first customer to 10,000+ paying customers.\nTen years, two products built from scratch, and the architecture education of a lifetime.\nOwned end-to-end architecture for a high-availability platform spanning 100+ servers at peak, sustaining a 99.9% uptime SLA. Created CODAC, a phone-number-verification product for e-commerce that peaked at 10M+ calls/day and generated ~$12M a year with a 3-person team: clients included Lenskart, Snapdeal, and Myntra. Distributed architecture, PHP → Python, MySQL, Redis/Memcached, 10+ telephony servers. Designed multi-server, load-balanced, fault-tolerant clusters with custom tooling for number distribution across servers. Led early cloud-telephony AI research: raw-audio emotion/anger recognition, audio keyword detection, voicebots, TTS/ASR foundations, and neural-net anomaly detection on time-series data. Led and grew a 30+ person engineering org with the lowest attrition across departments. Selected experiments \u0026amp; demos The stuff I build on weekends. Some of it became real products; most of it just taught me something.\nDigital Twin: fully automated talking-avatar news videos, end to end.\nSnapStitch: apparel catalogue photoshoots generated on-model.\nA few more, by link rather than embed:\nAI news automation: heatwave segment · markets segment · shorts version Apparel virtual try-on: demo Catalog photoshoots for t-shirts: demo Contact-lens try-on: demo Real-time frame removal, take two: demo And a few without a camera pointed at them:\nVideofarm: an API-driven layout engine that generates video programmatically; the engine under most of the automation above. Agentic development system built on Claude Code: autonomous, AI-driven software development I use on my own projects. Trained models: audio anomaly detection, keyword and anger detection in speech, and a custom text-to-speech voice. Home security on a Raspberry Pi: on-device detection, no cloud round-trip. Want the full history? My résumé is here, and the About page has the narrative version. Or just email me.\n","permalink":"https://ashishsaini.work/work/","summary":"\u003cp\u003eA walk through the things I\u0026rsquo;ve built: what the problem was, what I did, and what it actually moved. Most of the AI/AR work has video, so where I can, I\u0026rsquo;ll just show you.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"comify-cto--co-founder\"\u003eComify: CTO \u0026amp; Co-founder\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e2025 – present.\u003c/strong\u003e Intelligent, omnichannel customer-communication infrastructure for large consumer brands.\u003c/p\u003e\n\u003cp\u003eI co-founded Comify on a simple frustration: every brand spends enormous effort \u003cem\u003edelivering\u003c/em\u003e messages and almost none deciding whether a message should be sent at all. We\u0026rsquo;re building the layer that decides what to say, to whom, and when, then delivers it at scale, cheaply.\u003c/p\u003e","title":"Work"}]