Financial AI Agents Are Moving Beyond Chatbots, but Memory…

Financial AI agents are no longer being judged only by whether they can answer a market question. The harder test is whether they can participate in a workflow over time without losing context, repeating the same mistakes or forcing users to manually rebuild the agent’s memory at every step. That is the broader issue raised by a new arXiv paper authored by Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita, Dmytro Kyrylenko, Sofiia Pidturkina and Julia Stadnyk, researchers and engineers affiliated with TRUE AI and Inc4.net. The paper proposes one possible technical framing for this problem, called an interaction-native knowledge harness, or InKH, but the more important industry question is wider than any single architecture: what does it take for AI agents to become reliable participants in financial workflows? FinanceFeeds spoke with the authors about the problem financial firms are increasingly facing as AI moves from one-off question answering toward market analysis, portfolio review, copy-trading evaluation, trade preparation and operational decision support. The key point is not that the paper proves live trading performance. It does not. Its reported results are based on a controlled synthetic benchmark, not real-money execution, investment returns, trading alpha or live market deployment. The paper is best read as infrastructure research: an attempt to measure how financial AI agents handle context, memory, stale information, latency and auditability.

The user is still doing too much of the agent’s work

The first problem is simple: many financial AI agents still behave like chatbots with tools attached. A user asks a question. The agent answers. Then, when the workflow continues later, the user often has to repeat the same goals, constraints, risk preferences, portfolio assumptions and prior judgments. The system may have access to a transcript, but it does not necessarily convert that transcript into a reliable operational state. Ailiya Borjigin said this is where many financial AI systems fail before they reach anything resembling trading intelligence. “Financial AI is often described as if the main challenge is prediction,” Borjigin said. “But in many real workflows, the first failure is memory. The system forgets what the user already explained, what the portfolio context was, what risk limits mattered and what assumptions were already challenged. When that happens, the user becomes the memory layer.” That creates what the authors describe as financial cognition friction. The phrase refers to the repeated mental work imposed on users when a system cannot preserve useful context across turns, sessions or workflow stages. In consumer chat, that friction is inconvenient. In finance, it can become material. A portfolio review may depend on constraints stated earlier. A copy-trading assessment may depend on prior concerns about drawdown, leverage or market regime. A trade-preparation workflow may depend on whether the user is exploring, confirming or risk-checking. If the agent loses that context, it may still produce a fluent answer. The danger is that the answer can sound coherent while being built on incomplete or outdated assumptions.

Financial agents need state, not just conversation history

Igor Stadnyk said firms should distinguish between storing conversation history and maintaining financial state. “A transcript is not the same thing as memory,” Stadnyk said. “Financial workflows need structured state: user preferences, market assumptions, risk constraints, portfolio facts, trader observations, evidence and timestamps. The question is not only what was said before. The question is what remains valid now.” That distinction matters as AI agents begin to move closer to operational workflows. A simple market summary can tolerate some repetition. A workflow that reviews a trader, prepares an order summary or flags portfolio risk cannot rely only on the user remembering to restate every constraint. The authors argue that financial AI agents should continuously transform interaction traces into structured and auditable knowledge. In practical terms, that means a system should be able to identify relevant entities, track what was learned, understand when information was last validated and suppress knowledge that has been superseded. The issue is not whether an agent can have “more memory.” The issue is whether the memory is usable, current and governed.

More memory can create more risk

One of the clearest messages from the authors is that memory is not automatically beneficial. A financial AI system that remembers too little creates repetition and context loss. But a system that remembers too much, without controls, can create a different problem: stale assumptions being reused with confidence. Ben Bilski, Founder of TrueLabs, said this is why financial AI products should not treat long-term memory as a simple feature upgrade. “Remembering everything is not the goal,” Bilski said. “In finance, old information can be dangerous because it often still sounds reasonable. A liquidity assumption, a trader profile or a market view may have been valid under one regime and wrong under another. The agent has to know what to use, what to question and what to stop using.” This is one of the biggest differences between financial AI and general productivity AI. In finance, context is time-sensitive. Market assumptions decay. Risk conditions change. User preferences can change. A strategy that looked robust in one volatility environment may break in another. That means the memory layer needs something closer to governance than storage. It needs confidence, maturity, evidence, timestamps and invalidation.

Stale memory is harder to detect than missing memory

Missing context is visible. The user notices when the agent asks the same question again. Stale context is more subtle. Dmytro Kyrylenko said stale memory is dangerous because it often appears legitimate. “Stale knowledge rarely announces itself as stale,” Kyrylenko said. “It may look like a perfectly valid statement from a previous workflow. That is the problem. In financial systems, the agent needs to know when new evidence supersedes old memory, not simply retrieve old memory because it is related.” This is particularly important in shock-driven environments. A change in central bank expectations, liquidity conditions, crypto protocol status, broker execution quality or trader performance can make previous judgments unreliable. An AI agent that cannot invalidate old assumptions may become worse over time. It accumulates information, but it also accumulates outdated beliefs. In a high-stakes workflow, that can lead to repeated errors that feel personalized and intelligent because they are based on the user’s own history. The authors’ broader point is that financial AI should not only learn. It should also forget safely.

Auditability will separate demos from deployable systems

For many firms, the next stage of AI adoption will depend less on how impressive the interface looks and more on whether the system can be reviewed. Sofiia Pidturkina said auditability is central to making AI usable in financial teams. “Financial users do not only need an answer,” Pidturkina said. “They need to know what context shaped that answer, whether the context was current and whether the system relied on information that should no longer influence the workflow. Without that trail, the output is hard to trust operationally.” This is where financial AI agents differ from consumer assistants. A broker, fintech platform, research desk or trading-technology provider may need to understand why an agent flagged a risk, why it viewed a trader differently after a drawdown or why it prepared a certain trade summary. The output alone is not enough. The system needs an inspectable memory trail. That does not mean every detail has to be shown to the end user at all times. But the underlying system should be able to reconstruct what mattered, when it was learned and why it was considered valid.

Financial AI is becoming a workflow problem

The authors argue that the industry is moving from single-turn AI toward sustained financial workflows. That shift changes the evaluation criteria. A one-shot assistant can be judged mostly on answer quality. A workflow agent must also be judged on continuity, memory discipline, latency, cost, traceability and risk handling. Maksym Chikita said the challenge is to make continuity efficient enough for practical use. “If memory makes the agent slow, expensive or noisy, users will work around it,” Chikita said. “The system has to provide relevant context without turning every interaction into a retrieval project. Financial workflows need continuity, but that continuity has to feel lightweight.” This is one reason the paper emphasizes compact, pre-assembled working context rather than asking the model to search through memory during every task. The broader concept is not product-specific: financial agents need the right information available before reasoning begins, without forcing the user or the model to reconstruct it from scratch. In a broker or fintech environment, this could apply to several workflows: reviewing a client’s recurring preferences, checking whether a trader’s past behavior remains relevant, preparing a trade with known risk constraints, or analyzing a market event against previously stated assumptions. The goal is not to make the AI autonomous by default. The goal is to reduce repeated cognitive work while keeping the human in control.

The benchmark should be read carefully

The authors’ paper includes benchmark results, but those results should be interpreted narrowly. The evaluation uses a controlled synthetic benchmark with simulated workflows. It tests architecture-level behavior such as latency, token cost, memory reuse, stale-knowledge suppression and traceability. It does not test live trading profitability, real execution outcomes, alpha generation or regulatory deployment. Julia Stadnyk said that distinction is important because financial AI results are often misunderstood. “People naturally want to ask whether an AI system makes better trades,” Julia Stadnyk said. “That is not what this benchmark is designed to answer. It is designed to ask whether the agent handles memory, context and traceability better under controlled conditions. Those are necessary infrastructure questions, but they are not the same as live trading performance.” The paper reports that the proposed design outperformed several baseline agent-memory setups in the synthetic test, particularly on stale-memory suppression and traceability. But the authors are careful to state that the reported results validate system behavior, not live market performance. That limitation makes the research more useful, not less. It keeps the discussion focused on the part of financial AI that many firms are still underestimating: the operating layer between a model’s reasoning ability and a financial workflow’s practical requirements.

The real test is what happens after conditions change

A financial AI agent may look useful in stable conditions. The harder test is what happens after a shock. Markets are full of regime changes. Volatility rises. Liquidity disappears. A trader’s behavior shifts. A portfolio constraint changes. A token, stock or macro assumption becomes outdated. A broker’s execution quality changes under stress. An agent that remembers well during stable periods but fails to update after shocks can become a liability. Borjigin said this is why financial AI memory should be tested under changing conditions, not only on static tasks. “Financial memory has to be dynamic,” she said. “A system should not treat every previous conclusion as a permanent fact. It needs to understand that some knowledge matures, some decays and some has to be invalidated when conditions change.” That point may become increasingly relevant as firms use agents for monitoring and preparation rather than simple Q&A. A monitoring agent that cannot update its assumptions after new evidence is not truly monitoring. It is replaying old context.

Execution safety is not enough

The AI finance debate often focuses on execution safety: preventing unauthorized trades, enforcing constraints, blocking unsafe orders and maintaining logs at the point of action. The authors do not dismiss that layer. Instead, they argue that execution safety and cognition safety should be treated as separate but connected problems. Stadnyk said execution controls cannot fully compensate for a weak cognition layer. “You can have strong execution guardrails and still have a poor upstream reasoning process,” he said. “If the agent is reasoning from stale context, the problem starts before execution. Financial AI needs safe action controls, but it also needs safe memory and safe context formation.” This is an important point for institutions testing agentic systems. Guardrails at the execution layer may prevent the worst outcomes, but they do not ensure that the agent’s analysis, summaries or recommendations are based on valid context. In workflows such as copy-trading review, investment research or risk preparation, the agent may influence human judgment even if it never places a trade. That means the cognition layer itself needs governance.

The industry should ask better procurement questions

For firms evaluating AI agents, the authors suggest that the most important questions may not be about model size or interface design. They are more practical: Does the system maintain structured context across sessions? Can it distinguish current knowledge from stale knowledge? Can it show which memory influenced an output? Can it prevent immature or low-confidence information from influencing high-risk workflows? Can it adapt when market conditions change? Bilski said these questions are likely to become more important as AI products move from demos to deployment. “The first wave of AI adoption was about whether the system could answer,” Bilski said. “The next wave is about whether it can participate in work. Participation requires continuity. It requires knowing what changed, what stayed valid and what should be escalated to a human.” That is the core difference between an AI assistant and an AI workflow participant. The assistant answers a prompt. The workflow participant carries context, but must do so with discipline.

The memory layer may become part of compliance infrastructure

Financial firms already care about records, audit trails, suitability, supervision and risk controls. AI memory will not sit outside those concerns. Pidturkina said memory governance should be designed with reviewability from the beginning. “If an AI system influences a financial workflow, someone may eventually ask why,” she said. “Why did it flag this risk? Why did it rely on this trader history? Why did it ignore a previous assumption? If the memory layer cannot answer those questions, the system will be difficult to use in serious environments.” This does not mean every AI memory system is a regulated recordkeeping system. But in financial contexts, memory architecture and governance architecture will increasingly overlap. A model may generate the final text, but the memory layer shapes what the model sees. That makes memory a source of both value and risk.

Human control still matters

The authors are not arguing for fully autonomous financial AI. Their emphasis is almost the opposite: systems should absorb complexity so humans can make better-supervised decisions. Chikita said the goal is not to remove the user from the loop, but to stop wasting the user’s attention. “The user should not have to repeat basic context just to make the agent useful,” Chikita said. “Human attention should go to judgment, exceptions and decisions, not to constantly reminding the system what it already learned.” This framing may resonate with brokers and fintech platforms that want AI to improve productivity without handing control to opaque systems. A well-designed agent should reduce repetitive work while increasing inspectability. That is a different message from the promotional idea of a black-box trading agent. It is closer to operational AI: a system that helps financial professionals manage context, risk and workflow continuity.

Why this matters for brokers and fintech platforms

Brokers, trading platforms and fintech infrastructure providers are likely to encounter this issue as they add AI features to client and internal workflows. A customer-facing AI assistant may need to remember user preferences without making unsuitable assumptions. An internal operations agent may need to track recurring risk issues across support cases. A research assistant may need to preserve analyst judgments while updating them after new data. A copy-trading tool may need to distinguish between historical performance and current risk. In each case, the challenge is not merely generating language. It is maintaining context over time. Kyrylenko said the wrong memory architecture can make an agent appear smarter than it is. “A fluent answer can hide a weak memory process,” he said. “That is why teams should test agents after context changes, after contradictions and after stale information is introduced. The important question is not only whether the answer sounds good. It is whether the system used the right context.” That kind of testing may become a standard part of AI due diligence in financial services.

A small research paper with a larger industry question

The arXiv paper itself is a technical contribution, but the issue it raises is larger than the specific design it proposes. Financial AI agents are being pulled in two directions. On one side, users want more continuity and personalization. On the other, financial workflows require caution, auditability and control. Memory sits directly between those two demands. Too little memory, and the agent becomes repetitive and shallow. Too much unmanaged memory, and the agent may reuse stale assumptions. The useful middle ground is governed memory: context that is structured, current, bounded and reviewable. Julia Stadnyk said the adoption question will come down to whether agents can operate responsibly across a chain of work. “The market does not need more systems that sound confident for one turn,” she said. “It needs systems that can support a sequence of work, update their assumptions and show what shaped the output.” That may be the clearest takeaway. Financial AI agents will not be judged only by their best answer in a clean demo. They will be judged by how they behave after repeated use, changing conditions and accumulated context.