Enterprise AI has a File Data Problem
Key Takeaways
- Enterprise GenAI projects rarely fail because of the model, they fail because the underlying file data is fragmented, ungoverned, and not AI-ready.
- Hybrid cloud storage and enterprise file services are not solved problems, especially for AI workloads in regulated industries.
- Semantic enrichment at the file system layer, not just vector embeddings — is what turns unstructured enterprise content into a governed, queryable knowledge base for GenAI.
- CTERA’s Intelligent Data Platform makes file data across data centers, edge, and cloud accessible to AI in place, with security and governance intact.
In 2020, I transitioned away from leading marketing for the hybrid, edge, and data migration portion of the AWS Storage portfolio, which included AWS Storage Gateway. By then, I’d spent decades working with or inside data management and storage firms like EqualLogic, Dell, Actifio, and then AWS, and I was ready for something different, something a little more bleeding edge. So, I went to work building a team to support emerging technologies at AWS.
Quantum computing was just blowing up. Robotics test/dev was getting a lot of attention, both inside Amazon and out (e.g., iRobot). Customers were building geospatial capabilities into apps everywhere: on the back end for asset and delivery tracking and fraud detection, and in front-end UIs.
As I moved toward the “new new,” I assumed that hybrid cloud storage and enterprise file services were essentially solved problems. The industry had been working on them for years. There was traction, real enterprise use cases, and lots of options, and I assumed the rough edges would be worked out. Plus, all the data was going to move to the cloud, right? Time to move on.
Today, I know that assumption was wrong.
What changed my mind wasn’t abstract, and it wasn’t a revelation I had after joining CTERA. In my last role at AWS, my team spent a few years working closely with Fortune 500 organizations, building and applying new cloud technologies and AI to real enterprise workflows. This started with betas, PoCs, and pilots, but eventually progressed to real production deployments in regulated, mission-critical industries.
What I consistently found was that the new tech (whether an LLM or some new knowledge graph) wasn’t the bottleneck. The data was.
Specifically, the challenge was getting access to current, curated, and trustworthy data to feed into compute engines, sophisticated analytical systems, or AI-powered applications. In some cases, this involved structured databases that had to merge outputs from systems of record with observed realities in the field or fact-checking values against physical engineering impossibilities.
But as my conversations with customers shifted towards Generative AI (GenAI) and business workflows, unstructured file data showed up everywhere.
It came into businesses through PDFs, Word docs, and even hand-drawn diagrams. It was spread across on-premises systems, file shares, desktops, note-taking apps, edge locations, and, of course, cloud storage. It was all meant to be governed by internal policies (at a minimum), if not legal and regulatory requirements.
In large, regulated enterprises, these are the sorts of constraints that don’t disappear just because someone promises innovation and efficiency gains. Those hard realities break the old “move fast and break things” Silicon Valley ethos.
Getting data aggregated, auditable, TRUSTED, and usable (often in the cloud) for AI workflows is actually a hard problem.
For many enterprises, and for AI use cases that truly matter to the business, hybrid file services at scale are not solved, especially for AI. Not even close.
Why Enterprise File Data Is Still Unstructured Chaos
There’s a version of the enterprise AI story you see a lot on LinkedIn or in conference talks: clean data lakes, smooth pipelines, incrementally polished data products, and frontier LLMs delivering insight at scale.
That story is real for some organizations. For most, though, the data reality is messier, if not ugly.
File data is still how many enterprises create, share, and store information that runs their businesses. The applications and devices that generate it still speak SMB and NFS. That’s not going to change. Windows apps and users are here to stay, and many machines (think life science labs) insist on writing to an NFS share. There’s simply no way around it.
What has accelerated is the pressure on organizations to make that data work harder: for AI, analytics, governance, and security.
Ransomware, phishing attacks, and other security threats have become more sophisticated. Regulatory requirements around data classification and auditability have become stricter. Yet the AI use cases with the highest business value — the ones that require reasoning over real enterprise content, in context, with governance — require a file data foundation that most organizations still haven’t been able to build effectively.
That’s the problem CTERA is solving. And it’s a bigger market opportunity than most of the industry realizes.
Why I Joined CTERA: Solving Hybrid File Services for AI
In short, I got a call and listened.
Cheryle Cushion, CTERA’s SVP of Marketing, and I worked together 20 years ago at EqualLogic, a storage startup that had genuinely great technology and earned a loyal enterprise customer base before being acquired by Dell.
Cheryle is one of the best marketing leaders I’ve worked with, and she knows how to build a machine from scratch. Succeeding in startup life certainly takes luck, but your odds improve dramatically when you have people like Cheryle on your team.
So, when she called, I paid attention and did some research.
What I found was a company that had been quietly doing hard things for years. CTERA’s Intelligent Data Platform delivers a global file system across on-premises data centers, edge sites, and multiple clouds (including private clouds) — with enterprise-grade security, active ransomware protection, and data governance built directly into the architecture.
OK great, sounds useful. Clearly, I believe in hybrid cloud storage, but I also understand the cases and reasons customers prefer to have a cost-effective object storage back end to their file systems on-premises, and CTERA covers all those use cases.
Turning Unstructured Files into AI-Ready Data with Semantic Enrichment
But what I found interesting about CTERA were the investments the company was making on top of its (really cool) global file system to make customer data more accessible and trustworthy for AI workloads.
For example, CTERA Classify performs semantic enrichment across file content to make unstructured data actionable for AI. That includes:
- Semantic entity extraction: extracting people, companies, and dates
- Semantic extraction: identifying topics and concepts
- Summary generation: producing concise representations of file content
- Metadata augmentation: generating tags, labels, and attributes attached to files
Semantic Enrichment & Vector Embeddings for Retrieval-Augmented Generation (RAG)
How does this relate to vectors, which have been discussed much more widely? Vector similarity can find related content. Semantic enrichment and extractions can find specific content within the context of your organization. Applied at the file system layer, semantic enrichment turns unstructured content into a governed, queryable knowledge base without moving the data or rebuilding the pipeline. And in production RAG (Retrieval-Augmented Generation) systems, it should be combined with vector embeddings.
When I thought about the value that CTERA’s large enterprise customers in financial services, government, and manufacturing can get by connecting their file data – in place – to LLMs with governance and security, it sounded like a real opportunity. For me, solving hard problems and delivering measurable value for enterprise customers is meaningful work.
Organizations in those industries don’t give their trust easily. Earning their business, and keeping it, is proof of something real.
GenAI Will Force Every Enterprise to Fix Its Data Foundation
The GenAI and agentic conversation is becoming much more grounded in the messy realities of data and business workflows. The early narrative, focused almost entirely on GPUs, training, and frontier models, is giving way to harder questions about data quality, governance, and the actual enterprise data that GenAI systems and agents need.
Those enterprises with a solid data foundation that understand, classify and structure their unstructured data will have better options for applying AI successfully and getting real value from it.
That’s a major reason why I’m at CTERA.
FAQs: Enterprise AI and File Data
- Why do most enterprise AI projects fail?
Most enterprise AI projects fail because of unprepared file data, not the model. Unstructured content sits across on-premises systems, edge locations, and clouds without the classification, governance, or AI-ready metadata that GenAI systems require.
- What is the file data problem in GenAI?
GenAI requires current, curated, and trustworthy data. Enterprise file data lives in PDFs, Word docs, file shares, and edge systems governed by strict internal and regulatory policies. Making it accessible to LLMs in place, with governance intact, is the file data problem.
- What is semantic enrichment?
Semantic enrichment is the process of extracting entities (people, companies, dates), topics, concepts, and summaries from unstructured files and attaching them as metadata so the content becomes queryable by AI systems and agents.
- How is semantic enrichment different from vector embeddings?
Vector embeddings find content similar to a query. Semantic enrichment finds specific entities and concepts within the context of your organization. Production RAG systems use both together.
- What is a global file system?
A global file system unifies file storage across data centers, edge sites, and multiple clouds into a single namespace, with consistent security, governance, and access regardless of where the data physically lives.
- Vice President of Product Marketing at CTERA
Dylan, Vice President of Product Marketing at CTERA, brings more than 25 years of experience in enterprise technology marketing and GTM leadership to the company. He has worked to scale up unicorn startups such as EqualLogic and Actifio, as well as drive multiples of growth for products and services inside of giants such as Dell and Amazon Web Services.
Prior to joining CTERA, Dylan led go-to-market efforts, technical business development, and alliance management for early-stage businesses in AWS Applied AI Solutions and AWS Industry Products. He has also led product marketing for key hybrid cloud and edge services, including AWS Storage Gateway, DataSync, and the Snowball family. Dylan holds an MBA from the Johnson School at Cornell University, and a BA in International Relations and French from Tufts University.