Sitemap

CrewAI: Practical lessons learned

15 min readMay 26, 2025

--

There is a lot of buzz CrewAI in my social bubble. The descriptions are ranging from exciting “This is magic and will solve all our problems” to exciting “Wow, I will just tell him, and he will do anything”.

For those not familiar with it — CrewAI is described as the “Leading Multi-Agent Platform”. An agent in CrewAI understanding is an AI entity designed to perform specific roles and tasks. I would add the world “autonomously” though that may trigger too wild expectations. Practically, you define the agent using a role, goal and backstory in plain English — e.g.:

agent:
role: "SaaS Metrics Specialist focusing on growth-stage startups"
goal: "Identify actionable insights from business data that can directly impact customer retention and revenue growth"
backstory: "With 10+ years analyzing SaaS business models, you've developed a keen eye for the metrics that truly matter for sustainable growth. You've helped numerous companies identify the leverage points that turned around their business trajectory. You believe in connecting data to specific, actionable recommendations rather than general observations."

Then you specify a task— e.g.: “Get the best insights from all the metrics we have”, hit the Run button and then put your feet on the table and grab the pay check at the end of the day.

So how does it work in my reality?

The Use-case

I work as a software engineer, and I was looking for a practical use-case for CrewAI for my work. This proved more difficult than it looked.

The CrewAI course mentions a couple of use cases — like — you can create agents and polish your resume to get a better job. Well, that seems weird to me, why would I do that? Surely, it can be done, but it can be done by any current AI chat thingy. CrewAI is a tool that can run repeatedly, from command line, having no chat interface. Why would I use that to polish my resume when I can do the same in Claude Deskop chat having a back-and-forth conversation about it?

Surely setting agents for this task would be useful if I wanted to generate dozens of resumes, but I don’t need that. Just because something can be done with the help of someToolAI™ doesn’t necessarily mean that it is a good idea to do so. Yeah, you can run the stairs down on a scooter, but your teeth will appreciate if you don’t do that, even though you read it on XTwitter this morning.

Anyway, I tried a couple of ideas. Most of which proved to be too difficult to even define with enough precision. I tried some mundane tasks like “update library XY to a new version” and although it seems straight forward to me, it’s composed of so many hidden steps that my CrewAI attempts got completely lost. Though these ideas were unsuccessful, they were valuable in shaping a better use-case. So, I can recommend that exercise to everyone — try to use the tool. If it doesn’t work, it will help you shape how it can be used.

The most successful idea was generating example configurations. The platform I work on is composed of different components (1000+), each component being configured by some complex JSON configuration. I have a database of all existing configurations of all components and want to create samples, that can be used for a few-shot learning. A single configuration JSON for one component looks like this:

{
"id": "1244507422",
"name": "Opportunity CSV Download",
"component": "HTTP Extractor",
"description": "Downloads the opportunity.csv file for demonstration purposes.",
... snip ~20 lines ...
"configuration": {
"parameters": {
"baseUrl": "https:\/\/help.keboola.com"
},
"runtime": {
"uiMode": "default"
}
},
"rows": [
{
... snip ~20 lines ...
"configuration": {
"parameters": {
"path": "\/tutorial\/opportunity.csv"
},
"processors": {
"after": [
{
"definition": {
"component": "keboola.processor-move-files"
},
"parameters": {
"direction": "tables",
"folder": "opportunity-data"
}
},
... snip 20 lines ...
]
}
}
}
]
}

From this JSON, the interesting information (actually the only information supplied by the end-user) is the values “https://help.keboola.com/” and “tutorial/opportunity.csv”. The rest of the JSON is somehow derived from this information. I could probably say that every configuration for this component (HTTP extractor) will have a baseUrl property and a path property nested in rows.configration.parameters. In probably 80% of the configurations there will be also a keboola.processor-move-files instance. And I can go on with such rules. And all that applies only to a single component out of 1000+ of the existing ones.

The JSON configurations are very flexible supporting many configuration attributes. However, to create an initial version of a configuration, only few values are typically required. My idea was to take the existing configurations and generate example configurations for each component. Ideally these would represent “average” use-cases of the given component. For example, the above JSON would be “CSV file from HTTP”, then I could have another one for “archive of multiple CSV files from HTTP”, etc.

So, all this describes the setup, how it went?

The Good: It works!

Yes, it does. My year 2020 brain is very excited about this. I choose a use-case, created a few agents for it, specified a few tasks and I got the use-case solved. With writing almost no code.

Let’s backup a little bit. Like I said before, this use-case was about a third one I tried with CrewAI. So, by this time I went through most of the docs, some experiments, and the DeepLearning.AI course. Even got me a neat little certificate:

Certificate of Wizardry Level 3

The course is neatly done and if you know nothing about AI agents, I highly recommend it, because it gives a nice general overview of the concepts and capabilities. If you are familiar with the concept of agents, it’s mostly waste of time — with one exception (I’ll get back to it later). The CrewAI docs are real quality too, even if you are not particularly interested in using CrewAI, they are worth browsing through. They give plenty of practical advice. I can especially recommend reading the First Flow page about 10 times.

So I started with a crew of agents:

  • database_analyst — to pick up the correct data from the database,
  • configuration_analyst — to analyze the existing configurations,
  • training_data_specialist — to create meaningful and representative examples of the configurations,
  • QA_specialist — to ensure that the examples for each component are as required,
  • prompt_creator — to craft an output prompt with few-shot examples that I would feed into another system.

Each agent had one respective task, corresponding the above — so select_source_configurations, analyze_component_configurations, generate_configuration_examples, validate_examples, create_prompt.

I run the crew, and I almost got the result that I could use. I didn’t have any specific requirements for the output, I initially envisioned that the result would be a markdown formatted prompt for each component with instructions and few-shot examples on how to create the JSON configuration. Which I almost got.

Sunshine and Roses?

My year 2025 brain is much less excited about this, because with the advent of reasoning LLMs, it seems obvious that agents should work. But how difficult would it be?

I initially tested my crew on sample database containing configurations of only one component. My initial enthusiasm evaporated quite quickly when I tried running it repeatedly and on multiple components (meaning wildly different configuration JSONs).

Source data

The first problem was the amount of the source data, the source database has about 7 million of rows and low hundreds of gigabytes. Even a single configuration can in extreme cases be several megabytes, although median is low hundreds of kilobytes.

Executing (and fetching) SELECT * from the source table leads nowhere, because the result will overflow the LLM context window. I tried to instruct the agent to first select all component IDs, and only then select configurations of each component. The LLM (and my crew) will not always follow the instruction (yes, we all know that), meaning that it will either crash (if it hits the context window) or return something else (e.g. descriptions instead of configurations).

To avoid all this, I created custom tools list_component_ids and get_configurations_for_component_id. That helped a lot, but didn’t really solve the problem. The database_analyst agent would still sometimes run the get_configurations_for_component_id tool first with a made up component ID). Another idea I had was to split the selection task select_source_configuration into two tasks: provide_component_ids and select_configurations. It helped somewhat. The issue is that it just transitioned one level up. The agent is not obliged to solve the tasks or it can still answer them without running the corresponding tools. It would easily skip the provide_component_ids, make up its own component IDs, select no configurations for them (because they don’t exist) and then make up the results too.

Guiding the agents

Here comes the top trick — and I’d say the most valuable advice from the DeepLearning.AI course — use task context. The context is both cool and disappointing feature. On one hand it provides a clear path from result of one task to results of another task. On the other hand, it largely removes the autonomy of the decisions (because one task is always assigned to one agent). This may be the reason why the CrewAI description is somewhat toned down in this matter. And finally I eventually realized that I’m organizing the tasks into a DAG, because the context marks a clear dependency between the tasks like so:

provide_component_ids
agent: database_analyst
select_configurations
agent: database_analyst
context: ['provide_component_ids']

Setting context was the first thing that helped tremendously. Now, it’s probably time to set expectations — until now, I was able to use my crew to process up to 2 components (approximately 15 configurations out of the 200k+). With properly configured context, I got to about 10 components (and ~100 configurations). It’s a huge improvement, but at the same time it’s only a small fraction of what I needed.

The task context is a cool feature in that it nicely constraints the reasoning model to some sensible boundary (do A first, because B requires A). On the other hand, it takes away the magic, because you must do the hard work of designing the system.

With added task context though, I saw that the CrewAI tool is clearly capable of solving the task, but it was still hugely unreliable. At this moment I turned my attention to CrewAI Flows, which allows to combine the CrewAI reasoning with some event-like execution behavior using start and listen decorators.

class SchemaFlow(Flow[SchemaState]):
@start()
def retrieve_component_ids(self):
print("Retrieving component IDs")
self.state.component_ids = get_component_ids()

@listen(retrieve_component_ids)
def generate_sample_jsons(self):
for component_id in self.state.component_ids:
print(f"Generating example JSONs for component ID {component_id}")
result = (
JsonCrew().crew().kickoff(inputs={"component_id": component_id})
)

Instead of relying on the agents to pick up and process all the data, I supply only the relevant piece of data to the crew. At the same time, I created a custom tool GetConfigurationsTool that takes component_id and returns the source configurations for it. Using flows is what took my use case from 1% to 90%.

Context window

While solving the source data problem I started hitting another big issue, if one agent crashes (due to overflowing model context window), the whole crew continues to work and that means that it may run the agent again. Especially if it is responsible for a task that is predecessor for another task. Running the agent again will likely crash again, entering a loop of doom. CrewAI has built in loop detection, so it will eventually stop repeating itself, but it can easily take tens of minutes.

The other option is that the crew will just completely hallucinate the task result or that it will say something that it cannot get the real configurations, so it will imagine how a JSON configuration can look.

I could have solved it by using a different model with a larger context window, but I would hit the limit will all the configurations anyway. So, I took a different approach. I modified the GetConfigurationsTool tool to get the configurations only so that the response fits in the model context window. The agent is instructed to call the tool again to request more configurations if it didn’t produce the requested number of samples. This sometimes works and sometimes not (meaning that less then the required number of samples is produced) but it certainly solves the context window problem.

Ensuring Correct Output

I ditched the QA_specialist and subsequently prompt_creator agents completely. They somewhat worked, but. The QA specialist sometimes “returned” the result back to the crew (e.g. because not enough samples were generated or because the output format was not right), but rarely the second try fixed the issue. I attribute this to using memory, where the crew had the affinity to return to the “proven” solution rather than starting from scratch, but I haven’t tested this hypothesis. Also, the QA agent itself was unreliable (that’s inherent to LLMs, no surprise). Once I got to use the CrewAI flow, I easily replaced the QA agent with a simple piece of code wrapped around running the Crew.

  max_attempts = 5
attempt = 1
while attempt <= max_attempts:
result = (
JsonCrew()
.crew()
.kickoff(inputs={"component_id": component_id, "number_of_samples": 10})
)

# Check if the result is a valid JSONl file
print(f"Example JSONs generated (attempt {attempt}/{max_attempts})", result.raw)

# Validate the generated JSONL file
validation_result = CheckOutputTool()._run(result.raw)
if validation_result.startswith("Success"):
self.state.sample_jsons[component_id] = result.raw
break
else:
if attempt == max_attempts:
print(f"Failed to generate valid JSONL for {component_id} after {max_attempts} attempts")
self.state.sample_jsons[component_id] = result.raw # Store the last attempt anyway
attempt += 1

Similar goes to for the prompt_creator agent, because I realized that all the resulting samples (or the prompts generated from them) should be in fact in the same format, so I replaced the agent+task with a simple f string template.

Later I realized that I need to add sanitization agent. While the configurations do not directly contain any personal information nor secrets, it is sometimes possible that an email can be extracted from it (e.g. part of account name). Also, the configuration can sometimes reference specific company names (e.g. “This configuration gets data for subsidiary A”), so I wanted to replace this with some non-sense names like “Company” or “Contoso”. The data sanitization specialist worked surprisingly exceptionally well — in about 5000 output examples I found only about 15 items to be removed.

Memory

The CrewAI docs strongly suggest turning on memory. I had an interesting experience with it. When I used plain crew, I wouldn’t be afraid that it didn’t work at all without memory. Once I switched to flows, it doesn’t seem to make any noticeable difference. Which is sort of understandable — the way I run them in the loop; I process every component individually so there isn’t much space for memory. In fact, when I was still using just a crew without a flow, I found that sometimes the result was better without memory. Not in terms of operations (that didn’t work at all) but in terms of results. With memory the configuration samples would sometimes get mixed between different components — e.g. the agent would assume that the configuration of component B has property baseUrl, because component A configuration had it.

Lessons Learned

Well, it’s been an interesting journey, but it definitely helped me to learn a couple of new things and remember some old things. Here’s a list in no particular order.

Take the effort to provide correct inputs. In the end this is a ML task like any other and the quality of the outputs is a function of the quality of the inputs. In my case the initial SELECT query grew from simple SELECT config_json FROM configurations to 100+ lines beast because it had to exclude broken configurations, unfinished configurations, configurations for one specific component that doesn’t follow pattern of 99% others, etc. Cleaning up the data fixed a great number of issues.

Tools need to be specific. Yes, you can use generic SQL tool to select data from database, and yes it works very badly. The problem doesn’t lie only the amount of data or the incorrect inputs like I described above. The LLM reasoning model (or the crew) can eventually deal with these — it will use generic query tool, then realize it doesn’t contain specific configurations, update the query, or filter the results, or do both and eventually it will probably find the correct input data. The trouble is that it will take many iterations, and while I may not care about the speed, every iteration means non-zero risk of error, so the output quality deteriorates quickly. This may not be a big issue when working with interactive tools like Claude Desktop or Cursor, where I can spot and correct the error after every ad-hoc interaction. With CrewAI, that’s supposed to represent repeatable/reusable system, it’s much more important.

Tools must be reliable. I cannot emphasize this enough. The no-code/low-code approach leads me to the “just hack this tool” mindset and it is a path straight to hell. The LLM won’t tell that the tool is returning bad data and will either take convoluted paths (error-prone — see above) to work around the problem or will just make up the results. The time invested in writing tool tests pays of quickly.

The task order matters. Obviously. What’s not so obvious is that the task order matters in the Crew class, not in the YAML file. I wish this would be implemented differently in CrewAI. I.e. swapping the two tasks below breaks everything (not immediately obvious, the LLM will hallucinate the required data).

@CrewBase
class JsonCrew:
agents_config = "config/agents.yaml"
tasks_config = "config/tasks.yaml"

@task
def analyze_configurations(self) -> Task:
return Task(
config=self.tasks_config["analyze_configurations"],
)

@task
def sanitize_configurations(self) -> Task:
return Task(
config=self.tasks_config["sanitize_configurations"],
)

Unless…

Context matters. I wish this was emphasized more in the CrewAI docs. It makes huge difference to tell the reasoning model the dependencies between the steps. It makes things more reliable and much faster.

QA agent sucks. I like the idea, but it doesn’t serve much the purpose. If I want the results to match certain conditions, then I write deterministic checks. If I cannot write deterministic checks, I use the LLM to generate properties on which I can write deterministic checks. I.e. let the LLM generate e.g. tags, but don’t let it check if the number of tags is correct.

Propagate tool errors. The LLM makes mistakes, if the tool returns reasonable errors, it’s often possible to auto correct those. Sometimes it’s real trivialities — e.g. passing "myComponentId instead of myComponentId. Every tool should give helpful error messages (Invalid component id: '"myComponentId’ ) instead of stupid ones (component not found) or no messages at all.

Beware of the model context size. The issue here is that the LLM has no way to tell that the context overflow occurred, it will just see it as a failed step. Also, it has no way to back up or guess that it should change the tool parameters. This needs to be built into the tools — limit the result size and if reached let the LLM know that there is more unreturned data. Then it can deal with it. All simple tools like file reading, directory listing, querying database suffer from this.

Debugging is pain. Yes, debugging crew (or all LLMs with function calls to that matter) is pain. One thing is the unreliability (or randomness if you wish) of the LLM, but I can deal with that. In my view, the more serious deficiency is not being able to write unit-tests. In other words — it is difficult to test it per-partes (and no, CrewAI testing does not help).

LLM Models are interchangeable. I tried running with GPT-4o and Gemini 2.5 Pro. In my case 4o was subjectively a lot faster with less weird errors and timeouts. I couldn’t tell any difference in the quality of the results. Surely they were different, but since the desired output was set of examples, I didn’t see any set of examples particularly better or worse than the other, they were all ok.

Conclusion

The resulting crew is available on Github. My final verdict and thoughts are mixed. CrewAI is definitely a very interesting tool/framework. The documentation is high quality, and the introductory course is also well made. The framework is surprisingly mature, meaning that it mostly works as it should. The first impression is amazing.

On the other hand, in all of the use-cases I tried I ended up spending hours analysing the logs and results trying to make it work repeatedly. It is also incredibly slow (in my case one run takes 5–10 minutes), which is mostly due to the planning and thinking process. But once I want it repeatable, and resort to defining the flow and the task DAG using the task context, there is much less need for the “magic” crew thinking. In the end, by the time I had the crew working I would have had the use-case implemented like 10 times if I used LLM API directly.

On the other hand, I’m assuming that CrewAI is supposed to be a no-code/low-code tool and I’m totally not their target persona. My sentiment towards no-code/low-code tools is generally not favourable (I remember the age of “Someone made this magic Excel macro 8 years ago and today cell $AG$234 is wrong”). Yet I have to say that the CrewAI approach is really intriguing in that it allows smooth transition from no-code (just define agents + tasks) via low-code (add some tools) to code (write a flow, rewrite the flow to call the LLM directly). I’m curious how it will evolve.

P.S. The entire experiment cost $414 on Gemini.

--

--

Responses (1)