SLMs on edge: A classifier approach for function calling
The game changed with transformer-based architectures such as Google’s BERT [3] in 2018 and OpenAI’s GPT [4], showcasing unprecedented language capabilities. However, their size and computational demands made them impractical for lightweight tasks, particularly in applications that have hardware constraints.
Small Language Models [5] have recently emerged as a solution to this type of situation — optimized for efficiency and specificity. They excel in niche applications like classification, where inputs are mapped to predefined categories or functions. By training SLMs to recognize intents and execute corresponding actions, we’ve unlocked a powerful and accessible tool for bridging user queries with technical operations, all while maintaining speed and precision.
This article seeks to explain in detail one such application of SLMs — function calling in real-time systems. As we continue to explore ways in which AI can improve our work and lives, one of the key tasks we need our AI models to do is to work with other models and software. This requires the models to be able to call and execute other programs. This is what function calling enables. To be even more specific, function calling simply means that we get an AI model to execute a function.
Setting up the problem
Suppose you have a drone that is delivering groceries. It needs to make real-time decisions about navigation and may not always have internet connectivity. Thus, some core functions need to be decided upon in real time. As an example, let’s assume these three functions need to be called by the SLM in the drone:
- find_current_location()
- calculate_fastest_route(destination_location)
- navigate_to_nearest_charging_station(current_location)
As we see, there are two types of functions here:
- Functions that have parameters that need to be inferred from the user’s prompt. In this case, calculate_fastest_route.
- Functions that don’t have parameters that need to be inferred from the user’s prompt. In this case, find_current_location and navigate_to_nearest_charging_station. (Even though this requires the current location as an input, that would just be the output of the find_current_location function, thus it does not actually need to be inferred from the user’s prompt.)
Approach and methodology
This naturally suggests a two-step approach to solve this problem:
1. Classify the function name: Find the name of the function that fits the purpose for a prompt from a list of the functions. In this case, the SLM effectively acts as a classifier. An example is shown below of how this can be done.
And the list of functions is defined as follows:
2. Extract the parameters: Once the function has been identified, we can go back to the user’s prompt and request the parameter needed for that function.
The system prompt needed for this exercise as its more complex, as shown below:
The code needed to extract the parameters can be written as follows:
As illustrated, we can define the exact match or a fuzzy match of our parameters (which is useful in case there are spelling mistakes or just different ways of writing a location). In our case, we just define a fuzzy match as there being at least a substring of the true answer in the answer by the SLM.
Combining the results of these two steps gives rise to a full function along with the parameters.
Data
To test these functions, we create two datasets. One is for the classifier approach to identify the three functions. For each function, we use GPT-4.5 to generate 20 prompts that correspond to “realistic” things a person may ask of a drone. As example, for the find_current_location function, one of the prompts is “Which area are you in?” The full list of prompts can be found .
Similarly, to test for the parameter extraction function, we create a dataset using GPT-4.5, but this time specifically for the calculate_fastest_route function as it takes in the location input parameter. For each prompt, we also create the expected parameter, as shown in the example below:
"prompt": "Navigate to the drone launch site in Phoenix as quickly as possible.",
"extracted_parameter": "drone launch site in Phoenix"
The full list of prompts and parameters used can be found .
Results
For identifying the function name, we start by trying a broad range of openly available models on HuggingFace, including the famed DeepSeek base model.
We see that the best performing models are the Google/Flan-T5 series. We can take these further to check if they are also capable of parameter extraction. We then see that the large model gives very good results out of the box and with a bit of prompt engineering can probably get even better.
Conclusions and future outlook
As this article has worked to show, SLMs can provide a powerful tool to create function calling capabilities and hence be part of large systems that work together to create workflows and autonomous systems. Out the box, there are already models that work very well in this space, for example, Google’s Flan-T5 series of models.
Interpreting the results
The results of this analysis indicate that Small Language Models (SLMs) like Flan-T5 can be effectively employed for function calling tasks, particularly when approached from a classifier-based perspective. Using models such as Flan-T5-large and Flan-T5-xl, high accuracy was achieved for function classification (98 percent and 92 percent, respectively) and parameter extraction. This success is largely attributable to the architecture of these models, which incorporate a full encoder-decoder framework allowing for richer contextual understanding and sequence-to-sequence training. This structure is particularly advantageous for instruction-following tasks, which aligns well with the requirements of function calling. The classifier-based approach achieves high performance by leveraging pre-trained models already fine-tuned on a diverse set of instructions, like the flan models [6]. However, it is inherently limited by the predefined list of functions and requires structured prompts for effective parameter extraction, which can introduce errors if the prompt is too ambiguous or varied.
An alternative to the classifier-based approach is the direct fine-tuning of an SLM such as Flan-T5. Unlike the classifier approach, fine-tuning involves training the model on a task-specific dataset, optimizing it for both function identification and parameter extraction. This method offers the advantage of customizing the model to achieve high accuracy for specific applications. However, this approach is costly and time-consuming, requiring substantial computational resources and well-curated datasets. Additionally, once fine-tuned, the model’s performance is highly dependent on the training data, limiting its flexibility if the set of functions changes or expands.
Another intriguing direction involves the use of masked language models (MLMs) [7]. This approach utilizes a masked head mechanism to predict function names or other relevant outputs by filling in masked tokens within a prompt. Unlike classification or fine-tuning, this approach is particularly appealing for zero-shot or few-shot learning scenarios where explicit training is not feasible or efficient. The architecture allows for generalization to unseen tasks by relying on the model’s existing pre-training rather than task-specific fine-tuning. However, the approach’s generalization capabilities come at the expense of precision and consistency, particularly when handling complex function calling tasks that require high specificity. Additionally, the reliance on masked token prediction introduces ambiguity and may require architectural adjustments or prompt engineering to achieve satisfactory performance.
Recent advancements have sought to enhance the robustness of function calling in language models through various adaptations of the masked head approach. The Hammer model family, for instance, employs an augmented dataset that increases sensitivity to irrelevant functions while incorporating function masking to reduce overfitting to specific naming conventions. This technique has demonstrated improved generalization across diverse benchmarks, outperforming larger models in function-calling tasks by minimizing dependency on rigid naming structures [8].
Additionally, the ADC framework aims to enhance function calling via adversarial datasets and code line-level feedback. By fine-tuning models with high-quality code datasets and providing granular supervision, ADC improves logical reasoning and adherence to function formats, resulting in better performance in complex function-calling scenarios. This approach highlights the potential for masked language models to achieve higher accuracy by integrating feedback loops and dataset augmentation during training [9].
These approaches represent different strategies for addressing the function calling problem. The classifier-based method strikes a balance between accuracy and efficiency, making it a suitable choice for most applications. Fine-tuning offers the highest potential accuracy but at a substantial cost in terms of training time and computational resources. The masked head approach is the most flexible, allowing for zero-shot learning, but struggles with precision when compared to the other methods.
Hardware considerations
When it comes to deploying these models on hardware, it is worth looking across a range of possibilities for different model sizes. A summary of potential hardware options for the different models is shown below:
In summary, this shows there are realistic options for function calling that can be used with readily available hardware and openly available SLMs that could be deployed on edge, for example, in the case of autonomous drones.
The code for running the analysis of this article can be found .
I would like to thank for many discussions and suggestions that helped improve this.
Darsh Kodwani is on .
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]