AI in Software Testing. The LLM way.

July 4, 2024 by vishnu.kv

AI is transforming the traditional software testing into more efficient, accurate and intelligent process. Advanced LLM models backed by machine learning algorithms can autonomously generate unit test cases, stimulate user interaction, run test cases and analyze the output to identify security vulnerabilities and issues to improve the code without human interaction.

The LLM way

After the introduction of LLM models, they’re widely used for testing/generating codes. The most common way for testing the code is to provide code to an LLM and ask it to review/refactor the code. This is the traditional way. The new/more efficient approach is to create agents with different roles who has a backstory, goal and access to tools to test the code.

LLM Model

The most crucial part of software testing using LLM is the selection of an LLM model. There are tens of thousands of models out there which is paid and publicly available. The paid models include Gemini and ChatGPT which provides good results but there is a risk of code being sent outside the development premise which would be major concern for many companies. The solution to this is to use an open source model like IBM Granite, Mistral, Llama3 or Mixture of Expert models like llam2-hermes-orrca-platypus-wizardlm to back the agent.

What is an Agent?

An agent is an autonomous unit programmed to:

Perform tasks
Make decisions
Communicate with other agents

Think of an agent as a member of a development team, with specific skills and a particular job to do. Agents can have different roles like ‘Manager’, ‘Developer’, or ‘Tester’, each contributing to the overall goal of the crew.

Agent Attributes

Attribute	Description
Role	Defines the agents function within the crew. It determines the kind of tasks the agent is best suited for
Goal	The individual objective that the agent aims to achieve. It guides the agent decision making process
Backstory	Provides context to the agents role and what they’re good at
LLM	The large language model used by the agent. Defaults to using open source models like Mistral MoE / IBM Granite
Tools	Set of capabilities or functions that the agent can use to perform tasks . Tools can be shared or exclusive to specific agents. It is an attribute that is set during the initialization of an agent

The first step of creating an autonomous software testing system is the creation of agents. We can create different agents who will talk to each other to solve a particular task. For software testing we will be creating agents with roles like ‘Test case generator’, ‘Test Executor’, ‘Result Analyzer’, ‘Developer’ and ‘Manager’.

Test Case Generator

This agent is responsible for developing unit test cases based on the code. We can provide a backstory of his capabilities in creating good unit test cases which can cover all the aspects of testing.

Test Executor

This agent will run the test cases. He will have access to tools that can execute the software just like humans and can run test cases without human intervention.

Result Analyzer

This agent will analyze the results of execution generated by Test executor and passes it to Manager agent.

Manager

He will convert the analyzed results into executable report and solutions for the problems. His backstory will be like a pro at coding for years and never fails to find optimum solutions to the problems he come across. He may have access to internet tools for surfing solutions and summarizing the results for arriving at solutions.

Developer

Manager agent will pass the detailed report along with the solutions to the developer agent which could be backend developer or frontend developer depending on the task and this agent will refactor the code based on the fixing and will pass the new code to the Test case generator agent.Once the software is tested thoroughly with multiple iterations the manager will decide whether the defects reported have been corrected or not. There will also be a default iteration cutoff value upon which the execution stops from being on infinite loop.

Benefits

Improved accuracy: AI testing tools can learn what is normal and then spot anything that doesn’t fit with high accuracy compared to humans
Adaptability: AI systems have the capacity to adapt and evolve over time, continuously learning from past testing experiences and incorporating new insights into their testing strategies through reinforcement learning.
Ease of use: AI tools can be integrated into the development workflow, which provides feedback directly within code repositories or development environments.
Time Effective: Manual code review is a time-consuming process but AI based code review system can analyze code much faster and provide instant feedback with high accuracy.
Cost Effective: Traditional code review can be expensive where as an AI based code review system can provide instant feedback at a fraction of the cost of a manual review.

Real world examples and use cases

InApp AICoder : This is an internal tool developed by InApp Information technologies for code refactoring, unit test case generation and checking security vulnerabilities of code
Facebook’s Sapienz: Facebook uses an AI-powered tool called Sapienz for automated software testing at scale.
Google’s DeepMind for Game Testing: Google uses its AI technology, DeepMind, for testing games.
IBM Watson: IBM Watson uses AI to automate software testing processes, enabling faster and more accurate testing.
Testsigma: Testsigma is a cloud-based continuous testing tool that uses NLP for test case creation and an AI-powered core for maintenance of all automated test cases.
Mabl: Mabl employs AI for autonomous testing, creating and executing end-to-end tests that evolve alongside the application.
Appvance.ai: Appvance.ai offers AI-driven testing solutions that generate thousands of test variations and data combinations to test applications thoroughly.

Challenges

Data Quality issue

LLM based testing systems rely on the quality of the code dataset used for training the model. Non standard code, bugs in code, untested codes can led to prediction issues and unreliable test results. Also the data has to cover all the scenarios and use cases. Maintaining and using high quality data is a difficult process.

Model context length

Most of the open source models have a context length of upto 32k tokens. This means we can only analyze upto 1600 lines of code at a time and even this comes with challenges like Lost in the Middle phenomenon. The best results got from an open source model was from Mistral which has a context length of 32k but performs well for codes below 600 lines. Other models like IBM Granite and llama2-hermes-orca-platypus-wizard lm showed promising results but only for codes below 500 and 200 lines respectively

Algorithm Biases

Machine learning biases refer to producing biased results that reflects the human biases of a society including social, economical biases and social inequality. Bias are mostly found in the initial training data.

Code complexity

This related to the complexity of the codes generated by AI. Sometimes the code generated by AI is much more complex and there will be simple solutions available and this leads to challenges in trust and adaptation among testing teams.

Future Scope

The integration of AI technologies in software testing is revolutionizing software development practices. AI testing is expected to become the new standard in the next few years and forecasts the AI-enabled testing market to skyrocket from $736.8 million to $2.7 billion by 2030. Upon releasing of new models the testing capabilities of agents are improving and this will assist human testers to improve productivity and ease of work. Fingers crossed 🤞