AI datasets and VLAI model

cedric · June 10, 2025, 5:35am

Introduction

At CIRCL (Computer Incident Response Center Luxembourg), we faced the challenge of evaluating vulnerabilities with only partial information often just a textual description.

To address this, we built an NLP model using the existing dataset from Vulnerability Lookup. The entire solution has now been released, including integration into the free online service and the open-source code. With this model, you can obtain the VLAI vulnerability score even when no existing score is available, by assessing severity based solely on the description.

Below, you’ll find the complete process we developed for the VLAI Severity models, which can be applied to many other use cases.

Datasets

Among the datasets we provide, a key one is dedicated to vulnerability scoring and features CPE data, CVSS scores, and detailed descriptions.

This dataset is updated daily.

Sources of the data:

CVE Program (enriched with data from vulnrichment and Fraunhofer FKIE)
GitHub Security Advisories
PySec advisories
CSAF Red Hat
CSAF Cisco

The licenses for each security advisory feed are listed here:

Get started with the dataset

import json
from datasets import load_dataset

dataset = load_dataset("CIRCL/vulnerability-scores")

vulnerabilities = ["CVE-2012-2339", "RHSA-2023:5964", "GHSA-7chm-34j8-4f22", "PYSEC-2024-225"]

filtered_entries = dataset.filter(lambda elem: elem["id"] in vulnerabilities)

for entry in filtered_entries["train"]:
    print(json.dumps(entry, indent=4))

For each vulnerability, you will find all assigned severity scores and associated CPEs.

Models

How We Build Our VLAI Model

With the various vulnerability feeders of Vulnerability-Lookup (for the CVE Program, NVD, Fraunhofer FKIE, GHSA, PySec, CSAF sources, Japan Vulnerability Database, etc.)
we’ve collected over a million JSON records. This allow us to generate datasets for training and building models.

During our explorations, we realized that we can automatically update a BERT-based text classification model daily using a dataset of approximately 600k rows from Vulnerability-Lookup.
With powerful GPUs, it’s a matter of hours.

Models are generated on our own GPUs and with our various open source trainers.

Similar to the datasets, model updates are performed on a regular basis.

Text classification

vulnerability-severity-classification-roberta-base

This model is a fine-tuned version of RoBERTa base on the dataset
CIRCL/vulnerability-scores.

The time of generation with two GPUs NVIDIA L40S is approximately 6 hours.

Try it with Python:

>>> from transformers import AutoModelForSequenceClassification, AutoTokenizer
... import torch
... 
... labels = ["low", "medium", "high", "critical"]
... 
... model_name = "CIRCL/vulnerability-severity-classification-roberta-base"
... tokenizer = AutoTokenizer.from_pretrained(model_name)
... model = AutoModelForSequenceClassification.from_pretrained(model_name)
... model.eval()
... 
... test_description = "langchain_experimental 0.0.14 allows an attacker to bypass the CVE-2023-36258 fix and execute arbitrary code via the PALChain in the python exec method."
... inputs = tokenizer(test_description, return_tensors="pt", truncation=True, padding=True)
... 
... # Run inference
... with torch.no_grad():
...     outputs = model(**inputs)
...     predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
... 
... 
... # Print results
... print("Predictions:", predictions)
... predicted_class = torch.argmax(predictions, dim=-1).item()
... print("Predicted severity:", labels[predicted_class])
... 
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.25k/1.25k [00:00<00:00, 4.51MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 2.66MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 3.42MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.56M/3.56M [00:00<00:00, 5.92MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 280/280 [00:00<00:00, 1.14MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 913/913 [00:00<00:00, 3.40MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 499M/499M [00:44<00:00, 11.2MB/s]
Predictions: tensor([[2.5910e-04, 2.1585e-03, 1.3680e-02, 9.8390e-01]])
Predicted severity: critical

critical has a score of 98%.

Putting Our Models to Work in Vulnerability-Lookup

Models are loaded locally in the ML-Gateway to ensure minimal latency. All processing is done locally — no data is sent to Hugging Face servers.
We use the Hugging Face platform to share our datasets and models, as part of our commitment to open collaboration.

ML-Gateway implements a FastAPI-based local server designed to load one or more pre-trained
NLP models during startup and expose them through a clean, RESTful API for inference.

Clients interact with the server via dedicated HTTP endpoints corresponding to each loaded model. Additionally, the server automatically generates
comprehensive OpenAPI documentation that details the available endpoints, their expected input formats, and sample responses—making it easy to explore and integrate the services.

The ultimate goal is to enrich vulnerability data descriptions through the application of a suite of NLP models, providing direct benefits to Vulnerability-Lookup and supporting other related projects.

dymaxion · June 10, 2025, 5:09pm

@cedric
Huh. It might be interesting if it was possible to get the model to output a confidence interval, rather than just a single number.

cedric · June 10, 2025, 8:34pm

Sure it’s possible! This is something we can eventually simulate at inference time. So we would have to only update our ML-Gateway (and not the model) if we want to test this. That’s the good thing.

We are currently getting point estimates from the model: the predicted class (a class is already a range, but the range is “lost” during the generation of the model) and its associated probability from the softmax layer.
So we are getting point estimates for each class. This is maybe more visual if you use at the Space on Hugging Face (it is using the same model):

You can see the point estimates, per class.

Like most models, this model is deterministic and dropout is disabled, here. It always gives you the same point estimates.
So we could perform multiple passes at inference time in ML-Gateway when computing the softmax by forcing the dropout. Then we need keep all predictions in a list. And, use the distribution of predictions to compute a confidence interval. That would simulate this.

gnyman · June 17, 2025, 4:52pm

@cedric this is interesting, I took a quick look and the few samples I tried, the VLAI was more in line with how I rated it than the CVSS

it would be interesting to do a bit more systematic testing to see if it agrees often enough to be useful with prioritising things

cedric · June 18, 2025, 5:15am

Same here. We made various tests during the last months and are quite satisfied with the classifications. I know that @adulau made more tests in a programmatic way with a script. Most of the time the results are quite close to the severity defined in the CVE. I think around 75 percent of the time. Sometimes the classification is against the one defined in the CVE and it can make sense or not. It is not trivial to assess the results of this model but for now we are really impressed.

adulau · June 18, 2025, 6:54am

Indeed, I wrote a very simple script that downloads the latest vulnerabilities, calculate the VLAI for the missing CVSS and compares them to the CVSS scores later assigned by the CNA/vendor.

High accuracy for the ‘High’ and ‘Critical’ categories. When the vulnerability descriptions are very short or unclear, the results can be a bit off but I imagine analysts face a similar pattern.
85% match rate with the final CVSS scores assigned by the vendor.

I was not expecting such good result tbh.

adulau · August 22, 2025, 1:01pm

We did a preprint about the VLAI model.

mmclellan21 · October 31, 2025, 3:58pm

Question here, if organizations were to adopt this model to bridge that gap of proposing a CVSS score while NVD publishes there, what are the recommendations / communications when what the model predicted is higher / lower than of the NVD?

Use case here is helping clients that are regulatory bound for patching critical and high cvss vulnerabilities within X period and helping propose a CVSS score when one does not exist yet. How can we communicate to them in instances where time / resources is being taken up patching a vulnerabilities that is proposed as a high cvss turns out to be a medium.

adulau · October 31, 2025, 5:30pm

This is a very good question, and there are many strategies to address potentially conflicting severity scores. The list below is not exhaustive and could be expanded further:

When the VL-AI score is clearly lower, reviewing the CVSS parameters may provide clarification. For example, the description might claim code execution, but the CVSS metric group clearly states that exploitation is remote or requires no additional steps. In such cases, CVSS metrics can reveal that the original description was downplaying the severity of the vulnerability. It should also be noted that CVSS scores can be calculated by individuals who are not the original submitters of the vulnerability.
When the VL-AI score is clearer higher, reviewing external parameters such as sighting, the presence of a PoC or additional third-party sources. We have seen original reporters having a strong evaluation about the severity and then later the CVSS is way lower by an external review.

I suppose there are many strategies but having differences could be in any case a trigger to investigate more the third-party vulnerabilities information than just trusting the publishing source.