The ML Techniques Behind DGA Domain Detection

3 min readApr 26, 2022

From sophisticated to zero-day attacks, cyberspace is experiencing a different and suspicious kind of communication which make the security of cyberspace become critical for malware analyst.

Modern malware such as botnets and ransomware use DNS service to communicate with C2 server (Command and Control Server) for file transfer and software updates. and the hardcode of malware’s IP is being easily detected and blocked using traditional security toolste. To hide the malware IP and avoid being blocked by security tools, attackers use DGA (Domain Generation Algorithm).

What is DGA? And how it works?

Domain Generation Algorithm is a program that uses a sequence of random characteristics that generate a huge number of pseudo-random, non-existent domain names for C2 server that attackers use to switch between when the used domain in malware is detected and blocked during the attack cycle, so rather than starting the process over again, malware use different domain to avoid being detected and blocked by the blacklist method.

Seeds are the main parameter in generating DGA domain. There are two types of seeds, static and dynamic. Static seeds could be words from dictionaries, random characteristics, and numbers. While dynamic seeds are changed by time and could be Twitter hashtags or exchange rates and could be date and time. these seeds are known by source and destination which enable the attackers to open channel communication for the malware

With this massive number of domains, detecting and blocking malicious IPs is critical for the security analyst and the traditional software security will not be able to handle this huge number of requests, which make the task of classifying the DGA domain and non-DGA domain a diffeclt task in the cyber defense, which make the use of machine learning algorithms against DGA domains become essential.

How Machine Learning can be used to detect DGA?

The main use of machine learning in security is to detect threats and take the appropriate action against those threats according to the trained dataset.

For DGA domain Detection, two levels of Machine Learning models are applied and proved their success against DGA domain generator. The Classification model and Clustering model.

First-level model (Classification):

A classification model is a supervised learning technique that is taught and trained on a class label from a given dataset (known attack elements) to predict and detect a specific attack.
This technique is widely used in DGA domain generators. In this technique, the only needed from raw data is the domain names, and all the trivial information is removed by applying a domain-request packet filter which filters all domains and stores them in blacklist. From the stored domains, domain features are extracted and used in classification.

The Domain’s Features used in Classification method:

.: linguistic features:

From linguistic features, many features can be used in the classification process such as: (Length, Meaningful Word Ratio, Pronounce-ability Score, Percentage of Numerical Characters, and the percentage of the Length of the Longest Meaningful String (LMS).

.: DNS features:

a variety of data can be collected about the received domain such as:
(DNS record, Geography location, distinct IP address, creation, and expiration date of the domain (this is a very important data as the DGA domain has a short date of creation typically one year) these data are then used to classify the detected domains under the good and bad domains.

A clustering model groups the domains based on the similarities of domain’s features — two-level model of classification and clustering, Source (researchgate.net)

Second-Level model (Clustering):

Clustering is unsupervised learning that uses common features to group data in groups that have similar features.
Using Clustering in the DGA domain is the process of grouping the domains based on static features used in the classification level above, and on a measure of similarity of the DNS traffics. E.g. Grouping the domains from the same source in specific group. These groups of features will be used to find similar DGA domain and block it during the blacklist method.

The use of classification and clustering on the trained dataset empower the detection of DGA domains, and make this task more effective and accurate than the use of traditional security softwares.