19 Mar 2023

Using CNN for a Domain name Generation Algorithm (2)

by Harpo MAxx (6 min read)

The post is the second part of a series of posts discussing the development of an algorithm for domain generation (DGA) See the first post here. The current post is centered around the creation of the model using Keras for the R language.

Creating the model

In this case, we are going to write the create_model() function with a simple one-dimensional convolutional neural network (CNN) architecture The architecture consists of 4 layers.

layer_conv_1d: This layer applies a 1D convolution operation to the input. It takes as input the number of filters, kernel size, activation function, padding, strides, and input shape. In this case, we are using 32 filters with a kernel size of 3, ReLU activation function, valid padding. Notice that the input shape consists of a 3-dimensional tensor with the form (batch_size, maxlen, n_tokens), where n_tokens is the number of unique characters in the input text, and maxlen is the length of the input sequences (i.e., the number of characters to consider at each time step).

layer_flatten: This layer flattens the output from the previous layer into a 1D tensor.

layer_dense: This layer is a fully connected dense layer that maps the flattened output to a vector of length n_tokens.

layer_activation: This layer applies the softmax activation function to the output of the previous layer to obtain a probability distribution over the output tokens. The output will be a 1-dimensional tensor of size n_tokens. Using this we can choose the most likely next character in the sequence.

create_model <- function(n_tokens, maxlen){
    keras_model_sequential() %>%
         layer_conv_1d(filters = 32, 
                       kernel_size = 3,
                       activation = 'relu', 
                       padding = 'valid',
                       strides = 1,
                       input_shape = c(maxlen, n_tokens)
                       ) %>%
          layer_flatten() %>%
        layer_dense(n_tokens) %>%
        layer_activation("softmax") %>%
        compile(
            loss = "categorical_crossentropy",
            optimizer = optimizer_adam()
        )
}

We could have used an LSTM layer (or even a bidirectional LSTM) which is more adequate for this kind of problem. However, in my experience, I have always had good results when using 1D Convolutional layers, with the extra benefit of speed improvement due to their inherent parallelization.

The create_model() function also compiles the model, which sets up the backend computation graph, based on the specified loss function, optimizer, and any additional metrics specified. In this case, we are going to use the categorical_crossentropy loss function and the Adam optimizer.

dga_gen_model <- create_model(n_tokens,maxlen)
summary(dga_gen_model)

Model: "sequential"
________________________________________________________
 Layer (type)             Output Shape          Param #        
========================================================
 conv1d (Conv1D)          (None, 38, 32)        4256           
 flatten (Flatten)        (None, 1216)          0              
 dense (Dense)            (None, 44)            53548          
 activation (Activation)  (None, 44)            0              
========================================================
Total params: 57,804
Trainable params: 57,804
Non-trainable params: 0
________________________________________________________

After compilation, the model is ready for training. We are going to use the 3D tensors inputs containing a sequence of characters (domains names) encoded using one-hot and the labels which are 2D tensors containing the next character to predict (also encoded using one-hot)

dga_gen_model %>% fit(seq_vectorized_x, 
                     seq_vectorized_y,
                     batch_size=128, 
                     epochs=10, 
                     verbose=1,
                     )
save_model_tf(dga_gen_model,"dgagen.keras")

OK, so now, we have our model trained and we are almost ready to start generating new sequences. For instance, we can feed the model with just one of the domain names already seen during training and ask the model to start generating new variations of the original sequence. But first just a few words about how to select the next character.

Sampling the next character

The resulting model output a probability distribution over the valid_characters_vectors using a softmaxfunction. However, selecting always the most probable character is not always a good idea. Don’t get me wrong, choosing the most probable character according to the softmax output is a valid approach when generating text using, but it can lead to generating repetitive and uninteresting text.

The reason for this is that always selecting the most probable character will result in a deterministic output where the same sequence of characters is generated every time the model is given the same starting input.

On the other hand, using a sampling function to select the next character introduces a level of randomness into the output, allowing for greater variability and creativity in the generated text. By randomly selecting from the probability distribution, the model can produce unexpected and diverse output that is not limited to a single deterministic path.

Below we have a snippet of the sampling function used for generating domain names. The function includes a temperature . By varying the value of temperature, we can control the level of randomness in the generated domain. A higher temperature value will produce more diverse and unexpected output, while a lower temperature value will produce more conservative and predictable output.

# Function to choose the next character in a sequence based on predicted probabilities and a temperature parameter
choose_next_char <- function(preds, valid_characters_vector, temperature, seed) {
  
  # Set the random seed for reproducibility
  set.seed(seed)
  
  # Scale the predicted probabilities by the temperature parameter
  preds <- log(preds) / temperature
  
  # Exponentiate the scaled probabilities to obtain 
  # unnormalized probabilities
  exp_preds <- exp(preds)
  
  # Normalize the unnormalized probabilities to obtain a 
  # probability distribution over all possible characters
  preds <- exp_preds / sum(exp_preds)
  
  # Sample from the probability distribution using the 
  # rmultinom functionThe rmultinom function produces a 
  # binary vector indicating which character was selected
  # The which.max function is used to convert the binary vector 
  # to an integer index. This integer index is used to select the 
  # corresponding character from the chars vector
  next_index <- rmultinom(1, 1, preds) %>%
    as.integer() %>%
    which.max()
  next_char <- valid_characters_vector[next_index]
  
  # Return the selected character
  return(next_char)
}

The function has three additional arguments;:

preds a vector of predicted probabilities for each possible character, output by the LSTM model.
valid_characters_vector a vector of possible characters.
seed a seed value for reproducibility of the output.

Generating new domains.

Finally, we have all the elements necessary to start generating new domains using our character-level generation network.

The process is basically the following:

Pick a random domain from the list used for training. This domain is going to be our initial sequence.
Encode the selected domain and feed to the model.
Obtain model output and pick the new character according the sampling function choose_next_char() with a temperature of 0.2
Concat the new character to the original domain and remove the first character
Reapeat the process n times. Where n is the number of new characters we want to generate for the new DGA domain.

Here is the code. In this case we select 10 new DGA domains. Each new domain will have a variable size, since we are adding between 15 to 18 new characters.

# Set the initial random seed for reproducibility
seed <- 1

# Initialize variables for storing generated domains and the next random seed
dga_domains <- c()
nextseed <- seed

# Load the pre-trained  model for generating DGA domains
dga_gen_model <- load_model_tf("../../../models/dgagen.keras")

# Set the random seed for reproducibility
set.seed(seed)

# Generate 10 new DGA domains
for (j in seq(1:10)) {
  
  # Set the initial sentence to the character at the current random seed
  initial_sentence <- seq_x[seed]
  generated <- ""
  # Generate a sequence of characters to complete the domain.
  # We are going to pick between 15 and 35 characters to add
  for (i in seq(0:sample(15:18, 1))) {
    
    # Vectorize the current sentence and convert it to one-hot encoding
    set.seed(seed)
    vectorized_test <- tokenize(initial_sentence, "n", maxlen)
    shape <- c(nrow(vectorized_test$x),
               maxlen, 
               length(valid_characters_vector))
    vectorized_test <- to_onehot(vectorized_test$x, shape)
    
    # Use the pre-trained  model to predict the next character 
    # in the sequence
    predictions <- dga_gen_model(vectorized_test)
    predictions <- predictions %>% as.array()
    
    # Choose the next character based on the predicted probabilities 
    # using the choose_next_char function
    next_char <- choose_next_char(preds = predictions, 
                                   chars = valid_characters_vector, 
                                   temperature = 0.2, 
                                   nextseed)
    
    # Update the initial sentence with the chosen next character
    # and remove the first one.
    initial_sentence <- paste0(initial_sentence, next_char)
    initial_sentence <- substr(initial_sentence, 2, nchar(initial_sentence))
    
    # Add the chosen next character to the generated domain name
    generated <- paste0(generated, next_char)
    
    # Increment the random seed for the next iteration
    nextseed <- nextseed + 1
  }
  
  # Add the generated domain name to the list of generated domains using a  
  # predefined TLD
  dga_domains[j] <- paste0(generated, ".com")
}

The dataframe dga_domains will contain the generated DGA domains with the .com TLD added. Here is the final list.

[0] "generated: alestanthentereq.com"
[1] "generated: netalanestoreateran.com"
[2] "generated: storeanthertarth.com"
[3] "generated: nationstalestares.com"
[4] "generated: eancertardentare.com"
[5] "generated: amestonlinestarta.com"
[6] "generated: aleandingsingodem.com"
[7] "generated: amestonthonettress.com"
[8] "generated: amestoromaricarq.com"
[9] "generated: amestallandianter.com"

Using a temperature value of 0.2 the resulting domains looks pretty much like normal domains. Let’s see what happens if we change the temperature to 0.9

[0] "generated: eengragzinetrabt.com"
[1] "generated: abiedbiolsorotass.com"
[2] "generated: acheronalatpinfoxca.com"
[3] "generated: sinjlostahockintea.com"
[4] "generated: exichstums9rikin.com"
[5] "generated: hedadelphantcomfuca.com"
[6] "generated: artctscyfudrachandr.com"
[7] "generated: detaxelaulicenuss.com"
[8] "generated: eercacecoms1207fa49.com"
[9] "generated: flecbroctopsalrtun.com"

Perhaps it is just me, but I think the first group of DGA domains looks more “normal” than the second one. In any case, the final decision would be to check the DGA domains against some sort of detector. I tested the generated domains against a simple convnet DGA detector and the results were pretty good actually ̇.

library(curl)
res<-c()
reference<-factor(rep(1,length(dga)),levels=c(0,1))
for (j in seq(1:length(dga))){
  req<- curl::curl_fetch_memory(paste0("http://ns158.ingenieria.uncuyo.edu.ar:8000/predict?domain=",dga[j]))
  res[j]<-({jsonlite::parse_json(rawToChar(req$content))}$class)
}
caret::confusionMatrix(data=as.factor(res),reference)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0  0 94
         1  0  7
                                          
               Accuracy : 0.0693          
                 95% CI : (0.0283, 0.1376)
    No Information Rate : 1               
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity :      NA         
            Specificity : 0.06931         
         Pos Pred Value :      NA         
         Neg Pred Value :      NA         
             Prevalence : 0.00000         
         Detection Rate : 0.00000         
   Detection Prevalence : 0.93069         
      Balanced Accuracy :      NA         
                                          
       'Positive' Class : 0

As you can see, only 7 out 101 generated domains were detected by the DGA detector (Notice that observations labeled as 0 are normal domains while 1 was used for DGA). Of course that means nothing since there are much more effective DGA detectors, but it is just to get an idea. I guess…

Some final words…

As you can see it is not difficult to implement a character-level generative approach for DGA. The performance of the generated domains is actually pretty decent. However, there are still some considerations before deploying an approach like that.

The first issue is that the whole approach requires a predefined list of domains to use as the seed sequence (i.e. the one used to start the generation of new domains). Such a list must be included (or accessible somehow) by the malware. Such inclusion can be easily detected by forensic analysis and it would be pretty easy to take some countermeasures to take over the whole Botnet.

A second issue, also related to the seed, is that we have not considered any C2 synchronization approach. In this post, we have just used a hard-coded approach. But using this strategy could be easy to detect. The bostmaster and the botnet need to agree on a pseudo-random seed to start generating and using the current date is the common approach. So, we need to build a function for mapping the current date into one of the domains used in the predefined list.

You can check the complete source code as well as a plumber implementation of the approach in my GitHub repository

References

[1] An application of CNN for DGA detection. Some approaches for DGA detection are explained.

[2] Character-level text generation with LSTM using KERAS

[3] Deep Learning with Python (2nd Ed.) has a complete chapter devoted to generative approaches.

[4] Link to the GitHub repo containing most code shown here.