Computer Science Notes

Computer Science Notes

CS Notes is a simple blog to keep track about CS-related stuff I consider useful.

30 Jul 2023

Sharing your own dataset on Hugging Face

For the last few months, I’ve been working with the Hugging Face 🤗 ecosystem, mostly because of the LLM hype of course. In a previous post, I used a simple DGA detector based on CNN and uploaded it to the Hugging Face (HF) hub. The Hub is a place where you can share your model. Actually not only your models, you can also share the datasets you used for training them and even you can create a simple app to let people try your model in production. I personally find that very useful not only for sharing but also for keeping records of your models, code, data, and so on.

When I uploaded the DGA detector model I wanted also to share my dataset there. So I used the git repository UI approach from HF to upload the dataset. The approach is pretty simple: You create a repo in a GitHub way and using the UI you upload the dataset. And that’s all. Simple.. and effective.

So I picked my CSV file and compressed it using gzip , and uploaded it to the hug using the UI. Wow. That was fast! 🏃

What are the benefits of doing this? Well, besides sharing your dataset with the community which is perhaps the most important thing. The other benefit of uploading the dataset to the HF hub is the datasets library. Once you have uploaded your dataset to the hub, the dataset can be accessed through the datasets library with just one line of code.

import datasets
dataset = datasets.load_dataset("harpomaxx/dga-detection")

Basic functionality

By using the datasets the library you gain access to a lot of functionality, such as dataset splitting, versioning, streaming, and preprocessing among others.

The streaming functionality is very useful if you don want or (can’t) download the complete dataset.

from datasets import load_dataset
dataset = load_dataset('harpomaxx/dga-detection', streaming=True)
print(next(iter(dataset)))
{'domain': '0-1.ru', 'label': 'normal.alexa', 'class': 0}
print(next(iter(dataset)))
{'domain': '0-60specs.com', 'label': 'normal.alexa', 'class': 0}

The load_dataset() function will return an iterator that will allow you to access to the dataset one element at the time. Notice that the access using streaming=True could be slower than reading the complete dataset. But something we don’t have other option…🤷

Preprocessing is fundamental in the machine learning pipeline and the HF datasets library fits perfectly with the well-known transformers library (also from HF). You can simply tokenize a particular feature from your dataset with a few lines of code. In the example below I tokenize using bert-case-uncased tokenizer

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer(dataset['domain'])
{'input_ids': [101, 1014, 1011, 1015, 1012, 21766, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Keep in mind that tokenizers are just one of the possible preprocessing you can perform to your dataset using the Transformers library.

Splitting your dataset is also simple by calling the method train_test_split(). For instance, let’s say that we want to split the dataset into the usual 80/20 ratio. We just need to execute the following code:

dataset_split = dataset.train_test_split(test_size=0.2)
DatasetDict({
    train: Dataset({
        features: ['domain', 'label', 'class'],
        num_rows: 1838652
    })
    test: Dataset({
        features: ['domain', 'label', 'class'],
        num_rows: 204295
    })
})

Now, we can access the train and test by simply using something like dataset_split["train"] or dataset_split["test"]

Sometimes you want to give your dataset a particular structure or configuration. For instance, you want to predefined sub-datasets for training, testing, or validation. For doing that you can use the README.md file that contains a YAML portion where you can set up a lot of information about your datasets. A common approach is to predefine your split with the following configuration:

---
configs:
- config_name: default
  data_files:
  - split: train
    path: "domains_train.csv"
  - split: test
    path: "domains_test.csv"
---

Using a load script

Finally, you can create your own load script. As mentioned in the HF site, this is a more advanced way to define a dataset than using YAML metadata in the dataset card. A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data. You can, for instance, download data files from any website, or from the same dataset repository. Then convert or transform the data the way you want.

Just for testing, I decided to create a simple loader script for my DGA dataset just to learn how to do it. The loaded script should have the same name as the HF repository. So I created a script called dga-detection.py The script has basically a set of constants and a class. First, you create a class like DGADataset inheriting from datasets.GeneratorBasedBuilder.

class MyDataset(datasets.GeneratorBasedBuilder):

In this class you will need to define three methods: _info(), _split_datasets() and _generate_examples().

The _info() method provides metadata about the dataset, such as its description, features, supervised keys, and the homepage. This metadata is essential for the Hugging Face Datasets library to know how to handle and represent the dataset.

def _info(self):
        # Provide metadata for the dataset
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {"domain": datasets.Value("string"), 
                 "label": datasets.Value("string"),
                 "class": datasets.Value("int32")
                }
            ),
            supervised_keys=("domain", "class"),
            homepage=_HOMEPAGE,
        )

The method provides information about the structure of the dataset such as the name of the features and their types. Notice that _info() references two constants: _HOMEPAGE and _DESCRIPTION. The latter contains the description of the dataset, while the former is just the homepage for the dataset. Here you can put other info such as the info for CITATIONS and so on.

In this case, I have filled both constants with the following information:

_DESCRIPTION = """\
A dataset containing both DGA and normal domain names. The normal domain names were taken from the Alexa's top one million domains. An additional 3,161 normal 
domains were included in the dataset, provided by the Bambenek Consulting feed. This later group is particularly interesting since it consists of suspicious domain 
names that were not generated by DGA. Therefore, the total amount of domains normal in the dataset is 1,003,161. DGA domains were obtained from the repositories 
of DGA domains of Andrey Abakumov and John Bambenek. The total amount of DGA domains is 1,915,335, and they correspond to 51 different malware families. DGA domains 
were generated by 51 different malware families. About the 55% of the DGA portion of the dataset is composed of samples from the Banjori, Post, Timba, Cryptolocker, 
Ramdo and Conficker malware.
"""
_HOMEPAGE = "https://https://huggingface.co/datasets/harpomaxx/dga-detection"

Then we have the _split_generators() method that defines how the dataset should be split into different sets like train, test, and validation. Here, the path for the CSV file with domain names is specified. The method then returns SplitGenerator objects for each dataset split. These objects define how data should be fetched and processed for each split.

def _split_generators(self, dl_manager: datasets.DownloadConfig):
        # Load your dataset file
        csv_path = "https://huggingface.co/datasets/harpomaxx/dga-detection/resolve/main/argencon.csv.gz"

        # Create SplitGenerators for each dataset split (train, test, validation)
        return [
            datasets.SplitGenerator(
                name=split,
                gen_kwargs={
                    "filepath": csv_path,
                    "split": split,
                },
            )
            for split in ["train", "test", "validation"]
        ]

The datasets.SplitGenerator in the _split_generators() method is responsible for creating the three different keys ('train', 'test', 'validation'). When you load your dataset using load_dataset(), the Hugging Face datasets library will automatically call the _split_generators() method to create the three different dataset splits.

In turn, the _split_generators() will call the _generate_examples() method for each split separately, passing the corresponding split name as the split argument. This is how the different keys are created.

In this particular implementation of the _generate_examples() method, we load the dataset and shuffle it. Then a minor we create a new feature named class containing the numeric representation of the label. (i.e. 0 for normal and 1 for DGA).

def _generate_examples(
        self,
        filepath: str,
        split: str,
    ):
        # Read your CSV dataset
        dataset = pd.read_csv(filepath,compression='gzip')
        # 2. Shuffle the dataset using a particular seed (e.g., 42)
        seed = 42
        dataset = dataset.sample(frac=1, random_state=seed).reset_index(drop=True)
        
        # Create the 'class' column based on the 'label' column
        dataset['class'] = dataset['label'].apply(lambda x: 0 if 'normal' in x else 1)

        # Get the total number of rows
        total_rows = len(dataset)

        # Define the ratio for train, test, and validation splits
        train_ratio = 0.7
        test_ratio = 0.2

        # Calculate the indices for each split
        train_end = int(train_ratio * total_rows)
        test_end = train_end + int(test_ratio * total_rows)

        # Filter your dataset based on the 'split' argument
        if split == "train":
            dataset = dataset.iloc[:train_end]
        elif split == "test":
            dataset = dataset.iloc[train_end:test_end]
        elif split == "validation":
            dataset = dataset.iloc[test_end:]

        # Generate examples
        for index, row in dataset.iterrows():
            yield index, {
                "domain": row["domain"],
                "label": row["label"],
                "class": row["class"],
            }

Then, we calculate the indices for the train, test, and validation splits based on the total number of rows and the specified ratios. In this case, we used 0.7 for training, 0.2 for testing, and 0.1 for validation. Finally, we filter the dataset based on the provided split (train, test, or validation).

Just a final words

Submitting your own dataset to Hugging Face is a straightforward process. At least using the git UI interface and the dataset reference card. If you want something more complex you can create your own load scripts for dealing with different aspects of your dataset generation pipeline.

Remember that when you upload your dataset to Hugging Face, you are getting the benefits provided by the datasets library. This library facilitates the way in which you can access and manipulate your datasets. Whether you are diving into data transformation or crafting a processing routine, the datasets library will ensure every operation is streamlined and efficient.

Also with the datasets library, you can deal with huge datasets that normally would not fit in your computer memory. Part of the library’s secret is the use of the Apache Arrow library for storing the data efficiently. You can go here for a better understanding of what is under the hood of the Datasets library.

For sure is the first step for diving into the Hugging Face ecosystem.