Generic Data Structures in C

Sep 22, 2018 20:12 · 825 words · 4 minute read Tags:

In C, writing generic data structures is hard. There are no templates, generics, or overloaded functions built in to the language, so the programmer needs to get creative. In this post, we will explore 3 ways to write a generic vector in C: using void*, dynamically storing size, and using macros. We will implement append and get, and set.

Using Void*

In C, void* is a generic pointer type – it is up to the user to interpret the data as an apporpriate type. This is usually the go to way of implementing generics. Basically, we have a data structure of pointers. We don’t care about the type, we just store the pointers and give the pointers back to the user. Such an implementation might look as follows:

#include <stddef.h>
#include <stdlib.h>

#define vector_MIN_CAP 32

typedef struct vector {
  void **buf;
  size_t capacity;
  size_t size;
} vector;

void vector_init(vector *vec) {
  vec->capacity = vector_MIN_CAP;
  vec->buf = malloc(sizeof(*vec->buf) * vec->capacity);
  vec->size = 0;
}

void *vector_get(const vector *vec, size_t idx) { return vec->buf[idx]; }

void vector_set(vector *vec, size_t idx, const void *data) { vec->buf[idx] = data; }

void vector_push(vector *vec, const void *data) {
  if (vec->size == vec->capacity) {
    vec->capacity *= 2;
    vec->buf = realloc(vec->buf, sizeof(*vec->buf) * vec->capacity);
  }
  vector_set(vec, vec->size++, data);
}

The down side with this approach is that all data in the vector needs to be heap allocated. This also means that in order to access any data, a random read into memory must be performed. The next approach solves this problem by allowing arbitrary sized data to be stored in situ.

Dynamically Storing Data Size

In this approach, the initializer for the vector will take in a size. It is assumed that all data read and written from the vector is in chunks of this size. This is coded below:

#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#define vector_MIN_CAP 32

typedef struct vector {
  void *buf;
  size_t capacity;
  size_t size;
  size_t data_size;
} vector;

void vector_init(vector *vec, size_t data_size) {
  vec->data_size = data_size;
  vec->capacity = vector_MIN_CAP;
  vec->buf = malloc(vec->data_size * vec->capacity);
  vec->size = 0;
}

void *vector_get(const vector *vec, size_t idx) {
  return vec->buf + vec->data_size * idx;
}

void vector_set(vector *vec, size_t idx, const void *data) {
  memcpy(vec->buf + idx * vec->data_size, data, vec->data_size);
}

void vector_push(vector *vec, const void *data) {
  if (vec->size == vec->capacity) {
    vec->capacity *= 2;
    vec->buf = realloc(vec->buf, sizeof(*vec->buf) * vec->capacity);
  }
  vector_set(vec, vec->size++, data);
}

In this scenario, the data needs to be copied into the vector on insertion. Pointers are returned as addresses into the vector’s buffer; so if the buffer address changes during appending, any previous pointers to the data are invalidated. There is a small runtime overhead of storing the data size and doing computation based on it. There is also an ergonomic cost – if you want to create a vector of integers, to insert an integer into it, you need to do int x = 10; vector_set(&v, 0, &x) rather than the more natural vector_set(&v, 0, 10). However, by using macros, the index computations can be done at compile time and data can be passed by value.

Macros

The final solution is to use macros which generate code to define each vector method above for the types. This might look something like:

#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#define vector_MIN_CAP 32

#define vector_struct(T)                                                       \
  typedef struct T##_vector {                                                  \
    T *buf;                                                                    \
    size_t capacity;                                                           \
    size_t size;                                                               \
  } T##_vector;

#define vector_init(T)                                                         \
  void T##_vector_init(T##_vector *vec) {                                      \
    vec->capacity = vector_MIN_CAP;                                            \
    vec->buf = malloc(sizeof(T) * vec->capacity);                              \
    vec->size = 0;                                                             \
  }

#define vector_get(T)                                                          \
  void *T##_vector_get(T##_vector *vec, size_t idx) { return vec->buf + idx; }

#define vector_set(T)                                                          \
  void T##_vector_set(T##_vector *vec, size_t idx, T data) {                   \
    vec->buf[idx] = data;                                                      \
  }

#define vector_push(T)                                                         \
  void T##_vector_push(T##_vector *vec, T data) {                              \
    if (vec->size == vec->capacity) {                                          \
      vec->capacity *= 2;                                                      \
      vec->buf = realloc(vec->buf, sizeof(T) * vec->capacity);                 \
    }                                                                          \
    T##_vector_set(vec, vec->size++, data);                                    \
  }

#define vector(T)                                                              \
  vector_struct(T);                                                            \
  vector_init(T) vector_get(T) vector_set(T) vector_push(T)

For every type you want to use this vector, you must declare it with vector(int) (or whatever type). This will increase the amount of code generated. Furthermore, it has the obvious downside that T must be a one word type (so for example you cannot pass struct foo as T; it must be typedefed). In order to call any vector methods, you must prepend them with the type. For example, here is how you might use an int_vector:

vector(int);

int main() {
  int_vector v;
  int_vector_init(&v);
  int_vector_push(&v, 10);
  int_vector_set(&v, 0, 5);
  int_vector_get(&v, 0);
}

This macro based solution is closer to what C++ templates provide. One of the reasons higher level languages like C++ or Rust are preferred to C is that the programmer can naturally get generic data structures without having to think too much about it. C provides the capability to do all that, but with a bit more work from the programmer.

tweet Share