What You Need to Know About C

August 1, 2024

You may want to read this in tandem with our article Compiling and Running C and C++

C was designed in the early 1970s at Bell Labs for the purpose of writing software for UNIX, including its kernel. It is known for being low level, very close to the hardware. Anything you can do on a computer, you can do in C! It is also a rather simple language, in that once you learn its rules, you can pretty much figure out what any piece of software written in C does. It is not going to do anything funny behind your back, as many modern languages do.

Why Study C?

First, I don’t think you should write new independent software in C. There are far better languages for nearly any application. Rust, in particular, is an excellent choice for software that has traditionally been written in C. That said, there are two excellent reasons to at least be conversant in it, and know how to read code, write some small programs, and compile C software. First, C is the foundation of Linux and the entire UNIX ecosystem. Second, it is the best language with which to experiment with algorithms that require the use of raw pointers, though C++ and Go could also be in the running there.

This article intends to give you a jump-start in being able to read and understand C code, and be able to make some simple changes. It will not endeavor to make you a master C programmer.

Some examples on this site will be given in C, particularly to demonstrate lower-level concepts.

The Foundation of Linux

The Linux kernel is written in C, along with some assembly language and, lately, a bit of Rust. This means that syscalls, meaning the calls your application code makes into the Linux kernel, are defined in terms of C calling conventions. You need to be able to read system documentation which may well assume C knowledge. When you use utilities like strace to figure out which calls are being made by your application, the output from that will look like C calls.

There is also a lot of common user-space software, and its associated libraries, that is written in C. You may very well need to interface with these C libraries in your own code. One important example that is relevant to us is the PostgreSQL relational database system. Another is the system service manager systemd.

Algorithms

Most important algorithms and data structures in computer science use pointers. A pointer is a variable that points to another location in memory. When data is processed, the pointer must be followed. Most modern languages hide pointers to the point where a programmer rarely needs to deal with them. But if you’re going to either implement existing algorithms as a learning exercise, or experiment with new algorithms, you really need easy access to pointers. C is one of the best languages for this because it will not get in your way. Of course, with great power comes great responsibility. Use of pointers and associated memory allocation that fails to use proper precautions will yield bugs that can be very difficult to find.

A Basic C Program

The classic “Hello, World!” C program reads about like this:

#include "stdio.h"
int main(int argc, char *argv[]) {
    printf("Hello, World!\n");
    return 0;
}

Let’s break it down. Any line starting with # is processed by the C pre-processor. It takes special commands that inject code into the compilation unit that the compiler is currently working on. In this case it is including another file, namely stdio.h, which is the header file for the main input/output routines in the C standard library.

A header file is a list of resources – functions, types, variables, macros, etc. Resources declared in the header that are not in the current compilation unit are expected to be defined in another compilation unit (that is, another C file). The linker will ensure that the resource actually exists; if not, it will throw an error. (See separate article on compiling and running C programs.)

It should be noted that the C pre-processor is pretty dumb. This is just a blind inclusion of a file. You can, for example, end the file in the middle of a block and finish the block in the main program. (If you do this, you’re a horrible person and deserve unending grief!) This is in contrast to most other languages, whose inclusion mechanisms look at the thing they are including as a cohesive self-contained whole.

Moving on, we get to the declaration and definition of the main function. This is special in that the linker, when creating the binary executable that the CPU will actually run, will set main as the entry point into the program. While it can be defined to take no arguments (type void), traditionally it takes an integer argc, which is the count of the number of command line arguments given at runtime, and argv, an array of pointers to character strings that contain the actual arguments.

A block of code in C is delineated by braces ({ and }). Each line of normal code ends in a semicolon ;.

printf is a function that writes formatted output (f) to standard output. It takes a format string, followed by more arguments if the string has formatting codes in it. C allows for a function to have a variable number of arguments (the number of which is not known at the time of compiling the function); this is called a variadic function.

Finally, return 0 tells the function to return zero, which in turn is returned to the operating system. In Linux this is considered a success. If there is an error condition, main should return a non-zero integer.

C Types

Being a low level language, C is statically typed, meaning the compiler knows the exact address and length of any variable at any time. It is also considered weakly typed, since it is fairly easy to treat a variable as a different type than it is defined to be. Of course, if this is done in a way that doesn’t make sense, a bug will be introduced, and the compiler may not tell you!

C is pretty limited in terms of built-in basic types. Pretty much int for integers, char for a character, and float or double for a floating-point number. A proper bool type for booleans was only added in C23; before that, bool, true, and false were just macros that operated on the integers 1 and 0!

The C specification doesn’t even specify exactly how many bits are in these basic types. On most modern 64-bit systems, an int is 32 bits. You can add long and short qualifiers; a long int may be 64 bits and a short int may be 16 bits. You can check the sizeof() of any type to know for sure how many bytes it will consume with your compiler.

Here’s a brief demo of basic types:

#include <stdio.h>

int main(int argc, char *argv[]) {
    int i = 5, j = 10;
    char c = 'a';
    int sum = i + j;
    double pi = 3.14159265358979;
    printf("My char is %c and the sum of %d and %d is %d", c, i, j, sum);
    printf("Also, Pi is about %lf\n", pi);
    return 0;
}

Here’s a quick program to show the size of the basic types:

#include <stdio.h>

int main(int argc, char *argv[]) {
    printf("bytes/type\n");
    printf("%d char\n", sizeof(char));
    printf("%d short int\n", sizeof(short int));
    printf("%d int\n", sizeof(int));
    printf("%d long int\n", sizeof(long int));
    printf("%d long long int\n", sizeof(long long int));
    printf("%d float\n", sizeof(float));
    printf("%d double\n", sizeof(double));
    return 0;
}

The output, from both GCC and Clang (the two main open source C compilers) on 64-bit x86_64 and ARM64 (the only architectures that are particularly relevant for Linux servers at this time) is the same:

bytes/type
1 char
2 short int
4 int
8 long int
8 long long int
4 float
8 double

(Just remember for the sake of portable code that it’s not guaranteed to be the same on other architectures or even compilers.)

Strings in C are simply a pointer to a character; the string starts at the target memory location and ends with a byte simply containing zero. This is referred to as a null-terminated string. The * character before the variable name denotes a pointer, so char *mystring would point to the beginning of a string.

This method of string processing opens up significant potential for bugs: If there is a buffer of a fixed size, and a string exceeds it or doesn’t have the null byte at the end, something processing the string can just go off into other memory. There are even functions in the C standard library that need extreme caution.

Here’s a quick test of inputting and outputting strings that can trigger the bug:

#include <stdio.h>

int main(int argc, char *argv[]) {
    char buf[20];
    printf("Enter your name: ");
    scanf("%s", buf);
    printf("Hello, %s!\n", buf);
    return 0;
}

We can run it…

Enter your name: Micah
Hello, Micah!

So far so good.

Enter your name: Micah Yoder
Hello, Micah!

Note that the %s format specifier only reads up to a space.

Enter your name: TheQuickBrownFoxJumpsOverTheLazyDog
Hello, TheQuickBrownFoxJumpsOverTheLazyDog!
*** stack smashing detected ***: terminated

Uh-oh. We over-ran our 20-character buffer! Note that I tried it in different circumstances and did not always get the crash. It really depends on the memory layout of the process and whether stack smash detection is enabled in the compiler. You might also get a segmentation fault, caused by an attempt to access a memory location outside the process. But, it can be fixed by changing the %s format specifier to %19s, which limits the input to 19 characters plus the null byte, fitting nicely in our 20-character buffer:

Enter your name: TheQuickBrownFoxJumpsOverTheLazyDog
Hello, TheQuickBrownFoxJum!

The moral of this is that you need to be very careful about memory usage in C. It is very easy to shoot yourself in the foot!

Structs and Functions

Next up is the struct, a collection of simple types into one more complex one. Here’s a quick example:

#include <stdio.h>

struct city {
    double latitude, longitude;
    int population, elevation;
    char name[30];
};

void print_city(struct city c) {
    printf("*** %s ***\n", c.name);
    printf("Location: %f, %f\n", c.latitude, c.longitude);
    printf("Population: %d, elevation: %d feet\n", c.population, c.elevation);
}

void input_city(struct city *c) {
    printf("Enter all on one line, separated by spaces: name latitude longitude population elevation\n");
    printf("Sorry, the city name can't have spaces or special characters.\n ==> ");
    scanf("%29s %lf %lf %d %d", c->name, &c->latitude, &c->longitude, &c->population, &c->elevation);
}

int main(int argc, char *argv[]) {
    struct city c = {35.10498, -106.63008, 654559, 5312, "Albuquerque"};
    print_city(c);
    printf("Now your turn.....\n");
    input_city(&c);
    print_city(c);
}

The city definition should be self-explanatory. When declaring a variable to be the type of a struct, you do need to use the struct keyword, unlike in most other languages.

The print_city function takes a city, c, as its parameter, and returns nothing (void). It accesses and prints the member variables as you’d expect.

Now we get to input_city. Note that it takes a pointer to the city, denoted by *c. We pass it by pointer, whereas in print_city we passed the whole thing on the stack, forcing the machine to copy the contents of the city into memory for the function to have exclusive access to it. If it had modified the members, those changes would not have showed up in main(). But here we have a pointer to the structure, and any changes will be returned. Note that we call it using input_city(&c); the ampersand instructs the compiler to pass the address of the variable – a pointer. The same is true inside the function when we pass its member parts to scanf(); we need it to modify them in place. Do note that arrays, such as strings that are in a character array buffer, are sent by pointer automatically so we do not need the & there.

Also note how we access members of a struct. If we have the value itself, we just use a period, such as c.name. If we’re accessing it through a pointer, we use the -> operator, as in c->name.

Here’s an example run:

*** Albuquerque ***
Location: 35.104980, -106.630080
Population: 654559, elevation: 5312 feet
Now your turn.....
Enter all on one line, separated by spaces: name latitude longitude population elevation
Sorry, the city name can't have spaces or special characters.
 ==> Portland 45.58572 -122.67073 652503 161
*** Portland ***
Location: 45.585720, -122.670730
Population: 652503, elevation: 161 feet

A union is somewhat like a struct, except that its members occupy exactly the same spot in memory, so only one of them is valid. It’s like an enum in some other languages, except C has no built-in way to know which value is supposed to be active. So it’s up to the application to simply know by context which value can be accessed. Example:

#include <stdio.h>

union test {
    int i;
    float f;
};

int main(int argc, char *argv[]) {
    union test u;
    u.i = 5;
    u.f = 3.14159;
    printf("f: %f\n", u.f);
    printf("i: %d", u.i);
}

Output:

f: 3.141590
i: 1078530000

As you can see, the integer gets clobbered by the float. We can access it as an integer, but we won’t get what we expect! By the way, if we swap the assignments, we then clobber the float:

f: 0.000000
i: 5

Note that C doesn’t have a great way in to pass the size of an array into a function, so the convention is to pass an integer along with the array:

#include <stdio.h>

double average(int n, double args[]) {
    double accumulator = 0;
    for (int i=0; i<n; i++) {
        accumulator += args[i];
    }
    return accumulator / n;
}

int main(int argc, char *argv[]) {
    double a1[] = {3.14159, 2.71828};
    printf("First average: %lf\n", average(2, a1));
    double a2[] = {38.34532145, 372.2637214, 62.473832, 936.3762185 };
    printf("Second average: %lf", average(4, a2));
}

This also introduces the for loop, which really only has one form in C. There are three expressions, separated by semicolons. The first is just something it runs prior to the loop, which is usually the initializer of the loop counter. The second is a condition; as soon as it evaluates to true, the loop stops executing. Since the count is zero based, we check to ensure it’s less than our count n; when it’s equal to n we want to exit. The third expression tells it what to do every cycle. i++ just increments the loop counter i.

Memory Management

Now we get to the unpleasant topic of C memory management. Example to look at:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct person {
    char name[50];
    unsigned int age;
    bool likes_pineapple_on_pizza;
};

void print_person(struct person *p) {
    if (!p) {
        printf("You passed me a null pointer, nothing here!\n");
        return;
    }
    printf("Data for:  %s\n", p->name);
    printf("Age: %d\n", p->age);
    printf(p->likes_pineapple_on_pizza ? "Civilized and respectable individual\n" : "Uncivilized savage\n");
}

int main(int argc, char *argv[]) {
    struct person *p = NULL;   // Or just 0
    print_person(p);
    p = malloc(sizeof(struct person));
    if (!p) {
        printf("Couldn't allocate memory! :'(\n");
        return 1;
    }
    strcpy(p->name, "Micah");
    p->age = 12;  // Well, when I got started programming
    p->likes_pineapple_on_pizza = true;  // duh
    print_person(p);
    free(p);
    return 0;
}

In main(), we first declare a variable, p, which is a pointer to a person. We first assign it NULL, then pass it to our print function, which behaves appropriately. At this point, it is nothing more than a pointer which does not point to anything.

Next we call malloc(), defined in the stdlib.h header. This is the primary method of allocating memory in C. We pass in the amount of memory we need in bytes, which we just ask the compiler to provide us using sizeof(struct person). Upon success, it returns a non-NULL pointer to the new memory, which is on the heap. Upon failure, it will just return a null, which we can catch with !p and exit if we don’t have the memory.

By the way, we should review the concepts of the two kinds of memory in a program so we’re all on the same page: The stack and the heap. The stack is simply all the function-local variables; everything we have used up till now. Function parameters also go there. When we call a function, its parameters are pushed one by one onto the stack. The compiler always knows exactly where everything is on the stack. The heap, on the other hand, is a separate area of memory that is used for dynamically allocated structures. If we don’t know how large something might be, or if it’s too large for the stack (there’s a limit), or if we’ll access it from more than one thread or function, it needs to go on the heap.

With our newly allocated memory on the heap for a person, we can start to put our data there. We can’t just assign strings, they have to be copied with the strcpy() function from the string.h header. We can just assign simple types like integers and booleans (or even single characters), however.

We then have to be careful to free our memory. If we don’t, there will be a memory leak. All memory is freed when the program exits, so we could get away without the free() here. But in a long-running program with complex logic and various structures go in and out of needing to exist, it’s critical to properly free memory. If you don’t, you’ll eventually run out of memory and make the program (or system) crash.

It’s also worth noting the conditional operator in the print_person() function, within the printf call. In summary, it is: <expr> ? <value_if_true> : <value_if_false> It returns the second or third expression, depending on whether the first expression is true or false. It can be very handy!

Sharing variables and functions between source files

Often you want to define functions that will be used in more than one source file. Occasionally you may want a global variable declaration, though that is usually bad practice. (Normally, any source file containing a variable should contain all the functions that operate on it, and other files would just do what they need by calling those functions.)

But here’s how to do both. I have three files:

Our functions and variable are defined in test.h:

#ifndef CPLAY_TEST_H
#define CPLAY_TEST_H

extern int shared_val;
void fun_in_main();
void fun_in_test();

#endif //CPLAY_TEST_H

Note the preprocessor directives. This is just to prevent the “meat” of the header from being included more than once. That could happen, for example, if headers include other headers. You may have it being included more than once in the same compilation unit, and that would cause re-declaration errors, or even allow recursive inclusions. They are simply saying “if CPLAY_TEST_H is not defined, define it and continue; if it is defined, just skip to the end and do nothing.”

Our shared variable is declared as extern, meaning the linker is expecting its definition to be somewhere in one of the C files. The functions are just the prototypes, without the bodies.

In main.c, we include the header and one of the functions. In main() we initialize the shared variable and call the function, which is defined in the other file.

#include <stdio.h>
#include "test.h"

void fun_in_main() {
    printf("I'm in main! Value is %d\n", shared_val);
}

int main(int argc, char *argv[]) {
    shared_val = 10;
    fun_in_test();
}

Finally, in test.c we have shared_val actually defined. It is OK that it already has the header where it is noted as extern, but you do not want to define it in more than one C file, or there will be a linker error.

#include <stdio.h>
#include "test.h"

int shared_val;

void fun_in_test() {
    printf("I'm in test!\n");
    shared_val++;
    fun_in_main();
}

The program flow is thus: main() just gives shared_val a value and calls fun_in_test(), defined in the other file. It prints a message and increments the variable, then calls fun_in_main() which uses the shared variable.

Pointer arithmetic

An important aspect of C is that its pointers can be manipulated; indeed many algorithms operate on iterating through data structures with pointers. A simple example of iterating through an array:

#include <stdio.h>

int main(int argc, char *argv[]) {
    double numbers[] = {3.32597425, 9.2314375, 12.3475854, 1.342955,
                        9.1234654365, 7.2153783456, 2.3593863, -1.0};

    double *p = numbers;
    while (*p > 0.0) {
        printf("Number %lf at location %p\n", *p, p);
        p++;
    }
}

We’re also seeing the while loop here for the first time; it simply iterates until its condition is false. The condition is that *p, which is the value that the pointer p is currently pointing at, is greater than zero. Our array, of course, ends with a negative value.

We then print it and its pointer. You’ll notice that each time, the pointer increases by 8 bytes, which is the exact size of a double. The compiler will try to keep the pointer aligned to the size of your type.

A bit more advanced: Using the `qsort()` function and pointers to functions

Not only is the qsort() function useful, but it will also get us into pointers to functions, which you’ll want to recognize when you see in code.

Here we go:

#include <stdio.h>
#include <stdlib.h>

struct color {
    char name[20];
    float red, green, blue;
};

int comp_green(const void *arg1, const void *arg2) {
    const struct color *c1 = (const struct color*)arg1;
    const struct color *c2 = (const struct color*)arg2;
    if (c1->green < c2->green) return -1;
    if (c1->green > c2->green) return 1;
    return 0;
}

int main(int argc, char *argv[]) {
    struct color my_colors[] = {
            {"Red", 1.0, 0.0, 0.0},
            {"Green", 0.0, 1.0, 0.0},
            {"Blue", 0.0, 0.0, 1.0},
            {"Olive", 0.59, 0.62, 0.11},
            {"Pink", 1.0, 0.1, 0.9},
            {"Cyan", 0.0, 0.92, 1.0},
            {"Yellow", 0.91, 1.0, 0.0}
    };

    int n = sizeof(my_colors) / sizeof(struct color);  // calculate number of items in our array
    qsort(my_colors, n, sizeof(struct color), comp_green);

    printf("The colors sorted by their green content:\n");
    for (int i=1; i<n; i++) {
        printf("%s\n", my_colors[i].name);
    }
}

qsort() comes from stdlib.h, so we include that.

Let’s look at the two lines under the color definitions. We first compute n, the number of elements, by taking the size of the whole array divided by the size of the type of one element. In this case that’s 7.

We call qsort() by passing:

First, the array as a whole, this is passed as a pointer to the first element
Then, the number of elements, n
Then, the size of each element
Finally, the comparison function, which is passed as a pointer to the function

About the comparison function: It takes two elements of the type we’re comparing. The qsort algorithm internally decides which elements to call it on. Unfortunately, C is not very ergonomic here; we must accept const pointers to void. (Modern languages would allow us to pass in the exact type we need.) So we must cast the void to our type, struct color (or more accurately, a const pointer to them). This is what (const struct color*)arg1 does – it just tells the C compiler “I know this doesn’t look like the right type to you, but trust me, I know what I’m doing.”

So we then have our appropriately typed variables. Here we just compare the green content of the two arguments. qsort() expects that, when calling our function, we will return -1 if the first argument is less, +1 if the first argument is greater, or 0 if they are equal.

Let’s continue with a bit of polymorphism in C. Polymorphism is the ability to call various functions that may vary at runtime.

#include <stdio.h>

int square(int x) {
    return x * x;
};

int twice(int x) {
    return x + x;
}

int negative(int x) {
    return 0 - x;
}

typedef int(*do_int_thing)(int);

int main(int argc, char *argv[]) {
    do_int_thing funcs[] = {square, twice, negative};
    printf("Num Square Double Negative\n");
    for (int i=0; i<20; i++) {
        printf("For %d: ", i);
        for (int j=0; j<3; j++) {
            int x = (funcs[j])(i);  // Call a function through a pointer
            printf("%d ", x);
        }
        printf("\n");
    }
}

First we define three simple functions that operate on an integer and return an integer.

Next we define a typedef – a way to associate a name with a type that might be more complex. In its simple form, we can say something like typedef int mytype, which will simply create an alias mytype for int.

The syntax for defining a type for function pointers is rather more confusing, however. It’s basically the return type of the function, then in parentheses a pointer to the new type name, followed by the types of its arguments in another set of parentheses.

In main(), we first define an array containing all three of our functions. We could pass these around and add them dynamically based on user input, but here it’s just a static list of what we have.

For the meat of the program, we loop from 0 to 19 in the outer loop with counter variable i, and from 0 to 2 in the inner loop with counter j, which will be an index into our array of functions. Then we call it. Instead of just having the name of a function to call, we put our value, which contains a pointer to the function, in parentheses - (funcs[j]), just an index into our array to get the right pointer. That is followed by another set of parentheses with the argument we’re passing to it.

Conclusion

We’ve done a whirlwind tour of the most important facets of the C language. Hopefully it will help you as you encounter C code in the wild.

References

Great C/C++ reference – C is at the bottom of the page. https://en.cppreference.com/

Tags:

C
Lang-Intro