C++ Tutorial 5 - Arrays and Strings Using Collatz

This is the fifth tutorial in a series on the C++ language. If you have not read the previous four or are not already familiar with their content the topics covered in this section will probably be hard to follow. Today will at look arrays which are a closely related to the material which we saw in the third article. Arrays are contiguous, homogenous segments of memory which are all of the same type. That means that you can have an array of all ints, or an array of all doubles, or any other data type. The advantage to this might not seem too useful and without loops they are not particularly useful but with loops arrays let us work on a large amount of data in a fast, uniform fashion. Vectors, like what you might find in a physics class, are a closely related concept in mathematics to arrays, but thankfully arrays are a much simpler structure to work with because much of the messy mathematics falls away when we can operate on individual elements inside of an array. With that introduction we will start looking at some of the things which we can do with arrays.

In order to illustrate how arrays work we will return to our old example of the Collatz-conjecture except this time we will calculate the number of steps for the first one hundred integers and store those values in an array. After we calculate the number of steps for the first one hundred integers, arbitrarily set to one hundred, we will then try examining the array for information such as the average number of steps, the maximum value, and the number of unique steps.

<iostream>
using namespace std;

int main(int argc, char *argv[])
{
cout << "=== Starting Program ===\n" << endl;

const int SIZE_OF_DATASET = 100;

int array_of_steps[SIZE_OF_DATASET];

for(int i = 1; i <= SIZE_OF_DATASET; ++i)
{
int n = i;
int steps = 0;

while(n != 1)
{
if(n % 2 == 0)
{
n /= 2;
}

else
{
n = n * 3 + 1;
}

++steps;
}

array_of_steps[i - 1] = steps;
}

cout << "=== Ending Program ===\n" << endl;

return 0;
}

Like of the concepts which have introduced thus far arrays require slightly different syntax than what we have seen so far. First of all we have something called a "const int" in this program, "SIZE OF DATASET", which we have set to one hundred. The key word "const" stands for constant and prevents us from changing the value of SIZE OF DATASET, in other words making it immutable, anywhere in the program. The next line of code is "int array_of_steps[SIZE OF DATASET];" which creates an array of size 100, equal to SIZE OF DATASET, of type int. At this point you might be a little confused so we will slow down and explain what these two lines of code actually do. The first line of code creates a constant integer which we use to refer to the size of the array. In the C++ language arrays have to have a constant size of elements. At a hardware level arrays create a contiguous, homogeneous segment of memory of size n which the compiler can access by multiplying the size of the data type, this is why the array has to be of one type which we call homogeneous, by the index of the array. On most computers an int eight bytes so in order to get the "nth" element from an array of ints we could use this short formula n * 8 + base address to calculate the index for an array of ints. The base address is not something that we have to worry about because the operating system, which ever operating system you areworking on will have its own way of managing memory but they will manage this for us, but the array has a definite size which means that it has to have definite start and end as well. If we have the base address and the length the we can calculate the end so we only need two pieces of information and not three.

The next thing which many beginners find confusing about arrays that they have zero based number system which means that we start counting elements from the zeroth position and not the first. This might sound confusing at first but mathematically it works out perfectly. Suppose that we create an array of size 100 at the base address 1000 in system memory. If we want the "first" element in that array we would want to refer back to that formula above. If we start counting at one then the first address we would get is 1 * 8 + 1000 which is equal to 1008. Obviously we are leaving out the "first" element of the array, and even worse is that if we continue in this fashion all the way up to 100 then we would eventually have an operation that looks like 100 * 8 + 100 which would be equal to 1800. If each element in the array is an int, and an int is equal to 8 bytes in length then that would mean that our array would end at address 1808 in memory. But the operating system only allocated 100 * 8, 800, bytes of RAM for our array so that would mean that we are accessing bytes which are outside of the array. Obviously this is wrong and in fact if you tried to do this then the program would most likely crash because the you would probably be accessing memory from another application. Now let’s look at this example with a zero based number system. If we want the "first" element of an array of integers, and that array starts at address 1000, then our little formula would look something like 0 * 8 + 1000. If our integers are 8 bits wide then that element would end at 1008 and if we want the last element, 100 – 1, then we would get 99 *+ 1000 which would end at 1800. So it seems like zero based number systems are harder to use then our more organic way of counting, but this significantly reduces the complexity of programs and once you start using them it will become quite familiar. The only thing to remember about this is that we count the "first" element, zeroth element is what you should use, from zero and not one and the last element is n – 1 because we counted from zero and not from one.

Moving on in the program we see our old friend the for loop with an upper limit of SIZE_OF_DATASET, which is always 100 because it is constant, and a lower bound of one. If you think about the arithmetic behind this loop and compare that with the previous paragraph where we talked about counting from zero and not from one this might seem contradictory. If we look at the rest of the loop however it become clear why the boundaries are set the way that they are. Moving forward in the loop you probably recognize the Collatz example from the third lesson is more or less unchanged. At the end of that while loop we see line of code "array_of_steps[i - 1] = steps;". This line of code takes care of two things. First of all the minus one operation handles the problem which we specified above, this problem was chosen to illustrate the off by one, "OBO", errors, and second it inserts the number of steps taken to converge to one, held in the variable "steps", into the array at the index i minus one. So now we have a program which can calculate the number of steps it takes the collatz conjecture to converge to one from one to some arbitrary n which we have set to 100. Let's try expanding on this program to give us the average number of steps, mean, the most number of steps and the number of unique steps.

#include <iostream>
using namespace std;

int main(int argc, char *argv[])
{
cout << "=== Starting Program ===\n" << endl;

const int SIZE_OF_DATASET = 100;

int array_of_steps[SIZE_OF_DATASET];

for(int i = 1; i <= SIZE_OF_DATASET; ++i)
{
int n = i;
int steps = 0;

while(n != 1)
{
if(n % 2 == 0)
{
n /= 2;
}

else
{
n = n * 3 + 1;
}

++steps;
}

array_of_steps[i - 1] = steps;
}

int max = 0;
double avg = 0;
int num = 0;

for(int i = 0; i < SIZE_OF_DATASET; ++i)
{
if(array_of_steps[i] > max)
{
max = array_of_steps[i];
}

avg += array_of_steps[i];

bool unique_value = true;

for(int j = 0; j < i; ++j)
{
if(array_of_steps[i] == array_of_steps[j])
{
unique_value = false;
}
}

if(unique_value)
{
++num;
}
}

avg /= SIZE_OF_DATASET;

cout << "The maximum number of steps from 1 to: " << SIZE_OF_DATASET << " is: " << max << endl;
cout << "The average number of steps from 1 to: " << SIZE_OF_DATASET << " is: " << avg << endl;
cout << "The number of unique values from 1 to: " << SIZE_OF_DATASET << " is: " << num << endl;

cout << "=== Ending Program ===\n" << endl;

return 0;
}

The code above should print out the maximum number of steps, the average number of steps, and the number of unique steps, some values for "i" might have the same number of steps, required to calculate the our dataset. Right after our previous program left off we create three variables, max, an integer, avg, a double which is a variation on floating point values, and num which we will use to record the number of unique steps taken. After that we have another loop with a more "proper" upper and lower bound however there are some interesting things going on in the loop. First off we have an if statement which compares array_of_steps at index of i to the integer max. This conditional statement keeps track of the maximum value that we have seen so far in the array, if the next value in the array is greater than max then we set max equal to that value and if it is less than or equal to we just move on. After that we have the summation for the average value of steps which we will divide after the loop has finished. But right after all that we see something which is perhaps a little less obvious. First of all there is a bool which we set to true, we’ll talk about what that does in a moment, after the bool is initialized we have a second loop which has a bound of i. This is a fairly common way of doing what we call a "nested loop". We did not go over this example in tutorial three but we can have two loops after another, like the first loop where we create the values in the array, and we can have two loops inside of another. This has significant consequence when we set the upper bound of the inner loop to a value which is equal to, or approaches the upper bound of the outer loop. Regardless of what the contents of the inner loop are if the upper bound of the outer loop is n, and the upper bound of the lower loop is "close" to n, the number of times we iterate through the inner loop actually happen n^2 times. We will not jump off the deep end into this branch of theory, formally called complexity theory, but now that we can use loops and arrays we need to know some of the pitfalls of using them. Finally getting inside of the loop we have another if statement which tests if the element of the array at index i is equal to the element of the array at index j, which is always at least one less than i, then we set that bool to false. The purpose of the bool should become obvious after the loop inner loop when we test if it is still true. If it is true then the number of unique values, num, is incremented and if it is not then the program jumps to the start of the loop. After the outer loop has finished there is some small amount of housekeeping where we have to divide the average number by the size of the dataset and the of course we have to display our findings.

This lesson might seem to be lacking at times but because there is no way of concisely addressing this topic, because of how many other things are tangentially related to arrays, so we will have to make do and struggle forward and deal with those concepts as we encounter them.

As a final note, an important pair of concepts which we glossed over in previous lessons were characters and strings. A brief reminder of how characters work is necessary here because we will need them for some of the example code. Characters are fundamentally a "map" of integers to characters which takes a number, generally an eight-bit number but it depends on your system, and maps that number to a character. For example, the letter ‘A’ is stored in a C++ program as the number "65" and the letter ‘B’ as "66" and so on with each character in the Latin alphabet being stored one after another and the lower case letters, like ‘a’ and ‘b’, being stored thirty-two bits above the uppercase. There are of course digits, starting at "48" with the digit ‘0’, and special characters like the ampersand, ‘&’, but for now we will only worry about letters which we will use to form words. Strings are an array of characters all laid out in orderly fashion in both memory, physically, and logically, in a permutation of characters which make words, so that a computer can interpret human friendly data and so that a user has the ability to use a computer in an easy way.

#include <iostream>
using namespace std;

int main(int argc, char *argv[])
{
cout << "=== Starting Program ===" << endl;

char sentence[37] = {'T', 'h', 'i', 's', ' ' , 'i', 's', ' ', 'h', 'o', 'w', ' ', 'y', 'o', 'u', ' ', 'r', 'c', 'o', 'r', 'd', ' ', 's', 't', 'r', 'i', 'n', 'g', 's', ' ', 'i', 'n', ' ', 'C', '+', '+', '\0'};

for(int i = 0; i < 37; ++i)
{
cout << sentence[i];
}

cout << "\n=== Ending Program ===" << endl;

return 0;
}

A brief note on the syntax in this program as opposed to the previous is that here we use something called array initialization where we are declaring the contents of the array when the array itself is created. The contents are defined between the curly braces, {}, and they have to be separated with a comma. The syntax to create an array might seem like we are creating a single variable, in this case named sentence, but what we are actually doing is creating a whole bunch, thirty-seven to be exact, characters which we have initialized, the initialization occurs between the two curly braces and requires a comma between each element, to the phrase, "This is how you record strings in C++". And indeed it is, although you should note that there is also a string data type which we will also look at. Another thing worth mentioning is that if you ever have to work with this array of characters method you always have to end the string with a "NULL value" which is the ‘\0’ character at the very end. This character tells the compiler that the end of the string has been reached and there are no more characters to write to memory. A brief note on the syntax is probably necessary here, first of all you need to either have a number inside of the first two braces, in this case we have "[37]", or have the initialization. One final thing to mention is that this is called a “C String” which is of course the standard way of interacting with strings in the C programming language.

While strings and arrays in particularly are closely connected, strings in C++ are created using another type aptly named, "string". This type has something called functions which we can "call"in order to "mutate"the array of chars. Something which we have already mentioned is illegal in C++ is to access an index of an array which is greater than the size of the array. Again, the string type in C++ takes care of this for us if we were to, for example, insert something new into the string like if we wanted to replace certain characters for words in a string. That is exactly what do below where we take a word from the user, scan that word for each letter, and then we replace that with the "letters" of the NATO phonetic alphabet. If you do not understand all of the workings of the program below then do not panic because we have not encountered all of the concepts yet, the major area of confusion is probably the functions calls which the next lesson will address.

#include <iostream>
#include <string>
#include <stdlib.h>

using namespace std;

int main(int argc, char *argv[])
{
cout << "=== Starting Program ===" << endl;

string input;
string output;
string nato[] = {"alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel", "india", "juliett", "kilo", "lima", "mike", "november", "oscar", "papa", "quebec", "romeo", "sierra", "tango", "uniform", "victor", "whiskey", "xray", "yankee", "xray"};

cout << "Input a string to be changed: ";
cin >> input;
cout << sizeof(nato) / sizeof(*nato) << endl;

cout << "The string you entered was: " << input << endl;

for(int i = 0; i < input.size(); ++i)
{
input[i] = toupper(input[i]);
}

for(int i = 0; i < input.size(); ++i)
{
if(isalpha(input[i]))
{
output.append(nato[input[i] - 'A']);
output.append(" ");
}
}

cout << "And after changing it we get: " << output << endl;

cout << "=== Ending Program ===" << endl;
}