Critique of C++ code for encoding and decoding text

Jamin2112 · Aug 22, 2013

These are some functions I'm using to encode and decode text.

bitstring returns a string of bits representing the ascii values of the characters in the string passed to the function. For example, passing in "Hi" changes it to "0100100001101001" since (int)'H'=72=01001000 and (int)'i'=105=01101001. The function compress_bitstring takes any sequence of 3 to 9 of the same bits and compresses them by putting the number followed by the bit. For example, "0100100001101001" is changed to "01001401101001". Then the other 2 functions are for undoing the functions I just described.

Code:

#include <iostream>
#include <sstream>
#include <string>
#include <math.h>


static const size_t charbits = 8;  // note that sizeof(char) is 1 by the C++ standard


static void bitstring(std::string& str) { 
	std::string str2(charbits*str.length(), '0');
	for (unsigned i = 0; i < str2.length(); i += charbits) { 
		int this_ascii = (int)str[i/charbits];
		for (unsigned j = 0; j < charbits ; j++) { // NOTE: ASCII values are 0-127, 00000000-01111111
			if (this_ascii % 2) str2[i+charbits-j-1] = '1';
			this_ascii /= 2;
		}
	} 
	str = str2;
} 


static void undo_bitstring(std::string& str) { 
	for (unsigned i = 0; i < str.length(); i += charbits) { 
		int ascii_val = 0; 
		for (unsigned j = 0; j < charbits; j++) { 
			if (str[i+j] == '1') ascii_val += (int)exp2(charbits-j-1);
		}
		str[i/charbits] = (char)ascii_val;
	}
	str.erase(str.begin()+str.size()/charbits, str.end());
}

static void compress_bitstring(std::string& str) { 
	std::stringstream ss;
	std::string cur_num;
	int count = 1;
	for (unsigned i = 0; i < str.size()-1; ++i) {
		if (str[i] == str[i+1]){
			++count;
			cur_num = str[i];
			if (i == str.size()-2) {
				ss << count << cur_num;
			} else {
				if (count == 9) {
					ss << count << cur_num;
					count = 0;
				}
			}
		} else if (count > 2) {
			ss << count << cur_num;
			count = 1;
			if (i == str.size()-2) {
				ss << str[i+1];
			}
		} else if (i != str.size()-2) {
			for (int j = 0; j < count; ++j) {
				ss << str[i];
			}
			count = 1;
		} else {
			ss << str[i] << str[i];
		}
	}
	str = ss.str();
}

static void decompress_bitstring(std::string& str) {
	unsigned i = 0;
	int count;
	std::stringstream ss;
	while (i < str.size()) {
		std::stringstream to_int;
		to_int << str[i];
		to_int >> count;
		if (count > 1){
			for (int j = 0; j < count; ++j) {
				ss << str[i+1];
			}
			i = i + 2;
		} else {
			ss << str[i];
			++i;
		}
	}
	str = ss.str();
}

Grep · Aug 26, 2013

Firstly, where are the tests? Comments? Both are important, and would even increase the odds on people taking the time to check over your code, since it makes it easier to read, and gives them a way to run it without writing their own test code.

Why static functions? Do you really intend them to only be visible in this compilation unit? Make sure you know what static does to a function in C++.

No putting things on one line that should really be on two or more. I still suggest using curly braces, even when not strictly necessary, but definitely don't mash things onto one line just to save space.

Use #include <cmath>, not #include <math.h>. Though they both should work, using math.h *in C++* is deprecated. It may stop working, eventually. Same goes for a number of other header files with a C origin.

You can use <climits>, also, which contains CHAR_BIT, which you should most definitely use rather than defining your own. Try not to assume its value is 8.

Use static_cast and friends, and not C style casts. For a good explanation, see the top answer here: c++ - Should I really use static_cast every single time I want to convert between primitive types? - Stack Overflow

In for loops, use pre-increment, not post. For one, it is simply idiomatic and semantically correct. In some cases, it can be faster, but I won't get into that. But consider the difference between 'x = ++y;' and 'x = y++'. Then think which is more appropriate, semantically, for the for loop increment. I found this, and the top answer seems ok: c++ - Avoid Postfix Increment Operator - Programmers Stack Exchange

I'd normally suggest using bit shifting, masks, and so on rather than multiply/divide/exp2/etc when using factors of 2. It's faster, generally, and any programmer that doesn't understand what's going on is probably not well qualified. However, all that is still more complicated than necessary.

I'd suggest you look very closely at bitset. I wrote my own versions of your bitstring and undo_bitstring functions using bitset, and it is far, far simpler and more easy to maintain. Do these functions *really* end up being the bottleneck in a program, such that you need to optimize them? Doubtful. Don't optimize unnecessarily. Don't be wasteful and inefficient, sure, but don't waste time and increase maintenance costs just to squeeze a few nanoseconds, either! In this case, use the already written bit handling code in bitset. Consider that over 100,000 iterations with the string 'Hi' to binary digits and back with bitset, on my system, took slightly over 100ms, whereas bit shifting and so on was about 55ms. Not worth the trouble in most cases.

Not huge on the way you're modifying the parameter, either, though technically ok (not a const parameter). Since your str = str2; line will copy anyways, why not just return the result by value? Still, particularly depending on intended use, there's room for argument, here, I suppose.

You might also document your functions. Look at doxygen. For example:

Code:

/*! Convert string parameter to binary representation.
 *
 * \param str String of characters to convert to a string representation of bits.
 * \return String of parameter's binary representation.
 */

A note about comments that can't hurt: They should be brief and to the point, explaining what you're doing in general or pointing out complicated parts. Let them guide the reader. They should aid any reader in comprehending the code quickly, and yet not space out the code unnecessarily or be annoying to maintain. You must *always* check comments related to code when you change the code, so overdoing it isn't great either. It's a tough line to walk, but no comments is almost always too few.

Use size_t when appropriate. Why use unsigned (int) when it should be size_t?

I've long thought that knowing C is sometimes a barrier to learning C++ properly. In a way, forget that the two languages are related, and don't just do things the way you might in C by default. It's so often not the right way to do it. Just a personal observation.

Do check out features in the new C++11 standard, as well. There's some awesome bits in there. Ranged based for loops, initializer lists... Much goodness.

Haven't really had much of a look at the compress/decompress functions, but you should have plenty to work with, I hope. If you post bitset based versions of the bitstring and undo_bitstring functions, I'll show you what I wrote and we can compare.

I also wrote my own versions of a function to calculate the mode of a vector of integers (mentioned in the last thread you posted like this). Also a good exercise for you, perhaps.

Comments on my comments are welcome, as usual, from anyone. Hope this helps you! Good luck.

Jamin2112 · Aug 26, 2013

Grep said:

I also wrote my own versions of a function to calculate the mode of a vector of integers (mentioned in the last thread you posted like this). Also a good exercise for you, perhaps.

How did you do it?

I think that making a binary tree class does the job the best. Any comments on the following code?

Code:

using namespace std; 

class binArr { 
    
private: 
    typedef struct node { 
        node* next[2];
        int cnt;
    };
    node* root; 
    node* newNode();
    void clearBranch(node*);
    int mostVal;     
    int mostCnt;

    public: 
    binArr();
    ~binArr();
    void add(int);
    void addArray(int, int*); 
    int getMost(){return mostVal;}
};

void binArr::addArray(int size, int* intArr) { 
    for (int i = 0; i < size; i++) { 
        add(intArr[i]);
    }
}

void binArr::add(int newVal) { 
    node* thisPtr = root;
    int thisVal = newVal;
    while(!thisVal) {
        int nextBit = thisVal % 2;
        if (thisPtr->next[nextBit] == NULL) {
            thisPtr->next[nextBit] = newNode();
        }
        thisPtr = thisPtr->next[nextBit];
        thisVal /= 2;
    }
    thisPtr->cnt++;
    if (thisPtr->cnt > mostCnt) { 
        mostCnt = thisPtr->cnt;
        mostVal = newVal;
    }
} 

binArr::node* binArr::newNode() { 
    node* newNode = new node;
    newNode->cnt = 0;
    newNode->next[0] = NULL;
    newNode->next[1] = NULL;
    return newNode;
}


void binArr::clearBranch(node* branch) { 
    if (branch) { 
        clearBranch(branch->next[0]);
        clearBranch(branch->next[1]);
        delete branch;
    }
} 

binArr::binArr() { 
    root = newNode();
    mostCnt = 0; 
}

binArr::~binArr() { 
    clearBranch (root);
}

int main() { 
    int arr[] = {9, 3, 2, 11, 87, 4, 3, 9, 21, 11, 91, 11, 9, 2, 9};
    binArr B; 
    B.addArray(sizeof(arr)/sizeof(int), arr);
    cout << "The mode of arr is " << B.getMost();
    return 0;
}

Jamin2112 · Aug 26, 2013

Grep said:

Firstly, where are the tests?

Should I make a separate .h file that has the machinery for running tests?

Grep · Aug 27, 2013

Jamin2112 said:

Should I make a separate .h file that has the machinery for running tests?

In general, you can have a .h for your class/function declarations as usual, and perhaps a separate test program that uses them. There are many ways to do it. Though for our purposes, a main() in the same file that runs through some more comprehensive tests works, and makes our lives easier (no Makefiles and linking to worry about, for example). In real life, you'd maybe look at some kind of test framework.

Some additional tests might reveal an issue with your (mode) algorithm. For example, add the number 43 to the end of the array and see what happens.

Other than that, incorporate advice you've already received, I'd say. A lot of previous advice you've gotten applies to the mode code as well.

Jamin2112 · Aug 27, 2013

Grep said:

Some additional tests might reveal an issue with your (mode) algorithm. For example, add the number 43 to the end of the array and see what happens.

It returned 43 as the mode. I can't figure out why this happened though ...

Grep · Aug 27, 2013

Jamin2112 said:

It returned 43 as the mode. I can't figure out why this happened though ...

Looking at your add() function, there's some issues there. It's not doing what you're expecting, so find out exactly what's happening.

I usually develop with various std::cout calls to see what's going on while I develop. But you also need to really learn to use and love your debugger. Step through the code and see what happens. It should be apparent rather quickly what it's doing and why it always returns the last element in the input array.

I've noticed so many professional developers that don't use the debugger, and that's a shame. I've never understood why that is, but don't be one of those people. Possibly some of the most important advice I give to neophite programmers, and sadly one of the most ignored.

Jamin2112 · Aug 27, 2013

Grep said:

Looking at your add() function, there's some issues there. It's not doing what you're expecting, so find out exactly what's happening.

I usually develop with various std::cout calls to see what's going on while I develop. But you also need to really learn to use and love your debugger. Step through the code and see what happens. It should be apparent rather quickly what it's doing and why it always returns the last element in the input array.

I've noticed so many professional developers that don't use the debugger, and that's a shame. I've never understood why that is, but don't be one of those people. Possibly some of the most important advice I give to neophite programmers, and sadly one of the most ignored.

That's great advice. Thanks!

Grep · Sep 22, 2013

Your latest thread reminded me of this one, and that I had written my own code to show you how I might approach these problems. I may as well share them so you have the chance to learn from them. You've already done the work, and enough time has passed that it's not likely to be of use to you in cheating on anything, so I presume the mods won't mind. Here are far simpler versions of your encoding/decoding functions:

Code:

#include <string>
#include <bitset>
// For CHAR_BIT, i.e. number of bits per char. Usually 8 bits, but you
// never know.
#include <climits>

/*! Convert string parameter to binary representation.
 *
 * \param str The character string to convert to a string of bits.
 * \return String of parameter's binary representation.
 */
std::string bitstring(const std::string& str)
{
    std::string result;
    result.reserve(str.size()*CHAR_BIT);

    for (size_t i = 0; i < str.size(); ++i)
    {
        result.append(std::bitset<CHAR_BIT>(str[i]).to_string());
    }

    return result;
}

/*! Convert string with binary representation back to a string of the
 *  characters it represents.
 *
 * \param str The character string of bits to convert back to a string if
 *     characters.
 * \return String with characters represented by the bits in the input string
 *     parameter.
 */
std::string undo_bitstring(const std::string& str)
{
    std::string result;
    size_t num_chars = str.size() / CHAR_BIT;
    result.reserve(num_chars);

    for (size_t i = 0; i < num_chars; ++i)
    {
        std::string sub_str = str.substr(i*CHAR_BIT, CHAR_BIT);
        char c = static_cast<char>(std::bitset<CHAR_BIT>(sub_str).to_ulong());
        result.append(1, c);
    }

    return result;
}

You could make faster versions, but the added complexity isn't worth the loss of clarity, unless it becomes a bottleneck. Note that I'm returning the resulting string by value. Return Value Optimization (RVO) will almost certainly eliminate the copying, anyways. Unless you find it to be a problem in a profiler, as a general rule, don't try and optimize prematurely. Don't be wantonly inefficient, either, mind you. For example, reserving space in those strings doesn't really make it harder to follow, and it's easy to see why that could help performance.

If anyone has any constructive criticism of this code, please do share. I'm always looking for ways to improve.

I also have a version of the mode finding function that is more efficient, flexible, shorter and more understandable. If you show me a working version of yours, I'll show you mine. :)

Jamin2112 · Sep 22, 2013

Grep said:

If you show me a working version of yours, I'll show you mine. :)

Can you give me a hint about what you did?

Grep · Sep 22, 2013

Jamin2112 said:

Can you give me a hint about what you did?

One hint: unordered_map. :)

For bonus points, so to speak, make it a template that can operate on a vector of arbitrary element types. For example, maybe it can operate on a vector of strings. The logic is basically the same.

I'll just add that I did use features from C++11. For g++, you can add the option '-std=c++11' to enable it.

Jamin2112 · Sep 22, 2013

Grep said:

One hint: unordered_map. :)

For bonus points, so to speak, make it a template that can operate on a vector of arbitrary element types. For example, maybe it can operate on a vector of strings. The logic is basically the same.

I'll just add that I did use features from C++11. For g++, you can add the option '-std=c++11' to enable it.

k I'll work on this

Critique of C++ code for encoding and decoding text

Related to Critique of C++ code for encoding and decoding text

1. What is the purpose of encoding and decoding text in C++?

2. How does C++ handle text encoding and decoding?

3. What are some common errors that can occur when encoding and decoding text in C++?

4. How can I improve the efficiency of my C++ code for text encoding and decoding?

5. Are there any best practices for encoding and decoding text in C++?

Similar threads

Hot Threads

Recent Insights