Critique of C++ code for encoding and decoding text

  • C/C++
  • Thread starter Jamin2112
  • Start date
  • Tags
    C++ Code
In summary: replace str with str2 in the caller, it's not technically a parameter modification, but more of a function call.
  • #1
Jamin2112
986
12
These are some functions I'm using to encode and decode text.

bitstring returns a string of bits representing the ascii values of the characters in the string passed to the function. For example, passing in "Hi" changes it to "0100100001101001" since (int)'H'=72=01001000 and (int)'i'=105=01101001. The function compress_bitstring takes any sequence of 3 to 9 of the same bits and compresses them by putting the number followed by the bit. For example, "0100100001101001" is changed to "01001401101001". Then the other 2 functions are for undoing the functions I just described.



Code:
#include <iostream>
#include <sstream>
#include <string>
#include <math.h>


static const size_t charbits = 8;  // note that sizeof(char) is 1 by the C++ standard


static void bitstring(std::string& str) { 
	std::string str2(charbits*str.length(), '0');
	for (unsigned i = 0; i < str2.length(); i += charbits) { 
		int this_ascii = (int)str[i/charbits];
		for (unsigned j = 0; j < charbits ; j++) { // NOTE: ASCII values are 0-127, 00000000-01111111
			if (this_ascii % 2) str2[i+charbits-j-1] = '1';
			this_ascii /= 2;
		}
	} 
	str = str2;
} 


static void undo_bitstring(std::string& str) { 
	for (unsigned i = 0; i < str.length(); i += charbits) { 
		int ascii_val = 0; 
		for (unsigned j = 0; j < charbits; j++) { 
			if (str[i+j] == '1') ascii_val += (int)exp2(charbits-j-1);
		}
		str[i/charbits] = (char)ascii_val;
	}
	str.erase(str.begin()+str.size()/charbits, str.end());
}

static void compress_bitstring(std::string& str) { 
	std::stringstream ss;
	std::string cur_num;
	int count = 1;
	for (unsigned i = 0; i < str.size()-1; ++i) {
		if (str[i] == str[i+1]){
			++count;
			cur_num = str[i];
			if (i == str.size()-2) {
				ss << count << cur_num;
			} else {
				if (count == 9) {
					ss << count << cur_num;
					count = 0;
				}
			}
		} else if (count > 2) {
			ss << count << cur_num;
			count = 1;
			if (i == str.size()-2) {
				ss << str[i+1];
			}
		} else if (i != str.size()-2) {
			for (int j = 0; j < count; ++j) {
				ss << str[i];
			}
			count = 1;
		} else {
			ss << str[i] << str[i];
		}
	}
	str = ss.str();
}

static void decompress_bitstring(std::string& str) {
	unsigned i = 0;
	int count;
	std::stringstream ss;
	while (i < str.size()) {
		std::stringstream to_int;
		to_int << str[i];
		to_int >> count;
		if (count > 1){
			for (int j = 0; j < count; ++j) {
				ss << str[i+1];
			}
			i = i + 2;
		} else {
			ss << str[i];
			++i;
		}
	}
	str = ss.str();
}
 
Technology news on Phys.org
  • #2
Firstly, where are the tests? Comments? Both are important, and would even increase the odds on people taking the time to check over your code, since it makes it easier to read, and gives them a way to run it without writing their own test code.

Why static functions? Do you really intend them to only be visible in this compilation unit? Make sure you know what static does to a function in C++.

No putting things on one line that should really be on two or more. I still suggest using curly braces, even when not strictly necessary, but definitely don't mash things onto one line just to save space.

Use #include <cmath>, not #include <math.h>. Though they both should work, using math.h *in C++* is deprecated. It may stop working, eventually. Same goes for a number of other header files with a C origin.

You can use <climits>, also, which contains CHAR_BIT, which you should most definitely use rather than defining your own. Try not to assume its value is 8.

Use static_cast and friends, and not C style casts. For a good explanation, see the top answer here: c++ - Should I really use static_cast every single time I want to convert between primitive types? - Stack Overflow

In for loops, use pre-increment, not post. For one, it is simply idiomatic and semantically correct. In some cases, it can be faster, but I won't get into that. But consider the difference between 'x = ++y;' and 'x = y++'. Then think which is more appropriate, semantically, for the for loop increment. I found this, and the top answer seems ok: c++ - Avoid Postfix Increment Operator - Programmers Stack Exchange

I'd normally suggest using bit shifting, masks, and so on rather than multiply/divide/exp2/etc when using factors of 2. It's faster, generally, and any programmer that doesn't understand what's going on is probably not well qualified. However, all that is still more complicated than necessary.

I'd suggest you look very closely at bitset. I wrote my own versions of your bitstring and undo_bitstring functions using bitset, and it is far, far simpler and more easy to maintain. Do these functions *really* end up being the bottleneck in a program, such that you need to optimize them? Doubtful. Don't optimize unnecessarily. Don't be wasteful and inefficient, sure, but don't waste time and increase maintenance costs just to squeeze a few nanoseconds, either! In this case, use the already written bit handling code in bitset. Consider that over 100,000 iterations with the string 'Hi' to binary digits and back with bitset, on my system, took slightly over 100ms, whereas bit shifting and so on was about 55ms. Not worth the trouble in most cases.

Not huge on the way you're modifying the parameter, either, though technically ok (not a const parameter). Since your str = str2; line will copy anyways, why not just return the result by value? Still, particularly depending on intended use, there's room for argument, here, I suppose.

You might also document your functions. Look at doxygen. For example:

Code:
/*! Convert string parameter to binary representation.
 *
 * \param str String of characters to convert to a string representation of bits.
 * \return String of parameter's binary representation.
 */

A note about comments that can't hurt: They should be brief and to the point, explaining what you're doing in general or pointing out complicated parts. Let them guide the reader. They should aid any reader in comprehending the code quickly, and yet not space out the code unnecessarily or be annoying to maintain. You must *always* check comments related to code when you change the code, so overdoing it isn't great either. It's a tough line to walk, but no comments is almost always too few.

Use size_t when appropriate. Why use unsigned (int) when it should be size_t?

I've long thought that knowing C is sometimes a barrier to learning C++ properly. In a way, forget that the two languages are related, and don't just do things the way you might in C by default. It's so often not the right way to do it. Just a personal observation.

Do check out features in the new C++11 standard, as well. There's some awesome bits in there. Ranged based for loops, initializer lists... Much goodness.

Haven't really had much of a look at the compress/decompress functions, but you should have plenty to work with, I hope. If you post bitset based versions of the bitstring and undo_bitstring functions, I'll show you what I wrote and we can compare.

I also wrote my own versions of a function to calculate the mode of a vector of integers (mentioned in the last thread you posted like this). Also a good exercise for you, perhaps.

Comments on my comments are welcome, as usual, from anyone. Hope this helps you! Good luck.
 
  • #3
Grep said:
I also wrote my own versions of a function to calculate the mode of a vector of integers (mentioned in the last thread you posted like this). Also a good exercise for you, perhaps.

How did you do it?

I think that making a binary tree class does the job the best. Any comments on the following code?

Code:
using namespace std; 

class binArr { 
    
private: 
    typedef struct node { 
        node* next[2];
        int cnt;
    };
    node* root; 
    node* newNode();
    void clearBranch(node*);
    int mostVal;     
    int mostCnt;

    public: 
    binArr();
    ~binArr();
    void add(int);
    void addArray(int, int*); 
    int getMost(){return mostVal;}
};

void binArr::addArray(int size, int* intArr) { 
    for (int i = 0; i < size; i++) { 
        add(intArr[i]);
    }
}

void binArr::add(int newVal) { 
    node* thisPtr = root;
    int thisVal = newVal;
    while(!thisVal) {
        int nextBit = thisVal % 2;
        if (thisPtr->next[nextBit] == NULL) {
            thisPtr->next[nextBit] = newNode();
        }
        thisPtr = thisPtr->next[nextBit];
        thisVal /= 2;
    }
    thisPtr->cnt++;
    if (thisPtr->cnt > mostCnt) { 
        mostCnt = thisPtr->cnt;
        mostVal = newVal;
    }
} 

binArr::node* binArr::newNode() { 
    node* newNode = new node;
    newNode->cnt = 0;
    newNode->next[0] = NULL;
    newNode->next[1] = NULL;
    return newNode;
}


void binArr::clearBranch(node* branch) { 
    if (branch) { 
        clearBranch(branch->next[0]);
        clearBranch(branch->next[1]);
        delete branch;
    }
} 

binArr::binArr() { 
    root = newNode();
    mostCnt = 0; 
}

binArr::~binArr() { 
    clearBranch (root);
}

int main() { 
    int arr[] = {9, 3, 2, 11, 87, 4, 3, 9, 21, 11, 91, 11, 9, 2, 9};
    binArr B; 
    B.addArray(sizeof(arr)/sizeof(int), arr);
    cout << "The mode of arr is " << B.getMost();
    return 0;
}
 
  • #4
Grep said:
Firstly, where are the tests?

Should I make a separate .h file that has the machinery for running tests?
 
  • #5
Jamin2112 said:
Should I make a separate .h file that has the machinery for running tests?

In general, you can have a .h for your class/function declarations as usual, and perhaps a separate test program that uses them. There are many ways to do it. Though for our purposes, a main() in the same file that runs through some more comprehensive tests works, and makes our lives easier (no Makefiles and linking to worry about, for example). In real life, you'd maybe look at some kind of test framework.

Some additional tests might reveal an issue with your (mode) algorithm. For example, add the number 43 to the end of the array and see what happens.

Other than that, incorporate advice you've already received, I'd say. A lot of previous advice you've gotten applies to the mode code as well.
 
  • #6
Grep said:
Some additional tests might reveal an issue with your (mode) algorithm. For example, add the number 43 to the end of the array and see what happens.

It returned 43 as the mode. I can't figure out why this happened though ...
 
  • #7
Jamin2112 said:
It returned 43 as the mode. I can't figure out why this happened though ...

Looking at your add() function, there's some issues there. It's not doing what you're expecting, so find out exactly what's happening.

I usually develop with various std::cout calls to see what's going on while I develop. But you also need to really learn to use and love your debugger. Step through the code and see what happens. It should be apparent rather quickly what it's doing and why it always returns the last element in the input array.

I've noticed so many professional developers that don't use the debugger, and that's a shame. I've never understood why that is, but don't be one of those people. Possibly some of the most important advice I give to neophite programmers, and sadly one of the most ignored.
 
  • Like
Likes 1 person
  • #8
Grep said:
Looking at your add() function, there's some issues there. It's not doing what you're expecting, so find out exactly what's happening.

I usually develop with various std::cout calls to see what's going on while I develop. But you also need to really learn to use and love your debugger. Step through the code and see what happens. It should be apparent rather quickly what it's doing and why it always returns the last element in the input array.

I've noticed so many professional developers that don't use the debugger, and that's a shame. I've never understood why that is, but don't be one of those people. Possibly some of the most important advice I give to neophite programmers, and sadly one of the most ignored.

That's great advice. Thanks!
 
  • #9
Your latest thread reminded me of this one, and that I had written my own code to show you how I might approach these problems. I may as well share them so you have the chance to learn from them. You've already done the work, and enough time has passed that it's not likely to be of use to you in cheating on anything, so I presume the mods won't mind. Here are far simpler versions of your encoding/decoding functions:

Code:
#include <string>
#include <bitset>
// For CHAR_BIT, i.e. number of bits per char. Usually 8 bits, but you
// never know.
#include <climits>

/*! Convert string parameter to binary representation.
 *
 * \param str The character string to convert to a string of bits.
 * \return String of parameter's binary representation.
 */
std::string bitstring(const std::string& str)
{
    std::string result;
    result.reserve(str.size()*CHAR_BIT);

    for (size_t i = 0; i < str.size(); ++i)
    {
        result.append(std::bitset<CHAR_BIT>(str[i]).to_string());
    }

    return result;
}

/*! Convert string with binary representation back to a string of the
 *  characters it represents.
 *
 * \param str The character string of bits to convert back to a string if
 *     characters.
 * \return String with characters represented by the bits in the input string
 *     parameter.
 */
std::string undo_bitstring(const std::string& str)
{
    std::string result;
    size_t num_chars = str.size() / CHAR_BIT;
    result.reserve(num_chars);

    for (size_t i = 0; i < num_chars; ++i)
    {
        std::string sub_str = str.substr(i*CHAR_BIT, CHAR_BIT);
        char c = static_cast<char>(std::bitset<CHAR_BIT>(sub_str).to_ulong());
        result.append(1, c);
    }

    return result;
}

You could make faster versions, but the added complexity isn't worth the loss of clarity, unless it becomes a bottleneck. Note that I'm returning the resulting string by value. Return Value Optimization (RVO) will almost certainly eliminate the copying, anyways. Unless you find it to be a problem in a profiler, as a general rule, don't try and optimize prematurely. Don't be wantonly inefficient, either, mind you. For example, reserving space in those strings doesn't really make it harder to follow, and it's easy to see why that could help performance.

If anyone has any constructive criticism of this code, please do share. I'm always looking for ways to improve.

I also have a version of the mode finding function that is more efficient, flexible, shorter and more understandable. If you show me a working version of yours, I'll show you mine. :)
 
  • #10
Grep said:
If you show me a working version of yours, I'll show you mine. :)

Can you give me a hint about what you did?
 
  • #11
Jamin2112 said:
Can you give me a hint about what you did?

One hint: unordered_map. :)

For bonus points, so to speak, make it a template that can operate on a vector of arbitrary element types. For example, maybe it can operate on a vector of strings. The logic is basically the same.

I'll just add that I did use features from C++11. For g++, you can add the option '-std=c++11' to enable it.
 
  • #12
Grep said:
One hint: unordered_map. :)

For bonus points, so to speak, make it a template that can operate on a vector of arbitrary element types. For example, maybe it can operate on a vector of strings. The logic is basically the same.

I'll just add that I did use features from C++11. For g++, you can add the option '-std=c++11' to enable it.

k I'll work on this
 

Related to Critique of C++ code for encoding and decoding text

1. What is the purpose of encoding and decoding text in C++?

The purpose of encoding and decoding text in C++ is to convert data from one format to another in order to make it more secure or to ensure compatibility with different systems. In the context of text, encoding and decoding is often used to convert human-readable text into a format that can be easily transmitted and stored, and then back to its original form for consumption.

2. How does C++ handle text encoding and decoding?

C++ offers a variety of built-in functions and libraries for handling text encoding and decoding, such as the string class and the codecvt library. These tools allow for efficient conversion between different character sets, including ASCII, UTF-8, and UTF-16.

3. What are some common errors that can occur when encoding and decoding text in C++?

Some common errors that can occur when encoding and decoding text in C++ include incorrect character mapping, buffer overflows, and incorrect use of encoding functions. These errors can lead to unexpected results and even security vulnerabilities in the code.

4. How can I improve the efficiency of my C++ code for text encoding and decoding?

To improve the efficiency of text encoding and decoding in C++, you can use algorithms and data structures that are specifically designed for handling text, such as std::wstring and std::codecvt_utf8. Additionally, optimizing your code for memory usage and minimizing unnecessary conversions can also help improve efficiency.

5. Are there any best practices for encoding and decoding text in C++?

Yes, there are several best practices for encoding and decoding text in C++, including properly handling errors and exceptions, using secure input/output functions, and testing your code thoroughly to ensure it can handle different input scenarios. It is also important to follow the conventions and standards for the specific character set you are working with.

Similar threads

  • Programming and Computer Science
Replies
1
Views
901
  • Programming and Computer Science
2
Replies
40
Views
2K
  • Programming and Computer Science
2
Replies
55
Views
4K
  • Programming and Computer Science
Replies
10
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
2
Views
973
  • Programming and Computer Science
3
Replies
97
Views
7K
  • Programming and Computer Science
Replies
3
Views
1K
  • Programming and Computer Science
Replies
12
Views
1K
  • Programming and Computer Science
Replies
19
Views
1K
  • Programming and Computer Science
Replies
8
Views
1K
Back
Top