The attached code and test files show it off, and only a description follows.
The text is converted by analyzing it line by line, and building up an array that contains metadata about the document. The metadata describes each line: is it long? Is it blank? Does it look like a quote? The metadata is analyzed to determine paragraph groupings.
This differs from the typical solution of using regular expressions to add HTML code to text. For one, we try not to manipulate the text in place. Rather, we simply “look at” the text, and “notice” features. Later, we analyze the features to determine what tags to insert.
This technique works well because a lot of formatting information is embedded in the layout of the text. By preserving the layout, we can guess what the formatter intended. Also, by allowing for multiple passes over the text, we can refine the metadata.
For example, we could detect if one of the first few lines contains a line that’s capitalized like a title. If so, we can assume it’s a title, and add that metadata. Then, we can quickly look one line below that and see if it looks like a byline, and if so, add that metadata.
What you detect depends on your data. This function’s being written to convert text email messages into HTML, for easier reading on small screens, so bylines and title aren’t that important, but getting quoted text right is.
I’ve left linking to another function, and escaping characters to htmlspecialchars().
The paradump() function is not related to all this – it’s just a way to view the text alongside the metadata.