Text File Statistics

Linguist: A Statistical Extension for LibreOffice Writer

Photo by LinkedIn Sales Solutions on Unsplash

Photo by LinkedIn Sales Solutions on Unsplash

Author(s):

Linguist offers writers interesting, valuable data to help them make the best of their words.

LibreOffice Writer includes features for both the home user's short files and a publisher's lengthy manuscripts. However, a professional writer would benefit from statistical tools to answer such questions as: How often does each word appear? What is the average sentence length? How readable is a passage? Needing such information for a book-length manuscript, I turned to LibreOffice's extension site, confident that among the hundreds of extensions available I would find at least one that would give me such statistics. Within seconds I had found Linguist, which was able to give me the raw data I needed, although without any filters or more than a summary in a LibreOffice file.

Linguist is installed like every other LibreOffice extension: Download the extension, open Writer |Tools | Extensions to find and add it, and Linguist will be available the next time you reboot Writer. Just be sure that you download the 1.5.1 release at the bottom of Linguist's page rather than use the Download Latest button at the top of the page, which does not do what it says. When you reboot, you will find that, unlike many extensions that add a menu item or a toolbar icon seemingly at random, Linguist adds its own easily found menu. If, like me, you are not fond of extra menus, you can leave Linguist disabled in Tools | Extension and only enable it when you needed during revisions.

Linguist's menu lists four self-explanatory items. Selecting any one generates a Writer document with its results displayed one per line. The analysis is rapid, generating results for a 400-page document in less than 20 seconds on a modern desktop system. However, you may prefer to work chapter by chapter rather than with an entire document in order to minimize the scrolling as you work.

List Unrecognized Words

This item is useful for detecting not only non-standard coinages such as "weirded" not found in most dictionaries, but also foreign words such as "hypocaust" (a Roman method of home heating using hot water pipes) (Figure 1). Typos, too, are detected, and while using Linguist, I also discovered several common typos of my own of which I was previously unaware. Similarly, I found that I had spelled one name in five different ways and another in three. No doubt the spell-check would have queried each one, but I may not have seen the pattern and been more careful in the future. Should a document contain no unrecognized words, the results will of course be blank.

Figure 1: The start of the display of non-standard words.

Complete Word List (Alphabetical)

At one result per line, the results for this list are extremely long (Figure 2). I probably had no need to know that my test document uses "the" 252 times, but the results did confirm my suspicion that I am too fond of "but" and "and," and made me eliminate many complex and compound sentences.

The results end with the number of different words, the total word count, and the lexical variety. As a ratio, the number of different words and the word count indicate the complexity of expression, while the lexical variety indicates the number of words derived from the same root such as "lives" and "lively" from "live." My test document had a lexical variety of .38, more complex than most newspaper articles, but simpler than most academic papers.

Figure 2: Linguist's list of words in a document.

Sort Words on Frequency

These results give the same information as the complete word list, including the final statistics, but arranged by how often they appear rather than by alphabetical order (Figure 3). The complete word list is useful for checking on a particular word, while the frequency is handy when line editing, telling me words that I might want to swap for synonyms or directing me toward sentences I should consider restructuring. When checking frequency, I begin by deleting frequent common words that I can hardly do without, as well as low-frequency words, and then feed the words that were left, on at a time, into Find and Replace and consider each occurrence of the word. While the complete word list gives me an overview, I use the word frequency list for the heavy editing.

Figure 3: How often each word appears.

Statistics

The Statistics include the lexical variety and the number of different words, but also the number of words per full stop (that is, per sentence), as well as the percentage of long words, often down to several decimal points (Figure 4). These figures are used to derive the LIX readability score, in which 20 is very easy and 60 is very difficult. My test document was just over 30, which is close to what I always aim for.

Figure 4: General Statistics on a document.

Further Enhancements

Lexicon would benefit from filters. For instance, it would save time if you could filter common words from the results. Some users, too, might want to see results by part of speech so that they can see, for example, how many adverbs they use and that can be deleted in favor of stronger verbs. Adjectives could also be minimized in favor of nouns.

Those who want such enhancements might look at LanguageTool. In the last few releases, LibreOffice has partially incorporated LanguageTool into its structure. From Tools | Options | LibreOffice | Language Settings, users can enable either the free or subscription version of LanguageTool (Figure 5). Alternatively, they can also install the self-contained extension version of LanguageTool, which functions locally rather than online. However, both the free and the extension versions of LanguageTool are merely grammar checkers, and the fact that they work as you type may not be to everyone's taste. To get editing statistics, you must pay a monthly or yearly subscription fee, and although LanguageTool's statistics are far more advanced than the latest release of Linguist and have a graphical interface, paying for a subscription may be unacceptable to free software users. With luck, Linguist will evolve to something closer to LanguageTool, but even in its current raw form, Linguist lets writers the job done.

Figure 5: Optionally, LanguageTool can be integrated into Writer.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Text to Speech

    Visually impaired users often find working with text and tables in office suites difficult. Pico TTS, a text-to-speech synthesizer, and the Read Text extension for LibreOffice and OpenOffice provide a solution.

  • Workspace: OpenOffice.org Extensions

    Similar to Firefox, you can add new features to OpenOffice.org by installing extensions. We’ll take a closer look at a few must-have extensions.

  • LibreOffice 5

    The Document Foundation released LibreOffice 5.0.0 at the beginning of August, and the first update 5.0.1 appeared just three weeks later. In addition to several fixes and new features under the hood, Version 5 provides some very visible improvements.

  • The Clear Choice

    While LibreOffice and OpenOffice have a shared past, LibreOffice outstrips OpenOffice in contributors, code commits, and features.

  • LibreOffice Office Suite

    The LibreOffice Writer word processing tool offers all the basic functionality you expect, along with a couple of features that really make it stand out. We also look at the other LibreOffice components: Calc, Impress, and Base.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News