pdfgrep

pdfgrep

It is important to check the statistical terminology in manuscripts as this often reveals confusion and misunderstandings. However, the terminology is often not prioritized by reviewers, and corrections are not always appreciated. Nevertheless, the ideal scientific writing is clear, specific, and unambiguous. The reader does not have to guess what the author really means, for example when using terms such as variable, parameter, and quartile. The correct definitions of statistical terms can be found in The International Statistical Institute. The Oxford Dictionary of Statistical Terms. Oxford University Press, New York 2003

In order to facilitate my own checking of the statistical terminology in manuscripts, I use the Linux command-line utility pdfgrep, a program that scans one or more pdf documents for defined keywords and returns information on detected occurrences. A Window version of the program exists, but the Linux version can be run directly using the Windows subsystem for Linux (WSL).

I have written a short shell file in bash to facilitate my terminology checks. The routine calls pdfgrep and searches the pdf manuscripts in the specified folder for the keywords defined in a separate text-file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
 
if [ "$1" == "-h" ] || [ "$1" == "--help" ]; then
   echo "Usage: statrev.sh [OPTION]"
   echo
   echo "With no option, matching terms are listed."
   echo "  -c, --context     describe each separate occurence of the matched term in its context"
   echo "  -t, --terms       list matching terms"
   echo "  -h, --help        display this help and exit"
   echo
   echo "This program matches the terms specified in $HOME/statrevterms.cfg with the contents of the pdf files located in the $HOME/Downloads/ directory, writing to OUTPUT (or standard output)."
   echo 
   exit 1
fi
cd $HOME/Downloads
if [ ! -f *.pdf ] ; then
    echo "There are no pdf files in $HOME/Downloads/."
    echo
    exit 1
fi
echo
echo "Statrev Terminology Check"
echo "========================="
if [ -z $1 ] || [ "$1" == "-t" ] || [ "$1" == "--terms" ] ; then
   echo
   echo "Matched terms"
   echo "-------------" 
   echo
   pdfgrep -Hio -f$HOME/statrevterms.cfg *.pdf | sort | uniq -ic | tr '[:upper:]' '[:lower:]'
   echo 
else
   echo
   echo "Context"
   echo "-------" 
   echo
   pdfgrep -FHin -C1 -f$HOME/statrevterms.cfg *.pdf
fi
echo
echo "Finish"
echo

The keywords I wish to check are registered in separate text-file, statrevterms.cfg. The content of this file is currently:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
### Descriptive ###
range
tertile
quartile
quintile
### Models ###
multivariateR2
predictor
prediction
demonstrate
strongly
univariate
stepwise
propensity
### Study design ###
cohort
cross-sectional
case-control
### Parameters ###
odds ratio
### Epidemiology ###
prevalence
incidence
mortality rate
fatality rate
### Randomised trials ###
primary endpoint
primary outcome
adverse event
crossover
cross-over
multiplicity
non-inferiority
equivalence
### Results ###
not differ
no difference
no effect
no significant difference
a significant difference
statistical difference
normally distributed
normal distribution
### Miscellaneous ###
goodness-of-fit
R2
α
β
association
predictor
prediction
demonstrate
strongly
### Finished ###

Running the shell file with a pdf manuscript in the folder provides useful output for a simple and quick check that the manuscript is based on correct terminology. The files can be downloaded here.

Liked this post? Follow this blog to get more.