linux - Is there any way to get the page numbers in a PDF of a search pattern? -
i have pdf named test.pdf , need search text my name in pdf.
by using script, can job:
pdftotext test.pdf - | grep 'my name' is there way page number text "my name" in terminal itself?
if want linear page number (as opposed number appears on page), can counting form-feed characters while search text. pdftotext puts form-feed @ end of every page, number of form-feeds prior text 1 less (linear) page number text on. (or thereabouts. pdf files not seem.)
something following should work:
pdftotext test.pdf - | awk -vrs=$'\f' -vname="my name" \ 'index($0,name){printf "%d: %s\n", nr, name;}' the following more complicated solution prove useful if want scan more 1 pattern. unlike simple solution above, 1 give 1 line per pattern match, if same pattern matches twice on same page:
pdftotext test.pdf - | grep -f -o -e $'\f' -e 'my name' | awk 'begin{page=1} /\f/{++page;next} 1{printf "%d: %s\n", page, $0;}' you can add many patterns grep command (by adding -e string argument). -f causes match exact strings, that's not essential; use -e , regex. awk script assumes of matches either form-feed or string matched, -o option grep.
if looking phrases, should aware might have line breaks (or page breaks) in middle. there's not lot can page breaks, first (pure awk) solution handle line breaks if change call index regular expression search, , write regular expression [[:space::]]+ replacing every single space in original phrase:
pdftotext test.pdf - | awk -vrs=$'\f' \ '/my[[:space:]]+name/{printf "%d: %s\n", nr, "my name";}' in theory, extract visible page number (or "page label" called), many pdf files not retain metadata , you'd need real pdf parser extract it.
Comments
Post a Comment