sed language translation script - improving efficiency for long texts -
here's issue. i'm spanish translator, , have lengthy spanish-english glossary file -- 50k entries. additionally, have stop-word glossary of on 1k entries. want strip these entries texts plan translate. so, built sed script that, in turn, builds 2 more sed scripts glossaries, stripping, , leave me untranslated text (so don't have solve same problem twice). works well, problem takes long time on long texts, upwards of 15 minutes. inevitable, or there more efficient way this?
here's main script:
#!/bin/sh before="$(date +%s)" #wordstxt=$(wc -w < $1) #mintime=$(expr "$wordstxt / 200" |bc -l) #maxtime=$(expr "$wordstxt / 175" |bc -l) #echo "estimated time process: between $mintime , $maxtime seconds." sed ' s/\,/\n/g # strip commas s/\?/\n/g # strip question marks s/\*/\n/g # strip asterisks s/\!/\n/g # strip exclamation marks s/:/\n/g # strip colons s/\-/\n/g # strip hyphens s/\./\n/g # strip periods s/«/\n/g # strip left euro-quotes s/»/\n/g # strip right euro-quotes s/”/\n/g # strip slanted quotes s/\"/\n/g # strip left quotes s/(/\n/g # strip left paren s/)/\n/g # strip right paren s/\[/\n/g # strip left bracket s/\]/\n/g # strip right bracket s/¿/\n/g # "¿" s/—/\n/g # m-dash s/\ –\ /\n/g # n-dash s/…/\n/g # strip elipsis single character, not 3 periods s/;/\n/g # strip semicolon s/[0-9]/\n/g # strip out numbers, replace returns ' $1 > $1.z.tmp #echo "punctuation eliminated." #cp ../../spanish\ to\ english\ projects/glossary/stoplist.txt . sed ' s/^\ //g # strip leading spaces s/\ $// # strip trailing spaces /^$/d # delete blank lines s/\./\n/g # strip periods s/\ /\\ /g # make spaces literals s/^/s\// # begins substitution s/$/\/\\n\/g/ # concludes substitution 1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ ' stoplist.txt > stoplist.sed chmod +x stoplist.sed echo "eliminating stopwords." ./stoplist.sed $1.z.tmp > $1.0.tmp sed 's/\([a-za-z\ ]*\t\).*/\1/' spanishglossary.utf8 > tempgloss.2.txt #echo "target phrases stripped." sort -u tempgloss.2.txt > tempgloss.3.txt awk '{ print length(), $0 | "sort -rn" }' tempgloss.3.txt > tempgloss.4.txt #echo "list ordered length." #echo "now creating new sed script." # affects sed script, not output file. sed ' s/[0-9]//g # strip out numbers s/^\ //g # strip leading spaces -- lines have them due sort /^$/d # delete blank lines s/\//\\\//g # make text slashes literals s/"/\n/g # strip quotes s/\t//g # strip tabs s/\./\n/g # strip periods s/'\''/\\'\''/g # make straight apostrophes literals s/'\’'/\\'\’'/g # make curly apostrophes literals s/\ /\\ /g # make spaces literals /^.\{0,5\}$/d # delete lines less 5 characters s/^/s\/\\b/ # begins substitution s/$/\\b\/\\n\/g/ # concludes substitution 1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ ' tempgloss.4.txt > glossy.sed #echo "glossy.sed created." chmod +x glossy.sed echo "eliminating existing entries. may take while." ./glossy.sed $1.0.tmp > $1.1.tmp echo "now cleaning lines." sed -e ' s/\ $// # strip trailing spaces s/^\ *//g # strip , leading spaces s/\ el$//g # strip "el" end s/\ la$//g # strip "la" end s/\ los//g # strip "los" end s/\ las//g # strip "las" end s/\ o$//g # strip "o" end s/\ y$//g # strip "y" end s/\ $// # strip trailing spaces (yes, again) ' $1.1.tmp > $1.2.tmp echo "creating ngrams." ./ngrams 5 < $1.2.tmp > $1.3.tmp 2> /dev/null linecount="$(wc -l < $1.3.tmp)" #echo $linecount "lines." if [ "$linecount" -gt "1000" ] echo "eliminating single instances." sed '/^1\t/d' $1.3.tmp > $1.4.tmp else echo "fewer 1000 entries, keeping all." cp $1.3.tmp $1.4.tmp fi sed -e ' s/[0-9]//g # strip out numbers s/^\t//g # strip leading tab s/^\ *//g # strip , leading spaces /^.\{0,7\}$/d # delete lines less 6 characters s/\ $// # strip trailing spaces (yes, again) #s/$/\t/ # add in tab ' $1.4.tmp > $1.csv echo "looking duplicates." sh ./dedupe $1.csv wordstxt=$(wc -w < $1) #echo $wordstxt wordslist=$(wc -w < $1.csv) #echo $wordslist wordspercent=$(echo "scale=4; $wordslist / $wordstxt" |bc -l) wordspercentage=$(echo "$wordspercent * 100" |bc -l) after="$(date +%s)" elapsed_seconds="$(expr $after - $before)" rate=$(echo "scale=3; $wordstxt / $elapsed_seconds" |bc -l) echo "created "$1.csv", $wordspercentage% left, in" $elapsed_seconds "seconds." #, effective rate of" $rate "words per second." rm tempgloss.*.txt rm *.tmp rm glossy.sed
rewrite script in awk , run in seconds instead of minutes , briefer, simpler , clearer. sed excellent tool simple substitutions on single line. else, use awk.
Comments
Post a Comment