sed language translation script - improving efficiency for long texts -

- July 15, 2010

here's issue. i'm spanish translator, , have lengthy spanish-english glossary file -- 50k entries. additionally, have stop-word glossary of on 1k entries. want strip these entries texts plan translate. so, built sed script that, in turn, builds 2 more sed scripts glossaries, stripping, , leave me untranslated text (so don't have solve same problem twice). works well, problem takes long time on long texts, upwards of 15 minutes. inevitable, or there more efficient way this?

here's main script:

#!/bin/sh before="$(date +%s)"  #wordstxt=$(wc -w < $1) #mintime=$(expr "$wordstxt / 200" |bc -l) #maxtime=$(expr "$wordstxt / 175" |bc -l) #echo "estimated time process: between $mintime , $maxtime seconds." sed ' s/\,/\n/g           # strip commas s/\?/\n/g       # strip question marks s/\*/\n/g       # strip asterisks s/\!/\n/g           # strip exclamation marks s/:/\n/g            # strip colons s/\-/\n/g           # strip hyphens s/\./\n/g           # strip periods s/«/\n/g            # strip left euro-quotes s/»/\n/g            # strip right euro-quotes s/”/\n/g            # strip slanted quotes s/\"/\n/g       # strip left quotes s/(/\n/g            # strip left paren s/)/\n/g            # strip right paren s/\[/\n/g           # strip left bracket s/\]/\n/g           # strip right bracket s/¿/\n/g            # "¿" s/—/\n/g        # m-dash s/\ –\ /\n/g        # n-dash s/…/\n/g        # strip elipsis single character, not 3 periods s/;/\n/g            # strip semicolon s/[0-9]/\n/g        # strip out numbers, replace returns ' $1 > $1.z.tmp #echo "punctuation eliminated."  #cp ../../spanish\ to\ english\ projects/glossary/stoplist.txt . sed ' s/^\ //g        # strip leading spaces s/\ $//         # strip trailing spaces /^$/d           # delete blank lines s/\./\n/g       # strip periods s/\ /\\ /g      # make spaces literals s/^/s\//        # begins substitution s/$/\/\\n\/g/   # concludes substitution  1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/  ' stoplist.txt > stoplist.sed chmod +x stoplist.sed echo "eliminating stopwords." ./stoplist.sed $1.z.tmp > $1.0.tmp  sed 's/\([a-za-z\ ]*\t\).*/\1/' spanishglossary.utf8 > tempgloss.2.txt #echo "target phrases stripped."  sort -u tempgloss.2.txt > tempgloss.3.txt  awk '{ print length(), $0 | "sort -rn" }' tempgloss.3.txt > tempgloss.4.txt #echo "list ordered length."  #echo "now creating new sed script." # affects sed script, not output file.  sed ' s/[0-9]//g      # strip out numbers s/^\ //g        # strip leading spaces -- lines have them due sort /^$/d           # delete blank lines s/\//\\\//g     # make text slashes literals s/"/\n/g            # strip quotes s/\t//g         # strip tabs s/\./\n/g       # strip periods s/'\''/\\'\''/g     # make straight apostrophes literals s/'\’'/\\'\’'/g     # make curly apostrophes literals s/\ /\\ /g      # make spaces literals /^.\{0,5\}$/d       # delete lines less 5 characters s/^/s\/\\b/     # begins substitution s/$/\\b\/\\n\/g/    # concludes substitution  1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/  ' tempgloss.4.txt > glossy.sed  #echo "glossy.sed created." chmod +x glossy.sed  echo "eliminating existing entries. may take while." ./glossy.sed $1.0.tmp > $1.1.tmp  echo "now cleaning lines." sed -e ' s/\ $//         # strip trailing spaces s/^\ *//g       # strip , leading spaces s/\ el$//g      # strip "el" end s/\ la$//g      # strip "la" end s/\ los//g      # strip "los" end s/\ las//g      # strip "las" end s/\ o$//g       # strip "o" end s/\ y$//g       # strip "y" end s/\ $//         # strip trailing spaces (yes, again) ' $1.1.tmp > $1.2.tmp  echo "creating ngrams." ./ngrams 5 < $1.2.tmp > $1.3.tmp 2> /dev/null  linecount="$(wc -l < $1.3.tmp)" #echo $linecount "lines." if [ "$linecount" -gt "1000" ]     echo "eliminating single instances."     sed '/^1\t/d' $1.3.tmp > $1.4.tmp else     echo "fewer 1000 entries, keeping all."     cp $1.3.tmp $1.4.tmp fi  sed -e ' s/[0-9]//g      # strip out numbers s/^\t//g            # strip leading tab s/^\ *//g       # strip , leading spaces /^.\{0,7\}$/d       # delete lines less 6 characters s/\ $//         # strip trailing spaces (yes, again) #s/$/\t/            # add in tab ' $1.4.tmp > $1.csv  echo "looking duplicates." sh ./dedupe $1.csv  wordstxt=$(wc -w < $1) #echo $wordstxt wordslist=$(wc -w < $1.csv) #echo $wordslist wordspercent=$(echo "scale=4; $wordslist / $wordstxt" |bc -l) wordspercentage=$(echo "$wordspercent * 100" |bc -l)   after="$(date +%s)" elapsed_seconds="$(expr $after - $before)" rate=$(echo "scale=3; $wordstxt / $elapsed_seconds" |bc -l) echo "created "$1.csv", $wordspercentage% left, in" $elapsed_seconds "seconds." #, effective rate of" $rate "words per second."  rm tempgloss.*.txt rm *.tmp rm glossy.sed

rewrite script in awk , run in seconds instead of minutes , briefer, simpler , clearer. sed excellent tool simple substitutions on single line. else, use awk.

Search This Blog

If code

sed language translation script - improving efficiency for long texts -

Comments

Post a Comment

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -