Transcripts of all of Star Trek and some commands to search them
Check out this repo, and put it somewhere you like:
$ git clone https://github.com/varenc/star_trek_transcript_search
$ cd star_trek_transcript_search/scripts
$ ls
DS9 Discovery Enterprise Movies NextGen TAS TOS Voyager
Now search the transcripts. Here's the most basic command for searching:
grep -rin "<search term>" .
For example:
$ grep -rin "terraform.*venus" .
./DS9/200457.txt:401:O'BRIEN: All of it. The Utopia Planitia yards on Mars, the terraforming stations on Venus, Starfleet Headquarters. I'm not detecting a single sign of Starfleet activity anywhere in this sector.
For a better experience, use The Silver Search (ag
) instead of grep
$ ag "terraform.*venus" .
DS9/200457.txt
401:O'BRIEN: All of it. The Utopia Planitia yards on Mars, the terraforming stations on Venus, Starfleet Headquarters. I'm not detecting a single sign of Starfleet activity anywhere in this sector.
Make this an easily used function by adding this to your .bashrc
or .zshrc
function trekLines() {
cd /path/to/star_trek_transcript_search/scripts/
ag "${1}" .
}
Then just call trekLines "terraform.*venus"
to do a search.
There's lots more things you can do as well. Like count the number of lines per character per episode. Or find the episode where each character spoke the fewest number of words. May update this with examples of how to do that later.
Get the average words per episode for each series
$ printf "%-15s %-12s %-12s %-12s\n" "SERIES" "TOTAL_WORDS" "EPISODES" "WORD_PER_EP"; for f in *; do W=$(cat $f/*.txt | wc -w); E=$(ls $f/*.txt | wc -l); printf "%-15s %-12s %-12s %-12s\n" "$f" $W $E $((${W}/ $E)); done
SERIES TOTAL_WORDS EPISODES WORD_PER_EP
DS9 949778 173 5490
Discovery 69637 15 4642
Enterprise 472007 97 4866
Movies 93772 10 9377
NextGen 908469 176 5161
TAS 67066 22 3048
TOS 423886 79 5365
Voyager 960510 160 6003
(Note: not super accurate since the transcripts include some descriptions of what's happening on screen and the name of each speaker. Running this on subtitles instead of transcripts would be more accurate.)
Make a function to find the episodes where a chacter has the fewest/shortest lines, and then run it on Worf and then Tom Paris
$ trekQuietestEpisodesFor () {
limit_num=${2:-10}
for p in "$1"
do
echo -n "\n\n========= $p ========"
for f in $(ag -ti "${p} ?(\[[\w\s]+\])?:" --count | sed "s/txt:/txt /" | sort -nr -k 2 -k 1 | tail -n $limit_num | cut -d ' ' -f 1)
do
echo "\n=== Episode ${f:r} ==="
cat "$f" | ag -ti "${p} ?(\[[\w\s]+\])?:"
done
done
}
$ trekQuietestEpisodesFor Worf 3
========= Worf ========
=== Episode DS9/200510 ===
WORF: Constable. Why are you talking to your beverage?
=== Episode DS9/200507 ===
WORF: We found him on top of the mountain, slumped over a subspace transmitter.
=== Episode DS9/200493 ===
WORF: I would.
$ trekQuietestEpisodesFor Paris 2
========= Paris ========
=== Episode Voyager/300622 ===
PARIS: Yes, ma'am.
=== Episode Voyager/300225 ===
PARIS: I'm picking up a lot of plasmatic turbulence in there. It might be a bumpy ride.
(Note: The above requires ag
and probably zsh instead of bash)