Python library Markivet helps you convert TXT files exported from Retriever Mediearkivet into JSON files with structured metadata.
Markivet makes large-scale text analysis easier.
pip install git+https://github.com/peterdalle/markivet.git@v0.5
Convert a text file:
from markivet import Markivet
markivet = Markivet("aftonbladet.txt")
markivet.save("aftonbladet.json")
Show a summary:
print(markivet)
Convert multiple files:
ab1 = Markivet("aftonbladet1.txt")
ab2 = Markivet("aftonbladet2.txt")
ab3 = Markivet("aftonbladet3.txt")
markivet = ab1 + ab2 + ab3
markivet.remove_duplicates()
markivet.save("aftonbladet.json")
Convert all text files in a directory:
markivet = Markivet.from_path("/home/username/*.txt")
markivet.save("articles.json")
Loop through news articles and display:
markivet = Markivet("aftonbladet.txt")
markivet.add_id() # adds incremental id to each article (e.g. 1 to 50 if you have 50 articles)
for news in markivet:
print(news.id)
print(news.title)
print(news.section)
print(news.page)
print(news.newspaper)
print(news.edition)
print(news.date) # parsed date as yyyy-mm-dd hh:mm:ss
print(news.date_raw) # date as it was found
print(news.lead)
print(news.body)
print(news.url) # url to article on Mediearkivet
Note: All examples on this page assume that you've downloaded text files from Retriever Mediearkivet with default settings (Swedish).
A parser is responsible for converting the article text string into structured metadata (of the type NewsArticle
).
You can write your own parser if you don't like the default ArticleParser
.
How to:
- Create your own class, like
MyParser
- Add a
parse()
method - The method must take a string as an input argument
- The method must return a
NewsArticle
object - When you want to use your parser, pass the class name as an argument:
Markivet("file.txt", parser=MyParser)
Example:
from markivet import Markivet, NewsArticle
class MyParser:
def parse(self, content: str) -> NewsArticle:
"""Extract the info you want, put it into NewsArticle, and return it"""
news = NewsArticle()
news.title = "I see no God here other than me"
news.newspaper = "Journal of Advanced Self-Indulgence"
news.lead = "I walked by the mirror and looked God into the eyes."
news.body = "True story."
news.section = "Domestic News"
return news
journal = Markivet("journal.txt", parser=MyParser) # <---- Inject your parser here
journal.save("journal.json")
Markivet consists of three classes.
Class | What it does |
---|---|
Markivet |
Loads TXT files, identifies all articles in a TXT file, and saves JSON files |
ArticleParser |
Converts an article text string into a NewsArticle object |
NewsArticle |
Represents a news article with title, name of newspaper, lead, body, pages etc. |
Create a new issue if you find an error with the software or have a feature request.