Understanding Querido Diário
For the official gazette file to travel from the municipality's website to be accessible to you through Querido Diário, it goes through a few stages:
The data collection where we put scraping robots to work on our behalf by visiting the publishing websites of the integrated municipalities every day to obtain the original official gazette files. Here, we use Python and Scrapy for scraping and the PostgreSQL database for storage.
In data processing we process the collected file, mainly by extracting the textual content of closed files (usually PDFs) into an open and searchable format. And it's the use of Python, Apache Tika (extraction) and OpenSearch (textual search engine) that make this possible.
In data sharing, we have created means of accessing our data. Anyone can search in a user-friendly way with the search engine on the home page of this site, developed in TypeScript and Angular. And any computer can search programmatically via the Public API, developed in Python with FastAPI.
The following image summarizes how the pieces interact in order to have this complete data flow.
