Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain updated PDF URLs automatically #155

Open
grimord opened this issue Apr 17, 2021 · 1 comment
Open

Obtain updated PDF URLs automatically #155

grimord opened this issue Apr 17, 2021 · 1 comment
Labels
enhancement New feature or request question Further information is requested

Comments

@grimord
Copy link
Contributor

grimord commented Apr 17, 2021

Create simple web scraper to automatically obtain links to all ISEL PDF timetables from the programme page (it can also include official programme name and degree level - licenciatura / mestrado).

To consider:

  • The current solution uses a ".properties" file for each programme that contains a couple of key-value pairs containing the PDF URL and an alert recipient email address. These files have to be updated manually.
  • Should the files used to store programme metadata and PDF URL be updated automatically as soon as a new URL is found?
  • Should we keep a file "history" of some sort?
  • If yes, should we fallback to previous working URL if a new one fails? (downloading or parsing? both?)
  • Should the source file format be changed to store this data in a more structured format like YAML, JSON or CSV?
  • Is there any other source of information that could be scraped that would prove useful for other components?
@grimord grimord added enhancement New feature or request question Further information is requested labels Apr 17, 2021
@grimord
Copy link
Contributor Author

grimord commented Apr 17, 2021

Created a basic throwaway scraper as a practical proof of concept using Node JS.
ISEL's current website doesn't include ID attributes in most elements so DOM queries will have to rely on element type + class combinations and even a little filtering through href attributes.

const fetch = require("node-fetch");
const cheerio = require("cheerio");

const LEIC = "https://www.isel.pt/cursos/licenciaturas/engenharia-informatica-e-computadores/horarios";
const LEM = "https://www.isel.pt/cursos/licenciaturas/engenharia-mecanica/horarios";
const LEC = "https://www.isel.pt/cursos/licenciaturas/engenharia-civil/horarios";
const programmes = [LEIC, LEC, LEM];

const getSchedule = async (uri) => {
  const body = await fetch(uri).then((resp) => resp.text());

  const $ = cheerio.load(body);

  let pdf_anchor = $("a[class=sizer]")
    .filter((i, data) => $(data).attr("href").endsWith("pdf"))
    .first();

  return {
    programme: $("h1[class=sizer]").text(),
    pdf: $(pdf_anchor).attr("href"),
  };
};

programmes.forEach((url) =>
  getSchedule(url).then((d) => console.log(`${d.programme} => ${d.pdf}`))
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant