Skip to content

Latest commit

 

History

History
321 lines (243 loc) · 9.17 KB

README.md

File metadata and controls

321 lines (243 loc) · 9.17 KB

web scrapping 101

In this project we will understand the DOM and interact with it, we will learn and assure HTML and some python best practices like requirements.txt and pip.


TOC


Pre Requisites

Go ahead and read these:

  • HTML
  • We will try to understand the DOM

Python Extras

pip

pip is the python package manage, from the web:

is the standard package manager for Python. It allows you to install and manage additional packages that are not part of the Python standard library

requirements.txt

“Requirements files” are files containing a list of items to be installed using pip install like so:

pip install -r requirements.txt

Requirements are meant for (but not limited to):

  1. Requirements files are used to hold the result from pip freeze for the purpose of achieving repeatable installations

    pip freeze > requirements.txt
    pip install -r requirements.txt
  2. Requirements files are used to force pip to properly resolve dependencies. As it is now, pip doesn’t have true dependency resolution, but instead simply uses the first specification it finds for a project. E.g. if pkg1 requires pkg3>=1.0 and pkg2 requires pkg3>=1.0,<=2.0, and if pkg1 is resolved first, pip will only use pkg3>=1.0, and could easily end up installing a version of pkg3 that conflicts with the needs of pkg2. To solve this problem, you can place pkg3>=1.0,<=2.0 (i.e. the correct specification) into your requirements file directly along with the other top level requirements

    pkg1
    pkg2
    pkg3>=1.0,<=2.0
  3. Requirements files are used to force pip to install an alternate version of a sub-dependency. For example, suppose ProjectA in your requirements file requires ProjectB, but the latest version (v1.3) has a bug, you can force pip to accept earlier versions like so:

    ProjectA
    ProjectB<1.3

There are other ways of achieving the same result but will leave those for later.


Web Scraping with bs4 and requests

We will be using requests to GET the html and bs4 to parse it

will be use to make http requests (GET by default) and retrieve a html web page content

is a Python library for pulling data out of HTML and XML files


Your project

For every item here you must display the results in a very understandable way:

<YOUR NAME GOES HERE>
=============================
1. Portal
# item_title: <result>
GET the title and print it: <result>
---------------------------------------
GET the Complete Address of UFM: <result>
------------------------------------------
.
.
.
find all properties that have href (link to somewhere):
- <result 1>
- <result 2>
- <result 3>
=============================
2. Estudios
# ----- : separator between items

# ===== : separator between parts

# 1. Title: Title of the section

# use '-' if its a list

It will be possible to pass an argument to your app to specify which section to run, if no argument provided it will default to "run all parts"

# default to run all parts
python3 soup.py

# run part 1
python3 soup.py 1

# run part 2
python3 soup.py 2

# run part 3
python3 soup.py 3

  • NOTE If for some reason the result exceeds 30 lines you will display "Output exceeds 30 lines, sending output to: <logfile>" and send the output to a text file inside logs/ , example format:
$ python3 soup.py 1
=============================
1. Portal
GET the title and print it: Output exceeds 30 lines, sending output to: logs/1portal_GET_the_title_and_print_it.txt


$ ls logs/1portal_GET_the_title_and_print_it.txt

$ cat logs/1portal_GET_the_title_and_print_it.txt

Date of generation: Mon Sep  9 22:58:30 CST 2019
================================================

Universidad Francisco Marroquín

this log files will not be git tracked.


1. Portal

using "http://ufm.edu/Portal"

  • GET the title and print it
  • GET the Complete Address of UFM
  • GET the phone number and info email
  • GET all item that are part of the upper nav menu (id: menu-table)
  • find all properties that have href (link to somewhere)
  • GET href of "UFMail" button
  • GET href "MiU" button.
  • get hrefs of all <img>
  • count all <a>

1.1 Extra points

  • From all (<a>) Create a csv file (logs/extra_as.csv) with the following columns: Text, href

example:

<ul><li><a target="_blank" rel="nofollow noreferrer noopener" class="external text" href="https://www.ufm.edu/english/">UFM Key Projects</a></li>
Text href
UFM Key Projects https://www.ufm.edu/english/

2. Estudios

using "http://ufm.edu/Estudios"

  • now navigate to /Estudios (better if you obtain href from the DOM)
  • display all items from "topmenu" (8 in total)
  • display ALL "Estudios" (Doctorados/Maestrias/Posgrados/Licenciaturas/Baccalaureus)
  • display from "leftbar" all <li> items (4 in total)
  • get and display all available social media with its links (href) "class=social pull-right"
  • count all <a> (just display the count)

3. CS

using "https://fce.ufm.edu/carrera/cs/"

  • GET title
  • Download the "FACULTAD de CIENCIAS ECONOMICAS" logo. (you need to obtain the link dynamically)
  • GET following <meta>: "title", "description" ("og")
  • count all <a> (just display the count)
  • count all <div> (just display the count)

4. Directorio

using "https://www.ufm.edu/Directorio"

  • Sort all emails alphabetically (href="mailto:arquitectura@ufm.edu") in a list, dump it to logs/4directorio_emails.txt
  • Count all emails that start with a vowel. (just display the count)
  • Group in a JSON all rows that have Same Address (dont use Room number) as address, dump it to logs/4directorio_address.json
{
    "Edificio Academico":[
        "Arquitectura",
        "Ciencias Economicas",
        .
        .
        .
        "Crédito Educativo"
    ],
    "Centro Estudiantil":[
        "Admisiones",
        .
        .
        .
         "Desarrollo"
    ],
    .
    .
    .
}
  • Try to correlate in a JSON Faculty Dean and Directors, and dump it to logs/4directorio_deans.json
{
    "Facultad de Arquitectura": {
            "Dean/Director": "Roberto Quevedo",
            "email": "rquevedo@ufm.edu",
            "Phone Number": "2338-7709"
        },
    "Facultad de Ciencias Económicas": {
        "Dean/Director": "Mónica Rio Nevado de Zelaya",
        "email": "zelaya@ufm.edu",
        "Phone Number": "2338-7723 2338-7724"
    }
    .
    .
    .
}
  • GET the directory of all 3 column table and generate a CSV with these columns (Entity,FullName, Email), and dump it to logs/4directorio_3column_tables.csv
Entity FullName Email
Rector Gabriel Calzada Álvarez rectoria@ufm.edu
Campus Madrid Gonzalo Melián gmelian@ufm.edu
Alumni Marcela Porta alumni@ufm.edu

5. Extra

  • Complete Dockerfile
  • Create README section for Dockerfile under Usage Dockerfile
  • Add CI to your own repo.

Start your project

In order to start your project:

  • you MUST fork this repository into your own personal repo in github

  • you will need to use git and commit every once in a while, every commit must have a meaningful message.

  • to start using it:

    # clone
    git clone <your own personal repo URL>
    # install dependencies
    pip install -r requirements.txt
    # run it
    python soup.py
    # or
    ./soup.py
  • everytime you complete an "item" make sure to mark it as done [x]

Usage Dockerfile

Put your Docker build/run/etc commands here

Delivery

  • FORK IT!!
  • This will be developed individually
  • You will send the response via miU
  • You will respond only with the URL of your git repo. (preferable git tags)
  • your name (username) MUST have commits in the git log.
  • it must compile & work!
  • READ all README.me first