-
Notifications
You must be signed in to change notification settings - Fork 7
Install
After compiling the product following the instructions of Compile there will be a need to create a Hadoop cluster to process a large number of arc files. This file has for base the tutorial http://wiki.apache.org/hadoop/QuickStart, http://hadoop.apache.org/common/docs/current/cluster_setup.html and http://wiki.apache.org/hadoop/GettingStartedWithHadoop. The cluster of Hadoop servers is only used by the pwa for the creation of indexes for the collections, for an active and production Hadoop cluster consult the Project page.
This assumes a tutorial basic knowledge of Hadoop and some knowledge of apache tomcat server.
First you have to grab the files generated by the Compile procedure of the Hadoop and copy them to all your cluster servers.
As the tutorial pages before explain, the Hadoop system can be run in a distributed way. To run a large number of arc files this is the best way of achieving this.
Has a base configuration the Portuguese Web Archive assumes the following directory structure, and this tutorial is based on it:
/opt/searcher
/apache-tomcat-5.5.25 -> tomcat server
/arcproxy -> files for the arcproxy database
/collections -> collections served by this hadoop search server
/dictionaries -> dictionaries for the spellchecker application
/hadoop -> hadoop for processing the arc files
/logs -> Logs from the hadoop search servers
/run -> directory to hold the pid files
/scripts -> scripts for starting applications
/data/outputs -> directory for the indexes created from the arc files
/hadooptemp -> directory for all the Hadoop files used for indexing
Starting up a cluster:
- You will need to define a server for Hadoop master, this server will have the namenode and tasktracker (for more information on Hadoop Hadoop wiki page and project page).
- Copy the Hadoop folder after Compile to every server of the cluster.
- Ensure that the Hadoop package is accessible from the same path on all nodes that are to be included in the cluster. If you have separated configuration from the install then ensure that the config directory is also accessible the same way.
- Populate the
slaves
file with the nodes to be included in the cluster. One node per line. - Format the Namenode
- Configure a environment variable for the Hadoop home
% export HADOOP_HOME=/opt/searcher/hadoop
- Run the command
% ${HADOOP_HOME}/hadoop/bin/start-dfs.sh
on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file mentioned above. - Run the command
% ${HADOOP_HOME}/hadoop/bin/start-mapred.sh
on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.
After knowing what server to use has master define it in the configurations files.
Setup masters at ${HADOOP_HOME}/conf/masters
file:
master.example.com
Setup slaves at ${HADOOP_HOME}/conf/slaves
file:
server1.example.com
server2.example.com
For your cluster you need to edit the file of ${HADOOP_HOME}/conf/hadoop-site.xml
. You should change the value of fs.default.name
to the namenode server , the value of mapred.job.tracker
to the tasktracker server. There is also a need to correct the directories path for your system: dfs.name.dir
, dfs.data.dir
, mapred.system.dir
, mapred.local.dir
and hadoop.tmp.dir
. Every option of this configurable in this file is in the file ${HADOOP_HOME}/conf/hadoop-default.xml
:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>master.example.com:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>master.example.com:9001</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadooptemp/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadooptemp/dfs/datanode</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadooptemp/mapred/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/hadooptemp/mapred/local</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadooptemp/tmp/hadoop-${user.name}</value>
</property>
</configuration>
Edit JAVA_HOME variable at ${HADOOP_HOME}/conf/hadoop-env.sh
:
...
export JAVA_HOME=/usr/java/default
...
Edit files at master and copy for all other machines.
There is also the need to share the ssh public key from the master server to the other servers:
- generate ssh key:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- remove pwd request from localhost:
cat ~/.ssh/id_dsa.pub >>> ~/.ssh/authorized_keys
- test (if no pwd is requested then it is OK):
ssh localhost
- Put the key in other machines, repeat for every machine in the cluster.
ssh-copy-id -i ~/.ssh/id_dsa.pub user@server1.example.com
- Create directories configured in the
hadoop-site.xml
:
mkdir -p /hadooptemp/dfs/datanode
mkdir -p /hadooptemp/dfs/datanode/
mkdir -p /hadooptemp/mapred/system/
mkdir -p /hadooptemp/mapred/local/
- Format the HDFS (attention the y is capitalized):
${HADOOP_HOME}/bin/hadoop namenode -format
Note: If format is aborted then remove directories and format it again (directories from hadoop-site.xml
):
rm -rf /hadooptemp/dfs/namenode/*
rm -rf /hadooptemp/dfs/datanode/*
rm -rf /hadooptemp/mapred/system/*
rm -rf /hadooptemp/mapred/local/*
- Start the Hadoop daemons in all machines from the Hadoop cluster:
${HADOOP_HOME}/bin/start-all.sh
- See if the services started OK:
NameNode http://master.example.com:50070/
JobTracker http://master.example.com:50030/
- Stop the Hadoop deamons:
${HADOOP_HOME}/bin/stop-all.sh
There is a need to change some parameters for the servers. Because the Hadoop system opens a large number of files there is a need to add 2 lines to the file /etc/security/limits.conf
:
...
* hard nofile 65000
* soft nofile 30000
...
Change variable LANG in file /etc/sysconfig/i18n
:
LANG="pt_PT.ISO-8859-1"
...
- Install apache tomcat 5.5.25 http://archive.apache.org/dist/tomcat/tomcat-5/v5.5.25/bin/apache-tomcat-5.5.25.tar.gz
- Get the compilation result of Hadoop 0.14.4 instructions in Compile
- The files are going to be called, for simplification purposes:
- nutchwax.jar:
pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar
- nutchwax.war:
pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-webapp/target/nutchwax-webapp-0.11.0-SNAPSHOT.war
- wayback.war:
pwa-technologies/PwaArchive-access/projects/wayback/wayback-webapp/target/wayback-1.2.1.war
- pwalucene.jar:
pwa-technologies/PwaLucene/target/pwalucene-1.0.0-SNAPSHOT.jar
- nutchwax.jar:
- To install and configure tomcat this documentation should be followed.
- Untar the file to /opt/searcher/apache-tomcat-5.5.25
tar -zxf apache-tomcat-5.5.25.tar.gz
- Configure a environment variable for the Catalina home
% export CATALINA_HOME=/opt/searcher/apache-tomcat-5.5.25
Copy the file nutchwax.war to ${CATALINA_HOME}/webapps/
Set ${CATALINA_HOME}/webapps/nutchwax/WEB-INF/classes/hadoop-site.xml with:
<name>searcher.dir</name>
<value>/opt/searcher/scripts</value>
...
<name>wax.host</name>
<value>example.com:8080/wayback/wayback</value>
Where the directory of the searcher.dir
/opt/searcher/scripts
should have a file named search-servers.txt
that defines where the nutch servers are running:
example.com 21111
example.com 21112
The property wax.host
should have the host with the wayback configuration that is open to the world.
Copy the file wayback.war to ${CATALINA_HOME}/webapps/
Update ${CATALINA_HOME}/webapps/wayback/WEB-INF/wayback.xml
file:
- The
resourceIndex
maps the arc name and an URL for the arc it self, normally the URL of the arc is served by an http server. The configuration should refer the arcproxy that knows where all the arc files exist. - The
remotecollection
is the location of the search for an url. It should be configured with the nutchwax search. - The
uriConverter
is used to reply the url to be accessed.
...
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.Http11ResourceStore">
<property name="urlPrefix" value="http://127.0.0.1:8080/arcproxy/arcproxy/" />
...
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://127.0.0.1:8080/nutchwax/opensearch" />
...
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="http://MACHINE:8080/wayback/wayback/" />
...
There are two manner of setting up a arcproxy:
-
- Copy the file wayback.war to
${CATALINA_HOME}/webapps/
and rename it to arcproxy,mv wayback.xml arcproxy.xml
- Copy the file wayback.war to
Replace ${CATALINA_HOME}/webapps/arcproxy/WEB-INF/wayback.xml file:
- Change the
bdbPath
,bdbName
andlogPath
to the parameters to your desire, the directories have to be writable by tomcat application.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
<!--
The following 3 beans are required when using the ArcProxy for providing
HTTP 1.1 remote access to ARC files distributed across multiple computers
or directories.
-->
<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
init-method="init">
<property name="bdbPath" value="/home/wayback/searcher/arcproxy" />
<property name="bdbName" value="arquivo" />
<property name="logPath" value="/home/wayback/searcher/arcproxy/tmp_arc-db.log" />
</bean>
<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>
<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>
</beans>
2 - Wayback Append the configuration of the bean above into the file wayback.xml.
...
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.Http11ResourceStore">
<property name="urlPrefix" value="http://127.0.0.1:8080/wayback/arcproxy/" />
...
<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
init-method="init">
<property name="bdbPath" value="/opt/searcher/arcproxy" />
<property name="bdbName" value="arquivo" />
<property name="logPath" value="/opt/searcher/arcproxy/tmp_arc-db.log" />
</bean>
<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>
<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
<property name="locationDB" ref="filelocationdb" />
</bean>
</beans>
...
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://127.0.0.1:8080/nutchwax/opensearch" />
...
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="http://MACHINE:8080/wayback/wayback/" />
...
The browser application has the objective of making the arc files available through http. By creating a repository of files arc files. The creation of the application is made creating a folder for the application inside webapps directory of tomcat.
mkdir -p ${CATALINA_HOME}/webapps/browser/WEB-INF
mkdir -p ${CATALINA_HOME}/webapps/browser/files
Then creating a file web.xml in the WEB-INF directory.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE web-app
PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
<display-name>browser</display-name>
<description>File Browsing Application for the Document Share</description>
<!-- Enable directory listings by overriding the server default web.xml -->
<!-- definition for the default servlet -->
<servlet>
<servlet-name>DefaultServletOverride</servlet-name>
<servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
<init-param>
<param-name>listings</param-name>
<param-value>true</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<!-- Add a mapping for our new default servlet -->
<servlet-mapping>
<servlet-name>DefaultServletOverride</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>
For this you need to have the nutchwax.jar and the Hadoop folder after Compile has been run.
- Copy the folder of Hadoop to the server that will serve the collection. For each collection you will need a new copy of this folder. (example:
/opt/searcher/collections/test_collection_hadoop
). - Copy the file nutchwax.jar to /opt/searcher/collections/test_collection_hadoop
- Get the scripts from
https://github.com/arquivo/pwa-technologies/tree/master/scripts
to/opt/searcher/scripts
- Configure a environment variable for the Scripts directory
% export SCRIPTS_DIR=/opt/searcher/scripts
. - Copy the file pwa_lucene.jar to /opt/searcher/scripts.
- Configure a environment variable for the Collections directory
% export COLLECTIONS_DIR=/opt/searcher/collections
. - Create the file
${COLLECTIONS_DIR}/search-servers.txt
should have the following definition, 1 line per collection:
hostname port_for_server folder_of_hadoop_server folder_for_outputs
example:
master.example.com 21111 /opt/searcher/collections/test_collection_hadoop /data/outputs
- Start the servers:
% ${SCRIPTS_DIR}/start-slave-searchers.sh
. - Verify the
/opt/searcher/logs/slave-searcher-21111.log
for startup errors of the server. - To stop the server use:
% ${SCRIPTS_DIR}/stop-slave-searchers.sh
- Download https://github.com/arquivo/pwa-technologies/tree/master/Plone/ploneConf
- Download http://sobre.arquivo.pt/~pwa/PWA-TechnologiesSourceCodeDump22-11-2013/Data.fs
- tar -xvf Plone-3.0.5-UnifiedInstaller.tar.gz
- tar -xvf LinguaPlone-2.0.tar.gz
- Install Plone without PWA configurations
Plone-3.0.5-UnifiedInstaller/install.sh standalone
- The following files are the PWA visual configurations
Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/selector.py
Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/languageselector.pt
Plone-3.0.5/zinstance/lib/python/plone/app/layout/viewlets/personal_bar.pt
Plone-3.0.5/zinstance/Products/LinguaPlone/browser/languageselector.pt
Plone-3.0.5/zinstance/Products/PloneTranslations/i18n/plone-pt.po
- Replace files generated by Plone installation for the PWA files
cp ploneConf/selector.py Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/
cp ploneConf/languageselector.pt Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/
cp ploneConf/personal_bar.pt Plone-3.0.5/zinstance/lib/python/plone/app/layout/viewlets/
cp ploneConf/LinguaPlone/languageselector.pt Plone-3.0.5/zinstance/Products/LinguaPlone/browser/
cp ploneConf/PloneTranslations/* Plone-3.0.5/zinstance/Products/PloneTranslations/i18n/
cp Data.fs Plone-3.0.5/zinstance/var/
- start Plone
Plone-3.0.5/zinstance/bin/zopectl start
- Test configuration