urlwatch

This is a simple URL watcher, designed to send you diffs of webpages as they change. Ideal for watching web pages of university courses, so you always know when lecture dates have changed or new tasks are online 🙂

urlwatch는 웹 사이트가 업데이트되었는지 아닌지 간단하게 확인할 수 있는 툴이다. 웹 사이트가 바뀌었는지 확인해야 할 경우가 종종 생기는데, 그런 경우에 쓸 수 있다.

다음 명령어로 urlwatch를 설치할 수 있다.

$sudo apt-get install urlwatch

확인하고자 하는 url을 ~/.urlwatch/urls.txt에 적어주면 된다.

# This is an example urls.txt file for urlwatch
# Empty lines and lines starting with "#" are ignored

http://www.dubclub-vienna.com/
http://www.openpandora.org/developers.php
#http://www.statistik.tuwien.ac.at/lv-guide/u107.369/info.html
#http://www.statistik.tuwien.ac.at/lv-guide/u107.369/blatter.html
#http://www.dbai.tuwien.ac.at/education/dbs/current/index.html
#http://www.dbai.tuwien.ac.at/education/dbs/current/uebung.html
http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming
http://ti.tuwien.ac.at/rts/teaching/courses/systems_programming/labor
http://ti.tuwien.ac.at/rts/teaching/courses/betriebssysteme
#http://www.complang.tuwien.ac.at/anton/lvas/effiziente-programme.html
#http://www.complang.tuwien.ac.at/anton/lvas/effizienz-aufgabe08/
http://www.kukuk.at/ical/events
http://guckes.net/cal/

# You can use the pipe character to "watch" the output of shell commands
|ls -al ~

# If you want to use spaces in URLs, you have to URL-encode them (e.g. %20)
http://example.org/With%20Spaces/

# You can do POST requests by writing the POST data behind the URL,
# separated by a single space character. POST data is URL-encoded.
http://example.com/search.cgi button=Search&q=something&category=4

urlwatch는 hooks.py를 제공하는데, 이를 사용해 관심없는 부분을 tracking하지 않도록 설정할 수 있다[3].

#
# Example hooks file for urlwatch
#
# Copyright (c) 2008-2011 Thomas Perl <thp.io/about>
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
# 3. The name of the author may not be used to endorse or promote products
#    derived from this software without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
# IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
# NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
# THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#


# You can decide which filter you want to apply using the "url"
# parameter and you can use the "re" module to search for the
# content that you want to filter, so the noise is removed.


# Needed for regular expression substitutions
import re

# Additional modules installed with urlwatch
from urlwatch import ical2txt
from urlwatch import html2txt


def filter(url, data):
    if url == 'http://www.inso.tuwien.ac.at/lectures/usability/':
        return re.sub('.*TYPO3SEARCH_end.*', '', data)
    elif url == 'https://www.auto.tuwien.ac.at/courses/viewDetails/11/':
        return re.sub('</html><!-- \d+ -->', '', data)
    elif url == 'http://grenzlandvagab.gr.funpic.de/events/':
        return re.sub('<!-- Ad by .*by funpic.de -->', '', data)
    elif url == 'http://www.mv-eberau.at/terminliste.php':
        return data.replace('</br>', '\n')
    elif 'iuner.lukas-krispel.at' in url:
        # Remove always-changing entries from FTP server listing
        return re.sub('drwx.*usage', '', re.sub('drwx.*logs', '', data))
    elif url.startswith('http://ti.tuwien.ac.at/rts/teaching/courses/'):
        # example of using the "tidy" module for cleaning up bad HTML
        import tidy
        mlr = re.compile('magicCalendarHeader.*magicCalendarBottom', re.S)
        data = str(tidy.parseString(data, output_xhtml=1, indent=0, tidy_mark=0))
        return re.sub(mlr, '', data)
    elif url == 'http://www.poleros.at/calender.htm':
        # remove style changes, because we only want to see content changes
        return re.sub('style="[^"]"', '', data)
    elif url == 'http://www.ads.tuwien.ac.at/teaching/LVA/186170.html':
        return re.sub('Saved in parser cache with key .* and timestamp .* --', '', re.sub('Served by aragon in .* secs\.', '', re.sub('This page has been accessed .* times\.', '', data))) 
    elif url.endswith('.ics') or url == 'http://www.kukuk.at/ical/events':
        # example of generating a summary for icalendar files
        # append "data" to the converted ical data, so you get
        # all minor changes to the ICS that are not included
        # in the ical2text summary (remove this if you want)
        return ical2txt.ical2text(data).encode('utf-8') + '\n\n' + data
    elif url == 'http://www.oho.at/programm/programm.php3':
        # example of converting HTML to plaintext for very
        # ugly HTML code that cannot be deciphered when just
        # diffing the HTML source (or if the user is just not
        # used to HTML, use this for every web page)
        #   
        # You need to install "lynx" for this to work or use
        # "html2text" as method (needs "html2text") or use
        # "re" (does not need anything, but only strips tags
        # using a regular expression and does no formatting)
        return html2txt.html2text(data, method='lynx')

    # The next line is optional - if the filter function returns
    # None (or no value at all), the input data will be taken as
    # the result -> None as return value means "don't filter".
    return data

명령어를 입력함으로써 사용할 수 있다.

$urlwatch

변경 사항이 있으면 다음과 같이 변경사항을 출력한다.

 $ urlwatch
***************************************************************************
CHANGED: http://www.naver.com/
***************************************************************************
--- @   Thu, 19 Mar 2015 14:28:25 +0900+++ @   Thu, 19 Mar 2015 14:28:27 +0900
@@ -39,7 +39,7 @@
 var strHost = "www.naver.com";
 var isMobile = false;
 var isMyCast = false;
-var svr = "<!--tnweb105.ntop-->";
+var svr = "<!--cweb102.ntop-->";
...

References:
[1] urlwatch – a tool for monitoring webpages for updates, https://thp.io/2008/urlwatch/
[2] urlwatch – a tool for monitoring webpages for updates, github, https://github.com/thp/urlwatch
[3] Ben Martin, Automatically watching Web sites for changes, http://archive09.linux.com/feature/132197

Advertisements
Tagged with: ,
Posted in 2) Computer Engineering
3 comments on “urlwatch
    • gumdaeng says:

      훨씬 더 직관적이고 편하군요 🙂
      다음엔 이것으로 해봐야겠습니다

    • partrita says:

      편리한 서비스이긴 하지만, 하루 한번만 확인후 이멜을 보내주더군요. 저같은 경우는 쇼핑몰 재입고 확인을 위해서 urlwatch를 crontab 스케줄로 30분마다 실행해서 이멜로 쏴주고 있습니다:)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

누적 방문자 수
  • 93,229 hits
%d bloggers like this: