Document (#41240)

Author
Klic, L.
Miller, M.
Nelson, J.K.
Germann, J.E.
Title
Approaching the largest 'API' : extracting information from the Internet with Python
Source
Code4Lib journal. Issue 39(2018), [http://journal.code4lib.org]
Year
2018
Abstract
This article explores the need for libraries to algorithmically access and manipulate the world's largest API: the Internet. The billions of pages on the 'Internet API' (HTTP, HTML, CSS, XPath, DOM, etc.) are easily accessible and manipulable. Libraries can assist in creating meaning through the datafication of information on the world wide web. Because most information is created for human consumption, some programming is required for automated extraction. Python is an easy-to-learn programming language with extensive packages and community support for web page automation. Four packages (Urllib, Selenium, BeautifulSoup, Scrapy) in Python can automate almost any web page for all sized projects. An example warrant data project is explained to illustrate how well Python packages can manipulate web pages to create meaning through assembling custom datasets.
Content
Vgl.: http://journal.code4lib.org/articles/13197.
Theme
Internet
Object
Python

Similar documents (author)

  1. Klic, L.; Miller, M.; Nelson, J.K.; Pattuelli, C.; Provo, A.: ¬The drawings of the Florentine painters : from print catalog to linked open data (2017) 3.35
    3.3513775 = sum of:
      3.3513775 = sum of:
        1.3798823 = weight(author_txt:miller in 4105) [ClassicSimilarity], result of:
          1.3798823 = score(doc=4105,freq=1.0), product of:
            0.6190816 = queryWeight, product of:
              7.132539 = idf(docFreq=95, maxDocs=44218)
              0.08679681 = queryNorm
            2.2289183 = fieldWeight in 4105, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              7.132539 = idf(docFreq=95, maxDocs=44218)
              0.3125 = fieldNorm(doc=4105)
        1.9714952 = weight(author_txt:nelson in 4105) [ClassicSimilarity], result of:
          1.9714952 = score(doc=4105,freq=1.0), product of:
            0.78532666 = queryWeight, product of:
              1.1262926 = boost
              8.033325 = idf(docFreq=38, maxDocs=44218)
              0.08679681 = queryNorm
            2.5104141 = fieldWeight in 4105, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              8.033325 = idf(docFreq=38, maxDocs=44218)
              0.3125 = fieldNorm(doc=4105)
    
  2. Nelson, M.J.: Correlation of term usage and term indexing frequencies (1988) 1.97
    1.9714952 = sum of:
      1.9714952 = product of:
        3.9429903 = sum of:
          3.9429903 = weight(author_txt:nelson in 651) [ClassicSimilarity], result of:
            3.9429903 = score(doc=651,freq=1.0), product of:
              0.78532666 = queryWeight, product of:
                1.1262926 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.08679681 = queryNorm
              5.0208282 = fieldWeight in 651, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.625 = fieldNorm(doc=651)
        0.5 = coord(1/2)
    
  3. Nelson, M.G.: Catalogers as librarians (1986) 1.97
    1.9714952 = sum of:
      1.9714952 = product of:
        3.9429903 = sum of:
          3.9429903 = weight(author_txt:nelson in 2880) [ClassicSimilarity], result of:
            3.9429903 = score(doc=2880,freq=1.0), product of:
              0.78532666 = queryWeight, product of:
                1.1262926 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.08679681 = queryNorm
              5.0208282 = fieldWeight in 2880, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.625 = fieldNorm(doc=2880)
        0.5 = coord(1/2)
    
  4. Nelson, T.H.: ¬A file structure for the complex, the changing, and the indeterminate (1965) 1.97
    1.9714952 = sum of:
      1.9714952 = product of:
        3.9429903 = sum of:
          3.9429903 = weight(author_txt:nelson in 4468) [ClassicSimilarity], result of:
            3.9429903 = score(doc=4468,freq=1.0), product of:
              0.78532666 = queryWeight, product of:
                1.1262926 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.08679681 = queryNorm
              5.0208282 = fieldWeight in 4468, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.625 = fieldNorm(doc=4468)
        0.5 = coord(1/2)
    
  5. Nelson, M.J.: ¬The design of a hypertext interface for information retrieval (1991) 1.97
    1.9714952 = sum of:
      1.9714952 = product of:
        3.9429903 = sum of:
          3.9429903 = weight(author_txt:nelson in 4805) [ClassicSimilarity], result of:
            3.9429903 = score(doc=4805,freq=1.0), product of:
              0.78532666 = queryWeight, product of:
                1.1262926 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.08679681 = queryNorm
              5.0208282 = fieldWeight in 4805, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.625 = fieldNorm(doc=4805)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Eiter, T.; Kaminski, T.; Redl, C.; Schüller, P.; Weinzierl, A.: Answer set programming with external source access (2017) 0.08
    0.07764307 = sum of:
      0.07764307 = product of:
        0.4852692 = sum of:
          0.009019479 = weight(abstract_txt:information in 3938) [ClassicSimilarity], result of:
            0.009019479 = score(doc=3938,freq=3.0), product of:
              0.034415625 = queryWeight, product of:
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014215774 = queryNorm
              0.26207513 = fieldWeight in 3938, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=3938)
          0.015790354 = weight(abstract_txt:through in 3938) [ClassicSimilarity], result of:
            0.015790354 = score(doc=3938,freq=1.0), product of:
              0.06298531 = queryWeight, product of:
                1.1045774 = boost
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.014215774 = queryNorm
              0.250699 = fieldWeight in 3938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.0625 = fieldNorm(doc=3938)
          0.07793144 = weight(abstract_txt:programming in 3938) [ClassicSimilarity], result of:
            0.07793144 = score(doc=3938,freq=1.0), product of:
              0.18257998 = queryWeight, product of:
                1.880629 = boost
                6.829353 = idf(docFreq=129, maxDocs=44218)
                0.014215774 = queryNorm
              0.42683455 = fieldWeight in 3938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.829353 = idf(docFreq=129, maxDocs=44218)
                0.0625 = fieldNorm(doc=3938)
          0.38252792 = weight(abstract_txt:python in 3938) [ClassicSimilarity], result of:
            0.38252792 = score(doc=3938,freq=1.0), product of:
              0.6644007 = queryWeight, product of:
                5.0734873 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.014215774 = queryNorm
              0.5757488 = fieldWeight in 3938, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.0625 = fieldNorm(doc=3938)
        0.16 = coord(4/25)
    
  2. Falk, H.: Internet browsing tools (1995) 0.07
    0.06962042 = sum of:
      0.06962042 = product of:
        0.43512765 = sum of:
          0.013018497 = weight(abstract_txt:information in 2431) [ClassicSimilarity], result of:
            0.013018497 = score(doc=2431,freq=1.0), product of:
              0.034415625 = queryWeight, product of:
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014215774 = queryNorm
              0.37827286 = fieldWeight in 2431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.15625 = fieldNorm(doc=2431)
          0.039475888 = weight(abstract_txt:through in 2431) [ClassicSimilarity], result of:
            0.039475888 = score(doc=2431,freq=1.0), product of:
              0.06298531 = queryWeight, product of:
                1.1045774 = boost
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.014215774 = queryNorm
              0.62674755 = fieldWeight in 2431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.15625 = fieldNorm(doc=2431)
          0.06718351 = weight(abstract_txt:internet in 2431) [ClassicSimilarity], result of:
            0.06718351 = score(doc=2431,freq=2.0), product of:
              0.081573084 = queryWeight, product of:
                1.5395564 = boost
                3.7271836 = idf(docFreq=2891, maxDocs=44218)
                0.014215774 = queryNorm
              0.823599 = fieldWeight in 2431, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.7271836 = idf(docFreq=2891, maxDocs=44218)
                0.15625 = fieldNorm(doc=2431)
          0.31544974 = weight(abstract_txt:packages in 2431) [ClassicSimilarity], result of:
            0.31544974 = score(doc=2431,freq=1.0), product of:
              0.28818312 = queryWeight, product of:
                2.8937194 = boost
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.014215774 = queryNorm
              1.0946156 = fieldWeight in 2431, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.15625 = fieldNorm(doc=2431)
        0.16 = coord(4/25)
    
  3. Priss, U.: Alternatives to the "Semantic Web" : multi-strategy knowledge representation (2003) 0.07
    0.06617492 = sum of:
      0.06617492 = product of:
        0.41359323 = sum of:
          0.011716647 = weight(abstract_txt:information in 2733) [ClassicSimilarity], result of:
            0.011716647 = score(doc=2733,freq=9.0), product of:
              0.034415625 = queryWeight, product of:
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014215774 = queryNorm
              0.34044558 = fieldWeight in 2733, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.046875 = fieldNorm(doc=2733)
          0.032321867 = weight(abstract_txt:pages in 2733) [ClassicSimilarity], result of:
            0.032321867 = score(doc=2733,freq=1.0), product of:
              0.1230084 = queryWeight, product of:
                1.5436325 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.014215774 = queryNorm
              0.26276144 = fieldWeight in 2733, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.046875 = fieldNorm(doc=2733)
          0.082658775 = weight(abstract_txt:programming in 2733) [ClassicSimilarity], result of:
            0.082658775 = score(doc=2733,freq=2.0), product of:
              0.18257998 = queryWeight, product of:
                1.880629 = boost
                6.829353 = idf(docFreq=129, maxDocs=44218)
                0.014215774 = queryNorm
              0.4527264 = fieldWeight in 2733, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.829353 = idf(docFreq=129, maxDocs=44218)
                0.046875 = fieldNorm(doc=2733)
          0.28689593 = weight(abstract_txt:python in 2733) [ClassicSimilarity], result of:
            0.28689593 = score(doc=2733,freq=1.0), product of:
              0.6644007 = queryWeight, product of:
                5.0734873 = boost
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.014215774 = queryNorm
              0.4318116 = fieldWeight in 2733, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.211981 = idf(docFreq=11, maxDocs=44218)
                0.046875 = fieldNorm(doc=2733)
        0.16 = coord(4/25)
    
  4. Falk, H.: Library databases on the Web (1996) 0.06
    0.060072068 = sum of:
      0.060072068 = product of:
        0.37545043 = sum of:
          0.010414798 = weight(abstract_txt:information in 6905) [ClassicSimilarity], result of:
            0.010414798 = score(doc=6905,freq=1.0), product of:
              0.034415625 = queryWeight, product of:
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.014215774 = queryNorm
              0.3026183 = fieldWeight in 6905, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.125 = fieldNorm(doc=6905)
          0.026484234 = weight(abstract_txt:libraries in 6905) [ClassicSimilarity], result of:
            0.026484234 = score(doc=6905,freq=1.0), product of:
              0.056012243 = queryWeight, product of:
                1.0416409 = boost
                3.782635 = idf(docFreq=2735, maxDocs=44218)
                0.014215774 = queryNorm
              0.47282937 = fieldWeight in 6905, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.782635 = idf(docFreq=2735, maxDocs=44218)
                0.125 = fieldNorm(doc=6905)
          0.08619164 = weight(abstract_txt:pages in 6905) [ClassicSimilarity], result of:
            0.08619164 = score(doc=6905,freq=1.0), product of:
              0.1230084 = queryWeight, product of:
                1.5436325 = boost
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.014215774 = queryNorm
              0.7006972 = fieldWeight in 6905, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6055775 = idf(docFreq=441, maxDocs=44218)
                0.125 = fieldNorm(doc=6905)
          0.25235978 = weight(abstract_txt:packages in 6905) [ClassicSimilarity], result of:
            0.25235978 = score(doc=6905,freq=1.0), product of:
              0.28818312 = queryWeight, product of:
                2.8937194 = boost
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.014215774 = queryNorm
              0.8756924 = fieldWeight in 6905, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.125 = fieldNorm(doc=6905)
        0.16 = coord(4/25)
    
  5. Hilts, P.: Mosaic provides stained-glass windows into the world of the Internet (1994) 0.06
    0.059843447 = sum of:
      0.059843447 = product of:
        0.37402156 = sum of:
          0.02763312 = weight(abstract_txt:through in 779) [ClassicSimilarity], result of:
            0.02763312 = score(doc=779,freq=1.0), product of:
              0.06298531 = queryWeight, product of:
                1.1045774 = boost
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.014215774 = queryNorm
              0.43872327 = fieldWeight in 779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.011184 = idf(docFreq=2176, maxDocs=44218)
                0.109375 = fieldNorm(doc=779)
          0.03325414 = weight(abstract_txt:internet in 779) [ClassicSimilarity], result of:
            0.03325414 = score(doc=779,freq=1.0), product of:
              0.081573084 = queryWeight, product of:
                1.5395564 = boost
                3.7271836 = idf(docFreq=2891, maxDocs=44218)
                0.014215774 = queryNorm
              0.4076607 = fieldWeight in 779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7271836 = idf(docFreq=2891, maxDocs=44218)
                0.109375 = fieldNorm(doc=779)
          0.09231949 = weight(abstract_txt:page in 779) [ClassicSimilarity], result of:
            0.09231949 = score(doc=779,freq=1.0), product of:
              0.14076075 = queryWeight, product of:
                1.651267 = boost
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.014215774 = queryNorm
              0.655861 = fieldWeight in 779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9964437 = idf(docFreq=298, maxDocs=44218)
                0.109375 = fieldNorm(doc=779)
          0.22081481 = weight(abstract_txt:packages in 779) [ClassicSimilarity], result of:
            0.22081481 = score(doc=779,freq=1.0), product of:
              0.28818312 = queryWeight, product of:
                2.8937194 = boost
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.014215774 = queryNorm
              0.7662309 = fieldWeight in 779, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0055394 = idf(docFreq=108, maxDocs=44218)
                0.109375 = fieldNorm(doc=779)
        0.16 = coord(4/25)