Issei's Blog

RxJS: How to fetch listing page recursively with expand?

RxJS is a great library when making an asynchronous program, such as scraping websites, and there are lots of useful operators to write readable codes. One of them is the expand operator. With this operator, we can express recursion or while loop in RxJS.

And, I will talk about:

  • what is expand()?
  • fetching listing page with expand()

What is expand()?

Arguments of this function are a function that gets an item and returns Observable.
And until it returns empty, it is reused.
Reference As an example, let's try Collatz conjecture without RxJS, and with RxJS, and compare them. According to Wikipedia, its process is:

Consider the following operation on an arbitrary positive integer:

Without RxJS,

function collatz(n: number) {
  console.log(n)
  if (n === 1) return 1
  if (n % 2 === 0) return collatz(n / 2)
  if (n % 2 === 1) return collatz(3 * n + 1)
}
collatz(250)

demo here

With RxJS,

import { from, of, empty } from 'rxjs'
import { expand } from 'rxjs/operators'
of(200)
  .pipe(
    expand(n => {
      if (n === 1) return empty()
      if (n % 2 == 0) {
        return of(n / 2)
      } else {
        return of(3 * n + 1)
      }
    })
  )
  .subscribe(v => console.log(v))

demo here

Now, you may feel what this operator is like. But how can I use it in the real world? I will tell you in the next chapter.

Fetching listing page

One application of this operator is fetching all items from listing pages with axios.
Most websites use pagination to display a limited number of items on listing pages, and each of those pages has a link to the next page. By following those links recursively, we can get whole items. At that time, expand operator is very useful. To fetch an HTML and manipulate it like a jquery, I will use rxjs, axios, and cheerio.

Code

First of all, let's make a function to create an Observable which fetches an HTML and parse it into Cheerio object.

import { EMPTY, from } from 'rxjs'
import { expand, map } from 'rxjs/operators'
import * as urlPath from 'url'
import axios from 'axios'
import * as cheerio from 'cheerio'

function RxFetch(url: string, encoding: 'utf8' | 'sjis' = 'utf8') {
  if (!url) return EMPTY
  return from(
    axios.get(url, {
      headers: {
        'user-agent':
          'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'
      }
    })
  ).pipe(map(data => cheerio.load(data.data)))
}

With RxFetch and expand, we can fetch pages recursively.

const getAllPagesRx = (
  url: string,
  nextSelector: string,
  f?: (selector) => string
) =>
  RxFetch(url).pipe(
    expand($ => {
      const $next = $(nextSelector).first()
      const relativeUrl = f ? f($next) : $next.attr('href')
      if (!relativeUrl || /javascript/i.exec(relativeUrl)) {
        return EMPTY
      }
      const nextUrl = urlPath.resolve(url, relativeUrl)

      return relativeUrl ? RxFetch(nextUrl) : EMPTY
    })
  )

As an example, get all results of my information.

getAllPagesRx(
  'https://www.google.com/search?q=Issei+morita+programmer&oq=Issei+morita+programmer',
  '#pnnext'
).subscribe(v => console.log(v))

Conclusion

Today, I talked about the application of expand.
This is useful when making recursive Observable.
I hope this article help.