Setup 2025 files and started parsing the archive site but was rate limited. Will need to finish it in the future.
This commit is contained in:
237
memory/scraping-guide.md
Normal file
237
memory/scraping-guide.md
Normal file
@@ -0,0 +1,237 @@
|
||||
# IRW Scraping Guide — Full Process for Adding a New Year
|
||||
|
||||
## Overview
|
||||
The Inlander Restaurant Week website (inlanderrestaurantweek.com) is WordPress/Divi.
|
||||
Menu pages are partially JS-rendered but WP-Super-Cache creates static HTML snapshots
|
||||
that the Wayback Machine archives. We scrape those static snapshots.
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Find Restaurant Slugs
|
||||
|
||||
Fetch the price listing page to get all slugs for that year:
|
||||
```bash
|
||||
curl -s "https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/price/" \
|
||||
-o /tmp/irw-price-YEAR.html
|
||||
```
|
||||
|
||||
Pick a timestamp close to the event (Wayback Machine format: YYYYMMDDHHmmss).
|
||||
The price listing page has portfolio items like:
|
||||
```html
|
||||
<article class="et_pb_portfolio_item ... project_category_45">
|
||||
<a href="https://inlanderrestaurantweek.com/project/SLUG/">
|
||||
```
|
||||
Extract slug from the href. The class `project_category_(25|35|45)` gives authoritative price.
|
||||
|
||||
**Important**: Scrape the price listing page FIRST and save the slug→price map.
|
||||
Some restaurant pages have drink prices ($22, $33) that confuse the price parser.
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Scrape Each Restaurant Page
|
||||
|
||||
Use a PowerShell script (written to project dir, copied to local temp to run):
|
||||
|
||||
**Wayback Machine URL format**:
|
||||
```
|
||||
https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/
|
||||
```
|
||||
|
||||
**Key fields to extract**:
|
||||
```powershell
|
||||
# Name
|
||||
$nameM = [regex]::Match($html, '<title>(.+?) \| Inlander')
|
||||
|
||||
# Price (from page, but USE PRICE LISTING MAP - this can be wrong)
|
||||
$priceM = [regex]::Match($html, '<strong>\$(\d+)</strong>')
|
||||
|
||||
# Cuisine
|
||||
$cuisineM = [regex]::Match($html, 'CUISINE:\s*([A-Z][A-Za-z/ ]+?)(?:\s*</|\s*<)')
|
||||
$cuisine = (Get-Culture).TextInfo.ToTitleCase($c.ToLower())
|
||||
|
||||
# Phone
|
||||
$phoneM = [regex]::Match($html, '\((?:208|509)\) \d{3}-\d{4}')
|
||||
|
||||
# Hours
|
||||
$hoursM = [regex]::Match($html, 'Menu served [^<]+')
|
||||
|
||||
# Area (match against known area keys, case-insensitive)
|
||||
$areaMap keys: "AIRWAY HEIGHTS","ATHOL","COEUR D'ALENE","POST FALLS","HAYDEN",
|
||||
"LIBERTY LAKE","NORTH SPOKANE","SOUTH SPOKANE","SPOKANE VALLEY",
|
||||
"WEST SPOKANE","WORLEY","DOWNTOWN"
|
||||
```
|
||||
|
||||
**Rate limiting**: Add `Start-Sleep -Milliseconds 2000` between each request.
|
||||
After a 429, stop and wait 30+ minutes before trying again.
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Parse Menu Courses
|
||||
|
||||
### Course Block Extraction (`Get-CourseBlock`)
|
||||
Two HTML layouts exist:
|
||||
|
||||
**Layout A** (most common): heading and items in SEPARATE `et_pb_text_inner` blocks
|
||||
```powershell
|
||||
# Strategy 1: find content between this label and next label
|
||||
$m = [regex]::Match($html, [regex]::Escape($label) + '(.+?)(?=' + [regex]::Escape($nextLabel) + ')', $opts)
|
||||
|
||||
# Strategy 3 (fallback): items in next et_pb_text_inner block
|
||||
$im = [regex]::Match($sub, '(?s)et_pb_text_inner">(?!<h[123])(.+?)(?=et_pb_text_inner"><h|</div>\s*</div>\s*</div>\s*</div>\s*<div)', $opts)
|
||||
```
|
||||
|
||||
**Layout B** (some restaurants — tavolata, durkins, table13, etc.): heading + items in SAME block
|
||||
```powershell
|
||||
# Strategy 2: extract <p> tags after </h3> within same div
|
||||
$sameDivM = [regex]::Match($sub, '(?s)</h[123]>\s*(<p.+?)(?=</div>)', $opts)
|
||||
```
|
||||
|
||||
### Dish Parsing (`Parse-Dish`)
|
||||
Three tag styles exist:
|
||||
|
||||
**Style 1** (most restaurants): `<strong>` for name
|
||||
```html
|
||||
<p><strong>Dish Name</strong><br/>Description text</p>
|
||||
```
|
||||
|
||||
**Style 2** (India House, Lebanon, Karma, others): `<b>` with `<br/>` before `</b>`
|
||||
```html
|
||||
<p><b>Dish Name <br/></b><span>Description text</span></p>
|
||||
```
|
||||
|
||||
**Style 3** (1898): `<b>` + `<strong>` combination
|
||||
```html
|
||||
<p><span><b>Part1</b></span><strong>Part2</strong> Description</p>
|
||||
```
|
||||
|
||||
**Multi-strategy parser** (handles all three):
|
||||
```powershell
|
||||
function Parse-Dish($pContent) {
|
||||
$opts = [System.Text.RegularExpressions.RegexOptions]::Singleline
|
||||
|
||||
# Style 2: <b>Name <br/></b>
|
||||
$bWithBrM = [regex]::Match($pContent, '(?s)<b>(.*?)<br\s*/?>', $opts)
|
||||
if ($bWithBrM.Success) {
|
||||
$name = Get-CleanText $bWithBrM.Groups[1].Value
|
||||
if (Test-ValidDishName $name) {
|
||||
$desc = Get-CleanText ($pContent.Substring($bWithBrM.Index + $bWithBrM.Length))
|
||||
return [PSCustomObject]@{ name = $name; desc = $desc }
|
||||
}
|
||||
}
|
||||
|
||||
# Style 3: <b>Part1</b>...<strong>Part2</strong>
|
||||
$bM = [regex]::Match($pContent, '(?s)<b>(.*?)</b>', $opts)
|
||||
if ($bM.Success) {
|
||||
$namePart = Get-CleanText $bM.Groups[1].Value
|
||||
if (Test-ValidDishName $namePart) {
|
||||
$afterB = $pContent.Substring($bM.Index + $bM.Length)
|
||||
$sM2 = [regex]::Match($afterB, '(?s)^[^<]*<strong>(.*?)</strong>(.*)', $opts)
|
||||
if ($sM2.Success) {
|
||||
$p2 = Get-CleanText $sM2.Groups[1].Value
|
||||
if (-not (Test-DietaryTag $p2) -and $p2.Length -ge 2) {
|
||||
return [PSCustomObject]@{ name = "$namePart $p2".Trim(); desc = Get-CleanText $sM2.Groups[2].Value }
|
||||
}
|
||||
}
|
||||
return [PSCustomObject]@{ name = $namePart; desc = Get-CleanText $afterB }
|
||||
}
|
||||
}
|
||||
|
||||
# Style 1: <strong>Name</strong>
|
||||
$sM = [regex]::Match($pContent, '(?s)<strong>(.*?)</strong>', $opts)
|
||||
if ($sM.Success) {
|
||||
$name = Get-CleanText $sM.Groups[1].Value
|
||||
if (-not (Test-ValidDishName $name)) { return $null }
|
||||
$afterBr = ''
|
||||
if ($pContent -match '(?s)<br\s*/?>(.*?)$') { $afterBr = $matches[1] }
|
||||
else { $am = [regex]::Match($pContent, '(?s)</strong>(.*?)$', $opts); if ($am.Success) { $afterBr = $am.Groups[1].Value } }
|
||||
return [PSCustomObject]@{ name = $name; desc = Get-CleanText $afterBr }
|
||||
}
|
||||
return $null
|
||||
}
|
||||
|
||||
function Test-ValidDishName($name) {
|
||||
$name.Length -ge 3 -and $name.Length -le 80 -and
|
||||
$name -notmatch '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$' -and
|
||||
$name -notmatch '^[A-Z]{1,3}:'
|
||||
}
|
||||
|
||||
function Test-DietaryTag($str) {
|
||||
$str -match '^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$'
|
||||
}
|
||||
```
|
||||
|
||||
### HTML Cleanup
|
||||
```powershell
|
||||
function Get-CleanText($rawHtml) {
|
||||
$t = $rawHtml -replace '<[^>]+>', ' '
|
||||
$t = $t -replace '&', '&' -replace ''', "'" -replace '"', '"'
|
||||
$t = $t -replace '<', '<' -replace '>', '>' -replace ' ', ' '
|
||||
$t = $t -replace '–', '-' -replace '—', '-'
|
||||
($t -replace '\s+', ' ').Trim()
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Fix Prices
|
||||
|
||||
After scraping, apply authoritative prices from the price listing page:
|
||||
- Parse `project_category_(25|35|45)` CSS class from portfolio items
|
||||
- Match slug from adjacent `href` attribute
|
||||
- Build a hashtable and apply to all entries
|
||||
|
||||
Common gotcha: Restaurant pages may show $22 (wine), $33 (lunch) — these are NOT the event price.
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Recover Missing Restaurants
|
||||
|
||||
If a restaurant has 0/0/0 courses:
|
||||
1. Try alternate Wayback timestamps: `20250401000000`, `20250415000000`, `20250501000000`, `20250601000000`
|
||||
2. Check if page uses Layout B (same-block) — add Strategy 2 to course block extractor
|
||||
3. Check if page uses `<b>` tags instead of `<strong>` for dish names
|
||||
|
||||
**Known JS-only restaurants** (no static cache recoverable for 2025):
|
||||
heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
|
||||
|
||||
---
|
||||
|
||||
## Step 6: Output and Validation
|
||||
|
||||
```powershell
|
||||
# Save as UTF-8 (important — special characters in restaurant names)
|
||||
$json = $data | ConvertTo-Json -Depth 10
|
||||
[System.IO.File]::WriteAllText($outPath, $json, [System.Text.Encoding]::UTF8)
|
||||
|
||||
# Validate: list any restaurant not at 3/3/3
|
||||
$data | Where-Object {
|
||||
$_.menu.courses.'First Course'.Count -ne 3 -or
|
||||
$_.menu.courses.'Second Course'.Count -ne 3 -or
|
||||
$_.menu.courses.'Third Course'.Count -ne 3
|
||||
} | ForEach-Object {
|
||||
"$($_.slug): $($_.menu.courses.'First Course'.Count)/$($_.menu.courses.'Second Course'.Count)/$($_.menu.courses.'Third Course'.Count)"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PowerShell Script Execution Pattern (REQUIRED)
|
||||
|
||||
```bash
|
||||
# Write script to project dir (via Write tool or Edit)
|
||||
# Then in bash:
|
||||
cp "//WinServ-20-3.chns.local/Profiles/derekc/Documents/Coding Projects/.../script.ps1" \
|
||||
"/c/Users/derekc.CHNSLocal/AppData/Local/Temp/script.ps1"
|
||||
powershell.exe -ExecutionPolicy Bypass -File "C:\Users\derekc.CHNSLocal\AppData\Local\Temp\script.ps1"
|
||||
```
|
||||
|
||||
**Never** use `powershell -Command "..."` for multi-line scripts — escaping is unreliable.
|
||||
**Never** try to run `.ps1` directly from `\\WinServ-20-3...` UNC path — execution policy blocks it.
|
||||
|
||||
---
|
||||
|
||||
## PowerShell Gotchas
|
||||
- `"$slug: text"` fails if `:` follows var — use `"${slug}: text"`
|
||||
- Function names like `Is-X`, `Decode-X`, `Parse-X` get PSScriptAnalyzer warnings (unapproved verbs) but work fine
|
||||
- `return ,$array` (comma prefix) forces PowerShell to return an array, not unroll it
|
||||
- `[System.IO.File]::WriteAllText(path, json, UTF8)` — use this, not `Out-File`, to avoid BOM/encoding issues
|
||||
Reference in New Issue
Block a user