Setup 2025 files and started parsing the archive site but was rate limited. Will need to finish it in the future.
This commit is contained in:
152
memory/html-structures.md
Normal file
152
memory/html-structures.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# IRW Website HTML Structure Reference
|
||||
|
||||
## Restaurant Page URL
|
||||
Live: `https://inlanderrestaurantweek.com/project/SLUG/`
|
||||
Archived: `https://web.archive.org/web/TIMESTAMP/https://inlanderrestaurantweek.com/project/SLUG/`
|
||||
|
||||
## Page Framework
|
||||
The site uses WordPress + Divi theme. Relevant container class: `et_pb_text_inner`.
|
||||
Each course section typically occupies one or two `et_pb_text_inner` divs.
|
||||
|
||||
---
|
||||
|
||||
## Course Layout Types
|
||||
|
||||
### Layout A — Heading and items in SEPARATE divs (most restaurants)
|
||||
```html
|
||||
<div class="et_pb_text_inner"><h3>First Course</h3></div>
|
||||
<div class="et_pb_text_inner">
|
||||
<p><strong>Dish Name</strong><br/>Description</p>
|
||||
<p><strong>Dish Name 2</strong><br/>Description 2</p>
|
||||
</div>
|
||||
<div class="et_pb_text_inner"><h3>Second Course</h3></div>
|
||||
...
|
||||
```
|
||||
|
||||
### Layout B — Heading and items in SAME div (tavolata, durkins, table13, others)
|
||||
```html
|
||||
<div class="et_pb_text_inner">
|
||||
<h3>First Course</h3>
|
||||
<p><strong>Dish Name</strong><br/>Description</p>
|
||||
<p><strong>Dish Name 2</strong><br/>Description 2</p>
|
||||
</div>
|
||||
<div class="et_pb_text_inner">
|
||||
<h3>Second Course</h3>
|
||||
...
|
||||
</div>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dish Name Tag Styles
|
||||
|
||||
### Style 1 — `<strong>` tag (most restaurants)
|
||||
Examples: 315cuisine, anthonys, bardenay, barkrescuepub, etc.
|
||||
```html
|
||||
<p><strong>Dish Name</strong><br/>Description text here</p>
|
||||
<p><strong>Dish Name</strong> <br/>With space before br</p>
|
||||
```
|
||||
|
||||
### Style 2 — `<b>` tag with `<br/>` inside (India House, Lebanon, Karma, ponderosa)
|
||||
```html
|
||||
<p><b>Dish Name <br/></b><span>Description text</span></p>
|
||||
<p><b>Dish Name<br/></b> Description without span</p>
|
||||
```
|
||||
Key: name is inside `<b>`, the `<br/>` is INSIDE the `<b>` tag.
|
||||
|
||||
### Style 3 — `<b>` + `<strong>` combo (1898 restaurant)
|
||||
```html
|
||||
<p><span><b>First Part</b></span><strong>Second Part</strong> Description</p>
|
||||
```
|
||||
Full dish name = "First Part" + " " + "Second Part"
|
||||
|
||||
---
|
||||
|
||||
## Field Extraction Patterns
|
||||
|
||||
### Name (from page title)
|
||||
```
|
||||
<title>Restaurant Name | Inlander Restaurant Week</title>
|
||||
```
|
||||
Regex: `<title>(.+?) \| Inlander`
|
||||
|
||||
### Price (WARNING: unreliable — use price listing page instead)
|
||||
```html
|
||||
<h1 style="text-align: left;"><strong>$45</strong></h1>
|
||||
```
|
||||
Regex: `<strong>\$(\d+)</strong>`
|
||||
PROBLEM: Some pages show drink prices like $22 that match before the real price.
|
||||
SOLUTION: Always build an authoritative slug→price map from the price listing page.
|
||||
|
||||
### Price Listing Page — Authoritative Prices
|
||||
URL: `https://inlanderrestaurantweek.com/price/` (or Wayback archived version)
|
||||
```html
|
||||
<article class="et_pb_portfolio_item ... project_category_45 ...">
|
||||
...
|
||||
<a href="https://inlanderrestaurantweek.com/project/SLUG/">
|
||||
```
|
||||
Extract price tier from `project_category_(25|35|45)` CSS class.
|
||||
Extract slug from `href=".../project/SLUG/"`.
|
||||
|
||||
### Cuisine
|
||||
```html
|
||||
CUISINE: AMERICAN COMFORT FOOD
|
||||
```
|
||||
Often inside `<strong>` or `<em>` tags. Extract uppercase text after "CUISINE:".
|
||||
Apply `.ToTitleCase()` for proper formatting.
|
||||
|
||||
### Phone
|
||||
Area codes: 509 (Spokane area) or 208 (Idaho/CDA area)
|
||||
Pattern: `(509) 555-1234` or `(208) 555-1234`
|
||||
Regex: `\((?:208|509)\) \d{3}-\d{4}`
|
||||
|
||||
### Hours
|
||||
```
|
||||
Menu served 5pm-9pm nightly
|
||||
Menu served Thursday-Sunday, 5-9pm
|
||||
```
|
||||
Regex: `Menu served [^<]+`
|
||||
|
||||
### Area
|
||||
Look for area keywords (ALL CAPS in source) anywhere in the HTML:
|
||||
- DOWNTOWN, NORTH SPOKANE, SOUTH SPOKANE, WEST SPOKANE, SPOKANE VALLEY
|
||||
- AIRWAY HEIGHTS, LIBERTY LAKE, COEUR D'ALENE, POST FALLS, HAYDEN, ATHOL, WORLEY
|
||||
Default to ["Downtown"] if nothing matched.
|
||||
Some restaurants appear in multiple areas — collect all matches.
|
||||
|
||||
---
|
||||
|
||||
## Dietary Tag Filtering
|
||||
Skip these as dish names — they appear in `<strong>` but are dietary labels, not dish names:
|
||||
- GF (gluten free)
|
||||
- GFA (gluten free available)
|
||||
- V, V+ (vegetarian, vegan)
|
||||
- DF, DFA (dairy free, dairy free available)
|
||||
- V:, V+A (legend lines)
|
||||
- 2025 (year marker some restaurants include)
|
||||
- Drink (some restaurants label beverage course)
|
||||
|
||||
Full regex: `^(GF|GFA|V\+?|DF|DFA|V:|2025|Drink|V\+A)$`
|
||||
Also skip names matching `^[A-Z]{1,3}:` (legend lines like "GF: Gluten Free")
|
||||
Also skip names shorter than 3 chars or longer than 80 chars.
|
||||
|
||||
---
|
||||
|
||||
## Restaurants by Known HTML Style (2025)
|
||||
|
||||
**Layout B (same-block)**: tavolata, durkins, table13, terraza, and others
|
||||
**Style 2 (`<b>` tags)**: indiahouse, lebanon, karma, ponderosa, collectivekitchen, dryfly, masselowslounge, vieuxcarre, wileys, osprey, shawnodonnells, ganderryegrass
|
||||
**Style 3 (`<b>`+`<strong>` combo)**: 1898
|
||||
|
||||
Note: These styles may change year to year as restaurants update their pages.
|
||||
Always check a few representative pages before assuming the same structure applies.
|
||||
|
||||
---
|
||||
|
||||
## JS-Only Pages (no static HTML menu content)
|
||||
These restaurants had no recoverable menu data from any Wayback snapshot in 2025:
|
||||
heritage, kismet, littlenoodle, macdaddys, purgatory, redtail, republickitchen, republicpi, vicinopizza
|
||||
|
||||
Their pages are fully JS-rendered — the static HTML captured by Wayback Machine
|
||||
shows the page shell but not the menu content. For future years, these may or may not
|
||||
have static caches depending on server-side rendering changes.
|
||||
Reference in New Issue
Block a user