Friday, November 21, 2014

Regular Expressions, Redirects, and Rewrites oh my!

I recently got a request worded as such:
You choose a technical topic related to programming computers that you know a great deal about already, send me the topic so I can read up on it if necessary, and then prepare to give in-depth 10 to 15 minute lecture about the topic to me and I'll ask follow questions.
I thought this would be the perfect thing to share with everyone else as well. One of my biggest strengths is networking, which I had already covered with that person a bit. As well my extensive embedded experience gives me command at the HW-SW interface. We had already touched on optimizing databases and critical code which I view as more abstract computer architecture (things like pipelining and spacial locality effects on cache misses), and things like pull-ups, flash loaders, and bypass capacitors were not an appropriate subject.

I pondered a bit diving into computability, and the things one could evaluate with seminal computing models like the state machine, push down automata, and Turing machines, but this seemed a bit too much to bite off in short order. However thinking about state machines (e.g. a soda machine), gave me an idea that would incorporate a bit of one of my favorite subjects, Regular Expressions (regex), and let me touch on a scenario that we run into often in the modern interconnected web world.

It will also allow me to demonstrate a few interesting things under the hood of the internet. What you often perceive as simply loading a web page is often becoming a server accessing databases and services on the server side to serve a page which may still yet deliver executable code that can load local and cached resources, as well as make browser side calls to further web services. While below I use the example of loading web content this is analogous (via 4 - code, and combinations of other methods) to using any remote procedure call be it REST, SOAP, or some other method which may deliver JSON, XML, or other things instead of an HTML document.

The Problem


I use case 1 as the base problem, and will add to the problem as we come up with decision points down the line:

1) I want to type www.example.com and see the content of www.yahoo.com

The following cases follow as we go through the analysis:

2) I want to be able to share a URL by itself as a way to share the content I am viewing at the moment. (No iFrame)

3) I want query parameters to pass through to my page, easily, without server or browser code (No iFrame)

4) I want cookies to be shareable between domains (no cross domain iFrame, or careful consideration).

5) I want to support https (no matter what you must get certificates).

6) I want a custom path scheme for my content (no CNAME).

7) I want the user to type www.example.com every they want to get to the site (don't send an HTTP 301)

8) I want the user to still see the domain www.example.com even though they see the content from www.yahoo.com (don't send an HTTP 302)

9) The site/content I am integrating with uses a different version of javascript, jquery, node.js, etc. or I need to tightly control the order of javascript execution, for example using a message passing interface (don't use client side javascript without using an iFrame, or heavy focus on synchronization constructs).

Solution 1: HTML iFrame

This is probably the easiest way for most people to think of. HTML has a mechanism for this called an iFrame. This simple document will embed a window with www.example.com in my page. I created a simple page like this at:

http://hl1264.blogspot.com/2014/11/blog-post.html

I will note some limitations:

1) I click a link inside the iFrame. The iFrame content changes, but my URL doesn't. This is ok for me, but say I am on my nth click and wan't to send a link to a friend. When I copy-paste the URL I will go to my blog post, not to the page I wanted to share!

2) Query parameters are not passed through. I actually ran into this problem when I asked someone to forward a domain in a way I'll cover next, but they used an iFrame. While 1) above was an issue, we were using google analytics and while the user might enter:

http://hl1264.blogspot.com/2014/11/blog-post.html?utm_source=Tom&utm_medium=Blogspot&utm_campaign=Example1

I would not get:

www.example.com/?utm_source=Tom&utm_medium=Blogspot&utm_campaign=Example1

which would allow the information to pass through. I would just get a hit on:

www.example.com

So the information was lost. While it's a quick Javascript or PHP script to grab those parameters and append to the iFrame address, I don't want to do this with code because it's messy, and in this particular case the partner would not do this for us as well.

3) Cookies will not automatically be shared without careful consideration. I run into this often when using iFrames to integrate multiple sites. You must carefully navigate this with a smart CNAME, cookie declaration, and possibly certificate which I will give as a real world example at the end.

iFrames are a useful tool, and can be blended with following items in various ways to achieve different end goals. One prime example is blending sites or content that use different versions of javascript or

Solution 2: DNS CName

In DNS parlance a "cname" is short for "canonical name," and means one domain is an alias for another. This is analogous to a file alias in the Mac world, a shortcut in PC land, and a link in the *nix universe. This allows you to say in 1 DNS entry:

www.example.com     CNAME  www.yahoo.com

As long as the site uses relative paths, then you can navigate the entire site and the paths on example.com will mirror those on yahoo.com or whatever target site you choose. For example most sites will have something like this in place:

www.example.com     CNAME  example.com
example.com               A              93.184.216.119

Where the second line is an "A" or "address" record. The above two lines mean that the canonical name of www.example.com is example.com, and the address of example.com is 93.184.216.119.

1) One limitation here is https, which is rapidly becoming the norm for all sites. If you use this method but your domain is different, modern browsers will block the content or warn users that they are likely entering a dicey situation. This is because the domain the user entered will not match the domain on the certificate. There is no way around this, and any valid case will have no trouble justifying the cost of a new certificate for the project.

2) Another limitation is with regard to the path. With this method the path will always mirror that of the aliased site. In the case of www.example.com and example.com this makes obvious sense but there may be cases where you will want a custom mapping whether arbitrary, or for backwards compatibility, but more on that later.

Solution 3: HTTP Protocol

The Hyper Text Transfer Protocol (HTTP) is used to transfer Hyper Text Markup Language (HTML) documents. You get a code 200, for success almost every time you load a web page, and most people will recognize 404 (Not Found - usually mistyped or dead link), and 503 (Service Unavailable - usually server maintinence or overload), and 500 (Internal Server Error - any unhandled exception, for me usually an unhandled PHP error). Within this protocol there is a class of responses (3XX) which are specifically aimed at redirection. 

These codes also fulfill our purpose in some ways:

301 - Moved Permanently

Sending this response would indicate a permanent change. If I clicked on a bookmark, and receive this response, a smart browser should change the URL in my bookmark. Smart search engines should know longer index the old URL, and should index the new URL.

302 - Moved Temporarily OR Found

Moved Temporarily is the old nomenclature, and now Found is used, but this means keep the URL you have. This response is frequently used in parts of sites that are dynamic with respect to site structure, things like AB testing which could guide user subsets to slightly different interfaces. You can still type www.example.com and get your other site content, but by the time you looked up at the address bar you would quickly see the result site, in our case www.yahoo.com so this option is out if you want to see www.example.com as the domain name.

Solution 4 - The code silly!

There are, of course, many ways to do this by writing good old code! Here I present 3 different modalities of using server and browser side code to accomplish the effect. In all cases one can get the path from the request URL and use that in the URL requested, with similar effects to the cname mapping. This is the method we did not consider before in the simple iFrame case. In a simple Pseudocode:

Take www.example.com/path
read www.yahoo.com/path
return the html document received

This is also a simple example analogous to getting data via a SOAP or REST call and applying a CSS style sheet to it

Regex Aside 1

A quick introduction to regular expressions. Most people know that '*' means "everything," and often people will know that 'ap*' will match "ap," "app," "application," "apache" and anything else that starts with "ap." To get particular let's talk about some typical regex syntax ( the particulars which may be platform and language dependant ). Some regex basics:
  • '.' represents any character, except in a character class
    • '.' will match 'a' or 'b' but also 'ab' as there was in fact a character, there just happened to be a second
  • '+' represents one or more of the preceding character
    • a+ will match 'a', 'aa', 'ab' but not 'b'. It will still match 'ba' and 'baa'
  • '*' represents 0 or more of the preceding character - be careful
    • '*' will match everything
    • '.*' will match everything
    • 'ab*a' will match 'aa', 'aba', 'abbbbbba'
  • '?' makes something optional, or indicates 0 or 1 of the preceding character
    • 'ab?a' will match anything with 'aa' or 'aba' in it but not 'abba'
  • () groups a piece of the regex for later reference. The value in the first params can be referred to typically as $1 or \1, the second as $2 and so on....
  • [] are used to indicate a character class, such that multiple characters are possible
    • [aeiou] will match any string with a vowel in it
    • [1234567890] will match anything with a decimal digit in it
  • ^ Matches the start of a string
    • '^ab' matches "abatement"or "absinthe" but not "an abatement system" or "a bottle of absinthe"
  • $ Matches the end of a string
    • 'ing$' matches "fishing", "swimming", and "lounging", but not "stinking fish", "It's to freezing out there  
In PERL, which is similar enough to PHP, Python, and Javascript with respect to regular expressions here is a simple, yet befuddling looking expression that breaks out the pieces of a URL:

if ($uri =~ m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!) {
        print "protocol:$2, domain:$4, path:$5, query:$7, fragment:$9\n";
}

$uri = "http://www.yahoo.com/folder1/page1.html?key1=val1&key2=val2#FragmentOrAnchor\n\n";

gives 

protocol:http, domain:www.yahoo.com, path:/folder1/page1.html, query:key1=val1&key2=val2, fragment:FragmentOrAnchor

$uri = "ftp://ftp.example.com/user1?user=tom&password=12345";

I will leave deconstructing that regex in your favorite scripting language for you as an exercise later. Regex will come in handy again later when we talk about rewrite rules in 5).

Solution 4a: Server Code loads destination site and serves it

In this case code on the server will naively call the content, and can programatically parse the path out and append it to the call. For example in PERL I could write:

$command = "wget http://www.yahoo.com/".$path; # Where # '.' being string concatenation.
print `$command`; #backtick executes the command in the shell and returns stdout

In this case, as with others https is a concern. In this case, however I could insert elements around or inside the page loaded, modify the style sheet, and do many dynamic things. This modality will frequently be used in the form of Server Side RESTful calls.

Solution 4b: SSI

This will normally not be an option as it opens all kinds of security vulnerabilities, but an easy way for demonstration purposes is to use a Server Side Includ (SSI). This command let's the web page execute any command with group permissions that match with apache's permissions. In this case the simple example is:



Adding in the path needs could be accomplished multiple ways via further code, but is messy, and this example is just academic.

Solution 4c: Browser side java script

With java script you can actually accomplish the feat in many ways that parallel some of the other pitfalls. One way to achieve this is to simply change to the destination URL:

var url = "http://www.yahoo.com.com/"

window.location = url;

But this will give you the destination URL in your browser, and really is equivalent to using an HTTP 3XX code to change your destination. Another option is to load the code and re-write the current document with something like the wget above, or even more fun, just insert the iFrame:

document.body.innerHTML= < iframe src = "http://www.example.com" >< / iframe >


The sensible time to do this may be when you are actually receiving for example XML to be rendered according to CSS rules. This is a much more complex but typical case for embedding widgets and 3rd party content into your site. Especially in the case where all of the traffic is between a 3rd party server, this option will reduce latency, but require smart tricks like subdomains to deal with XSS issues. 
The common trouble is, again, different versions of java script and js frameworks. Doing this on the client side also exposes you to race conditions (The 'A' in AJAX is asynchronous), so this route is not recommended in that case either without careful consideration and synchronization.


Solution 5 Rewrite Rules

Solution 5a: Web Server Rewrite Rules

I know that a similar mechanism exists in Windows world, but I'm used to apache, where they have the mod_rewrite module. From the apache documentation:

The mod_rewrite module uses a rule-based rewriting engine, based on a PCRE regular-expression parser, to rewrite requested URLs on the fly. By default, mod_rewrite maps a URL to a filesystem path. However, it can also be used to redirect one URL to another URL, or to invoke an internal proxy fetch.

So you can use this module to map one path to another, as well as parse things like HTTP headers as I mention below. It is regular expression based ( you see regex everywhere ), so you can create logical mappings, not just one to one relationships. Examples aboud on the net but a couple of examples excerpted:

In the example ruleset below we replace /~user by the canonical /u/user and fix a missing trailing slash for /u/user.

RewriteRule   ^/~([^/]+)/?(.*)    /u/$1/$2  [R]
RewriteRule   ^/([uge])/([^/]+)$  /$1/$2/   [R]

The goal of this rule is to force the use of a particular hostname, in preference to other hostnames which may be used to reach the same site. For example, if you wish to force the use of www.example.com instead of example.com, you might use a variant of the following recipe.

# For sites running on a port other than 80
RewriteCond %{HTTP_HOST}   !^www\.example\.com [NC]
RewriteCond %{HTTP_HOST}   !^$
RewriteCond %{SERVER_PORT} !^80$
RewriteRule ^/(.*)         http://www.example.com:%{SERVER_PORT}/$1 [L,R]

# And for a site running on port 80
RewriteCond %{HTTP_HOST}   !^www\.example\.com [NC]
RewriteCond %{HTTP_HOST}   !^$
RewriteRule ^/(.*)         http://www.example.com/$1 [L,R]

Solution 5b: Rewrite Rules with Sub-domains & Cookies

In the modern world, while i can load an iFrame with content from another site, as soon as you run to active content, especially scripts, the domains can become a problem. If I want to run this script:

www.example.com/cgi-bin/script.js

using my cname example, this would be forwarded as a request to:

www.yahoo.com/cgi-bin/script.js. Certificates aside, a modern browser will not let this happen. It's behavior will be somewhere from simply nothing happening (my default behavior for Firefox at the moment), to getting a warning/error about cross domain scripts or security. In addition if I wrote a re-write rule as in example 5) above, based on HTTP cookies

Real World Examples

To be discussed live....

HTTP Refer

Unified CSS and templating for white label applications via CNAME and rewrite rules

Redirect to Mobile site based on User Agent using Apache ModReWrite

Have you noticed that you typically get redirected to an m-dot "m.site.com" site (not ironman site) often on your mobile device? Apache rewrite rules allow access to HTTP headers:

RewriteCond  %{HTTP_USER_AGENT}  (iPhone|Blackberry|Android)
RewriteRule  ^/$                 /homepage.mobile.html  [L]
RewriteRule  ^/$                 /homepage.std.html  [L]

3rd Party Integration example

The complete package. Example will be given in my presentation

No comments: