Meta tags:
Headings (most frequently used words):
microformats, for, entry, this, tags, and, data, vs, the, org, are, microformat, to, parser, html, browse, considerations, content, year, 14, welcome, new, admins, happy, 13th, improving, php, mf2, all, metadata, entries, by, month, in, blog, archive, what, specifications, upcoming, events, post, categories, choose, news, format, recommended, raw, vocabularies, parsed, canonical, json, derived, storing, navigating, parsing, structures, possible, property, values, know, your, algorithms, sanitise, still, how, during, confirms, google, consume, fetching, steps, validate, next, world, real, with, test, truncate, latest,
Text of the page (most frequently used words):
the (154), and (91), you (66), for (57), microformats (55), #data (40), html (37), #your (31), url (31), #value (30), #with (30), #that (29), #this (28), are (26), #property (26), use (25), #more (21), which (21), properties (20), one (18), content (16), #parser (16), embedded (15), there (15), can (15), mf2 (14), have (14), struct (14), photo (13), php (13), first_val (12), cases (12), when (12), structs (12), card (12), type (11), return (11), alt (11), also (11), not (11), may (11), all (11), parsing (11), some (11), entry (10), using (10), values (10), might (10), get (10), key (10), post (10), will (10), how (9), different (9), other (9), but (9), plaintext (9), from (9), code (9), consuming (9), see (8), need (8), plaintext_val (8), microformats2 (8), time (8), indieweb (8), mf_struct (8), could (8), where (8), property_name (8), first (8), own (8), google (7), used (7), consume (7), make (7), real (7), way (7), test (7), like (7), has (7), world (7), than (7), they (7), microformat (7), application (7), them (7), nested (7), json (7), potentially (7), what (7), either (7), look (7), under (7), page (7), common (7), take (7), example (7), want (6), canonical (6), something (6), plain (6), should (6), parsers (6), else (6), events (6), level (6), expect (6), possible (6), comments (6), any (6), text (6), name (6), raw (6), out (6), case (6), off (6), none (6), new (6), various (6), examples (6), org (6), well (6), most (6), top (6), support (6), multiple (5), started (5), into (5), structures (5), supported (5), such (5), start (5), about (5), each (5), many (5), same (5), original (5), valid (5), string (5), check (5), avoid (5), algorithm (5), site (5), got (5), parsed (5), then (5), contain (5), containing (5), its (5), algorithms (5), year (5), relevant (4), often (4), good (4), ways (4), display (4), search (4), handle (4), against (4), aware (4), posts (4), happy (4), attacks (4), following (4), through (4), expecting (4), metadata (4), format (4), john (4), keys (4), consider (4), originally (4), storing (4), derived (4), welcome (4), items (4), tantek (4), specific (4), particular (4), personal (4), language (4), useful (4), was (4), people (4), several (4), event (4), built (4), tools (4), found (4), effective (4), whether (4), feed (4), com (4), write (4), now (4), find (4), improvements (4), languages (4), treat (4), looking (4), adding (4), both (4), tags (4), wiki (4), structure (4), two (4), img (4), list (4), author (4), information (4), def (3), vocabularies (3), june (3), except (3), know (3), resolution (3), https (3), plans (3), don (3), software (3), implied (3), response (3), their (3), testing (3), within (3), fetching (3), library (3), try (3), image (3), images (3), keyerror (3), improving (3), without (3), fetch (3), web (3), tree (3), libraries (3), writing (3), sites (3), three (3), isinstance (3), dict (3), provide (3), markup (3), blog (3), separate (3), get_first_plaintext (3), necessary (3), article (3), publishing (3), indexerror (3), still (3), authorship (3), relative (3), representative (3), functions (3), assume (3), every (3), posted (3), parse (3), recommend (3), having (3), urls (3), able (3), never (3), marked (3), admins (3), walters (3), nest (3), been (3), xss (3), needs (3), finally (3), suite (3), dealing (3), barnaby (3), escape (3), function (3), base (3), present (3), plugins (3), store (3), single (3), lot (3), community (3), usually (3), sense (3), years (3), 2020 (3), just (3), rather (2), pages (2), fallback (2), turn (2), version (2), hostname (2), source (2), usage (2), getting (2), achieve (2), doesn (2), end (2), context (2), capable (2), contains (2), dangerous (2), section (2), structured (2), photos (2), approaches (2), shows (2), handling (2), certain (2), flexible (2), variety (2), entire (2), exist (2), deal (2), future (2), additional (2), geo (2), potential (2), strong (2), representation (2), links (2), featured (2), maps (2), appropriate (2), scenarios (2), validate (2), starting (2), person (2), although (2), worth (2), thought (2), webmention (2), sure (2), scope (2), based (2), injection (2), these (2), external (2), covering (2), truncate (2), above (2), mentioned (2), read (2), only (2), leaving (2), convenience (2), python (2), javascript (2), probably (2), profiles (2), always (2), profile (2), leave (2), best (2), johnmu (2), better (2), much (2), processwire (2), add (2), features (2), xray (2), improved (2), coming (2), available (2), least (2), january (2), parecki (2), changes (2), currently (2), ready (2), published (2), vocabulary (2), 2017 (2), 22nd (2), morrill (2), confirmed (2), gregor (2), microformats1 (2), aaron (2), archive (2), 2018 (2), building (2), address (2), redirect (2), during (2), browser (2), around (2), received (2), quite (2), few (2), children (2), 13th (2), experience (2), important (2), especially (2), bugs (2), network (2), websites (2), active (2), members (2), updates (2), returns (2), request (2), very (2), correct (2), social (2), rich (2), waterpigs (2), categories (2), engine (2), share (2), rel (2), simple (2), news (2), upcoming (2), recommended (2), wordpress (2), formats (2), applications (2), actively (2), existing (2), issues (2), next (2), graph, super, otherwise, plenty, considerations, agnostic, hosted, plug, ambitious, works, implementation, neither, basic, approach, anywhere, order, encoded, discuss, latest, basics, follow, times, final, belated, chain, reliably, reader, called, expects, resolved, grasp, online, applies, solid, going, secondly, help, developers, interesting, specification, line, provided, server, ruby, passing, thirdly, choice, get_img_alt, rust, options, depends, recent, fragments, pin13, net, download, experiment, sandbox, updating, live, falling, back, build, call, import, get_first_html, low, comparison, enabling, mark, choose, volume, differently, command, side, point, classic, depending, inferring, programming, empty, hardest, assuming, fetched, whatever, settings, those, navigating, already, traverse, perform, cleaning, sanitisation, assumptions, hard, working, work, mixture, previous, map, match, unescaped, entities, additionally, therefore, makes, archiving, focusing, copy, intermediate, brevity, encapsulating, mapping, places, filtering, create, searching, individual, guaranteed, primary, identifying, sort, speaking, generally, thing, represents, represent, whole, things, once, callback, wanting, detailing, lang, goals, homepage, matches, reading, right, finding, tasks, number, basis, easily, scripts, caching, firstly, non, 200, necessarily, prioritise, mean, nothing, simply, message, implications, let, explaining, publishers, deletion, http, equivalent, rely, regardless, before, 410, complexity, partially, gone, habit, str, making, yourself, custom, heuristic, further, background, cleans, jobs, jpg, update, taking, advantage, hundreds, broken, gradually, contained, fixing, occasionally, entirely, tweaked, access, unexpected, edge, over, processing, simpler, cleaned, sanitised, website, representations, likely, refer, somewhere, quick, suit, powered, bit, accordingly, march, 4th, continued, growth, across, extend, great, comment, result, due, abandon, noted, our, almost, jamietanna, ago, helping, 21st, among, growing, iterated, seeing, specs, doing, ven, essential, gardening, sven, knebel, martijn, van, der, announcement, uses, removing, webmaster, trends, analyst, being, announced, results, appeared, exchange, jamie, tanna, confirms, february, 19th, 2022, questions, mueller, twitter, longest, moment, parses, since, 2009, 7th, anniversary, announce, hand, yes, broadly, guess, said, zce7rtkmpa, knows, deprecate, spending, runs, come, second, xfn, organizations, specifications, designed, humans, machines, set, calendar, open, upon, widely, adopted, standards, learn, calendars, reviews, entries, xoxo, hosting, sponsored, www, linode, week, indiewebcamp, lists, ratings, outlines, tag, keywords, license, licenses, review, opinions, browse, month, popular, extension, improve, discovered, impressed, novelty, classes, contact, book, easier, later, involved, learned, became, block, hcard, past, catch, added, gregorlove, forward, stable, installed, richer, interactions, known, attributes, backwards, compatible, because, consistently, combination, experimental, feature, cool, join, pushing, variations, location, forms, ordinates, combined, addresses, adr, totally, twice, publish, liberal, accepts, longitude, latitude, presenting, showing, logo, safely, input, preventing, gives, measures, sanitising, truncating, sanitise, corresponding, app, supports, outside, sniff, presence, remove, ignore, instead, addition, involves, last, others, knowing, include, general, principles, applied, however, problems, trend, interpret, standardised, invalid, run, looks, even, usages, solve, formal, exact, consumption, definitions, techniques, implementations, deviate, slightly, meets, established, via, main, paginated, who, wrote, less, formally, untrusted, majority, consumer, permissive, method, marking, yet, perfectly, goal, strive, diverse, accept, decide, maximum, length, piece, ideally, place, sections, breaking, head, chat, room, hopefully, helped, gotchas, gave, towards, consumes, successfully, steps, forget, mind, wider, sources, determine, truncation, words, vast, absolutely, configure, interfere, css, shouldn, ever, purifier, ideas, respected, attack, highly, referring, owasp, resources, unicode, prevention, sanitizer, everything, apart, reducing, truncated, link, proxying, resource, local, copies, size, pass, mixed, missing, break, specifically, everywhere, actually, checking, blocks,
Text of the page (random words):
p mf2 data will be tweaked and improved as you add new features and handle unexpected edge cases mf2 parsers gradually get improved fixing bugs and occasionally adding entirely new features therefore if it makes sense for your use case i recommend archiving a copy of the original html as well as your derived data leaving out the intermediate canonical json that way you can easily create scripts or background jobs to update all the derived data based on the original html taking advantage of both parser improvements and improvements to your own code at the same time without having to re fetch potentially hundreds of potentially broken links as mentioned in the previous section if you archive original html for re parsing you ll need to additionally store the effective url for correct relative url resolution for some languages there are already libraries such as xray for php which will perform common cleaning and sanitisation for you if the assumptions with which these libraries are built suit your applications you may be able to avoid a lot of the hard work of handling raw microformats 2 data structures if not read on navigating microformat structures a parsed page may contain a number of microformat data structures mf structs in various different places take a look at the parsed canonical microformats json for the article you re reading right now for example items is a list of top level mf structs each of which may contain nested mf structs either under their properties or children keys each individual mf struct is guaranteed to have at least two keys type and properties type is the primary way of identifying what sort of thing that struct represents e g a person a post an event structs can have more than one type if they represent multiple things at once without wanting to nest them for example a post detailing an event might be both a h entry and a h event at the same time structs can also have additional top level keys such as id and lang generally speaking type information is most useful when dealing with top level mf structs and mf structs nested under a children key nested mf structs found in properties will also have type information but their usage is usually implied by the property name they re found under for many common use cases e g a homepage feed and profile there are several different ways people might nest mf structs to achieve the same goals so it s important that your code is capable of searching the entire tree rather than just looking at the top level mf structs never assume that the microformat struct you re looking for will be in the top level of the items list you need to search the whole tree i recommend writing some functions which can traverse a mf tree and return all structs which match a filtering callback this can then be used as a basis for writing more specific convenience functions for common tasks such as finding all microformats on a page of a particular type or where a certain property matches a certain value see my microformats2 php functions for some working examples possible property values each key in a mf struct s properties dict maps to a list of values for that property every property may map to multiple values and those values may be a mixture of any of the following a plain string value containing no html and leaving html entities unescaped e g items type h card properties name barnaby walters in future examples i will leave out the encapsulating items type for brevity focusing on the properties key of a single mf struct an embedded html struct containing two keys html which maps to an html representation of the property and value mapping to a plain text version properties content html p the content of a post as strong raw html strong or not p value the content of a post as raw html or not an img alt struct containing the url of a parsed image under value and its alt text under alt properties photo value https example com profile photo jpg alt example person a nested microformat data structure with an additional value key containing a plaintext representation of the data contained within properties author type h card properties name barnaby walters value barnaby walters all properties may have more than one value in cases where you expect a single property value e g name simply take the first one you find and in cases where you expect multiple values use all values you consider valid there are also some cases where it may make sense to use multiple values but to prioritise one based on some heuristic for example an h card may have multiple url values in which case the first one is usually the canonical url and further urls refer to external profiles let s look at the implications of each of the potential property value structures in turn firstly never assume that a property value will be a plaintext string microformats publishers can nest microformats embedded content and img alt structures in a variety of different ways and your consuming code should be as flexible as possible to partially make up for this complexity you can always rely on the value key of nested structs to provide you with an equivalent plaintext value regardless of what type of struct you ve found when you start consuming microformats 2 write a function like this and get into the habit of using it every time you want a single plaintext value from a property def get_first_plaintext mf_struct property_name try first_val mf_struct properties property_name 0 if isinstance first_val str return first_val else return first_val value except indexerror keyerror return none secondly never assume that a particular property will contain an embedded html struct this usually applies to content but is relevant anywhere your application expects embedded html if you want to reliably get a value encoded as raw html then you need to check whether the first property value is an embedded html struct i e has an html key if so take the value of the html key otherwise get the first plaintext property value using the approach above and html escape it if neither is found the property has no value in python 3 5 that could look something like this from html import escape def get_first_html mf_struct property_name try first_val mf_struct properties property_name 0 if isinstance first_val dict and html in first_val return first_val html else plaintext_val get_first_plaintext mf_struct property_name if plaintext_val is not none plaintext_val escape plaintext_val return plaintext_val except indexerror keyerror return none in some cases it may make sense for your application to be aware of whether a value was parsed as embedded html or a plain text string and to store treat them differently in all other cases always use a function like this when you re expecting embedded html data thirdly when expecting an image url check for an img alt structure falling back to the plain text value and either assuming an empty alt text or inferring an appropriate one depending on your specific use case something like this could be a good starting point def get_img_alt mf_struct property_name try first_val mf_struct properties property_name 0 if isinstance first_val dict and alt in first_val return first_val else plaintext_val get_first_plaintext mf_struct property_name if plaintext_val is not none return value plaintext_val alt return none except indexerror keyerror return none finally in cases where you expect a nested microformat you might end up getting something else this is the hardest case to deal with and the one which depends the most on the specific data and use case you re dealing with for example if you re expecting a nested h card under an author property but get something else you could use any of the following approaches if you got a plain string which doesn t look like a url treat it as the name property of an implied h card structure with no other properties and if you need a url you could potentially take the hostname of the effective url if it works in context as a useful fallback value if you got an img alt struct you could treat the value as the photo property the alt as the name property and potentially even take the hostname of the photo url to be the implied fallback url property although that s pushing it a bit and in most cases it s probably better to just leave out the url if you got an embedded html struct take its plaintext value and use one of the first two approaches if you got a plain string check to see if it looks like a url if so fetch that url and look for a representative h card to use as the author value if you get an embedded mf struct with a url property but no photo you could fetch the url look for a representative h card more on that in the next section and see if it has a photo property treat the author property as invalid and run the h entry or entire page if relevant through the authorship algorithm the first three are general principles which can be applied to many scenarios where you expect an embedded mf struct but find something else the last three however are examples of a common trend in consuming microformats 2 data for many common use cases there are well thought through algorithms you can use to interpret data in a standardised way know your algorithms and vocabularies the authorship algorithm mentioned above is one of several more or less formally established algorithms used to solve common problems in indieweb usages of microformats 2 some others which are worth knowing about include who wrote this post authorship algorithm there s more than one h card on this page which one should i use representative h card i want to get a paginated feed of posts from this page how to consume h feed how do i find and display the main post on this page how to consume h entry i received a response to one of my posts via webmention how do i display it how to display comments library implementations of these algorithms exist for some languages although they often deviate slightly from the exact text see if you can find one which meets your needs and if not write your own and share it with the community in addition to the formal consumption algorithms it s worth looking through the definitions of the microformats vocabularies you re using as well as testing with real world data and adding support for properties or publishing techniques you might not have thought of the first time around some examples to get you started if an h card has no valid photo see if there s a valid logo you can use instead when presenting a h entry with a featured photo check both the photo property and the featured property as one or the other might be used in different scenarios when dealing with address or location data e g on an h card h entry or h event be aware that either might be present in various different forms co ordinates might be separate latitude and longitude properties a combined plaintext geo property or an embedded h geo addresses might be separate top level properties or an embedded h adr there are many variations which are totally valid to publish and your consuming code should be as liberal as possible in what it accepts if a h entry contains images which are marked up with u photo within the e content they ll be present both in the content html key and also under the photo property if your app shows the embedded content html rather than using the plaintext version and also supports photo properties which may also be present outside the content you may have to sniff the presence of photos within the content and either remove them from it or ignore the corresponding photo properties to avoid showing photos twice sanitise validate and truncate in the vast majority of cases consuming microformats 2 data involves handling storing and potentially re publishing untrusted and potentially dangerous input data preventing xss and other attacks is out of the scope of the microformats parsing algorithm so the data your parser gives you is just as dangerous as the original source you need to take your own measures for sanitising and truncating it so you can store and display it safely covering every possible injection and xss attack is out of the scope of this article so i highly recommend referring to the owasp resources on xss prevention unicode attacks and injection attacks for more information other than that the following ideas are a good start use plaintext values where possible only using embedded html when absolutely necessary pass everything html or not through a well respected html sanitizer such as php s html purifier configure it to make sure that embedded html can t interfere with your own markup or css it probably shouldn t contain any javascript ever either in any case where you re expecting a value with a specific format validate it as appropriate more specifically everywhere that you expect a url check that what you got was actually a url if you re using the url as an image consider fetching it an checking its content type consider either proxying resource such as images or storing local copies of them reducing size and resolution as necessary to avoid mixed content issues potential attacks and missing images if the links break in the future decide on relevant maximum length values for each separate piece of external content and truncate them as necessary ideally use a language aware truncation algorithm to avoid breaking words apart when the content of a post is truncated consider adding a read more link for convenience test with real world data the web is a diverse place and microformats are a flexible permissive method of marking up structured data there are often several different yet perfectly valid ways to achieve the same goal and as a good consumer of mf2 data your application should strive to accept as many of them as possible the best way to test this is with real world data if your application is built with a particular source of data in mind then start off with testing it against that if you want to be able to handle a wider variety of sources the best way is to determine what vocabularies and publishing use cases your application consumes and look at the examples sections of the relevant indieweb org wiki pages for real world sites to test your code against don t forget to test your code against examples you ve published on your own personal site next steps hopefully this article helped you avoid a lot of common gotchas and gave you a good head start towards successfully consuming real world microformats 2 data if you have questions or issues or want to share something cool you ve built come and join us in the indieweb chat room february 19th 2022 waterpigs co uk comments off on how to consume microformats 2 data google confirms microformats are still a recommended metadata format for content this post originally appeared on jamie tanna s site google announced that they are removing support for the data vocabulary metadata markup that could be used to provide rich search results on its search engine i...
|