So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest. But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation.
My 3-min talk for Beyond The PDF 2 (Amsterdam) #btpdf2 pointing out that scholarly publishers don't tend to embed very rich metadata in their products e.g. PDFs. Sure, arguably this data is inside the PDF, but I don't want to look inside the PDF. I want my machines to be able to grep what the PDF is in a nanosecond from structured, standardised metadata. I believe the standards exist, there's XMP and it's been around for a long time. It's just not being implemented much, or fully.
[Update: I’ve submitted this idea as a FORCE11 £1K Challenge research proposal 2015-01-13. I may be unemployed from April 2015 onwards (unsolicited job offers welcome! ), so I certainly might find myself with plenty of time on my hands to properly get this done…!] Inspired by something I heard Stephen Curry say recently, and with a little bit of help from Jo McIntyre I’ve started a project to compare EuropePMC author manuscripts with their publisher-made (mangled?) ‘version of record’ twins.
Muck Rack makes it simple to find people, tweets, or articles that mention any name, keyword, company, hashtag etc. We've compiled this guide to help you make the most of your search.
Selecting a term
Start searching tweets, articles from media outlets, articles mentioned in tweets, journalists'
names, titles and bios with some suggested searches:
Companies or Topics (e.g. iPhone, Microsoft)
Phrases (e.g. "cloud computing") — use quotes to keep the terms together
Twitter handles (e.g. @username) — returns those who have mentioned or replied to
Names (e.g. "David Pogue")
Hashtags (e.g. #sxsw, #london2012)
Bio details (e.g. vegan, Olympics, father)
Muck Rack's Advanced Search allows for many boolean operators.
Find results that mention multiple specified terms, use AND or
+. For example, ensure each result contains both Elon Musk and Mark Zuckerberg by
searching Musk AND Zuckerberg or Musk + Zuckerberg.
Use the operators OR or , to broaden your search when you'd like either of
multiple terms to appear in results. (This is the default behavior of our search when no operators
are used). For example, results will contain either cake or cookie by searching cake OR cookie or cake,cookie
Use NOT or - to subtract results from your search. For
example, searching Disney will yield results about the Walt Disney Company as well as Walt Disney
World Resort. To exclude mentions of Disney World, search for Disney -World or Disney
When using one of these operators with a phrase, enclose it in quotation marks. For example, you can
find results about smartphones excluding Apple's iPhone 4S by searching smartphone -"iPhone
Exact case matching or punctuation
If you're searching for a brand name or keyword that relies on specific punctuation marks or capitalization, you can
find results that match your exact query by adding matchcase: before the keyword you're searching for, like matchcase:E*TRADE .
Use parentheses to separate multiple
boolean phrases. For example, to find journalists talking about having fun in Disney World or
Disneyland, search for ("disney world" OR disneyland) AND fun.
An asterisk can be used to search for any variation of a root word truncated by the asterisk. For example, searching for admin* will return results for administrator, administration, administer, administered, etc.
A near operator is an AND operator where you can control the distance between the words. You can vary the distance the near operation uses by adding a forward slash and number (between 0-99) such as strawberries NEAR/10 "whipped cream", which means the strawberries must exist within 10 words of "whipped cream".