Identifying categories of projects, axes on the map, and approaches and trends in OSS.
by John W Maxwell, Erik Hanson, Leena Desai, Carmen Tiampo, Kim O'Donnell, Avvai Ketheeswaran, Melody Sun, Emma Walter, and Ellen Michelle
Published onAug 02, 2019
Mapping the Landscape
Some axes of analysis:
Across the 52 projects that we have catalogued here, there is great variation across a number of different axes. The following subsections provide some possibilities for subdividing the landscape along some of the more obvious lines.
Journal publishing & book publishing
Some projects we catalogued are straightforwardly oriented to journal publishing. Some (especially given the Mellon Foundation’s recent funding moves12) are oriented to monographs and books. But a substantial number occupy a space in between—agnostic with regard to journals or books, and sometimes reaching for new forms altogether.
Centralized vs distributed models
Several projects we catalogued are designed around a central hosting model where there is considerable value in how the project host supports the software centrally; a prime example is Fulcrum, developed and hosted by the University of Michigan Library and Press. Other projects, like OJS, are designed around a distributed model where anyone can download and deploy the software. An increasing number of projects seem to anticipate a hybrid position in which any number of third-party hosting/integration partners will take care of deployment (and effectively become partners in the operational life of the software). None of these projects, by virtue of their open-source licenses, are strictly constrained to one or other deployment model; our observations here are about how the development is unfolding currently.
Old projects and new projects
As might be expected, we were able to catalogue a wealth of new projects, and a smaller number of older, more established projects. OJS is the longest-running project we catalogued, established in 2002. Some other notable projects are the bibliography manager Zotero (est. 2006), the French journal platform Lodel (est. 2006), the conversion tool Pandoc (est. 2007), the authoring tool Omeka (est. 2008), the annotation platform Hypothes.is (est. 2011), and the Math typesetting system MathJax (est. 2011). By contrast, fully half of our catalogue has emerged since 2015, with more than a dozen of these projects having their first release since 2018.
This, again, makes comparison difficult. Brand new, bursting-with-promise projects simply aren’t directly comparable to those that have weathered time, competition, and the ongoing demands of users. Conversely, longevity tells a story of fitness, but is difficult to make generalizations. A ‘graveyard’ of old and abandoned projects does not really exist, as Github only emerged as a common platform for software development projects in around 2012 - 2013. Projects older than that, even if the source code is still available, are not easily findable.
Very few of the projects we catalogued even do the same things. Some are attempts to create end-to-end functionality for an entire publishing process; an example is the Libero suite from eLife. Others offer very specific functionality, but may be usable in concert with other components; the best example here is Hypothes.is, which does one thing—annotation—very well and can be integrated in a variety of contexts.
To help visualize the functional scope of various development agendas, we propose a hypothetical publishing workflow that covers a number of stages in order to show how various projects address different functional areas. But we must emphasize one serious caveat: even though different projects may address the same workflow stages in this diagram, they most likely do so differently, with different boundaries and different goals. Our focus here is with software development priorities, rather than “features” per se. We thus offer the following diagram for illustrative—but not comparative—purposes:
The projects we catalogued also differ in development features, languages and frameworks, and licenses. Some are well supported by external funding, some struggle to maintain financial support, some (including some important projects) are effectively unfunded. We offer the following summary data, again for illustrative purposes:
License: Seventeen projects are released under the MIT License; seven under the GPL v3, seven under the GPLv2, and seven using a BSD license. The remainder use AGPL, Apache, or ECL licenses. Comparing these numbers with a 2018 report by Ayala Goldstein 2018, the proportions here are close to the proportions for Github as a whole.3
Funding: About a dozen of the projects we catalogued claim multiple funding agencies; this unsurprisingly tends to correlate with the age of the project. Another dozen projects appear to have no funding at all—apart from the developers’ time on the project. At least fourteen of the projects have received funding from the Andrew W Mellon Foundation.
Traditional functions, new capacities
It would be easier to examine publishing software if we all weren’t simultaneously in the midst of reinventing publishing itself. If publishing functions and the scope definition of journal or monograph publishing were stable over time, it would be more straightforward to judge software offerings against a functional standard. But, at the same time that we are re-building publishing infrastructure (in both open-source and proprietary contexts), we are broadly at work redefining publishing itself, as well as the forms and genres that define scholarly communications.
Journal publishing, while the most transformed by a three-decade shift to online distribution, at least sees some stability in its essential forms. A great deal of the innovation in journal publishing is concerned with the drive to scale and production efficiency, leaving the basic form of the article alone (there are of course exceptions, as in eLife’s Reproducible Document Stack and similar data-rich, interactive formats).4
Book publishing is another story, where a key source of innovation comes from the desire to produce and publish interactive scholarly works that have comparable size and significance to a traditional book, but share little with them production-wise. The latter shift has been identified and encouraged by the Mellon Foundation in recent years.5
At the same time, the affordances of web publishing have spawned a host of publication formats and platforms that are web native—neither journal nor book—that proceed less from a sense of traditional forms than a sense of what can be done, quickly and elegantly, online.
These trends complicate our landscape analysis. Some of the projects we catalogued seek very straightforwardly to model existing publishing practices while extending their efficiency or flexibility through digital media. OJS is perhaps the original case, aiming to pave the way to a fluid, open-access ecosystem. Its original design principles sought to embody existing best practices in journal publishing. OJS was not designed to be disruptive; rather its goal was to allow journal publishers to move their existing operations into an online, indexed environment.
An example of modeling existing publishing practices in book production is Editoria, developed by the Coko Foundation, the University of California Press, and a community of other interested academic publishers. Editoria is an editorial and production system for scholarly monographs designed to provide a web-based, collaborative platform with much more output flexibility than traditional proprietary tools offer. Editoria’s aspiration to be a drop-in replacement for existing tools makes it an ambitious development effort, but perhaps a necessary one if uptake in traditional university press operations is the goal.
While tools like OJS and Editoria serve established publication models, many of the tools we catalogued seek to break new ground and open up new possibilities in scholarly communication. MIT’s PubPub provides a full-featured platform for research teams to communicate with colleagues and the wider world; PubPub could be used to publish traditional scholarly works (journals, books), but it opens up a faster, more reader-centric modality that isn’ neatly contained by current publication norms. The University of Minnesota Press and CUNY’s Manifold Scholarship can hold scholarly monographs within it, but the point of the tool is to facilitate and capture the ongoing discourse around a book, rather than just the book’s content. Well-established tools like Omeka and Scalar exist to break new ground with the integration of multiple media and non-linear content organization.
Special-purpose components—from web-based word processors (Wax, Texture, and FidusWriter) to typographic toolkits (Hyphenopoly, KaTeX) and annotation and reference systems (Hypothes.is, Zotero)—often are agnostic to the publishing formats or genres they can serve, with the exception of assuming the Web as a common platform. It is worth noting that we also include contemporary examples of print production tools (Paged.js, Vivliostyle).
Technological approaches and trends
The software projects surveyed here represent a variety of approaches to contemporary problems, and as such provide a rich snapshot of contemporary thinking about publishing and software strategies. While the vast majority of these projects are web-based in one way or another, they vary greatly in their priorities and the bids they make to exist in a much larger ecosystem. The following are some significant trends we noted:
Approaches to XML
Most of the software we surveyed involves representation of text: for authoring and editing purposes and for display and publication. XML is central, in one way or another, to almost all of the projects. But what does that mean, exactly? Two dominant approaches to XML are evident: the first, employing the JATS XML schema for rich semantic markup and robust in-document metadata, seems to be a popular choice with projects focused on journal publishing workflows. The Texture editor from the Substance Consortium (including eLife and PKP) provides an excellent open-source, JATS-based authoring and editing platform which can then be incorporated into other tools. ELife’s Libero Producer is designed around Texture, building a JATS-native6 editing interface right into the core of eLife’s platform. OJS, which for most of its history has eschewed dealing with the text directly (opting to move .doc and .pdf files through its review workflow), now allows Texture integration as an option, and PKP seems enthusiastic about Texture’s development and future. Janeway, designed for the Open Library of the Humanities platform, is also based on JATS and seems poised to adopt Texture as well. This is a potentially important moment for JATS XML. While a ‘standard,’ JATS has not enjoyed actual standardized practice, because JATS-based workflows are typically buried in proprietary toolchains owned by corporate publishers. The emergence of an open, common editing tool for JATS is a welcome development for XML-based publishing ambitions.
The second major current of XML development is the use of web-native HTML as the basis for content and workflow. Owing to the ubiquity of this format and the wealth of readily available tools and standard ways of working, many of the projects we surveyed have opted for an HTML-first approach. This is true of journal-friendly projects like PubPub and Vega, but is especially the case with the more book-oriented projects such as Fulcrum, Manifold, Editoria, Pressbooks, and Scalar. In an HTML-based workflow, rendering in the browser comes more or less for free, and the associated EPUB standard (which includes HTML as its core text representation) provides a handy distribution or import/export format. More interestingly, authoring and editing tools for HTML are by now in their third or fourth generation, and sophisticated software is not hard to come by. An emerging open-source toolkit, ProseMirror has already seen significant uptake on the web (major news sites like New YorkTimes and The Guardian have reportedly built editorial tools around ProseMirror) owing to features like collaborative editing. ProseMirros is found in PubPub, Coko’s Wax editor (part of Editoria), and the science-oriented FidusWriter. There seems to be increasing interest in ProseMirror as an adaptable foundation for building specialized HTML editing environments.7
A third alternative, which puts markdown before markup, is seen in some production systems such as ElectricBook and Getty’s Quire. The markdown approach relies on a simplest-possible authoring environment (in a text editor) and up-converting to HTML or other XML formats. Markdown is also a straightforward import format for tools like Manifold, PubPub, and Pandoc. ProseMirror seems able to work as easily with markdown as with HTML, so the apparent distinctiveness of a markdown-based workflow may fade over time.
LaTeX deserves a mention here. One of the original OSS publishing tools (LaTeX, and TeX especially, predate the term “open-source” by many years), LaTeX is still alive and well in scientific publishing. Its support for equations and formulae remains hard to beat, despite efforts to move LaTeX’s features into more modern environments. In our survey, LaTeX appears in only a few cases. We examined here one contemporary platform, Tectonic, which seems to be an easily adoptable typesetting tool. We considered including Overleaf, the leading commercial LaTeX-based production system, as their codebase is open-source and accessible on Github, but we ultimately decided to remove it as it seems to have no substantial interest beyond Overleaf’s own application. LaTeX also appears in a few web-typography tools aimed at math typesetting: KaTeX from the Khan Academy, and MathJax, both of which aim to provide a browser-native math typesetting system that does what LaTeX does, and indeed can speak LaTeX.
Conversion and ingestion strategies
Despite the maturing contexts of XML in publishing, it appears to be a largely unchallenged fact that “authors will write in Word.” Word processor documents, despite the advent of XML file formats over the past decade, are just not structured documents, because the scope of possibilities that an author can express in a tool like Word is not constrained by any schema. Further, the vast legacy of online publishing has been the proliferation of PDF files—again, not a structured content format. So any publishing system that attempts to leverage structured content while allowing content to come from unstructured sources must have a strategy for ingesting these source documents and making sense of them.
This problem is as old as XML—indeed as old as SGML—and toolchains to solve the problem as numerous as the grasses; it appears that people continue to build these today. The emergence of XML-based word-processor file formats at least has made parsing a bit more straightforward, allowing XSLT to be used to at least take the original document apart. In our landscape survey, we have catalogued at least half a dozen projects dedicated to import and conversion, and at least as many larger projects have ingest tools built into them.
The traditional way to convert legacy documents is to parse them—either via XSLT or some other way of reading the native file format, and then attempting to make reasonable guesses about what the formatting means: the big, boldfaced line at the beginning of an article is likely the title, for instance. If the original document was formatted using named paragraph- or character styles, so much the better. Some of these parsing tools are mature and can handle a good many variations. Pandoc, for instance, is a robust conversion utility that has been in development for over a decade, with support for dozens of input and export formats. It is usable as a tool on its own, but it is also incorporated as a library or a component in several of the tools in our survey.
A traditional strategy for managing conversion from legacy formats is to constrain the scope of possibilities. Building a conversion tool around documents that consistently look like journal articles is easier than building a general-purpose converter. PKP’s Open Typesetting Stack8; has been designed using this approach, as is OpenEdition’s Lodel. Open Typesetting Stack is composed of a series of tools that are designed to take apart journal articles: front matter, body text, bibliographic references, and so on.
A newer approach altogether is to forgo parsing the internals of a file and instead pay attention to the visual and presentational characteristics of a PDF. Grobid, a machine-learning tool trained on a corpus of many thousands of journal articles, exemplifies this strategy. The latest versions of PKP’s Open Typesetting Stack include Grobid in its arsenal. Machine-learning tools improve over time and over larger datasets, so it seems likely that this approach will become common, if not dominant, in large-scale conversion and ingest of journal articles. Grobid —like several other tools (including le-tex Transpect, Lodel)—uses the Text Encoding Initiative’s (TEI) extremely rich and flexible descriptive XML tagset as an intermediate conversion target before normalizing to JATS XML for publication purposes.
Workflow modeling and management
Scholarly publishing is typically characterized by formal editorial review processes, including blind peer review. Modeling and capturing these formal review stages in software is a hallmark of scholarly publishing applications. OJS first established a formal model for peer review workflow nearly twenty years ago, designed around a hierarchy of editorial authority, explicit hand-offs from stage to stage, and a series of automated email reminders keeping every member of the process on task. OJS’s fine-grained, formal peer review has clearly stood the test of time (the model was made more modular in OJS 3), but developers and aspirants have been re-thinking and re-building editorial and review workflows ever since. The most recent generation of publishing software carries on this tradition, and re-designing workflow management is a feature in most of the projects we examined.
Some approaches aim to make submission and review simpler. PubPub, for instance, aims to make collaborative reviews easy and intuitive. Vega takes a similar approach, establishing a new conceptual vocabulary around the review model. Manifold brings robust commenting and annotation to its review process, perhaps more in the spirit of ‘open review.’ Ubiquity Press, while relying on OJS as the core of their journal-publishing platform, have made customizations for article review and have built an entirely different system, Rua, for managing book editorial processes.
The Coko Foundation and its partners have taken a somewhat different approach by building a layered and modular framework for workflows. Coko’s PubSweet framework exposes a set of components for integration. Specific applications—like eLife’s Libero Reviewer or Hindawi’s Phenom—configure these to the specific business/editorial needs of their publishers. EuropePMC and Wormbase’s micropublications framework also manage submissions this way. On the book-publishing side, Editoria is also built on top of the PubSweet framework, as is the BookSprints platform. As such there are at least six different workflow applications based on the PubSweet workflow system, and Coko’s promise is that many more are possible.
Whatever the specifics of workflow management in various contexts, it would appear that many people still see this as a problem that needs a solution—or indeed more solutions. It may be the case that workflow modeling is something that resists being solved once and for all. In an interview, one of the PKP team quipped that once some of the newer projects have been around for as long as OJS has—and if they are to serve a diverse user base—their simple workflows will need to evolve to serve those diverse needs. The many attempts to address workflow models in the current catalogue seems to support this view.
Innovating new possibilities
Many of the projects in this survey also seek to push the envelope, to expand the possibilities of digital scholarly publishing. These range from infrastructural innovation to blue-sky revolutionary thinking—like dokie.li’s decentralized, distributed authoring/publishing project, which is part of a rethink of the entire World-Wide Web from a linked-data perspective. Most projects we surveyed are a little more conventional, but many break new ground in thinking about how scholarly communications actually happens.
The University of Michigan’s Fulcrum project, for instance, makes a significant structural change in how we think about infrastructure. Fulcrum does not take great strides with user interface, but by building a robust, media-friendly ebook platform on top of the Samvera repository, developing robust metadata linkages between books and media objects, and integrating a set of modular tools for displaying and embedding these, Fulcrum has potentially emerged as a major new platform for digital book distribution, one that several other publishers seem to find attractive. Fulcrum potentially changes the ecosystem for scholarly ebooks, making media rich content workable and discoverable, at scale.
The University of Minnesota & CUNY Graduate Centre’s Manifold Scholarship also elegantly integrates a set of good ideas, while pushing out the post-production scope for book-length works. Manifold aspires to gather the discourse around a book—review, commentary, annotation, and even social media discourse—and collect it within the book itself. The result is that books expand over time as they gather their surrounding discourse. Manifold was initially designed as a monograph publication tool but has already found applications in open educational resources and in critical digital editions, owing to its reader-focused feature set.
MIT’s Knowledge Futures Group offers PubPub, a scholarly publishing tool that hosts journals, books, reports, and related content types, but seems poised to gain a devoted audience by making it incredibly easy for a research lab or team of like-minded scholars to collaboratively develop and publish media-rich content on an ongoing basis. It is early yet to tell if PubPub will evolve into a research-publishing platform or a turn-key publishing alternative. Vega, designed by Cheryl Ball after many years of publishing the Kairos journal, aims to bring multi-media authoring and collaboration into the centre of scholarly discourse. Vega has been long anticipated by those inspired by the promise of its model; it appeared in alpha release in early 2019.
Omeka has been in development for more than a decade already, but it, as well as ANVC’s Scalar, and Washington State University’s Mukurtu pushes on the boundaries of what a book might be in a natively digital mode. Omeka, Scalar, and Mukurtu have all been focused on scholars and researchers first, as opposed to presses, but the wealth of content and projects published on these systems already (including the Ravenspace project from the University of British Columbia and University of Washington Presses, which draws in ways on all three) means that these platforms are part of the discourse around the nature of the book in an online context. Stanford University Press’s embrace of Scalar-based projects is evidence that this platform is being taken seriously by traditional publishers.
An emerging genre of writing tools—exemplified by Jupyter Notebooks, RStudio’s Shiny, and the Stencila project (part of eLife’s Reproducible Document Stack initiative)—integrates written documentation with live code and data in a publishable interactive environment. A researcher can write an article, incorporate a dataset, and feature live code snippets and data visualizations in the body of the article. Shared or published online, a reader can then interact with the data or the code directly, effectively bringing into play a richer way of constructing and communicating a scholarly or scientific argument. Shared between two researchers, these tools are clever enough; all three projects are pushing towards much broader scale publication of interactive documents.
Two well-established projects—the Hypothes.is annotation system and the Zotero reference management software—plus one newer one, the Rebus Foundation’s Ink platform for research-based reading—deserve mention here too. These are not publishing tools per se, but they serve critical parts of the publishing and scholarly ecosystem. Hypothes.is, while not being the only approach to annotation represented here, has established a standard approach to web annotation that now appears to be essential. Zotero, which as a networked platform is much more than the personal reference manager most people use it for, is the primary open-source platform for large-scale bibliography handling. Both Hypothes.is and Zotero should, at this point in time, be judged in terms of their integration with other applications in the publishing and scholarship ecosystem; certainly no one should be developing in this space without considering the contributions already made by these tools. Which brings us to the Rebus Foundation’s Ink project: funded by a grant from the Mellon Foundation, Ink is an experiment in developing a better integrated environment for scholarly reading, reference and document management, and annotation. Ink’s development is made with tools like Hypothes.is and Zotero already established; if it comes to fruition, it should shift the thinking around what happens to scholarly publications when they reach readers, an aspect somewhat under-developed currently.
Maxwell, John W., Alessandra Bordini, and Katie Shamash. “Reassembling Scholarly Communications: An Evaluation of the Andrew W. Mellon Foundation‚” Monograph Initiative (Final Report, May 2016).‚ Journal of Electronic Publishing 20, no. 1 (2017). http://dx.doi.org/10.3998/3336451.0020.101
Texture's XML file format is .dar, which is an encapsulated collection of XML content (compliant with a Texture-specific JATS subset) and its related assets, plus a manifest file listing the contents. See https://github.com/substance/dar
What’s the difference between these? (I may have missed a definition?)
Sorry, “research publishing platform” is not clear. The point here is really about it being too early to tell about the users Vega will attract.
There is an interesting discussion to be had here about the value of centralized hosting. This could lead to a centralization of opportunity, which is often what is desired but it could also lead to detering other hosting providers (wishing to provide services on the same software) from entering the game. The later scenario could stunt adoption and also effectively turn the OSS project into a hosting and services monopoly.
The use of “XML file formats” here is confusing because of the previous discussion of XML as a publishing format and because the discussion about “why word processor XML is different” follows this clause. Suggest: “despite their adoption of XML as an underlying file format over the past decade”. Also suggest change the end of the sentence to: “is not constrained by a suitably sensible schema.” (Or maybe a more meaningful phrase than ‘suitably sensible.’)
Good points. Thanks, Peter.
The title for this section seems to be misnamed. Perhaps “Approaches to Text Markup” is better? The first described option is XML. The second is HTML, which is sometimes a flavor of XML. The third one, Markdown (and “reStructuredText” and arguably things like Wikitext/Creole) is not by any stretch “XML”.
I think perhaps the point here is that no matter what the document format, you still need to get to XML in most (Journal) use cases. As the report says “XML is central in one way or another”…XML as a required output makes it central to almost every Journal platform.
+ 1 more...
Figure 3 will no doubt provoke some annotations and feedback, as the coverage of these projects across functional areas is often a matter of perspective — and evolution.
The emphasis in Figure 3 is on projects’ provision of open-source code that addresses these functional areas, as opposed to the features of the software in use; these are in many places not the same thing.
What strikes me powerfully from the figure is how so many of our projects are simply not taking metrics of usage and indicators of engagement seriously enough. Our authors want to get a sense of the impact their work has, and commercial entities like bepress place this desire front and center of their platforms with great visualizations and opt in alerts. Why are we neglecting this functionality?
+ 1 more...
slight correction/suggestion - Editoria has a dev priority in ingestion via XSweet.
Duly noted, Adam, and that’s a miss on my part. XSweet should be in this diagram.
+ 1 more...
Still, this is a very useful figure. Thanks!
I understand that this is not about features, but I’d like to correct an inaccuracy:
“dokieli” sets equal priority to:
That’s more or less expressed everywhere (including the project’s tagline)